# Category Archives: Sampling Methods

## A Gentle Introduction to Markov Chain Monte Carlo (MCMC)

Applying probabilistic models to data usually involves integrating a complex, multi-dimensional probability distribution. For example, calculating the expectation/mean of a model distribution involves such an integration. Many (most) times, these integrals are not calculable due to the high dimensionality of the distribution or because there is no closed-form expression for the integral available using calculus. Markov Chain Monte Carlo (MCMC) is a method that allows one to approximate complex integrals using stochastic sampling routines. As MCMC’s name indicates, the method is composed of two components, the Markov chain and Monte Carlo integration.

Monte Carlo integration is a powerful technique that exploits stochastic sampling of the distribution in question in order to approximate the difficult integration. However, in order to use Monte Carlo integration it is necessary to be able to sample from the probability distribution in question, which may be difficult or impossible to do directly. This is where the second component of MCMC, the Markov chain, comes in. A Markov chain is a sequential model that transitions from one state to another in a probabilistic fashion, where the next state that the chain takes is conditioned on the previous state. Markov chains are useful in that if they are constructed properly, and allowed to run for a long time, the states that a chain will take also sample from a target probability distribution. Therefore we can construct Markov chains to sample from the distribution whose integral we would like to approximate, then use Monte Carlo integration to perform the approximation.

Here I introduce a series of posts where I describe the basic concepts underlying MCMC, starting off by describing Monte Carlo Integration, then giving a brief introduction of Markov chains and how they can be constructed to sample from a target probability distribution. Given these foundation principles, we can then discuss MCMC techniques such as the Metropolis and Metropolis-Hastings algorithms, the Gibbs sampler, and the Hybrid Monte Carlo algorithm.

As always, each post has a somewhat formal/mathematical introduction, along with an example and simple Matlab implementations of the associated algorithms.

## MCMC: The Gibbs Sampler

In the previous post, we compared using block-wise and component-wise implementations of the Metropolis-Hastings algorithm for sampling from a multivariate probability distribution$p(\bold x)$. Component-wise updates for MCMC algorithms are generally more efficient for multivariate problems than blockwise updates in that we are more likely to accept a proposed sample by drawing each component/dimension independently of the others. However, samples may still be rejected, leading to excess computation that is never used. The Gibbs sampler, another popular MCMC sampling technique, provides a means of avoiding such wasted computation. Like the component-wise implementation of the Metropolis-Hastings algorithm, the Gibbs sampler also uses component-wise updates. However, unlike in the Metropolis-Hastings algorithm, all proposed samples are accepted, so there is no wasted computation.

The Gibbs sampler is applicable for certain classes of problems, based on two main criterion. Given a target distribution $p(\bold x)$, where $\bold x = (x_1, x_2, \dots, x_D$, ),  The first criterion is 1) that it is necessary that we have an analytic (mathematical) expression for the conditional distribution of each variable in the joint distribution given all other variables in the joint. Formally, if the target distribution $p(\bold x)$ is $D$-dimensional, we must have $D$ individual expressions for

$p(x_i|x_1,x_2,\dots,x_{i-1},x_{i+1},\dots,x_D)$

$= p(x_i | x_j), j\neq i$.

Each of these expressions defines the probability of the $i$-th dimension given that we have values for all other ($j \neq i$) dimensions. Having the conditional distribution for each variable means that we don’t need a proposal distribution or an accept/reject criterion, like in the Metropolis-Hastings algorithm. Therefore, we can simply sample from each conditional while keeping all other variables held fixed. This leads to the second criterion 2) that we must be able to sample from each conditional distribution. This caveat is obvious if we want an implementable algorithm.

The Gibbs sampler works in much the same way as the component-wise Metropolis-Hastings algorithms except that instead drawing from a proposal distribution for each dimension, then accepting or rejecting the proposed sample, we simply draw a value for that dimension according to the variable’s corresponding conditional distribution. We also accept all values that are drawn. Similar to the component-wise Metropolis-Hastings algorithm, we step through each variable sequentially, sampling it while keeping all other variables fixed. The Gibbs sampling procedure is outlined below

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

draw $x_i$ from $p(x_i|x_1,x_2,\dots,x_{i-1},x_{i+1},\dots,x_D)$

To get a better understanding of the Gibbs sampler at work, let’s implement the Gibbs sampler to solve the same multivariate sampling problem addressed in the previous post.

### Example: Sampling from a bivariate a Normal distribution

This example parallels the examples in the previous post where we sampled from a 2-D Normal distribution using block-wise and component-wise Metropolis-Hastings algorithms. Here, we show how to implement a Gibbs sampler to draw samples from the same target distribution. As a reminder, the target distribution $p(\bold x)$ is a Normal form with following parameterization:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [\mu_1,\mu_2]= [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & \rho_{12} \\ \rho_{21} & 1\end{bmatrix} = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

In order to sample from this distribution using a Gibbs sampler, we need to have in hand the conditional distributions for variables/dimensions $x_1$ and $x_2$:

$p(x_1 | x_2^{(t-1)})$ (i.e. the conditional for the first dimension, $x_1$)

and

$p(x_2 | x_1^{(t)})$ (the conditional for the second dimension, $x_2$)

Where $x_2^{(t-1)}$ is the previous state of the second dimension, and $x_1^{(t)}$ is the state of the first dimension after drawing from $p(x_1 | x_2^{(t-1)})$. The reason for the discrepancy between updating $x_1$ and $x_2$ using states $(t-1)$ and $(t)$, can be is seen in step 3 of the algorithm outlined in the previous section. At iteration $t$ we first sample a new state for variable $x_1$ conditioned on the most recent state of variable $x_2$, which is from iteration $(t-1)$. We then sample a new state for the variable $x_2$ conditioned on the most recent state of variable $x_1$, which is now from the current iteration, $t$.

After some math (which which I will skip for some brevity, but see the following for some details), we find that the two conditional distributions for the target Normal distribution are:

$p(x_1 | x_2^{(t-1)}) = \mathcal N(\mu_1 + \rho_{21}(x_2^{(t-1)} - \mu_2),\sqrt{1-\rho_{21}^2})$

and

$p(x_2 | x_1^{(t)})=\mathcal N(\mu_2 + \rho_{12}(x_1^{(t)}-\mu_1),\sqrt{1-\rho_{12}^2})$,

which are both univariate Normal distributions, each with a mean that is dependent on the value of the most recent state of the conditioning variable, and a variance that is dependent on the target covariances between the two variables.

Using the above expressions for the conditional probabilities of variables $x_1$ and $x_2$, we implement the Gibbs sampler using MATLAB below. The output of the sampler is shown here:

Gibbs sampler Markov chain and samples for bivariate Normal target distribution

Inspecting the figure above, note how at each iteration the Markov chain for the Gibbs sampler first takes a step only along the $x_1$ direction, then only along the $x_2$ direction.  This shows how the Gibbs sampler sequentially samples the value of each variable separately, in a component-wise fashion.

% EXAMPLE: GIBBS SAMPLER FOR BIVARIATE NORMAL
rand('seed' ,12345);
nSamples = 5000;

mu = [0 0]; % TARGET MEAN
rho(1) = 0.8; % rho_21
rho(2) = 0.8; % rho_12

% INITIALIZE THE GIBBS SAMPLER
propSigma = 1; % PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE SAMPLES
x = zeros(nSamples,2);
x(1,1) = unifrnd(minn(1), maxx(1));
x(1,2) = unifrnd(minn(2), maxx(2));

dims = 1:2; % INDEX INTO EACH DIMENSION

% RUN GIBBS SAMPLER
t = 1;
while t < nSamples
t = t + 1;
T = [t-1,t];
for iD = 1:2 % LOOP OVER DIMENSIONS
% UPDATE SAMPLES
nIx = dims~=iD; % *NOT* THE CURRENT DIMENSION
% CONDITIONAL MEAN
muCond = mu(iD) + rho(iD)*(x(T(iD),nIx)-mu(nIx));
% CONDITIONAL VARIANCE
varCond = sqrt(1-rho(iD)^2);
% DRAW FROM CONDITIONAL
x(t,iD) = normrnd(muCond,varCond);
end
end

% DISPLAY SAMPLING DYNAMICS
figure;
h1 = scatter(x(:,1),x(:,2),'r.');

% CONDITIONAL STEPS/SAMPLES
hold on;
for t = 1:50
plot([x(t,1),x(t+1,1)],[x(t,2),x(t,2)],'k-');
plot([x(t+1,1),x(t+1,1)],[x(t,2),x(t+1,2)],'k-');
h2 = plot(x(t+1,1),x(t+1,2),'ko');
end

h3 = scatter(x(1,1),x(1,2),'go','Linewidth',3);
legend([h1,h2,h3],{'Samples','1st 50 Samples','x(t=0)'},'Location','Northwest')
hold off;
xlabel('x_1');
ylabel('x_2');
axis square


## Wrapping Up

The Gibbs sampler is a popular MCMC method for sampling from complex, multivariate probability distributions. However, the Gibbs sampler cannot be used for general sampling problems. For many target distributions, it may difficult or impossible to obtain a closed-form expression for all the needed conditional distributions. In other scenarios, analytic expressions may exist for all conditionals but it may be difficult to sample from any or all of the conditional distributions (in these scenarios it is common to use univariate sampling methods such as rejection sampling and (surprise!) Metropolis-type MCMC techniques to approximate samples from each conditional). Gibbs samplers are very popular for Bayesian methods where models are often devised in such a way that conditional expressions for all model variables are easily obtained and take well-known forms that can be sampled from efficiently.

Gibbs sampling, like many MCMC techniques suffer from what is often called “slow mixing.” Slow mixing occurs when the underlying Markov chain takes a long time to sufficiently explore the values of $\bold x$ in order to give a good characterization of $p(\bold x)$. Slow mixing is due to a number of factors including the “random walk” nature of the Markov chain, as well as the tendency of the Markov chain to get “stuck,” only sampling a single region of $\bold x$ having high-probability under $p(\bold x)$. Such behaviors are bad for sampling distributions with multiple modes or heavy tails. More advanced techniques, such as Hybrid Monte Carlo have been developed to incorporate additional dynamics that increase the efficiency of the Markov chain path. We will discuss Hybrid Monte Carlo in a future post.

## MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates

In the previous posts on MCMC methods, we focused on how to sample from univariate target distributions. This was done mainly to give the reader some intuition about MCMC implementations with fairly tangible examples that can be visualized. However, MCMC can easily be extended to sample multivariate distributions.

In this post we will discuss two flavors of MCMC update procedure for sampling distributions in multiple dimensions: block-wise, and component-wise update procedures. We will show how these two different procedures can give rise to different implementations of the Metropolis-Hastings sampler to solve the same problem.

## Block-wise Sampling

The first approach for performing multidimensional sampling is to use block-wise updates. In this approach the proposal distribution $q(\bold{x})$ has the same dimensionality as the target distribution $p(\bold x)$. Specifically, if $p(\bold x)$ is a distribution over $D$ variables, ie. $\bold{x} = (x_1, x_2, \dots, x_D)$, then we must design a proposal distribution that is also a distribution involving $D$ variables. We then accept or reject a proposed state $\bold x^*$ sampled from the proposal distribution $q(\bold x)$ in exactly the same way as for the univariate Metropolis-Hastings algorithm. To generate $M$ multivariate samples we perform the following block-wise sampling procedure:

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $\bold x^*$ from $q(\bold x | \bold x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(\bold x^{(t-1)} | \bold x^*) }{q(\bold x^*|\bold x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(\bold x^*)}{p(\bold x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $\bold x^*$ and set $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

Let’s take a look at the block-wise sampling routine in action.

### Example 1: Block-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we use block-wise Metropolis-Hastings algorithm to sample from a bivariate (i.e. $D = 2$) Normal distribution:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

Usually the target distribution $p(\bold x)$ will have a complex mathematical form, but for this example we’ll circumvent that by using MATLAB’s built-in function $\text{mvnpdf}$ to evaluate $p(\bold x)$. For our proposal distribution, $q(\bold x)$, let’s use a circular Normal centered at the the previous state/sample of the Markov chain/sampler, i.e:

$q(\bold x | \bold x^{(t-1)}) \sim \mathcal N (\bold x^{(t-1)}, \bold I)$,

where $\bold I$ is a 2-D identity matrix, giving the proposal distribution unit variance along both dimensions $x_1$ and $x_2$, and zero covariance. You can find an MATLAB implementation of the block-wise sampler at the end of the section. The display of the samples and the target distribution output by the sampler implementation are shown below:

Samples drawn from block-wise Metropolis-Hastings sampler

We can see from the output that the block-wise sampler does a good job of drawing samples from the target distribution.

Note that our proposal distribution in this example is symmetric, therefore it was not necessary to calculate the correction factor $c$ per se. This means that this Metropolis-Hastings implementation is identical to the simpler Metropolis sampler.

%------------------------------------------------------
% EXAMPLE 1: METROPOLIS-HASTINGS
% BLOCK-WISE SAMPLER (BIVARIATE NORMAL)
rand('seed' ,12345);

D = 2; % # VARIABLES
nBurnIn = 100;

% TARGET DISTRIBUTION IS A 2D NORMAL WITH STRONG COVARIANCE
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

% PROPOSAL DISTRIBUTION STANDARD 2D GUASSIAN
q = inline('mvnpdf(x,mu)','x','mu')

nSamples = 5000;
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE BLOCK-WISE SAMPLER
t = 1;
x = zeros(nSamples,2);
x(1,:) = randn(1,D);

% RUN SAMPLER
while t < nSamples
t = t + 1;

% SAMPLE FROM PROPOSAL
xStar = mvnrnd(x(t-1,:),eye(D));

% CORRECTION FACTOR (SHOULD EQUAL 1)
c = q(x(t-1,:),xStar)/q(xStar,x(t-1,:));

% CALCULATE THE M-H ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1,:))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t,:) = xStar;
else
x(t,:) = x(t-1,:);
end
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')


## Component-wise Sampling

A problem with block-wise updates, particularly when the number of dimensions $D$ becomes large, is that finding a suitable proposal distribution is difficult. This leads to a large proportion of the samples being rejected. One way to remedy this is to simply loop over the the $D$ dimensions of $\bold x$ in sequence, sampling each dimension independently from the others. This is what is known as using component-wise updates. Note that now the proposal distribution $q(x)$ is univariate, working only in one dimension, namely the current dimension that we are trying to sample. The component-wise Metropolis-Hastings algorithm is outlined below.

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

generate a proposal state $x_i^*$ from $q(x_i | x_i^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x_i^{(t-1)}) | x_i^*)}{q(x_i^* | x_i^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p( x_i^*, \bold x_j^{(t-1)})}{p( x_i^{(t-1)}, \bold x_j^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x_i^*$ and set $x_i^{(t)}=x_i^*$

else set $x_i^{(t)} = x_i^{(t-1)}$

Note that in the component-wise implementation a sample for the $i$-th dimension is proposed, then  accepted or rejected while all other dimensions ($j \neq i$) are held fixed. We then move on to the next ($(i + 1)$-th) dimension and repeat the process while holding all other variables ($j \neq (i + 1)$) fixed. In each successive step we are using updated values for the dimensions that have occurred since increasing $(t -1) \rightarrow t$.

### Example 2: Component-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we draw samples from the same bivariate Normal target distribution described in Example 1, but using component-wise updates. Therefore $p(x)$ is the same, however, the proposal distribution $q(x)$ is now a univariate Normal distribution with unit unit variance in the direction of the $i$-th dimension to be sampled. The MATLAB implementation of the component-wise sampler is at the end of the section. The samples and comparison to the analytic target distribution are shown below.

Samples drawn from component-wise Metropolis-Hastings algorithm compared to target distribution

Again, we see that we get a good characterization of the bivariate target distribution.

%--------------------------------------------------
% EXAMPLE 2: METROPOLIS-HASTINGS
% COMPONENT-WISE SAMPLING OF BIVARIATE NORMAL
rand('seed' ,12345);

% TARGET DISTRIBUTION
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

nSamples = 5000;
propSigma = 1;		% PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE COMPONENT-WISE SAMPLER
x = zeros(nSamples,2);
xCurrent(1) = randn;
xCurrent(2) = randn;
dims = 1:2; % INDICES INTO EACH DIMENSION
t = 1;
x(t,1) = xCurrent(1);
x(t,2) = xCurrent(2);

% RUN SAMPLER
while t < nSamples
t = t + 1;
for iD = 1:2 % LOOP OVER DIMENSIONS

% SAMPLE PROPOSAL
xStar = normrnd(xCurrent(:,iD), propSigma);

% NOTE: CORRECTION FACTOR c=1 BECAUSE
% N(mu,1) IS SYMMETRIC, NO NEED TO CALCULATE

% CALCULATE THE ACCEPTANCE PROBABILITY
pratio = p([xStar xCurrent(dims~=iD)])/ ...
p([xCurrent(1) xCurrent(2)]);
alpha = min([1, pratio]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
xCurrent(iD) = xStar;
end
end

% UPDATE SAMPLES
x(t,:) = xCurrent;
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
figure;
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')



## Wrapping Up

Here we saw how we can use block- and component-wise updates to derive two different implementations of the Metropolis-Hastings algorithm. In the next post we will use component-wise updates introduced above to motivate the Gibbs sampler, which is often used to increase the efficiency of sampling well-defined probability multivariate distributions.

## MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state $x^*$ in the Markov chain, based on a previous state $x^{(t-1)}$, according to the proposal distribution $q(x^* | x^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $p(x)$ evaluated at $x^*$. (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution $q(x^* | x^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time $t$, the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ must be equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, a condition known as reversibility or detailed balance. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. $[0, \infty)$).

In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor $c$, defined from the proposal distribution as

$c = \frac{q(x^{(t-1)} | x^*) }{q(x^* | x^{(t-1)})}$

The correction factor adjusts the transition operator to ensure that the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ is equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$.  Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set t = 0
2. generate an initial state $x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x^{(t-1)} | x^*) }{q(x^*|x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(x^*)}{p(x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x^*$ and set $x^{(t)}=x^*$

else set $x^{(t)} = x^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters $\theta$ to an assumed model $p(y | \theta)$ such that the model can best account for some observed data $y$. The model function $p(y | \theta)$ is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution $p(\theta)$ that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution $p(\theta | y)$, which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

$p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)}$

where, $p(y)$ is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and $y$ can take.

Let’s say that we assume the following model (likelihood function):

$p(y | \theta) = \text{Gamma}(y;A,B)$, where

$\text{Gamma}(y;A,B) = \frac{B^A}{\Gamma(A)} y^{A-1}e^{-By}$, where

$\Gamma( )$ is the gamma function. Thus, the model parameters are

$\theta = [A,B]$

The parameter $A$ controls the shape of the distribution, and $B$ controls the scale. The likelihood surface for $B = 1$, and a number of values of $A$ ranging from zero to five are shown below.

Likelihood surface and conditional probability p(y|A=2,B=1) in green

The conditional distribution $p(y | A=2, B = 1)$ is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

 plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

$p(B = 1) = 1$

and

$p(A) = \text{sin}(\pi A)^2$

The first prior states that $B$ only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of $A$ varies as a sinusoidal function. (Note that both of these prior distributions are called improper priors because they do not integrate to one). Because $B$ is constant, we only need to estimate the value of $A$.

It turns out that even though the normalization constant $p(y)$ may be difficult to compute, we can sample from $p(A | y)$ without knowing $p(x)$ using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant $p(x)$ and sample from the unnormalized posterior:

$p(A | y) \propto p(y |A) p(A)$

The surface of the (unnormalized) posterior for $y$ ranging from zero to ten are shown below. The prior $p(A)$ is displayed in blue on the right of the plot. Let’s say that we have a datapoint $y = 1.5$ and would like to estimate the posterior distribution $p(A|y=1.5)$ using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Posterior surface, prior distribution (blue), and target distribution (pink)

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from $p(A|y=1.5)$ due to the fact that the posterior only has support on the real positive numbers $A \in [0 ,\infty)$. An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

$q(A) = \text{Exp}(\mu) = \mu e^{-\mu A}$,

This distribution is parameterized by a single variable $\mu$ that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for $\mu = 5$) are shown in the plot below.

Target posterior p(A|y) and proposal distribution q(A)

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

Metropolis-Hastings Markov chain and samples

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter $\mu$ (see line 88 in the MATLAB code below). Each proposal states $x^*$ is drawn independently of the previous state. Therefore this is an example of an independence sampler, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB  code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR
rand('seed',12345)

% PRIOR OVER SCALE PARAMETERS
B = 1;

% DEFINE LIKELIHOOD
likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B');

% CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE
yy = linspace(0,10,100);
AA = linspace(0.1,5,100);
likeSurf = zeros(numel(yy),numel(AA));
for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end;

figure;
surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot

% DISPLAY CONDITIONAL AT A = 2
hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend(ly,'p(y|A=2)','Location','Northeast');
hold off;
title('p(y|A)');

% DEFINE PRIOR OVER SHAPE PARAMETERS
prior = inline('sin(pi*A).^2','A');

% DEFINE THE POSTERIOR
p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B');

% CALCULATE AND DISPLAY THE POSTERIOR SURFACE
postSurf = zeros(size(likeSurf));
for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end;

figure
surf(postSurf); ylabel('y'); xlabel('A'); colormap hot

% DISPLAY THE PRIOR
hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3)

% SAMPLE FROM p(A | y = 1.5)
y = 1.5;
target = postSurf(16,:);

% DISPLAY POSTERIOR
psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast');
hold off
title('p(A|y)');

% INITIALIZE THE METROPOLIS-HASTINGS SAMPLER
% DEFINE PROPOSAL DENSITY
q = inline('exppdf(x,mu)','x','mu');

% MEAN FOR PROPOSAL DENSITY
mu = 5;

% DISPLAY TARGET AND PROPOSAL
figure; hold on;
th = plot(AA,target,'m','Linewidth',2);
qh = plot(AA,q(AA,mu),'k','Linewidth',2)
legend([th,qh],{'Target, p(A)','Proposal, q(A)'});
xlabel('A');

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
minn = 0.1; maxx = 5;

% INTIIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = mu;
t = 1;

% RUN METROPOLIS-HASTINGS SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = exprnd(mu);

% CORRECTION FACTOR
c = q(x(t-1),mu)/q(xStar,mu);

% CALCULATE THE (CORRECTED) ACCEPTANCE RATIO
alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
else
x(t) = x(t-1);
end
end

% DISPLAY MARKOV CHAIN
figure;
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2)
ylabel('t'); xlabel('samples, A');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([0 maxx]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 100;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, A' ); ylabel( 'p(A | y)' );
title('Samples');
xlim([0 10])

% OVERLAY TARGET DISTRIBUTION
hold on;
plot(AA, target/sum(target) , 'm-', 'LineWidth', 2);
legend('Sampled Distribution',sprintf('Target Posterior'))
axis tight


## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.

## MCMC: The Metropolis Sampler

As discussed in an earlier post, we can use a Markov chain to sample from some target probability distribution $p(x)$ from which drawing samples directly is difficult. To do so, it is necessary to design a transition operator for the Markov chain which makes the chain’s stationary distribution match the target distribution. The Metropolis sampling algorithm  (and the more general Metropolis-Hastings sampling algorithm) uses simple heuristics to implement such a transition operator.

## Metropolis Sampling

Starting from some random initial state $x^{(0)} \sim \pi^{(0)}$, the algorithm first draws a possible sample $x^*$ from a  proposal distribution $q(x | x^{(t-1)})$.  Much like a conventional transition operator for a Markov chain, the proposal distribution depends only on the previous state in the chain. However, the transition operator for the Metropolis algorithm has an additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the proposed state as a sample and setting it to the next state in the chain. If the density of $p(x)$ is low near the proposed state, then it is likely (but not guaranteed) that it will be rejected. The criterion for accepting or rejecting a proposed state are defined by the following heuristics:

1. If $p(x^*) \geq p(x^{(t-1)})$,  the proposed state is kept $x^*$ as a sample and is set as the next state in the chain (i.e. move the chain’s state to a location  where $p(x)$ has equal or greater density).
2. If $p(x^*) < p(x^{(t-1)})$–indicating that $p(x)$ has low density near $x^*$–then the proposed state may still be accepted, but only randomly, and with a probability $\frac{p(x^*)}{p(x^{(t-1)})}$

These heuristics can be instantiated by calculating the acceptance probability for the proposed state.

$\alpha = \min \left(1, \frac{p(x^*)}{p(x^{(t-1)})}\right)$

Having the acceptance probability in hand, the transition operator for the metropolis algorithm works like this: if a random uniform number $u$ is less than or equal to $\alpha$, then the state $x^*$ is accepted (as in (1) above), if not, it is rejected and another state is proposed (as in (2) above). In order to collect $M$ samples using  Metropolis sampling we run the following algorithm:

1. set t = 0
2. generate an initial state $x^{(0)}$ from a prior distribution $\pi^{(0)}$ over initial states
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the acceptance probability $\alpha = \min \left(1, \frac{p(x^*)}{p(x^{(t-1)})}\right)$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$, accept the proposal and set $x^{(t)} = x^*$

else  set $x^{(t)} = x^{(t-1)}$

### Example: Using the Metropolis algorithm to sample from an unknown distribution

Say that we have some mysterious function

$p(x) = (1 + x^2)^{-1}$

from which we would like to draw samples. To do so using Metropolis sampling we need to define two things: (1) the prior distribution $\pi^{(0)}$ over the initial state of the Markov chain, and (2)  the proposal distribution $q(x | x^{(t-1)})$. For this example we define:

$\pi^{(0)} \sim \mathcal N(0,1)$

$q(x | x^{(t-1)}) \sim \mathcal N(x^{(t-1)},1)$,

both of which are simply a Normal distribution, one centered at zero, the other centered at previous state of the chain. The following chunk of MATLAB code runs the Metropolis sampler with this proposal distribution and prior.

% METROPOLIS SAMPLING EXAMPLE
randn('seed',12345);

% DEFINE THE TARGET DISTRIBUTION
p = inline('(1 + x.^2).^-1','x')

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
nDisplay = 30;
sigma = 1;
minn = -20; maxx = 20;
xx = 3*minn:.1:3*maxx;
target = p(xx);
pauseDur = .8;

% INITIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = randn;
t = 1;

% RUN SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = normrnd(x(t-1) ,sigma);
proposal = normpdf(xx,x(t-1),sigma);

% CALCULATE THE ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
str = 'Accepted';
else
x(t) = x(t-1);
str = 'Rejected';
end

% DISPLAY SAMPLING DYNAMICS
if t < nDisplay + 1
figure(1);
subplot(211);
cla
plot(xx,target,'k');
hold on;
plot(xx,proposal,'r');
line([x(t-1),x(t-1)],[0 p(x(t-1))],'color','b','linewidth',2)
scatter(xStar,0,'ro','Linewidth',2)
line([xStar,xStar],[0 p(xStar)],'color','r','Linewidth',2)
plot(x(1:t),zeros(1,t),'ko')
legend({'Target','Proposal','p(x^{(t-1)})','x^*','p(x^*)','Kept Samples'})

switch str
case 'Rejected'
scatter(xStar,p(xStar),'rx','Linewidth',3)
case 'Accepted'
scatter(xStar,p(xStar),'rs','Linewidth',3)
end
scatter(x(t-1),p(x(t-1)),'bo','Linewidth',3)
title(sprintf('Sample % d %s',t,str))
xlim([minn,maxx])
subplot(212);
hist(x(1:t),50); colormap hot;
xlim([minn,maxx])
title(['Sample ',str]);
drawnow
pause(pauseDur);
end
end

% DISPLAY MARKOV CHAIN
figure(1); clf
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([-10 10],[burnIn burnIn],'b--')
ylabel('t'); xlabel('samples, x');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([-10 10]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 200;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, x' ); ylabel( 'p(x)' );
title('Samples');

% OVERLAY ANALYTIC DENSITY OF STUDENT T
nu = 1;
y = tpdf(sampleBins,nu)
hold on;
plot(sampleBins, y/sum(y) , 'r-', 'LineWidth', 2);
legend('Samples',sprintf('Theoretic\nStudent''s t'))
axis tight
xlim([-10 10]);


Using the Metropolis algorithm to sample from a continuous distribution (black)

In the figure above, we visualize the first 50 iterations of the Metropolis sampler.The black curve represents the target distribution $p(x)$. The red curve that is bouncing about the x-axis is the proposal distribution $q(x | x^{(t-1)})$ (if the figure is not animated, just click on it). The vertical blue line (about which the bouncing proposal distribution is centered) represents the quantity $p(x^{(t-1)})$, and the vertical red line represents the quantity $p(x^*)$, for a proposal state $x^*$ sampled according to the red  curve. At every iteration, if the vertical red line is longer than the blue line, then the sample $x^*$ is accepted, and the proposal distribution becomes centered about the newly accepted sample. If the blue line is longer, the sample is randomly rejected or accepted.

But why randomly keep “bad” proposal samples? It turns out that doing this allows the Markov chain to every-so-often visit states of low probability under the target distribution. This is a desirable property if we want the chain to adequately sample the entire target distribution, including any tails.

An attractive property of the Metropolis algorithm is that the target distribution $p(x)$ does not have to be a properly normalized probability distribution. This is due to the fact that the acceptance probability is based on the ratio of two values of the target distribution. I’ll show you what I mean. If $p(x)$ is an unnormalized distribution and

$p^*(x) = \frac{p(x)}{Z}$

is a properly normalized probability distribution with normalizing constant $Z$, then

$p(x) = Zp^*(x)$

and a ratio like that used in calculating the acceptance probability $\alpha$ is

$\frac{p(a)}{p(b)} = \frac{Zp^*(a)}{Zp^*(b)} = \frac{p^*(a)}{p^*(b)}$

The normalizing constants $Z$ cancel! This attractive property is quite useful in the context of Bayesian methods, where determining the normalizing constant for a distribution may be impractical to calculate directly. This property is demonstrated in current example. It turns out that the “mystery” distribution that we sampled from using the Metropolis algorithm is an unnormalized form of the Student’s-t distribution with one degree of freedom. Comparing $p(x)$ to the definition of the definition Student’s-t

$Student(x,\nu) = \frac{\Gamma\left(\frac{\nu + 1}{2} \right)}{\sqrt{\nu\pi} \Gamma\left( \frac{\nu}{2}\right)}\left( 1+\frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}} = \frac{(1 + x^2)^{-1}}{Z} = \frac{p(x)}{Z}$

we see that $p(x)$ is a Student’s-t distribution with degrees of freedom $\nu=1$, but missing the normalizing constant

$Z = \left(\frac{\Gamma\left(\frac{\nu + 1}{2} \right)}{\sqrt{\nu\pi} \Gamma\left( \frac{\nu}{2}\right)}\right )^{-1}$

Below is additional output from the code above showing that the samples from Metropolis sampler draws samples that follow a normalized Student’s-t distribution, even though $p(x)$ is not normalized.

Metropolis samples from an unnormalized t-distribution follow the normalized distribution

The upper plot shows the progression of the Markov chain’s progression from state $x^{(0)}$ (top) to state $x^{(5000)}$ (bottom). The burn in period for this chain was chosen to be 500 transitions, and is indicated by the dashed blue line (for more on burnin see this previous post).

The bottom plot shows samples from the Markov chain in black (with burn in samples removed). The theoretical curve for the Student’s-t with one degree of freedom is overlayed in red. We see that the states kept by the Metropolis sampler transition operator sample from values that follow the Student’s-t, even though the function $p(x)$ used in the transition operator was not a properly normalized probability distribution.

## Reversibility of the transition operator

It turns out that there is a theoretical constraint on the Markov chain the transition operator in order for it settle into a stationary distribution (i.e. a target distribution we care about). The constraint states that the probability of the transition $x^{(t)} \to x^{(t+1)}$ must be equal to the probability of the reverse transition $x^{(t+1)} \to x^{(t)}$. This reversibility property is often referred to as detailed balance. Using the Metropolis algorithm transition operator, reversibility is assured if the proposal distribution $q(x|x^{(t-1)})$ is symmetric. Such symmetric proposal distributions are the Normal, Cauchy, Student’s-t, and Uniform distributions.

However, using a symmetric proposal distribution may not be reasonable to adequately or efficiently sample all possible target distributions. For instance if a target distribution is bounded on the positive numbers $0 < x \leq \infty$, we would like to use a proposal distribution that has the same support, and will thus be assymetric. This is where the Metropolis-Hastings sampling algorithm comes in. We will discuss in a later post how the Metropolis-Hastings sampler uses a simple change to the calculation of the acceptance probability which allows us to use non-symmetric proposal distributions.

## A Brief Introduction to Markov Chains

Markov chains are an essential component of Markov chain Monte Carlo (MCMC) techniques. Under MCMC, the Markov chain is used to sample from some target distribution. To get a better understanding of what a Markov chain is, and further, how it can be used to sample form a distribution, this post introduces and applies a few basic concepts.

A Markov chain is a stochastic process that operates sequentially (e.g. temporally), transitioning from one state to another within an allowed set of states.

$x^{(0)} \rightarrow x^{(1)} \rightarrow x^{(2)} \dots \rightarrow x^{(t)} \rightarrow \dots$

A Markov chain is defined by three elements:

1. A state space $x$, which is a set of values that the chain is allowed to take
2. A transition operator $p(x^{(t+1)} | x^{(t)})$ that defines  the probability of moving from state $x^{(t)}$ to $x^{(t+1)}$ .
3. An initial condition distribution $\pi^{(0)}$ which defines the probability of being in any one of the possible states at the initial iteration $t = 0$.

The Markov chain starts at some initial state, which is sampled from $\pi^{(0)}$, then transitions from one state to another according to the transition operator $p(x^{(t+1)} | x^{(t)})$.

A Markov chain is called memoryless if the next state only depends on the current state and not on any of the states previous to the current:

$p(x^{(t+1)}|x^{(t)},x^{(t-1)},...x^{(0)}) = p(x^{(t+1)}|x^{(t)})$

(This memoryless property is formally know as the Markov property).

If the transition operator for a Markov chain does not change across transitions, the Markov chain is called time homogenous.  A nice  property of time homogenous Markov chains is that as the chain runs for a long time and $t \rightarrow \infty$, the chain will reach an equilibrium that is called the chain’s stationary distribution:

$p(x^{(t+1)} | x^{(t)}) = p(x^{(t)} | x^{(t-1)})$

We’ll see later how the stationary distribution of a Markov chain is important for sampling from probability distributions, a technique that is at the heart of Markov Chain Monte Carlo (MCMC) methods.

## Finite state-space (time homogenous) Markov chain

If the state space of a Markov chain takes on a finite number of distinct values, and it is time homogenous, then the transition operator can be defined by a matrix $P$, where the entries of $P$ are:

$p_{ij} = p(X^{(t+1)} = j | x^{(t)} = i)$

This means that if the chain is currently in the $i$-th state, the transition operator assigns the probability of moving to the  $j$-th state by the entries of $i$-th row of $P$ (i.e.  each row of $P$ defines a conditional probability distribution on the state space). Let’s take a look at a finite state-space Markov chain in action with a simple example.

### Example: Predicting the weather with a finite state-space Markov chain

In Berkeley, CA, there are (literally) only 3 types of weather: sunny, foggy, or rainy (this is analogous to a state-space that takes on three discrete values). The weather patterns are very stable there, so a Berkeley weatherman can easily predict the weather next week based on the weather today with the following transition rules:

If it is sunny today, then

• it is highly likely that it will be sunny next week
• $p(X^{(week)}=sunny | X^{(today)}=sunny)=0.8$,
• it is very unlikely that it will be raining next week
• $p(X^{(week)}=rainy | X^{(today)}=sunny)=0.05$
• and somewhat likely that it will foggy next week
• $p(X^{(week)}=foggy | X^{(today)}=sunny)=0.15$

If it is foggy today then

• it is somewhat likely that it will be sunny next week
• $p(X^{(week)}=sunny | X^{(today)}=foggy)=0.4$
• but slightly less likely that it will be foggy next week
• $p(X^{(week)}=foggy | X^{(today)}=foggy)=0.5$,
• and fairly unlikely that it will be raining next week.
• $p(X^{(week)}=rainy | X^{(today)}=foggy)=0.1$,

If it is rainy today then

• it is unlikely that it will be sunny next week
• $p(X^{(week)}=sunny | X^{(today)}=rainy)=0.1$,
• it is somewhat likely that it will be foggy next week
• $p(X^{(week)}=foggy | X^{(today)}=rainy)=0.3$,
• and it is fairly likely that it will be rainy next week
• $p(X^{(week)}=rainy | X^{(today)}=rainy)=0.6$,

All of these transition rules can be instantiated in a single 3 x 3 transition operator matrix:

$P = \begin{bmatrix} 0.8 & 0.15 & 0.05\\ 0.4 & 0.5 & 0.1\\ 0.1& 0.3 & 0.6 \end{bmatrix}$

Where each row of $P$ corresponds to the weather at iteration $t$ (today), and each column corresponds to the weather the next week. Let’s say that it is rainy today, what is the probability it will be sunny next week, in two weeks, or in 6 months? We can answer these questions by running a Markov chain from the initial state of “rainy,” transitioning according to $P$. The following chunk of MATLAB code runs the Markov chain.

% FINITE STATE-SPACE MARKOV CHAIN EXAMPLE

% TRANSITION OPERATOR
%     S  F  R
%     U  O  A
%     N  G  I
%     N  G  N
%     Y  Y  Y
P = [.8 .15 .05;  % SUNNY
.4 .5  .1;   % FOGGY
.1 .3  .6];  % RAINY
nWeeks = 25

% INITIAL STATE IS RAINY
X(1,:) = [0 0 1];

% RUN MARKOV CHAIN
for iB = 2:nWeeks
X(iB,:) = X(iB-1,:)*P; % TRANSITION
end

% DISPLAY
figure; hold on
h(1) = plot(1:nWeeks,X(:,1),'r','Linewidth',2);
h(2) = plot(1:nWeeks,X(:,2),'k','Linewidth',2);
h(3) = plot(1:nWeeks,X(:,3),'b','Linewidth',2);
h(4) = plot([15 15],[0 1],'g--','Linewidth',2);
hold off
legend(h, {'Sunny','Foggy','Rainy','Burn In'});
xlabel('Week')
ylabel('p(Weather)')
xlim([1,nWeeks]);
ylim([0 1]);

% PREDICTIONS
fprintf('\np(weather) in 1 week -->'), disp(X(2,:))
fprintf('\np(weather) in 2 weeks -->'), disp(X(3,:))
fprintf('\np(weather) in 6 months -->'), disp(X(25,:))


Finite state-space Markov chain for predicting the weather

Here we see that at week 1 the probability of sunny weather is 0.1. The next week, the probability of sunny weather is 0.26, and in 6 months, there is a 60% chance that it will be sunny. Also note that after approximately 15 weeks the Markov chain has reached the equilibrium/stationary distribution and, chances are, the weather will be sunny. This 15-week period is what is known as the burn in period for the Markov chain, and is the number of transitions it takes the chain to move from the initial conditions to the stationary distribution.

A cool thing about finite state-space time-homogeneous Markov chain is that it is not necessary  to run the chain sequentially through all iterations in order to predict a state in the future. Instead we can predict by first raising the transition operator to the $T$-th power, where $T$ is the iteration at which we want to predict, then multiplying the result by the distribution over the initial state, $\pi^{(0)}$. For instance, to predict the probability of the weather in 2 weeks, knowing that it is rainy today (i.e. $\pi^{(0)} = [0,0,1]$):

$p(x^{(week2)}) = \pi^{(0)}P^2 = [0.26, 0.345, 0.395]$

and in six months:

$p(x^{(week24)}) = \pi^{(0)}P^{24} = [0.596, 0.263, 0.140]$

These are the same results we get by running the Markov chain sequentially through each number of transitions. Therefore we can calculate an approximation to the stationary distribution from $P$ by setting $T$ to a large number. It turns out that it is also possible to analytically derive the stationary distribution from $P$ (hint: think about the properties of eigenvectors).

## Continuous state-space Markov chains

A Markov chain can also have a continuous state space that exists in the real numbers $x \in \mathbb{R}^N$. In this case the transition operator cannot be instantiated simply as a matrix, but is instead some continuous function on the real numbers. Note that the continuous state-space Markov chain also has a burn in period and a stationary distribution. However, the stationary distribution will also be over a continuous set of variables. To get a better understanding of the workings of a continuous state-space Markov chain, let’s look at a simple example.

### Example: Sampling from a continuous distribution using continuous state-space Markov chains

We can use the stationary distribution of a continuous state-space Markov chain in order to sample from a continuous probability distribution: we  run a Markov chain for a sufficient amount of time so that it has reached its stationary distribution, then keep the states that the chain visits as samples from that stationary distribution.

In the following example we define a continuous state-space Markov chain. The transition operator is a Normal distribution with unit variance and a mean that is half the distance between zero and the previous state, and the distribution over initial conditions is a Normal distribution with zero mean and unit variance.

To ensure that the chain has moved sufficiently far from the initial conditions and that we are sampling  from the chain’s stationary distribution,  we will choose to throw away the first 50 burn in states of the chain. We can also run multiple chains simultaneously in order to sample the stationary distribution more densely. Here we choose to run 5 chains simultaneously.

% EXAMPLE OF CONTINUOUS STATE-SPACE MARKOV CHAIN

% INITIALIZE
randn('seed',12345)
nBurnin = 50; % # BURNIN
nChains = 5;  % # MARKOV CHAINS

% DEFINE TRANSITION OPERATOR
P = inline('normrnd(.5*x,1,1,nChains)','x','nChains');
nTransitions = 1000;
x = zeros(nTransitions,nChains);
x(1,:) = randn(1,nChains);

% RUN THE CHAINS
for iT = 2:nTransitions
x(iT,:) = P(x(iT-1),nChains);
end

% DISPLAY BURNIN
figure
subplot(221); plot(x(1:100,:)); hold on;
minn = min(x(:));
maxx = max(x(:));
l = line([nBurnin nBurnin],[minn maxx],'color','k','Linewidth',2);
ylim([minn maxx])
legend(l,'~Burn-in','Location','SouthEast')
title('First 100 Samples'); hold off

% DISPLAY ENTIRE MARKOV CHAIN
subplot(223); plot(x);hold on;
l = line([nBurnin nBurnin],[minn maxx],'color','k','Linewidth',2);
legend(l,'~Burn-in','Location','SouthEast')
title('Entire Chain');

% DISPLAY SAMPLES FROM STATIONARY DISTRIBUTION
samples = x(nBurnin+1:end,:);
subplot(122);
[counts,bins] = hist(samples(:),100); colormap hot
b = bar(bins,counts);
legend(b,sprintf('Markov Chain\nSamples'));
title(['\mu=',num2str(mean(samples(:))),' \sigma=',num2str(var(samples(:)))])


Sampling from the stationary distribution of a continuous state-space Markov chain

In the upper left panel of the code output we see a close up of the first 100 of the 1000 transitions made by the 5 simultaneous Markov chains; the burn in cutoff is marked by the black line. In the lower left panel we see the entire sequence of transitions for the Markov chains. In the right panel, we can tell from the sampled states that the stationary distribution for this chain is a Normal distribution, with mean equal to zero, and a variance equal to 1.3.

## Wrapping Up

In the previous example we were able to deduce the stationary distribution of the Markov chain by looking at the samples generated from the chain after the burn in period. However, in order to use Markov chains to sample from a specific target distribution, we have to design the transition operator such that the resulting chain reaches a stationary distribution that matches the target distribution. This is where MCMC methods like the Metropolis sampler, the Metropolis-Hastings sampler, and the Gibbs sampler come to rescue. We will discuss each of these Markov-chain-based sampling methods separately in later posts.

# Monte Carlo Approximation for Integration

Using statistical methods we often run into integrals that take the form:

$I = \int_a^b h(x)g(x) dx$

For instance, the expected value of a some function $f(x)$ of a random variable $X$

$\mathbb E[x] = \int_{p(x)} p(x) f(x) dx$

and many quantities essential for Bayesian methods such as the marginal likelihood a.k.a “model evidence”

$p(x) = \int_\theta p(x | \theta) p(\theta) dx$

involve integrals of this form.  Sometimes (not often) such an integral can be evaluated analytically. When a closed form solution does not exist, numeric integration methods can be applied. However numerical methods quickly become intractable for any practical application that requires more than a small number of dimensions. This is where Monte Carlo approximation comes in. Monte Carlo approximation allows us to calculate an estimate for the value of $I$ by transforming the  integration problem into a procedure of sampling values from a tractable probability distribution and calculating the average of those samples. Here’s what I mean:

If the function $g(x)$ fullfills two simple criteria, namely that the function is always positive on the interval $(a,b)$

$g(x) \geq 0, x \in (a,b)$

and that the integral of the function is finite

$\int_a^b g(x) = C < \infty$

then we can define a corresponding probability distribution on the interval $(a,b)$:

$p(x) = \frac{g(x)}{C}$

Another way to think of it is that $g(x)$ is a probability distribution scaled by a constant $C$.

Using this link between probability distributions $p(x)$ and $g(x)$, we can restate the original integration as

$I = C \int_a^b h(x) p(x)dx = C \mathbb E_{p(x)}[h(x)]$

This statement says that if we can sample values of $x$ using $p(x)$, then the value of  the original integral $I$ is simply a scaled version of the expected value of the integrand function $h(x)$ calculated using those samples. Turns out that the expected value $\mathbb E_{p(x)}[h(x)]$ can be easily approximated by the sample mean:

$\mathbb E_{p(x)} [h(x)] \approx \frac{1}{N} \sum_i^N h(x_i)$

where samples $x_i, i = 1..N$ are drawn independently from $p(x)$. This leads to a simple 4-Step Procedure for performing Monte Carlo approximation to the integral $I$:

1. Identify $h(x)$
2. Identify $g(x)$ and from it determine $p(x)$ and $C$
3. Draw $N$ independent samples from $p(x)$
4. Evaluate $I = C \mathbb E[h(x)] \approx \frac{C}{N} \sum_i^N h(x_i)$

The larger the number of samples $N$ we draw, the better our approximation to the actual value of $I$. This 4-step procedure is demonstrated in a some toy examples below:

### Example 1: Approximating the integral $xe^x$

Say we want to calculate the integral:

$I = \int_0^1 x e^x dx$

We can calculate the closed form solution of this integral using integration by parts:

$u = x, dv = e^x$

$du = dx, v = e^x$

and

$I = uv - \int v du$

$= xe^x - \int e^x dx$

$= \left. xe^x - e^x \right|_0^1$

$= \left. e^x(x-1)\right|_0^1$

$= 0 - (-1) = 1$

Orr…we could calculate the  Monte Carlo approximation of this integral.

Step 1 we identify

$h(x) = xe^x$

Step 2 we identify

$g(x) = 1$

and from this can also determine the probability distribution function $p(x) \in (a,b) = (0,1)$. According to the definition expression for $p(x)$ given above we detemine $p(x)$ to be:

$p(x) = \frac{g(x)}{\int_a^b g(x)dx} = \frac{1}{b-a}$.

Step 3: The expression on the right is the definition for the uniform distribution $Unif(0,1)$, which is easy to sample from using the MATLAB $\text{rand()}$ (Notice too that the constant $C = b-a = 1$).

Step 4: we calculate the Monte Carlo approximation as

$I = C \mathbb E_{p(x)} h(x)$

$\approx \frac{1}{N}\sum_{i=1}^N x_i e^{x_i}$,

where each $x_i$ is sampled from the standard uniform distribution. Below is some MATLAB code running the Monte Carlo Approximation for two different values of $N$

% MONTE CARLO APPROXIMATION OF INT(xexp(x))dx
% FOR TWO DIFFERENT SAMPLE SIZES
rand('seed',12345);

% THE FIRST APPROXIMATION USING N1 = 100 SAMPLES
N1 = 100;
x = rand(N1,1);
Ihat1 = sum(x.*exp(x))/N1

% A SECOND APPROXIMATION USING N2 = 5000 SAMPLES
N2 = 5000;
x = rand(N2,1);
Ihat2 = sum(x.*exp(x))/N2


Comparing the values of the variables $\text{Ihat1}$ and $\text{Ihat2}$ we see that the Monte Carlo approximation is better for a larger number of samples.

### Example 2: Approximating the expected value of the Beta distribution

Lets look at how the 4-step Monte Carlo approximation procedure can be used to calculate expectations. In this example we will calculate

$\mathbb E[x] = \int_{p(x)} p(x)x dx$,

where $x \sim p(x) = Beta(\alpha_1,\alpha_2)$.

Step 1: we identify $h(x) = x$.

Step 2: the function $g(x)$ is simply the probability density function $p(x)$ due the expression for $p(x)$ above:

$p(x)^\star = \frac{p(x)}{\int p(x)dx} = p(x)$.

Step 3 we can use MATLAB to easily draw $N$ independent samples $p(x)$ using the function $\text{betarnd()}$. And finally,

Step 4 we approximate the expectation with the expression

$\mathbb E[x]_{Beta(\alpha_1,\alpha_2)} \approx \frac{1}{N} \sum_i x_i$

Below is some MATLAB code that performs this approximation of the expected value.

rand('seed',12345);
alpha1 = 2; alpha2 = 10;
N = 10000;
x = betarnd(alpha1,alpha2,1,N);

% MONTE CARLO EXPECTATION
expectMC = mean(x);

% ANALYTIC EXPRESSION FOR BETA MEAN
expectAnalytic = alpha1/(alpha1 + alpha2);

% DISPLAY
figure;
bins = linspace(0,1,50);
counts = histc(x,bins);
probSampled = counts/sum(counts);
probTheory = betapdf(bins,alpha1,alpha2);
b = bar(bins,probSampled); colormap hot; hold on;
t = plot(bins,probTheory/sum(probTheory),'r','Linewidth',2)
m = plot([expectMC,expectMC],[0 .1],'g')
e = plot([expectAnalytic,expectAnalytic],[0 .1],'b')
xlim([expectAnalytic - .05,expectAnalytic + 0.05])
legend([b,t,m,e],{'Samples','Theory','$\hat{E}$','$E_{Theory}$'},'interpreter','latex');
title(['E_{MC} = ',num2str(expectMC),'; E_{Theory} = ',num2str(expectAnalytic)])
hold off


And the output of the code:

Monte Carlo Approximation of the the Expected value of Beta(2,10)

The analytical solution for the expected value of this Beta distribution:

$\mathbb E_{Beta(2,10)}[x] = \frac{\alpha_1}{\alpha_1 + \alpha_2} = \frac{2}{12} = 0.167$

is quite close to our approximation (also indicated by the small distance between the blue and green lines on the plot above).

## Monte Carlo Approximation for Optimization

Monte Carlo Approximation can also be used to solve optimization problems of the form:

$\hat x = \underset{x \in (a,b)}{\text{argmax }} g(x)$

If $g(x)$ fulfills the same criteria described above (namely that it is a scaled version of a probability distribution), then (as above) we can define the probability function

$p(x) = \frac{g(x)}{C}$

This allows us to instead solve the problem

$\hat x = \underset{x}{\text{argmax }}C p(x)$

If we can sample from $p(x)$, the solution $\hat x$ is easily found by drawing samples from $p(x)$ and determining the location of the samples that has the highest density (Note that the solution is not dependent of the value of $C$). The following example demonstrates Monte Carlo optimization:

### Example: Monte Carlo Optimization of $g(x) = e^{-\frac{(x-4)^2}{2}}$

Say we would like to find the value of $x_{opt}$ which optimizes the function $g(x) = e^{-((x-4)^2)/2}$. In other words we want to solve the problem

$x_{opt} = \underset{x}{\text{argmax }} e^{-\frac{(x-4)^2}{2}}$

We could solve for $x_{opt}$ using standard calculus methods, but a clever trick is to use Monte Carlo approximation to solve the problem. First, notice that $g(x)$  is  a scaled version of a Normal distribution with mean equal to 4 and unit variance:

$g(x) = C \times \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-4)^2}{2}}$

$= C \times \mathcal N(4,1)$

where $C = \sqrt{2\pi}$. This means we can solve for $x_{opt}$ by drawing samples from the normal distribution and determining where those samples have the highest density. The following chunk of matlab code solves the optimization problem in this way.

% MONTE CARLO OPTIMIZATION OF exp(x-4)^2
randn('seed',12345)

% INITIALZIE
N = 100000;
x = 0:.1:6;
C = sqrt(2*pi);
g = inline('exp(-.5*(x-4).^2)','x');
ph = plot(x,g(x)/C,'r','Linewidth',3); hold on
gh = plot(x,g(x),'b','Linewidth',2); hold on;
y = normpdf(x,4,1);

% CALCULATE MONTE CARLO APPROXIMATION
x = normrnd(4,1,1,N);
bins = linspace(min(x),max(x),100);
counts = histc(x,bins);
[~,optIdx] = max(counts);
xHat = bins(optIdx);

% OPTIMA AND ESTIMATED OPTIMA
oh = plot([4 4],[0,1],'k');
hh = plot([xHat,xHat],[0,1],'g');

set(gca,'fontsize',16)
legend([gh,ph,oh,hh],{'g(x)','$p(x)=\frac{g(x)}{C}$','$x_{opt}$','$\hat{x}$'},'interpreter','latex','Location','Northwest');


Monte Carlo Optimization

In the code output above we see the function $g(x)$ we want to optimize in blue and the Normal distribution $p(x)$ from which we draw samples in red. The Monte Carlo method provides a good approximation (green) to the real solution (black).

## Wrapping Up

In the toy examples above it was easy to sample from $p(x)$. However, for practical problems the distributions we want to sample from are often complex and operate in many dimensions. For these problems more clever sampling methods have to be used. Such sampling methods include Inverse Transform Sampling, Rejection Sampling, Importance Sampling, and Markov Chain Monte Carlo methods such as the Metropolis Hasting algorithm and the Gibbs sampler, each of which I plan to cover in separate posts.

## Sampling From the Normal Distribution Using the Box-Muller Transform

The Normal Distribution is the workhorse of many common statistical analyses and being able to draw samples from this distribution lies at the heart of many statistical/machine learning algorithms. There have been a number of methods developed to sample from the Normal distribution including Inverse Transform Sampling, the Ziggurat Algorithm, and the Ratio Method (a rejection sampling technique). In this post we will focus on an elegant method called the Box-Muller transform.

## A quick review of Cartesian and polar coordinates.

Before we can talk about using the Box-Muller transform, let’s refresh our understanding of the relationship between Cartesian and polar coordinates. You may remember from geometry that if x and y are two points in the Cartesian plane they can be represented in polar coordinates with a radius $r$ and an angle $\theta$ using the following relationships:

$r^2 = x^2 + y^2$

$\tan(\theta) = \frac{y}{x}$, and therefore

$x = r\cos(\theta)$

$y = r\sin(\theta)$

Notice that if $r \leq 1$ and $\theta \in [0 , 2\pi]$, then we map out values contained in the unit circle, as shown in the figure below. Also note that random variables in such a circle can be generated by transforming values sampled from the uniform distribution. Specifically, radii can be sampled from $r \sim Unif(0,1)$ and angle can be sampled from $\theta \sim 2\pi \times Unif(0,1)$. A similar mechanism (i.e. drawing points in a circle using uniform variables) is at the heart of the Box-Muller transform for sampling Normal random variables.

Example of the relationship between Cartesian and polar coordinates

## Drawing Normally-distributed samples with the Box-Muller transform

Ok, now that we’ve discussed how Cartesian coordinates are represented in polar coordinates, let’s move on to how we can use this relationship to generate random variables. Box-Muller sampling is based on representing the joint distribution of two independent standard Normal random Cartesian variables $X$ and $Y$

$X \sim N(0,1)$

$Y \sim N(0,1)$

in polar coordinates. The joint distribution $p(x,y)$ (which is circular-symmetric) is:

$p(x,y) = p(x)p(y) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}}$

$= \frac{1}{2\pi}e^{-\frac{x^2 + y^2}{2}}$

If we notice that the $x^2 + y^2$ term in the numerator of the exponent is equal to $r^2$ (as above) we can make the connection between the the Cartesian representation of the joint Normal distribution and its polar representation:

$p(x,y) = \left ( \frac{1}{2\pi} \right ) \left ( e^{\frac{-r^2}{2}} \right )$

which is the product of two density functions, an exponential distribution over squared radii:

$r^2 \sim Exp(\frac{1}{2})$

and a uniform distribution over angles:

$\theta \sim Unif(0,2\pi)$

just like those mentioned above when generating points on the unit circle. Now, if we make another connection between the exponential distribution and the uniform distribution, namely that:

$Exp(\lambda) = \frac{-\log (Unif(0,1))}{\lambda}$

then $r \sim \sqrt{-2\log (Unif(0,1))}$

This gives us a way to generate points from the joint Gaussian distribution by sampling from two independent uniform distributions, one for $r$ and another for $\theta$, and transforming them into Cartesian coordinates via the relationships above. In detail, the procedure goes as follows:

1. Draw, $u_1,u_2 \sim Unif(0,1)$
2. Transform the variables into radius and angle representation $r = \sqrt{-2\log(u_1)}$, and $\theta = 2\pi u_2$
3. Transform radius and angle into Cartesian coordinates: $x = r \cos(\theta), y = r \sin(\theta)$

What results are two independent Normal random variables, $X$ and $Y$. A MATLAB implementation of the Box-Muller algorithm is shown below:

% NORMAL SAMPLES USING BOX-MUELLER METHOD
% DRAW SAMPLES FROM PROPOSAL DISTRIBUTION
u = rand(2,100000);
r = sqrt(-2*log(u(1,:)));
theta = 2*pi*u(2,:);
x = r.*cos(theta);
y = r.*sin(theta);

% DISPLAY BOX-MULLER SAMPLES
figure
% X SAMPLES
subplot(121);
hist(x,100);
colormap hot;axis square
title(sprintf('Box-Muller Samples Y\n Mean = %1.2f\n Variance = %1.2f\n Kurtosis = %1.2f',mean(x),var(x),3-kurtosis(x)))
xlim([-6 6])

% Y SAMPLES
subplot(122);
hist(y,100);
colormap hot;axis square
title(sprintf('Box-Muller Samples X\n Mean = %1.2f\n Variance = %1.2f\n Kurtosis = %1.2f',mean(y),var(y),3-kurtosis(y)))
xlim([-6 6])


Box-Muller Samples for Normal Distribution

## Wrapping Up

The output of the MATLAB code is shown above. Notice the first, second, and fourth central moments (mean, variance, and kurtosis)  of the generated samples are consistent with the standard normal. The Box-Muller transform is another example of of how uniform variables on the interval (0,1) and can be  transformed in order to sample from a more complicated distribution.

## Rejection Sampling

Suppose that we want to sample from a distribution $f(x)$ that is difficult or impossible to sample from directly, but instead have a simpler distribution $q(x)$ from which sampling is easy.  The idea behind Rejection sampling (aka Acceptance-rejection sampling) is to sample from $q(x)$ and apply some rejection/acceptance criterion such that the samples that are accepted are distributed according to $f(x)$.

## Envelope distribution and rejection criterion

In order to be able to reject samples from $q(x)$ such that they are sampled from $f(x)$, $q(x)$ must “cover” or envelop the distribution $f(x)$. This is generally done by choosing a constant $c > 1$ such that  $cq(x) > f(x)$ for all $x$. For this reason $cq(x)$ is often called the envelope distribution. A common criterion for accepting samples from $x \sim q(x)$ is based on the ratio of the target distribution to that of the envelope distribution. The samples are accepted if

$\frac{f(x)}{cq(x)} > u$

where $u \sim Unif(0,1)$, and rejected otherwise. If the ratio is close to one, then $f(x)$ must have a large amount of probability mass around $x$ and that sample should  be more likely accepted. If the ratio is small, then it means that $f(x)$ has low probability mass around $x$ and we should be less likely to accept the sample. This criterion is demonstrated in the chunk of MATLAB code and the resulting figure below:

rand('seed',12345);
x = -10:.1:10;
% CREATE A "COMPLEX DISTRIBUTION" f(x) AS A MIXTURE OF TWO NORMAL
% DISTRIBUTIONS
f = inline('normpdf(x,3,2) + normpdf(x,-5,1)','x');
t = plot(x,f(x),'b','linewidth',2); hold on;

% PROPOSAL IS A CENTERED NORMAL DISTRIBUTION
q = inline('normpdf(x,0,4)','x');

% DETERMINE SCALING CONSTANT
c = max(f(x)./q(x))

%PLOT SCALED PROPOSAL/ENVELOP DISTRIBUTION
p = plot(x,c*q(x),'k--');

% DRAW A SAMPLE FROM q(x);
qx = normrnd(0,4);
fx = f(qx);

% PLOT THE RATIO OF f(q(x)) to cq(x)
a = plot([qx,qx],[0 fx],'g','Linewidth',2);
r = plot([qx,qx],[fx,c*q(qx)],'r','Linewidth',2);
legend([t,p,a,r],{'Target','Proposal','Accept','Reject'});
xlabel('x');


Rejection Sampling with a Normal proposal distribution

Here a zero-mean Normal distribution is used as the proposal distribution. This distribution is scaled by a factor $c = 9.2$, determined from $f(x)$ and $q(x)$ to ensure that the proposal distribution covers $f(x)$. We then sample from $q(x)$, and compare the proportion of $cq(x)$ occupied by $f(x)$. If we compare this proportion to a random number sampled  from $Unif(0,1)$ (i.e. the criterion outlined above), then we would accept this sample with probability proportional to the length of the green line segment and reject the sample with probability proportional to the length of the red line segment.

## Rejection sampling of a random discrete distribution

This next example shows how rejection sampling can be used to sample from any arbitrary distribution, continuous or not, and with or without an analytic probability density function.

Random Discrete Target Distribution and Proposal that Bounds It.

The figure above shows a random discrete probability density function $f(x)$ generated on the interval (0,15). We will use rejection sampling as described above to sample from $f(x)$. Our proposal/envelope distribution is the uniform discrete distribution on the same interval (i.e. any of the integers from 1-15 are equally probable) multiplied by a constant $c$ that is determined such that the maximum value of $f(x)$ lies under (or equal to) $cq(x)$.

Rejection Samples For Discrete Distribution on interval [1 15]

Plotted above is the target distribution (in red) along with the discrete samples obtained using the rejection sampling. The MATLAB code used to sample from the target distribution and display the plot above is here:

rand('seed',12345)
randn('seed',12345)

fLength = 15;
% CREATE A RANDOM DISTRIBUTION ON THE INTERVAL [1 fLength]
f = rand(1,fLength); f = f/sum(f);

figure; h = plot(f,'r','Linewidth',2);
hold on;
l = plot([1 fLength],[max(f) max(f)],'k','Linewidth',2);

legend([h,l],{'f(x)','q(x)'},'Location','Southwest');
xlim([0 fLength + 1])
xlabel('x');
ylabel('p(x)');
title('Target (f(x)) and Proposal (q(x)) Distributions');

% OUR PROPOSAL IS THE DISCRETE UNIFORM ON THE INTERVAL [1 fLength]
% SO OUR CONSTANT IS
c = max(f/(1/fLength));

nSamples = 10000;
i = 1;
while i < nSamples
proposal = unidrnd(fLength);
q = c*1/fLength; % ENVELOPE DISTRIBUTION
if rand < f(proposal)/q
samps(i) = proposal;
i = i + 1;
end
end

% DISPLAY THE SAMPLES AND COMPARE TO THE TARGET DISTRIBUTION
bins = 1:fLength;
counts = histc(samps,bins);
figure
b = bar(1:fLength,counts/sum(counts),'FaceColor',[.8 .8 .8])
hold on;
h = plot(f,'r','Linewidth',2)
legend([h,b],{'f(x)','samples'});
xlabel('x'); ylabel('p(x)');
xlim([0 fLength + 1]);


## Rejection sampling from the unit circle to estimate $\pi$

Though the ratio-based acceptance-rejection criterion introduced above is a common choice for drawing samples from complex distributions, it is not the only criterion we could use. For instance we could use a different set of criteria to generate some geometrically-bounded distribution. If we wanted to generate points uniformly within the unit circle (i.e. a circle centered at $(y,x) = 0$ and with radius $r = 1$), we could do so by sampling Cartesian spatial coordinates $x$ and $y$ uniformly from the interval (-1,1)–which samples form a square centered at (0,0)–and reject those points that lie outside of the radius $r = \sqrt{x^2 + y^2} = 1$

Unit Circle Inscribed in Square

Something clever that we can do with such a set of samples is to approximate the value $\pi$: Because a square that inscribes the unit circle has area:

$A_{square} = (2r)^2 = 4r^2$

and the unit circle has the area:

$A_{circle} = \pi r^2$

We can use the ratio of their areas to approximate $\pi$:

$\pi = 4\frac{A_{circle}}{A_{square}}$

The figure below shows the rejection sampling process and the resulting estimate of $\pi$ from the samples. One-hundred thousand 2D points are sampled uniformly from the interval (-1,1).  Those points that lie within the unit circle are plotted as blue dots. Those points that lie outside of the unit circle are plotted as red x’s. If we take four times the ratio of the area in blue to the entire area, we get a very close approximation to 3.14 for $\pi$.

Rejection Criterion

The MATLAB code used to generate the example figures is below:

% DISPLAY A CIRCLE INSCRIBED IN A SQUARE

figure;
a = 0:.01:2*pi;
x = cos(a); y = sin(a);
hold on
plot(x,y,'k','Linewidth',2)

t = text(0.5, 0.05,'r');
l = line([0 1],[0 0],'Linewidth',2);
axis equal
box on
xlim([-1 1])
ylim([-1 1])
title('Unit Circle Inscribed in a Square')

pause;
rand('seed',12345)
randn('seed',12345)
delete(l); delete(t);

% DRAW SAMPLES FROM PROPOSAL DISTRIBUTION
samples = 2*rand(2,100000) - 1;

% REJECTION
reject = sum(samples.^2) > 1;

% DISPLAY REJECTION CRITERION
scatter(samples(1,~reject),samples(2,~reject),'b.')
scatter(samples(1,reject),samples(2,reject),'rx')
hold off
xlim([-1 1])
ylim([-1 1])

piHat = mean(sum(samples.*samples)<1)*4;

title(['Estimate of \pi = ',num2str(piHat)]);


## Wrapping Up

Rejection sampling is a simple way to generate samples from complex distributions. However, Rejection sampling also has a number of weaknesses:

• Finding a proposal distribution that can cover the support of the target distribution is a non-trivial task.
• Additionally, as the dimensionality of the target distribution increases, the proportion of points that are rejected also increases. This curse of dimensionality makes rejection sampling an inefficient technique for sampling multi-dimensional distributions, as the majority of the points proposed are not accepted as valid samples.
• Some of these problems are solved by changing the form of the proposal distribution to “hug” the target distribution as we gain knowledge of the target from observing accepted samples. Such a process is called Adaptive Rejection Sampling, which will be covered in another post.

## Inverse Transform Sampling

There are a number of sampling methods used in machine learning, each of which has various strengths and/or weaknesses depending on the nature of the sampling task at hand. One simple method for generating samples from distributions with closed-form descriptions is Inverse Transform (IT) Sampling.

The idea behind IT Sampling is that the probability mass for a random variable $X$ distributed according to the probability density function $f(x)$ integrates to one and therefore the cumulative distribution function $C(x)$ can be used to map from values in the interval $(0,1)$ (i.e. probabilities) to the domain of $f(x)$. Because it is easy to sample values $z$ uniformly from the interval $(0,1)$, we can use the inverse of the CDF $C(x)^{-1}$ to transform these sampled probabilities into samples $x$. The code below demonstrates this process at work in order to sample from a student’s t distribution with 10 degrees of freedom.

rand('seed',12345)

% DEGREES OF FREEDOM
dF = 10;
x = -3:.1:3;
Cx = cdf('t',x,dF)
z = rand;

% COMPARE VALUES OF
zIdx = min(find(Cx>z));

% DRAW SAMPLE
sample = x(zIdx);

% DISPLAY
figure; hold on
plot(x,Cx,'k','Linewidth',2);
plot([x(1),x(zIdx)],[Cx(zIdx),Cx(zIdx)],'r','LineWidth',2);
plot([x(zIdx),x(zIdx)],[Cx(zIdx),0],'b','LineWidth',2);
plot(x(zIdx),z,'ko','LineWidth',2);
text(x(1)+.1,z + .05,'z','Color','r')
text(x(zIdx)+.05,.05,'x_{sampled}','Color','b')
ylabel('C(x)')
xlabel('x')
hold off


IT Sampling from student’s-t(10)

However, the scheme used to create to plot above is inefficient in that one must compare current values of $z$ with the $C(x)$ for all values of $x$. A much more efficient method is to evaluate $C^{-1}$ directly:

1. Derive $C^{-1}(x)$ (or a good approximation) from $f(x)$
2. for $i = 1:n$
• – draw $z_i$ from $Unif(0,1)$
• $x_i = CDF^{-1}(z_i)$
• – end for

The IT sampling process is demonstrated in the next chunk of code to sample from the Beta distribution, a distribution for which $C^{-1}$  is easy to approximate using Netwon’s method (which we let MATLAB do for us within the function icdf.m)

rand('seed',12345)
nSamples = 1000;

% BETA PARAMETERS
alpha = 2; beta = 10;

% DRAW PROPOSAL SAMPLES
z = rand(1,nSamples);

% EVALUATE PROPOSAL SAMPLES AT INVERSE CDF
samples = icdf('beta',z,alpha,beta);
bins = linspace(0,1,50);
counts = histc(samples,bins);
probSampled = counts/sum(counts)
probTheory = betapdf(bins,alpha,beta);

% DISPLAY
b = bar(bins,probSampled,'FaceColor',[.9 .9 .9]);
hold on;
t = plot(bins,probTheory/sum(probTheory),'r','LineWidth',2);
xlim([0 1])
xlabel('x')
ylabel('p(x)')
legend([t,b],{'Theory','IT Samples'})
hold off


Inverse Transform Sampling of Beta(2,10)

## Wrapping Up

The IT sampling method is generally only used for univariate distributions where $C^{-1}$ can be computed in closed form, or approximated. However, it is a nice example of how uniform random variables can be used to sample from much more complicated distributions.