# Blog Archives

## A Gentle Introduction to Markov Chain Monte Carlo (MCMC)

Applying probabilistic models to data usually involves integrating a complex, multi-dimensional probability distribution. For example, calculating the expectation/mean of a model distribution involves such an integration. Many (most) times, these integrals are not calculable due to the high dimensionality of the distribution or because there is no closed-form expression for the integral available using calculus. Markov Chain Monte Carlo (MCMC) is a method that allows one to approximate complex integrals using stochastic sampling routines. As MCMC’s name indicates, the method is composed of two components, the Markov chain and Monte Carlo integration.

Monte Carlo integration is a powerful technique that exploits stochastic sampling of the distribution in question in order to approximate the difficult integration. However, in order to use Monte Carlo integration it is necessary to be able to sample from the probability distribution in question, which may be difficult or impossible to do directly. This is where the second component of MCMC, the Markov chain, comes in. A Markov chain is a sequential model that transitions from one state to another in a probabilistic fashion, where the next state that the chain takes is conditioned on the previous state. Markov chains are useful in that if they are constructed properly, and allowed to run for a long time, the states that a chain will take also sample from a target probability distribution. Therefore we can construct Markov chains to sample from the distribution whose integral we would like to approximate, then use Monte Carlo integration to perform the approximation.

Here I introduce a series of posts where I describe the basic concepts underlying MCMC, starting off by describing Monte Carlo Integration, then giving a brief introduction of Markov chains and how they can be constructed to sample from a target probability distribution. Given these foundation principles, we can then discuss MCMC techniques such as the Metropolis and Metropolis-Hastings algorithms, the Gibbs sampler, and the Hybrid Monte Carlo algorithm.

As always, each post has a somewhat formal/mathematical introduction, along with an example and simple Matlab implementations of the associated algorithms.

## MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)

The random-walk behavior of many Markov Chain Monte Carlo (MCMC) algorithms makes Markov chain convergence to a target stationary distribution $p(x)$ inefficient, resulting in slow mixing. Hamiltonian/Hybrid Monte Carlo (HMC), is a MCMC method that adopts physical system dynamics rather than a probability distribution to propose future states in the Markov chain. This allows the Markov chain to explore the target distribution much more efficiently, resulting in faster convergence. Here we introduce basic analytic and numerical concepts for simulation of Hamiltonian dynamics. We then show how Hamiltonian dynamics can be used as the Markov chain proposal function for an MCMC sampling algorithm (HMC).

## First off, a brief physics lesson in Hamiltonian dynamics

Before we can develop Hamiltonian Monte Carlo, we need to become familiar with the concept of Hamiltonian dynamics. Hamiltonian dynamics is one way that physicists describe how objects move throughout a system. Hamiltonian dynamics describe an object’s motion in terms of its location $\bold x$ and momentum $\bold p$ (equivalent to the object’s mass times its velocity) at some time $t$. For each location the object takes, there is an associated potential energy $U(\bold x)$, and for each momentum there is an associated kinetic energy $K(\bold p)$. The total energy of the system is constant and known as the Hamiltonian $H(\bold x, \bold p)$, defined simply as the sum of the potential and kinetic energies:

$H(\bold x,\bold p) = U(\bold x) + K(\bold p)$

Hamiltonian dynamics describe how kinetic energy is converted to potential energy (and vice versa) as an object moves throughout a system in time. This description is implemented quantitatively via a set of differential equations known as the Hamiltonian equations:

$\frac{\partial x_i}{\partial t} = \frac{\partial H}{\partial p_i} = \frac{\partial K(\bold p)}{\partial p_i}$
$\frac{\partial p_i}{\partial t} = -\frac{\partial H}{\partial x_i} = - \frac{\partial U(\bold x)}{\partial x_i}$

Therefore, if we have expressions for $\frac{\partial U(\bold x)}{\partial x_i}$ and $\frac{\partial K(\bold p)}{\partial p_i}$ and a set of initial conditions (i.e. an initial position $\bold x_0$ and initial momentum $\bold p_0$ at time $t_0$), it is possible to predict the location and momentum of an object at any point in time $t = t_0 + T$ by simulating these dynamics for a duration $T$

## Simulating Hamiltonian dynamics — the Leap Frog Method

The Hamiltonian equations describe an object’s motion in time, which is a continuous variable. In order to simulate Hamiltonian dynamics numerically on a computer, it is necessary to approximate the Hamiltonian equations by discretizing  time. This is done by splitting the interval $T$ up into a series of smaller intervals of length $\delta$. The smaller the value of $\delta$ the closer the approximation is to the dynamics in continuous time. There are a number of procedures that have been developed for discretizing time including Euler’s method and the Leap Frog Method, which I will introduce briefly in the context of Hamiltonian dynamics. The Leap Frog method updates the momentum and position variables sequentially, starting by simulating the momentum dynamics over a small interval of time $\delta /2$, then simulating the position dynamics over a slightly longer interval in time $\delta$, then completing the momentum simulation over another small interval of time $\delta /2$ so that $\bold x$ and $\bold p$ now exist at the same point in time. Specifically, the Leap Frog method is as follows: 1. Take a half step in time to update the momentum variable:

$p_i(t + \delta/2) = p_i(t) - (\delta /2)\frac{\partial U}{\partial x_i(t)}$

2. Take a full step in time to update the position variable

$x_i(t + \delta) = x_i(t) + \delta \frac{\partial K}{\partial p_i(t + \delta/2)}$

3. Take the remaining half step in time to finish updating the momentum variable

$p_i(t + \delta) = p_i(t + \delta/2) - (\delta/2) \frac{\partial U}{\partial x_i(t+\delta)}$

The Leap Fog method can be run for $L$ steps to simulate dynamics over $L \times \delta$ units of time. This particular discretization method has a number of properties that make it preferable to other approximation methods like Euler’s method, particularly for use in MCMC, but discussion of these properties are beyond the scope of this post. Let’s see how we can use the Leap Frog method to simulate Hamiltonian dynamics in a simple 1D example.

### Example 1: Simulating Hamiltonian dynamics of an harmonic oscillator

Imagine a ball with mass equal to one is attached to a horizontally-oriented spring. The spring exerts a force on the ball equal to

$F = -kx$

which works to restore the ball’s position to the equilibrium position of the spring  at $x = 0$. Let’s assume that the spring constant $k$, which defines the strength of the restoring force is also equal to one. If the ball is displaced by some distance $x$ from equilibrium, then the potential energy is

$U(x) = \int F dx = \int -x dx = \frac{x^2}{2}$

In addition, the kinetic energy an object with mass $m$ moving with velocity $v$ within a linear system is known to be

$K(v) = \frac{(mv)^2}{2m} = \frac{v^2}{2} = \frac{p^2}{2} = K(p)$,

if the object’s mass is equal to one, like the ball this example. Notice that we now have in hand the expressions for both $U(x)$ and $K(p)$. In order to simulate the Hamiltonian dynamics of the system using the Leap Frog method, we also need expressions for the partial derivatives of each variable (in this 1D example there are only one for each variable):

$\frac{\partial U(x)}{\partial x} =x$ $\frac{\partial K(p)}{\partial p} = p$

Therefore one iteration the Leap Frog algorithm for simulating Hamiltonian dynamics in this system is:

1.  $p(t + \delta/2) = p(t) - (\delta/2)x(t)$
2. $x(t + \delta) = x(t) + (\delta) p(t + \delta/2)$ 3. $p(t + \delta) = p(t + \delta /2) - (\delta/2)x(t + \delta)$

We simulate the dynamics of the spring-mass system described using the Leap Frog method in Matlab below (if the graph is not animated, try clicking on it to open up the linked .gif). The left bar in the bottom left subpanel of the simulation output demonstrates the trade-off between potential and kinetic energy described by Hamiltonian dynamics. The cyan portion of the bar is the proportion of the Hamiltonian contributed by  the potential energy $U(x)$, and the yellow portion  represents is the contribution of the kinetic energy $K(p)$. The right bar (in all yellow), is the total value of the Hamiltonian $H(x,p)$. Here we see that the ball oscillates about the equilibrium position of the spring with a constant period/frequency.  As the ball passes the equilibrium position $x= 0$, it has a minimum potential energy and maximum kinetic energy. At the extremes of the ball’s trajectory, the potential energy is at a maximum, while the kinetic energy is minimized. The  procession of momentum and position map out positions in what is referred to as phase space, which is displayed in the bottom right subpanel of the output. The harmonic oscillator maps out an ellipse in phase space. The size of the ellipse depends on the energy of the system defined by initial conditions.

Simple example of Hamiltonian Dynamics: 1D Harmonic Oscillator (Click to see animated)

You may also notice that the value of the Hamiltonian $H$ is not a exactly constant in the simulation, but oscillates slightly. This is an artifact known as energy drift due to approximations used to the discretize time.

% EXAMPLE 1: SIMULATING HAMILTONIAN DYNAMICS
%            OF HARMONIC OSCILLATOR
% STEP SIZE
delta = 0.1;

% # LEAP FROG
L = 70;

% DEFINE KINETIC ENERGY FUNCTION
K = inline('p^2/2','p');

% DEFINE POTENTIAL ENERGY FUNCTION FOR SPRING (K =1)
U = inline('1/2*x^2','x');

% DEFINE GRADIENT OF POTENTIAL ENERGY
dU = inline('x','x');

% INITIAL CONDITIONS
x0 = -4; % POSTIION
p0 = 1;  % MOMENTUM
figure

%% SIMULATE HAMILTONIAN DYNAMICS WITH LEAPFROG METHOD
% FIRST HALF STEP FOR MOMENTUM
pStep = p0 - delta/2*dU(x0)';

% FIRST FULL STEP FOR POSITION/SAMPLE
xStep = x0 + delta*pStep;

% FULL STEPS
for jL = 1:L-1
% UPDATE MOMENTUM
pStep = pStep - delta*dU(xStep);

% UPDATE POSITION
xStep = xStep + delta*pStep;

% UPDATE DISPLAYS
subplot(211), cla
hold on;
xx = linspace(-6,xStep,1000);
plot(xx,sin(6*linspace(0,2*pi,1000)),'k-');
plot(xStep+.5,0,'bo','Linewidth',20)
xlim([-6 6]);ylim([-1 1])
hold off;
title('Harmonic Oscillator')
subplot(223), cla
b = bar([U(xStep),K(pStep);0,U(xStep)+K(pStep)],'stacked');
set(gca,'xTickLabel',{'U+K','H'})
ylim([0 10]);
title('Energy')
subplot(224);
plot(xStep,pStep,'ko','Linewidth',20);
xlim([-6 6]); ylim([-6 6]); axis square
xlabel('x'); ylabel('p');
title('Phase Space')
pause(.1)
end
% (LAST HALF STEP FOR MOMENTUM)
pStep = pStep - delta/2*dU(xStep);


## Hamiltonian dynamics and the target distribution $p(\bold x)$

Now that we have a better understanding of what Hamiltonian dynamics are and how they can be simulated, let’s now discuss how we can use Hamiltonian dynamics for MCMC. The main idea behind Hamiltonian/Hibrid Monte Carlo is to develop a Hamiltonian function $H(\bold x, \bold p)$ such that the resulting Hamiltonian dynamics allow us to efficiently explore some target distribution $p(\bold x)$. How can we choose such a Hamiltonian function? It turns out it is pretty simple to relate a $H(\bold x, \bold p)$ to $p(\bold x)$ using a basic concept adopted from statistical mechanics known as the canonical distribution. For any energy function $E(\bf\theta)$ over a set of variables $\theta$, we can define the corresponding canonical distribution as: $p(\theta) = \frac{1}{Z}e^{-E(\bf\theta)}$ where we simply take the exponential of the negative of the energy function. The variable $Z$ is a normalizing constant called the partition function that scales the canonical distribution such that is sums to one, creating a valid probability distribution. Don’t worry about $Z$, it isn’t really important because, as you may recall from an earlier post, MCMC methods can sample from unscaled probability distributions. Now, as we saw above, the energy function for Hamiltonian dynamics is a combination of potential and kinetic energies: $E(\theta) = H(\bold x,\bold p) = U(\bold x) + K(\bold p)$

Therefore the canoncial distribution for the Hamiltonian dynamics energy function is

$p(\bold x,\bold p) \propto e^{-H(\bold x,\bold p)} \\ = e^{-[U(\bold x) - K(\bold p)]} \\ = e^{-U(\bold x)}e^{-K(\bold p)} \\ \propto p(\bold x)p(\bold p)$

Here we see that joint (canonical) distribution for $\bold x$ and $\bold p$ factorizes. This means that the two variables are independent, and the canoncial distribution $p(\bold x)$ is independent of the analogous distribution for the momentum. Therefore, as we’ll see shortly, we can use Hamiltonian dynamics to sample from the joint canonical distribution over $\bold p$ and $\bold x$ and simply ignore the momentum contributions. Note that this is an example of introducing auxiliary variables to facilitate the Markov chain path. Introducing the auxiliary variable $\bold p$ allows us to use Hamiltonian dynamics, which are unavailable without them. Because the canonical distribution for $\bold x$ is independent of the canonical distribution for $\bold p$, we can choose any distribution from which to sample the momentum variables. A common choice is to use a zero-mean Normal distribution with unit variance:

$p(\bold p) \propto \frac{\bold{p^Tp}}{2}$

Note that this is equivalent to having a quadratic potential energy term in the Hamiltonian:

$K(\bold p) = \frac{\bold{p^Tp}}{2}$

Recall that this is is the exact quadratic kinetic energy function (albeit in 1D) used in the harmonic oscillator example above. This is a convenient choice for the kinetic energy function as all partial derivatives are easy to compute. Now that we have defined a kinetic energy function, all we have to do is find a potential energy function $U(\bold x)$ that when negated and run through the exponential function, gives the target distribution $p(\bold x)$ (or an unscaled version of it). Another way of thinking of it is that we can define the potential energy function as

$U(\bold x) = -\log p(\bold x)$.

If we can calculate $-\frac{\partial \log(p(\bold x)) }{\partial x_i}$, then we’re in business and we can simulate Hamiltonian dynamics that can be used in an MCMC technique.

## Hamiltonian Monte Carlo

In HMC we use Hamiltonian dynamics as a proposal function for a Markov Chain in order to explore the target (canonical) density $p(\bold x)$ defined by $U(\bold x)$ more efficiently than using a proposal probability distribution. Starting at an initial state $[\bold x_0, \bold p_0]$, we simulate Hamiltonian dynamics for a short time using the Leap Frog method. We then use the state of the position and momentum variables at the end of the simulation as our proposed states variables $\bold x^*$ and $\bold p^*$. The proposed state is accepted using an update rule analogous to the Metropolis acceptance criterion. Specifically if the probability of the proposed state after Hamiltonian dynamics

$p(\bold x^*, \bold p^*) \propto e^{-[U(\bold x^*) + K{\bold p^*}]}$

is greater than probability of the state prior to the Hamiltonian dynamics

$p(\bold x_0,\bold p_0) \propto e^{-[U(\bold x^{(t-1)}), K(\bold p^{(t-1)})]}$

then the proposed state is accepted, otherwise, the proposed state is accepted randomly. If the state is rejected, the next state of the Markov chain is set as the state at $(t-1)$. For a given set of initial conditions, Hamiltonian dynamics will follow contours of constant energy in phase space (analogous to the circle traced out in phase space in the example above). Therefore we must randomly perturb the dynamics so as to explore all of $p(\bold x)$. This is done by simply drawing a random momentum from the corresponding canonical distribution $p(\bold p)$  before running the dynamics prior to each sampling iteration $t$. Combining these steps, sampling random momentum, followed by Hamiltonian dynamics and Metropolis acceptance criterion defines the HMC algorithm for drawing $M$ samples from a target distribution:

1. set $t = 0$
2. generate an initial position state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

– sample a new initial momentum variable from the momentum canonical distribution $\bold p_0 \sim p(\bold p)$

– set $\bold x_0 = \bold x^{(t-1)}$

– run Leap Frog algorithm starting at $[\bold x_0, \bold p_0]$ for $L$ steps and stepsize $\delta$ to obtain proposed states $\bold x^*$ and $\bold p^*$

– calculate the Metropolis acceptance probability:

$\alpha = \text{min}(1,\exp(-U(\bold x^*) + U(\bold x_0) - K(\bold p^*) + K(\bold p_0)))$

– draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposed state position $\bold x^*$ and set the next state in the Markov chain $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

In the next example we implement HMC  to sample from a multivariate target distribution that we have sampled from previously using multi-variate Metropolis-Hastings, the bivariate Normal. We also qualitatively compare the sampling dynamics of HMC to multivariate Metropolis-Hastings for the sampling the same distribution.

### Example 2: Hamiltonian Monte for sampling a Bivariate Normal distribution

As a reminder, the target distribution $p(\bold x)$ for this exampleis a Normal form with following parameterization:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean $\mu = [\mu_1,\mu_2]= [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & \rho_{12} \\ \rho_{21} & 1\end{bmatrix} = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

In order to sample from $p(\bold x)$ (assuming that we are using a quadratic energy function), we need to determine the expressions for $U(\bold x)$ and

$\frac{\partial U(\bold x) }{ \partial x_i}$.

Recall that the target potential energy function can be defined from the canonical form as

$U(\bold x) = -\log(p(\bold x))$

If we take the negative log of the Normal distribution outline above, this defines the following potential energy function:

$E(\bold x) = -\log \left(e^{-\frac{\bold{x^T \Sigma^{-1} x}}{2}}\right) - \log Z$

Where $Z$ is the normalizing constant for a Normal distribution (and can be ignored because it will eventually cancel). The potential energy function is then simply:

$U(\bold x) = \frac{\bold{x^T \Sigma^{-1}x}}{2}$

with partial derivatives

$\frac{\partial U(\bold x)}{\partial x_i} = x_i$

Using these expressions for the potential energy and its partial derivatives, we implement HMC for sampling from the bivariate Normal in Matlab:

Hybrid Monte Carlo Samples from bivariate Normal target distribution

In the graph above we display HMC samples of the target distribution, starting from an initial position very far from the mean of the target. We can see that HMC rapidly approaches areas of high density under the target distribution. We compare these samples with samples drawn using the Metropolis-Hastings (MH) algorithm below. The MH algorithm converges much slower than HMC, and consecutive samples have much higher autocorrelation than samples drawn using HMC.

Metropolis-Hasting (MH) samples of the same target distribution. Autocorrelation is evident. HMC is much more efficient than MH.

The Matlab code for the HMC sampler:

% EXAMPLE 2: HYBRID MONTE CARLO SAMPLING -- BIVARIATE NORMAL
rand('seed',12345);
randn('seed',12345);

% STEP SIZE
delta = 0.3;
nSamples = 1000;
L = 20;

% DEFINE POTENTIAL ENERGY FUNCTION
U = inline('transp(x)*inv([1,.8;.8,1])*x','x');

% DEFINE GRADIENT OF POTENTIAL ENERGY
dU = inline('transp(x)*inv([1,.8;.8,1])','x');

% DEFINE KINETIC ENERGY FUNCTION
K = inline('sum((transp(p)*p))/2','p');

% INITIAL STATE
x = zeros(2,nSamples);
x0 = [0;6];
x(:,1) = x0;

t = 1;
while t < nSamples
t = t + 1;

% SAMPLE RANDOM MOMENTUM
p0 = randn(2,1);

%% SIMULATE HAMILTONIAN DYNAMICS
% FIRST 1/2 STEP OF MOMENTUM
pStar = p0 - delta/2*dU(x(:,t-1))';

% FIRST FULL STEP FOR POSITION/SAMPLE
xStar = x(:,t-1) + delta*pStar;

% FULL STEPS
for jL = 1:L-1
% MOMENTUM
pStar = pStar - delta*dU(xStar)';
% POSITION/SAMPLE
xStar = xStar + delta*pStar;
end

% LAST HALP STEP
pStar = pStar - delta/2*dU(xStar)';

% COULD NEGATE MOMENTUM HERE TO LEAVE
% THE PROPOSAL DISTRIBUTION SYMMETRIC.
% HOWEVER WE THROW THIS AWAY FOR NEXT
% SAMPLE, SO IT DOESN'T MATTER

% EVALUATE ENERGIES AT
% START AND END OF TRAJECTORY
U0 = U(x(:,t-1));
UStar = U(xStar);

K0 = K(p0);
KStar = K(pStar);

% ACCEPTANCE/REJECTION CRITERION
alpha = min(1,exp((U0 + K0) - (UStar + KStar)));

u = rand;
if u < alpha
x(:,t) = xStar;
else
x(:,t) = x(:,t-1);
end
end

% DISPLAY
figure
scatter(x(1,:),x(2,:),'k.'); hold on;
plot(x(1,1:50),x(2,1:50),'ro-','Linewidth',2);
xlim([-6 6]); ylim([-6 6]);
legend({'Samples','1st 50 States'},'Location','Northwest')
title('Hamiltonian Monte Carlo')


## Wrapping up

In this post we introduced the Hamiltonian/Hybrid Monte Carlo algorithm for more efficient MCMC sampling. The HMC algorithm is extremely powerful for sampling distributions that can be represented terms of a potential energy function and its partial derivatives. Despite the efficiency and elegance of HMC, it is an underrepresented sampling routine in the literature. This may be due to the popularity of simpler algorithms such as Gibbs sampling or Metropolis-Hastings, or perhaps due to the fact that one must select hyperparameters such as the number of Leap Frog steps and Leap Frog step size when using HMC. However, recent research has provided effective heuristics such as adapting the Leap Frog step size in order to maintain a constant Metropolis rejection rate, which facilitate the use of HMC for general applications.

## MCMC: The Metropolis Sampler

As discussed in an earlier post, we can use a Markov chain to sample from some target probability distribution $p(x)$ from which drawing samples directly is difficult. To do so, it is necessary to design a transition operator for the Markov chain which makes the chain’s stationary distribution match the target distribution. The Metropolis sampling algorithm  (and the more general Metropolis-Hastings sampling algorithm) uses simple heuristics to implement such a transition operator.

## Metropolis Sampling

Starting from some random initial state $x^{(0)} \sim \pi^{(0)}$, the algorithm first draws a possible sample $x^*$ from a  proposal distribution $q(x | x^{(t-1)})$.  Much like a conventional transition operator for a Markov chain, the proposal distribution depends only on the previous state in the chain. However, the transition operator for the Metropolis algorithm has an additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the proposed state as a sample and setting it to the next state in the chain. If the density of $p(x)$ is low near the proposed state, then it is likely (but not guaranteed) that it will be rejected. The criterion for accepting or rejecting a proposed state are defined by the following heuristics:

1. If $p(x^*) \geq p(x^{(t-1)})$,  the proposed state is kept $x^*$ as a sample and is set as the next state in the chain (i.e. move the chain’s state to a location  where $p(x)$ has equal or greater density).
2. If $p(x^*) < p(x^{(t-1)})$–indicating that $p(x)$ has low density near $x^*$–then the proposed state may still be accepted, but only randomly, and with a probability $\frac{p(x^*)}{p(x^{(t-1)})}$

These heuristics can be instantiated by calculating the acceptance probability for the proposed state.

$\alpha = \min \left(1, \frac{p(x^*)}{p(x^{(t-1)})}\right)$

Having the acceptance probability in hand, the transition operator for the metropolis algorithm works like this: if a random uniform number $u$ is less than or equal to $\alpha$, then the state $x^*$ is accepted (as in (1) above), if not, it is rejected and another state is proposed (as in (2) above). In order to collect $M$ samples using  Metropolis sampling we run the following algorithm:

1. set t = 0
2. generate an initial state $x^{(0)}$ from a prior distribution $\pi^{(0)}$ over initial states
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the acceptance probability $\alpha = \min \left(1, \frac{p(x^*)}{p(x^{(t-1)})}\right)$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$, accept the proposal and set $x^{(t)} = x^*$

else  set $x^{(t)} = x^{(t-1)}$

### Example: Using the Metropolis algorithm to sample from an unknown distribution

Say that we have some mysterious function

$p(x) = (1 + x^2)^{-1}$

from which we would like to draw samples. To do so using Metropolis sampling we need to define two things: (1) the prior distribution $\pi^{(0)}$ over the initial state of the Markov chain, and (2)  the proposal distribution $q(x | x^{(t-1)})$. For this example we define:

$\pi^{(0)} \sim \mathcal N(0,1)$

$q(x | x^{(t-1)}) \sim \mathcal N(x^{(t-1)},1)$,

both of which are simply a Normal distribution, one centered at zero, the other centered at previous state of the chain. The following chunk of MATLAB code runs the Metropolis sampler with this proposal distribution and prior.

% METROPOLIS SAMPLING EXAMPLE
randn('seed',12345);

% DEFINE THE TARGET DISTRIBUTION
p = inline('(1 + x.^2).^-1','x')

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
nDisplay = 30;
sigma = 1;
minn = -20; maxx = 20;
xx = 3*minn:.1:3*maxx;
target = p(xx);
pauseDur = .8;

% INITIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = randn;
t = 1;

% RUN SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = normrnd(x(t-1) ,sigma);
proposal = normpdf(xx,x(t-1),sigma);

% CALCULATE THE ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
str = 'Accepted';
else
x(t) = x(t-1);
str = 'Rejected';
end

% DISPLAY SAMPLING DYNAMICS
if t < nDisplay + 1
figure(1);
subplot(211);
cla
plot(xx,target,'k');
hold on;
plot(xx,proposal,'r');
line([x(t-1),x(t-1)],[0 p(x(t-1))],'color','b','linewidth',2)
scatter(xStar,0,'ro','Linewidth',2)
line([xStar,xStar],[0 p(xStar)],'color','r','Linewidth',2)
plot(x(1:t),zeros(1,t),'ko')
legend({'Target','Proposal','p(x^{(t-1)})','x^*','p(x^*)','Kept Samples'})

switch str
case 'Rejected'
scatter(xStar,p(xStar),'rx','Linewidth',3)
case 'Accepted'
scatter(xStar,p(xStar),'rs','Linewidth',3)
end
scatter(x(t-1),p(x(t-1)),'bo','Linewidth',3)
title(sprintf('Sample % d %s',t,str))
xlim([minn,maxx])
subplot(212);
hist(x(1:t),50); colormap hot;
xlim([minn,maxx])
title(['Sample ',str]);
drawnow
pause(pauseDur);
end
end

% DISPLAY MARKOV CHAIN
figure(1); clf
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([-10 10],[burnIn burnIn],'b--')
ylabel('t'); xlabel('samples, x');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([-10 10]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 200;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, x' ); ylabel( 'p(x)' );
title('Samples');

% OVERLAY ANALYTIC DENSITY OF STUDENT T
nu = 1;
y = tpdf(sampleBins,nu)
hold on;
plot(sampleBins, y/sum(y) , 'r-', 'LineWidth', 2);
legend('Samples',sprintf('Theoretic\nStudent''s t'))
axis tight
xlim([-10 10]);


Using the Metropolis algorithm to sample from a continuous distribution (black)

In the figure above, we visualize the first 50 iterations of the Metropolis sampler.The black curve represents the target distribution $p(x)$. The red curve that is bouncing about the x-axis is the proposal distribution $q(x | x^{(t-1)})$ (if the figure is not animated, just click on it). The vertical blue line (about which the bouncing proposal distribution is centered) represents the quantity $p(x^{(t-1)})$, and the vertical red line represents the quantity $p(x^*)$, for a proposal state $x^*$ sampled according to the red  curve. At every iteration, if the vertical red line is longer than the blue line, then the sample $x^*$ is accepted, and the proposal distribution becomes centered about the newly accepted sample. If the blue line is longer, the sample is randomly rejected or accepted.

But why randomly keep “bad” proposal samples? It turns out that doing this allows the Markov chain to every-so-often visit states of low probability under the target distribution. This is a desirable property if we want the chain to adequately sample the entire target distribution, including any tails.

An attractive property of the Metropolis algorithm is that the target distribution $p(x)$ does not have to be a properly normalized probability distribution. This is due to the fact that the acceptance probability is based on the ratio of two values of the target distribution. I’ll show you what I mean. If $p(x)$ is an unnormalized distribution and

$p^*(x) = \frac{p(x)}{Z}$

is a properly normalized probability distribution with normalizing constant $Z$, then

$p(x) = Zp^*(x)$

and a ratio like that used in calculating the acceptance probability $\alpha$ is

$\frac{p(a)}{p(b)} = \frac{Zp^*(a)}{Zp^*(b)} = \frac{p^*(a)}{p^*(b)}$

The normalizing constants $Z$ cancel! This attractive property is quite useful in the context of Bayesian methods, where determining the normalizing constant for a distribution may be impractical to calculate directly. This property is demonstrated in current example. It turns out that the “mystery” distribution that we sampled from using the Metropolis algorithm is an unnormalized form of the Student’s-t distribution with one degree of freedom. Comparing $p(x)$ to the definition of the definition Student’s-t

$Student(x,\nu) = \frac{\Gamma\left(\frac{\nu + 1}{2} \right)}{\sqrt{\nu\pi} \Gamma\left( \frac{\nu}{2}\right)}\left( 1+\frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}} = \frac{(1 + x^2)^{-1}}{Z} = \frac{p(x)}{Z}$

we see that $p(x)$ is a Student’s-t distribution with degrees of freedom $\nu=1$, but missing the normalizing constant

$Z = \left(\frac{\Gamma\left(\frac{\nu + 1}{2} \right)}{\sqrt{\nu\pi} \Gamma\left( \frac{\nu}{2}\right)}\right )^{-1}$

Below is additional output from the code above showing that the samples from Metropolis sampler draws samples that follow a normalized Student’s-t distribution, even though $p(x)$ is not normalized.

Metropolis samples from an unnormalized t-distribution follow the normalized distribution

The upper plot shows the progression of the Markov chain’s progression from state $x^{(0)}$ (top) to state $x^{(5000)}$ (bottom). The burn in period for this chain was chosen to be 500 transitions, and is indicated by the dashed blue line (for more on burnin see this previous post).

The bottom plot shows samples from the Markov chain in black (with burn in samples removed). The theoretical curve for the Student’s-t with one degree of freedom is overlayed in red. We see that the states kept by the Metropolis sampler transition operator sample from values that follow the Student’s-t, even though the function $p(x)$ used in the transition operator was not a properly normalized probability distribution.

## Reversibility of the transition operator

It turns out that there is a theoretical constraint on the Markov chain the transition operator in order for it settle into a stationary distribution (i.e. a target distribution we care about). The constraint states that the probability of the transition $x^{(t)} \to x^{(t+1)}$ must be equal to the probability of the reverse transition $x^{(t+1)} \to x^{(t)}$. This reversibility property is often referred to as detailed balance. Using the Metropolis algorithm transition operator, reversibility is assured if the proposal distribution $q(x|x^{(t-1)})$ is symmetric. Such symmetric proposal distributions are the Normal, Cauchy, Student’s-t, and Uniform distributions.

However, using a symmetric proposal distribution may not be reasonable to adequately or efficiently sample all possible target distributions. For instance if a target distribution is bounded on the positive numbers $0 < x \leq \infty$, we would like to use a proposal distribution that has the same support, and will thus be assymetric. This is where the Metropolis-Hastings sampling algorithm comes in. We will discuss in a later post how the Metropolis-Hastings sampler uses a simple change to the calculation of the acceptance probability which allows us to use non-symmetric proposal distributions.