# MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state $x^*$ in the Markov chain, based on a previous state $x^{(t-1)}$, according to the proposal distribution $q(x^* | x^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $p(x)$ evaluated at $x^*$. (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution $q(x^* | x^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time $t$, the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ must be equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, a condition known as reversibility or detailed balance. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. $[0, \infty)$).

In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor $c$, defined from the proposal distribution as

$c = \frac{q(x^{(t-1)} | x^*) }{q(x^* | x^{(t-1)})}$

The correction factor adjusts the transition operator to ensure that the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ is equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$.  Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set t = 0
2. generate an initial state $x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x^{(t-1)} | x^*) }{q(x^*|x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(x^*)}{p(x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x^*$ and set $x^{(t)}=x^*$

else set $x^{(t)} = x^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters $\theta$ to an assumed model $p(y | \theta)$ such that the model can best account for some observed data $y$. The model function $p(y | \theta)$ is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution $p(\theta)$ that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution $p(\theta | y)$, which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

$p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)}$

where, $p(y)$ is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and $y$ can take.

Let’s say that we assume the following model (likelihood function):

$p(y | \theta) = \text{Gamma}(y;A,B)$, where

$\text{Gamma}(y;A,B) = \frac{B^A}{\Gamma(A)} y^{A-1}e^{-By}$, where

$\Gamma( )$ is the gamma function. Thus, the model parameters are

$\theta = [A,B]$

The parameter $A$ controls the shape of the distribution, and $B$ controls the scale. The likelihood surface for $B = 1$, and a number of values of $A$ ranging from zero to five are shown below.

Likelihood surface and conditional probability p(y|A=2,B=1) in green

The conditional distribution $p(y | A=2, B = 1)$ is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

 plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

$p(B = 1) = 1$

and

$p(A) = \text{sin}(\pi A)^2$

The first prior states that $B$ only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of $A$ varies as a sinusoidal function. (Note that both of these prior distributions are called improper priors because they do not integrate to one). Because $B$ is constant, we only need to estimate the value of $A$.

It turns out that even though the normalization constant $p(y)$ may be difficult to compute, we can sample from $p(A | y)$ without knowing $p(x)$ using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant $p(x)$ and sample from the unnormalized posterior:

$p(A | y) \propto p(y |A) p(A)$

The surface of the (unnormalized) posterior for $y$ ranging from zero to ten are shown below. The prior $p(A)$ is displayed in blue on the right of the plot. Let’s say that we have a datapoint $y = 1.5$ and would like to estimate the posterior distribution $p(A|y=1.5)$ using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Posterior surface, prior distribution (blue), and target distribution (pink)

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from $p(A|y=1.5)$ due to the fact that the posterior only has support on the real positive numbers $A \in [0 ,\infty)$. An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

$q(A) = \text{Exp}(\mu) = \mu e^{-\mu A}$,

This distribution is parameterized by a single variable $\mu$ that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for $\mu = 5$) are shown in the plot below.

Target posterior p(A|y) and proposal distribution q(A)

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

Metropolis-Hastings Markov chain and samples

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter $\mu$ (see line 88 in the MATLAB code below). Each proposal states $x^*$ is drawn independently of the previous state. Therefore this is an example of an independence sampler, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB  code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR
rand('seed',12345)

% PRIOR OVER SCALE PARAMETERS
B = 1;

% DEFINE LIKELIHOOD
likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B');

% CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE
yy = linspace(0,10,100);
AA = linspace(0.1,5,100);
likeSurf = zeros(numel(yy),numel(AA));
for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end;

figure;
surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot

% DISPLAY CONDITIONAL AT A = 2
hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend(ly,'p(y|A=2)','Location','Northeast');
hold off;
title('p(y|A)');

% DEFINE PRIOR OVER SHAPE PARAMETERS
prior = inline('sin(pi*A).^2','A');

% DEFINE THE POSTERIOR
p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B');

% CALCULATE AND DISPLAY THE POSTERIOR SURFACE
postSurf = zeros(size(likeSurf));
for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end;

figure
surf(postSurf); ylabel('y'); xlabel('A'); colormap hot

% DISPLAY THE PRIOR
hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3)

% SAMPLE FROM p(A | y = 1.5)
y = 1.5;
target = postSurf(16,:);

% DISPLAY POSTERIOR
psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast');
hold off
title('p(A|y)');

% INITIALIZE THE METROPOLIS-HASTINGS SAMPLER
% DEFINE PROPOSAL DENSITY
q = inline('exppdf(x,mu)','x','mu');

% MEAN FOR PROPOSAL DENSITY
mu = 5;

% DISPLAY TARGET AND PROPOSAL
figure; hold on;
th = plot(AA,target,'m','Linewidth',2);
qh = plot(AA,q(AA,mu),'k','Linewidth',2)
legend([th,qh],{'Target, p(A)','Proposal, q(A)'});
xlabel('A');

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
minn = 0.1; maxx = 5;

% INTIIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = mu;
t = 1;

% RUN METROPOLIS-HASTINGS SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = exprnd(mu);

% CORRECTION FACTOR
c = q(x(t-1),mu)/q(xStar,mu);

% CALCULATE THE (CORRECTED) ACCEPTANCE RATIO
alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
else
x(t) = x(t-1);
end
end

% DISPLAY MARKOV CHAIN
figure;
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2)
ylabel('t'); xlabel('samples, A');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([0 maxx]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 100;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, A' ); ylabel( 'p(A | y)' );
title('Samples');
xlim([0 10])

% OVERLAY TARGET DISTRIBUTION
hold on;
plot(AA, target/sum(target) , 'm-', 'LineWidth', 2);
legend('Sampled Distribution',sprintf('Target Posterior'))
axis tight


## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.

I recently received my PhD from UC Berkeley where I studied computational neuroscience and machine learning.

Posted on October 20, 2012, in Algorithms, Sampling Methods, Simulations, Statistics and tagged , , , , , , , . Bookmark the permalink. 5 Comments.

1. Reblogged this on Machine Learning Magic.

2. Michael Cheng

I appreciate your series. The best tutorial I’ve ever seen!
I’m not sure, around line 14-15, it seems that the second $x^{(t-1)} \to x^{(t)}$ should be $x^{(t)} \to x^{(t-1)}$.

3. John

One of the two states expressions in the sentence here under should be flipped, small typo 🙂 Thanks a lot for those good explanations!

«The correction factor adjusts the transition operator to ensure that the probability of moving from x^{(t-1)} \rightarrow x^{(t)} is equal to the probability of moving from x^{(t-1)} \rightarrow x^{(t)}, no matter the proposal distribution.»