MCMC: The Metropolis-Hastings Sampler
In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state in the Markov chain, based on a previous state
, according to the proposal distribution
. The algorithm accepts or rejects the proposed state based on the density of the the target distribution
evaluated at
. (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).
One constraint of the Metropolis sampler is that the proposal distribution must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time
, the probability of moving from
must be equal to the probability of moving from
, a condition known as reversibility or detailed balance. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g.
).
In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor , defined from the proposal distribution as
The correction factor adjusts the transition operator to ensure that the probability of moving from is equal to the probability of moving from
, no matter the proposal distribution.
The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability . Specifically, to draw
samples using the Metropolis-Hastings sampler:
- set t = 0
- generate an initial state
- repeat until
set
generate a proposal state from
calculate the proposal correction factor
calculate the acceptance probability
draw a random number from
if accept the proposal state
and set
else set
Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.
Example: Sampling from a Bayesian posterior with improper prior
For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters to an assumed model
such that the model can best account for some observed data
. The model function
is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution
that is placed on the model parameters and controls the values that the parameters can take.
The parameters are determined based on the posterior distribution , which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:
where, is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and
can take.
Let’s say that we assume the following model (likelihood function):
, where
, where
is the gamma function. Thus, the model parameters are
The parameter controls the shape of the distribution, and
controls the scale. The likelihood surface for
, and a number of values of
ranging from zero to five are shown below.
The conditional distribution is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:
plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)
Now, let’s assume the following priors on the model parameters:
and
The first prior states that only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of
varies as a sinusoidal function. (Note that both of these prior distributions are called improper priors because they do not integrate to one). Because
is constant, we only need to estimate the value of
.
It turns out that even though the normalization constant may be difficult to compute, we can sample from
without knowing
using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant
and sample from the unnormalized posterior:
The surface of the (unnormalized) posterior for ranging from zero to ten are shown below. The prior
is displayed in blue on the right of the plot. Let’s say that we have a datapoint
and would like to estimate the posterior distribution
using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.
Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from due to the fact that the posterior only has support on the real positive numbers
. An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.
,
This distribution is parameterized by a single variable that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for
) are shown in the plot below.
We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.
As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter (see line 88 in the MATLAB code below). Each proposal states
is drawn independently of the previous state. Therefore this is an example of an independence sampler, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.
The MATLAB code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.
% METROPOLIS-HASTINGS BAYESIAN POSTERIOR rand('seed',12345) % PRIOR OVER SCALE PARAMETERS B = 1; % DEFINE LIKELIHOOD likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B'); % CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE yy = linspace(0,10,100); AA = linspace(0.1,5,100); likeSurf = zeros(numel(yy),numel(AA)); for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end; figure; surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot % DISPLAY CONDITIONAL AT A = 2 hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3) xlim([0 100]); ylim([0 100]); axis normal set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]); set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]); view(65,25) legend(ly,'p(y|A=2)','Location','Northeast'); hold off; title('p(y|A)'); % DEFINE PRIOR OVER SHAPE PARAMETERS prior = inline('sin(pi*A).^2','A'); % DEFINE THE POSTERIOR p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B'); % CALCULATE AND DISPLAY THE POSTERIOR SURFACE postSurf = zeros(size(likeSurf)); for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end; figure surf(postSurf); ylabel('y'); xlabel('A'); colormap hot % DISPLAY THE PRIOR hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3) % SAMPLE FROM p(A | y = 1.5) y = 1.5; target = postSurf(16,:); % DISPLAY POSTERIOR psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3) xlim([0 100]); ylim([0 100]); axis normal set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]); set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]); view(65,25) legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast'); hold off title('p(A|y)'); % INITIALIZE THE METROPOLIS-HASTINGS SAMPLER % DEFINE PROPOSAL DENSITY q = inline('exppdf(x,mu)','x','mu'); % MEAN FOR PROPOSAL DENSITY mu = 5; % DISPLAY TARGET AND PROPOSAL figure; hold on; th = plot(AA,target,'m','Linewidth',2); qh = plot(AA,q(AA,mu),'k','Linewidth',2) legend([th,qh],{'Target, p(A)','Proposal, q(A)'}); xlabel('A'); % SOME CONSTANTS nSamples = 5000; burnIn = 500; minn = 0.1; maxx = 5; % INTIIALZE SAMPLER x = zeros(1 ,nSamples); x(1) = mu; t = 1; % RUN METROPOLIS-HASTINGS SAMPLER while t < nSamples t = t+1; % SAMPLE FROM PROPOSAL xStar = exprnd(mu); % CORRECTION FACTOR c = q(x(t-1),mu)/q(xStar,mu); % CALCULATE THE (CORRECTED) ACCEPTANCE RATIO alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]); % ACCEPT OR REJECT? u = rand; if u < alpha x(t) = xStar; else x(t) = x(t-1); end end % DISPLAY MARKOV CHAIN figure; subplot(211); stairs(x(1:t),1:t, 'k'); hold on; hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2) ylabel('t'); xlabel('samples, A'); set(gca , 'YDir', 'reverse'); ylim([0 t]) axis tight; xlim([0 maxx]); title('Markov Chain Path'); legend(hb,'Burnin'); % DISPLAY SAMPLES subplot(212); nBins = 100; sampleBins = linspace(minn,maxx,nBins); counts = hist(x(burnIn:end), sampleBins); bar(sampleBins, counts/sum(counts), 'k'); xlabel('samples, A' ); ylabel( 'p(A | y)' ); title('Samples'); xlim([0 10]) % OVERLAY TARGET DISTRIBUTION hold on; plot(AA, target/sum(target) , 'm-', 'LineWidth', 2); legend('Sampled Distribution',sprintf('Target Posterior')) axis tight
Wrapping Up
Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.
Posted on October 20, 2012, in Algorithms, Sampling Methods, Simulations, Statistics and tagged Bayesian likelihood, Bayesian methods, Bayesian posterior, improper prior, Independence sampler, MCMC, Metropolis sampling, Metropolis-Hastings Sampling. Bookmark the permalink. 5 Comments.
Reblogged this on Machine Learning Magic.
I appreciate your series. The best tutorial I’ve ever seen!
I’m not sure, around line 14-15, it seems that the second $x^{(t-1)} \to x^{(t)}$ should be $x^{(t)} \to x^{(t-1)}$.
One of the two states expressions in the sentence here under should be flipped, small typo 🙂 Thanks a lot for those good explanations!
«The correction factor adjusts the transition operator to ensure that the probability of moving from x^{(t-1)} \rightarrow x^{(t)} is equal to the probability of moving from x^{(t-1)} \rightarrow x^{(t)}, no matter the proposal distribution.»
Pingback: MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates « The Clever Machine
Pingback: A Gentle Introduction to Markov Chain Monte Carlo (MCMC) « The Clever Machine