# Monthly Archives: October 2012

## MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state in the Markov chain, based on a previous state , according to the proposal distribution . The algorithm accepts or rejects the proposed state based on the density of the the target distribution evaluated at . (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time , the probability of moving from must be equal to the probability of moving from , a condition known as * reversibility* or

**. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. ).**

*detailed balance*In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor , defined from the proposal distribution as

The correction factor adjusts the transition operator to ensure that the probability of moving from is equal to the probability of moving from , no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability . Specifically, to draw samples using the Metropolis-Hastings sampler:

- set t = 0
- generate an initial state
- repeat until

set

generate a proposal state from

calculate the proposal correction factor

calculate the acceptance probability

draw a random number from

if accept the proposal state and set

else set

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters to an assumed model such that the model can best account for some observed data . The model function is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution , which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

where, is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and can take.

Let’s say that we assume the following model (likelihood function):

, where

, where

is the gamma function. Thus, the model parameters are

The parameter controls the shape of the distribution, and controls the scale. The likelihood surface for , and a number of values of ranging from zero to five are shown below.

The conditional distribution is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

and

The first prior states that only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of varies as a sinusoidal function. (Note that both of these prior distributions are called * improper priors* because they do not integrate to one). Because is constant, we only need to estimate the value of .

It turns out that even though the normalization constant may be difficult to compute, we can sample from without knowing using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant and sample from the unnormalized posterior:

The surface of the (unnormalized) posterior for ranging from zero to ten are shown below. The prior is displayed in blue on the right of the plot. Let’s say that we have a datapoint and would like to estimate the posterior distribution using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from due to the fact that the posterior only has support on the real positive numbers . An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

,

This distribution is parameterized by a single variable that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for ) are shown in the plot below.

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter (see line 88 in the MATLAB code below). Each proposal states is drawn independently of the previous state. Therefore this is an example of an ** independence sampler**, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR rand('seed',12345) % PRIOR OVER SCALE PARAMETERS B = 1; % DEFINE LIKELIHOOD likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B'); % CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE yy = linspace(0,10,100); AA = linspace(0.1,5,100); likeSurf = zeros(numel(yy),numel(AA)); for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end; figure; surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot % DISPLAY CONDITIONAL AT A = 2 hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3) xlim([0 100]); ylim([0 100]); axis normal set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]); set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]); view(65,25) legend(ly,'p(y|A=2)','Location','Northeast'); hold off; title('p(y|A)'); % DEFINE PRIOR OVER SHAPE PARAMETERS prior = inline('sin(pi*A).^2','A'); % DEFINE THE POSTERIOR p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B'); % CALCULATE AND DISPLAY THE POSTERIOR SURFACE postSurf = zeros(size(likeSurf)); for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end; figure surf(postSurf); ylabel('y'); xlabel('A'); colormap hot % DISPLAY THE PRIOR hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3) % SAMPLE FROM p(A | y = 1.5) y = 1.5; target = postSurf(16,:); % DISPLAY POSTERIOR psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3) xlim([0 100]); ylim([0 100]); axis normal set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]); set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]); view(65,25) legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast'); hold off title('p(A|y)'); % INITIALIZE THE METROPOLIS-HASTINGS SAMPLER % DEFINE PROPOSAL DENSITY q = inline('exppdf(x,mu)','x','mu'); % MEAN FOR PROPOSAL DENSITY mu = 5; % DISPLAY TARGET AND PROPOSAL figure; hold on; th = plot(AA,target,'m','Linewidth',2); qh = plot(AA,q(AA,mu),'k','Linewidth',2) legend([th,qh],{'Target, p(A)','Proposal, q(A)'}); xlabel('A'); % SOME CONSTANTS nSamples = 5000; burnIn = 500; minn = 0.1; maxx = 5; % INTIIALZE SAMPLER x = zeros(1 ,nSamples); x(1) = mu; t = 1; % RUN METROPOLIS-HASTINGS SAMPLER while t < nSamples t = t+1; % SAMPLE FROM PROPOSAL xStar = exprnd(mu); % CORRECTION FACTOR c = q(x(t-1),mu)/q(xStar,mu); % CALCULATE THE (CORRECTED) ACCEPTANCE RATIO alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]); % ACCEPT OR REJECT? u = rand; if u < alpha x(t) = xStar; else x(t) = x(t-1); end end % DISPLAY MARKOV CHAIN figure; subplot(211); stairs(x(1:t),1:t, 'k'); hold on; hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2) ylabel('t'); xlabel('samples, A'); set(gca , 'YDir', 'reverse'); ylim([0 t]) axis tight; xlim([0 maxx]); title('Markov Chain Path'); legend(hb,'Burnin'); % DISPLAY SAMPLES subplot(212); nBins = 100; sampleBins = linspace(minn,maxx,nBins); counts = hist(x(burnIn:end), sampleBins); bar(sampleBins, counts/sum(counts), 'k'); xlabel('samples, A' ); ylabel( 'p(A | y)' ); title('Samples'); xlim([0 10]) % OVERLAY TARGET DISTRIBUTION hold on; plot(AA, target/sum(target) , 'm-', 'LineWidth', 2); legend('Sampled Distribution',sprintf('Target Posterior')) axis tight

## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.

## MCMC: The Metropolis Sampler

As discussed in an earlier post, we can use a Markov chain to sample from some ** target probability distribution** from which drawing samples directly is difficult. To do so, it is necessary to design a transition operator for the Markov chain which makes the chain’s stationary distribution match the target distribution. The Metropolis sampling algorithm (and the more general Metropolis-Hastings sampling algorithm) uses simple heuristics to implement such a transition operator.

## Metropolis Sampling

Starting from some random initial state , the algorithm first draws a possible sample from a ** proposal distribution** . Much like a conventional transition operator for a Markov chain, the proposal distribution depends only on the previous state in the chain. However, the transition operator for the Metropolis algorithm has an additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the proposed state as a sample and setting it to the next state in the chain. If the density of is low near the proposed state, then it is likely (but not guaranteed) that it will be rejected. The criterion for accepting or rejecting a proposed state are defined by the following heuristics:

- If , the proposed state is kept as a sample and is set as the next state in the chain (i.e. move the chain’s state to a location where has equal or greater density).
- If –indicating that has low density near –then the proposed state may still be accepted, but only randomly, and with a probability

These heuristics can be instantiated by calculating the ** acceptance probability **for the proposed state.

Having the acceptance probability in hand, the transition operator for the metropolis algorithm works like this: if a random uniform number is less than or equal to , then the state is accepted (as in (1) above), if not, it is rejected and another state is proposed (as in (2) above). In order to collect samples using Metropolis sampling we run the following algorithm:

- set t = 0
- generate an initial state from a prior distribution over initial states
- repeat until

set

generate a proposal state from

calculate the acceptance probability

draw a random number from

if , accept the proposal and set

else set

### Example: Using the Metropolis algorithm to sample from an unknown distribution

Say that we have some mysterious function

from which we would like to draw samples. To do so using Metropolis sampling we need to define two things: (1) the prior distribution over the initial state of the Markov chain, and (2) the proposal distribution . For this example we define:

,

both of which are simply a Normal distribution, one centered at zero, the other centered at previous state of the chain. The following chunk of MATLAB code runs the Metropolis sampler with this proposal distribution and prior.

% METROPOLIS SAMPLING EXAMPLE randn('seed',12345); % DEFINE THE TARGET DISTRIBUTION p = inline('(1 + x.^2).^-1','x') % SOME CONSTANTS nSamples = 5000; burnIn = 500; nDisplay = 30; sigma = 1; minn = -20; maxx = 20; xx = 3*minn:.1:3*maxx; target = p(xx); pauseDur = .8; % INITIALZE SAMPLER x = zeros(1 ,nSamples); x(1) = randn; t = 1; % RUN SAMPLER while t < nSamples t = t+1; % SAMPLE FROM PROPOSAL xStar = normrnd(x(t-1) ,sigma); proposal = normpdf(xx,x(t-1),sigma); % CALCULATE THE ACCEPTANCE PROBABILITY alpha = min([1, p(xStar)/p(x(t-1))]); % ACCEPT OR REJECT? u = rand; if u < alpha x(t) = xStar; str = 'Accepted'; else x(t) = x(t-1); str = 'Rejected'; end % DISPLAY SAMPLING DYNAMICS if t < nDisplay + 1 figure(1); subplot(211); cla plot(xx,target,'k'); hold on; plot(xx,proposal,'r'); line([x(t-1),x(t-1)],[0 p(x(t-1))],'color','b','linewidth',2) scatter(xStar,0,'ro','Linewidth',2) line([xStar,xStar],[0 p(xStar)],'color','r','Linewidth',2) plot(x(1:t),zeros(1,t),'ko') legend({'Target','Proposal','p(x^{(t-1)})','x^*','p(x^*)','Kept Samples'}) switch str case 'Rejected' scatter(xStar,p(xStar),'rx','Linewidth',3) case 'Accepted' scatter(xStar,p(xStar),'rs','Linewidth',3) end scatter(x(t-1),p(x(t-1)),'bo','Linewidth',3) title(sprintf('Sample % d %s',t,str)) xlim([minn,maxx]) subplot(212); hist(x(1:t),50); colormap hot; xlim([minn,maxx]) title(['Sample ',str]); drawnow pause(pauseDur); end end % DISPLAY MARKOV CHAIN figure(1); clf subplot(211); stairs(x(1:t),1:t, 'k'); hold on; hb = plot([-10 10],[burnIn burnIn],'b--') ylabel('t'); xlabel('samples, x'); set(gca , 'YDir', 'reverse'); ylim([0 t]) axis tight; xlim([-10 10]); title('Markov Chain Path'); legend(hb,'Burnin'); % DISPLAY SAMPLES subplot(212); nBins = 200; sampleBins = linspace(minn,maxx,nBins); counts = hist(x(burnIn:end), sampleBins); bar(sampleBins, counts/sum(counts), 'k'); xlabel('samples, x' ); ylabel( 'p(x)' ); title('Samples'); % OVERLAY ANALYTIC DENSITY OF STUDENT T nu = 1; y = tpdf(sampleBins,nu) hold on; plot(sampleBins, y/sum(y) , 'r-', 'LineWidth', 2); legend('Samples',sprintf('Theoretic\nStudent''s t')) axis tight xlim([-10 10]);

In the figure above, we visualize the first 50 iterations of the Metropolis sampler.The black curve represents the target distribution . The red curve that is bouncing about the x-axis is the proposal distribution (if the figure is not animated, just click on it). The vertical blue line (about which the bouncing proposal distribution is centered) represents the quantity , and the vertical red line represents the quantity , for a proposal state sampled according to the red curve. At every iteration, if the vertical red line is longer than the blue line, then the sample is accepted, and the proposal distribution becomes centered about the newly accepted sample. If the blue line is longer, the sample is randomly rejected or accepted.

But why randomly keep “bad” proposal samples? It turns out that doing this allows the Markov chain to every-so-often visit states of low probability under the target distribution. This is a desirable property if we want the chain to adequately sample the entire target distribution, including any tails.

An attractive property of the Metropolis algorithm is that the target distribution does not have to be a properly normalized probability distribution. This is due to the fact that the acceptance probability is based on the ratio of two values of the target distribution. I’ll show you what I mean. If is an unnormalized distribution and

is a properly normalized probability distribution with normalizing constant , then

and a ratio like that used in calculating the acceptance probability is

The normalizing constants cancel! This attractive property is quite useful in the context of Bayesian methods, where determining the normalizing constant for a distribution may be impractical to calculate directly. This property is demonstrated in current example. It turns out that the “mystery” distribution that we sampled from using the Metropolis algorithm is an unnormalized form of the Student’s-t distribution with one degree of freedom. Comparing to the definition of the definition Student’s-t

we see that is a Student’s-t distribution with degrees of freedom , but missing the normalizing constant

Below is additional output from the code above showing that the samples from Metropolis sampler draws samples that follow a *normalized *Student’s-t distribution, even though is not normalized.

The upper plot shows the progression of the Markov chain’s progression from state (top) to state (bottom). The burn in period for this chain was chosen to be 500 transitions, and is indicated by the dashed blue line (for more on burnin see this previous post).

The bottom plot shows samples from the Markov chain in black (with burn in samples removed). The theoretical curve for the Student’s-t with one degree of freedom is overlayed in red. We see that the states kept by the Metropolis sampler transition operator sample from values that follow the Student’s-t, even though the function used in the transition operator was not a properly normalized probability distribution.

## Reversibility of the transition operator

It turns out that there is a theoretical constraint on the Markov chain the transition operator in order for it settle into a stationary distribution (i.e. a target distribution we care about). The constraint states that the probability of the transition must be equal to the probability of the reverse transition . This reversibility property is often referred to as * detailed balance*. Using the Metropolis algorithm transition operator, reversibility is assured if the proposal distribution is symmetric. Such symmetric proposal distributions are the Normal, Cauchy, Student’s-t, and Uniform distributions.

However, using a symmetric proposal distribution may not be reasonable to adequately or efficiently sample all possible target distributions. For instance if a target distribution is bounded on the positive numbers , we would like to use a proposal distribution that has the same support, and will thus be assymetric. This is where the ** Metropolis-Hastings** sampling algorithm comes in. We will discuss in a later post how the Metropolis-Hastings sampler uses a simple change to the calculation of the acceptance probability which allows us to use non-symmetric proposal distributions.