# Monthly Archives: October 2012

## MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state $x^*$ in the Markov chain, based on a previous state $x^{(t-1)}$, according to the proposal distribution $q(x^* | x^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $p(x)$ evaluated at $x^*$. (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution $q(x^* | x^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time $t$, the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ must be equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, a condition known as reversibility or detailed balance. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. $[0, \infty)$).

In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor $c$, defined from the proposal distribution as

$c = \frac{q(x^{(t-1)} | x^*) }{q(x^* | x^{(t-1)})}$

The correction factor adjusts the transition operator to ensure that the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ is equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$.  Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set t = 0
2. generate an initial state $x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x^{(t-1)} | x^*) }{q(x^*|x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(x^*)}{p(x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x^*$ and set $x^{(t)}=x^*$

else set $x^{(t)} = x^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters $\theta$ to an assumed model $p(y | \theta)$ such that the model can best account for some observed data $y$. The model function $p(y | \theta)$ is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution $p(\theta)$ that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution $p(\theta | y)$, which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

$p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)}$

where, $p(y)$ is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and $y$ can take.

Let’s say that we assume the following model (likelihood function):

$p(y | \theta) = \text{Gamma}(y;A,B)$, where

$\text{Gamma}(y;A,B) = \frac{B^A}{\Gamma(A)} y^{A-1}e^{-By}$, where

$\Gamma( )$ is the gamma function. Thus, the model parameters are

$\theta = [A,B]$

The parameter $A$ controls the shape of the distribution, and $B$ controls the scale. The likelihood surface for $B = 1$, and a number of values of $A$ ranging from zero to five are shown below.

Likelihood surface and conditional probability p(y|A=2,B=1) in green

The conditional distribution $p(y | A=2, B = 1)$ is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

 plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

$p(B = 1) = 1$

and

$p(A) = \text{sin}(\pi A)^2$

The first prior states that $B$ only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of $A$ varies as a sinusoidal function. (Note that both of these prior distributions are called improper priors because they do not integrate to one). Because $B$ is constant, we only need to estimate the value of $A$.

It turns out that even though the normalization constant $p(y)$ may be difficult to compute, we can sample from $p(A | y)$ without knowing $p(x)$ using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant $p(x)$ and sample from the unnormalized posterior:

$p(A | y) \propto p(y |A) p(A)$

The surface of the (unnormalized) posterior for $y$ ranging from zero to ten are shown below. The prior $p(A)$ is displayed in blue on the right of the plot. Let’s say that we have a datapoint $y = 1.5$ and would like to estimate the posterior distribution $p(A|y=1.5)$ using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Posterior surface, prior distribution (blue), and target distribution (pink)

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from $p(A|y=1.5)$ due to the fact that the posterior only has support on the real positive numbers $A \in [0 ,\infty)$. An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

$q(A) = \text{Exp}(\mu) = \mu e^{-\mu A}$,

This distribution is parameterized by a single variable $\mu$ that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for $\mu = 5$) are shown in the plot below.

Target posterior p(A|y) and proposal distribution q(A)

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

Metropolis-Hastings Markov chain and samples

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter $\mu$ (see line 88 in the MATLAB code below). Each proposal states $x^*$ is drawn independently of the previous state. Therefore this is an example of an independence sampler, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB  code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR
rand('seed',12345)

% PRIOR OVER SCALE PARAMETERS
B = 1;

% DEFINE LIKELIHOOD
likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B');

% CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE
yy = linspace(0,10,100);
AA = linspace(0.1,5,100);
likeSurf = zeros(numel(yy),numel(AA));
for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end;

figure;
surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot

% DISPLAY CONDITIONAL AT A = 2
hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend(ly,'p(y|A=2)','Location','Northeast');
hold off;
title('p(y|A)');

% DEFINE PRIOR OVER SHAPE PARAMETERS
prior = inline('sin(pi*A).^2','A');

% DEFINE THE POSTERIOR
p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B');

% CALCULATE AND DISPLAY THE POSTERIOR SURFACE
postSurf = zeros(size(likeSurf));
for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end;

figure
surf(postSurf); ylabel('y'); xlabel('A'); colormap hot

% DISPLAY THE PRIOR
hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3)

% SAMPLE FROM p(A | y = 1.5)
y = 1.5;
target = postSurf(16,:);

% DISPLAY POSTERIOR
psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast');
hold off
title('p(A|y)');

% INITIALIZE THE METROPOLIS-HASTINGS SAMPLER
% DEFINE PROPOSAL DENSITY
q = inline('exppdf(x,mu)','x','mu');

% MEAN FOR PROPOSAL DENSITY
mu = 5;

% DISPLAY TARGET AND PROPOSAL
figure; hold on;
th = plot(AA,target,'m','Linewidth',2);
qh = plot(AA,q(AA,mu),'k','Linewidth',2)
legend([th,qh],{'Target, p(A)','Proposal, q(A)'});
xlabel('A');

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
minn = 0.1; maxx = 5;

% INTIIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = mu;
t = 1;

% RUN METROPOLIS-HASTINGS SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = exprnd(mu);

% CORRECTION FACTOR
c = q(x(t-1),mu)/q(xStar,mu);

% CALCULATE THE (CORRECTED) ACCEPTANCE RATIO
alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
else
x(t) = x(t-1);
end
end

% DISPLAY MARKOV CHAIN
figure;
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2)
ylabel('t'); xlabel('samples, A');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([0 maxx]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 100;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, A' ); ylabel( 'p(A | y)' );
title('Samples');
xlim([0 10])

% OVERLAY TARGET DISTRIBUTION
hold on;
plot(AA, target/sum(target) , 'm-', 'LineWidth', 2);
legend('Sampled Distribution',sprintf('Target Posterior'))
axis tight


## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.

## MCMC: The Metropolis Sampler

As discussed in an earlier post, we can use a Markov chain to sample from some target probability distribution $p(x)$ from which drawing samples directly is difficult. To do so, it is necessary to design a transition operator for the Markov chain which makes the chain’s stationary distribution match the target distribution. The Metropolis sampling algorithm  (and the more general Metropolis-Hastings sampling algorithm) uses simple heuristics to implement such a transition operator.

## Metropolis Sampling

Starting from some random initial state $x^{(0)} \sim \pi^{(0)}$, the algorithm first draws a possible sample $x^*$ from a  proposal distribution $q(x | x^{(t-1)})$.  Much like a conventional transition operator for a Markov chain, the proposal distribution depends only on the previous state in the chain. However, the transition operator for the Metropolis algorithm has an additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the proposed state as a sample and setting it to the next state in the chain. If the density of $p(x)$ is low near the proposed state, then it is likely (but not guaranteed) that it will be rejected. The criterion for accepting or rejecting a proposed state are defined by the following heuristics:

1. If $p(x^*) \geq p(x^{(t-1)})$,  the proposed state is kept $x^*$ as a sample and is set as the next state in the chain (i.e. move the chain’s state to a location  where $p(x)$ has equal or greater density).
2. If $p(x^*) < p(x^{(t-1)})$–indicating that $p(x)$ has low density near $x^*$–then the proposed state may still be accepted, but only randomly, and with a probability $\frac{p(x^*)}{p(x^{(t-1)})}$

These heuristics can be instantiated by calculating the acceptance probability for the proposed state.

$\alpha = \min \left(1, \frac{p(x^*)}{p(x^{(t-1)})}\right)$

Having the acceptance probability in hand, the transition operator for the metropolis algorithm works like this: if a random uniform number $u$ is less than or equal to $\alpha$, then the state $x^*$ is accepted (as in (1) above), if not, it is rejected and another state is proposed (as in (2) above). In order to collect $M$ samples using  Metropolis sampling we run the following algorithm:

1. set t = 0
2. generate an initial state $x^{(0)}$ from a prior distribution $\pi^{(0)}$ over initial states
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the acceptance probability $\alpha = \min \left(1, \frac{p(x^*)}{p(x^{(t-1)})}\right)$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$, accept the proposal and set $x^{(t)} = x^*$

else  set $x^{(t)} = x^{(t-1)}$

### Example: Using the Metropolis algorithm to sample from an unknown distribution

Say that we have some mysterious function

$p(x) = (1 + x^2)^{-1}$

from which we would like to draw samples. To do so using Metropolis sampling we need to define two things: (1) the prior distribution $\pi^{(0)}$ over the initial state of the Markov chain, and (2)  the proposal distribution $q(x | x^{(t-1)})$. For this example we define:

$\pi^{(0)} \sim \mathcal N(0,1)$

$q(x | x^{(t-1)}) \sim \mathcal N(x^{(t-1)},1)$,

both of which are simply a Normal distribution, one centered at zero, the other centered at previous state of the chain. The following chunk of MATLAB code runs the Metropolis sampler with this proposal distribution and prior.

% METROPOLIS SAMPLING EXAMPLE
randn('seed',12345);

% DEFINE THE TARGET DISTRIBUTION
p = inline('(1 + x.^2).^-1','x')

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
nDisplay = 30;
sigma = 1;
minn = -20; maxx = 20;
xx = 3*minn:.1:3*maxx;
target = p(xx);
pauseDur = .8;

% INITIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = randn;
t = 1;

% RUN SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = normrnd(x(t-1) ,sigma);
proposal = normpdf(xx,x(t-1),sigma);

% CALCULATE THE ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
str = 'Accepted';
else
x(t) = x(t-1);
str = 'Rejected';
end

% DISPLAY SAMPLING DYNAMICS
if t < nDisplay + 1
figure(1);
subplot(211);
cla
plot(xx,target,'k');
hold on;
plot(xx,proposal,'r');
line([x(t-1),x(t-1)],[0 p(x(t-1))],'color','b','linewidth',2)
scatter(xStar,0,'ro','Linewidth',2)
line([xStar,xStar],[0 p(xStar)],'color','r','Linewidth',2)
plot(x(1:t),zeros(1,t),'ko')
legend({'Target','Proposal','p(x^{(t-1)})','x^*','p(x^*)','Kept Samples'})

switch str
case 'Rejected'
scatter(xStar,p(xStar),'rx','Linewidth',3)
case 'Accepted'
scatter(xStar,p(xStar),'rs','Linewidth',3)
end
scatter(x(t-1),p(x(t-1)),'bo','Linewidth',3)
title(sprintf('Sample % d %s',t,str))
xlim([minn,maxx])
subplot(212);
hist(x(1:t),50); colormap hot;
xlim([minn,maxx])
title(['Sample ',str]);
drawnow
pause(pauseDur);
end
end

% DISPLAY MARKOV CHAIN
figure(1); clf
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([-10 10],[burnIn burnIn],'b--')
ylabel('t'); xlabel('samples, x');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([-10 10]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 200;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, x' ); ylabel( 'p(x)' );
title('Samples');

% OVERLAY ANALYTIC DENSITY OF STUDENT T
nu = 1;
y = tpdf(sampleBins,nu)
hold on;
plot(sampleBins, y/sum(y) , 'r-', 'LineWidth', 2);
legend('Samples',sprintf('Theoretic\nStudent''s t'))
axis tight
xlim([-10 10]);


Using the Metropolis algorithm to sample from a continuous distribution (black)

In the figure above, we visualize the first 50 iterations of the Metropolis sampler.The black curve represents the target distribution $p(x)$. The red curve that is bouncing about the x-axis is the proposal distribution $q(x | x^{(t-1)})$ (if the figure is not animated, just click on it). The vertical blue line (about which the bouncing proposal distribution is centered) represents the quantity $p(x^{(t-1)})$, and the vertical red line represents the quantity $p(x^*)$, for a proposal state $x^*$ sampled according to the red  curve. At every iteration, if the vertical red line is longer than the blue line, then the sample $x^*$ is accepted, and the proposal distribution becomes centered about the newly accepted sample. If the blue line is longer, the sample is randomly rejected or accepted.

But why randomly keep “bad” proposal samples? It turns out that doing this allows the Markov chain to every-so-often visit states of low probability under the target distribution. This is a desirable property if we want the chain to adequately sample the entire target distribution, including any tails.

An attractive property of the Metropolis algorithm is that the target distribution $p(x)$ does not have to be a properly normalized probability distribution. This is due to the fact that the acceptance probability is based on the ratio of two values of the target distribution. I’ll show you what I mean. If $p(x)$ is an unnormalized distribution and

$p^*(x) = \frac{p(x)}{Z}$

is a properly normalized probability distribution with normalizing constant $Z$, then

$p(x) = Zp^*(x)$

and a ratio like that used in calculating the acceptance probability $\alpha$ is

$\frac{p(a)}{p(b)} = \frac{Zp^*(a)}{Zp^*(b)} = \frac{p^*(a)}{p^*(b)}$

The normalizing constants $Z$ cancel! This attractive property is quite useful in the context of Bayesian methods, where determining the normalizing constant for a distribution may be impractical to calculate directly. This property is demonstrated in current example. It turns out that the “mystery” distribution that we sampled from using the Metropolis algorithm is an unnormalized form of the Student’s-t distribution with one degree of freedom. Comparing $p(x)$ to the definition of the definition Student’s-t

$Student(x,\nu) = \frac{\Gamma\left(\frac{\nu + 1}{2} \right)}{\sqrt{\nu\pi} \Gamma\left( \frac{\nu}{2}\right)}\left( 1+\frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}} = \frac{(1 + x^2)^{-1}}{Z} = \frac{p(x)}{Z}$

we see that $p(x)$ is a Student’s-t distribution with degrees of freedom $\nu=1$, but missing the normalizing constant

$Z = \left(\frac{\Gamma\left(\frac{\nu + 1}{2} \right)}{\sqrt{\nu\pi} \Gamma\left( \frac{\nu}{2}\right)}\right )^{-1}$

Below is additional output from the code above showing that the samples from Metropolis sampler draws samples that follow a normalized Student’s-t distribution, even though $p(x)$ is not normalized.

Metropolis samples from an unnormalized t-distribution follow the normalized distribution

The upper plot shows the progression of the Markov chain’s progression from state $x^{(0)}$ (top) to state $x^{(5000)}$ (bottom). The burn in period for this chain was chosen to be 500 transitions, and is indicated by the dashed blue line (for more on burnin see this previous post).

The bottom plot shows samples from the Markov chain in black (with burn in samples removed). The theoretical curve for the Student’s-t with one degree of freedom is overlayed in red. We see that the states kept by the Metropolis sampler transition operator sample from values that follow the Student’s-t, even though the function $p(x)$ used in the transition operator was not a properly normalized probability distribution.

## Reversibility of the transition operator

It turns out that there is a theoretical constraint on the Markov chain the transition operator in order for it settle into a stationary distribution (i.e. a target distribution we care about). The constraint states that the probability of the transition $x^{(t)} \to x^{(t+1)}$ must be equal to the probability of the reverse transition $x^{(t+1)} \to x^{(t)}$. This reversibility property is often referred to as detailed balance. Using the Metropolis algorithm transition operator, reversibility is assured if the proposal distribution $q(x|x^{(t-1)})$ is symmetric. Such symmetric proposal distributions are the Normal, Cauchy, Student’s-t, and Uniform distributions.

However, using a symmetric proposal distribution may not be reasonable to adequately or efficiently sample all possible target distributions. For instance if a target distribution is bounded on the positive numbers $0 < x \leq \infty$, we would like to use a proposal distribution that has the same support, and will thus be assymetric. This is where the Metropolis-Hastings sampling algorithm comes in. We will discuss in a later post how the Metropolis-Hastings sampler uses a simple change to the calculation of the acceptance probability which allows us to use non-symmetric proposal distributions.