## fMRI in Neuroscience: Modeling the HRF With FIR Basis Functions

In the previous post on fMRI methods, we discussed how to model the selectivity of a voxel using the General Linear Model (GLM). One of the basic assumptions that we must make in order to use the GLM is that we also have an accurate model of the **Hemodynamic Response Function (HRF)** for the voxel. A common practice is to use a canonical HRF model established from previous empirical studies of fMRI timeseries. However, voxels throughout the brain and across subjects exhibit a variety of shapes, so the canonical model is often incorrect. Therefore it becomes necessary to estimate the shape of the HRF for each voxel.

There are a number of ways that have been developed for estimating HRFs, most of them are based on temporal basis function models. (For details on basis function models, see this previous post.). There are a number of basis function sets available, but in this post we’ll discuss modeling the HRF using a flexible basis set composed of a set of delayed impulses called **Finite Impulse Response (FIR)** basis.

## Modeling HRFs With a Set of Time-delayed Impulses

Let’s say that we have an HRF with the following shape.

We would like to be able to model the HRF as a weighted combination of simple basis functions. The simplest set of basis functions is the FIR basis, which is a series of distinct unit-magnitude (i.e. equal to one) impulses, each of which is delayed in time by TRs. An example of modeling the HRF above using FIR basis functions is below:

%% REPRESENTING AN HRF WITH FIR BASIS FUNCTIONS % CREATE ACTUAL HRF (AS MEASURED BY MRI SCANNER) rand('seed',12345) TR = 1 % REPETITION TIME t = 1:TR:20; % MEASUREMENTS h = gampdf(t,6) + -.5*gampdf(t,10); % ACTUAL HRF h = h/max(h); % DISPLAY THE HRF figure; stem(t,h,'k','Linewidth',2) axis square xlabel(sprintf('Basis Function Contribution\nTo HRF')) title(sprintf('HRF as a Series of \nWeighted FIR Basis Functions')) % CREATE/DISPLAY FIR REPRESENTATION figure; hold on cnt = 1; % COLORS BASIS FUNCTIONS ACCORDING TO HRF WEIGHT map = jet(64); cRange = linspace(min(h),max(h),64); for iT = numel(h):-1:1 firSignal = ones(size(h)); firSignal(cnt) = 2; [~,cIdx] = min(abs(cRange-h(cnt))); color = map(cIdx,:); plot(1:numel(h),firSignal + 2*(iT-1),'Color',color,'Linewidth',2) cnt = cnt+1; end colormap(map); colorbar; caxis([min(h) max(h)]); % DISPLAY axis square; ylabel('Basis Function') xlabel('Time (TR)') set(gca,'YTick',0:2:39,'YTickLabel',20:-1:1) title(sprintf('Weighted FIR Basis\n Set (20 Functions)'));

Each of the basis functions has an unit impulse that occurs at time ; otherwise it is equal to zero. Weighting each basis function with the corresponding value of the HRF at each time point , followed by a sum across all the functions gives the target HRF in the first plot above. The FIR basis model makes no assumptions about the shape of the HRF–the weight applied to each basis function can take any value–which allows the model to capture a wide range of HRF profiles.

Given an experiment where various stimuli are presented to a subject and BOLD responses evoked within the subject’s brain, the goal is to determine the HRF to each of the stimuli within each voxel. Let’s take a look at a concrete example of how we can use the FIR basis to simultaneously estimate HRFs to many stimuli for multiple voxels with distint tuning properties.

## Estimating the HRF of Simulated Voxels Using the FIR Basis

For this example we revisit a simulation of voxels with 4 different types of tuning (for details, see the previous post on fMRI in Neuroscience). One voxel is strongly tuned for visual stimuli (such as a light), the second voxel is weakly tuned for auditory stimuli (such as a tone), the third is moderately tuned for somatosensory stimuli (such as warmth applied to the palm), and the final voxel is unselective (i.e. weakly and equally selective for all three types of stimuli). We simulate an experiment where the blood-oxygen-level dependent (BOLD) signals evoked in each voxel by a series of stimuli consisting of nonoverlapping lights, tones, and applications of warmth to the palm, are measured over fMRI measurments (TRs). Below is the simulation of the experiment and the resulting simulated BOLD signals:

%% SIMULATE AN EXPERIMENT % SOME CONSTANTS trPerStim = 30; nRepeat = 10; nTRs = trPerStim*nRepeat + length(h); nCond = 3; nVox = 4; impulseTrain0 = zeros(1,nTRs); % RANDOM ONSET TIMES (TRs) onsetIdx = randperm(nTRs-length(h)); % VISUAL STIMULUS impulseTrainLight = impulseTrain0; impulseTrainLight(onsetIdx(1:nRepeat)) = 1; onsetIdx(1:nRepeat) = []; % AUDITORY STIMULUS impulseTrainTone = impulseTrain0; impulseTrainTone(onsetIdx(1:nRepeat)) = 1; onsetIdx(1:nRepeat) = []; % SOMATOSENSORY STIMULUS impulseTrainHeat = impulseTrain0; impulseTrainHeat(onsetIdx(1:nRepeat)) = 1; % EXPERIMENT DESIGN / STIMULUS SEQUENCE D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat']; X = conv2(D,h'); X = X(1:nTRs,:); %% SIMULATE RESPONSES OF VOXELS WITH VARIOUS SELECTIVITIES visualTuning = [4 0 0]; % VISUAL VOXEL TUNING auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING somatoTuning = [0 0 3]; % SOMATOSENSORY VOXEL TUNING noTuning = [1 1 1]; % NON-SELECTIVE beta = [visualTuning', ... auditoryTuning', ... somatoTuning', ... noTuning']; y0 = X*beta; SNR = 5; noiseSTD = max(y0)/SNR; noise = bsxfun(@times,randn(size(y0)),noiseSTD); y = y0 + noise; % VOXEL RESPONSES % DISPLAY VOXEL TIMECOURSES voxNames = {'Visual','Auditory','Somat.','Unselective'}; cols = lines(4); figure; for iV = 1:4 subplot(4,1,iV) plot(y(:,iV),'Color',cols(iV,:),'Linewidth',2); xlim([0,nTRs]); ylabel('BOLD Signal') legend(sprintf('%s Voxel',voxNames{iV})) end xlabel('Time (TR)') set(gcf,'Position',[100,100,880,500])

Now let’s estimate the HRF of each voxel to each of the stimulus conditions using an FIR basis function model. To do so, we create a design matrix composed of successive sets of delayed impulses, where each set of impulses begins at the onset of each stimulus condition. For the -sized stimulus onset matrix , we calculate an FIR design matrix , where is the assumed length of the HRF we are trying to estimate. The code for creating and displaying the design matrix for an assumed HRF length is below:

%% ESTIMATE HRF USING FIR BASIS SET % CREATE FIR DESIGN MATRIX hrfLen = 16; % WE ASSUME HRF IS 16 TRS LONG % BASIS SET FOR EACH CONDITOIN IS A TRAIN OF INPULSES X_FIR = zeros(nTRs,hrfLen*nCond); for iC = 1:nCond onsets = find(D(:,iC)); idxCols = (iC-1)*hrfLen+1:iC*hrfLen; for jO = 1:numel(onsets) idxRows = onsets(jO):onsets(jO)+hrfLen-1; for kR = 1:numel(idxRows); X_FIR(idxRows(kR),idxCols(kR)) = 1; end end end % DISPLAY figure; subplot(121); imagesc(D); colormap gray; set(gca,'XTickLabel',{'Light','Tone','Som.'}) title('Stimulus Train'); subplot(122); imagesc(X_FIR); colormap gray; title('FIR Design Matrix'); set(gca,'XTick',[8,24,40]) set(gca,'XTickLabel',{'Light','Tone','Som.'}) set(gcf,'Position',[100,100,550,400])In the right panel of the plot above, we see the form of the FIR design matrix for the stimulus onset on the left. For each voxel, we want to determine the weight on each column of that will best explain the BOLD signals measured from each voxel. We can form this problem in terms of a General Linear Model:

Where are the weights on each column of the FIR design matrix. If we set the values of such as to minimize the sum of the squared errors (SSE) between the model above and the measured actual responses

,

then we can use the Ordinary Least Squares (OLS) solution discussed earlier to solve the for . Specifically, we solve for the weights as:

Once determined, the resulting matrix of weights has the HRF of each of the different voxels to each stimulus condition along its columns. The first (1-16) of the weights along a column define the HRF to the first stimulus (the light). The second (17-32) weights along a column determine the HRF to the second stimulus (the tone), etc… Below we parse out these weights and display the resulting HRFs for each voxel:

% ESTIMATE HRF FOR EACH CONDITION AND VOXEL betaHatFIR = pinv(X_FIR'*X_FIR)*X_FIR'*y; % RESHAPE HRFS hHatFIR = reshape(betaHatFIR,hrfLen,nCond,nVox); % DISPLAY figure cols = lines(4); names = {'Visual','Auditory','Somat.','Unselective'}; for iV = 1:nVox subplot(2,2,iV) hold on; for jC = 1:nCond hl = plot(1:hrfLen,hHatFIR(:,jC,iV),'Linewidth',2); set(hl,'Color',cols(jC,:)) end hl = plot(1:numel(h),h,'Linewidth',2); xlabel('TR') legend({'Light','Tone','Heat','True HRF'}) set(hl,'Color','k') xlim([0 hrfLen]) grid on axis tight title(sprintf('%s Voxel',names{iV})); end set(gcf,'Position',[100,100,880,500])

Here we see that estimated HRFs accurately capture both the shape of the HRF and the selectivity of each of the voxels. For instance, the HRFs estimated from the responses of first voxel indicate strong tuning for the light stimulus. The HRF estimated for the light stimulus has an amplitude that is approximately 4 times that of the true HRF. This corresponds with the actual tuning of the voxel (compare this to the value of ). Additionally, time delay till the maximum value (time-to-peak) of the HRF to the light is the same as the true HRF. The first voxel’s HRFs estimated for the other stimuli are essentially noise around baseline. This (correctly) indicates that the first voxel has no selectivity for those stimuli. Further inspection of the remaining estimated HRFs indicate accurate tuning and HRF shape is recovered for the other three voxels as well.

## Wrapping Up

In this post we discussed how to apply a simple basis function model (the FIR basis) to estimate the HRF profile and get an idea of the tuning of individual voxels. Though the FIR basis model can accurately model any HRF shape, it is often times too flexible. In scenarios where voxel signals are very noisy, the FIR basis model will tend to model the noise.

Additionally, the FIR basis set needs to incorporate a basis function for each time measurement. For the example above, we assumed the HRF had a length of 16 TRs. The FIR basis therefore had 16 tuneable weights for each condition. This leads to a model with 48 () tunable parameters for the GLM model. For experiments with many different stimulus conditions, the number of parameters can grow quickly (as ). If the number of parameters is comparable (or more) than the number of BOLD signal measurements, it will be difficult accurately estimate . As we’ll see in later posts, we can often improve upon the FIR basis set by using more clever basis functions.

Another important but indirect issue that effects estimating the HRF is the experimental design, or rather the schedule used to present the stimuli. In the example above, the stimuli were presented in random, non-overlapping order. What if the stimuli were presented in the same order every time, with some set frequency? We’ll discuss in a later post the concept of design efficiency and how it affects our ability to characterize the shape of the HRF and, consequently, voxel selectivity.

## Basis Function Models

Often times we want to model data that emerges from some underlying function of independent variables such that for some future input we’ll be able to accurately predict the future output values. There are various methods for devising such a model, all of which make particular assumptions about the types of functions the model can emulate. In this post we’ll focus on one set of methods called * Basis Function Models* (BFMs).

## Basis Sets and Linear Independence

The idea behind BFMs is to model the complex target function as a linear combination of a set of simpler functions, for which we have closed form expressions. This set of simpler functions is called a * basis set*, and work in a similar manner to bases that compose vector spaces in linear algebra. For instance, any vector in the 2D spatial coordinate system (which is a vector space in ) can be composed of linear combinations of the and directions. This is demonstrated in the figures below:

Above we see a target vector in black pointing from the origin (at xy coordinates (0,0)) to the xy coordinates (2,3), and the coordinate basis vectors and , each of which point one unit along the x- (in blue) and y- (in red) directions.

We can compose the target vector as as a linear combination of the x- and y- basis vectors. Namely the target vector can be composed by adding (in the vector sense) 2 times the basis to 3 times the basis :

One thing that is important to note about the bases and is that they are **linearly independent.**** **This means that no matter how hard you try, you can’t compose the basis vector as a linear combination of the other basis vector , and vice versa. In the 2D vector space, we can easily see this because the red and blue lines are perpendicular to one another (a condition called * orthogonality*). But we can formally determine if two (column) vectors are independent by calculating the (column) rank of a matrix that is composed by concatenating the two vectors.

The rank of a matrix is the number of linearly independent columns in the matrix. If the rank of has the same value as the number of columns in the matrix, then the columns of forms a linearly independent set of vectors. The rank of above is 2. So is the number of columns. Therefore the basis vectors and are indeed linearly independent. We can use this same matrix rank-based test to verify if vectors of much higher dimension than two are independent. Linear independence of the basis set is important if we want to be able to define a unique model.

%% EXAMPLE OF COMPOSING A VECTOR OF BASIS VECTORS figure; targetVector = [0 0; 2 3] basisX = [0 0; 1 0]; basisY = [0 0; 0 1]; hv = plot(targetVector(:,1),targetVector(:,2),'k','Linewidth',2) hold on; hx = plot(basisX(:,1),basisX(:,2),'b','Linewidth',2); hy = plot(basisY(:,1),basisY(:,2),'r','Linewidth',2); xlim([-4 4]); ylim([-4 4]); xlabel('x-direction'), ylabel('y-direction') axis square grid legend([hv,hx,hy],{'Target','b^{(x)}','b^{(y)}'},'Location','bestoutside'); figure hv = plot(targetVector(:,1),targetVector(:,2),'k','Linewidth',2); hold on; hx = plot(2*basisX(:,1),2*basisX(:,2),'b','Linewidth',2); hy = plot(3*basisY(:,1),3*basisY(:,2),'r','Linewidth',2); xlim([-4 4]); ylim([-4 4]); xlabel('x-direction'), ylabel('y-direction'); axis square grid legend([hv,hx,hy],{'Target','2b^{(x)}','3b^{(y)}'},'Location','bestoutside') A = [1 0; 0 1]; % TEST TO SEE IF basisX AND basisY ARE % LINEARLY INDEPENDENT isIndependent = rank(A) == size(A,2)

## Modeling Functions with Linear Basis Sets

In a similar fashion to creating arbitrary vectors with vector bases, we can compose arbitrary functions in “function space” as a linear combination of simpler basis functions (note that basis functions are also sometimes called * kernels*). One such set of basis functions is the set of polynomials:

Here each basis function is a polynomial of order . We can then compose a basis set of functions, where the function is , then model the function as a linear combinations of these polynomial bases:

where is the weight on the -th basis function. In matrix format this model takes the form

Here, again the matrix is the concatenation of each of the polynomial bases into its columns. What we then want to do is determine all the weights such that is as close to as possible. We can do this by using Ordinary Least Squares (OLS) regression, which was discussed in earlier posts. The optimal solution for the weights under OLS is:

Let’s take a look at a concrete example, where we use a set of polynomial basis functions to model a complex data trend.

## Example: Modeling with Polynomial Basis Functions

In this example we model a set of data whose underlying function is:

In particular we’ll create a polynomial basis set of degree 10 and fit the weights using OLS. The Matlab code for this example, and the resulting graphical output are below:

%% EXAMPLE: MODELING A TARGET FUNCTION x = [0:.1:20]'; f = inline('cos(.5*x) + sin(x)','x'); % CREATE A POLYNOMIAL BASIS SET polyBasis = []; nPoly = 10; px = linspace(-10,10,numel(x))'; for iP = 1:nPoly polyParams = zeros(1,nPoly); polyParams(iP) = 1; polyBasis = [polyBasis,polyval(polyParams,px)]; end % SCALE THE BASIS SET TO HAVE MAX AMPLTUDE OF 1 polyBasis = fliplr(bsxfun(@rdivide,polyBasis,max(polyBasis))); % CHECK LINEAR INDEPENDENCE isIndependent = rank(polyBasis) == size(polyBasis,2) % SAMPLE SOME DATA FROM THE TARGET FUNCTION randIdx = randperm(numel(x)); xx = x(randIdx(1:30)); y = f(xx) + randn(size(xx))*.2; % FIT THE POLYNOMIAL BASIS MODEL TO THE DATA(USING polyfit.m) basisWeights = polyfit(xx,y,nPoly); % MODEL OF TARGET FUNCTION yHat = polyval(basisWeights,x); % DISPLAY BASIS SET AND AND MODEL subplot(131) plot(polyBasis,'Linewidth',2) axis square xlim([0,numel(px)]) ylim([-1.2 1.2]) title(sprintf('Polynomial Basis Set\n(%d Functions)',nPoly)) subplot(132) bar(fliplr(basisWeights)); axis square xlim([0 nPoly + 1]); colormap hot xlabel('Basis Function') ylabel('Estimated Weight') title('Model Weights on Basis Functions') subplot(133); hy = plot(x,f(x),'b','Linewidth',2); hold on hd = scatter(xx,y,'ko'); hh = plot(x,yHat,'r','Linewidth',2); xlim([0,max(x)]) axis square legend([hy,hd,hh],{'f(x)','y','Model'},'Location','Best') title('Model Fit') hold off;

First off, let’s make sure that the polynomial basis is indeed linearly independent. As above, we’ll compute the rank of the matrix composed of the basis functions along its columns. The rank of the basis matrix has a value of 10, which is also the number of columns of the matrix (line 19 in the code above). This proves that the basis functions are linearly independent.

We fit the model using Matlab’s internal function , which performs OLS on the basis set matrix. We see that the basis set of 10 polynomial functions (including the zeroth-bias term) does a pretty good job of modeling a very complex function . We essentially get to model a highly nonlinear function using simple linear regression (i.e. OLS).

## Wrapping up

Though the polynomial basis set works well in many modeling problems, it may be a poor fit for some applications. Luckily we aren’t limited to using only polynomial basis functions. Other basis sets include Gaussian basis functions, Sigmoid basis functions, and finite impulse response (FIR) basis functions, just to name a few (a future post, we’ll demonstrate how the FIR basis set can be used to model the hemodynamic response function (HRF) of an fMRI voxel measured from brain).

## fMRI in Neuroscience: Estimating Voxel Selectivity & the General Linear Model (GLM)

In a typical fMRI experiment a series of stimuli are presented to an observer and evoked brain activity–in the form of blood-oxygen-level-dependent (BOLD) signals–are measured from tiny chunks of the brain called voxels. The task of the researcher is then to infer the tuning of the voxels to features in the presented stimuli based on the evoked BOLD signals. In order to make this inference quantitatively, it is necessary to have a model of how BOLD signals are evoked in the presence of stimuli. In this post we’ll develop a model of evoked BOLD signals, and from this model recover the tuning of individual voxels measured during an fMRI experiment.

## Modeling the Evoked BOLD Signals — The Stimulus and Design Matrices

Suppose we are running an event-related fMRI experiment where we present different stimulus conditions to an observer while recording the BOLD signals evoked in their brain over a series of consecutive fMRI measurements (TRs). We can represent the stimulus presentation quantitatively with a binary * Stimulus Matrix,* , whose entries indicate the onset of each stimulus condition (columns) at each point in time (rows). Now let’s assume that we have an accurate model of how a voxel is activated by a single, very short stimulus. This activation model is called hemodynamic response function (HRF), , for the voxel, and, as we’ll discuss in a later post, can be estimated from the measured BOLD signals. Let’s assume for now that the voxel is also activated to an equal degree to all stimuli. In this scenario we can represent the BOLD signal evoked over the entire experiment with another matrix called the

*that is the convolution of the stimulus matrix with the voxel’s HRF .*

**Design Matrix**Note that this model of the BOLD signal is an example of the Finite Impulse Response (FIR) model that was introduced in the previous post on fMRI Basics.

To make the concepts of and more concrete, let’s say our experiment consists of different stimulus conditions: a light, a tone, and heat applied to the palm. Each stimulus condition is presented twice in a staggered manner during 80 TRs of fMRI measurements. The stimulus matrix and the design matrix are simulated here in Matlab:

TR = 1; % REPETITION TIME t = 1:TR:20; % MEASUREMENTS h = gampdf(t,6) + -.5*gampdf(t,10); % HRF MODEL h = h/max(h); % SCALE HRF TO HAVE MAX AMPLITUDE OF 1 trPerStim = 30; % # TR PER STIMULUS nRepeat = 2; % # OF STIMULUS REPEATES nTRs = trPerStim*nRepeat + length(h); impulseTrain0 = zeros(1,nTRs); % VISUAL STIMULUS impulseTrainLight = impulseTrain0; impulseTrainLight(1:trPerStim:trPerStim*nRepeat) = 1; % AUDITORY STIMULUS impulseTrainTone = impulseTrain0; impulseTrainTone(5:trPerStim:trPerStim*nRepeat) = 1; % SOMATOSENSORY STIMULUS impulseTrainHeat = impulseTrain0; impulseTrainHeat(9:trPerStim:trPerStim*nRepeat) = 1; % COMBINATION OF ALL STIMULI impulseTrainAll = impulseTrainLight + impulseTrainTone + impulseTrainHeat; % SIMULATE VOXELS WITH VARIOUS SELECTIVITIES visualTuning = [4 0 0]; % VISUAL VOXEL TUNING auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING somatoTuning = [0 0 3]; % SOMATOSENSORY VOXEL TUNING noTuning = [1 1 1]; % NON-SELECTIVE beta = [visualTuning', ... auditoryTuning', ... somatoTuning', ... noTuning']; % EXPERIMENT DESIGN / STIMULUS SEQUENCE D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat']; % CREATE DESIGN MATRIX FOR THE THREE STIMULI X = conv2(D,h'); % X = D * h X(nTRs+1:end,:) = []; % REMOVE EXCESS FROM CONVOLUTION % DISPLAY STIMULUS AND DESIGN MATRICES subplot(121); imagesc(D); colormap gray; xlabel('Stimulus Condition') ylabel('Time (TRs)'); title('Stimulus Train, D'); set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'}); subplot(122); imagesc(X); xlabel('Stimulus Condition') ylabel('Time (TRs)'); title('Design Matrix, X = D * h') set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});

Each column of the design matrix above (the right subpanel in the above figure) is essentially a model of the BOLD signal evoked independently by each stimulus condition, and the total signal is simply a sum of these independent signals.

## Modeling Voxel Tuning — The Selectivity Matrix

In order to develop the concept of the design matrix we assumed that our theoretical voxel is equally tuned to all stimuli. However, few voxels in the brain exhibit such non-selective tuning. For instance, a voxel located in visual cortex will be more selective for the light than for the tone or the heat stimulus. A voxel in auditory cortex will be more selective for the tone than for the other two stimuli. A voxel in the somoatorsensory cortex will likely be more selective for the heat than the visual or auditory stimuli. How can we represent the tuning of these different voxels?

A simple way to model tuning to the stimulus conditions in an experiment is to multiplying each column of the design matrix by a weight that modulates the BOLD signal according to the presence of the corresponding stimulus condition. For example, we could model a visual cortex voxel by weighting the first column of with a positive value, and the remaining two columns with much smaller values (or even negative values to model suppression). It turns out that we can model the selectivity of individual voxels simultaneously through a * Selectivity Matrix*, . Each entry in is the amount that the -th voxel (columns) is tuned to the -th stimulus condition (rows). Given the design matrix and the selectivity matrix, we can then predict the BOLD signals of selectively-tuned voxels with a simple matrix multiplication:

Keeping with our example experiment, let’s assume that we are modeling the selectivity of four different voxels: a strongly-tuned visual voxel, a moderately-tuned somatosensory voxel, a weakly tuned auditory voxel, and an unselective voxel that is very weakly tuned to all three stimulus conditions. We can represent the tuning of these four voxels with a selectivity matrix. Below we define a selectivity matrix that represents the tuning of these 4 theoretical voxels and simulate the evoked BOLD signals to our 3-stimulus experiment.

% SIMULATE NOISELESS VOXELS' BOLD SIGNAL % (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE) y0 = X*beta; figure; subplot(211); imagesc(beta); colormap hot; axis tight ylabel('Condition') set(gca,'YTickLabel',{'Visual','Auditory','Somato.'}) xlabel('Voxel'); set(gca,'XTick',1:4) title('Voxel Selectivity, \beta') subplot(212); plot(y0,'Linewidth',2); legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'}); xlabel('Time (TRs)'); ylabel('BOLD Signal'); title('Activity for Voxels with Different Stimulus Tuning') set(gcf,'Position',[100 100 750 540]) subplot(211); colorbar

The top subpanel in the simulation output visualizes the selectivity matrix defined for the four theoretical voxels. The bottom subpanel plots the columns of the matrix of voxel responses . We see that the maximum response of the strongly-tuned visual voxel (plotted in blue) is larger than that of the other voxels, corresponding to the larger weight upper left of the selectivity matrix. Also note that the response for the unselective voxel (plotted in cyan) demonstrates the linearity property of the FIR model. The attenuated but complex BOLD signal from the unselective voxel results from the sum of small independent signals evoked by each stimulus.

## Modeling Voxel Noise

The example above demonstrates how we can model BOLD signals evoked in noisless theoretical voxels. Though this noisless scenario is helpful for developing a modeling framework, real-world voxels exhibit variable amounts of * noise *(noise is any signal that cannot be accounted by the FIR model). Therefore we need to incorporate a noise term into our BOLD signal model.

The noise in a voxel is often modeled as a random variable . A common choice for the noise model is a zero-mean Normal/Gaussian distribution with some variance :

Though the variance of the noise model may not be known apriori, there are methods for estimating it from data. We’ll get to estimating noise variance in a later post when we discuss various sources of noise and how to account for them using more advance techniques. For simplicity, let’s just assume that the noise variance is 1 as we proceed.

## Putting It All Together — The General Linear Model (GLM)

So far we have introduced on the concepts of the stimulus matrix, the HRF, the design matrix, selectivity matrix, and the noise model. We can combine all of these to compose a comprehensive quantitative model of BOLD signals measured from a set of voxels during an experiment:

This is referred to as the **General Linear Model ****(****GLM****)**.

In a typical fMRI experiment the researcher controls the stimulus presentation , and measures the evoked BOLD responses from a set of voxels. The problem then is to estimate the selectivities of the voxels based on these measurments. Specifically, we want to determine the parameters that best explain the measured BOLD signals during our experiment. The most common way to do this is a method known as * Ordinary Least Squares (OLS) Regression*. Using OLS the idea is to adjust the values of such that the predicted model BOLD signals are as similar to the measured signals as possible. In other words, the goal is to infer the selectivity each voxel would have to exhibit in order to produce the measured BOLD signals. I showed in an earlier post that the optimal OLS solution for the selectivities is given by:

Therefore, given a design matrix and a set of voxel responses associated with the design matrix, we can calculate the selectivities of voxels to the stimulus conditions represented by the columns of the design matrix. This works even when the BOLD signals are noisy. To get a better idea of this process at work let’s look at a quick example based on our toy fMRI experiment.

## Example: Recovering Voxel Selectivity Using OLS

Here the goal is to recover the selectivities of the four voxels in our toy experiment they have been corrupted with noise. First, we add noise to the voxel responses. In this example the variance of the added noise is based on a concept known as * signal-to-noise-ration* or

*. As the name suggests, SNR is the ratio of the underlying signal to the noise “on top of” the signal. SNR is a very important concept when interpreting fMRI analyses. If a voxel exhibits a low SNR, it will be far more difficult to estimate its tuning. Though there are many ways to define SNR, in this example it is defined as the ratio of the maximum signal amplitude to the variance of the noise model. The underlying noise model variance is adjusted to be one-fifth of the maximum amplitude of the BOLD signal, i.e. an SNR of 5. Feel free to try different values of SNR by changing the value of the variable in the Matlab simulation. Noisy versions of the 4 model BOLD signals are plotted in the top subpanel of the figure below. We see that the noisy signals are very different from the actual underlying BOLD signals.*

**SNR**Here we estimate the selectivities from the GLM using OLS, and then predict the BOLD signals in our experiment with this estimate. We see in the bottom subpanel of the above figure that the resulting GLM predictions of are quite accurate. We also compare the estimated selectivity matrix to the actual selectivity matrix below. We see that OLS is able to recover the selectivity of all the voxels.

% SIMULATE NOISY VOXELS & ESTIMATE TUNING % (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE) SNR = 5; % (APPROX.) SIGNAL-TO-NOISE RATIO noiseSTD = max(y0(:))./SNR; % NOISE LEVEL FOR EACH VOXEL noise = bsxfun(@times,randn(size(y0)),noiseSTD); y = y0 + noise; betaHat = inv(X'*X)*X'*y % OLS yHat = X*betaHat; % GLM PREDICTION figure subplot(211); plot(y,'Linewidth',3); xlabel('Time (s)'); ylabel('BOLD Signal'); legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'}); title('Noisy Voxel Responses'); subplot(212) h1 = plot(y0,'Linewidth',3); hold on h2 = plot(yHat,'-o'); legend([h1(end),h2(end)],{'Actual Responses','Predicted Responses'}) xlabel('Time (s)'); ylabel('BOLD Signal'); title('Model Predictions') set(gcf,'Position',[100 100 750 540]) figure subplot(211); imagesc(beta); colormap hot(5); axis tight ylabel('Condition') set(gca,'YTickLabel',{'Visual','Auditory','Somato.'}) xlabel('Voxel'); set(gca,'XTick',1:4) title('Actual Selectivity, \beta') subplot(212) imagesc(betaHat); colormap hot(5); axis tight ylabel('Condition') set(gca,'YTickLabel',{'Visual','Auditory','Somato.'}) xlabel('Voxel'); set(gca,'XTick',1:4) title('Noisy Estimated Selectivity') drawnow

## Wrapping Up

Here we introduced the GLM commonly used for fMRI data analyses and used the GLM framework to recover the selectivities of simulated voxels. We saw that the GLM is quite powerful of recovering the selectivity in the presence of noise. However, there are a few details left out of the story.

First, we assumed that we had an accurate (albeit exact) model for each voxel’s HRF. This is generally not the case. In real-world scenarios the HRF is either assumed to have some canonical shape, or the shape of the HRF is estimated the experiment data. Though assuming a canonical HRF shape has been validated for block design studies of peripheral sensory areas, this assumption becomes dangerous when using event-related designs, or when studying other areas of the brain.

Additionally, we did not include any physiological noise signals in our theoretical voxels. In real voxels, the BOLD signal changes due to physiological processes such as breathing and heartbeat can be far larger than the signal change due to underlying neural activation. It then becomes necessary to either account for the nuisance signals in the GLM framework, or remove them before using the model described above. In two upcoming posts we’ll discuss these two issues: estimating the HRF shape from data, and dealing with nuisance signals.

## A Gentle Introduction to Markov Chain Monte Carlo (MCMC)

Applying probabilistic models to data usually involves integrating a complex, multi-dimensional probability distribution. For example, calculating the expectation/mean of a model distribution involves such an integration. Many (most) times, these integrals are not calculable due to the high dimensionality of the distribution or because there is no closed-form expression for the integral available using calculus. Markov Chain Monte Carlo (MCMC) is a method that allows one to approximate complex integrals using stochastic sampling routines. As MCMC’s name indicates, the method is composed of two components, the ** Markov chain** and

**.**

*Monte Carlo integration*Monte Carlo integration* *is a powerful technique that exploits stochastic sampling of the distribution in question in order to approximate the difficult integration. However, in order to use* *Monte Carlo integration it is necessary to be able to sample from the probability distribution in question, which may be difficult or impossible to do directly. This is where the second component of MCMC, the Markov chain,** **comes in. A Markov chain is a sequential model that transitions from one state to another in a probabilistic fashion, where the next state that the chain takes is conditioned on the previous state. Markov chains are useful in that if they are constructed properly, and allowed to run for a long time, the states that a chain will take also sample from a target probability distribution. Therefore we can construct Markov chains to sample from the distribution whose integral we would like to approximate, then use Monte Carlo integration to perform the approximation.

Here I introduce a series of posts where I describe the basic concepts underlying MCMC, starting off by describing Monte Carlo Integration, then giving a brief introduction of Markov chains and how they can be constructed to sample from a target probability distribution. Given these foundation principles, we can then discuss MCMC techniques such as the Metropolis and Metropolis-Hastings algorithms, the Gibbs sampler, and the Hybrid Monte Carlo algorithm.

As always, each post has a somewhat formal/mathematical introduction, along with an example and simple Matlab implementations of the associated algorithms.

## MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)

The random-walk behavior of many Markov Chain Monte Carlo (MCMC) algorithms makes Markov chain convergence to a target stationary distribution inefficient, resulting in slow mixing. Hamiltonian/Hybrid Monte Carlo (HMC), is a MCMC method that adopts physical system dynamics rather than a probability distribution to propose future states in the Markov chain. This allows the Markov chain to explore the target distribution much more efficiently, resulting in faster convergence. Here we introduce basic analytic and numerical concepts for simulation of Hamiltonian dynamics. We then show how Hamiltonian dynamics can be used as the Markov chain proposal function for an MCMC sampling algorithm (HMC).

## First off, a brief physics lesson in Hamiltonian dynamics

Before we can develop Hamiltonian Monte Carlo, we need to become familiar with the concept of Hamiltonian dynamics. Hamiltonian dynamics is one way that physicists describe how objects move throughout a system. Hamiltonian dynamics describe an object’s motion in terms of its location and momentum (equivalent to the object’s mass times its velocity) at some time . For each location the object takes, there is an associated potential energy , and for each momentum there is an associated kinetic energy . The total energy of the system is constant and known as the Hamiltonian , defined simply as the sum of the potential and kinetic energies:

Hamiltonian dynamics describe how kinetic energy is converted to potential energy (and vice versa) as an object moves throughout a system in time. This description is implemented quantitatively via a set of differential equations known as the Hamiltonian equations:

Therefore, if we have expressions for and and a set of initial conditions (i.e. an initial position and initial momentum at time ), it is possible to predict the location and momentum of an object at any point in time by simulating these dynamics for a duration

## Simulating Hamiltonian dynamics — the Leap Frog Method

The Hamiltonian equations describe an object’s motion in time, which is a continuous variable. In order to simulate Hamiltonian dynamics numerically on a computer, it is necessary to approximate the Hamiltonian equations by discretizing time. This is done by splitting the interval up into a series of smaller intervals of length . The smaller the value of the closer the approximation is to the dynamics in continuous time. There are a number of procedures that have been developed for discretizing time including Euler’s method and the Leap Frog Method, which I will introduce briefly in the context of Hamiltonian dynamics. The Leap Frog method updates the momentum and position variables sequentially, starting by simulating the momentum dynamics over a small interval of time , then simulating the position dynamics over a slightly longer interval in time , then completing the momentum simulation over another small interval of time so that and now exist at the same point in time. Specifically, the Leap Frog method is as follows: 1. Take a half step in time to update the momentum variable:

2. Take a full step in time to update the position variable

3. Take the remaining half step in time to finish updating the momentum variable

The Leap Fog method can be run for steps to simulate dynamics over units of time. This particular discretization method has a number of properties that make it preferable to other approximation methods like Euler’s method, particularly for use in MCMC, but discussion of these properties are beyond the scope of this post. Let’s see how we can use the Leap Frog method to simulate Hamiltonian dynamics in a simple 1D example.

### Example 1: Simulating Hamiltonian dynamics of an harmonic oscillator

Imagine a ball with mass equal to one is attached to a horizontally-oriented spring. The spring exerts a force on the ball equal to

which works to restore the ball’s position to the equilibrium position of the spring at . Let’s assume that the spring constant , which defines the strength of the restoring force is also equal to one. If the ball is displaced by some distance from equilibrium, then the potential energy is

In addition, the kinetic energy an object with mass moving with velocity within a linear system is known to be

,

if the object’s mass is equal to one, like the ball this example. Notice that we now have in hand the expressions for both and . In order to simulate the Hamiltonian dynamics of the system using the Leap Frog method, we also need expressions for the partial derivatives of each variable (in this 1D example there are only one for each variable):

Therefore one iteration the Leap Frog algorithm for simulating Hamiltonian dynamics in this system is:

1.

2. 3.

We simulate the dynamics of the spring-mass system described using the Leap Frog method in Matlab below (if the graph is not animated, try clicking on it to open up the linked .gif). The left bar in the bottom left subpanel of the simulation output demonstrates the trade-off between potential and kinetic energy described by Hamiltonian dynamics. The cyan portion of the bar is the proportion of the Hamiltonian contributed by the potential energy , and the yellow portion represents is the contribution of the kinetic energy . The right bar (in all yellow), is the total value of the Hamiltonian . Here we see that the ball oscillates about the equilibrium position of the spring with a constant period/frequency. As the ball passes the equilibrium position , it has a minimum potential energy and maximum kinetic energy. At the extremes of the ball’s trajectory, the potential energy is at a maximum, while the kinetic energy is minimized. The procession of momentum and position map out positions in what is referred to as ** phase space**, which is displayed in the bottom right subpanel of the output. The harmonic oscillator maps out an ellipse in phase space. The size of the ellipse depends on the energy of the system defined by initial conditions.

You may also notice that the value of the Hamiltonian is not a exactly constant in the simulation, but oscillates slightly. This is an artifact known as ** energy drift** due to approximations used to the discretize time.

% EXAMPLE 1: SIMULATING HAMILTONIAN DYNAMICS % OF HARMONIC OSCILLATOR % STEP SIZE delta = 0.1; % # LEAP FROG L = 70; % DEFINE KINETIC ENERGY FUNCTION K = inline('p^2/2','p'); % DEFINE POTENTIAL ENERGY FUNCTION FOR SPRING (K =1) U = inline('1/2*x^2','x'); % DEFINE GRADIENT OF POTENTIAL ENERGY dU = inline('x','x'); % INITIAL CONDITIONS x0 = -4; % POSTIION p0 = 1; % MOMENTUM figure %% SIMULATE HAMILTONIAN DYNAMICS WITH LEAPFROG METHOD % FIRST HALF STEP FOR MOMENTUM pStep = p0 - delta/2*dU(x0)'; % FIRST FULL STEP FOR POSITION/SAMPLE xStep = x0 + delta*pStep; % FULL STEPS for jL = 1:L-1 % UPDATE MOMENTUM pStep = pStep - delta*dU(xStep); % UPDATE POSITION xStep = xStep + delta*pStep; % UPDATE DISPLAYS subplot(211), cla hold on; xx = linspace(-6,xStep,1000); plot(xx,sin(6*linspace(0,2*pi,1000)),'k-'); plot(xStep+.5,0,'bo','Linewidth',20) xlim([-6 6]);ylim([-1 1]) hold off; title('Harmonic Oscillator') subplot(223), cla b = bar([U(xStep),K(pStep);0,U(xStep)+K(pStep)],'stacked'); set(gca,'xTickLabel',{'U+K','H'}) ylim([0 10]); title('Energy') subplot(224); plot(xStep,pStep,'ko','Linewidth',20); xlim([-6 6]); ylim([-6 6]); axis square xlabel('x'); ylabel('p'); title('Phase Space') pause(.1) end % (LAST HALF STEP FOR MOMENTUM) pStep = pStep - delta/2*dU(xStep);

## Hamiltonian dynamics and the target distribution

Now that we have a better understanding of what Hamiltonian dynamics are and how they can be simulated, let’s now discuss how we can use Hamiltonian dynamics for MCMC. The main idea behind Hamiltonian/Hibrid Monte Carlo is to develop a Hamiltonian function such that the resulting Hamiltonian dynamics allow us to efficiently explore some target distribution . How can we choose such a Hamiltonian function? It turns out it is pretty simple to relate a to using a basic concept adopted from statistical mechanics known as the ** canonical distribution**. For any energy function over a set of variables , we can define the corresponding canonical distribution as: where we simply take the exponential of the negative of the energy function. The variable is a normalizing constant called the

**that scales the canonical distribution such that is sums to one, creating a valid probability distribution. Don’t worry about , it isn’t really important because, as you may recall from an earlier post, MCMC methods can sample from unscaled probability distributions. Now, as we saw above, the energy function for Hamiltonian dynamics is a combination of potential and kinetic energies:**

*partition function*Therefore the canoncial distribution for the Hamiltonian dynamics energy function is

Here we see that joint (canonical) distribution for and factorizes. This means that the two variables are independent, and the canoncial distribution is independent of the analogous distribution for the momentum. Therefore, as we’ll see shortly, we can use Hamiltonian dynamics to sample from the joint canonical distribution over and and simply ignore the momentum contributions. Note that this is an example of introducing ** auxiliary variables **to facilitate the Markov chain path. Introducing the auxiliary variable allows us to use Hamiltonian dynamics, which are unavailable without them. Because the canonical distribution for is independent of the canonical distribution for , we can choose any distribution from which to sample the momentum variables. A common choice is to use a zero-mean Normal distribution with unit variance:

Note that this is equivalent to having a quadratic potential energy term in the Hamiltonian:

Recall that this is is the exact quadratic kinetic energy function (albeit in 1D) used in the harmonic oscillator example above. This is a convenient choice for the kinetic energy function as all partial derivatives are easy to compute. Now that we have defined a kinetic energy function, all we have to do is find a potential energy function that when negated and run through the exponential function, gives the target distribution (or an unscaled version of it). Another way of thinking of it is that we can define the potential energy function as

.

If we can calculate , then we’re in business and we can simulate Hamiltonian dynamics that can be used in an MCMC technique.

## Hamiltonian Monte Carlo

In HMC we use Hamiltonian dynamics as a proposal function for a Markov Chain in order to explore the target (canonical) density defined by more efficiently than using a proposal probability distribution. Starting at an initial state , we simulate Hamiltonian dynamics for a short time using the Leap Frog method. We then use the state of the position and momentum variables at the end of the simulation as our proposed states variables and . The proposed state is accepted using an update rule analogous to the Metropolis acceptance criterion. Specifically if the probability of the proposed state after Hamiltonian dynamics

is greater than probability of the state prior to the Hamiltonian dynamics

then the proposed state is accepted, otherwise, the proposed state is accepted randomly. If the state is rejected, the next state of the Markov chain is set as the state at . For a given set of initial conditions, Hamiltonian dynamics will follow contours of constant energy in phase space (analogous to the circle traced out in phase space in the example above). Therefore we must randomly perturb the dynamics so as to explore all of . This is done by simply drawing a random momentum from the corresponding canonical distribution before running the dynamics prior to each sampling iteration . Combining these steps, sampling random momentum, followed by Hamiltonian dynamics and Metropolis acceptance criterion defines the HMC algorithm for drawing samples from a target distribution:

- set
- generate an initial position state
- repeat until

set

– sample a new initial momentum variable from the momentum canonical distribution

– set

– run Leap Frog algorithm starting at for steps and stepsize to obtain proposed states and

– calculate the Metropolis acceptance probability:

– draw a random number from

if accept the proposed state position and set the next state in the Markov chain

else set

In the next example we implement HMC to sample from a multivariate target distribution that we have sampled from previously using multi-variate Metropolis-Hastings, the bivariate Normal. We also qualitatively compare the sampling dynamics of HMC to multivariate Metropolis-Hastings for the sampling the same distribution.

### Example 2: Hamiltonian Monte for sampling a Bivariate Normal distribution

As a reminder, the target distribution for this exampleis a Normal form with following parameterization:

with mean

and covariance

In order to sample from (assuming that we are using a quadratic energy function), we need to determine the expressions for and

.

Recall that the target potential energy function can be defined from the canonical form as

If we take the negative log of the Normal distribution outline above, this defines the following potential energy function:

Where is the normalizing constant for a Normal distribution (and can be ignored because it will eventually cancel). The potential energy function is then simply:

with partial derivatives

Using these expressions for the potential energy and its partial derivatives, we implement HMC for sampling from the bivariate Normal in Matlab:

In the graph above we display HMC samples of the target distribution, starting from an initial position very far from the mean of the target. We can see that HMC rapidly approaches areas of high density under the target distribution. We compare these samples with samples drawn using the Metropolis-Hastings (MH) algorithm below. The MH algorithm converges much slower than HMC, and consecutive samples have much higher autocorrelation than samples drawn using HMC.

The Matlab code for the HMC sampler:

% EXAMPLE 2: HYBRID MONTE CARLO SAMPLING -- BIVARIATE NORMAL rand('seed',12345); randn('seed',12345); % STEP SIZE delta = 0.3; nSamples = 1000; L = 20; % DEFINE POTENTIAL ENERGY FUNCTION U = inline('transp(x)*inv([1,.8;.8,1])*x','x'); % DEFINE GRADIENT OF POTENTIAL ENERGY dU = inline('transp(x)*inv([1,.8;.8,1])','x'); % DEFINE KINETIC ENERGY FUNCTION K = inline('sum((transp(p)*p))/2','p'); % INITIAL STATE x = zeros(2,nSamples); x0 = [0;6]; x(:,1) = x0; t = 1; while t < nSamples t = t + 1; % SAMPLE RANDOM MOMENTUM p0 = randn(2,1); %% SIMULATE HAMILTONIAN DYNAMICS % FIRST 1/2 STEP OF MOMENTUM pStar = p0 - delta/2*dU(x(:,t-1))'; % FIRST FULL STEP FOR POSITION/SAMPLE xStar = x(:,t-1) + delta*pStar; % FULL STEPS for jL = 1:L-1 % MOMENTUM pStar = pStar - delta*dU(xStar)'; % POSITION/SAMPLE xStar = xStar + delta*pStar; end % LAST HALP STEP pStar = pStar - delta/2*dU(xStar)'; % COULD NEGATE MOMENTUM HERE TO LEAVE % THE PROPOSAL DISTRIBUTION SYMMETRIC. % HOWEVER WE THROW THIS AWAY FOR NEXT % SAMPLE, SO IT DOESN'T MATTER % EVALUATE ENERGIES AT % START AND END OF TRAJECTORY U0 = U(x(:,t-1)); UStar = U(xStar); K0 = K(p0); KStar = K(pStar); % ACCEPTANCE/REJECTION CRITERION alpha = min(1,exp((U0 + K0) - (UStar + KStar))); u = rand; if u < alpha x(:,t) = xStar; else x(:,t) = x(:,t-1); end end % DISPLAY figure scatter(x(1,:),x(2,:),'k.'); hold on; plot(x(1,1:50),x(2,1:50),'ro-','Linewidth',2); xlim([-6 6]); ylim([-6 6]); legend({'Samples','1st 50 States'},'Location','Northwest') title('Hamiltonian Monte Carlo')

## Wrapping up

In this post we introduced the Hamiltonian/Hybrid Monte Carlo algorithm for more efficient MCMC sampling. The HMC algorithm is extremely powerful for sampling distributions that can be represented terms of a potential energy function and its partial derivatives. Despite the efficiency and elegance of HMC, it is an underrepresented sampling routine in the literature. This may be due to the popularity of simpler algorithms such as Gibbs sampling or Metropolis-Hastings, or perhaps due to the fact that one must select hyperparameters such as the number of Leap Frog steps and Leap Frog step size when using HMC. However, recent research has provided effective heuristics such as adapting the Leap Frog step size in order to maintain a constant Metropolis rejection rate, which facilitate the use of HMC for general applications.

## MCMC: The Gibbs Sampler

In the previous post, we compared using block-wise and component-wise implementations of the Metropolis-Hastings algorithm for sampling from a multivariate probability distribution. Component-wise updates for MCMC algorithms are generally more efficient for multivariate problems than blockwise updates in that we are more likely to accept a proposed sample by drawing each component/dimension independently of the others. However, samples may still be rejected, leading to excess computation that is never used. The Gibbs sampler, another popular MCMC sampling technique, provides a means of avoiding such wasted computation. Like the component-wise implementation of the Metropolis-Hastings algorithm, the Gibbs sampler also uses component-wise updates. However, unlike in the Metropolis-Hastings algorithm, all proposed samples are accepted, so there is no wasted computation.

The Gibbs sampler is applicable for certain classes of problems, based on two main criterion. Given a target distribution , where , ), The first criterion is 1) that it is necessary that we have an analytic (mathematical) expression for the conditional distribution of each variable in the joint distribution given all other variables in the joint. Formally, if the target distribution is -dimensional, we must have individual expressions for

.

Each of these expressions defines the probability of the -th dimension given that we have values for all other () dimensions. Having the conditional distribution for each variable means that we don’t need a proposal distribution or an accept/reject criterion, like in the Metropolis-Hastings algorithm. Therefore, we can simply sample from each conditional while keeping all other variables held fixed. This leads to the second criterion 2) that we must be able to sample from each conditional distribution. This caveat is obvious if we want an implementable algorithm.

The Gibbs sampler works in much the same way as the component-wise Metropolis-Hastings algorithms except that instead drawing from a proposal distribution for each dimension, then accepting or rejecting the proposed sample, we simply draw a value for that dimension according to the variable’s corresponding conditional distribution. We also accept all values that are drawn. Similar to the component-wise Metropolis-Hastings algorithm, we step through each variable sequentially, sampling it while keeping all other variables fixed. The Gibbs sampling procedure is outlined below

- set
- generate an initial state
- repeat until

set

for each dimension

draw from

To get a better understanding of the Gibbs sampler at work, let’s implement the Gibbs sampler to solve the same multivariate sampling problem addressed in the previous post.

### Example: Sampling from a bivariate a Normal distribution

This example parallels the examples in the previous post where we sampled from a 2-D Normal distribution using block-wise and component-wise Metropolis-Hastings algorithms. Here, we show how to implement a Gibbs sampler to draw samples from the same target distribution. As a reminder, the target distribution is a Normal form with following parameterization:

with mean

and covariance

In order to sample from this distribution using a Gibbs sampler, we need to have in hand the conditional distributions for variables/dimensions and :

(i.e. the conditional for the first dimension, )

and

(the conditional for the second dimension, )

Where is the previous state of the second dimension, and is the state of the first dimension after drawing from . The reason for the discrepancy between updating and using states and , can be is seen in step 3 of the algorithm outlined in the previous section. At iteration we first sample a new state for variable conditioned on the most recent state of variable , which is from iteration . We then sample a new state for the variable conditioned on the most recent state of variable , which is now from the current iteration, .

After some math (which which I will skip for some brevity, but see the following for some details), we find that the two conditional distributions for the target Normal distribution are:

and

,

which are both univariate Normal distributions, each with a mean that is dependent on the value of the most recent state of the conditioning variable, and a variance that is dependent on the target covariances between the two variables.

Using the above expressions for the conditional probabilities of variables and , we implement the Gibbs sampler using MATLAB below. The output of the sampler is shown here:

Inspecting the figure above, note how at each iteration the Markov chain for the Gibbs sampler first takes a step only along the direction, then only along the direction. This shows how the Gibbs sampler sequentially samples the value of each variable separately, in a component-wise fashion.

% EXAMPLE: GIBBS SAMPLER FOR BIVARIATE NORMAL rand('seed' ,12345); nSamples = 5000; mu = [0 0]; % TARGET MEAN rho(1) = 0.8; % rho_21 rho(2) = 0.8; % rho_12 % INITIALIZE THE GIBBS SAMPLER propSigma = 1; % PROPOSAL VARIANCE minn = [-3 -3]; maxx = [3 3]; % INITIALIZE SAMPLES x = zeros(nSamples,2); x(1,1) = unifrnd(minn(1), maxx(1)); x(1,2) = unifrnd(minn(2), maxx(2)); dims = 1:2; % INDEX INTO EACH DIMENSION % RUN GIBBS SAMPLER t = 1; while t < nSamples t = t + 1; T = [t-1,t]; for iD = 1:2 % LOOP OVER DIMENSIONS % UPDATE SAMPLES nIx = dims~=iD; % *NOT* THE CURRENT DIMENSION % CONDITIONAL MEAN muCond = mu(iD) + rho(iD)*(x(T(iD),nIx)-mu(nIx)); % CONDITIONAL VARIANCE varCond = sqrt(1-rho(iD)^2); % DRAW FROM CONDITIONAL x(t,iD) = normrnd(muCond,varCond); end end % DISPLAY SAMPLING DYNAMICS figure; h1 = scatter(x(:,1),x(:,2),'r.'); % CONDITIONAL STEPS/SAMPLES hold on; for t = 1:50 plot([x(t,1),x(t+1,1)],[x(t,2),x(t,2)],'k-'); plot([x(t+1,1),x(t+1,1)],[x(t,2),x(t+1,2)],'k-'); h2 = plot(x(t+1,1),x(t+1,2),'ko'); end h3 = scatter(x(1,1),x(1,2),'go','Linewidth',3); legend([h1,h2,h3],{'Samples','1st 50 Samples','x(t=0)'},'Location','Northwest') hold off; xlabel('x_1'); ylabel('x_2'); axis square

## Wrapping Up

The Gibbs sampler is a popular MCMC method for sampling from complex, multivariate probability distributions. However, the Gibbs sampler cannot be used for general sampling problems. For many target distributions, it may difficult or impossible to obtain a closed-form expression for all the needed conditional distributions. In other scenarios, analytic expressions may exist for all conditionals but it may be difficult to sample from any or all of the conditional distributions (in these scenarios it is common to use univariate sampling methods such as rejection sampling and (surprise!) Metropolis-type MCMC techniques to approximate samples from each conditional). Gibbs samplers are very popular for Bayesian methods where models are often devised in such a way that conditional expressions for all model variables are easily obtained and take well-known forms that can be sampled from efficiently.

Gibbs sampling, like many MCMC techniques suffer from what is often called “slow mixing.” Slow mixing occurs when the underlying Markov chain takes a long time to sufficiently explore the values of in order to give a good characterization of . Slow mixing is due to a number of factors including the “random walk” nature of the Markov chain, as well as the tendency of the Markov chain to get “stuck,” only sampling a single region of having high-probability under . Such behaviors are bad for sampling distributions with multiple modes or heavy tails. More advanced techniques, such as Hybrid Monte Carlo have been developed to incorporate additional dynamics that increase the efficiency of the Markov chain path. We will discuss Hybrid Monte Carlo in a future post.

## MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates

In the previous posts on MCMC methods, we focused on how to sample from univariate target distributions. This was done mainly to give the reader some intuition about MCMC implementations with fairly tangible examples that can be visualized. However, MCMC can easily be extended to sample multivariate distributions.

In this post we will discuss two flavors of MCMC update procedure for sampling distributions in multiple dimensions: block-wise, and component-wise update procedures. We will show how these two different procedures can give rise to different implementations of the Metropolis-Hastings sampler to solve the same problem.

## Block-wise Sampling

The first approach for performing multidimensional sampling is to use ** block-wise updates**. In this approach the proposal distribution has the same dimensionality as the target distribution . Specifically, if is a distribution over variables, ie. , then we must design a proposal distribution that is also a distribution involving variables. We then accept or reject a proposed state sampled from the proposal distribution in exactly the same way as for the univariate Metropolis-Hastings algorithm. To generate multivariate samples we perform the following block-wise sampling procedure:

- set
- generate an initial state
- repeat until

set

generate a proposal state from

calculate the proposal correction factor

calculate the acceptance probability

draw a random number from

if accept the proposal state and set

else set

Let’s take a look at the block-wise sampling routine in action.

### Example 1: Block-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we use block-wise Metropolis-Hastings algorithm to sample from a bivariate (i.e. ) Normal distribution:

with mean

and covariance

Usually the target distribution will have a complex mathematical form, but for this example we’ll circumvent that by using MATLAB’s built-in function to evaluate . For our proposal distribution, , let’s use a circular Normal centered at the the previous state/sample of the Markov chain/sampler, i.e:

,

where is a 2-D identity matrix, giving the proposal distribution unit variance along both dimensions and , and zero covariance. You can find an MATLAB implementation of the block-wise sampler at the end of the section. The display of the samples and the target distribution output by the sampler implementation are shown below:

We can see from the output that the block-wise sampler does a good job of drawing samples from the target distribution.

Note that our proposal distribution in this example is symmetric, therefore it was not necessary to calculate the correction factor per se. This means that this Metropolis-Hastings implementation is identical to the simpler Metropolis sampler.

%------------------------------------------------------ % EXAMPLE 1: METROPOLIS-HASTINGS % BLOCK-WISE SAMPLER (BIVARIATE NORMAL) rand('seed' ,12345); D = 2; % # VARIABLES nBurnIn = 100; % TARGET DISTRIBUTION IS A 2D NORMAL WITH STRONG COVARIANCE p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x'); % PROPOSAL DISTRIBUTION STANDARD 2D GUASSIAN q = inline('mvnpdf(x,mu)','x','mu') nSamples = 5000; minn = [-3 -3]; maxx = [3 3]; % INITIALIZE BLOCK-WISE SAMPLER t = 1; x = zeros(nSamples,2); x(1,:) = randn(1,D); % RUN SAMPLER while t < nSamples t = t + 1; % SAMPLE FROM PROPOSAL xStar = mvnrnd(x(t-1,:),eye(D)); % CORRECTION FACTOR (SHOULD EQUAL 1) c = q(x(t-1,:),xStar)/q(xStar,x(t-1,:)); % CALCULATE THE M-H ACCEPTANCE PROBABILITY alpha = min([1, p(xStar)/p(x(t-1,:))]); % ACCEPT OR REJECT? u = rand; if u < alpha x(t,:) = xStar; else x(t,:) = x(t-1,:); end end % DISPLAY nBins = 20; bins1 = linspace(minn(1), maxx(1), nBins); bins2 = linspace(minn(2), maxx(2), nBins); % DISPLAY SAMPLED DISTRIBUTION ax = subplot(121); bins1 = linspace(minn(1), maxx(1), nBins); bins2 = linspace(minn(2), maxx(2), nBins); sampX = hist3(x, 'Edges', {bins1, bins2}); hist3(x, 'Edges', {bins1, bins2}); view(-15,40) % COLOR HISTOGRAM BARS ACCORDING TO HEIGHT colormap hot set(gcf,'renderer','opengl'); set(get(gca,'child'),'FaceColor','interp','CDataMode','auto'); xlabel('x_1'); ylabel('x_2'); zlabel('Frequency'); axis square set(ax,'xTick',[minn(1),0,maxx(1)]); set(ax,'yTick',[minn(2),0,maxx(2)]); title('Sampled Distribution'); % DISPLAY ANALYTIC DENSITY ax = subplot(122); [x1 ,x2] = meshgrid(bins1,bins2); probX = p([x1(:), x2(:)]); probX = reshape(probX ,nBins, nBins); surf(probX); axis xy view(-15,40) xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})'); colormap hot axis square set(ax,'xTick',[1,round(nBins/2),nBins]); set(ax,'xTickLabel',[minn(1),0,maxx(1)]); set(ax,'yTick',[1,round(nBins/2),nBins]); set(ax,'yTickLabel',[minn(2),0,maxx(2)]); title('Analytic Distribution')

## Component-wise Sampling

A problem with block-wise updates, particularly when the number of dimensions becomes large, is that finding a suitable proposal distribution is difficult. This leads to a large proportion of the samples being rejected. One way to remedy this is to simply loop over the the dimensions of in sequence, sampling each dimension independently from the others. This is what is known as using * component-wise updates*. Note that now the proposal distribution is univariate, working only in one dimension, namely the current dimension that we are trying to sample. The component-wise Metropolis-Hastings algorithm is outlined below.

- set
- generate an initial state
- repeat until

set

for each dimension

generate a proposal state from

calculate the proposal correction factor

calculate the acceptance probability

draw a random number from

if accept the proposal state and set

else set

Note that in the component-wise implementation a sample for the -th dimension is proposed, then accepted or rejected while all other dimensions () are held fixed. We then move on to the next (-th) dimension and repeat the process while holding all other variables () fixed. In each successive step we are using updated values for the dimensions that have occurred since increasing .

### Example 2: Component-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we draw samples from the same bivariate Normal target distribution described in Example 1, but using component-wise updates. Therefore is the same, however, the proposal distribution is now a univariate Normal distribution with unit unit variance in the direction of the -th dimension to be sampled. The MATLAB implementation of the component-wise sampler is at the end of the section. The samples and comparison to the analytic target distribution are shown below.

Again, we see that we get a good characterization of the bivariate target distribution.

%-------------------------------------------------- % EXAMPLE 2: METROPOLIS-HASTINGS % COMPONENT-WISE SAMPLING OF BIVARIATE NORMAL rand('seed' ,12345); % TARGET DISTRIBUTION p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x'); nSamples = 5000; propSigma = 1; % PROPOSAL VARIANCE minn = [-3 -3]; maxx = [3 3]; % INITIALIZE COMPONENT-WISE SAMPLER x = zeros(nSamples,2); xCurrent(1) = randn; xCurrent(2) = randn; dims = 1:2; % INDICES INTO EACH DIMENSION t = 1; x(t,1) = xCurrent(1); x(t,2) = xCurrent(2); % RUN SAMPLER while t < nSamples t = t + 1; for iD = 1:2 % LOOP OVER DIMENSIONS % SAMPLE PROPOSAL xStar = normrnd(xCurrent(:,iD), propSigma); % NOTE: CORRECTION FACTOR c=1 BECAUSE % N(mu,1) IS SYMMETRIC, NO NEED TO CALCULATE % CALCULATE THE ACCEPTANCE PROBABILITY pratio = p([xStar xCurrent(dims~=iD)])/ ... p([xCurrent(1) xCurrent(2)]); alpha = min([1, pratio]); % ACCEPT OR REJECT? u = rand; if u < alpha xCurrent(iD) = xStar; end end % UPDATE SAMPLES x(t,:) = xCurrent; end % DISPLAY nBins = 20; bins1 = linspace(minn(1), maxx(1), nBins); bins2 = linspace(minn(2), maxx(2), nBins); % DISPLAY SAMPLED DISTRIBUTION figure; ax = subplot(121); bins1 = linspace(minn(1), maxx(1), nBins); bins2 = linspace(minn(2), maxx(2), nBins); sampX = hist3(x, 'Edges', {bins1, bins2}); hist3(x, 'Edges', {bins1, bins2}); view(-15,40) % COLOR HISTOGRAM BARS ACCORDING TO HEIGHT colormap hot set(gcf,'renderer','opengl'); set(get(gca,'child'),'FaceColor','interp','CDataMode','auto'); xlabel('x_1'); ylabel('x_2'); zlabel('Frequency'); axis square set(ax,'xTick',[minn(1),0,maxx(1)]); set(ax,'yTick',[minn(2),0,maxx(2)]); title('Sampled Distribution'); % DISPLAY ANALYTIC DENSITY ax = subplot(122); [x1 ,x2] = meshgrid(bins1,bins2); probX = p([x1(:), x2(:)]); probX = reshape(probX ,nBins, nBins); surf(probX); axis xy view(-15,40) xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})'); colormap hot axis square set(ax,'xTick',[1,round(nBins/2),nBins]); set(ax,'xTickLabel',[minn(1),0,maxx(1)]); set(ax,'yTick',[1,round(nBins/2),nBins]); set(ax,'yTickLabel',[minn(2),0,maxx(2)]); title('Analytic Distribution')

## Wrapping Up

Here we saw how we can use block- and component-wise updates to derive two different implementations of the Metropolis-Hastings algorithm. In the next post we will use component-wise updates introduced above to motivate the Gibbs sampler, which is often used to increase the efficiency of sampling well-defined probability multivariate distributions.

## MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state in the Markov chain, based on a previous state , according to the proposal distribution . The algorithm accepts or rejects the proposed state based on the density of the the target distribution evaluated at . (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time , the probability of moving from must be equal to the probability of moving from , a condition known as * reversibility* or

**. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. ).**

*detailed balance*In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor , defined from the proposal distribution as

The correction factor adjusts the transition operator to ensure that the probability of moving from is equal to the probability of moving from , no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability . Specifically, to draw samples using the Metropolis-Hastings sampler:

- set t = 0
- generate an initial state
- repeat until

set

generate a proposal state from

calculate the proposal correction factor

calculate the acceptance probability

draw a random number from

if accept the proposal state and set

else set

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters to an assumed model such that the model can best account for some observed data . The model function is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution , which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

where, is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and can take.

Let’s say that we assume the following model (likelihood function):

, where

, where

is the gamma function. Thus, the model parameters are

The parameter controls the shape of the distribution, and controls the scale. The likelihood surface for , and a number of values of ranging from zero to five are shown below.

The conditional distribution is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

and

The first prior states that only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of varies as a sinusoidal function. (Note that both of these prior distributions are called * improper priors* because they do not integrate to one). Because is constant, we only need to estimate the value of .

It turns out that even though the normalization constant may be difficult to compute, we can sample from without knowing using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant and sample from the unnormalized posterior:

The surface of the (unnormalized) posterior for ranging from zero to ten are shown below. The prior is displayed in blue on the right of the plot. Let’s say that we have a datapoint and would like to estimate the posterior distribution using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from due to the fact that the posterior only has support on the real positive numbers . An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

,

This distribution is parameterized by a single variable that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for ) are shown in the plot below.

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter (see line 88 in the MATLAB code below). Each proposal states is drawn independently of the previous state. Therefore this is an example of an ** independence sampler**, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR rand('seed',12345) % PRIOR OVER SCALE PARAMETERS B = 1; % DEFINE LIKELIHOOD likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B'); % CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE yy = linspace(0,10,100); AA = linspace(0.1,5,100); likeSurf = zeros(numel(yy),numel(AA)); for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end; figure; surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot % DISPLAY CONDITIONAL AT A = 2 hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3) xlim([0 100]); ylim([0 100]); axis normal set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]); set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]); view(65,25) legend(ly,'p(y|A=2)','Location','Northeast'); hold off; title('p(y|A)'); % DEFINE PRIOR OVER SHAPE PARAMETERS prior = inline('sin(pi*A).^2','A'); % DEFINE THE POSTERIOR p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B'); % CALCULATE AND DISPLAY THE POSTERIOR SURFACE postSurf = zeros(size(likeSurf)); for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end; figure surf(postSurf); ylabel('y'); xlabel('A'); colormap hot % DISPLAY THE PRIOR hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3) % SAMPLE FROM p(A | y = 1.5) y = 1.5; target = postSurf(16,:); % DISPLAY POSTERIOR psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3) xlim([0 100]); ylim([0 100]); axis normal set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]); set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]); view(65,25) legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast'); hold off title('p(A|y)'); % INITIALIZE THE METROPOLIS-HASTINGS SAMPLER % DEFINE PROPOSAL DENSITY q = inline('exppdf(x,mu)','x','mu'); % MEAN FOR PROPOSAL DENSITY mu = 5; % DISPLAY TARGET AND PROPOSAL figure; hold on; th = plot(AA,target,'m','Linewidth',2); qh = plot(AA,q(AA,mu),'k','Linewidth',2) legend([th,qh],{'Target, p(A)','Proposal, q(A)'}); xlabel('A'); % SOME CONSTANTS nSamples = 5000; burnIn = 500; minn = 0.1; maxx = 5; % INTIIALZE SAMPLER x = zeros(1 ,nSamples); x(1) = mu; t = 1; % RUN METROPOLIS-HASTINGS SAMPLER while t < nSamples t = t+1; % SAMPLE FROM PROPOSAL xStar = exprnd(mu); % CORRECTION FACTOR c = q(x(t-1),mu)/q(xStar,mu); % CALCULATE THE (CORRECTED) ACCEPTANCE RATIO alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]); % ACCEPT OR REJECT? u = rand; if u < alpha x(t) = xStar; else x(t) = x(t-1); end end % DISPLAY MARKOV CHAIN figure; subplot(211); stairs(x(1:t),1:t, 'k'); hold on; hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2) ylabel('t'); xlabel('samples, A'); set(gca , 'YDir', 'reverse'); ylim([0 t]) axis tight; xlim([0 maxx]); title('Markov Chain Path'); legend(hb,'Burnin'); % DISPLAY SAMPLES subplot(212); nBins = 100; sampleBins = linspace(minn,maxx,nBins); counts = hist(x(burnIn:end), sampleBins); bar(sampleBins, counts/sum(counts), 'k'); xlabel('samples, A' ); ylabel( 'p(A | y)' ); title('Samples'); xlim([0 10]) % OVERLAY TARGET DISTRIBUTION hold on; plot(AA, target/sum(target) , 'm-', 'LineWidth', 2); legend('Sampled Distribution',sprintf('Target Posterior')) axis tight

## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.

## MCMC: The Metropolis Sampler

As discussed in an earlier post, we can use a Markov chain to sample from some ** target probability distribution** from which drawing samples directly is difficult. To do so, it is necessary to design a transition operator for the Markov chain which makes the chain’s stationary distribution match the target distribution. The Metropolis sampling algorithm (and the more general Metropolis-Hastings sampling algorithm) uses simple heuristics to implement such a transition operator.

## Metropolis Sampling

Starting from some random initial state , the algorithm first draws a possible sample from a ** proposal distribution** . Much like a conventional transition operator for a Markov chain, the proposal distribution depends only on the previous state in the chain. However, the transition operator for the Metropolis algorithm has an additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the proposed state as a sample and setting it to the next state in the chain. If the density of is low near the proposed state, then it is likely (but not guaranteed) that it will be rejected. The criterion for accepting or rejecting a proposed state are defined by the following heuristics:

- If , the proposed state is kept as a sample and is set as the next state in the chain (i.e. move the chain’s state to a location where has equal or greater density).
- If –indicating that has low density near –then the proposed state may still be accepted, but only randomly, and with a probability

These heuristics can be instantiated by calculating the ** acceptance probability **for the proposed state.

Having the acceptance probability in hand, the transition operator for the metropolis algorithm works like this: if a random uniform number is less than or equal to , then the state is accepted (as in (1) above), if not, it is rejected and another state is proposed (as in (2) above). In order to collect samples using Metropolis sampling we run the following algorithm:

- set t = 0
- generate an initial state from a prior distribution over initial states
- repeat until

set

generate a proposal state from

calculate the acceptance probability

draw a random number from

if , accept the proposal and set

else set

### Example: Using the Metropolis algorithm to sample from an unknown distribution

Say that we have some mysterious function

from which we would like to draw samples. To do so using Metropolis sampling we need to define two things: (1) the prior distribution over the initial state of the Markov chain, and (2) the proposal distribution . For this example we define:

,

both of which are simply a Normal distribution, one centered at zero, the other centered at previous state of the chain. The following chunk of MATLAB code runs the Metropolis sampler with this proposal distribution and prior.

% METROPOLIS SAMPLING EXAMPLE randn('seed',12345); % DEFINE THE TARGET DISTRIBUTION p = inline('(1 + x.^2).^-1','x') % SOME CONSTANTS nSamples = 5000; burnIn = 500; nDisplay = 30; sigma = 1; minn = -20; maxx = 20; xx = 3*minn:.1:3*maxx; target = p(xx); pauseDur = .8; % INITIALZE SAMPLER x = zeros(1 ,nSamples); x(1) = randn; t = 1; % RUN SAMPLER while t < nSamples t = t+1; % SAMPLE FROM PROPOSAL xStar = normrnd(x(t-1) ,sigma); proposal = normpdf(xx,x(t-1),sigma); % CALCULATE THE ACCEPTANCE PROBABILITY alpha = min([1, p(xStar)/p(x(t-1))]); % ACCEPT OR REJECT? u = rand; if u < alpha x(t) = xStar; str = 'Accepted'; else x(t) = x(t-1); str = 'Rejected'; end % DISPLAY SAMPLING DYNAMICS if t < nDisplay + 1 figure(1); subplot(211); cla plot(xx,target,'k'); hold on; plot(xx,proposal,'r'); line([x(t-1),x(t-1)],[0 p(x(t-1))],'color','b','linewidth',2) scatter(xStar,0,'ro','Linewidth',2) line([xStar,xStar],[0 p(xStar)],'color','r','Linewidth',2) plot(x(1:t),zeros(1,t),'ko') legend({'Target','Proposal','p(x^{(t-1)})','x^*','p(x^*)','Kept Samples'}) switch str case 'Rejected' scatter(xStar,p(xStar),'rx','Linewidth',3) case 'Accepted' scatter(xStar,p(xStar),'rs','Linewidth',3) end scatter(x(t-1),p(x(t-1)),'bo','Linewidth',3) title(sprintf('Sample % d %s',t,str)) xlim([minn,maxx]) subplot(212); hist(x(1:t),50); colormap hot; xlim([minn,maxx]) title(['Sample ',str]); drawnow pause(pauseDur); end end % DISPLAY MARKOV CHAIN figure(1); clf subplot(211); stairs(x(1:t),1:t, 'k'); hold on; hb = plot([-10 10],[burnIn burnIn],'b--') ylabel('t'); xlabel('samples, x'); set(gca , 'YDir', 'reverse'); ylim([0 t]) axis tight; xlim([-10 10]); title('Markov Chain Path'); legend(hb,'Burnin'); % DISPLAY SAMPLES subplot(212); nBins = 200; sampleBins = linspace(minn,maxx,nBins); counts = hist(x(burnIn:end), sampleBins); bar(sampleBins, counts/sum(counts), 'k'); xlabel('samples, x' ); ylabel( 'p(x)' ); title('Samples'); % OVERLAY ANALYTIC DENSITY OF STUDENT T nu = 1; y = tpdf(sampleBins,nu) hold on; plot(sampleBins, y/sum(y) , 'r-', 'LineWidth', 2); legend('Samples',sprintf('Theoretic\nStudent''s t')) axis tight xlim([-10 10]);

In the figure above, we visualize the first 50 iterations of the Metropolis sampler.The black curve represents the target distribution . The red curve that is bouncing about the x-axis is the proposal distribution (if the figure is not animated, just click on it). The vertical blue line (about which the bouncing proposal distribution is centered) represents the quantity , and the vertical red line represents the quantity , for a proposal state sampled according to the red curve. At every iteration, if the vertical red line is longer than the blue line, then the sample is accepted, and the proposal distribution becomes centered about the newly accepted sample. If the blue line is longer, the sample is randomly rejected or accepted.

But why randomly keep “bad” proposal samples? It turns out that doing this allows the Markov chain to every-so-often visit states of low probability under the target distribution. This is a desirable property if we want the chain to adequately sample the entire target distribution, including any tails.

An attractive property of the Metropolis algorithm is that the target distribution does not have to be a properly normalized probability distribution. This is due to the fact that the acceptance probability is based on the ratio of two values of the target distribution. I’ll show you what I mean. If is an unnormalized distribution and

is a properly normalized probability distribution with normalizing constant , then

and a ratio like that used in calculating the acceptance probability is

The normalizing constants cancel! This attractive property is quite useful in the context of Bayesian methods, where determining the normalizing constant for a distribution may be impractical to calculate directly. This property is demonstrated in current example. It turns out that the “mystery” distribution that we sampled from using the Metropolis algorithm is an unnormalized form of the Student’s-t distribution with one degree of freedom. Comparing to the definition of the definition Student’s-t

we see that is a Student’s-t distribution with degrees of freedom , but missing the normalizing constant

Below is additional output from the code above showing that the samples from Metropolis sampler draws samples that follow a *normalized *Student’s-t distribution, even though is not normalized.

The upper plot shows the progression of the Markov chain’s progression from state (top) to state (bottom). The burn in period for this chain was chosen to be 500 transitions, and is indicated by the dashed blue line (for more on burnin see this previous post).

The bottom plot shows samples from the Markov chain in black (with burn in samples removed). The theoretical curve for the Student’s-t with one degree of freedom is overlayed in red. We see that the states kept by the Metropolis sampler transition operator sample from values that follow the Student’s-t, even though the function used in the transition operator was not a properly normalized probability distribution.

## Reversibility of the transition operator

It turns out that there is a theoretical constraint on the Markov chain the transition operator in order for it settle into a stationary distribution (i.e. a target distribution we care about). The constraint states that the probability of the transition must be equal to the probability of the reverse transition . This reversibility property is often referred to as * detailed balance*. Using the Metropolis algorithm transition operator, reversibility is assured if the proposal distribution is symmetric. Such symmetric proposal distributions are the Normal, Cauchy, Student’s-t, and Uniform distributions.

However, using a symmetric proposal distribution may not be reasonable to adequately or efficiently sample all possible target distributions. For instance if a target distribution is bounded on the positive numbers , we would like to use a proposal distribution that has the same support, and will thus be assymetric. This is where the ** Metropolis-Hastings** sampling algorithm comes in. We will discuss in a later post how the Metropolis-Hastings sampler uses a simple change to the calculation of the acceptance probability which allows us to use non-symmetric proposal distributions.