# Monthly Archives: November 2012

## fMRI in Neuroscience: Estimating Voxel Selectivity & the General Linear Model (GLM)

In a typical fMRI experiment  a series of stimuli are presented to an observer and evoked brain activity–in the form of blood-oxygen-level-dependent (BOLD) signals–are measured from tiny chunks of the brain called voxels. The task of the researcher is then to infer the tuning of the voxels to features in the presented stimuli based on the evoked BOLD signals. In order to make this inference quantitatively, it is necessary to have a  model of how BOLD signals are evoked in the presence of stimuli. In this post we’ll develop a model of evoked BOLD signals, and from this model recover the tuning of individual voxels measured during an fMRI experiment.

## Modeling the Evoked BOLD Signals — The Stimulus and Design Matrices

Suppose we are running an event-related fMRI experiment where we present $C$ different stimulus conditions to an observer while recording the BOLD signals evoked in their brain over a series of $T$ consecutive fMRI measurements (TRs). We can represent the stimulus presentation quantitatively with a $T \times C$ binary Stimulus Matrix, $D$, whose entries indicate the onset of each stimulus condition (columns) at each point in time (rows). Now let’s assume that we have an accurate model of how a voxel is activated by a single, very short stimulus. This activation model is called hemodynamic response function (HRF), $h$, for the voxel, and, as we’ll discuss in a later post, can be estimated from the measured BOLD signals. Let’s assume for now that the voxel is also activated to an equal degree to all stimuli. In this scenario we can represent the BOLD signal evoked over the entire experiment with another $T \times C$ matrix $X$ called the Design Matrix that is the convolution of the stimulus matrix $D$ with the voxel’s HRF $h$.

$X = D * h$

Note that this model of the BOLD signal is an example of the Finite Impulse Response (FIR) model that was introduced in the previous post on fMRI Basics.

To make the concepts of $D$ and $X$ more concrete, let’s say our experiment consists of $C = 3$ different stimulus conditions: a light, a tone, and heat applied to the palm. Each stimulus condition is presented twice in a staggered manner during 80 TRs of fMRI measurements. The stimulus matrix and the design matrix are simulated here in Matlab:

TR = 1; % REPETITION TIME
t = 1:TR:20; % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % HRF MODEL
h = h/max(h); % SCALE HRF TO HAVE MAX AMPLITUDE OF 1

trPerStim = 30; % # TR PER STIMULUS
nRepeat = 2; % # OF STIMULUS REPEATES
nTRs = trPerStim*nRepeat + length(h);
impulseTrain0 = zeros(1,nTRs);

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(1:trPerStim:trPerStim*nRepeat) = 1;

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(5:trPerStim:trPerStim*nRepeat) = 1;

% SOMATOSENSORY STIMULUS
impulseTrainHeat = impulseTrain0;
impulseTrainHeat(9:trPerStim:trPerStim*nRepeat) = 1;

% COMBINATION OF ALL STIMULI
impulseTrainAll = impulseTrainLight + impulseTrainTone + impulseTrainHeat;

% SIMULATE VOXELS WITH VARIOUS SELECTIVITIES
visualTuning = [4 0 0]; % VISUAL VOXEL TUNING
auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING
somatoTuning = [0 0 3]; % SOMATOSENSORY VOXEL TUNING
noTuning = [1 1 1]; % NON-SELECTIVE

beta = [visualTuning', ...
auditoryTuning', ...
somatoTuning', ...
noTuning'];

% EXPERIMENT DESIGN / STIMULUS SEQUENCE
D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat'];

% CREATE DESIGN MATRIX FOR THE THREE STIMULI
X = conv2(D,h'); % X = D * h
X(nTRs+1:end,:) = []; % REMOVE EXCESS FROM CONVOLUTION

% DISPLAY STIMULUS AND DESIGN MATRICES
subplot(121); imagesc(D); colormap gray;
xlabel('Stimulus Condition')
ylabel('Time (TRs)');
title('Stimulus Train, D');
set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});

subplot(122);
imagesc(X);
xlabel('Stimulus Condition')
ylabel('Time (TRs)');
title('Design Matrix, X = D * h')
set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});



Stimulus presentation matrix, D (left) and the Design Matrix X for an experiment with three stimulus conditions: a light, a tone, and heat applied to the palm

Each column of the design matrix above (the right subpanel in the above figure) is essentially a model of the BOLD signal evoked independently by each stimulus condition, and the total signal is simply a sum of these independent signals.

## Modeling Voxel Tuning — The Selectivity Matrix

In order to develop the concept of the design matrix we assumed that our theoretical voxel is equally tuned to all stimuli. However, few voxels in the brain exhibit such non-selective tuning. For instance, a voxel located in visual cortex will be more selective for the light than for the tone or the heat stimulus. A voxel in auditory cortex will be more selective for the tone than for the other two stimuli. A voxel in the somoatorsensory cortex will likely be more selective for the heat than the visual or auditory stimuli. How can we represent the tuning of these different voxels?

A simple way to model tuning to the stimulus conditions in an experiment is to multiplying each column of the design matrix by a weight that modulates the BOLD signal according to the presence of the corresponding stimulus condition. For example, we could model a visual cortex voxel by weighting the first column of $X$ with a positive value, and the remaining two columns with much smaller values (or even negative values to model suppression). It turns out that we can model the selectivity of $V$ individual voxels simultaneously through a $C \times V$ Selectivity Matrix$\beta$. Each entry in $\beta$ is the amount that the $v$-th voxel (columns) is tuned to the $c$-th stimulus condition (rows). Given the design matrix and the selectivity matrix, we can then predict the BOLD signals $y$ of selectively-tuned voxels with a simple matrix multiplication:

$y = X\beta$

Keeping with our example experiment, let’s assume that we are modeling the selectivity of four different voxels: a strongly-tuned visual voxel, a moderately-tuned somatosensory voxel, a weakly tuned auditory voxel, and an unselective voxel that is very weakly tuned to all three stimulus conditions. We can represent the tuning of these four voxels with a $3 \times 4$ selectivity matrix. Below we define a selectivity matrix that represents the tuning of these 4 theoretical voxels and simulate the evoked BOLD signals to our 3-stimulus experiment.

% SIMULATE NOISELESS VOXELS' BOLD SIGNAL
% (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE)
y0 = X*beta;

figure;
subplot(211);
imagesc(beta); colormap hot;
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Voxel Selectivity, \beta')

subplot(212);
plot(y0,'Linewidth',2);
legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'});
xlabel('Time (TRs)'); ylabel('BOLD Signal');
title('Activity for Voxels with Different Stimulus Tuning')
set(gcf,'Position',[100 100 750 540])
subplot(211); colorbar


Selectivity matrix (top) for four theoretical voxels and GLM BOLD signals (bottom) for a simple experiment

The top subpanel in the simulation output visualizes the selectivity matrix defined for the four theoretical voxels. The bottom subpanel plots the columns of the $T \times V$ matrix of voxel responses $y$. We see that the maximum response of the strongly-tuned visual voxel (plotted in blue) is larger than that of the other voxels, corresponding to the larger weight upper left of the selectivity matrix. Also note that the response for the unselective voxel (plotted in cyan) demonstrates the linearity property of the FIR model. The attenuated but complex BOLD signal from the unselective voxel results from the sum of small independent signals evoked by each stimulus.

## Modeling Voxel Noise

The example above demonstrates how we can model BOLD signals evoked in noisless theoretical voxels. Though this noisless scenario is helpful for developing a modeling framework, real-world voxels exhibit variable amounts of noise (noise is any signal that cannot be accounted by the FIR model). Therefore we need to incorporate a noise term into our BOLD signal model.

The noise in a voxel is often modeled as a random variable $\epsilon$. A common choice for the noise model is a zero-mean Normal/Gaussian distribution with some variance $\sigma^2$:

$\epsilon \sim \mathcal N(0,\sigma^2)$

Though the variance of the noise model may not be known apriori, there are methods for estimating it from data. We’ll get to estimating noise variance in a later post when we discuss various sources of noise and how to account for them using more advance techniques. For simplicity, let’s just assume that the noise variance is 1 as we proceed.

## Putting It All Together — The General Linear Model (GLM)

So far we have introduced on the concepts of the stimulus matrix, the HRF, the design matrix, selectivity matrix, and the noise model. We can combine all of these to compose a comprehensive quantitative model of BOLD signals measured from a set of voxels during an experiment:

$y = X\beta + \epsilon \\ = (D * h)\beta + \epsilon$

This is referred to as the General Linear Model (GLM).

In a typical fMRI experiment the researcher controls the stimulus presentation $D$, and measures the evoked BOLD responses $y$ from a set of voxels. The problem then is to estimate the selectivities of the voxels based on these measurments. Specifically, we want to determine the parameters $\hat \beta$ that best explain the measured BOLD signals during our experiment. The most common way to do this is a method known as Ordinary Least Squares (OLS) Regression. Using OLS the idea is to adjust the values of $\hat \beta$ such that the predicted model BOLD signals are as similar to the measured signals as possible. In other words, the goal is to infer the selectivity each voxel would have to exhibit in order to produce the measured BOLD signals. I showed in an earlier post that the optimal OLS solution for the selectivities $\hat \beta$ is given by:

$\hat \beta = (X^T X)^{-1} X^T y$

Therefore, given a design matrix $X$ and a set of voxel responses $y$ associated with the design matrix, we can calculate the selectivities of voxels to the stimulus conditions represented by the columns of the design matrix. This works even when the BOLD signals are noisy. To get a better idea of this process at work let’s look at a quick example based on our toy fMRI experiment.

## Example: Recovering Voxel Selectivity Using OLS

Here the goal is to recover the selectivities of the four voxels in our toy experiment they have been corrupted with noise. First, we add noise to the voxel responses. In this example the variance of the added noise is based on a concept known as signal-to-noise-ration or SNR.  As the name suggests, SNR is the ratio of the underlying signal to the noise “on top of” the signal. SNR is a very important concept when interpreting fMRI analyses. If a voxel exhibits a low SNR, it will be far more difficult to estimate its tuning. Though there are many ways to define SNR, in this example it is defined as the ratio of the maximum signal amplitude to the variance of the noise model. The underlying noise model variance is adjusted to be one-fifth of the maximum amplitude of the BOLD signal, i.e. an SNR of 5.  Feel free to try different values of SNR by changing the value of the variable $\text{SNR}$ in the Matlab simulation. Noisy versions of the 4 model BOLD signals are plotted in the top subpanel of the figure below. We see that the noisy signals are very different from the actual underlying BOLD signals.

Noisy BOLD signals from 4 voxels (top) and GLM predictions (bottom) of the underlying BOLD signals

Here we estimate the selectivities $\hat \beta$ from the GLM using OLS, and then predict the BOLD signals in our experiment with this estimate. We see in the bottom subpanel of the above figure that the resulting GLM predictions of are quite accurate. We also compare the estimated selectivity matrix $\hat \beta$ to the actual selectivity matrix $\beta$ below. We see that OLS is able to recover the selectivity of all the voxels.

Actual (top) and estimated (bottom) selectivity matrices.

% SIMULATE NOISY VOXELS & ESTIMATE TUNING
% (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE)

SNR = 5; % (APPROX.) SIGNAL-TO-NOISE RATIO
noiseSTD = max(y0(:))./SNR; % NOISE LEVEL FOR EACH VOXEL
noise = bsxfun(@times,randn(size(y0)),noiseSTD);
y = y0 + noise;

betaHat = inv(X'*X)*X'*y % OLS
yHat = X*betaHat; % GLM PREDICTION

figure
subplot(211);
plot(y,'Linewidth',3);
xlabel('Time (s)'); ylabel('BOLD Signal');
legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'});
title('Noisy Voxel Responses');

subplot(212)
h1 = plot(y0,'Linewidth',3); hold on
h2 = plot(yHat,'-o');
legend([h1(end),h2(end)],{'Actual Responses','Predicted Responses'})
xlabel('Time (s)'); ylabel('BOLD Signal');
title('Model Predictions')
set(gcf,'Position',[100 100 750 540])

figure
subplot(211);
imagesc(beta); colormap hot(5);
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Actual Selectivity, \beta')

subplot(212)
imagesc(betaHat); colormap hot(5);
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Noisy Estimated Selectivity')
drawnow


## Wrapping Up

Here we introduced the GLM commonly used for fMRI data analyses and used the GLM framework to recover the selectivities of simulated voxels. We saw that the GLM is quite powerful of recovering the selectivity in the presence of noise. However, there are a few details left out of the story.

First, we assumed that we had an accurate (albeit exact) model for each voxel’s HRF. This is generally not the case. In real-world scenarios the HRF is either assumed to have some canonical shape, or the shape of the HRF is estimated the experiment data. Though assuming a canonical HRF shape has been validated for block design studies of peripheral sensory areas, this assumption becomes dangerous when using event-related designs, or when studying other areas of the brain.

Additionally, we did not include any  physiological noise signals in our theoretical voxels. In real voxels, the BOLD signal changes due to physiological processes such as breathing and heartbeat can be far larger than the signal change due to underlying neural activation. It then becomes necessary to either account for the nuisance signals in the GLM framework, or remove them before using the model described above. In two upcoming posts we’ll discuss these two issues: estimating the HRF shape from data, and dealing with nuisance signals.

## fMRI In Neuroscience: The Basics

Magnetic Resonance Imaging (MRI) is a procedure used to essentially “look inside” of materials in a non-invasive manner. As the name suggests, the procedure forms images by measuring the differences in resonances of different materials while in the presence of a strong magnetic field (the details of the MRI procedure are pretty fascinating, but I will save them for another post). In the setting of neuroscience, MRI is often used to image the inside of the brain while it is performing basic functions like hearing a tone, or pressing a button. This flavor of MRI is appropriately referred to as Functional MRI, or fMRI.

The main concept behind fMRI is that as the neurons in the brain function they consume fuel (sugar) and oxygen, which is supplied by blood flow. The harder the neurons in one part of the brain work, the more blood and therefore the more oxygen flows to that part of the brain. The levels of oxygen in the blood will also vary proportionately to how quickly oxygen is being consumed by the active neurons; the more activity, the faster oxygen is consumed. It turns out the resonant frequency of a tissue will vary depending on the level of oxygen present in the tissue. fMRI essentially measures the differences in resonances of your brain tissue based on these functionally-dependent levels of blood oxygen. This is commonly referred to as the Blood-Oxygen-Level-Dependent (BOLD) signal.

The BOLD signal is not measured from individual neurons, but rather from small (on the order of 2-3 mm) cubic regions of the brain called voxels (imagine your brain being composed of hundreds of thousands of tiny Leggos). Each voxel contains hundreds of thousands of neurons, so the BOLD signal measured from a voxel is indicative of the group activity of the neurons located within that voxel. As the neurons in a voxel become active due to brain function, the BOLD signal in each of voxel will vary over time. The work of many a neuroscientist is to accurately characterize the BOLD signal change and to relate (i.e. correlate) it to brain function.

## Characterizing the BOLD Signal — the Hemodynamic Response Function (HRF)

Because the BOLD signal is based on blood flow, it is delayed in time from the onset of neural activity due to the period it takes for blood to flow into (and out of) the voxel. The BOLD signal generally peaks 4 to 6 seconds after the onset of neural activity, after which it decreases back toward baseline, even undershooting baseline amplitude around 8 to 12 seconds after onset. This undershoot is believed to be caused by ongoing neuronal metobolism that overconsumes the initial supply of oxygen to the voxel to sub-baseline levels. If there is no addition neural activity the BOLD signal eventually (after approximately 20 seconds) returns to baseline levels. These signal dynamics are referred to as a voxel’s Hemodynamic Response Function, or HRF. A common quantiative model of the HRF is a sum of two Gamma distributions. One of the distributions models the initial peak of the BOLD signal, and another (inverted) distribution models the undershoot. An example of a model HRF is shown below.

Model Hemodynamic Response Function (HRF)

%% MODEL OF THE HRF
t = 0:.1:20;
hrfModel = gampdf(t,6) + -.5*gampdf(t,10);

% DISPLAY
figure
plot(t,hrfModel,'r','Linewidth',2);
hold on;
hb = plot(t,zeros(size(t)),'k--');
title('Model HRF');
xlabel('Time From Activity Onset (s)');
ylabel('BOLD Signal');
legend(hb,'Baseline')


It is important to note that the MRI scanner doesn’t necessarily measure the BOLD signal at such a high temporal resolution as indicated above. Because of limitations imposed by both hardware and software, the required period for an individual fMRI measurement–or TR, short for Repetition Time–is on the order of 1 to 2 seconds. (However recent advances in parallell imaging and multiband excitation technologies are vastly shortening the required TR of fMRI measurements.)

## Relating the HRF to Brain Activity — The Finite Impulse Response (FIR) Model

The HRF provides a model for how the BOLD signal in a voxel will vary due to a short burst of neural activity. Therefore, a useful interpretation of BOLD signal dynamics comes from signal processing. If we treat the HRF as a filter $h$ that operates over some finite length of time equal to the length of the HRF, then the measured BOLD signal $y(t)$ evoked by a series of bursts in neural activity can be modeled as a convolution of an impulse train $D(t)$ with the filter:

$y(t) = D(t) * h$

where $D(t)$ equals one at each  point in time where a burst of neural activity  occurs, and zero otherwise. The $(*)$ is the convolution operator (if you’re not familiar with convolution, it’s not terribly important here; just think of it as the operation of applying  a filter to some data). This is identical to the Finite Impulse Response (FIR) model used in signal processing. An example of the FIR model used to model the BOLD signal evoked by a sequence of three bursts of neural activity is shown below.

Demonstration of the FIR model of the BOLD signal

%% FIR MODEL OF BOLD RESPONSE

% HRF AS MEASURED BY MRI SCANNER
TR = 1;			% REPETITION TIME
t = 1:TR:20;	% MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10);

% CREATE A STIMULUS IMPULSE TRAIN
% FOR THREE SEPARATE STIMULUS CONDITIONS
trPerStim = 30;			% # TR PER STIMULUS
nRepeat = 2;
nTRs = trPerStim*nRepeat + length(h);
impulseTrain0 = zeros(1,nTRs);

impulseTrainModel = impulseTrain0;
impulseTrainModel([1,24,28]) = 1;
boldModel = conv(impulseTrainModel,h);
boldModel(nTRs+1:end) = [];

% DISPLAY AN EXAMPLE OF FIR RESPONSE
figure
stem(impulseTrainModel*max(h),'k');
hold on;
plot(boldModel,'r','Linewidth',2); xlim([0,nTRs]);
title('HRF Convolved with Impulse Stimulus Train');
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Activity Onset', 'Voxel Response'})
set(gcf,'Position',[100 100 750 380])


A useful property of the FIR model is that BOLD signals from overlapping HRFs sum linearly. This effect is displayed in the figure above. The BOLD signals evoked by the second and third impulses, which appear close to one another in time, combine linearly to form a larger and more complicated BOLD signal than that evoked by the first impulse alone. As we will see later, this linearity property makes it possible to determine the selectivity of neurons within a voxel even in the presence of rapidly-presented stimuli.

## Block and Event-related fMRI Experiments

Using fMRI, neuroscientists design experiments that present stimuli having particular features that are hypothesized to be encoded by neurons in a particular region of interest in the brain. Given the evoked BOLD signal measured from a voxel during stimulus presentation, along with the HRF for the voxel, and the assumptions of FIR model framework, it is possible to calculate the degree to which the neurons in the voxel must be selective for each the stimulus features in order to produce the measured BOLD signals. We’ll delve more into how this selectivity is calculated in a later post, but for now let’s take look at two flavors of experiment designs.

### The Block Design

One flavor of experimental design is to simply present a stimulus continuosly for a long period, followed by absence of stimuli for a long period. This is what is called a Block Design experiment. From the view of the FIR model, presenting the stimulus for a long period is equivalent to composing $D(t)$ of many consecutive impulses. Because the signal from the HRF of each impulse are assumed to sum linearly, the BOLD signal evoked by the block of impulses will be much larger than the signal evoked by a single brief stimulus presentation. This is demonstrated in the simulation below:

%% SIMULATE A BLOCK-DESIGN EXPERIMENT
%% (ASSUMING VARIABLES FRON ABOVE ARE IN WORKSPACE)

% CREATE BLOCK STIMULUS TRAIN
blocks = repmat([ones(8,1);zeros(8,1)], round(nTRs-10)/16,1);
blockImpulseTrain = impulseTrain0;
blockImpulseTrain(1:numel(blocks)) = blocks;
boldBlock = conv(blockImpulseTrain,h);

% DISPLAY BOLD RESPONSES FROM BLOCK DESIGN
figure
stem(blockImpulseTrain*max(h),'k');
hold on;
plot(boldBlock(1:nTRs),'r','Linewidth',2); xlim([0,nTRs]);
title('Simulated Block Design');
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Stimulus Train', 'Voxel Response'})
set(gcf,'Position',[100 100 750 380])


Simulation of BOLD signals evoked by Block Design experiment

Here the height of the stimulus onset indicators (in black) are scaled to be the maximum height of the HRF from a single impulse. We see that the peak bold signal from the block design is much larger. The block design is used when the number of features/stimuli that the experimenter would like to probe is small. In this scenario, the increased signal amplitude results in a much more sensitive measurement of selectivity for the probed stimulus features, but at the cost of an inefficient model design. Therefore many separate block design experiments will have to be run to test various hypotheses.

### The Event-related Design

An alternative to the block design is to instead show many types of stimuli for short duations. This is what is known as an Event-related design. For instance, say are interested in how visual, auditory, and somatosensory information are encoded in the brain. We could run three separate block design experiments, one for each stimulus modality, or we could run a single event-related design experiment, where we intermittently present a subject a light, a tone, and ask them to press a button. If there exists a voxel that is involved in encoding each of these stimuli to an equal degree, it would be difficult to discover this relationship by running three separate block design experiments. A simulation of such an experiment and the responses from such a voxel is shown below.

Simulation of an Event-related experiment

%% SIMULATE AN EVENT-RELATED EXPERIMENT
%% (ASSUMING VARIABLES FRON ABOVE ARE IN WORKSPACE)

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(1:trPerStim:trPerStim*nRepeat) = 1;

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(5:trPerStim:trPerStim*nRepeat) = 1;

% SOMATOSENSORY STIMULUS
impulseTrainPress = impulseTrain0;
impulseTrainPress(9:trPerStim:trPerStim*nRepeat) = 1;

% COMBINATION OF ALL STIMULI
impulseTrainAll = impulseTrainLight + impulseTrainTone + impulseTrainPress;

% SIMULATE BOLD SIGNAL EVOKED BY EACH CONDITION
boldLight = conv(impulseTrainLight,h);
boldTone = conv(impulseTrainTone,h);
boldPress = conv(impulseTrainPress,h);
boldAll = conv(impulseTrainAll,h);

% DISPLAY STIMULUS ONSETS FOR EACH CONDITION
figure
subplot(211)
hold on
stem(impulseTrainLight,'k');
stem(impulseTrainTone,'b');
stem(impulseTrainPress,'g');
xlim([0,nTRs]);
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Light Stimulus Train', 'Tone Stimulus Train','Press Stimulus Train'})
title('Impulse Trains for 3 Different Stimuli');

% DISPLAY COMBINATION OF BOLD RESPONSES FROM EACH CONDITION
subplot(212)
hold on;
plot(boldLight(1:nTRs),'k');
plot(boldTone(1:nTRs),'b');
plot(boldPress(1:nTRs),'g');
plot(boldAll(1:nTRs),'r','Linewidth',2);
xlim([0,nTRs]);
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Response to Light','Response to Tone','Response to Press','Total Response'});
title('Simulation of BOLD Signal from Overlapping Stimuli');
set(gcf,'Position',[100 100 750 540])


In this example notice how the responses evoked by each class of stimulus overlap, resulting in a complex BOLD signal. However, if each class of stimulus were shown separately, as in the case of the block design, the evoked BOLD signal would look very different, as indicated by the individual responses to each stimulus (black, blue and green responses, respectively). Event-related designs are powerful in that we can probe many features simultaneously and therefore uncover correlated selectivities/tuning. However, event-related designs require that we have a very accurate model for the HRF. This requirement is greatly relaxed for block design experiments (in fact many block-designs assume very simplistic HRF shapes such as a triangle or square waveform).

## Wrapping Up

In this post we introduced some of the basic concepts in fMRI used for neuroscience research, including the BOLD signal, the HRF, as well as Block and Event-related experiment designs. However, the underlying story used to present these concepts has been dramatically simplified. Here we ignore the effects of physiological noise, which can be a debilatating factor in many fMRI analyses. Also missing are the details of how to calculate a voxel’s selectivity to a set of stimulus features given measured data, an HRF, and an experimental paradigm. This is where the General Linear Model (GLM) that predominates fMRI research comes in. We also assume that the proposed model HRF is an accurate characterization of a voxel’s BOLD response dynamics. Though this is often a reasonable assumption for peripheral sensory areas of the brain, it is often a very poor assumption for other areas of the brain. It is therefore often necessary to estimate the shape of the HRF from fMRI data. We’ll discuss all of these issues: physiological noise, tuning/selectivity estimation using the GLM, as well as HRF estimation in a following series of posts on fMRI methods in neuroscience.

## A Gentle Introduction to Markov Chain Monte Carlo (MCMC)

Applying probabilistic models to data usually involves integrating a complex, multi-dimensional probability distribution. For example, calculating the expectation/mean of a model distribution involves such an integration. Many (most) times, these integrals are not calculable due to the high dimensionality of the distribution or because there is no closed-form expression for the integral available using calculus. Markov Chain Monte Carlo (MCMC) is a method that allows one to approximate complex integrals using stochastic sampling routines. As MCMC’s name indicates, the method is composed of two components, the Markov chain and Monte Carlo integration.

Monte Carlo integration is a powerful technique that exploits stochastic sampling of the distribution in question in order to approximate the difficult integration. However, in order to use Monte Carlo integration it is necessary to be able to sample from the probability distribution in question, which may be difficult or impossible to do directly. This is where the second component of MCMC, the Markov chain, comes in. A Markov chain is a sequential model that transitions from one state to another in a probabilistic fashion, where the next state that the chain takes is conditioned on the previous state. Markov chains are useful in that if they are constructed properly, and allowed to run for a long time, the states that a chain will take also sample from a target probability distribution. Therefore we can construct Markov chains to sample from the distribution whose integral we would like to approximate, then use Monte Carlo integration to perform the approximation.

Here I introduce a series of posts where I describe the basic concepts underlying MCMC, starting off by describing Monte Carlo Integration, then giving a brief introduction of Markov chains and how they can be constructed to sample from a target probability distribution. Given these foundation principles, we can then discuss MCMC techniques such as the Metropolis and Metropolis-Hastings algorithms, the Gibbs sampler, and the Hybrid Monte Carlo algorithm.

As always, each post has a somewhat formal/mathematical introduction, along with an example and simple Matlab implementations of the associated algorithms.

## MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)

The random-walk behavior of many Markov Chain Monte Carlo (MCMC) algorithms makes Markov chain convergence to a target stationary distribution $p(x)$ inefficient, resulting in slow mixing. Hamiltonian/Hybrid Monte Carlo (HMC), is a MCMC method that adopts physical system dynamics rather than a probability distribution to propose future states in the Markov chain. This allows the Markov chain to explore the target distribution much more efficiently, resulting in faster convergence. Here we introduce basic analytic and numerical concepts for simulation of Hamiltonian dynamics. We then show how Hamiltonian dynamics can be used as the Markov chain proposal function for an MCMC sampling algorithm (HMC).

## First off, a brief physics lesson in Hamiltonian dynamics

Before we can develop Hamiltonian Monte Carlo, we need to become familiar with the concept of Hamiltonian dynamics. Hamiltonian dynamics is one way that physicists describe how objects move throughout a system. Hamiltonian dynamics describe an object’s motion in terms of its location $\bold x$ and momentum $\bold p$ (equivalent to the object’s mass times its velocity) at some time $t$. For each location the object takes, there is an associated potential energy $U(\bold x)$, and for each momentum there is an associated kinetic energy $K(\bold p)$. The total energy of the system is constant and known as the Hamiltonian $H(\bold x, \bold p)$, defined simply as the sum of the potential and kinetic energies:

$H(\bold x,\bold p) = U(\bold x) + K(\bold p)$

Hamiltonian dynamics describe how kinetic energy is converted to potential energy (and vice versa) as an object moves throughout a system in time. This description is implemented quantitatively via a set of differential equations known as the Hamiltonian equations:

$\frac{\partial x_i}{\partial t} = \frac{\partial H}{\partial p_i} = \frac{\partial K(\bold p)}{\partial p_i}$
$\frac{\partial p_i}{\partial t} = -\frac{\partial H}{\partial x_i} = - \frac{\partial U(\bold x)}{\partial x_i}$

Therefore, if we have expressions for $\frac{\partial U(\bold x)}{\partial x_i}$ and $\frac{\partial K(\bold p)}{\partial p_i}$ and a set of initial conditions (i.e. an initial position $\bold x_0$ and initial momentum $\bold p_0$ at time $t_0$), it is possible to predict the location and momentum of an object at any point in time $t = t_0 + T$ by simulating these dynamics for a duration $T$

## Simulating Hamiltonian dynamics — the Leap Frog Method

The Hamiltonian equations describe an object’s motion in time, which is a continuous variable. In order to simulate Hamiltonian dynamics numerically on a computer, it is necessary to approximate the Hamiltonian equations by discretizing  time. This is done by splitting the interval $T$ up into a series of smaller intervals of length $\delta$. The smaller the value of $\delta$ the closer the approximation is to the dynamics in continuous time. There are a number of procedures that have been developed for discretizing time including Euler’s method and the Leap Frog Method, which I will introduce briefly in the context of Hamiltonian dynamics. The Leap Frog method updates the momentum and position variables sequentially, starting by simulating the momentum dynamics over a small interval of time $\delta /2$, then simulating the position dynamics over a slightly longer interval in time $\delta$, then completing the momentum simulation over another small interval of time $\delta /2$ so that $\bold x$ and $\bold p$ now exist at the same point in time. Specifically, the Leap Frog method is as follows: 1. Take a half step in time to update the momentum variable:

$p_i(t + \delta/2) = p_i(t) - (\delta /2)\frac{\partial U}{\partial x_i(t)}$

2. Take a full step in time to update the position variable

$x_i(t + \delta) = x_i(t) + \delta \frac{\partial K}{\partial p_i(t + \delta/2)}$

3. Take the remaining half step in time to finish updating the momentum variable

$p_i(t + \delta) = p_i(t + \delta/2) - (\delta/2) \frac{\partial U}{\partial x_i(t+\delta)}$

The Leap Fog method can be run for $L$ steps to simulate dynamics over $L \times \delta$ units of time. This particular discretization method has a number of properties that make it preferable to other approximation methods like Euler’s method, particularly for use in MCMC, but discussion of these properties are beyond the scope of this post. Let’s see how we can use the Leap Frog method to simulate Hamiltonian dynamics in a simple 1D example.

### Example 1: Simulating Hamiltonian dynamics of an harmonic oscillator

Imagine a ball with mass equal to one is attached to a horizontally-oriented spring. The spring exerts a force on the ball equal to

$F = -kx$

which works to restore the ball’s position to the equilibrium position of the spring  at $x = 0$. Let’s assume that the spring constant $k$, which defines the strength of the restoring force is also equal to one. If the ball is displaced by some distance $x$ from equilibrium, then the potential energy is

$U(x) = \int F dx = \int -x dx = \frac{x^2}{2}$

In addition, the kinetic energy an object with mass $m$ moving with velocity $v$ within a linear system is known to be

$K(v) = \frac{(mv)^2}{2m} = \frac{v^2}{2} = \frac{p^2}{2} = K(p)$,

if the object’s mass is equal to one, like the ball this example. Notice that we now have in hand the expressions for both $U(x)$ and $K(p)$. In order to simulate the Hamiltonian dynamics of the system using the Leap Frog method, we also need expressions for the partial derivatives of each variable (in this 1D example there are only one for each variable):

$\frac{\partial U(x)}{\partial x} =x$ $\frac{\partial K(p)}{\partial p} = p$

Therefore one iteration the Leap Frog algorithm for simulating Hamiltonian dynamics in this system is:

1.  $p(t + \delta/2) = p(t) - (\delta/2)x(t)$
2. $x(t + \delta) = x(t) + (\delta) p(t + \delta/2)$ 3. $p(t + \delta) = p(t + \delta /2) - (\delta/2)x(t + \delta)$

We simulate the dynamics of the spring-mass system described using the Leap Frog method in Matlab below (if the graph is not animated, try clicking on it to open up the linked .gif). The left bar in the bottom left subpanel of the simulation output demonstrates the trade-off between potential and kinetic energy described by Hamiltonian dynamics. The cyan portion of the bar is the proportion of the Hamiltonian contributed by  the potential energy $U(x)$, and the yellow portion  represents is the contribution of the kinetic energy $K(p)$. The right bar (in all yellow), is the total value of the Hamiltonian $H(x,p)$. Here we see that the ball oscillates about the equilibrium position of the spring with a constant period/frequency.  As the ball passes the equilibrium position $x= 0$, it has a minimum potential energy and maximum kinetic energy. At the extremes of the ball’s trajectory, the potential energy is at a maximum, while the kinetic energy is minimized. The  procession of momentum and position map out positions in what is referred to as phase space, which is displayed in the bottom right subpanel of the output. The harmonic oscillator maps out an ellipse in phase space. The size of the ellipse depends on the energy of the system defined by initial conditions.

Simple example of Hamiltonian Dynamics: 1D Harmonic Oscillator (Click to see animated)

You may also notice that the value of the Hamiltonian $H$ is not a exactly constant in the simulation, but oscillates slightly. This is an artifact known as energy drift due to approximations used to the discretize time.

% EXAMPLE 1: SIMULATING HAMILTONIAN DYNAMICS
%            OF HARMONIC OSCILLATOR
% STEP SIZE
delta = 0.1;

% # LEAP FROG
L = 70;

% DEFINE KINETIC ENERGY FUNCTION
K = inline('p^2/2','p');

% DEFINE POTENTIAL ENERGY FUNCTION FOR SPRING (K =1)
U = inline('1/2*x^2','x');

% DEFINE GRADIENT OF POTENTIAL ENERGY
dU = inline('x','x');

% INITIAL CONDITIONS
x0 = -4; % POSTIION
p0 = 1;  % MOMENTUM
figure

%% SIMULATE HAMILTONIAN DYNAMICS WITH LEAPFROG METHOD
% FIRST HALF STEP FOR MOMENTUM
pStep = p0 - delta/2*dU(x0)';

% FIRST FULL STEP FOR POSITION/SAMPLE
xStep = x0 + delta*pStep;

% FULL STEPS
for jL = 1:L-1
% UPDATE MOMENTUM
pStep = pStep - delta*dU(xStep);

% UPDATE POSITION
xStep = xStep + delta*pStep;

% UPDATE DISPLAYS
subplot(211), cla
hold on;
xx = linspace(-6,xStep,1000);
plot(xx,sin(6*linspace(0,2*pi,1000)),'k-');
plot(xStep+.5,0,'bo','Linewidth',20)
xlim([-6 6]);ylim([-1 1])
hold off;
title('Harmonic Oscillator')
subplot(223), cla
b = bar([U(xStep),K(pStep);0,U(xStep)+K(pStep)],'stacked');
set(gca,'xTickLabel',{'U+K','H'})
ylim([0 10]);
title('Energy')
subplot(224);
plot(xStep,pStep,'ko','Linewidth',20);
xlim([-6 6]); ylim([-6 6]); axis square
xlabel('x'); ylabel('p');
title('Phase Space')
pause(.1)
end
% (LAST HALF STEP FOR MOMENTUM)
pStep = pStep - delta/2*dU(xStep);


## Hamiltonian dynamics and the target distribution $p(\bold x)$

Now that we have a better understanding of what Hamiltonian dynamics are and how they can be simulated, let’s now discuss how we can use Hamiltonian dynamics for MCMC. The main idea behind Hamiltonian/Hibrid Monte Carlo is to develop a Hamiltonian function $H(\bold x, \bold p)$ such that the resulting Hamiltonian dynamics allow us to efficiently explore some target distribution $p(\bold x)$. How can we choose such a Hamiltonian function? It turns out it is pretty simple to relate a $H(\bold x, \bold p)$ to $p(\bold x)$ using a basic concept adopted from statistical mechanics known as the canonical distribution. For any energy function $E(\bf\theta)$ over a set of variables $\theta$, we can define the corresponding canonical distribution as: $p(\theta) = \frac{1}{Z}e^{-E(\bf\theta)}$ where we simply take the exponential of the negative of the energy function. The variable $Z$ is a normalizing constant called the partition function that scales the canonical distribution such that is sums to one, creating a valid probability distribution. Don’t worry about $Z$, it isn’t really important because, as you may recall from an earlier post, MCMC methods can sample from unscaled probability distributions. Now, as we saw above, the energy function for Hamiltonian dynamics is a combination of potential and kinetic energies: $E(\theta) = H(\bold x,\bold p) = U(\bold x) + K(\bold p)$

Therefore the canoncial distribution for the Hamiltonian dynamics energy function is

$p(\bold x,\bold p) \propto e^{-H(\bold x,\bold p)} \\ = e^{-[U(\bold x) - K(\bold p)]} \\ = e^{-U(\bold x)}e^{-K(\bold p)} \\ \propto p(\bold x)p(\bold p)$

Here we see that joint (canonical) distribution for $\bold x$ and $\bold p$ factorizes. This means that the two variables are independent, and the canoncial distribution $p(\bold x)$ is independent of the analogous distribution for the momentum. Therefore, as we’ll see shortly, we can use Hamiltonian dynamics to sample from the joint canonical distribution over $\bold p$ and $\bold x$ and simply ignore the momentum contributions. Note that this is an example of introducing auxiliary variables to facilitate the Markov chain path. Introducing the auxiliary variable $\bold p$ allows us to use Hamiltonian dynamics, which are unavailable without them. Because the canonical distribution for $\bold x$ is independent of the canonical distribution for $\bold p$, we can choose any distribution from which to sample the momentum variables. A common choice is to use a zero-mean Normal distribution with unit variance:

$p(\bold p) \propto \frac{\bold{p^Tp}}{2}$

Note that this is equivalent to having a quadratic potential energy term in the Hamiltonian:

$K(\bold p) = \frac{\bold{p^Tp}}{2}$

Recall that this is is the exact quadratic kinetic energy function (albeit in 1D) used in the harmonic oscillator example above. This is a convenient choice for the kinetic energy function as all partial derivatives are easy to compute. Now that we have defined a kinetic energy function, all we have to do is find a potential energy function $U(\bold x)$ that when negated and run through the exponential function, gives the target distribution $p(\bold x)$ (or an unscaled version of it). Another way of thinking of it is that we can define the potential energy function as

$U(\bold x) = -\log p(\bold x)$.

If we can calculate $-\frac{\partial \log(p(\bold x)) }{\partial x_i}$, then we’re in business and we can simulate Hamiltonian dynamics that can be used in an MCMC technique.

## Hamiltonian Monte Carlo

In HMC we use Hamiltonian dynamics as a proposal function for a Markov Chain in order to explore the target (canonical) density $p(\bold x)$ defined by $U(\bold x)$ more efficiently than using a proposal probability distribution. Starting at an initial state $[\bold x_0, \bold p_0]$, we simulate Hamiltonian dynamics for a short time using the Leap Frog method. We then use the state of the position and momentum variables at the end of the simulation as our proposed states variables $\bold x^*$ and $\bold p^*$. The proposed state is accepted using an update rule analogous to the Metropolis acceptance criterion. Specifically if the probability of the proposed state after Hamiltonian dynamics

$p(\bold x^*, \bold p^*) \propto e^{-[U(\bold x^*) + K{\bold p^*}]}$

is greater than probability of the state prior to the Hamiltonian dynamics

$p(\bold x_0,\bold p_0) \propto e^{-[U(\bold x^{(t-1)}), K(\bold p^{(t-1)})]}$

then the proposed state is accepted, otherwise, the proposed state is accepted randomly. If the state is rejected, the next state of the Markov chain is set as the state at $(t-1)$. For a given set of initial conditions, Hamiltonian dynamics will follow contours of constant energy in phase space (analogous to the circle traced out in phase space in the example above). Therefore we must randomly perturb the dynamics so as to explore all of $p(\bold x)$. This is done by simply drawing a random momentum from the corresponding canonical distribution $p(\bold p)$  before running the dynamics prior to each sampling iteration $t$. Combining these steps, sampling random momentum, followed by Hamiltonian dynamics and Metropolis acceptance criterion defines the HMC algorithm for drawing $M$ samples from a target distribution:

1. set $t = 0$
2. generate an initial position state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

– sample a new initial momentum variable from the momentum canonical distribution $\bold p_0 \sim p(\bold p)$

– set $\bold x_0 = \bold x^{(t-1)}$

– run Leap Frog algorithm starting at $[\bold x_0, \bold p_0]$ for $L$ steps and stepsize $\delta$ to obtain proposed states $\bold x^*$ and $\bold p^*$

– calculate the Metropolis acceptance probability:

$\alpha = \text{min}(1,\exp(-U(\bold x^*) + U(\bold x_0) - K(\bold p^*) + K(\bold p_0)))$

– draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposed state position $\bold x^*$ and set the next state in the Markov chain $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

In the next example we implement HMC  to sample from a multivariate target distribution that we have sampled from previously using multi-variate Metropolis-Hastings, the bivariate Normal. We also qualitatively compare the sampling dynamics of HMC to multivariate Metropolis-Hastings for the sampling the same distribution.

### Example 2: Hamiltonian Monte for sampling a Bivariate Normal distribution

As a reminder, the target distribution $p(\bold x)$ for this exampleis a Normal form with following parameterization:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean $\mu = [\mu_1,\mu_2]= [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & \rho_{12} \\ \rho_{21} & 1\end{bmatrix} = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

In order to sample from $p(\bold x)$ (assuming that we are using a quadratic energy function), we need to determine the expressions for $U(\bold x)$ and

$\frac{\partial U(\bold x) }{ \partial x_i}$.

Recall that the target potential energy function can be defined from the canonical form as

$U(\bold x) = -\log(p(\bold x))$

If we take the negative log of the Normal distribution outline above, this defines the following potential energy function:

$E(\bold x) = -\log \left(e^{-\frac{\bold{x^T \Sigma^{-1} x}}{2}}\right) - \log Z$

Where $Z$ is the normalizing constant for a Normal distribution (and can be ignored because it will eventually cancel). The potential energy function is then simply:

$U(\bold x) = \frac{\bold{x^T \Sigma^{-1}x}}{2}$

with partial derivatives

$\frac{\partial U(\bold x)}{\partial x_i} = x_i$

Using these expressions for the potential energy and its partial derivatives, we implement HMC for sampling from the bivariate Normal in Matlab:

Hybrid Monte Carlo Samples from bivariate Normal target distribution

In the graph above we display HMC samples of the target distribution, starting from an initial position very far from the mean of the target. We can see that HMC rapidly approaches areas of high density under the target distribution. We compare these samples with samples drawn using the Metropolis-Hastings (MH) algorithm below. The MH algorithm converges much slower than HMC, and consecutive samples have much higher autocorrelation than samples drawn using HMC.

Metropolis-Hasting (MH) samples of the same target distribution. Autocorrelation is evident. HMC is much more efficient than MH.

The Matlab code for the HMC sampler:

% EXAMPLE 2: HYBRID MONTE CARLO SAMPLING -- BIVARIATE NORMAL
rand('seed',12345);
randn('seed',12345);

% STEP SIZE
delta = 0.3;
nSamples = 1000;
L = 20;

% DEFINE POTENTIAL ENERGY FUNCTION
U = inline('transp(x)*inv([1,.8;.8,1])*x','x');

% DEFINE GRADIENT OF POTENTIAL ENERGY
dU = inline('transp(x)*inv([1,.8;.8,1])','x');

% DEFINE KINETIC ENERGY FUNCTION
K = inline('sum((transp(p)*p))/2','p');

% INITIAL STATE
x = zeros(2,nSamples);
x0 = [0;6];
x(:,1) = x0;

t = 1;
while t < nSamples
t = t + 1;

% SAMPLE RANDOM MOMENTUM
p0 = randn(2,1);

%% SIMULATE HAMILTONIAN DYNAMICS
% FIRST 1/2 STEP OF MOMENTUM
pStar = p0 - delta/2*dU(x(:,t-1))';

% FIRST FULL STEP FOR POSITION/SAMPLE
xStar = x(:,t-1) + delta*pStar;

% FULL STEPS
for jL = 1:L-1
% MOMENTUM
pStar = pStar - delta*dU(xStar)';
% POSITION/SAMPLE
xStar = xStar + delta*pStar;
end

% LAST HALP STEP
pStar = pStar - delta/2*dU(xStar)';

% COULD NEGATE MOMENTUM HERE TO LEAVE
% THE PROPOSAL DISTRIBUTION SYMMETRIC.
% HOWEVER WE THROW THIS AWAY FOR NEXT
% SAMPLE, SO IT DOESN'T MATTER

% EVALUATE ENERGIES AT
% START AND END OF TRAJECTORY
U0 = U(x(:,t-1));
UStar = U(xStar);

K0 = K(p0);
KStar = K(pStar);

% ACCEPTANCE/REJECTION CRITERION
alpha = min(1,exp((U0 + K0) - (UStar + KStar)));

u = rand;
if u < alpha
x(:,t) = xStar;
else
x(:,t) = x(:,t-1);
end
end

% DISPLAY
figure
scatter(x(1,:),x(2,:),'k.'); hold on;
plot(x(1,1:50),x(2,1:50),'ro-','Linewidth',2);
xlim([-6 6]); ylim([-6 6]);
legend({'Samples','1st 50 States'},'Location','Northwest')
title('Hamiltonian Monte Carlo')


## Wrapping up

In this post we introduced the Hamiltonian/Hybrid Monte Carlo algorithm for more efficient MCMC sampling. The HMC algorithm is extremely powerful for sampling distributions that can be represented terms of a potential energy function and its partial derivatives. Despite the efficiency and elegance of HMC, it is an underrepresented sampling routine in the literature. This may be due to the popularity of simpler algorithms such as Gibbs sampling or Metropolis-Hastings, or perhaps due to the fact that one must select hyperparameters such as the number of Leap Frog steps and Leap Frog step size when using HMC. However, recent research has provided effective heuristics such as adapting the Leap Frog step size in order to maintain a constant Metropolis rejection rate, which facilitate the use of HMC for general applications.

## MCMC: The Gibbs Sampler

In the previous post, we compared using block-wise and component-wise implementations of the Metropolis-Hastings algorithm for sampling from a multivariate probability distribution$p(\bold x)$. Component-wise updates for MCMC algorithms are generally more efficient for multivariate problems than blockwise updates in that we are more likely to accept a proposed sample by drawing each component/dimension independently of the others. However, samples may still be rejected, leading to excess computation that is never used. The Gibbs sampler, another popular MCMC sampling technique, provides a means of avoiding such wasted computation. Like the component-wise implementation of the Metropolis-Hastings algorithm, the Gibbs sampler also uses component-wise updates. However, unlike in the Metropolis-Hastings algorithm, all proposed samples are accepted, so there is no wasted computation.

The Gibbs sampler is applicable for certain classes of problems, based on two main criterion. Given a target distribution $p(\bold x)$, where $\bold x = (x_1, x_2, \dots, x_D$, ),  The first criterion is 1) that it is necessary that we have an analytic (mathematical) expression for the conditional distribution of each variable in the joint distribution given all other variables in the joint. Formally, if the target distribution $p(\bold x)$ is $D$-dimensional, we must have $D$ individual expressions for

$p(x_i|x_1,x_2,\dots,x_{i-1},x_{i+1},\dots,x_D)$

$= p(x_i | x_j), j\neq i$.

Each of these expressions defines the probability of the $i$-th dimension given that we have values for all other ($j \neq i$) dimensions. Having the conditional distribution for each variable means that we don’t need a proposal distribution or an accept/reject criterion, like in the Metropolis-Hastings algorithm. Therefore, we can simply sample from each conditional while keeping all other variables held fixed. This leads to the second criterion 2) that we must be able to sample from each conditional distribution. This caveat is obvious if we want an implementable algorithm.

The Gibbs sampler works in much the same way as the component-wise Metropolis-Hastings algorithms except that instead drawing from a proposal distribution for each dimension, then accepting or rejecting the proposed sample, we simply draw a value for that dimension according to the variable’s corresponding conditional distribution. We also accept all values that are drawn. Similar to the component-wise Metropolis-Hastings algorithm, we step through each variable sequentially, sampling it while keeping all other variables fixed. The Gibbs sampling procedure is outlined below

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

draw $x_i$ from $p(x_i|x_1,x_2,\dots,x_{i-1},x_{i+1},\dots,x_D)$

To get a better understanding of the Gibbs sampler at work, let’s implement the Gibbs sampler to solve the same multivariate sampling problem addressed in the previous post.

### Example: Sampling from a bivariate a Normal distribution

This example parallels the examples in the previous post where we sampled from a 2-D Normal distribution using block-wise and component-wise Metropolis-Hastings algorithms. Here, we show how to implement a Gibbs sampler to draw samples from the same target distribution. As a reminder, the target distribution $p(\bold x)$ is a Normal form with following parameterization:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [\mu_1,\mu_2]= [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & \rho_{12} \\ \rho_{21} & 1\end{bmatrix} = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

In order to sample from this distribution using a Gibbs sampler, we need to have in hand the conditional distributions for variables/dimensions $x_1$ and $x_2$:

$p(x_1 | x_2^{(t-1)})$ (i.e. the conditional for the first dimension, $x_1$)

and

$p(x_2 | x_1^{(t)})$ (the conditional for the second dimension, $x_2$)

Where $x_2^{(t-1)}$ is the previous state of the second dimension, and $x_1^{(t)}$ is the state of the first dimension after drawing from $p(x_1 | x_2^{(t-1)})$. The reason for the discrepancy between updating $x_1$ and $x_2$ using states $(t-1)$ and $(t)$, can be is seen in step 3 of the algorithm outlined in the previous section. At iteration $t$ we first sample a new state for variable $x_1$ conditioned on the most recent state of variable $x_2$, which is from iteration $(t-1)$. We then sample a new state for the variable $x_2$ conditioned on the most recent state of variable $x_1$, which is now from the current iteration, $t$.

After some math (which which I will skip for some brevity, but see the following for some details), we find that the two conditional distributions for the target Normal distribution are:

$p(x_1 | x_2^{(t-1)}) = \mathcal N(\mu_1 + \rho_{21}(x_2^{(t-1)} - \mu_2),\sqrt{1-\rho_{21}^2})$

and

$p(x_2 | x_1^{(t)})=\mathcal N(\mu_2 + \rho_{12}(x_1^{(t)}-\mu_1),\sqrt{1-\rho_{12}^2})$,

which are both univariate Normal distributions, each with a mean that is dependent on the value of the most recent state of the conditioning variable, and a variance that is dependent on the target covariances between the two variables.

Using the above expressions for the conditional probabilities of variables $x_1$ and $x_2$, we implement the Gibbs sampler using MATLAB below. The output of the sampler is shown here:

Gibbs sampler Markov chain and samples for bivariate Normal target distribution

Inspecting the figure above, note how at each iteration the Markov chain for the Gibbs sampler first takes a step only along the $x_1$ direction, then only along the $x_2$ direction.  This shows how the Gibbs sampler sequentially samples the value of each variable separately, in a component-wise fashion.

% EXAMPLE: GIBBS SAMPLER FOR BIVARIATE NORMAL
rand('seed' ,12345);
nSamples = 5000;

mu = [0 0]; % TARGET MEAN
rho(1) = 0.8; % rho_21
rho(2) = 0.8; % rho_12

% INITIALIZE THE GIBBS SAMPLER
propSigma = 1; % PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE SAMPLES
x = zeros(nSamples,2);
x(1,1) = unifrnd(minn(1), maxx(1));
x(1,2) = unifrnd(minn(2), maxx(2));

dims = 1:2; % INDEX INTO EACH DIMENSION

% RUN GIBBS SAMPLER
t = 1;
while t < nSamples
t = t + 1;
T = [t-1,t];
for iD = 1:2 % LOOP OVER DIMENSIONS
% UPDATE SAMPLES
nIx = dims~=iD; % *NOT* THE CURRENT DIMENSION
% CONDITIONAL MEAN
muCond = mu(iD) + rho(iD)*(x(T(iD),nIx)-mu(nIx));
% CONDITIONAL VARIANCE
varCond = sqrt(1-rho(iD)^2);
% DRAW FROM CONDITIONAL
x(t,iD) = normrnd(muCond,varCond);
end
end

% DISPLAY SAMPLING DYNAMICS
figure;
h1 = scatter(x(:,1),x(:,2),'r.');

% CONDITIONAL STEPS/SAMPLES
hold on;
for t = 1:50
plot([x(t,1),x(t+1,1)],[x(t,2),x(t,2)],'k-');
plot([x(t+1,1),x(t+1,1)],[x(t,2),x(t+1,2)],'k-');
h2 = plot(x(t+1,1),x(t+1,2),'ko');
end

h3 = scatter(x(1,1),x(1,2),'go','Linewidth',3);
legend([h1,h2,h3],{'Samples','1st 50 Samples','x(t=0)'},'Location','Northwest')
hold off;
xlabel('x_1');
ylabel('x_2');
axis square


## Wrapping Up

The Gibbs sampler is a popular MCMC method for sampling from complex, multivariate probability distributions. However, the Gibbs sampler cannot be used for general sampling problems. For many target distributions, it may difficult or impossible to obtain a closed-form expression for all the needed conditional distributions. In other scenarios, analytic expressions may exist for all conditionals but it may be difficult to sample from any or all of the conditional distributions (in these scenarios it is common to use univariate sampling methods such as rejection sampling and (surprise!) Metropolis-type MCMC techniques to approximate samples from each conditional). Gibbs samplers are very popular for Bayesian methods where models are often devised in such a way that conditional expressions for all model variables are easily obtained and take well-known forms that can be sampled from efficiently.

Gibbs sampling, like many MCMC techniques suffer from what is often called “slow mixing.” Slow mixing occurs when the underlying Markov chain takes a long time to sufficiently explore the values of $\bold x$ in order to give a good characterization of $p(\bold x)$. Slow mixing is due to a number of factors including the “random walk” nature of the Markov chain, as well as the tendency of the Markov chain to get “stuck,” only sampling a single region of $\bold x$ having high-probability under $p(\bold x)$. Such behaviors are bad for sampling distributions with multiple modes or heavy tails. More advanced techniques, such as Hybrid Monte Carlo have been developed to incorporate additional dynamics that increase the efficiency of the Markov chain path. We will discuss Hybrid Monte Carlo in a future post.

## MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates

In the previous posts on MCMC methods, we focused on how to sample from univariate target distributions. This was done mainly to give the reader some intuition about MCMC implementations with fairly tangible examples that can be visualized. However, MCMC can easily be extended to sample multivariate distributions.

In this post we will discuss two flavors of MCMC update procedure for sampling distributions in multiple dimensions: block-wise, and component-wise update procedures. We will show how these two different procedures can give rise to different implementations of the Metropolis-Hastings sampler to solve the same problem.

## Block-wise Sampling

The first approach for performing multidimensional sampling is to use block-wise updates. In this approach the proposal distribution $q(\bold{x})$ has the same dimensionality as the target distribution $p(\bold x)$. Specifically, if $p(\bold x)$ is a distribution over $D$ variables, ie. $\bold{x} = (x_1, x_2, \dots, x_D)$, then we must design a proposal distribution that is also a distribution involving $D$ variables. We then accept or reject a proposed state $\bold x^*$ sampled from the proposal distribution $q(\bold x)$ in exactly the same way as for the univariate Metropolis-Hastings algorithm. To generate $M$ multivariate samples we perform the following block-wise sampling procedure:

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $\bold x^*$ from $q(\bold x | \bold x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(\bold x^{(t-1)} | \bold x^*) }{q(\bold x^*|\bold x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(\bold x^*)}{p(\bold x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $\bold x^*$ and set $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

Let’s take a look at the block-wise sampling routine in action.

### Example 1: Block-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we use block-wise Metropolis-Hastings algorithm to sample from a bivariate (i.e. $D = 2$) Normal distribution:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

Usually the target distribution $p(\bold x)$ will have a complex mathematical form, but for this example we’ll circumvent that by using MATLAB’s built-in function $\text{mvnpdf}$ to evaluate $p(\bold x)$. For our proposal distribution, $q(\bold x)$, let’s use a circular Normal centered at the the previous state/sample of the Markov chain/sampler, i.e:

$q(\bold x | \bold x^{(t-1)}) \sim \mathcal N (\bold x^{(t-1)}, \bold I)$,

where $\bold I$ is a 2-D identity matrix, giving the proposal distribution unit variance along both dimensions $x_1$ and $x_2$, and zero covariance. You can find an MATLAB implementation of the block-wise sampler at the end of the section. The display of the samples and the target distribution output by the sampler implementation are shown below:

Samples drawn from block-wise Metropolis-Hastings sampler

We can see from the output that the block-wise sampler does a good job of drawing samples from the target distribution.

Note that our proposal distribution in this example is symmetric, therefore it was not necessary to calculate the correction factor $c$ per se. This means that this Metropolis-Hastings implementation is identical to the simpler Metropolis sampler.

%------------------------------------------------------
% EXAMPLE 1: METROPOLIS-HASTINGS
% BLOCK-WISE SAMPLER (BIVARIATE NORMAL)
rand('seed' ,12345);

D = 2; % # VARIABLES
nBurnIn = 100;

% TARGET DISTRIBUTION IS A 2D NORMAL WITH STRONG COVARIANCE
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

% PROPOSAL DISTRIBUTION STANDARD 2D GUASSIAN
q = inline('mvnpdf(x,mu)','x','mu')

nSamples = 5000;
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE BLOCK-WISE SAMPLER
t = 1;
x = zeros(nSamples,2);
x(1,:) = randn(1,D);

% RUN SAMPLER
while t < nSamples
t = t + 1;

% SAMPLE FROM PROPOSAL
xStar = mvnrnd(x(t-1,:),eye(D));

% CORRECTION FACTOR (SHOULD EQUAL 1)
c = q(x(t-1,:),xStar)/q(xStar,x(t-1,:));

% CALCULATE THE M-H ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1,:))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t,:) = xStar;
else
x(t,:) = x(t-1,:);
end
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')


## Component-wise Sampling

A problem with block-wise updates, particularly when the number of dimensions $D$ becomes large, is that finding a suitable proposal distribution is difficult. This leads to a large proportion of the samples being rejected. One way to remedy this is to simply loop over the the $D$ dimensions of $\bold x$ in sequence, sampling each dimension independently from the others. This is what is known as using component-wise updates. Note that now the proposal distribution $q(x)$ is univariate, working only in one dimension, namely the current dimension that we are trying to sample. The component-wise Metropolis-Hastings algorithm is outlined below.

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

generate a proposal state $x_i^*$ from $q(x_i | x_i^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x_i^{(t-1)}) | x_i^*)}{q(x_i^* | x_i^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p( x_i^*, \bold x_j^{(t-1)})}{p( x_i^{(t-1)}, \bold x_j^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x_i^*$ and set $x_i^{(t)}=x_i^*$

else set $x_i^{(t)} = x_i^{(t-1)}$

Note that in the component-wise implementation a sample for the $i$-th dimension is proposed, then  accepted or rejected while all other dimensions ($j \neq i$) are held fixed. We then move on to the next ($(i + 1)$-th) dimension and repeat the process while holding all other variables ($j \neq (i + 1)$) fixed. In each successive step we are using updated values for the dimensions that have occurred since increasing $(t -1) \rightarrow t$.

### Example 2: Component-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we draw samples from the same bivariate Normal target distribution described in Example 1, but using component-wise updates. Therefore $p(x)$ is the same, however, the proposal distribution $q(x)$ is now a univariate Normal distribution with unit unit variance in the direction of the $i$-th dimension to be sampled. The MATLAB implementation of the component-wise sampler is at the end of the section. The samples and comparison to the analytic target distribution are shown below.

Samples drawn from component-wise Metropolis-Hastings algorithm compared to target distribution

Again, we see that we get a good characterization of the bivariate target distribution.

%--------------------------------------------------
% EXAMPLE 2: METROPOLIS-HASTINGS
% COMPONENT-WISE SAMPLING OF BIVARIATE NORMAL
rand('seed' ,12345);

% TARGET DISTRIBUTION
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

nSamples = 5000;
propSigma = 1;		% PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE COMPONENT-WISE SAMPLER
x = zeros(nSamples,2);
xCurrent(1) = randn;
xCurrent(2) = randn;
dims = 1:2; % INDICES INTO EACH DIMENSION
t = 1;
x(t,1) = xCurrent(1);
x(t,2) = xCurrent(2);

% RUN SAMPLER
while t < nSamples
t = t + 1;
for iD = 1:2 % LOOP OVER DIMENSIONS

% SAMPLE PROPOSAL
xStar = normrnd(xCurrent(:,iD), propSigma);

% NOTE: CORRECTION FACTOR c=1 BECAUSE
% N(mu,1) IS SYMMETRIC, NO NEED TO CALCULATE

% CALCULATE THE ACCEPTANCE PROBABILITY
pratio = p([xStar xCurrent(dims~=iD)])/ ...
p([xCurrent(1) xCurrent(2)]);
alpha = min([1, pratio]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
xCurrent(iD) = xStar;
end
end

% UPDATE SAMPLES
x(t,:) = xCurrent;
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
figure;
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')



## Wrapping Up

Here we saw how we can use block- and component-wise updates to derive two different implementations of the Metropolis-Hastings algorithm. In the next post we will use component-wise updates introduced above to motivate the Gibbs sampler, which is often used to increase the efficiency of sampling well-defined probability multivariate distributions.