# Category Archives: Simulations

## Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

## Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable $x$ is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function $f(x)$ that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function $f(x)$ from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function $f(x)$ taking $x$ as input and producing an output). However, in the real world, we don’t get to observe $f(x)$ directly, but instead get noisy observations $y$, where

(1) $y = f(x) + \epsilon$

Here we will assume that $\epsilon$ is random variable distributed according to a zero-mean Gaussian with standard deviation $\sigma^2$. Note that because $\epsilon$ is a random variable, $y$ is also a random variable (with a mean that is on conditioned on both $x$ and $f(x)$, and a variance $\sigma^2)$.

As an example, say that the true function $f(x)$ we want to determine has the the following form (though we don’t know it):

$f(x) = \sin(\pi x)$

Thus the observations $y$ we get to see have the following distribution.

$y = \sin(\pi x) + \mathcal N(0,\sigma^2)$

Below we define the function $f(x)$ and display it, then draw a few observation samples $y$, and display them as well:

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);
xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');


Again, the goal here is to characterized the function $f(x)$. However, since we don’t know the function form of $f(x)$, we must instead estimate some other function $g(x)$ that we think will be a good approximation to $f(x)$. Thus we call $g(x)$ an estimator of $f(x)$. In general, an estimator is some parameterized model that can capture a wide range of functional forms. One such class of estimators is the weighted combination of ordered polynomials:

$g_N(x) = \theta_0 + \theta_1x + \theta_2x^2 + \dots \theta_N x^N$

As the polynomial order $N$ increases, the functions $g_N(x)$ are able to capture increasingly complex behavior. For example, $g_0(x)$ desribes a horizontal line with an adjustable vertical offset, $g_1(x)$ desribes a line with adjustable vertical offset and adjustable slope, $g_2(x)$ describes a function that also includes a quadratic term. We thus try to fit the values of the parameters for a given estimator $g_N(x)$  to best account for observed data in the hopes that we will also accurately approximate $f(x)$.

Below we estimate the parameters of three polynomial model functions of increasing complexity (using Matlab’s $\texttt{polyfit.m}$) to the sampled data displayed above. Specifically, we estimate the functions $g_1(x)$, $g_3(x)$ and $g_{10}(x)$.

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')


Qualitatively, we see that the estimator $g_1(x)$ (red line) provides a poor fit to the observed data, as well as a poor approximation to the function $f(x)$ (black curve). We see that the estimator $g_{10}(x)$ (blue curve) provides a very accurate fit to the data points, but varies wildly to do so, and therefore provides an inaccurate approximation of $f(x)$. Finally, we see that the estimator $g_3(x)$ (green curve) provides a fairly good fit to the observed data, and a much better job at approximating $f(x)$.

Our original goal was to approximate $f(x)$, not the data points per se. Therefore $g_3(x)$, at least qualitatively, provides a more desirable estimate of $f(x)$ than the other two estimators. The fits for $g_1(x)$ and $g_{10}(x)$ are examples of “underfitting” and “overfitting” to the observed data. Underfitting occurs when an estimator $g(x)$ is not flexible enough to capture the underlying trends in the observed data. Overfitting occurs when an estimator is too flexible, allowing it to capture illusory trends in the data. These illusory trends are often the result of the noise in the observations $y$.

## Variance and Bias of an Estimator

The model fits for $g_N(x)$ discussed above were based on a single, randomly-sampled data set of observations $y$. However, there are in principle, a potentially infinite number of data sets that can be observed (drawn from $y$). In order to determine a good model of $f(x)$, it would be helpful to have an idea of how an estimator will perform on any or all of these potential datasets.  To get an idea of how each of the estimators discussed above performs in general we can repeat the model fitting procedure for many data sets.

Here we perform such an analyses, sampling 50 independent data sets according to Equation (1) (with $\sigma=0.1$), then fitting the parameters for the polynomial functions of model order $N = [1,3,10]$ to each dataset.

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end


Here are the results for the estimator $g_1(x)$

…and for the estimator $g_3(x)$, …

… and for $g_{10}(x).$

The lightly-colored curves in each of the three plots above are an individual polynomial model fit to one of the 50 sampled data sets. The darkly-colored curve in each plot is the average over the 50 individual fits. The dark black curve is the true, underlying function $f(x)$.

We see that for the estimator $g_1(x)$ (red curves), model fits do not vary much from data set to data set. Thus the expected (mean) estimator fit over all the data sets, formally written as $\mathbb E[g(x)]$, is very similar to all the individual fits.

A common formal definition used in statistics to describe how much a single estimator deviates from the average estimator over datasets is the variance of the estimator. Formally defined as

$variance=\mathbb E[(g(x)-\mathbb E[g(x)])^2]$

the variance is the average squared difference between any single data-set-dependent estimate of $g(x)$ and the average value of $g(x)$ estimated over all datasets.

According to the definition of variance, we can say that the estimator $g_1(x)$ exhibits low variance.

A commonly-used metric in statistics for assessing how an estimator $g(x)$ approximates a target function $f(x)$, based on behavior over many observed data sets is what is called the bias of the estimator. Formally defined as:

$bias = \mathbb E[g(x)] - f(x)$

The bias describes how much the average estimator fit over datasets $\mathbb E[g(x)]$ deviates from the value of the underlying target function $f(x)$.

We can see from the plot for $g(x)_1$ that $\mathbb E[g_1(x)]$ deviates significantly from $f(x)$. Thus we can say that the estimator $g_1(x)$ exhibits large bias when approximating the function $f(x)$.

Investigating the results for the estimator $g_{10}(x)$ (blue curves), we see that each individual model fit varies dramatically from one data set to another. Thus we can say that this estimator exhibits large variance. Looking at $\mathbb E[g_{10}(x)]$ (the dark blue curve), we see that on average the estimator $g_{10}(x)$ provides a better approximation to $f(x)$ than the estimator $g_1(x)$. Thus we can say that $g_{10}(x)$ exhibits a lower bias than the estimator $g_1(x)$.

Investigating the fits for the estimator $g_3(x)$ (green curves), we find that this estimator has low bias. Furthermore, the average estimator $\mathbb E[g_3(x)]$ (dark green curve) accurately approximates the true function $f(x)$, telling us that that the estimator $g_3(x)$ also has low bias.

We established earlier that the estimator $g_3(x)$ provided a qualitatively better fit to the function $f(x)$ than the other two polynomial estimators for a single dataset. It appears that this is also the case over many datasets. We also find that estimator $g_3(x)$ exhibits low bias and low variance, whereas the other two, less-desirable estimators, have either high bias or high variance. Thus it would appear that having both low bias and low variance is a reasonable criterion for selecting an accurate model of $f(x)$.

Included in each of the three plots above is a dashed black line representing the squared difference between the average estimator $\mathbb E[g_N(x)]$ and the true function $f(x)$. Calculating squared model errors is a common practice for quantifying the goodness of a model fit. If we calculate the expected value of each of the dashed black curves (and assuming that all values of $x$ are equally likely to occur), we would obtain a single value for each estimator that is the mean squared error (MSE) between the expected estimator and the true function.

$\mathbb E[(\mathbb E[g(x)]-f(x))^2] = \frac{1}{N}\sum_{i=1}^N (\mathbb E[g(x)]-f(x))^2$

For the estimator $g_3(x)$, the MSE will be very small, as the dashed black curve for this estimator is near zero for all values of $x$. The estimators $g_1(x)$ and $g_{10}(x)$ would have significantly larger values. Now, because exhibiting both a low MSE, as well as having both low bias and variance are indicative of a good estimator, it would be reasonable to assume that squared model error is directly related to bias and variance. The next section provides some formal evidence for this notion.

## Expected Prediction Error and the Bias-variance Tradeoff

For a given estimator $g(x)$ fit to a data set of $xy$ pairs, we would like to know, given all the possible datasets out there, what is the expected prediction error we will observe for a new data point $x^*, y^* = f(x^*) + \epsilon$. If we define prediction error to be the squared difference in model prediction $g(x^*)$ and observations $y^*$, the expected prediction error is then:

$\mathbb E[(g(x^*) - y^*)^2]$

If we expand this a little and use a few identities, something interesting happens:

(2) $\mathbb E[(g(x^*) - y^*)^2] = \mathbb E[g(x^*)^2-2g(x^*)y^*+y^{*2}]=$

(3) $\mathbb E[g(x^*)^2]-2\mathbb E[g(x^*)y^*]+\mathbb E[y^{*2}]=$

(4) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2-2 \mathbb E[g(x^*)]f(x^*) + \mathbb E[(y^*-f(x^*))^2] + f(x^*)^2$

where we have applied the following Lemma to the first and third terms of Equation (3), and use the fact to $\mathbb E[y] = f(x)$ (Think of averaging over an infinite number of datasets sampled from y; all noise will average out, leaving $f(x)$). Rearranging Equation (4), we obtain

(5) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2 - 2 \mathbb E[g(x^*)]f(x^*) + f(x^*)^2 + \mathbb E[(y^*-f(x^*))^2]$

which can be further simplified and grouped into three terms

(6) $\mathbb E[(g(x^*) - \mathbb E[g(x^*)])^2]$ $+\mathbb (E[g(x^*)]-f(x^*))^2$ $+ \mathbb E[(y^*-f(x^*))^2]$

1. The first term is the variance of the estimator introduced above.
2. The second term is the square of the bias of the estimator, also introduced above.
3. The third term is the variance of the observation noise and describes how much the observations $y$ vary from the true function $f(x)$. Notice that the noise term does not depend on the estimator $g(x)$. Thus the noise term is a constant that places a lower bound on expected prediction error.

Here we find that the expected prediction error on new data $(x^*,y^*)$ (in the squared differences sense) is the combination of three terms:

Expected prediction error = estimator variance + squared estimator bias + noise

Thus the expected prediction error on new data can be used as a quantitative criterion for selecting the best model from a candidate set of estimators! It turns out that, given $N$ new data points $(\bold x^*,\bold y^*)$, the expected prediction error can be easily approximated as the mean squared error over data pairs:

$\mathbb E[(g(\bold x^*) - \bold y^*)^2] \approx \frac{1}{N}\sum_{i=1}^N(g(x_i^*)-y_i^*)^2$

Below we demonstrate these findings with another set of simulations. We simulate 100 independent datasets, each with 25 $xy$ pairs. We then partition each dataset into two non-overlapping sets: a training set using for fitting model parameters, and a testing set used to calculate prediction error. We then fit the parameters for estimators of varying complexity. Complexity is varied by using polynomial functions that range in model order from 1 (least complex) to 12 (most complex). We then calculate and display the squared bias, variance, and error on testing set for each of the estimators:

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS
% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;


Here we see how, as the model complexity increases, the estimator variance (blue curve) gradually increases. Additionally, as model complexity increases, the squared bias (red curve) decreases. Thus there is a tradeoff between bias and variance that comes with model complexity: models that are too complex will have high variance and low bias; models that are too simple will have high bias and low variance. The best model will have both low bias and low variance. In this example, we highlight the best estimator in terms of prediction error on the testing set (black curve) with a dashed black vertical line. The best estimator corresponds to a polynomial model of order of $N=3$. Notice that the vertical black line is located where function defined by the sum of the squared bias and variance (dashed magenta curve) is also at a minimum.

Notice also how the sum of the squared bias and variance also has the same shape as curve defined by the prediction error on the testing set. This exemplifies how the error on novel data can be used as a proxy for determining the best estimator from a candidate set based on squared bias and variance. The noise term in Equation (6) is also represented in the figure by the vertical displacement between the black curve and dashed magenta curve.

It is very important to point out that all of these results are based on evaluating prediction error on novel data, not used to estimate model parameters. In fact assessing a model performance based on prediction error calculated on the same data used to estimate the model parameters is highly problematic, as it causes models to always overfit. In plain terms, this means that we will always favor a more complex model if we assess goodness of model fits on the training data, as a more complex model will be better able to capture small, random trends in the data due to noise.

This overfitting phenomenon is demonstrated below. We plot the error calculated on the training set (Train Error, green curve) along with the error calculated on the testing set (Test Error, black curve) for the above simulation. We also identify the best estimator as we did above.

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;


We see here that as model complexity increases, the error calculated on the training set continues to decrease, where as the error on the testing set increases past the optimal polynomial order $N=3$. We  showed above that error calculated on the testing set is the true indicator of how well an estimator will generalize to new data points. The error calculated on the training set strongly disagrees with the error calculated on the testing set after the optimal model complexity has been reached. Since, in general, the whole point of modeling a data set is to generalize to novel data, assessing model predictions on the training set data should be avoided.

## Wrapping Up

Here we discussed how the bias and variance of an estimator are related to squared prediction error. Though we focused on regression, these concepts can also be applied to classification problems. We found that an optimal estimator will have both low variance and low bias. We further found that information about squared bias and variance is contained in expected prediction error calculated on a testing set of data not used to fit a model’s parameters.

The concepts of estimator bias and variance are generally only clear in the context of an ensemble of datasets. However, in real-world applications, there is generally only a single observed dataset. In such cases the roles of bias and variance are less obvious (though, it is possible to calculate estimates of variance and bias using resampling methods such as bootstrapping). However, the direct connection we made between bias, variance with the mean-squared error calculated on a testing set give us a direct means for assessing a group of candidate estimators in light of a single data set. We only need to partition the available data set into separate portions: one used to fit model parameters (a training set), and another used to assess prediction accuracy (testing set). Comparing prediction accuracy across potential estimators is equivalent to assessing biases and variances of the estimators across many datasets. Note that resampling methods such as cross-validation can prove helpful here, particularly when the amount of observed data is small.

Note, all the code for this post, containe in a single script is here:

clear
clc
close all

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);

xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS

% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); % TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;


## fMRI In Neuroscience: Efficiency of Event-related Experiment Designs

Event-related fMRI experiments are used to detect selectivity in the brain to stimuli presented over short durations. An event is generally modeled as an impulse function that occurs at the onset of the stimulus in question. Event-related designs are flexible in that many different classes of stimuli can be intermixed. These designs can minimize confounding behavioral effects due to subject adaptation or expectation. Furthermore, stimulus onsets can be modeled at frequencies that are shorter than the repetition time (TR) of the scanner. However, given such flexibility in design and modeling, how does one determine the schedule for presenting a series of stimuli? Do we space out stimulus onsets periodically across a scan period? Or do we randomize stimulus onsets? Furthermore what is the logic for or against either approach? Which approach is more efficient for gaining incite into the selectivity in the brain?

## Simulating Two fMRI Experiments: Periodic and Random Stimulus Onsets

To get a better understanding of the problem of choosing efficient experiment design, let’s simulate two simple fMRI experiments. In the first experiment, a stimulus is presented periodically 20 times, once every 4 seconds, for a run of 80 seconds in duration. We then simulate a noiseless BOLD signal evoked in a voxel with a known HRF. In the second experiment, we simulate the noiseless BOLD signal evoked by 20 stimulus onsets that occur at random times over the course of the 80 second run duration.  The code for simulating the signals and displaying output are shown below:

rand('seed',12345);
randn('seed',12345);
TR = 1 % REPETITION TIME
t = 1:TR:20; % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % ACTUAL HRF
h = h/max(h); % SCALE TO MAX OF 1

% SOME CONSTANTS...
trPerStim = 4; % # TR PER STIMULUS FOR PERIODIC EXERIMENT
nRepeat = 20; % # OF TOTAL STIMULI SHOWN
nTRs = trPerStim*nRepeat
stimulusTrain0 = zeros(1,nTRs);

beta = 3; % SELECTIVITY/HRF GAIN

% SET UP TWO DIFFERENT STIMULUS PARADIGM...
% A. PERIODIC, NON-RANDOM STIMULUS ONSET TIMES
D_periodic = stimulusTrain0;
D_periodic(1:trPerStim:trPerStim*nRepeat) = 1;

% UNDERLYING MODEL FOR (A)
X_periodic = conv2(D_periodic,h);
X_periodic = X_periodic(1:nTRs);
y_periodic = X_periodic*beta;

% B. RANDOM, UNIFORMLY-DISTRIBUTED STIMULUS ONSET TIMES
D_random = stimulusTrain0;
randIdx = randperm(numel(stimulusTrain0)-5);
D_random(randIdx(1:nRepeat)) = 1;

% UNDERLYING MODEL FOR (B)
X_random = conv2(D_random,h);
X_random = X_random(1:nTRs);
y_random = X_random*beta;

% DISPLAY STIMULUS ONSETS AND EVOKED RESPONSES
% FOR EACH EXPERIMENT
figure
subplot(121)
stem(D_periodic,'k');
hold on;
plot(y_periodic,'r','linewidth',2);
xlabel('Time (TR)');
title(sprintf('Responses Evoked by\nPeriodic Stimulus Onset\nVariance=%1.2f',var(y_periodic)))

subplot(122)
stem(D_random,'k');
hold on;
plot(y_random,'r','linewidth',2);
xlabel('Time (TR)');
title(sprintf('Responses Evoked by\nRandom Stimulus Onset\nVariance=%1.2f',var(y_random)))


BOLD signals evoked by periodic (left) and random (right) stimulus onsets.

The black stick functions in the simulation output indicate the stimulus onsets and each red function is the simulated noiseless BOLD signal to those stimuli. The first thing to notice is the dramatically different variances of the BOLD signals evoked for the two stimulus presentation schedules. For the periodic stimuli, the BOLD signal quickly saturates, then oscillates around an effective baseline activation. The estimated variance of the periodic-based signal is 0.18. In contrast, the signal evoked by the random stimulus presentation schedule varies wildly, reaching a maximum amplitude that is roughly 2.5 times as large the maximum amplitude of the signal evoked by periodic stimuli. The estimated variance of the signal evoked by the random stimuli is 7.4, roughly 40 times the variance of the signal evoked by the periodic stimulus.

So which stimulus schedule allows us to better estimate the HRF and, more importantly, the amplitude of the HRF, as it is the amplitude that is the common proxy for voxel selectivity/activation? Below we repeat the above experiment 50 times. However, instead of simulating noiseless BOLD responses, we introduce 50 distinct, uncorrelated noise conditions, and from the simulated noisy responses, we estimate the HRF using an FIR basis set for each  repeated trial. We then compare the estimated HRFs across the 50 trials for the periodic and random stimulus presentation schedules. Note that for each trial, the noise is exactly the same for the two stimulus presentation schedules. Further, we simulate a selectivity/tuning gain of 3 times the maximum HRF amplitude and assume that the HRF to be estimated is 16 TRs/seconds in length. The simulation and output are below:

%% SIMULATE MULTIPLE TRIALS OF EACH EXPERIMENT
%% AND ESTIMATE THE HRF FOR EACH
%% (ASSUME THE VARIABLES DEFINED ABOVE ARE IN WORKSPACE)

% CREATE AN FIR DESIGN MATRIX
% FOR EACH EXPERIMENT
hrfLen = 16;  % WE ASSUME TO-BE-ESTIMATED HRF IS 16 TRS LONG

% CREATE FIR DESIGN MATRIX FOR THE PERIODIC STIMULI
X_FIR_periodic = zeros(nTRs,hrfLen);
onsets = find(D_periodic);
idxCols = 1:hrfLen;
for jO = 1:numel(onsets)
idxRows = onsets(jO):onsets(jO)+hrfLen-1;
for kR = 1:numel(idxRows);
X_FIR_periodic(idxRows(kR),idxCols(kR)) = 1;
end
end
X_FIR_periodic = X_FIR_periodic(1:nTRs,:);

% CREATE FIR DESIGN MATRIX FOR THE RANDOM STIMULI
X_FIR_random = zeros(nTRs,hrfLen);
onsets = find(D_random);
idxCols = 1:hrfLen;
for jO = 1:numel(onsets)
idxRows = onsets(jO):onsets(jO)+hrfLen-1;
for kR = 1:numel(idxRows);
X_FIR_random(idxRows(kR),idxCols(kR)) = 1;
end
end
X_FIR_random = X_FIR_random(1:nTRs,:);

% SIMULATE AND ESTIMATE HRF WEIGHTS VIA OLS
nTrials = 50;

% CREATE NOISE TO ADD TO SIGNALS
% NOTE: SAME NOISE CONDITIONS FOR BOTH EXPERIMENTS
noiseSTD = beta*2;
noise = bsxfun(@times,randn(nTrials,numel(X_periodic)),noiseSTD);

%% ESTIMATE HRF FROM PERIODIC STIMULUS TRIALS
beta_periodic = zeros(nTrials,hrfLen);
for iT = 1:nTrials
y = y_periodic + noise(iT,:);
beta_periodic(iT,:) = X_FIR_periodic\y';
end

% CALCULATE MEAN AND STANDARD ERROR OF HRF ESTIMATES
beta_periodic_mean = mean(beta_periodic);
beta_periodic_se = std(beta_periodic)/sqrt(nTrials);

%% ESTIMATE HRF FROM RANDOM STIMULUS TRIALS
beta_random = zeros(nTrials,hrfLen);
for iT = 1:nTrials
y = y_random + noise(iT,:);
beta_random(iT,:) = X_FIR_random\y';
end

% CALCULATE MEAN AND STANDARD ERROR OF HRF ESTIMATES
beta_random_mean = mean(beta_random);
beta_random_se = std(beta_random)/sqrt(nTrials);

% DISPLAY HRF ESTIMATES
figure
% ...FOR THE PERIODIC STIMULI
subplot(121);
hold on;
h0 = plot(h*beta,'k')
h1 = plot(beta_periodic_mean,'linewidth',2);
h2 = plot(beta_periodic_mean+beta_periodic_se,'r','linewidth',2);
plot(beta_periodic_mean-beta_periodic_se,'r','linewidth',2);
xlabel('Time (TR)')
legend([h0, h1,h2],'Actual HRF','Average \beta_{periodic}','Standard Error')
title('Periodic HRF Estimate')

% ...FOR THE RANDOMLY-PRESENTED STIMULI
subplot(122);
hold on;
h0 = plot(h*beta,'k');
h1 = plot(beta_random_mean,'linewidth',2);
h2 = plot(beta_random_mean+beta_random_se,'r','linewidth',2);
plot(beta_random_mean-beta_random_se,'r','linewidth',2);
xlabel('Time (TR)')
legend([h0,h1,h2],'Actual HRF','Average \beta_{random}','Standard Error')
title('Random HRF Estimate')


Estimated HRFs from 50 trials of periodic (left) and random (right) stimulus schedules

In the simulation outputs, the average HRF for the random stimulus presentation (right) closely follows the actual HRF tuning. Also, there is little variability of the HRF estimates, as is indicated by the small standard error estimates for each time points. As well, the selectivity/gain term is accurately recovered, giving a mean HRF with nearly the same amplitude as the underlying model. In contrast, the HRF estimated from the periodic-based experiment is much more variable, as indicated by the large standard error estimates. Such variability in the estimates of the HRF reduce our confidence in the estimate for any single trial. Additionally, the scale of the mean HRF estimate is off by nearly 30% of the actual value.

From these results, it is obvious that the random stimulus presentation rate gives rise to more accurate, and less variable estimates of the HRF function. What may not be so obvious is why this is the case, as there were the same number of stimuli and  the same number of signal measurements in each experiment. To get a better understanding of why this is occurring, let’s refer back to the variances of the evoked noiseless signals. These are the signals that are underlying the noisy signals used to estimate the HRF. When noise is added it impedes the detection of the underlying trends that are useful for estimating the HRF.  Thus it is important that the variance of the underlying signal is large compared to the noise so that the signal can be detected.

For the periodic stimulus presentation schedule, we saw that the variation in the BOLD signal was much smaller than the variation in the BOLD signals evoked during the randomly-presented stimuli. Thus the signal evoked by random stimulus schedule provide a better characterization of the underlying signal in the presence of the same amount of noise, and thus provide more information to estimate the HRF. With this in mind we can think of maximizing the efficiency of the an experiment design as maximizing the variance of the BOLD signals evoked by the experiment.

## An Alternative Perspective: The Frequency Power Spectrum

Another helpful interpretation is based on a signal processing perspective. If we assume that neural activity is directly correspondent with the onset of a stimulus event, then we can interpret the train of stimulus onsets as a direct signal of the evoked neural activity. Furthermore, we can interpret the HRF as a low-pass-filter that acts to “smooth” the available neural signal in time. Each of these signals–the neural/stimulus signal and the HRF filtering signal–has with it an associated power spectrum. The power spectrum for a signal captures the amount of power per unit time that the signal has as a particular frequency $\omega$. The power spectrum for a discrete signal can be calculated from the discrete Fourier transform (DFT) of the signal $F(\omega)$ as follows

$P(\omega) = | F(\omega)|^2$

Below, we use Matlab’s $\text{fft.m}$ function to calculate the DFT and the associated power spectrum for each of the stimulus/neural signals, as well as the HRF.

%% POWER SPECTRUM ANALYSES
%% (ASSUME THE VARIABLES DEFINED ABOVE ARE IN WORKSPACE)

% MAKE SURE WE PAD SUFFICIENTLY
% FOR CIRCULAR CONVOLUTION
N = 2^nextpow2(nTRs + numel(h)-1);
nUnique = ceil(1+N/2); % TAKE ONLY POSITIVE SPECTRA

% CALCULATE POWER SPECTRUM FOR PERIODIC STIMULI EXPERIMENT
ft_D_periodic = fft(D_periodic,N)/N; % DFT
P_D_periodic = abs(ft_D_periodic).^2; % POWER
P_D_periodic = 2*P_D_periodic(2:nUnique-1); % REMOVE ZEROTH & NYQUIST

% CALCULATE POWER SPECTRUM FOR RANDOM STIMULI EXPERIMENT
ft_D_random = fft(D_random,N)/N; % DFT
P_D_random = abs(ft_D_random).^2; % POWER
P_D_random = 2*P_D_random(2:nUnique-1); % REMOVE ZEROTH & NYQUIST

% CALCULATE POWER SPECTRUM OF HRF
ft_h = fft(h,N)/N; % DFT
P_h = abs(ft_h).^2; % POWER
P_h = 2*P_h(2:nUnique-1); % REMOVE ZEROTH & NYQUIST

% CREATE A FREQUENCY SPACE FOR PLOTTING
F = 1/N*[1:N/2-1];

% DISPLAY STIMULI POWER SPECTRA
figure
subplot(131)
hhd = plot(F,P_D_periodic,'b','linewidth',2);
axis square; hold on;
hhr = plot(F,P_D_random,'g','linewidth',2);
xlim([0 .3]); xlabel('Frequency (Hz)');
set(gca,'Ytick',[]); ylabel('Magnitude');
legend([hhd,hhr],'Periodic','Random')
title('Stimulus Power, P_{stim}')

% DISPLAY HRF POWER SPECTRUM
subplot(132)
plot(F,P_h,'r','linewidth',2);
axis square
xlim([0 .3]); xlabel('Frequency (Hz)');
set(gca,'Ytick',[]); ylabel('Magnitude');
title('HRF Power, P_{HRF}')

% DISPLAY EVOKED SIGNAL POWER SPECTRA
subplot(133)
hhd = plot(F,P_D_periodic.*P_h,'b','linewidth',2);
hold on;
hhr = plot(F,P_D_random.*P_h,'g','linewidth',2);
axis square
xlim([0 .3]); xlabel('Frequency (Hz)');
set(gca,'Ytick',[]); ylabel('Magnitude');
legend([hhd,hhr],'Periodic','Random')
title('Signal Power, P_{stim}.*P_{HRF}')


Power spectrum of neural/stimulus (left), HRF (center), and evoked BOLD (right) signals

On the left of the output we see the power spectra for the stimulus signals. The blue line corresponds to the spectrum for the periodic stimuli, and the green line the spectrum for the randomly-presented stimuli. The large peak in the blue spectrum corresponds to the majority of the stimulus power at 0.25 Hz for the periodic stimuli, as this the fundamental frequency of the periodic stimulus presentation (i.e. every 4 seconds). However, there is little power at any other stimulus frequencies. In contrast the green spectrum indicates that the random stimulus presentation has power at multiple frequencies.

If we interpret the HRF as a filter, then we can think of the HRF power spectrum as modulating the power spectrum of the neural signals to produce the power of the evoked BOLD signals. The power spectrum for the HRF is plotted in red in the center plot. Notice how a majority of the power for the HRF is at frequencies less than 0.1 Hz, and there is very little power at frequencies above 0.2 Hz. If the neural signal power is modulated by the HRF signal power, we see that there is little resultant power in the BOLD signals evoked by periodic stimulus presentation (blue spectrum in the right plot). In contrast, because the power for the neural signals evoked by random stimuli are spread across the frequency domain, there are a number of frequencies that overlap with those frequencies for which the HRF also has power. Thus after modulating neural/stimulus power with the HRF power, the spectrum of the BOLD signals evoked by the randomly-presented stimuli have much more power across the relevant frequency spectrum than those evoked by the periodic stimuli. This is indicated by the larger area under the green curve in the right plot.

Using the signal processing perspective allows us to directly gain perspective on the limitations of a particular experiment design which are rooted in the frequency spectrum of the HRF. Therefore, another way we can think of maximizing the efficiency of an experimental design is maximizing the amount of power in the resulting evoked BOLD responses.

## Yet Another Perspective Based in Statistics: Efficiency Metric

Taking a statistics-based approach leads to a formal definition of efficiency, and further, a nice metric for testing the efficiency of an experimental design. Recall that when determining the shape of the HRF, a common approach is to use the GLM model

$y = X \beta + \epsilon$

Here $y$ is the evoked BOLD signal and $X$ is a design matrix that links a set of linear model parameters $\beta$ to those responses. The variable $\epsilon$ is a noise term that is unexplained by the model. Using an FIR basis formulation of the model, the weights in $\beta$ represent the HRF to a stimulus condition.

Because fMRI data are a continuous time series, the underlying noise $\epsilon$ is generally correlated in time. We can model this noise as a Gaussian process with zero mean and a constant multivariate covariance $C_{\epsilon}$. Note that this is analogous to the Generalized Least Squares (GLS) formulation of the GLM. In general, the values that comprise $C_{\epsilon}$ are unknown and have to be estimated from the fMRI data themselves.

For a known or estimated noise covariance, the Maximum Likelihood Estimator (MLE) for the model parameters $\beta$(derivation not shown) is:

$\hat \beta = (X^TC_{\epsilon}^{-1}X)X^TC_{\epsilon}^{-1}y$

Because the ML estimator of the HRF is a linear combination of the design matrix $X$ and a set of corresponding responses, which are both random variables ($X$ can represent any possible experiment design, and $y$ is by definition random), the estimator is itself a random variable. It thus follows that the estimate for the HRF also has a variance. (We demonstrated how $\beta$ is a random variable in the 50 simulations above, where for each simulation X was held fixed, but due to the added noise $y$ was a random variable. For each noise condition, the estimate for $\beta$ took on different values.) We saw above how an HRF estimator with a large variance is undesirable, as it reduces our confidence in the estimates of the HRF shape and scale. Therefore we would like to determine an estimator that has a minimum overall variance.

A formal metric for efficiency of a least-squares estimator is directly related to the variance of the estimator. The efficiency is defined to be the inverse of the sum of the estimator variances. An estimator that has a large sum of variances will have a low efficiency, and vice versa. But how do we obtain the values of the variances for the estimator? The variances can be recovered from the diagonal elements of the estimator covariance matrix $C_{\hat \beta}$, giving the following definition for the efficiency, $E$

$E = 1/trace(C_{\hat \beta})$

In earlier post we found that the covariance matrix $C_{\hat \beta}$ for the GLS estimator (i.e. the formulation above) with a given noise covariance $C_{\epsilon}$ is:

$C_{\hat \beta} = (X^T C_{\epsilon}^{-1} X)^{-1}$.

Thus the efficiency for the HRF estimator is

$E = 1/trace((X^T C_{\epsilon}^{-1}X)^{-1})$

Here we see that the efficiency depends only on the known noise covariance (or an estimate of it), and the design matrix used in the model, but not the shape of the HRF. In general the noise covariance is out of the experimenter’s control (but see the take-homes below ), and must be dealt with post hoc. However, because the design matrix is directly related to the experimental design, the above expression gives a direct way to test the efficiency of experimental designs before they are ever used!

In the simulations above, the noise processes are drawn from an independent multivariate Gaussian distribution, therefore the noise covariance is equal to the identity (i.e. uncorrelated). We also estimated the HRF using the FIR basis set, thus our model design matrix was $X_{FIR}$. This gives the estimate the efficiency for the simulation experiments:

$E_{simulation} = 1/trace(X_{FIR}^T X_{FIR})$

Below we calculate the efficiency for the FIR estimates under the simulated experiments with periodic and random stimulus presentation designs.

%% ESTIMATE DESIGN EFFICIENCY
%% (ASSUME THE VARIABLES DEFINED ABOVE ARE IN WORKSPACE)

% CALCULATE EFFICIENCY OF PERIODIC EXPERIMENT
E_periodic = 1/trace(pinv(X_FIR_periodic'*X_FIR_periodic));

% CALCULATE EFFICIENCY OF RANDOM EXPERIMENT
E_random = 1/trace(pinv(X_FIR_random'*X_FIR_random));

% DISPLAY EFFICIENCY ESTIMATES
figure
bar([E_periodic,E_random]);
set(gca,'XTick',[1,2],'XTickLabel',{'E_periodic','E_random'});
title('Efficiency of Experimental Designs');
colormap hot;


Estimated efficiency for simulated periodic (left) and random (right) stimulus schedules.

Here we see that the efficiency metric does indeed indicate that the randomly-presented stimulus paradigm is far more efficient than the periodically-presented paradigm.

## Wrapping Up

In this post we addressed the efficiency of an fMRI experiment design. A few take-homes from the discussion are:

1. Randomize stimulus onset times. These onset times should take into account the low-pass characteristics (i.e. the power spectrum) of the HRF.
2. Try to model selectivity to events that occur close in time. The reason for this is that noise covariances in fMRI are highly non-stationary. There are many sources of low-frequency physiological noise such as breathing, pulse, blood pressure, etc, all of which dramatically effect the noise in the fMRI timecourses. Thus any estimate of noise covariances from data recorded far apart in time will likely be erroneous.
3. Check an experimental design against other candidate designs using the Efficiency metric.

Above there is mention of the effects of low-frequency physiological noise. Until now, our simulations have assumed that all noise is independent in time, greatly simplifying the picture of estimating HRFs and corresponding selectivity. However, in a later post we’ll address how to deal with more realistic time courses that are heavily influenced by sources of physiological noise. Additionally, we’ll tackle how to go about estimating the noise covariance $C_{\epsilon}$ from more realistic fMRI time series.

## fMRI in Neuroscience: Modeling the HRF With FIR Basis Functions

In the previous post on fMRI methods, we discussed how to model the selectivity of a voxel using the General Linear Model (GLM). One of the basic assumptions that we must make in order to use the GLM is that we also have an accurate model of the Hemodynamic Response Function (HRF) for the voxel. A common practice is to use a canonical HRF model established from previous empirical studies of fMRI timeseries. However, voxels throughout the brain and across subjects exhibit a variety of shapes, so the canonical model is often incorrect. Therefore it becomes necessary to estimate the shape of the HRF for each voxel.

There are a number of ways that have been developed for estimating HRFs, most of them are based on temporal basis function models. (For details on basis function models, see this previous post.). There are a number of basis function sets available, but in this post we’ll discuss modeling the HRF using a flexible basis set composed of a set of delayed impulses called Finite Impulse Response (FIR) basis.

## Modeling HRFs With a Set of Time-delayed Impulses

Let’s say that we have an HRF with the following shape.

A Model HRF.

We would like to be able to model the HRF as a weighted combination of simple basis functions. The simplest set of basis functions is the FIR basis, which is a series of $H$ distinct unit-magnitude (i.e. equal to one) impulses, each of which is delayed in time by $t = 1 \dots H$ TRs. An example of  modeling the HRF above using FIR basis functions is below:

%% REPRESENTING AN HRF WITH FIR BASIS FUNCTIONS
% CREATE ACTUAL HRF (AS MEASURED BY MRI SCANNER)
rand('seed',12345)
TR = 1                              % REPETITION TIME
t = 1:TR:20;                        % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % ACTUAL HRF
h = h/max(h);

% DISPLAY THE HRF
figure;
stem(t,h,'k','Linewidth',2)
axis square
xlabel(sprintf('Basis Function Contribution\nTo HRF'))
title(sprintf('HRF as a Series of \nWeighted FIR Basis Functions'))

% CREATE/DISPLAY FIR REPRESENTATION
figure; hold on
cnt = 1;

% COLORS BASIS FUNCTIONS ACCORDING TO HRF WEIGHT
map = jet(64);
cRange = linspace(min(h),max(h),64);

for iT = numel(h):-1:1
firSignal = ones(size(h));
firSignal(cnt) = 2;
[~,cIdx] = min(abs(cRange-h(cnt)));
color = map(cIdx,:);
plot(1:numel(h),firSignal + 2*(iT-1),'Color',color,'Linewidth',2)
cnt = cnt+1;
end
colormap(map); colorbar; caxis([min(h) max(h)]);

% DISPLAY
axis square;
ylabel('Basis Function')
xlabel('Time (TR)')
set(gca,'YTick',0:2:39,'YTickLabel',20:-1:1)
title(sprintf('Weighted FIR Basis\n Set (20 Functions)'));


Representing the HRF above as a weighted set of FIR basis functions.
The color of each of the 20 basis functions corresponds to its weight

Each of the basis functions $b_t$ has an unit impulse that occurs at time $t = 1 \dots 20$; otherwise it is equal to zero. Weighting each basis function $b_t$ with the corresponding value of the HRF at each time point $t$, followed by a sum across all the functions gives the target HRF in the first plot above. The FIR basis model makes no assumptions about the shape of the  HRF–the weight applied to each basis function can take any value–which allows the model to capture a wide range of HRF profiles.

Given an experiment where various stimuli are presented to a subject and BOLD responses evoked within the subject’s brain, the goal is to determine the HRF to each of the stimuli within each voxel. Let’s take a look at a concrete example of how we can use the FIR basis to simultaneously estimate HRFs to many stimuli for multiple voxels with distint tuning properties.

## Estimating the HRF of Simulated Voxels Using the FIR Basis

For this example we revisit a simulation of voxels with 4 different types of tuning (for details, see the previous post on fMRI in Neuroscience). One voxel is strongly tuned for visual stimuli (such as a light), the second voxel is weakly tuned for auditory stimuli (such as a tone), the third is moderately tuned for somatosensory stimuli (such as warmth applied to the palm), and the final voxel is unselective (i.e. weakly and equally selective for all three types of stimuli). We simulate an experiment where the blood-oxygen-level dependent (BOLD) signals evoked  in each voxel by a series of stimuli consisting of nonoverlapping lights, tones, and applications of warmth to the palm, are measured over $T=330$ fMRI measurments (TRs). Below is the simulation of the experiment and the resulting simulated BOLD signals:

%% SIMULATE AN EXPERIMENT
% SOME CONSTANTS
trPerStim = 30;
nRepeat = 10;
nTRs = trPerStim*nRepeat + length(h);
nCond = 3;
nVox = 4;
impulseTrain0 = zeros(1,nTRs);

% RANDOM ONSET TIMES (TRs)
onsetIdx = randperm(nTRs-length(h));

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(onsetIdx(1:nRepeat)) = 1;
onsetIdx(1:nRepeat) = [];

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(onsetIdx(1:nRepeat)) = 1;
onsetIdx(1:nRepeat) = [];

% SOMATOSENSORY STIMULUS
impulseTrainHeat = impulseTrain0;
impulseTrainHeat(onsetIdx(1:nRepeat)) = 1;

% EXPERIMENT DESIGN / STIMULUS SEQUENCE
D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat'];
X = conv2(D,h');
X = X(1:nTRs,:);

%% SIMULATE RESPONSES OF VOXELS WITH VARIOUS SELECTIVITIES
visualTuning =   [4 0 0]; % VISUAL VOXEL TUNING
auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING
somatoTuning =   [0 0 3]; % SOMATOSENSORY VOXEL TUNING
noTuning =       [1 1 1]; % NON-SELECTIVE

beta = [visualTuning', ...
auditoryTuning', ...
somatoTuning', ...
noTuning'];

y0 = X*beta;
SNR = 5;
noiseSTD = max(y0)/SNR;
noise = bsxfun(@times,randn(size(y0)),noiseSTD);
y = y0 + noise; % VOXEL RESPONSES

% DISPLAY VOXEL TIMECOURSES
voxNames = {'Visual','Auditory','Somat.','Unselective'};
cols = lines(4);
figure;
for iV = 1:4
subplot(4,1,iV)
plot(y(:,iV),'Color',cols(iV,:),'Linewidth',2); xlim([0,nTRs]);
ylabel('BOLD Signal')
legend(sprintf('%s Voxel',voxNames{iV}))
end
xlabel('Time (TR)')
set(gcf,'Position',[100,100,880,500])


Four simulated voxels, each with strong visual, weak auditory,
moderate somatosensory and unselective tuning.

Now let’s estimate the HRF of each voxel to each of the $C = 3$ stimulus conditions using an FIR basis function model. To do so, we create a design matrix composed of successive sets of delayed impulses, where each set of impulses begins at the onset of each stimulus condition. For the $[T \times C]$-sized stimulus onset matrix $D$, we calculate an $[T \times HC]$ FIR design matrix $X_{FIR}$, where $H$ is the assumed length of the HRF we are trying to estimate. The code for creating and displaying the design matrix for an assumed HRF length $H=16$ is below:

%% ESTIMATE HRF USING FIR BASIS SET
% CREATE FIR DESIGN MATRIX
hrfLen = 16;  % WE ASSUME HRF IS 16 TRS LONG

% BASIS SET FOR EACH CONDITOIN IS A TRAIN OF INPULSES
X_FIR = zeros(nTRs,hrfLen*nCond);

for iC = 1:nCond
onsets = find(D(:,iC));
idxCols = (iC-1)*hrfLen+1:iC*hrfLen;
for jO = 1:numel(onsets)
idxRows = onsets(jO):onsets(jO)+hrfLen-1;
for kR = 1:numel(idxRows);
X_FIR(idxRows(kR),idxCols(kR)) = 1;
end
end
end

% DISPLAY
figure;
subplot(121);
imagesc(D);
colormap gray;
set(gca,'XTickLabel',{'Light','Tone','Som.'})
title('Stimulus Train');

subplot(122);
imagesc(X_FIR);
colormap gray;
title('FIR Design Matrix');
set(gca,'XTick',[8,24,40])
set(gca,'XTickLabel',{'Light','Tone','Som.'})
set(gcf,'Position',[100,100,550,400])


Left: The stimulus onset matrix (size = [T x 3]).
Right the corresponding Design Matrix (size = [T x 3*H])

In the right panel of the plot above, we see the  form of the FIR design matrix $X_{FIR}$ for the stimulus onset on the left. For each voxel, we want to determine the weight on each column of $X_{FIR}$ that will best explain the BOLD signals $y$ measured from each voxel. We can form this problem in terms of a General Linear Model:

$y = X_{FIR}\beta_{FIR}$

Where $\beta_{FIR}$ are the weights on each column of the FIR design matrix. If we set the values of $\beta_{HRF}$ such as to minimize the sum of the squared errors (SSE) between the model above and the measured actual responses

$SSE = \sum_i^N(y^{(i)} - X_{FIR}^{(i)})^2$,

then we can use the Ordinary Least Squares (OLS) solution discussed earlier to solve the for $\beta_{HRF}$.  Specifically, we solve for the weights as:

$\hat \beta_{FIR} = (X_{FIR}^T X_{FIR})^{-1} X_{FIR} y$

Once determined, the resulting $[CH \times V]$ matrix of weights $\hat \beta_{FIR}$ has the HRF of each of the $V=4$ different voxels to each stimulus condition along its columns. The first $H$ (1-16) of the weights along a column define the HRF to the first stimulus (the light). The second $H$ (17-32) weights along a column determine the HRF to the second stimulus (the tone), etc… Below we parse out these weights and display the resulting HRFs for each voxel:

% ESTIMATE HRF FOR EACH CONDITION AND VOXEL
betaHatFIR = pinv(X_FIR'*X_FIR)*X_FIR'*y;

% RESHAPE HRFS
hHatFIR = reshape(betaHatFIR,hrfLen,nCond,nVox);

% DISPLAY
figure
cols = lines(4);
names = {'Visual','Auditory','Somat.','Unselective'};
for iV = 1:nVox
subplot(2,2,iV)
hold on;
for jC = 1:nCond
hl = plot(1:hrfLen,hHatFIR(:,jC,iV),'Linewidth',2);
set(hl,'Color',cols(jC,:))
end
hl = plot(1:numel(h),h,'Linewidth',2);
xlabel('TR')
legend({'Light','Tone','Heat','True HRF'})
set(hl,'Color','k')
xlim([0 hrfLen])
grid on
axis tight
title(sprintf('%s Voxel',names{iV}));
end
set(gcf,'Position',[100,100,880,500])


HRF estimates for each voxel to each of the 3 stimuli. Black plots show the shape of the true HRF.

Here we see that estimated HRFs accurately capture both the shape of the HRF and the selectivity of each of the voxels. For instance, the HRFs estimated from the responses of first voxel indicate strong tuning for the light stimulus. The HRF estimated for the light stimulus has an amplitude that is approximately 4 times that of the true HRF. This corresponds with the actual tuning of the voxel (compare this to the value of  $\beta(1,1)$). Additionally, time delay till the maximum value (time-to-peak) of the HRF to the light is the same as the true HRF. The first voxel’s HRFs estimated for the other stimuli are essentially noise around baseline. This (correctly) indicates that the first voxel has no selectivity for those stimuli. Further inspection of the remaining estimated HRFs indicate accurate tuning and HRF shape is recovered for the other three voxels as well.

## Wrapping Up

In this post we discussed how to apply a simple basis function model (the FIR basis) to estimate the HRF profile and get an idea of the tuning of individual voxels. Though the FIR basis model can accurately model any HRF shape, it is often times too flexible. In scenarios where voxel signals are very noisy, the FIR basis model will tend to model the noise.

Additionally, the FIR basis set needs to incorporate a basis function for each time measurement.  For the example above, we assumed the HRF had a length of 16 TRs. The FIR basis therefore had 16 tuneable weights for each condition. This leads to a model with 48 ($C\times H = 3 \times 16$) tunable parameters for the GLM model. For experiments with many different stimulus conditions, the number of parameters can grow quickly (as $HC$). If the number of parameters is comparable (or more) than the number of BOLD signal measurements, it will be difficult accurately estimate $\hat \beta_{FIR}$. As we’ll see in later posts, we can often improve upon the FIR basis set by using more clever basis functions.

Another important but indirect issue that effects estimating the HRF is the experimental design, or rather the schedule used to present the stimuli. In the example above, the stimuli were presented in random, non-overlapping order. What if the stimuli were presented in the same order every time, with some set frequency? We’ll discuss in a later post the concept of design efficiency and how it affects our ability to characterize the shape of the HRF and, consequently, voxel selectivity.

## fMRI in Neuroscience: Estimating Voxel Selectivity & the General Linear Model (GLM)

In a typical fMRI experiment  a series of stimuli are presented to an observer and evoked brain activity–in the form of blood-oxygen-level-dependent (BOLD) signals–are measured from tiny chunks of the brain called voxels. The task of the researcher is then to infer the tuning of the voxels to features in the presented stimuli based on the evoked BOLD signals. In order to make this inference quantitatively, it is necessary to have a  model of how BOLD signals are evoked in the presence of stimuli. In this post we’ll develop a model of evoked BOLD signals, and from this model recover the tuning of individual voxels measured during an fMRI experiment.

## Modeling the Evoked BOLD Signals — The Stimulus and Design Matrices

Suppose we are running an event-related fMRI experiment where we present $C$ different stimulus conditions to an observer while recording the BOLD signals evoked in their brain over a series of $T$ consecutive fMRI measurements (TRs). We can represent the stimulus presentation quantitatively with a $T \times C$ binary Stimulus Matrix, $D$, whose entries indicate the onset of each stimulus condition (columns) at each point in time (rows). Now let’s assume that we have an accurate model of how a voxel is activated by a single, very short stimulus. This activation model is called hemodynamic response function (HRF), $h$, for the voxel, and, as we’ll discuss in a later post, can be estimated from the measured BOLD signals. Let’s assume for now that the voxel is also activated to an equal degree to all stimuli. In this scenario we can represent the BOLD signal evoked over the entire experiment with another $T \times C$ matrix $X$ called the Design Matrix that is the convolution of the stimulus matrix $D$ with the voxel’s HRF $h$.

$X = D * h$

Note that this model of the BOLD signal is an example of the Finite Impulse Response (FIR) model that was introduced in the previous post on fMRI Basics.

To make the concepts of $D$ and $X$ more concrete, let’s say our experiment consists of $C = 3$ different stimulus conditions: a light, a tone, and heat applied to the palm. Each stimulus condition is presented twice in a staggered manner during 80 TRs of fMRI measurements. The stimulus matrix and the design matrix are simulated here in Matlab:

TR = 1; % REPETITION TIME
t = 1:TR:20; % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % HRF MODEL
h = h/max(h); % SCALE HRF TO HAVE MAX AMPLITUDE OF 1

trPerStim = 30; % # TR PER STIMULUS
nRepeat = 2; % # OF STIMULUS REPEATES
nTRs = trPerStim*nRepeat + length(h);
impulseTrain0 = zeros(1,nTRs);

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(1:trPerStim:trPerStim*nRepeat) = 1;

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(5:trPerStim:trPerStim*nRepeat) = 1;

% SOMATOSENSORY STIMULUS
impulseTrainHeat = impulseTrain0;
impulseTrainHeat(9:trPerStim:trPerStim*nRepeat) = 1;

% COMBINATION OF ALL STIMULI
impulseTrainAll = impulseTrainLight + impulseTrainTone + impulseTrainHeat;

% SIMULATE VOXELS WITH VARIOUS SELECTIVITIES
visualTuning = [4 0 0]; % VISUAL VOXEL TUNING
auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING
somatoTuning = [0 0 3]; % SOMATOSENSORY VOXEL TUNING
noTuning = [1 1 1]; % NON-SELECTIVE

beta = [visualTuning', ...
auditoryTuning', ...
somatoTuning', ...
noTuning'];

% EXPERIMENT DESIGN / STIMULUS SEQUENCE
D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat'];

% CREATE DESIGN MATRIX FOR THE THREE STIMULI
X = conv2(D,h'); % X = D * h
X(nTRs+1:end,:) = []; % REMOVE EXCESS FROM CONVOLUTION

% DISPLAY STIMULUS AND DESIGN MATRICES
subplot(121); imagesc(D); colormap gray;
xlabel('Stimulus Condition')
ylabel('Time (TRs)');
title('Stimulus Train, D');
set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});

subplot(122);
imagesc(X);
xlabel('Stimulus Condition')
ylabel('Time (TRs)');
title('Design Matrix, X = D * h')
set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});



Stimulus presentation matrix, D (left) and the Design Matrix X for an experiment with three stimulus conditions: a light, a tone, and heat applied to the palm

Each column of the design matrix above (the right subpanel in the above figure) is essentially a model of the BOLD signal evoked independently by each stimulus condition, and the total signal is simply a sum of these independent signals.

## Modeling Voxel Tuning — The Selectivity Matrix

In order to develop the concept of the design matrix we assumed that our theoretical voxel is equally tuned to all stimuli. However, few voxels in the brain exhibit such non-selective tuning. For instance, a voxel located in visual cortex will be more selective for the light than for the tone or the heat stimulus. A voxel in auditory cortex will be more selective for the tone than for the other two stimuli. A voxel in the somoatorsensory cortex will likely be more selective for the heat than the visual or auditory stimuli. How can we represent the tuning of these different voxels?

A simple way to model tuning to the stimulus conditions in an experiment is to multiplying each column of the design matrix by a weight that modulates the BOLD signal according to the presence of the corresponding stimulus condition. For example, we could model a visual cortex voxel by weighting the first column of $X$ with a positive value, and the remaining two columns with much smaller values (or even negative values to model suppression). It turns out that we can model the selectivity of $V$ individual voxels simultaneously through a $C \times V$ Selectivity Matrix$\beta$. Each entry in $\beta$ is the amount that the $v$-th voxel (columns) is tuned to the $c$-th stimulus condition (rows). Given the design matrix and the selectivity matrix, we can then predict the BOLD signals $y$ of selectively-tuned voxels with a simple matrix multiplication:

$y = X\beta$

Keeping with our example experiment, let’s assume that we are modeling the selectivity of four different voxels: a strongly-tuned visual voxel, a moderately-tuned somatosensory voxel, a weakly tuned auditory voxel, and an unselective voxel that is very weakly tuned to all three stimulus conditions. We can represent the tuning of these four voxels with a $3 \times 4$ selectivity matrix. Below we define a selectivity matrix that represents the tuning of these 4 theoretical voxels and simulate the evoked BOLD signals to our 3-stimulus experiment.

% SIMULATE NOISELESS VOXELS' BOLD SIGNAL
% (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE)
y0 = X*beta;

figure;
subplot(211);
imagesc(beta); colormap hot;
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Voxel Selectivity, \beta')

subplot(212);
plot(y0,'Linewidth',2);
legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'});
xlabel('Time (TRs)'); ylabel('BOLD Signal');
title('Activity for Voxels with Different Stimulus Tuning')
set(gcf,'Position',[100 100 750 540])
subplot(211); colorbar


Selectivity matrix (top) for four theoretical voxels and GLM BOLD signals (bottom) for a simple experiment

The top subpanel in the simulation output visualizes the selectivity matrix defined for the four theoretical voxels. The bottom subpanel plots the columns of the $T \times V$ matrix of voxel responses $y$. We see that the maximum response of the strongly-tuned visual voxel (plotted in blue) is larger than that of the other voxels, corresponding to the larger weight upper left of the selectivity matrix. Also note that the response for the unselective voxel (plotted in cyan) demonstrates the linearity property of the FIR model. The attenuated but complex BOLD signal from the unselective voxel results from the sum of small independent signals evoked by each stimulus.

## Modeling Voxel Noise

The example above demonstrates how we can model BOLD signals evoked in noisless theoretical voxels. Though this noisless scenario is helpful for developing a modeling framework, real-world voxels exhibit variable amounts of noise (noise is any signal that cannot be accounted by the FIR model). Therefore we need to incorporate a noise term into our BOLD signal model.

The noise in a voxel is often modeled as a random variable $\epsilon$. A common choice for the noise model is a zero-mean Normal/Gaussian distribution with some variance $\sigma^2$:

$\epsilon \sim \mathcal N(0,\sigma^2)$

Though the variance of the noise model may not be known apriori, there are methods for estimating it from data. We’ll get to estimating noise variance in a later post when we discuss various sources of noise and how to account for them using more advance techniques. For simplicity, let’s just assume that the noise variance is 1 as we proceed.

## Putting It All Together — The General Linear Model (GLM)

So far we have introduced on the concepts of the stimulus matrix, the HRF, the design matrix, selectivity matrix, and the noise model. We can combine all of these to compose a comprehensive quantitative model of BOLD signals measured from a set of voxels during an experiment:

$y = X\beta + \epsilon \\ = (D * h)\beta + \epsilon$

This is referred to as the General Linear Model (GLM).

In a typical fMRI experiment the researcher controls the stimulus presentation $D$, and measures the evoked BOLD responses $y$ from a set of voxels. The problem then is to estimate the selectivities of the voxels based on these measurments. Specifically, we want to determine the parameters $\hat \beta$ that best explain the measured BOLD signals during our experiment. The most common way to do this is a method known as Ordinary Least Squares (OLS) Regression. Using OLS the idea is to adjust the values of $\hat \beta$ such that the predicted model BOLD signals are as similar to the measured signals as possible. In other words, the goal is to infer the selectivity each voxel would have to exhibit in order to produce the measured BOLD signals. I showed in an earlier post that the optimal OLS solution for the selectivities $\hat \beta$ is given by:

$\hat \beta = (X^T X)^{-1} X^T y$

Therefore, given a design matrix $X$ and a set of voxel responses $y$ associated with the design matrix, we can calculate the selectivities of voxels to the stimulus conditions represented by the columns of the design matrix. This works even when the BOLD signals are noisy. To get a better idea of this process at work let’s look at a quick example based on our toy fMRI experiment.

## Example: Recovering Voxel Selectivity Using OLS

Here the goal is to recover the selectivities of the four voxels in our toy experiment they have been corrupted with noise. First, we add noise to the voxel responses. In this example the variance of the added noise is based on a concept known as signal-to-noise-ration or SNR.  As the name suggests, SNR is the ratio of the underlying signal to the noise “on top of” the signal. SNR is a very important concept when interpreting fMRI analyses. If a voxel exhibits a low SNR, it will be far more difficult to estimate its tuning. Though there are many ways to define SNR, in this example it is defined as the ratio of the maximum signal amplitude to the variance of the noise model. The underlying noise model variance is adjusted to be one-fifth of the maximum amplitude of the BOLD signal, i.e. an SNR of 5.  Feel free to try different values of SNR by changing the value of the variable $\text{SNR}$ in the Matlab simulation. Noisy versions of the 4 model BOLD signals are plotted in the top subpanel of the figure below. We see that the noisy signals are very different from the actual underlying BOLD signals.

Noisy BOLD signals from 4 voxels (top) and GLM predictions (bottom) of the underlying BOLD signals

Here we estimate the selectivities $\hat \beta$ from the GLM using OLS, and then predict the BOLD signals in our experiment with this estimate. We see in the bottom subpanel of the above figure that the resulting GLM predictions of are quite accurate. We also compare the estimated selectivity matrix $\hat \beta$ to the actual selectivity matrix $\beta$ below. We see that OLS is able to recover the selectivity of all the voxels.

Actual (top) and estimated (bottom) selectivity matrices.

% SIMULATE NOISY VOXELS & ESTIMATE TUNING
% (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE)

SNR = 5; % (APPROX.) SIGNAL-TO-NOISE RATIO
noiseSTD = max(y0(:))./SNR; % NOISE LEVEL FOR EACH VOXEL
noise = bsxfun(@times,randn(size(y0)),noiseSTD);
y = y0 + noise;

betaHat = inv(X'*X)*X'*y % OLS
yHat = X*betaHat; % GLM PREDICTION

figure
subplot(211);
plot(y,'Linewidth',3);
xlabel('Time (s)'); ylabel('BOLD Signal');
legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'});
title('Noisy Voxel Responses');

subplot(212)
h1 = plot(y0,'Linewidth',3); hold on
h2 = plot(yHat,'-o');
legend([h1(end),h2(end)],{'Actual Responses','Predicted Responses'})
xlabel('Time (s)'); ylabel('BOLD Signal');
title('Model Predictions')
set(gcf,'Position',[100 100 750 540])

figure
subplot(211);
imagesc(beta); colormap hot(5);
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Actual Selectivity, \beta')

subplot(212)
imagesc(betaHat); colormap hot(5);
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Noisy Estimated Selectivity')
drawnow


## Wrapping Up

Here we introduced the GLM commonly used for fMRI data analyses and used the GLM framework to recover the selectivities of simulated voxels. We saw that the GLM is quite powerful of recovering the selectivity in the presence of noise. However, there are a few details left out of the story.

First, we assumed that we had an accurate (albeit exact) model for each voxel’s HRF. This is generally not the case. In real-world scenarios the HRF is either assumed to have some canonical shape, or the shape of the HRF is estimated the experiment data. Though assuming a canonical HRF shape has been validated for block design studies of peripheral sensory areas, this assumption becomes dangerous when using event-related designs, or when studying other areas of the brain.

Additionally, we did not include any  physiological noise signals in our theoretical voxels. In real voxels, the BOLD signal changes due to physiological processes such as breathing and heartbeat can be far larger than the signal change due to underlying neural activation. It then becomes necessary to either account for the nuisance signals in the GLM framework, or remove them before using the model described above. In two upcoming posts we’ll discuss these two issues: estimating the HRF shape from data, and dealing with nuisance signals.

## MCMC: Hamiltonian Monte Carlo (a.k.a. Hybrid Monte Carlo)

The random-walk behavior of many Markov Chain Monte Carlo (MCMC) algorithms makes Markov chain convergence to a target stationary distribution $p(x)$ inefficient, resulting in slow mixing. Hamiltonian/Hybrid Monte Carlo (HMC), is a MCMC method that adopts physical system dynamics rather than a probability distribution to propose future states in the Markov chain. This allows the Markov chain to explore the target distribution much more efficiently, resulting in faster convergence. Here we introduce basic analytic and numerical concepts for simulation of Hamiltonian dynamics. We then show how Hamiltonian dynamics can be used as the Markov chain proposal function for an MCMC sampling algorithm (HMC).

## First off, a brief physics lesson in Hamiltonian dynamics

Before we can develop Hamiltonian Monte Carlo, we need to become familiar with the concept of Hamiltonian dynamics. Hamiltonian dynamics is one way that physicists describe how objects move throughout a system. Hamiltonian dynamics describe an object’s motion in terms of its location $\bold x$ and momentum $\bold p$ (equivalent to the object’s mass times its velocity) at some time $t$. For each location the object takes, there is an associated potential energy $U(\bold x)$, and for each momentum there is an associated kinetic energy $K(\bold p)$. The total energy of the system is constant and known as the Hamiltonian $H(\bold x, \bold p)$, defined simply as the sum of the potential and kinetic energies:

$H(\bold x,\bold p) = U(\bold x) + K(\bold p)$

Hamiltonian dynamics describe how kinetic energy is converted to potential energy (and vice versa) as an object moves throughout a system in time. This description is implemented quantitatively via a set of differential equations known as the Hamiltonian equations:

$\frac{\partial x_i}{\partial t} = \frac{\partial H}{\partial p_i} = \frac{\partial K(\bold p)}{\partial p_i}$
$\frac{\partial p_i}{\partial t} = -\frac{\partial H}{\partial x_i} = - \frac{\partial U(\bold x)}{\partial x_i}$

Therefore, if we have expressions for $\frac{\partial U(\bold x)}{\partial x_i}$ and $\frac{\partial K(\bold p)}{\partial p_i}$ and a set of initial conditions (i.e. an initial position $\bold x_0$ and initial momentum $\bold p_0$ at time $t_0$), it is possible to predict the location and momentum of an object at any point in time $t = t_0 + T$ by simulating these dynamics for a duration $T$

## Simulating Hamiltonian dynamics — the Leap Frog Method

The Hamiltonian equations describe an object’s motion in time, which is a continuous variable. In order to simulate Hamiltonian dynamics numerically on a computer, it is necessary to approximate the Hamiltonian equations by discretizing  time. This is done by splitting the interval $T$ up into a series of smaller intervals of length $\delta$. The smaller the value of $\delta$ the closer the approximation is to the dynamics in continuous time. There are a number of procedures that have been developed for discretizing time including Euler’s method and the Leap Frog Method, which I will introduce briefly in the context of Hamiltonian dynamics. The Leap Frog method updates the momentum and position variables sequentially, starting by simulating the momentum dynamics over a small interval of time $\delta /2$, then simulating the position dynamics over a slightly longer interval in time $\delta$, then completing the momentum simulation over another small interval of time $\delta /2$ so that $\bold x$ and $\bold p$ now exist at the same point in time. Specifically, the Leap Frog method is as follows: 1. Take a half step in time to update the momentum variable:

$p_i(t + \delta/2) = p_i(t) - (\delta /2)\frac{\partial U}{\partial x_i(t)}$

2. Take a full step in time to update the position variable

$x_i(t + \delta) = x_i(t) + \delta \frac{\partial K}{\partial p_i(t + \delta/2)}$

3. Take the remaining half step in time to finish updating the momentum variable

$p_i(t + \delta) = p_i(t + \delta/2) - (\delta/2) \frac{\partial U}{\partial x_i(t+\delta)}$

The Leap Fog method can be run for $L$ steps to simulate dynamics over $L \times \delta$ units of time. This particular discretization method has a number of properties that make it preferable to other approximation methods like Euler’s method, particularly for use in MCMC, but discussion of these properties are beyond the scope of this post. Let’s see how we can use the Leap Frog method to simulate Hamiltonian dynamics in a simple 1D example.

### Example 1: Simulating Hamiltonian dynamics of an harmonic oscillator

Imagine a ball with mass equal to one is attached to a horizontally-oriented spring. The spring exerts a force on the ball equal to

$F = -kx$

which works to restore the ball’s position to the equilibrium position of the spring  at $x = 0$. Let’s assume that the spring constant $k$, which defines the strength of the restoring force is also equal to one. If the ball is displaced by some distance $x$ from equilibrium, then the potential energy is

$U(x) = \int F dx = \int -x dx = \frac{x^2}{2}$

In addition, the kinetic energy an object with mass $m$ moving with velocity $v$ within a linear system is known to be

$K(v) = \frac{(mv)^2}{2m} = \frac{v^2}{2} = \frac{p^2}{2} = K(p)$,

if the object’s mass is equal to one, like the ball this example. Notice that we now have in hand the expressions for both $U(x)$ and $K(p)$. In order to simulate the Hamiltonian dynamics of the system using the Leap Frog method, we also need expressions for the partial derivatives of each variable (in this 1D example there are only one for each variable):

$\frac{\partial U(x)}{\partial x} =x$ $\frac{\partial K(p)}{\partial p} = p$

Therefore one iteration the Leap Frog algorithm for simulating Hamiltonian dynamics in this system is:

1.  $p(t + \delta/2) = p(t) - (\delta/2)x(t)$
2. $x(t + \delta) = x(t) + (\delta) p(t + \delta/2)$ 3. $p(t + \delta) = p(t + \delta /2) - (\delta/2)x(t + \delta)$

We simulate the dynamics of the spring-mass system described using the Leap Frog method in Matlab below (if the graph is not animated, try clicking on it to open up the linked .gif). The left bar in the bottom left subpanel of the simulation output demonstrates the trade-off between potential and kinetic energy described by Hamiltonian dynamics. The cyan portion of the bar is the proportion of the Hamiltonian contributed by  the potential energy $U(x)$, and the yellow portion  represents is the contribution of the kinetic energy $K(p)$. The right bar (in all yellow), is the total value of the Hamiltonian $H(x,p)$. Here we see that the ball oscillates about the equilibrium position of the spring with a constant period/frequency.  As the ball passes the equilibrium position $x= 0$, it has a minimum potential energy and maximum kinetic energy. At the extremes of the ball’s trajectory, the potential energy is at a maximum, while the kinetic energy is minimized. The  procession of momentum and position map out positions in what is referred to as phase space, which is displayed in the bottom right subpanel of the output. The harmonic oscillator maps out an ellipse in phase space. The size of the ellipse depends on the energy of the system defined by initial conditions.

Simple example of Hamiltonian Dynamics: 1D Harmonic Oscillator (Click to see animated)

You may also notice that the value of the Hamiltonian $H$ is not a exactly constant in the simulation, but oscillates slightly. This is an artifact known as energy drift due to approximations used to the discretize time.

% EXAMPLE 1: SIMULATING HAMILTONIAN DYNAMICS
%            OF HARMONIC OSCILLATOR
% STEP SIZE
delta = 0.1;

% # LEAP FROG
L = 70;

% DEFINE KINETIC ENERGY FUNCTION
K = inline('p^2/2','p');

% DEFINE POTENTIAL ENERGY FUNCTION FOR SPRING (K =1)
U = inline('1/2*x^2','x');

% DEFINE GRADIENT OF POTENTIAL ENERGY
dU = inline('x','x');

% INITIAL CONDITIONS
x0 = -4; % POSTIION
p0 = 1;  % MOMENTUM
figure

%% SIMULATE HAMILTONIAN DYNAMICS WITH LEAPFROG METHOD
% FIRST HALF STEP FOR MOMENTUM
pStep = p0 - delta/2*dU(x0)';

% FIRST FULL STEP FOR POSITION/SAMPLE
xStep = x0 + delta*pStep;

% FULL STEPS
for jL = 1:L-1
% UPDATE MOMENTUM
pStep = pStep - delta*dU(xStep);

% UPDATE POSITION
xStep = xStep + delta*pStep;

% UPDATE DISPLAYS
subplot(211), cla
hold on;
xx = linspace(-6,xStep,1000);
plot(xx,sin(6*linspace(0,2*pi,1000)),'k-');
plot(xStep+.5,0,'bo','Linewidth',20)
xlim([-6 6]);ylim([-1 1])
hold off;
title('Harmonic Oscillator')
subplot(223), cla
b = bar([U(xStep),K(pStep);0,U(xStep)+K(pStep)],'stacked');
set(gca,'xTickLabel',{'U+K','H'})
ylim([0 10]);
title('Energy')
subplot(224);
plot(xStep,pStep,'ko','Linewidth',20);
xlim([-6 6]); ylim([-6 6]); axis square
xlabel('x'); ylabel('p');
title('Phase Space')
pause(.1)
end
% (LAST HALF STEP FOR MOMENTUM)
pStep = pStep - delta/2*dU(xStep);


## Hamiltonian dynamics and the target distribution $p(\bold x)$

Now that we have a better understanding of what Hamiltonian dynamics are and how they can be simulated, let’s now discuss how we can use Hamiltonian dynamics for MCMC. The main idea behind Hamiltonian/Hibrid Monte Carlo is to develop a Hamiltonian function $H(\bold x, \bold p)$ such that the resulting Hamiltonian dynamics allow us to efficiently explore some target distribution $p(\bold x)$. How can we choose such a Hamiltonian function? It turns out it is pretty simple to relate a $H(\bold x, \bold p)$ to $p(\bold x)$ using a basic concept adopted from statistical mechanics known as the canonical distribution. For any energy function $E(\bf\theta)$ over a set of variables $\theta$, we can define the corresponding canonical distribution as: $p(\theta) = \frac{1}{Z}e^{-E(\bf\theta)}$ where we simply take the exponential of the negative of the energy function. The variable $Z$ is a normalizing constant called the partition function that scales the canonical distribution such that is sums to one, creating a valid probability distribution. Don’t worry about $Z$, it isn’t really important because, as you may recall from an earlier post, MCMC methods can sample from unscaled probability distributions. Now, as we saw above, the energy function for Hamiltonian dynamics is a combination of potential and kinetic energies: $E(\theta) = H(\bold x,\bold p) = U(\bold x) + K(\bold p)$

Therefore the canoncial distribution for the Hamiltonian dynamics energy function is

$p(\bold x,\bold p) \propto e^{-H(\bold x,\bold p)} \\ = e^{-[U(\bold x) - K(\bold p)]} \\ = e^{-U(\bold x)}e^{-K(\bold p)} \\ \propto p(\bold x)p(\bold p)$

Here we see that joint (canonical) distribution for $\bold x$ and $\bold p$ factorizes. This means that the two variables are independent, and the canoncial distribution $p(\bold x)$ is independent of the analogous distribution for the momentum. Therefore, as we’ll see shortly, we can use Hamiltonian dynamics to sample from the joint canonical distribution over $\bold p$ and $\bold x$ and simply ignore the momentum contributions. Note that this is an example of introducing auxiliary variables to facilitate the Markov chain path. Introducing the auxiliary variable $\bold p$ allows us to use Hamiltonian dynamics, which are unavailable without them. Because the canonical distribution for $\bold x$ is independent of the canonical distribution for $\bold p$, we can choose any distribution from which to sample the momentum variables. A common choice is to use a zero-mean Normal distribution with unit variance:

$p(\bold p) \propto \frac{\bold{p^Tp}}{2}$

Note that this is equivalent to having a quadratic potential energy term in the Hamiltonian:

$K(\bold p) = \frac{\bold{p^Tp}}{2}$

Recall that this is is the exact quadratic kinetic energy function (albeit in 1D) used in the harmonic oscillator example above. This is a convenient choice for the kinetic energy function as all partial derivatives are easy to compute. Now that we have defined a kinetic energy function, all we have to do is find a potential energy function $U(\bold x)$ that when negated and run through the exponential function, gives the target distribution $p(\bold x)$ (or an unscaled version of it). Another way of thinking of it is that we can define the potential energy function as

$U(\bold x) = -\log p(\bold x)$.

If we can calculate $-\frac{\partial \log(p(\bold x)) }{\partial x_i}$, then we’re in business and we can simulate Hamiltonian dynamics that can be used in an MCMC technique.

## Hamiltonian Monte Carlo

In HMC we use Hamiltonian dynamics as a proposal function for a Markov Chain in order to explore the target (canonical) density $p(\bold x)$ defined by $U(\bold x)$ more efficiently than using a proposal probability distribution. Starting at an initial state $[\bold x_0, \bold p_0]$, we simulate Hamiltonian dynamics for a short time using the Leap Frog method. We then use the state of the position and momentum variables at the end of the simulation as our proposed states variables $\bold x^*$ and $\bold p^*$. The proposed state is accepted using an update rule analogous to the Metropolis acceptance criterion. Specifically if the probability of the proposed state after Hamiltonian dynamics

$p(\bold x^*, \bold p^*) \propto e^{-[U(\bold x^*) + K{\bold p^*}]}$

is greater than probability of the state prior to the Hamiltonian dynamics

$p(\bold x_0,\bold p_0) \propto e^{-[U(\bold x^{(t-1)}), K(\bold p^{(t-1)})]}$

then the proposed state is accepted, otherwise, the proposed state is accepted randomly. If the state is rejected, the next state of the Markov chain is set as the state at $(t-1)$. For a given set of initial conditions, Hamiltonian dynamics will follow contours of constant energy in phase space (analogous to the circle traced out in phase space in the example above). Therefore we must randomly perturb the dynamics so as to explore all of $p(\bold x)$. This is done by simply drawing a random momentum from the corresponding canonical distribution $p(\bold p)$  before running the dynamics prior to each sampling iteration $t$. Combining these steps, sampling random momentum, followed by Hamiltonian dynamics and Metropolis acceptance criterion defines the HMC algorithm for drawing $M$ samples from a target distribution:

1. set $t = 0$
2. generate an initial position state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

– sample a new initial momentum variable from the momentum canonical distribution $\bold p_0 \sim p(\bold p)$

– set $\bold x_0 = \bold x^{(t-1)}$

– run Leap Frog algorithm starting at $[\bold x_0, \bold p_0]$ for $L$ steps and stepsize $\delta$ to obtain proposed states $\bold x^*$ and $\bold p^*$

– calculate the Metropolis acceptance probability:

$\alpha = \text{min}(1,\exp(-U(\bold x^*) + U(\bold x_0) - K(\bold p^*) + K(\bold p_0)))$

– draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposed state position $\bold x^*$ and set the next state in the Markov chain $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

In the next example we implement HMC  to sample from a multivariate target distribution that we have sampled from previously using multi-variate Metropolis-Hastings, the bivariate Normal. We also qualitatively compare the sampling dynamics of HMC to multivariate Metropolis-Hastings for the sampling the same distribution.

### Example 2: Hamiltonian Monte for sampling a Bivariate Normal distribution

As a reminder, the target distribution $p(\bold x)$ for this exampleis a Normal form with following parameterization:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean $\mu = [\mu_1,\mu_2]= [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & \rho_{12} \\ \rho_{21} & 1\end{bmatrix} = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

In order to sample from $p(\bold x)$ (assuming that we are using a quadratic energy function), we need to determine the expressions for $U(\bold x)$ and

$\frac{\partial U(\bold x) }{ \partial x_i}$.

Recall that the target potential energy function can be defined from the canonical form as

$U(\bold x) = -\log(p(\bold x))$

If we take the negative log of the Normal distribution outline above, this defines the following potential energy function:

$E(\bold x) = -\log \left(e^{-\frac{\bold{x^T \Sigma^{-1} x}}{2}}\right) - \log Z$

Where $Z$ is the normalizing constant for a Normal distribution (and can be ignored because it will eventually cancel). The potential energy function is then simply:

$U(\bold x) = \frac{\bold{x^T \Sigma^{-1}x}}{2}$

with partial derivatives

$\frac{\partial U(\bold x)}{\partial x_i} = x_i$

Using these expressions for the potential energy and its partial derivatives, we implement HMC for sampling from the bivariate Normal in Matlab:

Hybrid Monte Carlo Samples from bivariate Normal target distribution

In the graph above we display HMC samples of the target distribution, starting from an initial position very far from the mean of the target. We can see that HMC rapidly approaches areas of high density under the target distribution. We compare these samples with samples drawn using the Metropolis-Hastings (MH) algorithm below. The MH algorithm converges much slower than HMC, and consecutive samples have much higher autocorrelation than samples drawn using HMC.

Metropolis-Hasting (MH) samples of the same target distribution. Autocorrelation is evident. HMC is much more efficient than MH.

The Matlab code for the HMC sampler:

% EXAMPLE 2: HYBRID MONTE CARLO SAMPLING -- BIVARIATE NORMAL
rand('seed',12345);
randn('seed',12345);

% STEP SIZE
delta = 0.3;
nSamples = 1000;
L = 20;

% DEFINE POTENTIAL ENERGY FUNCTION
U = inline('transp(x)*inv([1,.8;.8,1])*x','x');

% DEFINE GRADIENT OF POTENTIAL ENERGY
dU = inline('transp(x)*inv([1,.8;.8,1])','x');

% DEFINE KINETIC ENERGY FUNCTION
K = inline('sum((transp(p)*p))/2','p');

% INITIAL STATE
x = zeros(2,nSamples);
x0 = [0;6];
x(:,1) = x0;

t = 1;
while t < nSamples
t = t + 1;

% SAMPLE RANDOM MOMENTUM
p0 = randn(2,1);

%% SIMULATE HAMILTONIAN DYNAMICS
% FIRST 1/2 STEP OF MOMENTUM
pStar = p0 - delta/2*dU(x(:,t-1))';

% FIRST FULL STEP FOR POSITION/SAMPLE
xStar = x(:,t-1) + delta*pStar;

% FULL STEPS
for jL = 1:L-1
% MOMENTUM
pStar = pStar - delta*dU(xStar)';
% POSITION/SAMPLE
xStar = xStar + delta*pStar;
end

% LAST HALP STEP
pStar = pStar - delta/2*dU(xStar)';

% COULD NEGATE MOMENTUM HERE TO LEAVE
% THE PROPOSAL DISTRIBUTION SYMMETRIC.
% HOWEVER WE THROW THIS AWAY FOR NEXT
% SAMPLE, SO IT DOESN'T MATTER

% EVALUATE ENERGIES AT
% START AND END OF TRAJECTORY
U0 = U(x(:,t-1));
UStar = U(xStar);

K0 = K(p0);
KStar = K(pStar);

% ACCEPTANCE/REJECTION CRITERION
alpha = min(1,exp((U0 + K0) - (UStar + KStar)));

u = rand;
if u < alpha
x(:,t) = xStar;
else
x(:,t) = x(:,t-1);
end
end

% DISPLAY
figure
scatter(x(1,:),x(2,:),'k.'); hold on;
plot(x(1,1:50),x(2,1:50),'ro-','Linewidth',2);
xlim([-6 6]); ylim([-6 6]);
legend({'Samples','1st 50 States'},'Location','Northwest')
title('Hamiltonian Monte Carlo')


## Wrapping up

In this post we introduced the Hamiltonian/Hybrid Monte Carlo algorithm for more efficient MCMC sampling. The HMC algorithm is extremely powerful for sampling distributions that can be represented terms of a potential energy function and its partial derivatives. Despite the efficiency and elegance of HMC, it is an underrepresented sampling routine in the literature. This may be due to the popularity of simpler algorithms such as Gibbs sampling or Metropolis-Hastings, or perhaps due to the fact that one must select hyperparameters such as the number of Leap Frog steps and Leap Frog step size when using HMC. However, recent research has provided effective heuristics such as adapting the Leap Frog step size in order to maintain a constant Metropolis rejection rate, which facilitate the use of HMC for general applications.

## MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates

In the previous posts on MCMC methods, we focused on how to sample from univariate target distributions. This was done mainly to give the reader some intuition about MCMC implementations with fairly tangible examples that can be visualized. However, MCMC can easily be extended to sample multivariate distributions.

In this post we will discuss two flavors of MCMC update procedure for sampling distributions in multiple dimensions: block-wise, and component-wise update procedures. We will show how these two different procedures can give rise to different implementations of the Metropolis-Hastings sampler to solve the same problem.

## Block-wise Sampling

The first approach for performing multidimensional sampling is to use block-wise updates. In this approach the proposal distribution $q(\bold{x})$ has the same dimensionality as the target distribution $p(\bold x)$. Specifically, if $p(\bold x)$ is a distribution over $D$ variables, ie. $\bold{x} = (x_1, x_2, \dots, x_D)$, then we must design a proposal distribution that is also a distribution involving $D$ variables. We then accept or reject a proposed state $\bold x^*$ sampled from the proposal distribution $q(\bold x)$ in exactly the same way as for the univariate Metropolis-Hastings algorithm. To generate $M$ multivariate samples we perform the following block-wise sampling procedure:

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $\bold x^*$ from $q(\bold x | \bold x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(\bold x^{(t-1)} | \bold x^*) }{q(\bold x^*|\bold x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(\bold x^*)}{p(\bold x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $\bold x^*$ and set $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

Let’s take a look at the block-wise sampling routine in action.

### Example 1: Block-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we use block-wise Metropolis-Hastings algorithm to sample from a bivariate (i.e. $D = 2$) Normal distribution:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

Usually the target distribution $p(\bold x)$ will have a complex mathematical form, but for this example we’ll circumvent that by using MATLAB’s built-in function $\text{mvnpdf}$ to evaluate $p(\bold x)$. For our proposal distribution, $q(\bold x)$, let’s use a circular Normal centered at the the previous state/sample of the Markov chain/sampler, i.e:

$q(\bold x | \bold x^{(t-1)}) \sim \mathcal N (\bold x^{(t-1)}, \bold I)$,

where $\bold I$ is a 2-D identity matrix, giving the proposal distribution unit variance along both dimensions $x_1$ and $x_2$, and zero covariance. You can find an MATLAB implementation of the block-wise sampler at the end of the section. The display of the samples and the target distribution output by the sampler implementation are shown below:

Samples drawn from block-wise Metropolis-Hastings sampler

We can see from the output that the block-wise sampler does a good job of drawing samples from the target distribution.

Note that our proposal distribution in this example is symmetric, therefore it was not necessary to calculate the correction factor $c$ per se. This means that this Metropolis-Hastings implementation is identical to the simpler Metropolis sampler.

%------------------------------------------------------
% EXAMPLE 1: METROPOLIS-HASTINGS
% BLOCK-WISE SAMPLER (BIVARIATE NORMAL)
rand('seed' ,12345);

D = 2; % # VARIABLES
nBurnIn = 100;

% TARGET DISTRIBUTION IS A 2D NORMAL WITH STRONG COVARIANCE
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

% PROPOSAL DISTRIBUTION STANDARD 2D GUASSIAN
q = inline('mvnpdf(x,mu)','x','mu')

nSamples = 5000;
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE BLOCK-WISE SAMPLER
t = 1;
x = zeros(nSamples,2);
x(1,:) = randn(1,D);

% RUN SAMPLER
while t < nSamples
t = t + 1;

% SAMPLE FROM PROPOSAL
xStar = mvnrnd(x(t-1,:),eye(D));

% CORRECTION FACTOR (SHOULD EQUAL 1)
c = q(x(t-1,:),xStar)/q(xStar,x(t-1,:));

% CALCULATE THE M-H ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1,:))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t,:) = xStar;
else
x(t,:) = x(t-1,:);
end
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')


## Component-wise Sampling

A problem with block-wise updates, particularly when the number of dimensions $D$ becomes large, is that finding a suitable proposal distribution is difficult. This leads to a large proportion of the samples being rejected. One way to remedy this is to simply loop over the the $D$ dimensions of $\bold x$ in sequence, sampling each dimension independently from the others. This is what is known as using component-wise updates. Note that now the proposal distribution $q(x)$ is univariate, working only in one dimension, namely the current dimension that we are trying to sample. The component-wise Metropolis-Hastings algorithm is outlined below.

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

generate a proposal state $x_i^*$ from $q(x_i | x_i^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x_i^{(t-1)}) | x_i^*)}{q(x_i^* | x_i^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p( x_i^*, \bold x_j^{(t-1)})}{p( x_i^{(t-1)}, \bold x_j^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x_i^*$ and set $x_i^{(t)}=x_i^*$

else set $x_i^{(t)} = x_i^{(t-1)}$

Note that in the component-wise implementation a sample for the $i$-th dimension is proposed, then  accepted or rejected while all other dimensions ($j \neq i$) are held fixed. We then move on to the next ($(i + 1)$-th) dimension and repeat the process while holding all other variables ($j \neq (i + 1)$) fixed. In each successive step we are using updated values for the dimensions that have occurred since increasing $(t -1) \rightarrow t$.

### Example 2: Component-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we draw samples from the same bivariate Normal target distribution described in Example 1, but using component-wise updates. Therefore $p(x)$ is the same, however, the proposal distribution $q(x)$ is now a univariate Normal distribution with unit unit variance in the direction of the $i$-th dimension to be sampled. The MATLAB implementation of the component-wise sampler is at the end of the section. The samples and comparison to the analytic target distribution are shown below.

Samples drawn from component-wise Metropolis-Hastings algorithm compared to target distribution

Again, we see that we get a good characterization of the bivariate target distribution.

%--------------------------------------------------
% EXAMPLE 2: METROPOLIS-HASTINGS
% COMPONENT-WISE SAMPLING OF BIVARIATE NORMAL
rand('seed' ,12345);

% TARGET DISTRIBUTION
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

nSamples = 5000;
propSigma = 1;		% PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE COMPONENT-WISE SAMPLER
x = zeros(nSamples,2);
xCurrent(1) = randn;
xCurrent(2) = randn;
dims = 1:2; % INDICES INTO EACH DIMENSION
t = 1;
x(t,1) = xCurrent(1);
x(t,2) = xCurrent(2);

% RUN SAMPLER
while t < nSamples
t = t + 1;
for iD = 1:2 % LOOP OVER DIMENSIONS

% SAMPLE PROPOSAL
xStar = normrnd(xCurrent(:,iD), propSigma);

% NOTE: CORRECTION FACTOR c=1 BECAUSE
% N(mu,1) IS SYMMETRIC, NO NEED TO CALCULATE

% CALCULATE THE ACCEPTANCE PROBABILITY
pratio = p([xStar xCurrent(dims~=iD)])/ ...
p([xCurrent(1) xCurrent(2)]);
alpha = min([1, pratio]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
xCurrent(iD) = xStar;
end
end

% UPDATE SAMPLES
x(t,:) = xCurrent;
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
figure;
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')



## Wrapping Up

Here we saw how we can use block- and component-wise updates to derive two different implementations of the Metropolis-Hastings algorithm. In the next post we will use component-wise updates introduced above to motivate the Gibbs sampler, which is often used to increase the efficiency of sampling well-defined probability multivariate distributions.

## MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state $x^*$ in the Markov chain, based on a previous state $x^{(t-1)}$, according to the proposal distribution $q(x^* | x^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $p(x)$ evaluated at $x^*$. (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution $q(x^* | x^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time $t$, the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ must be equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, a condition known as reversibility or detailed balance. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. $[0, \infty)$).

In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor $c$, defined from the proposal distribution as

$c = \frac{q(x^{(t-1)} | x^*) }{q(x^* | x^{(t-1)})}$

The correction factor adjusts the transition operator to ensure that the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ is equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$.  Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set t = 0
2. generate an initial state $x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x^{(t-1)} | x^*) }{q(x^*|x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(x^*)}{p(x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x^*$ and set $x^{(t)}=x^*$

else set $x^{(t)} = x^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters $\theta$ to an assumed model $p(y | \theta)$ such that the model can best account for some observed data $y$. The model function $p(y | \theta)$ is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution $p(\theta)$ that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution $p(\theta | y)$, which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

$p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)}$

where, $p(y)$ is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and $y$ can take.

Let’s say that we assume the following model (likelihood function):

$p(y | \theta) = \text{Gamma}(y;A,B)$, where

$\text{Gamma}(y;A,B) = \frac{B^A}{\Gamma(A)} y^{A-1}e^{-By}$, where

$\Gamma( )$ is the gamma function. Thus, the model parameters are

$\theta = [A,B]$

The parameter $A$ controls the shape of the distribution, and $B$ controls the scale. The likelihood surface for $B = 1$, and a number of values of $A$ ranging from zero to five are shown below.

Likelihood surface and conditional probability p(y|A=2,B=1) in green

The conditional distribution $p(y | A=2, B = 1)$ is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

 plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

$p(B = 1) = 1$

and

$p(A) = \text{sin}(\pi A)^2$

The first prior states that $B$ only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of $A$ varies as a sinusoidal function. (Note that both of these prior distributions are called improper priors because they do not integrate to one). Because $B$ is constant, we only need to estimate the value of $A$.

It turns out that even though the normalization constant $p(y)$ may be difficult to compute, we can sample from $p(A | y)$ without knowing $p(x)$ using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant $p(x)$ and sample from the unnormalized posterior:

$p(A | y) \propto p(y |A) p(A)$

The surface of the (unnormalized) posterior for $y$ ranging from zero to ten are shown below. The prior $p(A)$ is displayed in blue on the right of the plot. Let’s say that we have a datapoint $y = 1.5$ and would like to estimate the posterior distribution $p(A|y=1.5)$ using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Posterior surface, prior distribution (blue), and target distribution (pink)

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from $p(A|y=1.5)$ due to the fact that the posterior only has support on the real positive numbers $A \in [0 ,\infty)$. An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

$q(A) = \text{Exp}(\mu) = \mu e^{-\mu A}$,

This distribution is parameterized by a single variable $\mu$ that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for $\mu = 5$) are shown in the plot below.

Target posterior p(A|y) and proposal distribution q(A)

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

Metropolis-Hastings Markov chain and samples

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter $\mu$ (see line 88 in the MATLAB code below). Each proposal states $x^*$ is drawn independently of the previous state. Therefore this is an example of an independence sampler, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB  code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR
rand('seed',12345)

% PRIOR OVER SCALE PARAMETERS
B = 1;

% DEFINE LIKELIHOOD
likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B');

% CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE
yy = linspace(0,10,100);
AA = linspace(0.1,5,100);
likeSurf = zeros(numel(yy),numel(AA));
for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end;

figure;
surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot

% DISPLAY CONDITIONAL AT A = 2
hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend(ly,'p(y|A=2)','Location','Northeast');
hold off;
title('p(y|A)');

% DEFINE PRIOR OVER SHAPE PARAMETERS
prior = inline('sin(pi*A).^2','A');

% DEFINE THE POSTERIOR
p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B');

% CALCULATE AND DISPLAY THE POSTERIOR SURFACE
postSurf = zeros(size(likeSurf));
for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end;

figure
surf(postSurf); ylabel('y'); xlabel('A'); colormap hot

% DISPLAY THE PRIOR
hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3)

% SAMPLE FROM p(A | y = 1.5)
y = 1.5;
target = postSurf(16,:);

% DISPLAY POSTERIOR
psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast');
hold off
title('p(A|y)');

% INITIALIZE THE METROPOLIS-HASTINGS SAMPLER
% DEFINE PROPOSAL DENSITY
q = inline('exppdf(x,mu)','x','mu');

% MEAN FOR PROPOSAL DENSITY
mu = 5;

% DISPLAY TARGET AND PROPOSAL
figure; hold on;
th = plot(AA,target,'m','Linewidth',2);
qh = plot(AA,q(AA,mu),'k','Linewidth',2)
legend([th,qh],{'Target, p(A)','Proposal, q(A)'});
xlabel('A');

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
minn = 0.1; maxx = 5;

% INTIIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = mu;
t = 1;

% RUN METROPOLIS-HASTINGS SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = exprnd(mu);

% CORRECTION FACTOR
c = q(x(t-1),mu)/q(xStar,mu);

% CALCULATE THE (CORRECTED) ACCEPTANCE RATIO
alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
else
x(t) = x(t-1);
end
end

% DISPLAY MARKOV CHAIN
figure;
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2)
ylabel('t'); xlabel('samples, A');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([0 maxx]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 100;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, A' ); ylabel( 'p(A | y)' );
title('Samples');
xlim([0 10])

% OVERLAY TARGET DISTRIBUTION
hold on;
plot(AA, target/sum(target) , 'm-', 'LineWidth', 2);
legend('Sampled Distribution',sprintf('Target Posterior'))
axis tight


## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.

## Sampling From the Normal Distribution Using the Box-Muller Transform

The Normal Distribution is the workhorse of many common statistical analyses and being able to draw samples from this distribution lies at the heart of many statistical/machine learning algorithms. There have been a number of methods developed to sample from the Normal distribution including Inverse Transform Sampling, the Ziggurat Algorithm, and the Ratio Method (a rejection sampling technique). In this post we will focus on an elegant method called the Box-Muller transform.

## A quick review of Cartesian and polar coordinates.

Before we can talk about using the Box-Muller transform, let’s refresh our understanding of the relationship between Cartesian and polar coordinates. You may remember from geometry that if x and y are two points in the Cartesian plane they can be represented in polar coordinates with a radius $r$ and an angle $\theta$ using the following relationships:

$r^2 = x^2 + y^2$

$\tan(\theta) = \frac{y}{x}$, and therefore

$x = r\cos(\theta)$

$y = r\sin(\theta)$

Notice that if $r \leq 1$ and $\theta \in [0 , 2\pi]$, then we map out values contained in the unit circle, as shown in the figure below. Also note that random variables in such a circle can be generated by transforming values sampled from the uniform distribution. Specifically, radii can be sampled from $r \sim Unif(0,1)$ and angle can be sampled from $\theta \sim 2\pi \times Unif(0,1)$. A similar mechanism (i.e. drawing points in a circle using uniform variables) is at the heart of the Box-Muller transform for sampling Normal random variables.

Example of the relationship between Cartesian and polar coordinates

## Drawing Normally-distributed samples with the Box-Muller transform

Ok, now that we’ve discussed how Cartesian coordinates are represented in polar coordinates, let’s move on to how we can use this relationship to generate random variables. Box-Muller sampling is based on representing the joint distribution of two independent standard Normal random Cartesian variables $X$ and $Y$

$X \sim N(0,1)$

$Y \sim N(0,1)$

in polar coordinates. The joint distribution $p(x,y)$ (which is circular-symmetric) is:

$p(x,y) = p(x)p(y) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}}$

$= \frac{1}{2\pi}e^{-\frac{x^2 + y^2}{2}}$

If we notice that the $x^2 + y^2$ term in the numerator of the exponent is equal to $r^2$ (as above) we can make the connection between the the Cartesian representation of the joint Normal distribution and its polar representation:

$p(x,y) = \left ( \frac{1}{2\pi} \right ) \left ( e^{\frac{-r^2}{2}} \right )$

which is the product of two density functions, an exponential distribution over squared radii:

$r^2 \sim Exp(\frac{1}{2})$

and a uniform distribution over angles:

$\theta \sim Unif(0,2\pi)$

just like those mentioned above when generating points on the unit circle. Now, if we make another connection between the exponential distribution and the uniform distribution, namely that:

$Exp(\lambda) = \frac{-\log (Unif(0,1))}{\lambda}$

then $r \sim \sqrt{-2\log (Unif(0,1))}$

This gives us a way to generate points from the joint Gaussian distribution by sampling from two independent uniform distributions, one for $r$ and another for $\theta$, and transforming them into Cartesian coordinates via the relationships above. In detail, the procedure goes as follows:

1. Draw, $u_1,u_2 \sim Unif(0,1)$
2. Transform the variables into radius and angle representation $r = \sqrt{-2\log(u_1)}$, and $\theta = 2\pi u_2$
3. Transform radius and angle into Cartesian coordinates: $x = r \cos(\theta), y = r \sin(\theta)$

What results are two independent Normal random variables, $X$ and $Y$. A MATLAB implementation of the Box-Muller algorithm is shown below:

% NORMAL SAMPLES USING BOX-MUELLER METHOD
% DRAW SAMPLES FROM PROPOSAL DISTRIBUTION
u = rand(2,100000);
r = sqrt(-2*log(u(1,:)));
theta = 2*pi*u(2,:);
x = r.*cos(theta);
y = r.*sin(theta);

% DISPLAY BOX-MULLER SAMPLES
figure
% X SAMPLES
subplot(121);
hist(x,100);
colormap hot;axis square
title(sprintf('Box-Muller Samples Y\n Mean = %1.2f\n Variance = %1.2f\n Kurtosis = %1.2f',mean(x),var(x),3-kurtosis(x)))
xlim([-6 6])

% Y SAMPLES
subplot(122);
hist(y,100);
colormap hot;axis square
title(sprintf('Box-Muller Samples X\n Mean = %1.2f\n Variance = %1.2f\n Kurtosis = %1.2f',mean(y),var(y),3-kurtosis(y)))
xlim([-6 6])


Box-Muller Samples for Normal Distribution

## Wrapping Up

The output of the MATLAB code is shown above. Notice the first, second, and fourth central moments (mean, variance, and kurtosis)  of the generated samples are consistent with the standard normal. The Box-Muller transform is another example of of how uniform variables on the interval (0,1) and can be  transformed in order to sample from a more complicated distribution.

## Inverse Transform Sampling

There are a number of sampling methods used in machine learning, each of which has various strengths and/or weaknesses depending on the nature of the sampling task at hand. One simple method for generating samples from distributions with closed-form descriptions is Inverse Transform (IT) Sampling.

The idea behind IT Sampling is that the probability mass for a random variable $X$ distributed according to the probability density function $f(x)$ integrates to one and therefore the cumulative distribution function $C(x)$ can be used to map from values in the interval $(0,1)$ (i.e. probabilities) to the domain of $f(x)$. Because it is easy to sample values $z$ uniformly from the interval $(0,1)$, we can use the inverse of the CDF $C(x)^{-1}$ to transform these sampled probabilities into samples $x$. The code below demonstrates this process at work in order to sample from a student’s t distribution with 10 degrees of freedom.

rand('seed',12345)

% DEGREES OF FREEDOM
dF = 10;
x = -3:.1:3;
Cx = cdf('t',x,dF)
z = rand;

% COMPARE VALUES OF
zIdx = min(find(Cx>z));

% DRAW SAMPLE
sample = x(zIdx);

% DISPLAY
figure; hold on
plot(x,Cx,'k','Linewidth',2);
plot([x(1),x(zIdx)],[Cx(zIdx),Cx(zIdx)],'r','LineWidth',2);
plot([x(zIdx),x(zIdx)],[Cx(zIdx),0],'b','LineWidth',2);
plot(x(zIdx),z,'ko','LineWidth',2);
text(x(1)+.1,z + .05,'z','Color','r')
text(x(zIdx)+.05,.05,'x_{sampled}','Color','b')
ylabel('C(x)')
xlabel('x')
hold off


IT Sampling from student’s-t(10)

However, the scheme used to create to plot above is inefficient in that one must compare current values of $z$ with the $C(x)$ for all values of $x$. A much more efficient method is to evaluate $C^{-1}$ directly:

1. Derive $C^{-1}(x)$ (or a good approximation) from $f(x)$
2. for $i = 1:n$
• – draw $z_i$ from $Unif(0,1)$
• $x_i = CDF^{-1}(z_i)$
• – end for

The IT sampling process is demonstrated in the next chunk of code to sample from the Beta distribution, a distribution for which $C^{-1}$  is easy to approximate using Netwon’s method (which we let MATLAB do for us within the function icdf.m)

rand('seed',12345)
nSamples = 1000;

% BETA PARAMETERS
alpha = 2; beta = 10;

% DRAW PROPOSAL SAMPLES
z = rand(1,nSamples);

% EVALUATE PROPOSAL SAMPLES AT INVERSE CDF
samples = icdf('beta',z,alpha,beta);
bins = linspace(0,1,50);
counts = histc(samples,bins);
probSampled = counts/sum(counts)
probTheory = betapdf(bins,alpha,beta);

% DISPLAY
b = bar(bins,probSampled,'FaceColor',[.9 .9 .9]);
hold on;
t = plot(bins,probTheory/sum(probTheory),'r','LineWidth',2);
xlim([0 1])
xlabel('x')
ylabel('p(x)')
legend([t,b],{'Theory','IT Samples'})
hold off


Inverse Transform Sampling of Beta(2,10)

## Wrapping Up

The IT sampling method is generally only used for univariate distributions where $C^{-1}$ can be computed in closed form, or approximated. However, it is a nice example of how uniform random variables can be used to sample from much more complicated distributions.

## Emergence of Various Probability Distributions: Some Incite via Simulations

There a a large number of probability distributions that arise in statistical modeling and inference procedures. These distributions can take many complex analytic forms, and can have, what appear to be, arbitrary parameterizations. Though the definitions of many of these distributions are mathematically esoteric, it is often very helpful to see how they can emerge from simple simulation procedures. In this post we perform a number of simple manipulations to standard Normal/Gaussian-distributed variables that give rise to some probability distributions that predominate statistics:the  $\chi^2$, the F, and the Student’s t-distributions

# The Standard Normal Distribution

Before we begin the simulations, we need to first describe an important building block, the standard Normal distribution. The Normal distribution is a smooth continuous function that is symmetric about an axis and is often described as having a “bell” shape. The distribution lies at the heart of describing a vast number of physical processes. Also, many statistical quantities and distributions emerge from  simply transforming values drawn from the standard Normal distribution (I’ll describe what I mean by “standard” shortly).

Any Normal distribution is characterized by two parameters: a mean parameter $\mu$, which characterizes the location of the distribution center,  and a variance parameter $\sigma^2$, which characterizes the width of the distribution. Formally, the Normal distribution defines the probability $p(x)$ of some value $x$ occurring as:

$p(x) = \frac{1}{2 \pi \sigma^2}e^{-\frac{1}{2 \sigma^2}(x - \mu)^2}$

The standard Normal distribution has zero mean and and unit  variance (i.e. equal to 1), and therefore takes the simple form:

$p(x) = \frac{1}{2 \pi}e^{-\frac{1}{2}x^2}$

As I will show, a number of common probability distributions emerge from performing simple manipulations of values that follow the standard Normal distribution. In order to simulate these distributions we need to be able to draw values that follow the standard Normal distribution. Turns out that we can generate pseudo-random samples from a standard Normal distribution using a clever concept known as the Box-Muller transformation. Though I won’t go into the details, the Box-Muller transformation relates measurements on the unit circle (i.e. in polar coordinates) to norms of random vectors in Cartesian space (i.e. x-y coordinates). If none of that makes any sense, don’t worry it’s not important. What is important is that if we know how to independently draw two random numbers $R_1$ and $R_2$ that take on values 0 to 1, then it is possible to draw a sample $N$ from the standard Normal via the following relationship

$N = \sqrt{-2 \log R_1} \cos(2 \pi R_2)$

or also

$N = \sqrt{-2 \log R_1} \sin(2 \pi R_2)$

since sin(x) and cos(x) are orthogonal to one another. Figure 1 compares the empirical probability distribution of values drawn from a standard Normal distribution using the MATLAB function $\text{randn}$ to the distribution of values drawn from the uniform distribution using the function $\text{rand}$ and transforming those values using the Box-Muller transformation.

Figure 1: Theoretical and simulated random Normal variables. Simulated variables are calculated using the Box-Muller transformation of uniform variables on the interval (0 1).

Note that the Box-Muller transformation is not the only (or even the best/most efficient) method for generating random variables from the standard Normal distribution (for instance, see the Polar Method). The reason for demonstrating the Box-Muller method is its simplicity. For a MATLAB example of drawing samples using the Box-Muller method, see lines (16-18) in the source code below.

Ok, so now that we have a way generating samples from the standard Normal distribution, we can now begin transforming these samples in order to develop other distributions.

# Simulating the $\chi^2$ Distribution

The $\chi^2$ distribution is a common probability distribution used in statistical inference (i.e. the $\chi^2$ test) and assessing the goodness of fit of linear models to data. Specifically, a random variable $Z$ drawn from the $\chi^2$ with $\nu$ degrees of freedom is obtained by drawing $\nu$ independent variables $N$ from the standard normal distribution, squaring each value drawn and taking the sum of those squared values. Formally, this process is described as:

$Z = \sum_{i=1}^\nu N_i^2$

For a MATLAB example of simulating $\chi^2$-distributed variables, see lines (37-54) in the source code below. The results of this simulation are shown in right of Figure 2, and are compared to the theoretical distribution, which takes the mathematical form:

$p(x) = \frac{x^{(\nu/2) -1} e^{-(x/2)}}{2^\nu \Gamma(\frac{\nu}{2})}$

For values of $x \geq 0$ The $\Gamma$ is what is known as the Gamma function, and can be thought of as an extension to the factorial (i.e. !) operator to continuous values (i.e. ! only works on integers).

Figure 2: Theoretical (left) and simulated (right) Chi-Squared distributions having 1-10 degrees of freedom (dF)

Perhaps it is just me, but I feel that it is far more natural to interpret the $\chi^2$ distribution as a sum of squares of standard Normal variables, than this fairly complicated expression. The interpretation also lends itself well to testing the goodness of fit of a linear model to data using the sum of squares loss function. This brings us to a second probability distribution, which is often used in the analysis of variance of linear model fits.

# Simulating the F-Distribution

In order to test the goodness of fit of a linear model to some data, it is often helpful to look at the ratio of  the mean squared error of the model prediction to the mean squared deviation of the data from its mean (i.e. the variance of the data). This ratio can be thought of as the proportion of the variance in the data that is explained by the model. Because this ratio is composed of two values that result from squaring and summing deviations or errors, we are equivalently taking the ratio of values that follow two $\chi^2$ distributions. If the numerator and denominator of this ratio are scaled by their respective degrees of freedom (i.e. the number of model parameters minus one for the numerator value and the number of datapoints minus one for the denominator), then the value that results is what is known as an F-statistic.  These F-statistics can be used to quantify how well (i.e. the statistical significance) of the model fit by comparing it to the distribution of F-statistics values that follow the form:

$F(x,\nu_1,\nu_2) = \frac{\sqrt{\frac{\nu_1 x^{\nu_1}\nu_2^{\nu_2}}{(\nu_1 x + \nu_2)^{\nu_1+\nu_2}}}}{xB(\nu_1/2,\nu_2/2)}$

for variables having $\nu_1$ and $\nu_2$ degrees of freedom. The funcion $B$ is known as the Beta function or integral.

We can also simulate distributions of F-statistics by generating two $\chi^2$ variables with respective degrees of freedom $\nu_1$ and $\nu_2$ (just as we did above), scaling each variable by it’s relative degrees of freedom and taking their ratio. Formally this is described by:

$F(\nu_1,\nu_2) = \frac{\chi_{\nu_1}^2/\nu_1}{\chi_{\nu_2}^2/\nu_2}$

A MATLAB example of this simulation is shown in lines (77-104) in the code below. The results of the simulation are displayed in the bottom row of Figure 3.

Figure 3: Theoretical (top row) and simulated (bottom row) F-distributions with 2,10,50, & l00 degrees of freedom (dF). Columns represent theoretical and simulated distributions resulting from having a numerator Chi-Squared distribution with each dF. Each function corresponds to the distribution resulting form a denominator Chi-Squared distribution with a particular dF.

Again, it insightful to interpret the F-distribution as simply the ratio of scaled  $\chi^2$ distributions rather than the complicated mathematical expression for the F-distribution shown above.

# Simulating the Student’s t-distribution

During statistics analyses, it is often helpful to determine if the mean $\bar x$ of a dataset of size $N$  is different than some specified value $\mu_0$. In order to test the hypothesis is customary to calculate what is called a test statistic:

$t = \frac{\bar x - \mu_0}{\sqrt{s^2/n}}$

Here the sample mean is

$\bar x = \frac{1}{n}\sum_{i=1}^n x_n$

and the sample variance is

$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x - \bar x)^2$

The distribution of these t-values is known as the t-distribution.  Like the Normal distribution, the t-distribution is symmetric and “bell” shaped, but has heavier tails than a Normal distribution and is parameterized by a single parameter $\nu$  that corresponds to the degrees of freedom in the sample (i.e. sample size). Formally, the t-distribution has the following form

$p(t) = \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu \pi}\Gamma(\nu/2)} \left( 1 + \frac{t^2}{\nu}\right)^{\frac{\nu+1}{2}}$

The t-distribution is used  in a number of other statistical tests as well, particularly when an available sample size is small. Also, because it was developed in order to make Guiness the fine beer that it is today, the Student’s t-distribution is likely even more important than the Normal distribution!

The t-distribution with degrees for freedom $\nu$ can be simulated by repeatedly sampling $\nu$ independent datapoints from the standard normal distributions and calculating t-values on these samples. A MATLAB simulation of various t-distributions is shown in Figure 4, along with comparisons to the theoretical function shown above (see lines (130-148) of the code below).

Figure 4: Theoretical (left) and simulated (right) Student’s-t distribution with 2,10,50,& 100 degrees of freedom (dF).

Also, another way to simulate a t-distribution (not shown here) is to sample values that are the ratio of a standard normal variable $N$ to the square root of a $\chi^2$-distributed variable scaled by its degrees of freedom:

$T(\nu) = \frac{N}{\sqrt{Z_{\nu}/\nu}}$

# Wrapping up

So, hopefully these simulations give you some incite on how some standard probability distributions can come about. Just another example of how “complex” ideas  can be derived from simple manipulations of a general mechanism, but are often hidden in the mathematical detail and formalisms.

# MATLAB Code

(Use the toolbar in the upper right corner of the source in order to copy to your clipboard)

clear all; close all

% SOME CONSTANTS
nNu = 10;
nSimu = 1e6;
dx = 0.1;
x = -6:dx:6;

% PART 1: NORMAL DISTRIBUTION
%-----------------------------
% ONE CAN GENERATE NORMALLY-DISTRIBUTED RANDOM VARIABLES FROM
% UNIFORMLY DISTRIBUTED VARIABLES USING A GEOMETRIC METHOD KNOWN
% AS THE BOX-MULLER TRANSFORM

fprintf('\nSimulating Normal Distribution From Uniform Random Variables.');
R = rand(nSimu,2);
NSimu = sqrt(-2*log(R(:,1))).*cos(2*pi*R(:,2));
NTheory = randn(1,nSimu);
% OR...
% N = sqrt(-2*log(R(:,1))).*sin(2*pi*R(:,2));
% (THESE ACTUALLY COME IN PAIRS)

figure('Name','Simulated Normal Distribution');
[nHistTheory,nBin] = hist(NTheory,x);
nHistTheory = nHistTheory/(sum(nHistTheory)*dx);
bar(nBin,nHistTheory,'k')
hold on;
[nHist,nBin] = hist(NSimu,x);
nHist = nHist/(dx*sum(nHist));
stairs(nBin,nHist,'r','Linewidth',2)
legend({'Theory','Simulated'})
title(sprintf('Simulated Mean = %1.4f\n Simulated Variance = %1.4f',mean(NSimu),var(NSimu)))
xlim([-6 6]);
fprintf('\nPress any key to continue.\n')
pause

% PART 2: CHI-SQUARED-DISTRIBUTION
%----------------------------------
% SIMULATE AND CALCULATE THEORETICAL VALUES
% FOR CHI-SQUARE DISTRIBUTION
fprintf('\nSimulating Chi-Squared Distribution\n');
zFun0 = 'randn(1,nSimu).^2'; zFun = zFun0;
dx = 0.01;
x = 0:dx:15;
legStr = {};
for nu = 1:nNu
fprintf('\r dF = %d',nu);
zTmp = eval(zFun);
[zTmp,zBin] = hist(zTmp,x);
zSimu(:,nu) = zTmp/(dx*sum(zTmp));	% SIMULATED
zTheory(:,nu) = chi2pdf(x,nu);		% THEORETICAL
zFun = [zFun,'+',zFun0];
legStr{end+1} = sprintf('dF = %d',nu);
end

% DISPLAY CHI-SQUARE SIMULATIONS
figure('Name','Simulated Chi-Squared Distribution')
subplot(121);plot(zBin,zTheory,'Linewidth',2)
xlim([0 15]); ylim([0,.4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Theoretical')

subplot(122);plot(zBin,zSimu)
xlim([0 15]); ylim([0,.4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Simulated')
fprintf('\nPress any key to continue.\n')
pause

% PART 3: F-DISTRIBUTION
%------------------------
% SIMULATE AND CALCULATE THEORETICAL VALUES
% FOR F-DISTRIBUTION
fprintf('\nSimulating F-Distribution\n');
nSimu = 1e5;
cnt = 1;
nu = [2,10,50,100];
x = 0:dx:5;
zFun0 = 'randn(1,nSimu).^2'; zFun = zFun0;
legStr = {};

for nuTop = nu
zFunTop = zFun0;
for iF = 1:nuTop-1
zFunTop = [zFunTop,'+',zFun0];
end
for nuBottom = nu
fprintf('\r dF1 = %d, dF2 = %d',nuTop,nuBottom);
zFunBottom = zFun0;
for iF = 1:nuBottom-1
zFunBottom = [zFunBottom,'+',zFun0];
end
zTop = eval(zFunTop)/nuTop;	% CHI-2 DISTRIBUTIONS SCALED BY dF
zBottom = eval(zFunBottom)/nuBottom;
Ftmp = zTop./zBottom;   	% F IS THE RATIO OF SCALED CHI-2s
[Ftmp,zBin] = hist(Ftmp,x);
FSimu(:,cnt) = Ftmp/(dx*sum(Ftmp));
FTheory(:,cnt) = fpdf(x,nuTop,nuBottom);
cnt = cnt+1;
end
legStr{end+1} = sprintf('dF = %d',nuTop);
end

% DISPLAY F-DISTRIBUTION SIMULATIONS
figure('Name','Simulated F Distribution')
axCnt = cnt;
for iP = 1:numel(nu);
subplot(2,numel(nu),iP)
plot(x,FTheory(:,(iP-1)*numel(nu)+1:iP*numel(nu)),'Linewidth',2);
xlim([0 5]); ylim([0 2]);
legend(legStr);
title(sprintf('dF = %d',nu(iP)))
xlabel('x'); ylabel('p(x) Theoretical');

subplot(2,numel(nu),numel(nu)+iP);
plot(x,FSimu(:,(iP-1)*numel(nu)+1:iP*numel(nu)));
xlim([0 5]); ylim([0 2]);
legend(legStr);
title(sprintf('dF = %d',nu(iP)));
xlabel('x'); ylabel('p(x) Simulated');
end
fprintf('\nPress any key to continue.\n')
pause

% PART 4 STUDENT'S T-DISTRIBUTION
%---------------------------------
fprintf('\nSimulating Student''s t Distribution\n');
nSimu = 1e6;
dx = 0.01;
x = -6:dx:6;
nu = [2 10 50 100];
legStr = {};
cnt = 1;
for dF = nu
fprintf('\r dF = %d',dF);
x0 = randn(dF,nSimu);
xBar = mean(x0);
s = var(x0);
mu = 0;                             % randn IS ZERO MEAN
tTmp = (xBar - mu)./sqrt(s/dF);
[tTmp,tBin] = hist(tTmp,x);
tSimu(:,cnt) = tTmp/(dx*sum(tTmp));	% SIMULATED
tTheory(:,cnt) = tpdf(x,dF);		% THEORETICAL
legStr{end+1} = sprintf('dF = %d',dF);
cnt = cnt+1;
end

% DISPLAY T-DISTRIBUTION SIMULATIONS
figure('Name','Simulated Student''s t Distribution')
subplot(121);plot(tBin,tTheory,'Linewidth',2)
xlim([-6 6]); ylim([0 .4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Theoretical')

subplot(122);plot(tBin,tSimu,'Linewidth',2)
xlim([-6 6]); ylim([0 .4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Simulated')
fprintf('\nPress any key to finish demo.\n')
pause

clear all; close all