# Category Archives: Statistics

## Derivation: Maximum Likelihood for Boltzmann Machines

In this post I will review the gradient descent algorithm that is commonly used to train the general class of models known as Boltzmann machines. Though the primary goal of the post is to supplement another post on restricted Boltzmann machines, I hope that those readers who are curious about how Boltzmann machines are trained, but have found it difficult to track down a complete or straight-forward derivation of the maximum likelihood learning algorithm for these models (as I have), will also find the post informative.

First, a little background: Boltzmann machines are stochastic neural networks that can be thought of as the probabilistic extension of the Hopfield network. The goal of the Boltzmann machine is to model a set of observed data in terms of a set of visible random variables $v$  and a set of latent/unobserved random variables $h$. Due to the relationship between Boltzmann machines and neural networks, the random variables are often are often referred to as “units.” The role of the visible units is to approximate the true distribution of the data, while the role of the latent variables it to extend the expressiveness of the model by capturing underlying features in the observed data. The latent variables are often referred to as hidden units, as they do not result directly from the observed data and are generally marginalized over to obtain the likelihood of the observed data,  i.e.

$\Large{\begin{array}{rcl} p(v;\theta) &=& \sum_h p(v,h; \theta) \end{array}}$,

where $p(v,h; \theta)$ is the joint probability distribution over the visible and hidden units based on the current model parameters $\theta$. The general Boltzmann machine defines $p(v,h; \theta)$ through a set of weighted,  symmetric connections between all visible and hidden units (but no connections from any unit to itself). The graphical model for the general Boltzmann machine is shown in Figure 1.

Figure 1: Graphical Model of the Boltzmann machine (biases not depicted).

Given the current state of the visible and hidden units, the overall configuration of the model network is described by a connectivity function $E(v,h;\theta)$, parameterized by $\theta = {W, A, B, a, b}$:

$\Large{\begin{array}{rcl} E(v,h; \theta) &=& v^T W h + h^T A h + v^T B v + h^T a + v^T b \end{array}}.$

The parameter matrix $W$ defines the connection strength between the visible and hidden units. The parameters $A$ and $B$ define the connection strength amongst hidden units and visible units, respectively. The model also includes a set of  biases $a$ and $b$ that capture offsets for each of the hidden and visible units.

The Boltzmann machine has been used for years in field of statistical mechanics to model physical systems based on the principle of energy minimization. In the statistical mechanics, the connectivity function is often referred to the “energy function,” a term that is has also been standardized in the statistical learning literature. Note that the energy function returns a single scalar value for any configuration of the network parameters and random variable states.

Given the energy function, the Boltzmann machine models the joint probability of the visible and hidden unit states as a Boltzmann distribution:

$\Large{\begin{array}{rcl} p(v,h; \theta) &=& \frac{\mathrm{e}^{-E(v,h; \theta)}}{Z(\theta)} \text{ , where} \\ \\ Z(\theta) &=& \sum_{v'} \sum_{h'} \mathrm{e}^{-E(v',h'; \theta)}\end{array}}$

The partition function $Z(\theta)$ is a normalizing constant that is calculated by summing over all possible states of the network $(v', h') \in (V',H')$. Here we assume that all random variables take on discrete values, but the analogous derivation holds for continuous or mixed variable types by replacing the sums with integrals accordingly.

The common way to train the Boltzmann machine is to determine the parameters that maximize the likelihood of the observed data. To determine the parameters, we perform gradient descent on the log of the likelihood function (In order to simplify the notation in the remainder of the derivation, we do not include the explicit dependency on the parameters $\theta$. To further simplify things, let’s also assume that we calculate the gradient of the likelihood based on a single observation.):

$\Large{ \begin{array}{rcl} l(v; \theta) &=& \log p(v) \\ &=& \log \sum_h p(v,h) \\ &=& \log \frac{\sum_h \mathrm{e}^{-E(v,h)}}{Z} \\ &=& \log \sum_h \mathrm{e}^{-E(v,h)} - \log Z \\ &=& \log \sum_h \mathrm{e}^{-E(v,h)} - \sum_{v'} \sum_{h'} \mathrm{e}^{-E(v',h')} \end{array}}$

The gradient calculation is as follows:

$\Large{ \begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &=& \frac{\partial}{\partial \theta}\log \sum_h \mathrm{e}^{-E(v,h)} - \frac{\partial}{\partial \theta} \log \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')} \\ &=& \frac{1}{\sum_h \mathrm{e}^{-E(v,h)}} \frac{\partial}{\partial \theta} \sum_h \mathrm{e}^{-E(v,h)} - \frac{1}{\sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}} \frac{\partial}{\partial \theta} \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')} \\ &=& - \frac{1}{\sum_h \mathrm{e}^{-E(v,h)}} \sum_h \mathrm{e}^{-E(v,h)}\frac{\partial E(v,h)}{\partial \theta} + \frac{1}{\sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}} \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}\frac{\partial E(v',h')}{\partial \theta} \end{array}}$

Here we can simplify the expression somewhat by noting that $\mathrm{e}^{-E(v,h)} = Z p(v,h)$, that $Z = \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}$, and also that $Z$ is a constant:

$\Large{ \begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &=& - \frac{1}{Z\sum_h p(v,h)} Z \sum_h p(v,h) \frac{\partial E(v,h)}{\partial \theta} + \frac{1}{Z} Z \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& - \frac{1}{\sum_h p(v,h)} \sum_h p(v,h) \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ \end{array}}$

If we also note that $\sum_h p(v,h)= p(v)$, and use the definition of conditional probability $p(h|v) = \frac{p(v,h)}{p(v)}$, we can further simplify the expression for the gradient:

$\Large{ \begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &=& - \frac{1}{p(v)} \sum_h p(v,h) \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& -\sum_h \frac{p(v,h)}{p(v)} \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& -\sum_h p(h | v) \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& -\mathbb{E}_{p(h | v)} \frac{\partial E(v,h)}{\partial \theta} + \mathbb{E}_{p(v',h')}\frac{\partial E(v',h')}{\partial \theta}. \\ \end{array}}$

Here $\mathbb{E}_{p(*)}$ is the expected value under the distribution $p(*)$. Thus the gradient of the likelihood function is composed of two parts. The first part is expected gradient of the energy function with respect to the conditional distribution $p(h|v)$. The second part is expected gradient of the energy function with respect to the joint distribution over all variable states. However, calculating these expectations is generally infeasible for any realistically-sized model, as it involves summing over a huge number of possible states/configurations. The general approach for solving this problem is to use Markov Chain Monte Carlo (MCMC) to approximate these sums:

$\Large{\begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &\approx& -\left \langle \frac{\partial E(v,h)}{\partial \theta} \right \rangle_{p(h_{\text{data}}|v_{\text{data}})} + \left \langle \frac{\partial E(v,h)}{\partial \theta} \right \rangle_{p(h_{\text{model}}|v_{\text{model}})} \\ \end{array}}.$

Here $\langle \rangle_{p(*)}$ is the sample average of samples drawn according to the process $p(*)$. The first term is calculated by taking the average value of the energy function gradient when the visible and hidden units are being driven by observed data samples. In practice, this first term is generally straightforward to calculate. Calculating the second term is generally more complicated and involves running a set of Markov chains until they reach the current model’s equilibrium distribution (i.e. via Gibbs sampling, Metropolis-Hastings, or the like), then taking the average energy function gradient based on those samples. See this post on MCMC methods for details. It turns out that there is a subclass of Boltzmann machines that, due to a restricted connectivity/energy function (specifically, the parameters $(A, B)=0$), allow for efficient MCMC by way of blocked Gibbs sampling. These models, known as restricted Boltzman machines have become an important component for unsupervised pretraining in the field of deep learning and will be the focus of a related post.

## Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

## Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable $x$ is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function $f(x)$ that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function $f(x)$ from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function $f(x)$ taking $x$ as input and producing an output). However, in the real world, we don’t get to observe $f(x)$ directly, but instead get noisy observations $y$, where

(1) $y = f(x) + \epsilon$

Here we will assume that $\epsilon$ is random variable distributed according to a zero-mean Gaussian with standard deviation $\sigma^2$. Note that because $\epsilon$ is a random variable, $y$ is also a random variable (with a mean that is on conditioned on both $x$ and $f(x)$, and a variance $\sigma^2)$.

As an example, say that the true function $f(x)$ we want to determine has the the following form (though we don’t know it):

$f(x) = \sin(\pi x)$

Thus the observations $y$ we get to see have the following distribution.

$y = \sin(\pi x) + \mathcal N(0,\sigma^2)$

Below we define the function $f(x)$ and display it, then draw a few observation samples $y$, and display them as well:

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);
xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');

Again, the goal here is to characterized the function $f(x)$. However, since we don’t know the function form of $f(x)$, we must instead estimate some other function $g(x)$ that we think will be a good approximation to $f(x)$. Thus we call $g(x)$ an estimator of $f(x)$. In general, an estimator is some parameterized model that can capture a wide range of functional forms. One such class of estimators is the weighted combination of ordered polynomials:

$g_N(x) = \theta_0 + \theta_1x + \theta_2x^2 + \dots \theta_N x^N$

As the polynomial order $N$ increases, the functions $g_N(x)$ are able to capture increasingly complex behavior. For example, $g_0(x)$ desribes a horizontal line with an adjustable vertical offset, $g_1(x)$ desribes a line with adjustable vertical offset and adjustable slope, $g_2(x)$ describes a function that also includes a quadratic term. We thus try to fit the values of the parameters for a given estimator $g_N(x)$  to best account for observed data in the hopes that we will also accurately approximate $f(x)$.

Below we estimate the parameters of three polynomial model functions of increasing complexity (using Matlab’s $\texttt{polyfit.m}$) to the sampled data displayed above. Specifically, we estimate the functions $g_1(x)$, $g_3(x)$ and $g_{10}(x)$.

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

Qualitatively, we see that the estimator $g_1(x)$ (red line) provides a poor fit to the observed data, as well as a poor approximation to the function $f(x)$ (black curve). We see that the estimator $g_{10}(x)$ (blue curve) provides a very accurate fit to the data points, but varies wildly to do so, and therefore provides an inaccurate approximation of $f(x)$. Finally, we see that the estimator $g_3(x)$ (green curve) provides a fairly good fit to the observed data, and a much better job at approximating $f(x)$.

Our original goal was to approximate $f(x)$, not the data points per se. Therefore $g_3(x)$, at least qualitatively, provides a more desirable estimate of $f(x)$ than the other two estimators. The fits for $g_1(x)$ and $g_{10}(x)$ are examples of “underfitting” and “overfitting” to the observed data. Underfitting occurs when an estimator $g(x)$ is not flexible enough to capture the underlying trends in the observed data. Overfitting occurs when an estimator is too flexible, allowing it to capture illusory trends in the data. These illusory trends are often the result of the noise in the observations $y$.

## Variance and Bias of an Estimator

The model fits for $g_N(x)$ discussed above were based on a single, randomly-sampled data set of observations $y$. However, there are in principle, a potentially infinite number of data sets that can be observed (drawn from $y$). In order to determine a good model of $f(x)$, it would be helpful to have an idea of how an estimator will perform on any or all of these potential datasets.  To get an idea of how each of the estimators discussed above performs in general we can repeat the model fitting procedure for many data sets.

Here we perform such an analyses, sampling 50 independent data sets according to Equation (1) (with $\sigma=0.1$), then fitting the parameters for the polynomial functions of model order $N = [1,3,10]$ to each dataset.

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end

Here are the results for the estimator $g_1(x)$

…and for the estimator $g_3(x)$, …

… and for $g_{10}(x).$

The lightly-colored curves in each of the three plots above are an individual polynomial model fit to one of the 50 sampled data sets. The darkly-colored curve in each plot is the average over the 50 individual fits. The dark black curve is the true, underlying function $f(x)$.

We see that for the estimator $g_1(x)$ (red curves), model fits do not vary much from data set to data set. Thus the expected (mean) estimator fit over all the data sets, formally written as $\mathbb E[g(x)]$, is very similar to all the individual fits.

A common formal definition used in statistics to describe how much a single estimator deviates from the average estimator over datasets is the variance of the estimator. Formally defined as

$variance=\mathbb E[(g(x)-\mathbb E[g(x)])^2]$

the variance is the average squared difference between any single data-set-dependent estimate of $g(x)$ and the average value of $g(x)$ estimated over all datasets.

According to the definition of variance, we can say that the estimator $g_1(x)$ exhibits low variance.

A commonly-used metric in statistics for assessing how an estimator $g(x)$ approximates a target function $f(x)$, based on behavior over many observed data sets is what is called the bias of the estimator. Formally defined as:

$bias = \mathbb E[g(x)] - f(x)$

The bias describes how much the average estimator fit over datasets $\mathbb E[g(x)]$ deviates from the value of the underlying target function $f(x)$.

We can see from the plot for $g(x)_1$ that $\mathbb E[g_1(x)]$ deviates significantly from $f(x)$. Thus we can say that the estimator $g_1(x)$ exhibits large bias when approximating the function $f(x)$.

Investigating the results for the estimator $g_{10}(x)$ (blue curves), we see that each individual model fit varies dramatically from one data set to another. Thus we can say that this estimator exhibits large variance. Looking at $\mathbb E[g_{10}(x)]$ (the dark blue curve), we see that on average the estimator $g_{10}(x)$ provides a better approximation to $f(x)$ than the estimator $g_1(x)$. Thus we can say that $g_{10}(x)$ exhibits a lower bias than the estimator $g_1(x)$.

Investigating the fits for the estimator $g_3(x)$ (green curves), we find that this estimator has low bias. Furthermore, the average estimator $\mathbb E[g_3(x)]$ (dark green curve) accurately approximates the true function $f(x)$, telling us that that the estimator $g_3(x)$ also has low bias.

We established earlier that the estimator $g_3(x)$ provided a qualitatively better fit to the function $f(x)$ than the other two polynomial estimators for a single dataset. It appears that this is also the case over many datasets. We also find that estimator $g_3(x)$ exhibits low bias and low variance, whereas the other two, less-desirable estimators, have either high bias or high variance. Thus it would appear that having both low bias and low variance is a reasonable criterion for selecting an accurate model of $f(x)$.

Included in each of the three plots above is a dashed black line representing the squared difference between the average estimator $\mathbb E[g_N(x)]$ and the true function $f(x)$. Calculating squared model errors is a common practice for quantifying the goodness of a model fit. If we calculate the expected value of each of the dashed black curves (and assuming that all values of $x$ are equally likely to occur), we would obtain a single value for each estimator that is the mean squared error (MSE) between the expected estimator and the true function.

$\mathbb E[(\mathbb E[g(x)]-f(x))^2] = \frac{1}{N}\sum_{i=1}^N (\mathbb E[g(x)]-f(x))^2$

For the estimator $g_3(x)$, the MSE will be very small, as the dashed black curve for this estimator is near zero for all values of $x$. The estimators $g_1(x)$ and $g_{10}(x)$ would have significantly larger values. Now, because exhibiting both a low MSE, as well as having both low bias and variance are indicative of a good estimator, it would be reasonable to assume that squared model error is directly related to bias and variance. The next section provides some formal evidence for this notion.

## Expected Prediction Error and the Bias-variance Tradeoff

For a given estimator $g(x)$ fit to a data set of $xy$ pairs, we would like to know, given all the possible datasets out there, what is the expected prediction error we will observe for a new data point $x^*, y^* = f(x^*) + \epsilon$. If we define prediction error to be the squared difference in model prediction $g(x^*)$ and observations $y^*$, the expected prediction error is then:

$\mathbb E[(g(x^*) - y^*)^2]$

If we expand this a little and use a few identities, something interesting happens:

(2) $\mathbb E[(g(x^*) - y^*)^2] = \mathbb E[g(x^*)^2-2g(x^*)y^*+y^{*2}]=$

(3) $\mathbb E[g(x^*)^2]-2\mathbb E[g(x^*)y^*]+\mathbb E[y^{*2}]=$

(4) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2-2 \mathbb E[g(x^*)]f(x^*) + \mathbb E[(y^*-f(x^*))^2] + f(x^*)^2$

where we have applied the following Lemma to the first and third terms of Equation (3), and use the fact to $\mathbb E[y] = f(x)$ (Think of averaging over an infinite number of datasets sampled from y; all noise will average out, leaving $f(x)$). Rearranging Equation (4), we obtain

(5) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2 - 2 \mathbb E[g(x^*)]f(x^*) + f(x^*)^2 + \mathbb E[(y^*-f(x^*))^2]$

which can be further simplified and grouped into three terms

(6) $\mathbb E[(g(x^*) - \mathbb E[g(x^*)])^2]$ $+\mathbb (E[g(x^*)]-f(x^*))^2$ $+ \mathbb E[(y^*-f(x^*))^2]$

1. The first term is the variance of the estimator introduced above.
2. The second term is the square of the bias of the estimator, also introduced above.
3. The third term is the variance of the observation noise and describes how much the observations $y$ vary from the true function $f(x)$. Notice that the noise term does not depend on the estimator $g(x)$. Thus the noise term is a constant that places a lower bound on expected prediction error.

Here we find that the expected prediction error on new data $(x^*,y^*)$ (in the squared differences sense) is the combination of three terms:

Expected prediction error = estimator variance + squared estimator bias + noise

Thus the expected prediction error on new data can be used as a quantitative criterion for selecting the best model from a candidate set of estimators! It turns out that, given $N$ new data points $(\bold x^*,\bold y^*)$, the expected prediction error can be easily approximated as the mean squared error over data pairs:

$\mathbb E[(g(\bold x^*) - \bold y^*)^2] \approx \frac{1}{N}\sum_{i=1}^N(g(x_i^*)-y_i^*)^2$

Below we demonstrate these findings with another set of simulations. We simulate 100 independent datasets, each with 25 $xy$ pairs. We then partition each dataset into two non-overlapping sets: a training set using for fitting model parameters, and a testing set used to calculate prediction error. We then fit the parameters for estimators of varying complexity. Complexity is varied by using polynomial functions that range in model order from 1 (least complex) to 12 (most complex). We then calculate and display the squared bias, variance, and error on testing set for each of the estimators:

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS
% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;

Here we see how, as the model complexity increases, the estimator variance (blue curve) gradually increases. Additionally, as model complexity increases, the squared bias (red curve) decreases. Thus there is a tradeoff between bias and variance that comes with model complexity: models that are too complex will have high variance and low bias; models that are too simple will have high bias and low variance. The best model will have both low bias and low variance. In this example, we highlight the best estimator in terms of prediction error on the testing set (black curve) with a dashed black vertical line. The best estimator corresponds to a polynomial model of order of $N=3$. Notice that the vertical black line is located where function defined by the sum of the squared bias and variance (dashed magenta curve) is also at a minimum.

Notice also how the sum of the squared bias and variance also has the same shape as curve defined by the prediction error on the testing set. This exemplifies how the error on novel data can be used as a proxy for determining the best estimator from a candidate set based on squared bias and variance. The noise term in Equation (6) is also represented in the figure by the vertical displacement between the black curve and dashed magenta curve.

It is very important to point out that all of these results are based on evaluating prediction error on novel data, not used to estimate model parameters. In fact assessing a model performance based on prediction error calculated on the same data used to estimate the model parameters is highly problematic, as it causes models to always overfit. In plain terms, this means that we will always favor a more complex model if we assess goodness of model fits on the training data, as a more complex model will be better able to capture small, random trends in the data due to noise.

This overfitting phenomenon is demonstrated below. We plot the error calculated on the training set (Train Error, green curve) along with the error calculated on the testing set (Test Error, black curve) for the above simulation. We also identify the best estimator as we did above.

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;

We see here that as model complexity increases, the error calculated on the training set continues to decrease, where as the error on the testing set increases past the optimal polynomial order $N=3$. We  showed above that error calculated on the testing set is the true indicator of how well an estimator will generalize to new data points. The error calculated on the training set strongly disagrees with the error calculated on the testing set after the optimal model complexity has been reached. Since, in general, the whole point of modeling a data set is to generalize to novel data, assessing model predictions on the training set data should be avoided.

## Wrapping Up

Here we discussed how the bias and variance of an estimator are related to squared prediction error. Though we focused on regression, these concepts can also be applied to classification problems. We found that an optimal estimator will have both low variance and low bias. We further found that information about squared bias and variance is contained in expected prediction error calculated on a testing set of data not used to fit a model’s parameters.

The concepts of estimator bias and variance are generally only clear in the context of an ensemble of datasets. However, in real-world applications, there is generally only a single observed dataset. In such cases the roles of bias and variance are less obvious (though, it is possible to calculate estimates of variance and bias using resampling methods such as bootstrapping). However, the direct connection we made between bias, variance with the mean-squared error calculated on a testing set give us a direct means for assessing a group of candidate estimators in light of a single data set. We only need to partition the available data set into separate portions: one used to fit model parameters (a training set), and another used to assess prediction accuracy (testing set). Comparing prediction accuracy across potential estimators is equivalent to assessing biases and variances of the estimators across many datasets. Note that resampling methods such as cross-validation can prove helpful here, particularly when the amount of observed data is small.

Note, all the code for this post, containe in a single script is here:

clear
clc
close all

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);

xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS

% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); % TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;

## The Statistical Whitening Transform

In a number of modeling scenarios, it is beneficial to transform the to-be-modeled data such that it has an identity covariance matrix, a procedure known as Statistical Whitening. When data have an identity covariance, all dimensions are statistically independent, and the variance of the data along each of the dimensions is equal to one. (To get a better idea of what an identity covariance entails, see the following post.)

Enforcing statistical independence is useful for a number of reasons. For example, in probabilistic models of data that exist in multiple dimensions, the joint distribution–which may be very complex and difficult to characterize–can factorize into a product of many simpler distributions when the dimensions are statistically independent. Forcing all dimensions to have unit variance is also useful. For instance, scaling all variables to have the same variance treats each dimension with equal importance.

In the remainder of this post we derive how to transform data such that it has an identity covariance matrix, give some examples of applying such a transformation to real data, and address some interpretations of statistical whitening in the scope of theoretical neuroscience.

## Decorrelation: Transforming Data to Have a Diagonal Covariance Matrix

Let’s say we have some data matrix $X$ composed of $K$ dimensions and $n$ observations ($X$ has  size $[K \times n]$).  Let’s also assume that the rows of $X$ have been centered (the mean has been subracted across all observations) . The covariance $\Sigma$ of each of the dimensions with respect to the other is

$\Sigma = Cov(X) = \mathbb E[X X^T]$                                                                                        (1)

Where the covariance $\mathbb E[X X^T]$ can be estimated from the data matrix as follows:

$\mathbb E[X X^T] \approx \frac{X X^T}{n}$                                                                                            (2)

The covariance matrix $\Sigma$, by definition (Equation 2) is symmetric and positive semi-definite (if you don’t know what that means, don’t worry it’s not terribly important for this discussion). Thus we can write the matrix as the product of two simpler matrices $E$ and $D$, using a procedure known as Eigenvalue Decomposition:

$\Sigma = EDE^{-1}$                                                                                                 (3)

The matrix $E$ is an $[K \times K]$-sized matrix, where each column is an eigenvector of $\Sigma$, and $D$ is a diagonal matrix whose diagonal elements $D_{ii}$ are eigenvalues that correspond to the eigenvectors of the $i$-th column of $E$.  For more details on eigenvectors and eigenvalues see the following. From Equation (3), and using a little algebra, we can transform $\Sigma$ into the diagonal matrix $D$

$E^{-1} \Sigma E = D$                                                                                                 (4)

Now, imagine the goal is to transform the data matrix $X$ into a new data matrix $Y$

$Y = W_DX$                                                                                                   (5)

whose dimensions are uncorrelated (i.e. $Y$ has a diagonal covariance $D$). Thus we want to determine the transformation $W_D$ that makes:

$D = Cov(Y) = \mathbb E[YY^T]$                                                                                   (6)

Here we derive the expression for $W_D$ using Equations (2), (4), (5), and (6):

$D = \frac{W_DX(W_DX)^T}{n}$                                                       (a la Equations (5) and (6))

$D = W_D W_D^T \Sigma$                                                                       (via Equation (2))

$E^{-1}\Sigma E = W_D W_D^T \Sigma$                                                                   (via Equation (4))

$\Sigma^{-1}E^{-1} \Sigma E = \Sigma^{-1}W_D W_D^T \Sigma$

now, because $E^{-1} = E^T$                                             (see following link for details)

$E^TE = W_DW_D^T$ and thus

$W_D = E^T$                                                                                                   (7)

This means that we can transform $X$ into an uncorrelated (i.e. orthogonal) set of variables by premultiplying data matrix $X$ with the transpose of the the eigenvectors of data covariance matrix $\Sigma$.

## Whitening: Transforming data to have an Identity Covariance matrix

Ok, so now we have a way of transforming our data so that the dimensions are uncorrelated. However, this only gives us a diagonal covariance matrix, not an Identity covariance matrix. In order to obtain an Identity covariance, we also need to scale each dimension so that its variance is equal to one. How can we determine this transformation? We know how to transform our data so that the covariance is equal to $D$. If we can determine the transformation that leaves $D = I$, then we can apply this transformation to our decorrelated covariance to give us the desired whitening transform. We can determine this from the somewhat trivial notion that

$D^{-1}D = I$                                                                                                        (8)

and further that

$D^{-1} = D^{-1/2}ID^{-1/2}$                                                                                             (9)

Now, using Equation (4) along with Equation (8), we can see that

$D^{-1/2}E^{-1}\Sigma E D^{-1/2} = I$                                                                                      (10)

Now say that we define a variable $Y = W_W X$, where $W_W$ is the desired whitening transform, that leaves the covariance of $Y$ equal to the identity matrix. Using essentially the same set of derivation steps as above to solve for $W_D$, but starting from Equation (9) we find that

$W_W = D^{-1/2}E^T$                                                                                                  (11)

$= D^{-1/2}W_D$                                                                                                 (12)

Thus, the whitening transform is simply the decorrelation transform, but scaled by the inverse of the square root of the $D$ (here the inverse and square root can be performed element-wise because $D$ is a diagonal matrix).

## Interpretation of the Whitening Transform

So what does the whitening transformation actually do to the data (below, blue points)? We investigate this transformation below: The first operation decorrelates the data by premultiplying the data with the eigenvector matrix $E^T$, calculated from the data covariance. This decorrelation can be thought of as a rotation that reorients the data so that the principal axes of the data are aligned with the axes along which the data has the largest (orthogonal) variance. This rotation is essentially the same procedure as the oft-used Principal Components Analysis (PCA), and is shown in the middle row.

The second operation, scaling by $D^{-1/2}$ can be thought of squeezing the data–if the variance along a dimension is larger than one–or stretching the data–if the variance along a dimension is less than one. The stretching and squeezing forms the data into a sphere about the origin (which is why whitening is also referred to as “sphering”). This scaling operation is depicted in the bottom row in the plot above.

The MATLAB to make make the plot above is here:

% INITIALIZE SOME CONSTANTS
mu = [0 0];
S = [1 .9; .9 3];

% SAMPLE SOME DATAPOINTS
nSamples = 1000;
samples = mvnrnd(mu,S,nSamples)';

% WHITEN THE DATA POINTS...
[E,D] = eig(S);

% ROTATE THE DATA
samplesRotated = E*samples;

% TAKE D^(-1/2)
D = diag(diag(D).^-.5);

% SCALE DATA BY D
samplesRotatedScaled = D*samplesRotated;

% DISPLAY
figure;

subplot(311);
plot(samples(1,:),samples(2,:),'b.')
axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Original Data');

subplot(312);
plot(samplesRotated(1,:),samplesRotated(2,:),'r.'),
axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Decorrelate: Rotate by V');

subplot(313);
plot(samplesRotatedScaled(1,:),samplesRotatedScaled(2,:),'ko')
axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Whiten: scale by D^{-1/2}');

The transformation in Equation (11) and implemented above  whitens the data but leaves the data aligned with principle axes of the original data. In order to observe the data in the original space, it is often customary “un-rotate” the data back into it’s original space. This is done by just multiplying the whitening transform by the inverse of the rotation operation defined by the eigenvector matrix. This gives the whitening transform:

$W =E^{-1}D^{-1/2}E^T$                                                                                                   (13)

Let’s take a look an example of using statistical whitening for a more complex problem: whitening patches of images sampled from natural scenes.

## Example: Whitening Natural Scene Image Patches

Modeling the local spatial structure of pixels in natural scene images is important in many fields including computer vision and computational neuroscience. An interesting model of natural scenes is one that can account for interesting, high-order statistical dependencies between pixels. However, because natural scenes are generally composed of continuous objects or surfaces, a vast majority of the spatial correlations in natural image data can be explained by local pairwise dependencies. For example, observe the image below.

% LOAD AND DISPLAY A NATURAL IMAGE
figure
imagesc(im); colormap gray; axis image; axis off;
title('Base Image')

Given one of the gray pixels in the upper portion of the image, it is very likely that all pixels within the local neighborhood will also be gray. Thus there is a large amount of correlation between pixels in local regions of natural scenes. Statistical models of local structure applied to natural scenes will be dominated by these pairwise correlations, unless they are removed by preprocessing. Whitening provides such a preprocessing procedure.

Below we create and display a dataset of local image patches of size $16 \times 16$ extracted at random from the image above. Each patch is rastered out into a column vector of size $(16)16 \times 1$. Each of these patches can be thought of as samples of the local structure of this natural scene. Below we use the whitening transformation to remove pairwise correlations between pixels in each patch and scale the variance of each pixel to be one.

On the left is the dataset of extracted image patches, along with the corresponding covariance matrix for the image patches on the right. The large local correlation within the neighborhood of each pixel is indicated by the large bright diagonal regions throughout the covariance matrix.

The MATLAB code to extract and display the patches shown above is here:

% CREATE PATCHES DATASET FROM NATURAL IMAGE
rng(12345)
imSize = 256;
nPatches = 400;  % (MAKE SURE SQUARE)
patchSize = 16;
patches = zeros(patchSize*patchSize,nPatches);
patchIm = zeros(sqrt(nPatches)*patchSize);

% PAD IMAGE FOR EDGE EFFECTS

% EXTRACT PATCHES...
for iP = 1:nPatches
pix = ceil(rand(2,1)*imSize);
rows = pix(1):pix(1)+patchSize-1;
cols = pix(2):pix(2)+patchSize-1;
tmp = im(rows,cols);
patches(:,iP) = reshape(tmp,patchSize*patchSize,1);
rowIdx = (ceil(iP/sqrt(nPatches)) - 1)*patchSize + ...
1:ceil(iP/sqrt(nPatches))*patchSize;
colIdx = (mod(iP-1,sqrt(nPatches)))*patchSize+1:patchSize* ...
((mod(iP-1,sqrt(nPatches)))+1);
patchIm(rowIdx,colIdx) = tmp;
end

% CENTER IMAGE PATCHES
patchesCentered = bsxfun(@minus,patches,mean(patches,2));

% CALCULATE COVARIANCE MATRIX
S = patchesCentered*patchesCentered'/nPatches;

% DISPLAY PATCHES
figure;
subplot(121);
imagesc(patchIm);
axis image; axis off; colormap gray;
title('Extracted Patches')

% DISPLAY COVARIANCE
subplot(122);
imagesc(S);
axis image; axis off; colormap gray;
title('Extracted Patches Covariance')

Below we implement the whitening transformation described above to the extracted image patches and display the whitened patches that result.

On the left, we see that the whitening procedure zeros out all areas in the extracted patches that have the same value (zero is indicated by gray). The whitening procedure also boosts the areas of high-contrast (i.e. edges). The right plots the covariance matrix for the whitened patches. The covarance matrix is diagonal, indicating that pixels are now independent. In addition, all diagonal entries have the same value, indicating the that all pixels now have the same variance (i.e. 1). The MATLAB code used to whiten the image patches and create the display above is here:

%% MAIN WHITENING

% DETERMINE EIGENECTORS & EIGENVALUES
% OF COVARIANCE MATRIX
[E,D] = eig(S);

% CALCULATE D^(-1/2)
d = diag(D);
d = real(d.^-.5);
D = diag(d);

% CALCULATE WHITENING TRANSFORM
W = E*D*E';

% WHITEN THE PATCHES
patchesWhitened = W*patchesCentered;

% DISPLAY THE WHITENED PATCHES
wPatchIm = zeros(size(patchIm));
for iP = 1:nPatches
rowIdx = (ceil(iP/sqrt(nPatches)) - 1)*patchSize + 1:ceil(iP/sqrt(nPatches))*patchSize;
colIdx = (mod(iP-1,sqrt(nPatches)))*patchSize+1:patchSize* ...
((mod(iP-1,sqrt(nPatches)))+1);
wPatchIm(rowIdx,colIdx) = reshape(patchesWhitened(:,iP),...
[patchSize,patchSize]);
end

figure
subplot(121);
imagesc(wPatchIm);
axis image; axis off; colormap gray; caxis([-5 5]);
title('Whitened Patches')

subplot(122);
imagesc(cov(patchesWhitened'));
axis image; axis off; colormap gray; %colorbar
title('Whitened Patches Covariance');

## Investigating the Whitening Matrix: implications for theoretical neuroscience

So what does the whitening matrix look like, and what does it do? Below is the whitening matrix $W$ calculated for the image patches dataset:

% DISPLAY THE WHITENING MATRIX
figure; imagesc(W);
axis image; colormap gray; colorbar
title('The Whitening Matrix W')

Each column of $W$ is the operation that scales the variance of the corresponding pixel to be equal to one and forces that pixel independent of the others in the $16 \times 16$ patch. So what exactly does such an operation look like? We can get an idea by reshaping a column of W back into the shape of the image patches. Below we show what the 86th column of W looks like when reshaped in such a way (the index 86 has no particular significance, it was chosen at random):

% DISPLAY A COLUMN OF THE WHITENING MATRIX
figure; imagesc(reshape(W(:,86),16,16)),
colormap gray,
axis image, colorbar
title('Column 86 of W')

We see that the operation is essentially an impulse centered on the 86th pixel in the image (counting pixels starting in the upper left corner, proceeding down columns). This impulse is surrounded by inhibitory weights. If we were to look at the remaining columns of $W$, we would find that that the same center-surround operation is being replicated at every pixel location in each image patch. Essentially, the whitening transformation is performing a convolution of each image patch with a center-surround filter whose properties are estimated from the patches dataset. Similar techniques are common in computer vision edge-detection algorithms.

## Implications for theoretical neuroscience

A theoretical function of the primate retina is data compression: a large number of photoreceptors  pass data from the retina into a physiological bottleneck, the optic nerve, which has far fewer fibers than retinal photoreceptors. Thus removing redundant information is an important task that the retina must perform. When observing the whitened image patches above, we see that redundant information is nullified; pixels that have similar local values to one another are zeroed out. Thus, statistical whitening is a viable form of data compression

It turns out that there is a large class of ganglion cells in the retina whose spatial receptive fields exhibit…that’s right center-surround activation-inhibition like the operation of the whitening matrix shown above! Thus it appears that the primate visual system may be performing data compression at the retina by means of a similar operation to statistical whitening. Above, we derived the center-surround whitening operation based on data sampled from a natural scene. Thus it is seems reasonable that the primate visual system may have evolved a similar data-compression mechanism through experience with natural scenes, either through evolution, or development.

## Covariance Matrices and Data Distributions

Correlation between variables in a $K$-dimensional dataset are often summarized by a $K \times K$ covariance matrix. To get a better understanding of how correlation matrices characterize correlations between data points, we plot data points drawn from 3 different 2-dimensional Gaussian distributions, each of which is defined by a different covariance matrix.

The left plots below display the $2 \times 2$ covariance matrix for each Gaussian distribution. The values along the diagonal represent the variance of the data along each dimension, and the off-diagonal values represent the covariances between the dimensions. Thus the $i,j$-th entry of each matrix represents the correlation between the $i$-th and $j$-th dimensions. The right plots show data drawn from the corresponding 2D Gaussian.

The top row plot display a covariance matrix equal to the identity matrix, and the points drawn from the corresponding Gaussian distribution. The diagonal values are 1, indicating the data have variance of 1 along both of the dimensions. Additionally, the off-diagonal elements are zero, meaning that the two dimensions are uncorrelated.  We can see this in the data drawn from the distribution as well. The data are distributed in a sphere about origin. For such a distribution of points, it is difficult (impossible) to draw any single regression line that can predict the second dimension from the first, and vice versa. Thus an identity covariance matrix is equivalent to having independent dimensions, each of which has unit (i.e. 1) variance. Such a dataset is often called “white” (this naming convention comes from the notion that white noise signals–which can be sampled from independent Gaussian distributions–have equal power at all frequencies in the Fourier domain).

The middle row plots the points that result from a diagonal, but not identity covariance matrix. The off-diagonal elements are still zero, indicating that the dimensions are uncorrelated. However, the variances along each dimension are not equal to one, and are not equal. This is demonstrated by the elongated distribution in red. The elongation is along the second dimension, as indicated by the larger value in the bottom-right (point $(i,j) = (2,2)$) of the covariance matrix.

The bottom row plots points that result from a non-diagonal covariance matrix. Here the off-diagonal elements of covariance matrix have non-zero values, indicating a correlation between the dimensions. This correlation is reflected in the distribution of drawn datapoints (in blue). We can see that the primary axis along which the points are distributed is not along either of the dimensions, but a linear combination of the dimensions.

The MATLAB code to create the above plots is here

% INITIALIZE SOME CONSTANTS
mu = [0 0];         % ZERO MEAN
S = [1 .9; .9 3];   % NON-DIAGONAL COV.
SDiag = [1 0; 0 3]; % DIAGONAL COV.
SId = eye(2);       % IDENTITY COV.

% SAMPLE SOME DATAPOINTS
nSamples = 1000;
samples = mvnrnd(mu,S,nSamples)';
samplesId = mvnrnd(mu,SId,nSamples)';
samplesDiag = mvnrnd(mu,SDiag,nSamples)';

% DISPLAY
subplot(321);
imagesc(SId); axis image,
caxis([0 1]), colormap hot, colorbar
title('Identity Covariance')

subplot(322)
plot(samplesId(1,:),samplesId(2,:),'ko'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('White Data')

subplot(323);
imagesc(SDiag); axis image,
caxis([0 3]), colormap hot, colorbar
title('Diagonal Covariance')

subplot(324)
plot(samplesDiag(1,:),samplesDiag(2,:),'r.'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('Uncorrelated Data')

subplot(325);
imagesc(S); axis image,
caxis([0 3]), colormap hot, colorbar
title('Non-diagonal Covariance')

subplot(326)
plot(samples(1,:),samples(2,:),'b.'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('Correlated Data')

## fMRI In Neuroscience: Efficiency of Event-related Experiment Designs

Event-related fMRI experiments are used to detect selectivity in the brain to stimuli presented over short durations. An event is generally modeled as an impulse function that occurs at the onset of the stimulus in question. Event-related designs are flexible in that many different classes of stimuli can be intermixed. These designs can minimize confounding behavioral effects due to subject adaptation or expectation. Furthermore, stimulus onsets can be modeled at frequencies that are shorter than the repetition time (TR) of the scanner. However, given such flexibility in design and modeling, how does one determine the schedule for presenting a series of stimuli? Do we space out stimulus onsets periodically across a scan period? Or do we randomize stimulus onsets? Furthermore what is the logic for or against either approach? Which approach is more efficient for gaining incite into the selectivity in the brain?

## Simulating Two fMRI Experiments: Periodic and Random Stimulus Onsets

To get a better understanding of the problem of choosing efficient experiment design, let’s simulate two simple fMRI experiments. In the first experiment, a stimulus is presented periodically 20 times, once every 4 seconds, for a run of 80 seconds in duration. We then simulate a noiseless BOLD signal evoked in a voxel with a known HRF. In the second experiment, we simulate the noiseless BOLD signal evoked by 20 stimulus onsets that occur at random times over the course of the 80 second run duration.  The code for simulating the signals and displaying output are shown below:

rand('seed',12345);
randn('seed',12345);
TR = 1 % REPETITION TIME
t = 1:TR:20; % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % ACTUAL HRF
h = h/max(h); % SCALE TO MAX OF 1

% SOME CONSTANTS...
trPerStim = 4; % # TR PER STIMULUS FOR PERIODIC EXERIMENT
nRepeat = 20; % # OF TOTAL STIMULI SHOWN
nTRs = trPerStim*nRepeat
stimulusTrain0 = zeros(1,nTRs);

beta = 3; % SELECTIVITY/HRF GAIN

% SET UP TWO DIFFERENT STIMULUS PARADIGM...
% A. PERIODIC, NON-RANDOM STIMULUS ONSET TIMES
D_periodic = stimulusTrain0;
D_periodic(1:trPerStim:trPerStim*nRepeat) = 1;

% UNDERLYING MODEL FOR (A)
X_periodic = conv2(D_periodic,h);
X_periodic = X_periodic(1:nTRs);
y_periodic = X_periodic*beta;

% B. RANDOM, UNIFORMLY-DISTRIBUTED STIMULUS ONSET TIMES
D_random = stimulusTrain0;
randIdx = randperm(numel(stimulusTrain0)-5);
D_random(randIdx(1:nRepeat)) = 1;

% UNDERLYING MODEL FOR (B)
X_random = conv2(D_random,h);
X_random = X_random(1:nTRs);
y_random = X_random*beta;

% DISPLAY STIMULUS ONSETS AND EVOKED RESPONSES
% FOR EACH EXPERIMENT
figure
subplot(121)
stem(D_periodic,'k');
hold on;
plot(y_periodic,'r','linewidth',2);
xlabel('Time (TR)');
title(sprintf('Responses Evoked by\nPeriodic Stimulus Onset\nVariance=%1.2f',var(y_periodic)))

subplot(122)
stem(D_random,'k');
hold on;
plot(y_random,'r','linewidth',2);
xlabel('Time (TR)');
title(sprintf('Responses Evoked by\nRandom Stimulus Onset\nVariance=%1.2f',var(y_random)))

BOLD signals evoked by periodic (left) and random (right) stimulus onsets.

The black stick functions in the simulation output indicate the stimulus onsets and each red function is the simulated noiseless BOLD signal to those stimuli. The first thing to notice is the dramatically different variances of the BOLD signals evoked for the two stimulus presentation schedules. For the periodic stimuli, the BOLD signal quickly saturates, then oscillates around an effective baseline activation. The estimated variance of the periodic-based signal is 0.18. In contrast, the signal evoked by the random stimulus presentation schedule varies wildly, reaching a maximum amplitude that is roughly 2.5 times as large the maximum amplitude of the signal evoked by periodic stimuli. The estimated variance of the signal evoked by the random stimuli is 7.4, roughly 40 times the variance of the signal evoked by the periodic stimulus.

So which stimulus schedule allows us to better estimate the HRF and, more importantly, the amplitude of the HRF, as it is the amplitude that is the common proxy for voxel selectivity/activation? Below we repeat the above experiment 50 times. However, instead of simulating noiseless BOLD responses, we introduce 50 distinct, uncorrelated noise conditions, and from the simulated noisy responses, we estimate the HRF using an FIR basis set for each  repeated trial. We then compare the estimated HRFs across the 50 trials for the periodic and random stimulus presentation schedules. Note that for each trial, the noise is exactly the same for the two stimulus presentation schedules. Further, we simulate a selectivity/tuning gain of 3 times the maximum HRF amplitude and assume that the HRF to be estimated is 16 TRs/seconds in length. The simulation and output are below:

%% SIMULATE MULTIPLE TRIALS OF EACH EXPERIMENT
%% AND ESTIMATE THE HRF FOR EACH
%% (ASSUME THE VARIABLES DEFINED ABOVE ARE IN WORKSPACE)

% CREATE AN FIR DESIGN MATRIX
% FOR EACH EXPERIMENT
hrfLen = 16;  % WE ASSUME TO-BE-ESTIMATED HRF IS 16 TRS LONG

% CREATE FIR DESIGN MATRIX FOR THE PERIODIC STIMULI
X_FIR_periodic = zeros(nTRs,hrfLen);
onsets = find(D_periodic);
idxCols = 1:hrfLen;
for jO = 1:numel(onsets)
idxRows = onsets(jO):onsets(jO)+hrfLen-1;
for kR = 1:numel(idxRows);
X_FIR_periodic(idxRows(kR),idxCols(kR)) = 1;
end
end
X_FIR_periodic = X_FIR_periodic(1:nTRs,:);

% CREATE FIR DESIGN MATRIX FOR THE RANDOM STIMULI
X_FIR_random = zeros(nTRs,hrfLen);
onsets = find(D_random);
idxCols = 1:hrfLen;
for jO = 1:numel(onsets)
idxRows = onsets(jO):onsets(jO)+hrfLen-1;
for kR = 1:numel(idxRows);
X_FIR_random(idxRows(kR),idxCols(kR)) = 1;
end
end
X_FIR_random = X_FIR_random(1:nTRs,:);

% SIMULATE AND ESTIMATE HRF WEIGHTS VIA OLS
nTrials = 50;

% CREATE NOISE TO ADD TO SIGNALS
% NOTE: SAME NOISE CONDITIONS FOR BOTH EXPERIMENTS
noiseSTD = beta*2;
noise = bsxfun(@times,randn(nTrials,numel(X_periodic)),noiseSTD);

%% ESTIMATE HRF FROM PERIODIC STIMULUS TRIALS
beta_periodic = zeros(nTrials,hrfLen);
for iT = 1:nTrials
y = y_periodic + noise(iT,:);
beta_periodic(iT,:) = X_FIR_periodic\y';
end

% CALCULATE MEAN AND STANDARD ERROR OF HRF ESTIMATES
beta_periodic_mean = mean(beta_periodic);
beta_periodic_se = std(beta_periodic)/sqrt(nTrials);

%% ESTIMATE HRF FROM RANDOM STIMULUS TRIALS
beta_random = zeros(nTrials,hrfLen);
for iT = 1:nTrials
y = y_random + noise(iT,:);
beta_random(iT,:) = X_FIR_random\y';
end

% CALCULATE MEAN AND STANDARD ERROR OF HRF ESTIMATES
beta_random_mean = mean(beta_random);
beta_random_se = std(beta_random)/sqrt(nTrials);

% DISPLAY HRF ESTIMATES
figure
% ...FOR THE PERIODIC STIMULI
subplot(121);
hold on;
h0 = plot(h*beta,'k')
h1 = plot(beta_periodic_mean,'linewidth',2);
h2 = plot(beta_periodic_mean+beta_periodic_se,'r','linewidth',2);
plot(beta_periodic_mean-beta_periodic_se,'r','linewidth',2);
xlabel('Time (TR)')
legend([h0, h1,h2],'Actual HRF','Average \beta_{periodic}','Standard Error')
title('Periodic HRF Estimate')

% ...FOR THE RANDOMLY-PRESENTED STIMULI
subplot(122);
hold on;
h0 = plot(h*beta,'k');
h1 = plot(beta_random_mean,'linewidth',2);
h2 = plot(beta_random_mean+beta_random_se,'r','linewidth',2);
plot(beta_random_mean-beta_random_se,'r','linewidth',2);
xlabel('Time (TR)')
legend([h0,h1,h2],'Actual HRF','Average \beta_{random}','Standard Error')
title('Random HRF Estimate')

Estimated HRFs from 50 trials of periodic (left) and random (right) stimulus schedules

In the simulation outputs, the average HRF for the random stimulus presentation (right) closely follows the actual HRF tuning. Also, there is little variability of the HRF estimates, as is indicated by the small standard error estimates for each time points. As well, the selectivity/gain term is accurately recovered, giving a mean HRF with nearly the same amplitude as the underlying model. In contrast, the HRF estimated from the periodic-based experiment is much more variable, as indicated by the large standard error estimates. Such variability in the estimates of the HRF reduce our confidence in the estimate for any single trial. Additionally, the scale of the mean HRF estimate is off by nearly 30% of the actual value.

From these results, it is obvious that the random stimulus presentation rate gives rise to more accurate, and less variable estimates of the HRF function. What may not be so obvious is why this is the case, as there were the same number of stimuli and  the same number of signal measurements in each experiment. To get a better understanding of why this is occurring, let’s refer back to the variances of the evoked noiseless signals. These are the signals that are underlying the noisy signals used to estimate the HRF. When noise is added it impedes the detection of the underlying trends that are useful for estimating the HRF.  Thus it is important that the variance of the underlying signal is large compared to the noise so that the signal can be detected.

For the periodic stimulus presentation schedule, we saw that the variation in the BOLD signal was much smaller than the variation in the BOLD signals evoked during the randomly-presented stimuli. Thus the signal evoked by random stimulus schedule provide a better characterization of the underlying signal in the presence of the same amount of noise, and thus provide more information to estimate the HRF. With this in mind we can think of maximizing the efficiency of the an experiment design as maximizing the variance of the BOLD signals evoked by the experiment.

## An Alternative Perspective: The Frequency Power Spectrum

Another helpful interpretation is based on a signal processing perspective. If we assume that neural activity is directly correspondent with the onset of a stimulus event, then we can interpret the train of stimulus onsets as a direct signal of the evoked neural activity. Furthermore, we can interpret the HRF as a low-pass-filter that acts to “smooth” the available neural signal in time. Each of these signals–the neural/stimulus signal and the HRF filtering signal–has with it an associated power spectrum. The power spectrum for a signal captures the amount of power per unit time that the signal has as a particular frequency $\omega$. The power spectrum for a discrete signal can be calculated from the discrete Fourier transform (DFT) of the signal $F(\omega)$ as follows

$P(\omega) = | F(\omega)|^2$

Below, we use Matlab’s $\text{fft.m}$ function to calculate the DFT and the associated power spectrum for each of the stimulus/neural signals, as well as the HRF.

%% POWER SPECTRUM ANALYSES
%% (ASSUME THE VARIABLES DEFINED ABOVE ARE IN WORKSPACE)

% MAKE SURE WE PAD SUFFICIENTLY
% FOR CIRCULAR CONVOLUTION
N = 2^nextpow2(nTRs + numel(h)-1);
nUnique = ceil(1+N/2); % TAKE ONLY POSITIVE SPECTRA

% CALCULATE POWER SPECTRUM FOR PERIODIC STIMULI EXPERIMENT
ft_D_periodic = fft(D_periodic,N)/N; % DFT
P_D_periodic = abs(ft_D_periodic).^2; % POWER
P_D_periodic = 2*P_D_periodic(2:nUnique-1); % REMOVE ZEROTH & NYQUIST

% CALCULATE POWER SPECTRUM FOR RANDOM STIMULI EXPERIMENT
ft_D_random = fft(D_random,N)/N; % DFT
P_D_random = abs(ft_D_random).^2; % POWER
P_D_random = 2*P_D_random(2:nUnique-1); % REMOVE ZEROTH & NYQUIST

% CALCULATE POWER SPECTRUM OF HRF
ft_h = fft(h,N)/N; % DFT
P_h = abs(ft_h).^2; % POWER
P_h = 2*P_h(2:nUnique-1); % REMOVE ZEROTH & NYQUIST

% CREATE A FREQUENCY SPACE FOR PLOTTING
F = 1/N*[1:N/2-1];

% DISPLAY STIMULI POWER SPECTRA
figure
subplot(131)
hhd = plot(F,P_D_periodic,'b','linewidth',2);
axis square; hold on;
hhr = plot(F,P_D_random,'g','linewidth',2);
xlim([0 .3]); xlabel('Frequency (Hz)');
set(gca,'Ytick',[]); ylabel('Magnitude');
legend([hhd,hhr],'Periodic','Random')
title('Stimulus Power, P_{stim}')

% DISPLAY HRF POWER SPECTRUM
subplot(132)
plot(F,P_h,'r','linewidth',2);
axis square
xlim([0 .3]); xlabel('Frequency (Hz)');
set(gca,'Ytick',[]); ylabel('Magnitude');
title('HRF Power, P_{HRF}')

% DISPLAY EVOKED SIGNAL POWER SPECTRA
subplot(133)
hhd = plot(F,P_D_periodic.*P_h,'b','linewidth',2);
hold on;
hhr = plot(F,P_D_random.*P_h,'g','linewidth',2);
axis square
xlim([0 .3]); xlabel('Frequency (Hz)');
set(gca,'Ytick',[]); ylabel('Magnitude');
legend([hhd,hhr],'Periodic','Random')
title('Signal Power, P_{stim}.*P_{HRF}')

Power spectrum of neural/stimulus (left), HRF (center), and evoked BOLD (right) signals

On the left of the output we see the power spectra for the stimulus signals. The blue line corresponds to the spectrum for the periodic stimuli, and the green line the spectrum for the randomly-presented stimuli. The large peak in the blue spectrum corresponds to the majority of the stimulus power at 0.25 Hz for the periodic stimuli, as this the fundamental frequency of the periodic stimulus presentation (i.e. every 4 seconds). However, there is little power at any other stimulus frequencies. In contrast the green spectrum indicates that the random stimulus presentation has power at multiple frequencies.

If we interpret the HRF as a filter, then we can think of the HRF power spectrum as modulating the power spectrum of the neural signals to produce the power of the evoked BOLD signals. The power spectrum for the HRF is plotted in red in the center plot. Notice how a majority of the power for the HRF is at frequencies less than 0.1 Hz, and there is very little power at frequencies above 0.2 Hz. If the neural signal power is modulated by the HRF signal power, we see that there is little resultant power in the BOLD signals evoked by periodic stimulus presentation (blue spectrum in the right plot). In contrast, because the power for the neural signals evoked by random stimuli are spread across the frequency domain, there are a number of frequencies that overlap with those frequencies for which the HRF also has power. Thus after modulating neural/stimulus power with the HRF power, the spectrum of the BOLD signals evoked by the randomly-presented stimuli have much more power across the relevant frequency spectrum than those evoked by the periodic stimuli. This is indicated by the larger area under the green curve in the right plot.

Using the signal processing perspective allows us to directly gain perspective on the limitations of a particular experiment design which are rooted in the frequency spectrum of the HRF. Therefore, another way we can think of maximizing the efficiency of an experimental design is maximizing the amount of power in the resulting evoked BOLD responses.

## Yet Another Perspective Based in Statistics: Efficiency Metric

Taking a statistics-based approach leads to a formal definition of efficiency, and further, a nice metric for testing the efficiency of an experimental design. Recall that when determining the shape of the HRF, a common approach is to use the GLM model

$y = X \beta + \epsilon$

Here $y$ is the evoked BOLD signal and $X$ is a design matrix that links a set of linear model parameters $\beta$ to those responses. The variable $\epsilon$ is a noise term that is unexplained by the model. Using an FIR basis formulation of the model, the weights in $\beta$ represent the HRF to a stimulus condition.

Because fMRI data are a continuous time series, the underlying noise $\epsilon$ is generally correlated in time. We can model this noise as a Gaussian process with zero mean and a constant multivariate covariance $C_{\epsilon}$. Note that this is analogous to the Generalized Least Squares (GLS) formulation of the GLM. In general, the values that comprise $C_{\epsilon}$ are unknown and have to be estimated from the fMRI data themselves.

For a known or estimated noise covariance, the Maximum Likelihood Estimator (MLE) for the model parameters $\beta$(derivation not shown) is:

$\hat \beta = (X^TC_{\epsilon}^{-1}X)X^TC_{\epsilon}^{-1}y$

Because the ML estimator of the HRF is a linear combination of the design matrix $X$ and a set of corresponding responses, which are both random variables ($X$ can represent any possible experiment design, and $y$ is by definition random), the estimator is itself a random variable. It thus follows that the estimate for the HRF also has a variance. (We demonstrated how $\beta$ is a random variable in the 50 simulations above, where for each simulation X was held fixed, but due to the added noise $y$ was a random variable. For each noise condition, the estimate for $\beta$ took on different values.) We saw above how an HRF estimator with a large variance is undesirable, as it reduces our confidence in the estimates of the HRF shape and scale. Therefore we would like to determine an estimator that has a minimum overall variance.

A formal metric for efficiency of a least-squares estimator is directly related to the variance of the estimator. The efficiency is defined to be the inverse of the sum of the estimator variances. An estimator that has a large sum of variances will have a low efficiency, and vice versa. But how do we obtain the values of the variances for the estimator? The variances can be recovered from the diagonal elements of the estimator covariance matrix $C_{\hat \beta}$, giving the following definition for the efficiency, $E$

$E = 1/trace(C_{\hat \beta})$

In earlier post we found that the covariance matrix $C_{\hat \beta}$ for the GLS estimator (i.e. the formulation above) with a given noise covariance $C_{\epsilon}$ is:

$C_{\hat \beta} = (X^T C_{\epsilon}^{-1} X)^{-1}$.

Thus the efficiency for the HRF estimator is

$E = 1/trace((X^T C_{\epsilon}^{-1}X)^{-1})$

Here we see that the efficiency depends only on the known noise covariance (or an estimate of it), and the design matrix used in the model, but not the shape of the HRF. In general the noise covariance is out of the experimenter’s control (but see the take-homes below ), and must be dealt with post hoc. However, because the design matrix is directly related to the experimental design, the above expression gives a direct way to test the efficiency of experimental designs before they are ever used!

In the simulations above, the noise processes are drawn from an independent multivariate Gaussian distribution, therefore the noise covariance is equal to the identity (i.e. uncorrelated). We also estimated the HRF using the FIR basis set, thus our model design matrix was $X_{FIR}$. This gives the estimate the efficiency for the simulation experiments:

$E_{simulation} = 1/trace(X_{FIR}^T X_{FIR})$

Below we calculate the efficiency for the FIR estimates under the simulated experiments with periodic and random stimulus presentation designs.

%% ESTIMATE DESIGN EFFICIENCY
%% (ASSUME THE VARIABLES DEFINED ABOVE ARE IN WORKSPACE)

% CALCULATE EFFICIENCY OF PERIODIC EXPERIMENT
E_periodic = 1/trace(pinv(X_FIR_periodic'*X_FIR_periodic));

% CALCULATE EFFICIENCY OF RANDOM EXPERIMENT
E_random = 1/trace(pinv(X_FIR_random'*X_FIR_random));

% DISPLAY EFFICIENCY ESTIMATES
figure
bar([E_periodic,E_random]);
set(gca,'XTick',[1,2],'XTickLabel',{'E_periodic','E_random'});
title('Efficiency of Experimental Designs');
colormap hot;

Estimated efficiency for simulated periodic (left) and random (right) stimulus schedules.

Here we see that the efficiency metric does indeed indicate that the randomly-presented stimulus paradigm is far more efficient than the periodically-presented paradigm.

## Wrapping Up

In this post we addressed the efficiency of an fMRI experiment design. A few take-homes from the discussion are:

1. Randomize stimulus onset times. These onset times should take into account the low-pass characteristics (i.e. the power spectrum) of the HRF.
2. Try to model selectivity to events that occur close in time. The reason for this is that noise covariances in fMRI are highly non-stationary. There are many sources of low-frequency physiological noise such as breathing, pulse, blood pressure, etc, all of which dramatically effect the noise in the fMRI timecourses. Thus any estimate of noise covariances from data recorded far apart in time will likely be erroneous.
3. Check an experimental design against other candidate designs using the Efficiency metric.

Above there is mention of the effects of low-frequency physiological noise. Until now, our simulations have assumed that all noise is independent in time, greatly simplifying the picture of estimating HRFs and corresponding selectivity. However, in a later post we’ll address how to deal with more realistic time courses that are heavily influenced by sources of physiological noise. Additionally, we’ll tackle how to go about estimating the noise covariance $C_{\epsilon}$ from more realistic fMRI time series.

## Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

We showed in an earlier post that for the linear regression model

$y = X\beta + \epsilon$,

the optimal Ordinary Least Squares (OLS) estimator for model parameters $\beta$ is

$\hat \beta = (X^TX)^{-1}X^Ty$

However, because independent variables $X$ and responses $y$ can take on any value, they are both random variables. And, because $\hat \beta$ is a linear combination of $X$ and $y$, it is also a random variable, and therefore has a covariance. The definition of the covariance matrix $C_{\hat \beta}$ for the OLS estimator is defined as:

$C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

where, $E[*]$ denotes the expected value operator. In order to find an expression for $C_{\hat \beta}$, we first need an expression for  $(\hat \beta - \beta)$. The following derives this expression:

$\hat \beta = (X^TX)^{-1}X^T(X\beta + \epsilon)$,

where we use the fact that

$y = X\beta + \epsilon$.

It follows that

$\hat \beta = (X^TX)^{-1}X^TX \beta + (X^TX)^{-1}\epsilon$

$\hat \beta = \beta + (X^TX)^{-1}X^T \epsilon$

and therefore

$(\hat \beta - \beta) = (X^TX)^{-1}X^T \epsilon$

Now following the original definition for $C_{\hat \beta}$

$C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

$= E[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T \epsilon)^T]$

$= E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]$

where we take advantage of $(AB)^T = B^T A^T$ in order to rewrite the second term in the product of the expectation. If we take $X$ to be fixed for a given estimator of $\hat \beta$ (in other words we don’t randomly resample the independent variables), then the expectation only depends on the remaining stochastic/random variable, namely $\epsilon$. Therefore the above expression can be written as

$C_{\hat \beta} = (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}$.

where $E[\epsilon \epsilon^T]$ is the covariance of the noise term in the model. Because OLS assumes uncorrelated noise, the noise covariance is equal to $\sigma^2 I$, where $\sigma^2$ is the variance along each dimension, and $I$ is an identity matrix of size equal to the number of dimensions. The expression for the estimator covariance is now:

$C_{\hat \beta} = (X^TX)^{-1}X^T (\sigma^2 I) X(X^TX)^{-1}$,

$= \sigma^2 I (X^TX)^{-1} X^T X(X^TX)^{-1}$

which simplifies to

$C_{\hat \beta} = \sigma^2 (X^T X)^{-1}$

A further simplifying assumption made by OLS that is often made is that $\epsilon$ is drawn from a zero mean multivariate Guassian distribution of unit variances (i.e. $\sigma^2 = 1$), resulting in a noise covariance equal to the identity. Thus

$C_{\hat \beta} = (X^TX)^{-1}$

## Applying the derivation results to Generalized Least Squares

Notice that the expression for the OLS estimator covariance is equal to first inverse term in the expression for the OLS estimator. Identitying the covariance for the OLS estimator in this way gives a helpful heuristic to easily identify the covariance of related estimators that do not make the simplifying assumptions about the covariance that are made in OLS. For instance in Generalized Least Squares (GLS), it is possible for the noise terms to co-vary. The covariance is represented as a noise covariance matrix $C_{\epsilon}$. This gives the model form

$y = X \beta + \epsilon$,

where $E[\epsilon | X] = 0; Var[\epsilon | X] = C_{\epsilon}$.

In otherwords, under GLS, the noise terms have zero mean, and covariance $C_{\epsilon}$.  It turns out that estimator for the GLS model parameters is

$\hat \beta_{GLS} = (X^T C_{\epsilon}^{-1} X)^{-1} X^T C_{\epsilon}^{-1}y$.

Notice the similarity between the GLS and OLS estimators. The only difference is that in GLS, the solution for the parameters is scaled by the inverse of the noise covariance. And, in a similar fashion to the OLS estimator, the covariance for the GLS estimator is first term in the product that defines the GLS estimator:

$C_{\hat \beta, GLS} = (X^T C_{\epsilon}^{-1}X)^{-1}$

## A Gentle Introduction to Markov Chain Monte Carlo (MCMC)

Applying probabilistic models to data usually involves integrating a complex, multi-dimensional probability distribution. For example, calculating the expectation/mean of a model distribution involves such an integration. Many (most) times, these integrals are not calculable due to the high dimensionality of the distribution or because there is no closed-form expression for the integral available using calculus. Markov Chain Monte Carlo (MCMC) is a method that allows one to approximate complex integrals using stochastic sampling routines. As MCMC’s name indicates, the method is composed of two components, the Markov chain and Monte Carlo integration.

Monte Carlo integration is a powerful technique that exploits stochastic sampling of the distribution in question in order to approximate the difficult integration. However, in order to use Monte Carlo integration it is necessary to be able to sample from the probability distribution in question, which may be difficult or impossible to do directly. This is where the second component of MCMC, the Markov chain, comes in. A Markov chain is a sequential model that transitions from one state to another in a probabilistic fashion, where the next state that the chain takes is conditioned on the previous state. Markov chains are useful in that if they are constructed properly, and allowed to run for a long time, the states that a chain will take also sample from a target probability distribution. Therefore we can construct Markov chains to sample from the distribution whose integral we would like to approximate, then use Monte Carlo integration to perform the approximation.

Here I introduce a series of posts where I describe the basic concepts underlying MCMC, starting off by describing Monte Carlo Integration, then giving a brief introduction of Markov chains and how they can be constructed to sample from a target probability distribution. Given these foundation principles, we can then discuss MCMC techniques such as the Metropolis and Metropolis-Hastings algorithms, the Gibbs sampler, and the Hybrid Monte Carlo algorithm.

As always, each post has a somewhat formal/mathematical introduction, along with an example and simple Matlab implementations of the associated algorithms.

## MCMC: The Gibbs Sampler

In the previous post, we compared using block-wise and component-wise implementations of the Metropolis-Hastings algorithm for sampling from a multivariate probability distribution$p(\bold x)$. Component-wise updates for MCMC algorithms are generally more efficient for multivariate problems than blockwise updates in that we are more likely to accept a proposed sample by drawing each component/dimension independently of the others. However, samples may still be rejected, leading to excess computation that is never used. The Gibbs sampler, another popular MCMC sampling technique, provides a means of avoiding such wasted computation. Like the component-wise implementation of the Metropolis-Hastings algorithm, the Gibbs sampler also uses component-wise updates. However, unlike in the Metropolis-Hastings algorithm, all proposed samples are accepted, so there is no wasted computation.

The Gibbs sampler is applicable for certain classes of problems, based on two main criterion. Given a target distribution $p(\bold x)$, where $\bold x = (x_1, x_2, \dots, x_D$, ),  The first criterion is 1) that it is necessary that we have an analytic (mathematical) expression for the conditional distribution of each variable in the joint distribution given all other variables in the joint. Formally, if the target distribution $p(\bold x)$ is $D$-dimensional, we must have $D$ individual expressions for

$p(x_i|x_1,x_2,\dots,x_{i-1},x_{i+1},\dots,x_D)$

$= p(x_i | x_j), j\neq i$.

Each of these expressions defines the probability of the $i$-th dimension given that we have values for all other ($j \neq i$) dimensions. Having the conditional distribution for each variable means that we don’t need a proposal distribution or an accept/reject criterion, like in the Metropolis-Hastings algorithm. Therefore, we can simply sample from each conditional while keeping all other variables held fixed. This leads to the second criterion 2) that we must be able to sample from each conditional distribution. This caveat is obvious if we want an implementable algorithm.

The Gibbs sampler works in much the same way as the component-wise Metropolis-Hastings algorithms except that instead drawing from a proposal distribution for each dimension, then accepting or rejecting the proposed sample, we simply draw a value for that dimension according to the variable’s corresponding conditional distribution. We also accept all values that are drawn. Similar to the component-wise Metropolis-Hastings algorithm, we step through each variable sequentially, sampling it while keeping all other variables fixed. The Gibbs sampling procedure is outlined below

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

draw $x_i$ from $p(x_i|x_1,x_2,\dots,x_{i-1},x_{i+1},\dots,x_D)$

To get a better understanding of the Gibbs sampler at work, let’s implement the Gibbs sampler to solve the same multivariate sampling problem addressed in the previous post.

### Example: Sampling from a bivariate a Normal distribution

This example parallels the examples in the previous post where we sampled from a 2-D Normal distribution using block-wise and component-wise Metropolis-Hastings algorithms. Here, we show how to implement a Gibbs sampler to draw samples from the same target distribution. As a reminder, the target distribution $p(\bold x)$ is a Normal form with following parameterization:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [\mu_1,\mu_2]= [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & \rho_{12} \\ \rho_{21} & 1\end{bmatrix} = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

In order to sample from this distribution using a Gibbs sampler, we need to have in hand the conditional distributions for variables/dimensions $x_1$ and $x_2$:

$p(x_1 | x_2^{(t-1)})$ (i.e. the conditional for the first dimension, $x_1$)

and

$p(x_2 | x_1^{(t)})$ (the conditional for the second dimension, $x_2$)

Where $x_2^{(t-1)}$ is the previous state of the second dimension, and $x_1^{(t)}$ is the state of the first dimension after drawing from $p(x_1 | x_2^{(t-1)})$. The reason for the discrepancy between updating $x_1$ and $x_2$ using states $(t-1)$ and $(t)$, can be is seen in step 3 of the algorithm outlined in the previous section. At iteration $t$ we first sample a new state for variable $x_1$ conditioned on the most recent state of variable $x_2$, which is from iteration $(t-1)$. We then sample a new state for the variable $x_2$ conditioned on the most recent state of variable $x_1$, which is now from the current iteration, $t$.

After some math (which which I will skip for some brevity, but see the following for some details), we find that the two conditional distributions for the target Normal distribution are:

$p(x_1 | x_2^{(t-1)}) = \mathcal N(\mu_1 + \rho_{21}(x_2^{(t-1)} - \mu_2),\sqrt{1-\rho_{21}^2})$

and

$p(x_2 | x_1^{(t)})=\mathcal N(\mu_2 + \rho_{12}(x_1^{(t)}-\mu_1),\sqrt{1-\rho_{12}^2})$,

which are both univariate Normal distributions, each with a mean that is dependent on the value of the most recent state of the conditioning variable, and a variance that is dependent on the target covariances between the two variables.

Using the above expressions for the conditional probabilities of variables $x_1$ and $x_2$, we implement the Gibbs sampler using MATLAB below. The output of the sampler is shown here:

Gibbs sampler Markov chain and samples for bivariate Normal target distribution

Inspecting the figure above, note how at each iteration the Markov chain for the Gibbs sampler first takes a step only along the $x_1$ direction, then only along the $x_2$ direction.  This shows how the Gibbs sampler sequentially samples the value of each variable separately, in a component-wise fashion.

% EXAMPLE: GIBBS SAMPLER FOR BIVARIATE NORMAL
rand('seed' ,12345);
nSamples = 5000;

mu = [0 0]; % TARGET MEAN
rho(1) = 0.8; % rho_21
rho(2) = 0.8; % rho_12

% INITIALIZE THE GIBBS SAMPLER
propSigma = 1; % PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE SAMPLES
x = zeros(nSamples,2);
x(1,1) = unifrnd(minn(1), maxx(1));
x(1,2) = unifrnd(minn(2), maxx(2));

dims = 1:2; % INDEX INTO EACH DIMENSION

% RUN GIBBS SAMPLER
t = 1;
while t < nSamples
t = t + 1;
T = [t-1,t];
for iD = 1:2 % LOOP OVER DIMENSIONS
% UPDATE SAMPLES
nIx = dims~=iD; % *NOT* THE CURRENT DIMENSION
% CONDITIONAL MEAN
muCond = mu(iD) + rho(iD)*(x(T(iD),nIx)-mu(nIx));
% CONDITIONAL VARIANCE
varCond = sqrt(1-rho(iD)^2);
% DRAW FROM CONDITIONAL
x(t,iD) = normrnd(muCond,varCond);
end
end

% DISPLAY SAMPLING DYNAMICS
figure;
h1 = scatter(x(:,1),x(:,2),'r.');

% CONDITIONAL STEPS/SAMPLES
hold on;
for t = 1:50
plot([x(t,1),x(t+1,1)],[x(t,2),x(t,2)],'k-');
plot([x(t+1,1),x(t+1,1)],[x(t,2),x(t+1,2)],'k-');
h2 = plot(x(t+1,1),x(t+1,2),'ko');
end

h3 = scatter(x(1,1),x(1,2),'go','Linewidth',3);
legend([h1,h2,h3],{'Samples','1st 50 Samples','x(t=0)'},'Location','Northwest')
hold off;
xlabel('x_1');
ylabel('x_2');
axis square

## Wrapping Up

The Gibbs sampler is a popular MCMC method for sampling from complex, multivariate probability distributions. However, the Gibbs sampler cannot be used for general sampling problems. For many target distributions, it may difficult or impossible to obtain a closed-form expression for all the needed conditional distributions. In other scenarios, analytic expressions may exist for all conditionals but it may be difficult to sample from any or all of the conditional distributions (in these scenarios it is common to use univariate sampling methods such as rejection sampling and (surprise!) Metropolis-type MCMC techniques to approximate samples from each conditional). Gibbs samplers are very popular for Bayesian methods where models are often devised in such a way that conditional expressions for all model variables are easily obtained and take well-known forms that can be sampled from efficiently.

Gibbs sampling, like many MCMC techniques suffer from what is often called “slow mixing.” Slow mixing occurs when the underlying Markov chain takes a long time to sufficiently explore the values of $\bold x$ in order to give a good characterization of $p(\bold x)$. Slow mixing is due to a number of factors including the “random walk” nature of the Markov chain, as well as the tendency of the Markov chain to get “stuck,” only sampling a single region of $\bold x$ having high-probability under $p(\bold x)$. Such behaviors are bad for sampling distributions with multiple modes or heavy tails. More advanced techniques, such as Hybrid Monte Carlo have been developed to incorporate additional dynamics that increase the efficiency of the Markov chain path. We will discuss Hybrid Monte Carlo in a future post.

## MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates

In the previous posts on MCMC methods, we focused on how to sample from univariate target distributions. This was done mainly to give the reader some intuition about MCMC implementations with fairly tangible examples that can be visualized. However, MCMC can easily be extended to sample multivariate distributions.

In this post we will discuss two flavors of MCMC update procedure for sampling distributions in multiple dimensions: block-wise, and component-wise update procedures. We will show how these two different procedures can give rise to different implementations of the Metropolis-Hastings sampler to solve the same problem.

## Block-wise Sampling

The first approach for performing multidimensional sampling is to use block-wise updates. In this approach the proposal distribution $q(\bold{x})$ has the same dimensionality as the target distribution $p(\bold x)$. Specifically, if $p(\bold x)$ is a distribution over $D$ variables, ie. $\bold{x} = (x_1, x_2, \dots, x_D)$, then we must design a proposal distribution that is also a distribution involving $D$ variables. We then accept or reject a proposed state $\bold x^*$ sampled from the proposal distribution $q(\bold x)$ in exactly the same way as for the univariate Metropolis-Hastings algorithm. To generate $M$ multivariate samples we perform the following block-wise sampling procedure:

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $\bold x^*$ from $q(\bold x | \bold x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(\bold x^{(t-1)} | \bold x^*) }{q(\bold x^*|\bold x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(\bold x^*)}{p(\bold x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $\bold x^*$ and set $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

Let’s take a look at the block-wise sampling routine in action.

### Example 1: Block-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we use block-wise Metropolis-Hastings algorithm to sample from a bivariate (i.e. $D = 2$) Normal distribution:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

Usually the target distribution $p(\bold x)$ will have a complex mathematical form, but for this example we’ll circumvent that by using MATLAB’s built-in function $\text{mvnpdf}$ to evaluate $p(\bold x)$. For our proposal distribution, $q(\bold x)$, let’s use a circular Normal centered at the the previous state/sample of the Markov chain/sampler, i.e:

$q(\bold x | \bold x^{(t-1)}) \sim \mathcal N (\bold x^{(t-1)}, \bold I)$,

where $\bold I$ is a 2-D identity matrix, giving the proposal distribution unit variance along both dimensions $x_1$ and $x_2$, and zero covariance. You can find an MATLAB implementation of the block-wise sampler at the end of the section. The display of the samples and the target distribution output by the sampler implementation are shown below:

Samples drawn from block-wise Metropolis-Hastings sampler

We can see from the output that the block-wise sampler does a good job of drawing samples from the target distribution.

Note that our proposal distribution in this example is symmetric, therefore it was not necessary to calculate the correction factor $c$ per se. This means that this Metropolis-Hastings implementation is identical to the simpler Metropolis sampler.

%------------------------------------------------------
% EXAMPLE 1: METROPOLIS-HASTINGS
% BLOCK-WISE SAMPLER (BIVARIATE NORMAL)
rand('seed' ,12345);

D = 2; % # VARIABLES
nBurnIn = 100;

% TARGET DISTRIBUTION IS A 2D NORMAL WITH STRONG COVARIANCE
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

% PROPOSAL DISTRIBUTION STANDARD 2D GUASSIAN
q = inline('mvnpdf(x,mu)','x','mu')

nSamples = 5000;
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE BLOCK-WISE SAMPLER
t = 1;
x = zeros(nSamples,2);
x(1,:) = randn(1,D);

% RUN SAMPLER
while t < nSamples
t = t + 1;

% SAMPLE FROM PROPOSAL
xStar = mvnrnd(x(t-1,:),eye(D));

% CORRECTION FACTOR (SHOULD EQUAL 1)
c = q(x(t-1,:),xStar)/q(xStar,x(t-1,:));

% CALCULATE THE M-H ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1,:))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t,:) = xStar;
else
x(t,:) = x(t-1,:);
end
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')

## Component-wise Sampling

A problem with block-wise updates, particularly when the number of dimensions $D$ becomes large, is that finding a suitable proposal distribution is difficult. This leads to a large proportion of the samples being rejected. One way to remedy this is to simply loop over the the $D$ dimensions of $\bold x$ in sequence, sampling each dimension independently from the others. This is what is known as using component-wise updates. Note that now the proposal distribution $q(x)$ is univariate, working only in one dimension, namely the current dimension that we are trying to sample. The component-wise Metropolis-Hastings algorithm is outlined below.

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

generate a proposal state $x_i^*$ from $q(x_i | x_i^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x_i^{(t-1)}) | x_i^*)}{q(x_i^* | x_i^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p( x_i^*, \bold x_j^{(t-1)})}{p( x_i^{(t-1)}, \bold x_j^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x_i^*$ and set $x_i^{(t)}=x_i^*$

else set $x_i^{(t)} = x_i^{(t-1)}$

Note that in the component-wise implementation a sample for the $i$-th dimension is proposed, then  accepted or rejected while all other dimensions ($j \neq i$) are held fixed. We then move on to the next ($(i + 1)$-th) dimension and repeat the process while holding all other variables ($j \neq (i + 1)$) fixed. In each successive step we are using updated values for the dimensions that have occurred since increasing $(t -1) \rightarrow t$.

### Example 2: Component-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we draw samples from the same bivariate Normal target distribution described in Example 1, but using component-wise updates. Therefore $p(x)$ is the same, however, the proposal distribution $q(x)$ is now a univariate Normal distribution with unit unit variance in the direction of the $i$-th dimension to be sampled. The MATLAB implementation of the component-wise sampler is at the end of the section. The samples and comparison to the analytic target distribution are shown below.

Samples drawn from component-wise Metropolis-Hastings algorithm compared to target distribution

Again, we see that we get a good characterization of the bivariate target distribution.

%--------------------------------------------------
% EXAMPLE 2: METROPOLIS-HASTINGS
% COMPONENT-WISE SAMPLING OF BIVARIATE NORMAL
rand('seed' ,12345);

% TARGET DISTRIBUTION
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

nSamples = 5000;
propSigma = 1;		% PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE COMPONENT-WISE SAMPLER
x = zeros(nSamples,2);
xCurrent(1) = randn;
xCurrent(2) = randn;
dims = 1:2; % INDICES INTO EACH DIMENSION
t = 1;
x(t,1) = xCurrent(1);
x(t,2) = xCurrent(2);

% RUN SAMPLER
while t < nSamples
t = t + 1;
for iD = 1:2 % LOOP OVER DIMENSIONS

% SAMPLE PROPOSAL
xStar = normrnd(xCurrent(:,iD), propSigma);

% NOTE: CORRECTION FACTOR c=1 BECAUSE
% N(mu,1) IS SYMMETRIC, NO NEED TO CALCULATE

% CALCULATE THE ACCEPTANCE PROBABILITY
pratio = p([xStar xCurrent(dims~=iD)])/ ...
p([xCurrent(1) xCurrent(2)]);
alpha = min([1, pratio]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
xCurrent(iD) = xStar;
end
end

% UPDATE SAMPLES
x(t,:) = xCurrent;
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
figure;
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')

## Wrapping Up

Here we saw how we can use block- and component-wise updates to derive two different implementations of the Metropolis-Hastings algorithm. In the next post we will use component-wise updates introduced above to motivate the Gibbs sampler, which is often used to increase the efficiency of sampling well-defined probability multivariate distributions.

## MCMC: The Metropolis-Hastings Sampler

In an earlier post we discussed how the Metropolis sampling algorithm can draw samples from a complex and/or unnormalized target probability distributions using a Markov chain. The Metropolis algorithm first proposes a possible new state $x^*$ in the Markov chain, based on a previous state $x^{(t-1)}$, according to the proposal distribution $q(x^* | x^{(t-1)})$. The algorithm accepts or rejects the proposed state based on the density of the the target distribution $p(x)$ evaluated at $x^*$. (If any of this Markov-speak is gibberish to the reader, please refer to the previous posts on Markov Chains, MCMC, and the Metropolis Algorithm for some clarification).

One constraint of the Metropolis sampler is that the proposal distribution $q(x^* | x^{(t-1)})$ must be symmetric. The constraint originates from using a Markov Chain to draw samples: a necessary condition for drawing from a Markov chain’s stationary distribution is that at any given point in time $t$, the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ must be equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, a condition known as reversibility or detailed balance. However, a symmetric proposal distribution may be ill-fit for many problems, like when we want to sample from distributions that are bounded on semi infinite intervals (e.g. $[0, \infty)$).

In order to be able to use an asymmetric proposal distributions, the Metropolis-Hastings algorithm implements an additional correction factor $c$, defined from the proposal distribution as

$c = \frac{q(x^{(t-1)} | x^*) }{q(x^* | x^{(t-1)})}$

The correction factor adjusts the transition operator to ensure that the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$ is equal to the probability of moving from $x^{(t-1)} \rightarrow x^{(t)}$, no matter the proposal distribution.

The Metropolis-Hastings algorithm is implemented with essentially the same procedure as the Metropolis sampler, except that the correction factor is used in the evaluation of acceptance probability $\alpha$.  Specifically, to draw $M$ samples using the Metropolis-Hastings sampler:

1. set t = 0
2. generate an initial state $x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $x^*$ from $q(x | x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x^{(t-1)} | x^*) }{q(x^*|x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(x^*)}{p(x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x^*$ and set $x^{(t)}=x^*$

else set $x^{(t)} = x^{(t-1)}$

Many consider the Metropolis-Hastings algorithm to be a generalization of the Metropolis algorithm. This is because when the proposal distribution is symmetric, the correction factor is equal to one, giving the transition operator for the Metropolis sampler.

## Example: Sampling from a Bayesian posterior with improper prior

For a number of applications, including regression and density estimation, it is usually necessary to determine a set of parameters $\theta$ to an assumed model $p(y | \theta)$ such that the model can best account for some observed data $y$. The model function $p(y | \theta)$ is often referred to as the likelihood function. In Bayesian methods there is often an explicit prior distribution $p(\theta)$ that is placed on the model parameters and controls the values that the parameters can take.

The parameters are determined based on the posterior distribution $p(\theta | y)$, which is a probability distribution over the possible parameters based on the observed data. The posterior can be determined using Bayes’ theorem:

$p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)}$

where, $p(y)$ is a normalization constant that is often quite difficult to determine explicitly, as it involves computing sums over every possible value that the parameters and $y$ can take.

Let’s say that we assume the following model (likelihood function):

$p(y | \theta) = \text{Gamma}(y;A,B)$, where

$\text{Gamma}(y;A,B) = \frac{B^A}{\Gamma(A)} y^{A-1}e^{-By}$, where

$\Gamma( )$ is the gamma function. Thus, the model parameters are

$\theta = [A,B]$

The parameter $A$ controls the shape of the distribution, and $B$ controls the scale. The likelihood surface for $B = 1$, and a number of values of $A$ ranging from zero to five are shown below.

Likelihood surface and conditional probability p(y|A=2,B=1) in green

The conditional distribution $p(y | A=2, B = 1)$ is plotted in green along the likelihood surface. You can verify this is a valid conditional in MATLAB with the following command:

plot(0:.1:10,gampdf(0:.1:10,4,1)); % GAMMA(4,1)

Now, let’s assume the following priors on the model parameters:

$p(B = 1) = 1$

and

$p(A) = \text{sin}(\pi A)^2$

The first prior states that $B$ only takes a single value (i.e. 1), therefore we can treat it as a constant. The second (rather non-conventional) prior states that the probability of $A$ varies as a sinusoidal function. (Note that both of these prior distributions are called improper priors because they do not integrate to one). Because $B$ is constant, we only need to estimate the value of $A$.

It turns out that even though the normalization constant $p(y)$ may be difficult to compute, we can sample from $p(A | y)$ without knowing $p(x)$ using the Metropolis-Hastings algorithm. In particular, we can ignore the normalization constant $p(x)$ and sample from the unnormalized posterior:

$p(A | y) \propto p(y |A) p(A)$

The surface of the (unnormalized) posterior for $y$ ranging from zero to ten are shown below. The prior $p(A)$ is displayed in blue on the right of the plot. Let’s say that we have a datapoint $y = 1.5$ and would like to estimate the posterior distribution $p(A|y=1.5)$ using the Metropolis-Hastings algorithm. This particular target distribution is plotted in magenta in the plot below.

Posterior surface, prior distribution (blue), and target distribution (pink)

Using a symmetric proposal distribution like the Normal distribution is not efficient for sampling from $p(A|y=1.5)$ due to the fact that the posterior only has support on the real positive numbers $A \in [0 ,\infty)$. An asymmetric proposal distribution with the same support, would provide a better coverage of the posterior. One distribution that operates on the positive real numbers is the exponential distribution.

$q(A) = \text{Exp}(\mu) = \mu e^{-\mu A}$,

This distribution is parameterized by a single variable $\mu$ that controls the scale and location of the distribution probability mass. The target posterior and a proposal distribution (for $\mu = 5$) are shown in the plot below.

Target posterior p(A|y) and proposal distribution q(A)

We see that the proposal has a fairly good coverage of the posterior distribution. We run the Metropolis-Hastings sampler in the block of MATLAB code at the bottom. The Markov chain path and the resulting samples are shown in plot below.

Metropolis-Hastings Markov chain and samples

As an aside, note that the proposal distribution for this sampler does not depend on past samples, but only on the parameter $\mu$ (see line 88 in the MATLAB code below). Each proposal states $x^*$ is drawn independently of the previous state. Therefore this is an example of an independence sampler, a specific type of Metropolis-Hastings sampling algorithm. Independence samplers are notorious for being either very good or very poor sampling routines. The quality of the routine depends on the choice of the proposal distribution, and its coverage of the target distribution. Identifying such a proposal distribution is often difficult in practice.

The MATLAB  code for running the Metropolis-Hastings sampler is below. Use the copy icon in the upper right of the code block to copy it to your clipboard. Paste in a MATLAB terminal to output the figures above.

% METROPOLIS-HASTINGS BAYESIAN POSTERIOR
rand('seed',12345)

% PRIOR OVER SCALE PARAMETERS
B = 1;

% DEFINE LIKELIHOOD
likelihood = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y))','y','A','B');

% CALCULATE AND VISUALIZE THE LIKELIHOOD SURFACE
yy = linspace(0,10,100);
AA = linspace(0.1,5,100);
likeSurf = zeros(numel(yy),numel(AA));
for iA = 1:numel(AA); likeSurf(:,iA)=likelihood(yy(:),AA(iA),B); end;

figure;
surf(likeSurf); ylabel('p(y|A)'); xlabel('A'); colormap hot

% DISPLAY CONDITIONAL AT A = 2
hold on; ly = plot3(ones(1,numel(AA))*40,1:100,likeSurf(:,40),'g','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend(ly,'p(y|A=2)','Location','Northeast');
hold off;
title('p(y|A)');

% DEFINE PRIOR OVER SHAPE PARAMETERS
prior = inline('sin(pi*A).^2','A');

% DEFINE THE POSTERIOR
p = inline('(B.^A/gamma(A)).*y.^(A-1).*exp(-(B.*y)).*sin(pi*A).^2','y','A','B');

% CALCULATE AND DISPLAY THE POSTERIOR SURFACE
postSurf = zeros(size(likeSurf));
for iA = 1:numel(AA); postSurf(:,iA)=p(yy(:),AA(iA),B); end;

figure
surf(postSurf); ylabel('y'); xlabel('A'); colormap hot

% DISPLAY THE PRIOR
hold on; pA = plot3(1:100,ones(1,numel(AA))*100,prior(AA),'b','linewidth',3)

% SAMPLE FROM p(A | y = 1.5)
y = 1.5;
target = postSurf(16,:);

% DISPLAY POSTERIOR
psA = plot3(1:100, ones(1,numel(AA))*16,postSurf(16,:),'m','linewidth',3)
xlim([0 100]); ylim([0 100]);  axis normal
set(gca,'XTick',[0,100]); set(gca,'XTickLabel',[0 5]);
set(gca,'YTick',[0,100]); set(gca,'YTickLabel',[0 10]);
view(65,25)
legend([pA,psA],{'p(A)','p(A|y = 1.5)'},'Location','Northeast');
hold off
title('p(A|y)');

% INITIALIZE THE METROPOLIS-HASTINGS SAMPLER
% DEFINE PROPOSAL DENSITY
q = inline('exppdf(x,mu)','x','mu');

% MEAN FOR PROPOSAL DENSITY
mu = 5;

% DISPLAY TARGET AND PROPOSAL
figure; hold on;
th = plot(AA,target,'m','Linewidth',2);
qh = plot(AA,q(AA,mu),'k','Linewidth',2)
legend([th,qh],{'Target, p(A)','Proposal, q(A)'});
xlabel('A');

% SOME CONSTANTS
nSamples = 5000;
burnIn = 500;
minn = 0.1; maxx = 5;

% INTIIALZE SAMPLER
x = zeros(1 ,nSamples);
x(1) = mu;
t = 1;

% RUN METROPOLIS-HASTINGS SAMPLER
while t < nSamples
t = t+1;

% SAMPLE FROM PROPOSAL
xStar = exprnd(mu);

% CORRECTION FACTOR
c = q(x(t-1),mu)/q(xStar,mu);

% CALCULATE THE (CORRECTED) ACCEPTANCE RATIO
alpha = min([1, p(y,xStar,B)/p(y,x(t-1),B)*c]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t) = xStar;
else
x(t) = x(t-1);
end
end

% DISPLAY MARKOV CHAIN
figure;
subplot(211);
stairs(x(1:t),1:t, 'k');
hold on;
hb = plot([0 maxx/2],[burnIn burnIn],'g--','Linewidth',2)
ylabel('t'); xlabel('samples, A');
set(gca , 'YDir', 'reverse');
ylim([0 t])
axis tight;
xlim([0 maxx]);
title('Markov Chain Path');
legend(hb,'Burnin');

% DISPLAY SAMPLES
subplot(212);
nBins = 100;
sampleBins = linspace(minn,maxx,nBins);
counts = hist(x(burnIn:end), sampleBins);
bar(sampleBins, counts/sum(counts), 'k');
xlabel('samples, A' ); ylabel( 'p(A | y)' );
title('Samples');
xlim([0 10])

% OVERLAY TARGET DISTRIBUTION
hold on;
plot(AA, target/sum(target) , 'm-', 'LineWidth', 2);
legend('Sampled Distribution',sprintf('Target Posterior'))
axis tight

## Wrapping Up

Here we explored how the Metorpolis-Hastings sampling algorithm can be used to generalize the Metropolis algorithm in order to sample from complex (an unnormalized) probability distributions using asymmetric proposal distributions. One shortcoming of the Metropolis-Hastings algorithm is that not all of the proposed samples are accepted, wasting valuable computational resources. This becomes even more of an issue for sampling distributions in higher dimensions. This is where Gibbs sampling comes in. We’ll see in a later post that Gibbs sampling can be used to keep all proposal states in the Markov chain by taking advantage of conditional probabilities.