# Blog Archives

## Covariance Matrices and Data Distributions

Correlation between variables in a $K$-dimensional dataset are often summarized by a $K \times K$ covariance matrix. To get a better understanding of how correlation matrices characterize correlations between data points, we plot data points drawn from 3 different 2-dimensional Gaussian distributions, each of which is defined by a different covariance matrix.

The left plots below display the $2 \times 2$ covariance matrix for each Gaussian distribution. The values along the diagonal represent the variance of the data along each dimension, and the off-diagonal values represent the covariances between the dimensions. Thus the $i,j$-th entry of each matrix represents the correlation between the $i$-th and $j$-th dimensions. The right plots show data drawn from the corresponding 2D Gaussian.

The top row plot display a covariance matrix equal to the identity matrix, and the points drawn from the corresponding Gaussian distribution. The diagonal values are 1, indicating the data have variance of 1 along both of the dimensions. Additionally, the off-diagonal elements are zero, meaning that the two dimensions are uncorrelated.  We can see this in the data drawn from the distribution as well. The data are distributed in a sphere about origin. For such a distribution of points, it is difficult (impossible) to draw any single regression line that can predict the second dimension from the first, and vice versa. Thus an identity covariance matrix is equivalent to having independent dimensions, each of which has unit (i.e. 1) variance. Such a dataset is often called “white” (this naming convention comes from the notion that white noise signals–which can be sampled from independent Gaussian distributions–have equal power at all frequencies in the Fourier domain).

The middle row plots the points that result from a diagonal, but not identity covariance matrix. The off-diagonal elements are still zero, indicating that the dimensions are uncorrelated. However, the variances along each dimension are not equal to one, and are not equal. This is demonstrated by the elongated distribution in red. The elongation is along the second dimension, as indicated by the larger value in the bottom-right (point $(i,j) = (2,2)$) of the covariance matrix.

The bottom row plots points that result from a non-diagonal covariance matrix. Here the off-diagonal elements of covariance matrix have non-zero values, indicating a correlation between the dimensions. This correlation is reflected in the distribution of drawn datapoints (in blue). We can see that the primary axis along which the points are distributed is not along either of the dimensions, but a linear combination of the dimensions.

The MATLAB code to create the above plots is here

% INITIALIZE SOME CONSTANTS
mu = [0 0];         % ZERO MEAN
S = [1 .9; .9 3];   % NON-DIAGONAL COV.
SDiag = [1 0; 0 3]; % DIAGONAL COV.
SId = eye(2);       % IDENTITY COV.

% SAMPLE SOME DATAPOINTS
nSamples = 1000;
samples = mvnrnd(mu,S,nSamples)';
samplesId = mvnrnd(mu,SId,nSamples)';
samplesDiag = mvnrnd(mu,SDiag,nSamples)';

% DISPLAY
subplot(321);
imagesc(SId); axis image,
caxis([0 1]), colormap hot, colorbar
title('Identity Covariance')

subplot(322)
plot(samplesId(1,:),samplesId(2,:),'ko'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('White Data')

subplot(323);
imagesc(SDiag); axis image,
caxis([0 3]), colormap hot, colorbar
title('Diagonal Covariance')

subplot(324)
plot(samplesDiag(1,:),samplesDiag(2,:),'r.'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('Uncorrelated Data')

subplot(325);
imagesc(S); axis image,
caxis([0 3]), colormap hot, colorbar
title('Non-diagonal Covariance')

subplot(326)
plot(samples(1,:),samples(2,:),'b.'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('Correlated Data')


## MCMC: Multivariate Distributions, Block-wise, & Component-wise Updates

In the previous posts on MCMC methods, we focused on how to sample from univariate target distributions. This was done mainly to give the reader some intuition about MCMC implementations with fairly tangible examples that can be visualized. However, MCMC can easily be extended to sample multivariate distributions.

In this post we will discuss two flavors of MCMC update procedure for sampling distributions in multiple dimensions: block-wise, and component-wise update procedures. We will show how these two different procedures can give rise to different implementations of the Metropolis-Hastings sampler to solve the same problem.

## Block-wise Sampling

The first approach for performing multidimensional sampling is to use block-wise updates. In this approach the proposal distribution $q(\bold{x})$ has the same dimensionality as the target distribution $p(\bold x)$. Specifically, if $p(\bold x)$ is a distribution over $D$ variables, ie. $\bold{x} = (x_1, x_2, \dots, x_D)$, then we must design a proposal distribution that is also a distribution involving $D$ variables. We then accept or reject a proposed state $\bold x^*$ sampled from the proposal distribution $q(\bold x)$ in exactly the same way as for the univariate Metropolis-Hastings algorithm. To generate $M$ multivariate samples we perform the following block-wise sampling procedure:

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

generate a proposal state $\bold x^*$ from $q(\bold x | \bold x^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(\bold x^{(t-1)} | \bold x^*) }{q(\bold x^*|\bold x^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p(\bold x^*)}{p(\bold x^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $\bold x^*$ and set $\bold x^{(t)}=\bold x^*$

else set $\bold x^{(t)} = \bold x^{(t-1)}$

Let’s take a look at the block-wise sampling routine in action.

### Example 1: Block-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we use block-wise Metropolis-Hastings algorithm to sample from a bivariate (i.e. $D = 2$) Normal distribution:

$p(\bold x) = \mathcal N (\bold{\mu}, \bold \Sigma)$

with mean

$\mu = [0, 0]$

and covariance

$\bold \Sigma = \begin{bmatrix} 1 & 0.8 \\ 0.8 & 1\end{bmatrix}$

Usually the target distribution $p(\bold x)$ will have a complex mathematical form, but for this example we’ll circumvent that by using MATLAB’s built-in function $\text{mvnpdf}$ to evaluate $p(\bold x)$. For our proposal distribution, $q(\bold x)$, let’s use a circular Normal centered at the the previous state/sample of the Markov chain/sampler, i.e:

$q(\bold x | \bold x^{(t-1)}) \sim \mathcal N (\bold x^{(t-1)}, \bold I)$,

where $\bold I$ is a 2-D identity matrix, giving the proposal distribution unit variance along both dimensions $x_1$ and $x_2$, and zero covariance. You can find an MATLAB implementation of the block-wise sampler at the end of the section. The display of the samples and the target distribution output by the sampler implementation are shown below:

Samples drawn from block-wise Metropolis-Hastings sampler

We can see from the output that the block-wise sampler does a good job of drawing samples from the target distribution.

Note that our proposal distribution in this example is symmetric, therefore it was not necessary to calculate the correction factor $c$ per se. This means that this Metropolis-Hastings implementation is identical to the simpler Metropolis sampler.

%------------------------------------------------------
% EXAMPLE 1: METROPOLIS-HASTINGS
% BLOCK-WISE SAMPLER (BIVARIATE NORMAL)
rand('seed' ,12345);

D = 2; % # VARIABLES
nBurnIn = 100;

% TARGET DISTRIBUTION IS A 2D NORMAL WITH STRONG COVARIANCE
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

% PROPOSAL DISTRIBUTION STANDARD 2D GUASSIAN
q = inline('mvnpdf(x,mu)','x','mu')

nSamples = 5000;
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE BLOCK-WISE SAMPLER
t = 1;
x = zeros(nSamples,2);
x(1,:) = randn(1,D);

% RUN SAMPLER
while t < nSamples
t = t + 1;

% SAMPLE FROM PROPOSAL
xStar = mvnrnd(x(t-1,:),eye(D));

% CORRECTION FACTOR (SHOULD EQUAL 1)
c = q(x(t-1,:),xStar)/q(xStar,x(t-1,:));

% CALCULATE THE M-H ACCEPTANCE PROBABILITY
alpha = min([1, p(xStar)/p(x(t-1,:))]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
x(t,:) = xStar;
else
x(t,:) = x(t-1,:);
end
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')


## Component-wise Sampling

A problem with block-wise updates, particularly when the number of dimensions $D$ becomes large, is that finding a suitable proposal distribution is difficult. This leads to a large proportion of the samples being rejected. One way to remedy this is to simply loop over the the $D$ dimensions of $\bold x$ in sequence, sampling each dimension independently from the others. This is what is known as using component-wise updates. Note that now the proposal distribution $q(x)$ is univariate, working only in one dimension, namely the current dimension that we are trying to sample. The component-wise Metropolis-Hastings algorithm is outlined below.

1. set $t = 0$
2. generate an initial state $\bold x^{(0)} \sim \pi^{(0)}$
3. repeat until $t = M$

set $t = t+1$

for each dimension $i = 1..D$

generate a proposal state $x_i^*$ from $q(x_i | x_i^{(t-1)})$

calculate the proposal correction factor $c = \frac{q(x_i^{(t-1)}) | x_i^*)}{q(x_i^* | x_i^{(t-1)})}$

calculate the acceptance probability $\alpha = \text{min} \left (1,\frac{p( x_i^*, \bold x_j^{(t-1)})}{p( x_i^{(t-1)}, \bold x_j^{(t-1)})} \times c\right )$

draw a random number $u$ from $\text{Unif}(0,1)$

if $u \leq \alpha$ accept the proposal state $x_i^*$ and set $x_i^{(t)}=x_i^*$

else set $x_i^{(t)} = x_i^{(t-1)}$

Note that in the component-wise implementation a sample for the $i$-th dimension is proposed, then  accepted or rejected while all other dimensions ($j \neq i$) are held fixed. We then move on to the next ($(i + 1)$-th) dimension and repeat the process while holding all other variables ($j \neq (i + 1)$) fixed. In each successive step we are using updated values for the dimensions that have occurred since increasing $(t -1) \rightarrow t$.

### Example 2: Component-wise Metropolis-Hastings for sampling of bivariate Normal distribution

In this example we draw samples from the same bivariate Normal target distribution described in Example 1, but using component-wise updates. Therefore $p(x)$ is the same, however, the proposal distribution $q(x)$ is now a univariate Normal distribution with unit unit variance in the direction of the $i$-th dimension to be sampled. The MATLAB implementation of the component-wise sampler is at the end of the section. The samples and comparison to the analytic target distribution are shown below.

Samples drawn from component-wise Metropolis-Hastings algorithm compared to target distribution

Again, we see that we get a good characterization of the bivariate target distribution.

%--------------------------------------------------
% EXAMPLE 2: METROPOLIS-HASTINGS
% COMPONENT-WISE SAMPLING OF BIVARIATE NORMAL
rand('seed' ,12345);

% TARGET DISTRIBUTION
p = inline('mvnpdf(x,[0 0],[1 0.8;0.8 1])','x');

nSamples = 5000;
propSigma = 1;		% PROPOSAL VARIANCE
minn = [-3 -3];
maxx = [3 3];

% INITIALIZE COMPONENT-WISE SAMPLER
x = zeros(nSamples,2);
xCurrent(1) = randn;
xCurrent(2) = randn;
dims = 1:2; % INDICES INTO EACH DIMENSION
t = 1;
x(t,1) = xCurrent(1);
x(t,2) = xCurrent(2);

% RUN SAMPLER
while t < nSamples
t = t + 1;
for iD = 1:2 % LOOP OVER DIMENSIONS

% SAMPLE PROPOSAL
xStar = normrnd(xCurrent(:,iD), propSigma);

% NOTE: CORRECTION FACTOR c=1 BECAUSE
% N(mu,1) IS SYMMETRIC, NO NEED TO CALCULATE

% CALCULATE THE ACCEPTANCE PROBABILITY
pratio = p([xStar xCurrent(dims~=iD)])/ ...
p([xCurrent(1) xCurrent(2)]);
alpha = min([1, pratio]);

% ACCEPT OR REJECT?
u = rand;
if u < alpha
xCurrent(iD) = xStar;
end
end

% UPDATE SAMPLES
x(t,:) = xCurrent;
end

% DISPLAY
nBins = 20;
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);

% DISPLAY SAMPLED DISTRIBUTION
figure;
ax = subplot(121);
bins1 = linspace(minn(1), maxx(1), nBins);
bins2 = linspace(minn(2), maxx(2), nBins);
sampX = hist3(x, 'Edges', {bins1, bins2});
hist3(x, 'Edges', {bins1, bins2});
view(-15,40)

% COLOR HISTOGRAM BARS ACCORDING TO HEIGHT
colormap hot
set(gcf,'renderer','opengl');
set(get(gca,'child'),'FaceColor','interp','CDataMode','auto');
xlabel('x_1'); ylabel('x_2'); zlabel('Frequency');
axis square
set(ax,'xTick',[minn(1),0,maxx(1)]);
set(ax,'yTick',[minn(2),0,maxx(2)]);
title('Sampled Distribution');

% DISPLAY ANALYTIC DENSITY
ax = subplot(122);
[x1 ,x2] = meshgrid(bins1,bins2);
probX = p([x1(:), x2(:)]);
probX = reshape(probX ,nBins, nBins);
surf(probX); axis xy
view(-15,40)
xlabel('x_1'); ylabel('x_2'); zlabel('p({\bfx})');
colormap hot
axis square
set(ax,'xTick',[1,round(nBins/2),nBins]);
set(ax,'xTickLabel',[minn(1),0,maxx(1)]);
set(ax,'yTick',[1,round(nBins/2),nBins]);
set(ax,'yTickLabel',[minn(2),0,maxx(2)]);
title('Analytic Distribution')



## Wrapping Up

Here we saw how we can use block- and component-wise updates to derive two different implementations of the Metropolis-Hastings algorithm. In the next post we will use component-wise updates introduced above to motivate the Gibbs sampler, which is often used to increase the efficiency of sampling well-defined probability multivariate distributions.