Covariance Matrices and Data Distributions

Correlation between variables in a $K$-dimensional dataset are often summarized by a $K \times K$ covariance matrix. To get a better understanding of how correlation matrices characterize correlations between data points, we plot data points drawn from 3 different 2-dimensional Gaussian distributions, each of which is defined by a different covariance matrix.

The left plots below display the $2 \times 2$ covariance matrix for each Gaussian distribution. The values along the diagonal represent the variance of the data along each dimension, and the off-diagonal values represent the covariances between the dimensions. Thus the $i,j$-th entry of each matrix represents the correlation between the $i$-th and $j$-th dimensions. The right plots show data drawn from the corresponding 2D Gaussian.

The top row plot display a covariance matrix equal to the identity matrix, and the points drawn from the corresponding Gaussian distribution. The diagonal values are 1, indicating the data have variance of 1 along both of the dimensions. Additionally, the off-diagonal elements are zero, meaning that the two dimensions are uncorrelated.  We can see this in the data drawn from the distribution as well. The data are distributed in a sphere about origin. For such a distribution of points, it is difficult (impossible) to draw any single regression line that can predict the second dimension from the first, and vice versa. Thus an identity covariance matrix is equivalent to having independent dimensions, each of which has unit (i.e. 1) variance. Such a dataset is often called “white” (this naming convention comes from the notion that white noise signals–which can be sampled from independent Gaussian distributions–have equal power at all frequencies in the Fourier domain).

The middle row plots the points that result from a diagonal, but not identity covariance matrix. The off-diagonal elements are still zero, indicating that the dimensions are uncorrelated. However, the variances along each dimension are not equal to one, and are not equal. This is demonstrated by the elongated distribution in red. The elongation is along the second dimension, as indicated by the larger value in the bottom-right (point $(i,j) = (2,2)$) of the covariance matrix.

The bottom row plots points that result from a non-diagonal covariance matrix. Here the off-diagonal elements of covariance matrix have non-zero values, indicating a correlation between the dimensions. This correlation is reflected in the distribution of drawn datapoints (in blue). We can see that the primary axis along which the points are distributed is not along either of the dimensions, but a linear combination of the dimensions.

The MATLAB code to create the above plots is here

% INITIALIZE SOME CONSTANTS
mu = [0 0];         % ZERO MEAN
S = [1 .9; .9 3];   % NON-DIAGONAL COV.
SDiag = [1 0; 0 3]; % DIAGONAL COV.
SId = eye(2);       % IDENTITY COV.

% SAMPLE SOME DATAPOINTS
nSamples = 1000;
samples = mvnrnd(mu,S,nSamples)';
samplesId = mvnrnd(mu,SId,nSamples)';
samplesDiag = mvnrnd(mu,SDiag,nSamples)';

% DISPLAY
subplot(321);
imagesc(SId); axis image,
caxis([0 1]), colormap hot, colorbar
title('Identity Covariance')

subplot(322)
plot(samplesId(1,:),samplesId(2,:),'ko'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('White Data')

subplot(323);
imagesc(SDiag); axis image,
caxis([0 3]), colormap hot, colorbar
title('Diagonal Covariance')

subplot(324)
plot(samplesDiag(1,:),samplesDiag(2,:),'r.'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('Uncorrelated Data')

subplot(325);
imagesc(S); axis image,
caxis([0 3]), colormap hot, colorbar
title('Non-diagonal Covariance')

subplot(326)
plot(samples(1,:),samples(2,:),'b.'); axis square
xlim([-5 5]), ylim([-5 5])
grid
title('Correlated Data')


I recently received my PhD from UC Berkeley where I studied computational neuroscience and machine learning.

Posted on March 29, 2013, in Statistics and tagged , , , , , , . Bookmark the permalink. 2 Comments.