The Statistical Whitening Transform

In a number of modeling scenarios, it is beneficial to transform the to-be-modeled data such that it has an identity covariance matrix, a procedure known as Statistical Whitening. When data have an identity covariance, all dimensions are statistically independent, and the variance of the data along each of the dimensions is equal to one. (To get a better idea of what an identity covariance entails, see the following post.)

Enforcing statistical independence is useful for a number of reasons. For example, in probabilistic models of data that exist in multiple dimensions, the joint distribution–which may be very complex and difficult to characterize–can factorize into a product of many simpler distributions when the dimensions are statistically independent. Forcing all dimensions to have unit variance is also useful. For instance, scaling all variables to have the same variance treats each dimension with equal importance.

In the remainder of this post we derive how to transform data such that it has an identity covariance matrix, give some examples of applying such a transformation to real data, and address some interpretations of statistical whitening in the scope of theoretical neuroscience.

Decorrelation: Transforming Data to Have a Diagonal Covariance Matrix

Let’s say we have some data matrix X composed of K dimensions and n observations (X has  size [K \times n]).  Let’s also assume that the rows of X have been centered (the mean has been subracted across all observations) . The covariance \Sigma of each of the dimensions with respect to the other is

\Sigma = Cov(X) = \mathbb E[X X^T]                                                                                        (1)

Where the covariance \mathbb E[X X^T] can be estimated from the data matrix as follows:

\mathbb E[X X^T] \approx \frac{X X^T}{n}                                                                                            (2)

The covariance matrix \Sigma, by definition (Equation 2) is symmetric and positive semi-definite (if you don’t know what that means, don’t worry it’s not terribly important for this discussion). Thus we can write the matrix as the product of two simpler matrices E and D, using a procedure known as Eigenvalue Decomposition:

\Sigma = EDE^{-1}                                                                                                 (3)

The matrix E is an [K \times K]-sized matrix, where each column is an eigenvector of \Sigma, and D is a diagonal matrix whose diagonal elements D_{ii} are eigenvalues that correspond to the eigenvectors of the i-th column of E.  For more details on eigenvectors and eigenvalues see the following. From Equation (3), and using a little algebra, we can transform \Sigma into the diagonal matrix D

E^{-1} \Sigma E = D                                                                                                 (4)

Now, imagine the goal is to transform the data matrix X into a new data matrix Y

Y = W_DX                                                                                                   (5)

whose dimensions are uncorrelated (i.e. Y has a diagonal covariance D). Thus we want to determine the transformation W_D that makes:

D = Cov(Y) = \mathbb E[YY^T]                                                                                   (6)

Here we derive the expression for W_D using Equations (2), (4), (5), and (6):

D = \frac{W_DX(W_DX)^T}{n}                                                       (a la Equations (5) and (6))

D = W_D W_D^T \Sigma                                                                       (via Equation (2))

E^{-1}\Sigma E = W_D W_D^T \Sigma                                                                   (via Equation (4))

        \Sigma^{-1}E^{-1} \Sigma E = \Sigma^{-1}W_D W_D^T \Sigma

now, because E^{-1} = E^T                                             (see following link for details)

            E^TE = W_DW_D^T and thus

   W_D = E^T                                                                                                   (7)

This means that we can transform X into an uncorrelated (i.e. orthogonal) set of variables by premultiplying data matrix X with the transpose of the the eigenvectors of data covariance matrix \Sigma.

Whitening: Transforming data to have an Identity Covariance matrix

Ok, so now we have a way of transforming our data so that the dimensions are uncorrelated. However, this only gives us a diagonal covariance matrix, not an Identity covariance matrix. In order to obtain an Identity covariance, we also need to scale each dimension so that its variance is equal to one. How can we determine this transformation? We know how to transform our data so that the covariance is equal to D. If we can determine the transformation that leaves D = I, then we can apply this transformation to our decorrelated covariance to give us the desired whitening transform. We can determine this from the somewhat trivial notion that

D^{-1}D = I                                                                                                        (8)

and further that

D^{-1} = D^{-1/2}ID^{-1/2}                                                                                             (9)

Now, using Equation (4) along with Equation (8), we can see that

D^{-1/2}E^{-1}\Sigma E D^{-1/2} = I                                                                                      (10)

Now say that we define a variable Y = W_W X, where W_W is the desired whitening transform, that leaves the covariance of Y equal to the identity matrix. Using essentially the same set of derivation steps as above to solve for W_D, but starting from Equation (9) we find that

W_W = D^{-1/2}E^T                                                                                                  (11)

= D^{-1/2}W_D                                                                                                 (12)

Thus, the whitening transform is simply the decorrelation transform, but scaled by the inverse of the square root of the D (here the inverse and square root can be performed element-wise because D is a diagonal matrix).

Interpretation of the Whitening Transform

So what does the whitening transformation actually do to the data (below, blue points)? We investigate this transformation below: The first operation decorrelates the data by premultiplying the data with the eigenvector matrix E^T, calculated from the data covariance. This decorrelation can be thought of as a rotation that reorients the data so that the principal axes of the data are aligned with the axes along which the data has the largest (orthogonal) variance. This rotation is essentially the same procedure as the oft-used Principal Components Analysis (PCA), and is shown in the middle row.


The second operation, scaling by D^{-1/2} can be thought of squeezing the data–if the variance along a dimension is larger than one–or stretching the data–if the variance along a dimension is less than one. The stretching and squeezing forms the data into a sphere about the origin (which is why whitening is also referred to as “sphering”). This scaling operation is depicted in the bottom row in the plot above.

The MATLAB to make make the plot above is here:

mu = [0 0];
S = [1 .9; .9 3];

nSamples = 1000;
samples = mvnrnd(mu,S,nSamples)';

[E,D] = eig(S);

samplesRotated = E*samples;

% TAKE D^(-1/2)
D = diag(diag(D).^-.5);

samplesRotatedScaled = D*samplesRotated;


axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Original Data');

axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Decorrelate: Rotate by V');

axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Whiten: scale by D^{-1/2}');

The transformation in Equation (11) and implemented above  whitens the data but leaves the data aligned with principle axes of the original data. In order to observe the data in the original space, it is often customary “un-rotate” the data back into it’s original space. This is done by just multiplying the whitening transform by the inverse of the rotation operation defined by the eigenvector matrix. This gives the whitening transform:

W =E^{-1}D^{-1/2}E^T                                                                                                   (13)

Let’s take a look an example of using statistical whitening for a more complex problem: whitening patches of images sampled from natural scenes.

Example: Whitening Natural Scene Image Patches

Modeling the local spatial structure of pixels in natural scene images is important in many fields including computer vision and computational neuroscience. An interesting model of natural scenes is one that can account for interesting, high-order statistical dependencies between pixels. However, because natural scenes are generally composed of continuous objects or surfaces, a vast majority of the spatial correlations in natural image data can be explained by local pairwise dependencies. For example, observe the image below.

im = double(imread('cameraman.tif'));
imagesc(im); colormap gray; axis image; axis off;
title('Base Image')


Given one of the gray pixels in the upper portion of the image, it is very likely that all pixels within the local neighborhood will also be gray. Thus there is a large amount of correlation between pixels in local regions of natural scenes. Statistical models of local structure applied to natural scenes will be dominated by these pairwise correlations, unless they are removed by preprocessing. Whitening provides such a preprocessing procedure.

Below we create and display a dataset of local image patches of size 16 \times 16 extracted at random from the image above. Each patch is rastered out into a column vector of size (16)16 \times 1. Each of these patches can be thought of as samples of the local structure of this natural scene. Below we use the whitening transformation to remove pairwise correlations between pixels in each patch and scale the variance of each pixel to be one.


On the left is the dataset of extracted image patches, along with the corresponding covariance matrix for the image patches on the right. The large local correlation within the neighborhood of each pixel is indicated by the large bright diagonal regions throughout the covariance matrix.

The MATLAB code to extract and display the patches shown above is here:

imSize = 256;
nPatches = 400;  % (MAKE SURE SQUARE)
patchSize = 16;
patches = zeros(patchSize*patchSize,nPatches);
patchIm = zeros(sqrt(nPatches)*patchSize);

im = padarray(im,[patchSize,patchSize],'symmetric');

for iP = 1:nPatches
	pix = ceil(rand(2,1)*imSize);
	rows = pix(1):pix(1)+patchSize-1;
	cols = pix(2):pix(2)+patchSize-1;
	tmp = im(rows,cols);
	patches(:,iP) = reshape(tmp,patchSize*patchSize,1);
	rowIdx = (ceil(iP/sqrt(nPatches)) - 1)*patchSize + ...
	colIdx = (mod(iP-1,sqrt(nPatches)))*patchSize+1:patchSize* ...
	patchIm(rowIdx,colIdx) = tmp;

patchesCentered = bsxfun(@minus,patches,mean(patches,2));

S = patchesCentered*patchesCentered'/nPatches;

axis image; axis off; colormap gray;
title('Extracted Patches')

axis image; axis off; colormap gray;
title('Extracted Patches Covariance')

Below we implement the whitening transformation described above to the extracted image patches and display the whitened patches that result.

whitening-WhitenedPatches On the left, we see that the whitening procedure zeros out all areas in the extracted patches that have the same value (zero is indicated by gray). The whitening procedure also boosts the areas of high-contrast (i.e. edges). The right plots the covariance matrix for the whitened patches. The covarance matrix is diagonal, indicating that pixels are now independent. In addition, all diagonal entries have the same value, indicating the that all pixels now have the same variance (i.e. 1). The MATLAB code used to whiten the image patches and create the display above is here:


[E,D] = eig(S);

% CALCULATE D^(-1/2)
d = diag(D);
d = real(d.^-.5);
D = diag(d);

W = E*D*E';

patchesWhitened = W*patchesCentered;

wPatchIm = zeros(size(patchIm));
for iP = 1:nPatches
	rowIdx = (ceil(iP/sqrt(nPatches)) - 1)*patchSize + 1:ceil(iP/sqrt(nPatches))*patchSize;
	colIdx = (mod(iP-1,sqrt(nPatches)))*patchSize+1:patchSize* ...
	wPatchIm(rowIdx,colIdx) = reshape(patchesWhitened(:,iP),...

axis image; axis off; colormap gray; caxis([-5 5]);
title('Whitened Patches')

axis image; axis off; colormap gray; %colorbar
title('Whitened Patches Covariance');

Investigating the Whitening Matrix: implications for theoretical neuroscience

So what does the whitening matrix look like, and what does it do? Below is the whitening matrix W calculated for the image patches dataset:

figure; imagesc(W);
axis image; colormap gray; colorbar
title('The Whitening Matrix W')


Each column of W is the operation that scales the variance of the corresponding pixel to be equal to one and forces that pixel independent of the others in the 16 \times 16 patch. So what exactly does such an operation look like? We can get an idea by reshaping a column of W back into the shape of the image patches. Below we show what the 86th column of W looks like when reshaped in such a way (the index 86 has no particular significance, it was chosen at random):

figure; imagesc(reshape(W(:,86),16,16)),
colormap gray,
axis image, colorbar
title('Column 86 of W')


We see that the operation is essentially an impulse centered on the 86th pixel in the image (counting pixels starting in the upper left corner, proceeding down columns). This impulse is surrounded by inhibitory weights. If we were to look at the remaining columns of W, we would find that that the same center-surround operation is being replicated at every pixel location in each image patch. Essentially, the whitening transformation is performing a convolution of each image patch with a center-surround filter whose properties are estimated from the patches dataset. Similar techniques are common in computer vision edge-detection algorithms.

Implications for theoretical neuroscience

A theoretical function of the primate retina is data compression: a large number of photoreceptors  pass data from the retina into a physiological bottleneck, the optic nerve, which has far fewer fibers than retinal photoreceptors. Thus removing redundant information is an important task that the retina must perform. When observing the whitened image patches above, we see that redundant information is nullified; pixels that have similar local values to one another are zeroed out. Thus, statistical whitening is a viable form of data compression

It turns out that there is a large class of ganglion cells in the retina whose spatial receptive fields exhibit…that’s right center-surround activation-inhibition like the operation of the whitening matrix shown above! Thus it appears that the primate visual system may be performing data compression at the retina by means of a similar operation to statistical whitening. Above, we derived the center-surround whitening operation based on data sampled from a natural scene. Thus it is seems reasonable that the primate visual system may have evolved a similar data-compression mechanism through experience with natural scenes, either through evolution, or development.

About dustinstansbury

I recently received my PhD from UC Berkeley where I studied computational neuroscience and machine learning.

Posted on March 30, 2013, in Data Preprocessing, Derivations, Statistics and tagged , , , , , , , , , , , , , , . Bookmark the permalink. 7 Comments.

  1. Hi Dustin,

    Thanks for the nice writeup. I wanted to clarify one thing, in the first paragraph you write

    “When data have an identity covariance, all dimensions are statistically independent, and the variance of the data along each of the dimensions is equal to one.”

    Isn’t there a difference between statistical independence and variance along each dimension being one? I am thinking of this

  2. This was extremely helpful, and interesting. Thank you!

  3. Muralikrishna

    Good work… Thank you for the detailed explaination

  4. Here is a discussion and comparison of five natural whitening procedures, including PCA whitening as shown above: Kessy et. al. 2015. Optimal whitening and decorrelation.

  5. These posts are wonderful! Thank you so much.

  6. Adored this post, this is actually the first-time I have left a comment however ,
    Ive signed in and shared it with my social media friends – all the best with the blog.

  7. They also seem to do only the uncorrelation and stretching:

    So they do not rotate the data back… Now I’m not sure who’s right

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: