Blog Archives

Derivation: Error Backpropagation & Gradient Descent for Neural Networks

The material in this post has been migraged with python implementations to my github pages website.

Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

We showed in an earlier post that for the linear regression model

y = X\beta + \epsilon,

the optimal Ordinary Least Squares (OLS) estimator for model parameters \beta is

\hat \beta = (X^TX)^{-1}X^Ty

However, because independent variables X and responses y can take on any value, they are both random variables. And, because \hat \beta is a linear combination of X and y, it is also a random variable, and therefore has a covariance. The definition of the covariance matrix C_{\hat \beta} for the OLS estimator is defined as:

C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]

where, E[*] denotes the expected value operator. In order to find an expression for C_{\hat \beta}, we first need an expression for  (\hat \beta - \beta). The following derives this expression:

\hat \beta = (X^TX)^{-1}X^T(X\beta + \epsilon),

where we use the fact that

y = X\beta + \epsilon.

It follows that

\hat \beta = (X^TX)^{-1}X^TX \beta + (X^TX)^{-1}\epsilon

\hat \beta = \beta + (X^TX)^{-1}X^T \epsilon

and therefore

(\hat \beta - \beta) = (X^TX)^{-1}X^T \epsilon

Now following the original definition for C_{\hat \beta}

C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]

= E[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T \epsilon)^T]

= E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]

where we take advantage of (AB)^T = B^T A^T in order to rewrite the second term in the product of the expectation. If we take X to be fixed for a given estimator of \hat \beta (in other words we don’t randomly resample the independent variables), then the expectation only depends on the remaining stochastic/random variable, namely \epsilon. Therefore the above expression can be written as

C_{\hat \beta} = (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}.

where E[\epsilon \epsilon^T] is the covariance of the noise term in the model. Because OLS assumes uncorrelated noise, the noise covariance is equal to \sigma^2 I, where \sigma^2 is the variance along each dimension, and I is an identity matrix of size equal to the number of dimensions. The expression for the estimator covariance is now:

C_{\hat \beta} = (X^TX)^{-1}X^T (\sigma^2 I) X(X^TX)^{-1},

= \sigma^2 I (X^TX)^{-1} X^T X(X^TX)^{-1}

which simplifies to

C_{\hat \beta} = \sigma^2 (X^T X)^{-1}

A further simplifying assumption made by OLS that is often made is that \epsilon is drawn from a zero mean multivariate Guassian distribution of unit variances (i.e. \sigma^2 = 1), resulting in a noise covariance equal to the identity. Thus

C_{\hat \beta} = (X^TX)^{-1}

Applying the derivation results to Generalized Least Squares

Notice that the expression for the OLS estimator covariance is equal to first inverse term in the expression for the OLS estimator. Identitying the covariance for the OLS estimator in this way gives a helpful heuristic to easily identify the covariance of related estimators that do not make the simplifying assumptions about the covariance that are made in OLS. For instance in Generalized Least Squares (GLS), it is possible for the noise terms to co-vary. The covariance is represented as a noise covariance matrix C_{\epsilon}. This gives the model form

y = X \beta + \epsilon,

where E[\epsilon | X] = 0; Var[\epsilon | X] = C_{\epsilon}.

In otherwords, under GLS, the noise terms have zero mean, and covariance C_{\epsilon}.  It turns out that estimator for the GLS model parameters is

\hat \beta_{GLS} = (X^T C_{\epsilon}^{-1} X)^{-1} X^T C_{\epsilon}^{-1}y.

Notice the similarity between the GLS and OLS estimators. The only difference is that in GLS, the solution for the parameters is scaled by the inverse of the noise covariance. And, in a similar fashion to the OLS estimator, the covariance for the GLS estimator is first term in the product that defines the GLS estimator:

C_{\hat \beta, GLS} = (X^T C_{\epsilon}^{-1}X)^{-1}