Blog Archives

Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

We showed in an earlier post that for the linear regression model $y = X\beta + \epsilon$,

the optimal Ordinary Least Squares (OLS) estimator for model parameters $\beta$ is $\hat \beta = (X^TX)^{-1}X^Ty$

However, because independent variables $X$ and responses $y$ can take on any value, they are both random variables. And, because $\hat \beta$ is a linear combination of $X$ and $y$, it is also a random variable, and therefore has a covariance. The definition of the covariance matrix $C_{\hat \beta}$ for the OLS estimator is defined as: $C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

where, $E[*]$ denotes the expected value operator. In order to find an expression for $C_{\hat \beta}$, we first need an expression for $(\hat \beta - \beta)$. The following derives this expression: $\hat \beta = (X^TX)^{-1}X^T(X\beta + \epsilon)$,

where we use the fact that $y = X\beta + \epsilon$.

It follows that $\hat \beta = (X^TX)^{-1}X^TX \beta + (X^TX)^{-1}\epsilon$ $\hat \beta = \beta + (X^TX)^{-1}X^T \epsilon$

and therefore $(\hat \beta - \beta) = (X^TX)^{-1}X^T \epsilon$

Now following the original definition for $C_{\hat \beta}$ $C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$ $= E[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T \epsilon)^T]$ $= E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]$

where we take advantage of $(AB)^T = B^T A^T$ in order to rewrite the second term in the product of the expectation. If we take $X$ to be fixed for a given estimator of $\hat \beta$ (in other words we don’t randomly resample the independent variables), then the expectation only depends on the remaining stochastic/random variable, namely $\epsilon$. Therefore the above expression can be written as $C_{\hat \beta} = (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}$.

where $E[\epsilon \epsilon^T]$ is the covariance of the noise term in the model. Because OLS assumes uncorrelated noise, the noise covariance is equal to $\sigma^2 I$, where $\sigma^2$ is the variance along each dimension, and $I$ is an identity matrix of size equal to the number of dimensions. The expression for the estimator covariance is now: $C_{\hat \beta} = (X^TX)^{-1}X^T (\sigma^2 I) X(X^TX)^{-1}$, $= \sigma^2 I (X^TX)^{-1} X^T X(X^TX)^{-1}$

which simplifies to $C_{\hat \beta} = \sigma^2 (X^T X)^{-1}$

A further simplifying assumption made by OLS that is often made is that $\epsilon$ is drawn from a zero mean multivariate Guassian distribution of unit variances (i.e. $\sigma^2 = 1$), resulting in a noise covariance equal to the identity. Thus $C_{\hat \beta} = (X^TX)^{-1}$

Applying the derivation results to Generalized Least Squares

Notice that the expression for the OLS estimator covariance is equal to first inverse term in the expression for the OLS estimator. Identitying the covariance for the OLS estimator in this way gives a helpful heuristic to easily identify the covariance of related estimators that do not make the simplifying assumptions about the covariance that are made in OLS. For instance in Generalized Least Squares (GLS), it is possible for the noise terms to co-vary. The covariance is represented as a noise covariance matrix $C_{\epsilon}$. This gives the model form $y = X \beta + \epsilon$,

where $E[\epsilon | X] = 0; Var[\epsilon | X] = C_{\epsilon}$.

In otherwords, under GLS, the noise terms have zero mean, and covariance $C_{\epsilon}$.  It turns out that estimator for the GLS model parameters is $\hat \beta_{GLS} = (X^T C_{\epsilon}^{-1} X)^{-1} X^T C_{\epsilon}^{-1}y$.

Notice the similarity between the GLS and OLS estimators. The only difference is that in GLS, the solution for the parameters is scaled by the inverse of the noise covariance. And, in a similar fashion to the OLS estimator, the covariance for the GLS estimator is first term in the product that defines the GLS estimator: $C_{\hat \beta, GLS} = (X^T C_{\epsilon}^{-1}X)^{-1}$