Blog Archives

The material in this post has been migraged with python implementations to my github pages website.

Posted in Algorithms, Classification, Derivations, Gradient Descent, Machine Learning, Neural Networks, Optimization, Regression, Theory

18 Comments

Tags: backprop derivation, backpropagation algorithm, backpropagation derivation, Derivation, Machine Learning, Neural Networks

Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

Jan 14

Posted by dustinstansbury

We showed in an earlier post that for the linear regression model

$y = X\beta + \epsilon$ ,

the optimal Ordinary Least Squares (OLS) estimator for model parameters $\beta$ is

$\hat \beta = (X^TX)^{-1}X^Ty$

However, because independent variables $X$ and responses $y$ can take on any value, they are both random variables. And, because $\hat \beta$ is a linear combination of $X$ and $y$ , it is also a random variable, and therefore has a covariance. The definition of the covariance matrix $C_{\hat \beta}$ for the OLS estimator is defined as:

$C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

where, $E[*]$ denotes the expected value operator. In order to find an expression for $C_{\hat \beta}$ , we first need an expression for $(\hat \beta - \beta)$ . The following derives this expression:

$\hat \beta = (X^TX)^{-1}X^T(X\beta + \epsilon)$ ,

where we use the fact that

$y = X\beta + \epsilon$ .

It follows that

$\hat \beta = (X^TX)^{-1}X^TX \beta + (X^TX)^{-1}\epsilon$

$\hat \beta = \beta + (X^TX)^{-1}X^T \epsilon$

and therefore

$(\hat \beta - \beta) = (X^TX)^{-1}X^T \epsilon$

Now following the original definition for $C_{\hat \beta}$ …

$C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

$= E[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T \epsilon)^T]$

$= E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]$

where we take advantage of $(AB)^T = B^T A^T$ in order to rewrite the second term in the product of the expectation. If we take $X$ to be fixed for a given estimator of $\hat \beta$ (in other words we don’t randomly resample the independent variables), then the expectation only depends on the remaining stochastic/random variable, namely $\epsilon$ . Therefore the above expression can be written as

$C_{\hat \beta} = (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}$ .

where $E[\epsilon \epsilon^T]$ is the covariance of the noise term in the model. Because OLS assumes uncorrelated noise, the noise covariance is equal to $\sigma^2 I$ , where $\sigma^2$ is the variance along each dimension, and $I$ is an identity matrix of size equal to the number of dimensions. The expression for the estimator covariance is now:

$C_{\hat \beta} = (X^TX)^{-1}X^T (\sigma^2 I) X(X^TX)^{-1}$ ,

$= \sigma^2 I (X^TX)^{-1} X^T X(X^TX)^{-1}$

which simplifies to

$C_{\hat \beta} = \sigma^2 (X^T X)^{-1}$

A further simplifying assumption made by OLS that is often made is that $\epsilon$ is drawn from a zero mean multivariate Guassian distribution of unit variances (i.e. $\sigma^2 = 1$ ), resulting in a noise covariance equal to the identity. Thus

$C_{\hat \beta} = (X^TX)^{-1}$

Applying the derivation results to Generalized Least Squares

Notice that the expression for the OLS estimator covariance is equal to first inverse term in the expression for the OLS estimator. Identitying the covariance for the OLS estimator in this way gives a helpful heuristic to easily identify the covariance of related estimators that do not make the simplifying assumptions about the covariance that are made in OLS. For instance in Generalized Least Squares (GLS), it is possible for the noise terms to co-vary. The covariance is represented as a noise covariance matrix $C_{\epsilon}$ . This gives the model form

$y = X \beta + \epsilon$ ,

where $E[\epsilon | X] = 0; Var[\epsilon | X] = C_{\epsilon}$ .

In otherwords, under GLS, the noise terms have zero mean, and covariance $C_{\epsilon}$ . It turns out that estimator for the GLS model parameters is

$\hat \beta_{GLS} = (X^T C_{\epsilon}^{-1} X)^{-1} X^T C_{\epsilon}^{-1}y$ .

Notice the similarity between the GLS and OLS estimators. The only difference is that in GLS, the solution for the parameters is scaled by the inverse of the noise covariance. And, in a similar fashion to the OLS estimator, the covariance for the GLS estimator is first term in the product that defines the GLS estimator:

$C_{\hat \beta, GLS} = (X^T C_{\epsilon}^{-1}X)^{-1}$

Posted in Derivations, Regression, Statistics

2 Comments

Tags: Covariance Matrix, Derivation, Expected Value, Generalize Least Squares, Noise Covariance, OLS, OLS Estimator, Ordinary Least Squares

The OG Clever Machine

Topics in Computational Neuroscience & Machine Learning

Blog Archives

Derivation: Error Backpropagation & Gradient Descent for Neural Networks

Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

Applying the derivation results to Generalized Least Squares

Follow TheCleverMachine

Recent Posts

Archives

Meta

The OG Clever Machine

Topics in Computational Neuroscience & Machine Learning

Blog Archives

Derivation: Error Backpropagation & Gradient Descent for Neural Networks

Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

Applying the derivation results to Generalized Least Squares

Follow TheCleverMachine

Categories

Recent Posts

Archives

Meta