# Derivation: Ordinary Least Squares Solution and Normal Equations

In a linear regression framework, we assume some output variable $y$ is a linear combination of some independent input variables $X$ plus some independent noise $\epsilon$. The way the independent variables are combined is defined by a parameter vector $\beta$:

$\Large{\begin{array}{rcl} y &=& X \beta + \epsilon \end{array}}$

We also assume that the noise term $\epsilon$ is drawn from a standard Normal distribution:

$\Large{ \begin{array}{rcl}\epsilon &\sim& N(0,I)\end{array}}$

For some estimate of the model parameters $\hat \beta$, the model’s prediction errors/residuals $e$ are the difference between the model prediction and the observed ouput values

$\Large{\begin{array}{rcl} e = y - X\hat \beta \end{array}}$

The Ordinary Least Squares (OLS) solution to the problem (i.e. determining an optimal solution for $\hat \beta$) involves minimizing the sum of the squared errors with respect to the model parameters, $\hat \beta$. The sum of squared errors is equal to the inner product of the residuals vector with itself $\sum e_i^2 = e^Te$ :

$\Large{\begin{array}{rcl} e^T e &=& (y - X \hat \beta)^T (y - X \hat \beta) \\ &=& y^Ty - y^T (X \hat \beta) - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &=& y^Ty - (X \hat \beta)^T y - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &=& y^Ty - 2(X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &=& y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \\ \end{array}}$

To determine the parameters, $\hat \beta$, we minimize the sum of squared residuals with respect to the parameters.

$\Large{\begin{array}{rcl} \frac{\partial}{\partial \beta} \left[ e^T e \right] &=& 0 \\ &=& -2X^Ty + 2X^TX \hat \beta \text{, and thus} \\ X^Ty &=& X^TX \hat \beta \end{array}}$

due to the identity $\frac{\partial \mathbf{a}^T \mathbf{b}}{\partial \mathbf{a}} = \mathbf{b}$, for vectors $\mathbf{a}$ and $\mathbf{b}$. This relationship is matrix form of the Normal Equations. Solving for $\hat \beta$ gives  the analytical solution to the Ordinary Least Squares problem.

$\Large{\begin{array}{rcl} \hat \beta &=& (X^TX)^{-1}X^Ty \end{array}}$

Boom.

I recently received my PhD from UC Berkeley where I studied computational neuroscience and machine learning.

Posted on September 1, 2012, in Derivations, Regression, Statistics, Theory, Uncategorized and tagged , , , . Bookmark the permalink. 4 Comments.