# Category Archives: Derivations

## Derivation: Maximum Likelihood for Boltzmann Machines

In this post I will review the gradient descent algorithm that is commonly used to train the general class of models known as Boltzmann machines. Though the primary goal of the post is to supplement another post on restricted Boltzmann machines, I hope that those readers who are curious about how Boltzmann machines are trained, but have found it difficult to track down a complete or straight-forward derivation of the maximum likelihood learning algorithm for these models (as I have), will also find the post informative.

First, a little background: Boltzmann machines are stochastic neural networks that can be thought of as the probabilistic extension of the Hopfield network. The goal of the Boltzmann machine is to model a set of observed data in terms of a set of visible random variables $v$  and a set of latent/unobserved random variables $h$. Due to the relationship between Boltzmann machines and neural networks, the random variables are often are often referred to as “units.” The role of the visible units is to approximate the true distribution of the data, while the role of the latent variables it to extend the expressiveness of the model by capturing underlying features in the observed data. The latent variables are often referred to as hidden units, as they do not result directly from the observed data and are generally marginalized over to obtain the likelihood of the observed data,  i.e.

$\Large{\begin{array}{rcl} p(v;\theta) &=& \sum_h p(v,h; \theta) \end{array}}$,

where $p(v,h; \theta)$ is the joint probability distribution over the visible and hidden units based on the current model parameters $\theta$. The general Boltzmann machine defines $p(v,h; \theta)$ through a set of weighted,  symmetric connections between all visible and hidden units (but no connections from any unit to itself). The graphical model for the general Boltzmann machine is shown in Figure 1.

Figure 1: Graphical Model of the Boltzmann machine (biases not depicted).

Given the current state of the visible and hidden units, the overall configuration of the model network is described by a connectivity function $E(v,h;\theta)$, parameterized by $\theta = {W, A, B, a, b}$:

$\Large{\begin{array}{rcl} E(v,h; \theta) &=& v^T W h + h^T A h + v^T B v + h^T a + v^T b \end{array}}.$

The parameter matrix $W$ defines the connection strength between the visible and hidden units. The parameters $A$ and $B$ define the connection strength amongst hidden units and visible units, respectively. The model also includes a set of  biases $a$ and $b$ that capture offsets for each of the hidden and visible units.

The Boltzmann machine has been used for years in field of statistical mechanics to model physical systems based on the principle of energy minimization. In the statistical mechanics, the connectivity function is often referred to the “energy function,” a term that is has also been standardized in the statistical learning literature. Note that the energy function returns a single scalar value for any configuration of the network parameters and random variable states.

Given the energy function, the Boltzmann machine models the joint probability of the visible and hidden unit states as a Boltzmann distribution:

$\Large{\begin{array}{rcl} p(v,h; \theta) &=& \frac{\mathrm{e}^{-E(v,h; \theta)}}{Z(\theta)} \text{ , where} \\ \\ Z(\theta) &=& \sum_{v'} \sum_{h'} \mathrm{e}^{-E(v',h'; \theta)}\end{array}}$

The partition function $Z(\theta)$ is a normalizing constant that is calculated by summing over all possible states of the network $(v', h') \in (V',H')$. Here we assume that all random variables take on discrete values, but the analogous derivation holds for continuous or mixed variable types by replacing the sums with integrals accordingly.

The common way to train the Boltzmann machine is to determine the parameters that maximize the likelihood of the observed data. To determine the parameters, we perform gradient descent on the log of the likelihood function (In order to simplify the notation in the remainder of the derivation, we do not include the explicit dependency on the parameters $\theta$. To further simplify things, let’s also assume that we calculate the gradient of the likelihood based on a single observation.):

$\Large{ \begin{array}{rcl} l(v; \theta) &=& \log p(v) \\ &=& \log \sum_h p(v,h) \\ &=& \log \frac{\sum_h \mathrm{e}^{-E(v,h)}}{Z} \\ &=& \log \sum_h \mathrm{e}^{-E(v,h)} - \log Z \\ &=& \log \sum_h \mathrm{e}^{-E(v,h)} - \sum_{v'} \sum_{h'} \mathrm{e}^{-E(v',h')} \end{array}}$

The gradient calculation is as follows:

$\Large{ \begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &=& \frac{\partial}{\partial \theta}\log \sum_h \mathrm{e}^{-E(v,h)} - \frac{\partial}{\partial \theta} \log \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')} \\ &=& \frac{1}{\sum_h \mathrm{e}^{-E(v,h)}} \frac{\partial}{\partial \theta} \sum_h \mathrm{e}^{-E(v,h)} - \frac{1}{\sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}} \frac{\partial}{\partial \theta} \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')} \\ &=& - \frac{1}{\sum_h \mathrm{e}^{-E(v,h)}} \sum_h \mathrm{e}^{-E(v,h)}\frac{\partial E(v,h)}{\partial \theta} + \frac{1}{\sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}} \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}\frac{\partial E(v',h')}{\partial \theta} \end{array}}$

Here we can simplify the expression somewhat by noting that $\mathrm{e}^{-E(v,h)} = Z p(v,h)$, that $Z = \sum_{v'}\sum_{h'}\mathrm{e}^{-E(v',h')}$, and also that $Z$ is a constant:

$\Large{ \begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &=& - \frac{1}{Z\sum_h p(v,h)} Z \sum_h p(v,h) \frac{\partial E(v,h)}{\partial \theta} + \frac{1}{Z} Z \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& - \frac{1}{\sum_h p(v,h)} \sum_h p(v,h) \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ \end{array}}$

If we also note that $\sum_h p(v,h)= p(v)$, and use the definition of conditional probability $p(h|v) = \frac{p(v,h)}{p(v)}$, we can further simplify the expression for the gradient:

$\Large{ \begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &=& - \frac{1}{p(v)} \sum_h p(v,h) \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& -\sum_h \frac{p(v,h)}{p(v)} \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& -\sum_h p(h | v) \frac{\partial E(v,h)}{\partial \theta} + \sum_{v'}\sum_{h'}p(v',h')\frac{\partial E(v',h')}{\partial \theta} \\ &=& -\mathbb{E}_{p(h | v)} \frac{\partial E(v,h)}{\partial \theta} + \mathbb{E}_{p(v',h')}\frac{\partial E(v',h')}{\partial \theta}. \\ \end{array}}$

Here $\mathbb{E}_{p(*)}$ is the expected value under the distribution $p(*)$. Thus the gradient of the likelihood function is composed of two parts. The first part is expected gradient of the energy function with respect to the conditional distribution $p(h|v)$. The second part is expected gradient of the energy function with respect to the joint distribution over all variable states. However, calculating these expectations is generally infeasible for any realistically-sized model, as it involves summing over a huge number of possible states/configurations. The general approach for solving this problem is to use Markov Chain Monte Carlo (MCMC) to approximate these sums:

$\Large{\begin{array}{rcl} \frac{\partial l(v;\theta)}{\partial \theta} &\approx& -\left \langle \frac{\partial E(v,h)}{\partial \theta} \right \rangle_{p(h_{\text{data}}|v_{\text{data}})} + \left \langle \frac{\partial E(v,h)}{\partial \theta} \right \rangle_{p(h_{\text{model}}|v_{\text{model}})} \\ \end{array}}.$

Here $\langle \rangle_{p(*)}$ is the sample average of samples drawn according to the process $p(*)$. The first term is calculated by taking the average value of the energy function gradient when the visible and hidden units are being driven by observed data samples. In practice, this first term is generally straightforward to calculate. Calculating the second term is generally more complicated and involves running a set of Markov chains until they reach the current model’s equilibrium distribution (i.e. via Gibbs sampling, Metropolis-Hastings, or the like), then taking the average energy function gradient based on those samples. See this post on MCMC methods for details. It turns out that there is a subclass of Boltzmann machines that, due to a restricted connectivity/energy function (specifically, the parameters $(A, B)=0$), allow for efficient MCMC by way of blocked Gibbs sampling. These models, known as restricted Boltzman machines have become an important component for unsupervised pretraining in the field of deep learning and will be the focus of a related post.

## Introduction

When constructing Artificial Neural Network (ANN) models, one of the primary considerations is choosing activation functions for hidden and output layers that are differentiable. This is because calculating the backpropagated error signal that is used to determine ANN parameter updates requires the gradient of the activation function gradient . Three of the most commonly-used activation functions used in ANNs are the identity function, the logistic sigmoid function, and the hyperbolic tangent function. Examples of these functions and their associated gradients (derivatives in 1D) are plotted in Figure 1.

Common activation functions functions used in artificial neural, along with their derivatives

In the remainder of this post, we derive the derivatives/gradients for each of these common activation functions.

## The Identity Activation Function

The simplest activation function, one that is commonly used for the output layer activation function in regression problems,  is the identity/linear activation function:

$\Large{ \begin{array}{rcl}g_{\text{linear}}(z) = z \end{array}}$

(Figure 1, red curves). This activation function simply maps the pre-activation to itself and can output values that range $(-\infty, \infty)$. Why would one want to do use an identity activation function? After all, a multi-layered network with linear activations at each layer can be equally-formulated as a single-layered linear network. It turns out that the identity activation function is surprisingly useful. For example, a multi-layer network that has nonlinear activation functions amongst the hidden units and an output layer that uses the identity activation function implements a powerful form of nonlinear regression. Specifically, the network can predict continuous target values using a linear combination of signals that arise from one or more layers of nonlinear transformations of the input.

The derivative of $g_{\text{linear}}$ ,  $g'_{\text{linear}}$,  is simply 1, in the case of 1D inputs. For vector inputs of length $D$ the gradient is $\vec{1}^{1 x D}$, a vector of ones of length $D$.

## The Logistic Sigmoid Activation Function

Another function that is often used as the output activation function for binary classification problems (i.e. outputs values that range $(0, 1)$), is the logistic sigmoid. The logistic sigmoid has the following form:

$\Large{\begin{array}{rcl} g_{\text{logistic}}(z) = \frac{1}{1 + e^{-z}}\end{array}}$

(Figure 1, blue curves) and outputs values that range $(0, 1)$. The logistic sigmoid is motivated somewhat by biological neurons and can be interpreted as the probability of an artificial neuron “firing” given its inputs. (It turns out that the logistic sigmoid can also be derived as the maximum likelihood solution to for logistic regression in statistics). Calculating the derivative of the logistic sigmoid function makes use of the quotient rule and a clever trick that both adds and subtracts a one from the numerator:

$\Large{\begin{array}{rcl} g'_{\text{logistic}}(z) &=& \frac{\partial}{\partial z} \left ( \frac{1}{1 + e^{-z}}\right ) \\ &=& \frac{e^{-z}}{(1 + e^{-z})^2} \text{(chain rule)} \\ &=& \frac{1 + e^{-z} - 1}{(1 + e^{-z})^2} \\ &=& \frac{1 + e^{-z}}{(1 + e^{-z})^2} - \left ( \frac{1}{1+e^{-z}} \right )^2 \\ &=& \frac{1}{(1 + e^{-z})} - \left ( \frac{1}{1+e^{-z}} \right )^2 \\ &=& g_{\text{logistic}}(z)- g_{\text{logistic}}(z)^2 \\ &=& g_{\text{logistic}}(z)(1 - g_{\text{logistic}}(z)) \end{array}}$

Here we see that $g'_{logistic}(z)$ evaluated at $z$ is simply $g_{logistic}(z)$ weighted by 1-minus-$g_{logistic}(z)$. This turns out to be a convenient form for efficiently calculating gradients used in neural networks: if one keeps in memory the feed-forward activations of the logistic function for a given layer, the gradients for that layer can be evaluated using simple multiplication and subtraction rather than performing any re-evaluating the sigmoid function, which requires extra exponentiation.

## The Hyperbolic Tangent Activation Function

Though the logistic sigmoid has a nice biological interpretation, it turns out that the logistic sigmoid can cause a neural network to get “stuck” during training. This is due in part to the fact that if a strongly-negative input is provided to the logistic sigmoid, it outputs values very near zero. Since neural networks use the feed-forward activations to calculate parameter gradients (again, see this previous post for details), this can result in model parameters that are updated less regularly than we would like, and are thus “stuck” in their current state.

An alternative to the logistic sigmoid is the hyperbolic tangent, or tanh function (Figure 1, green curves):

$\Large{\begin{array}{rcl} g_{\text{tanh}}(z) &=& \frac{\text{sinh}(z)}{\text{cosh}(z)} \\ &=& \frac{\mathrm{e}^z - \mathrm{e}^{-z}}{\mathrm{e}^z + \mathrm{e}^{-z}}\end{array}}$.

Like the logistic sigmoid, the tanh function is also sigmoidal (“s”-shaped), but instead outputs values that range $(-1, 1)$. Thus strongly negative inputs to the tanh will map to negative outputs. Additionally, only zero-valued inputs are mapped to near-zero outputs. These properties make the network less likely to get “stuck” during training. Calculating the gradient for the tanh function also uses the quotient rule:

$\Large{\begin{array}{rcl} g'_{\text{tanh}}(z) &=& \frac{\partial}{\partial z} \frac{\text{sinh}(z)}{\text{cosh}(z)} \\ &=& \frac{\frac{\partial}{\partial z} \text{sinh}(z) \times \text{cosh}(z) - \frac{\partial}{\partial z} \text{cosh}(z) \times \text{sinh}(z)}{\text{cosh}^2(z)} \\ &=& \frac{\text{cosh}^2(z) - \text{sinh}^2(z)}{\text{cosh}^2(z)} \\ &=& 1 - \frac{\text{sinh}^2(z)}{\text{cosh}^2(z)} \\ &=& 1 - \text{tanh}^2(z)\end{array}}$

Similar to the derivative for the logistic sigmoid, the derivative of $g_{\text{tanh}}(z)$ is a function of feed-forward activation evaluated at $z$, namely $(1-g_{\text{tanh}}(z)^2)$. Thus the same caching trick can be used for layers that implement tanh activation functions.

## Wrapping Up

In this post we reviewed a few commonly-used activation functions in neural network literature and their derivative calculations. These activation functions are motivated by biology and/or provide some handy implementation tricks like calculating derivatives using cached feed-forward activation values. Note that there are also many other options for activation functions not covered here: e.g. rectification, soft rectification, polynomial kernels, etc. Indeed, finding and evaluating novel activation functions is an active subfield of machine learning research. However, the three basic activations covered here can be used to solve a majority of the machine learning problems one will likely face.

## Introduction

Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walk-through of the derivation of gradient descent learning algorithm commonly used to train ANNs (aka the backpropagation algorithm) and try to provide some high-level insights into the computations being performed during learning.

Figure 1: Diagram of an artificial neural network with one hidden layer

### Some Background and Notation

An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows: The signals from the input layer $a_i$ are multiplied by a set of fully-connected weights $w_{ij}$ connecting the input layer to the hidden layer. These weighted signals are then summed and combined with a bias $b_i$ (not displayed in the graphical model in Figure 1). This calculation forms the pre-activation signal $z_j = b_j + \sum_i a_i w_{ij}$ for the hidden layer. The pre-activation signal is then transformed by the hidden layer activation function $g_j$ to form the feed-forward activation signals leaving leaving the hidden layer $a_j$. In a similar fashion, the hidden layer activation signals $a_j$ are multiplied by the weights connecting the hidden layer to the output layer $w_{jk}$, a bias $b_k$ is added, and the resulting signal is transformed by the output activation function $g_k$ to form the network output $a_k$. The output is then compared to a desired target $t_k$ and the error between the two is calculated.

Training a neural network involves determining the set of parameters $\theta = \{\mathbf{W},\mathbf{b}\}$ that minimize the errors that the network makes. Often the choice for the error function is the sum of the squared difference between the target values $t_k$ and the network output $a_k$ (for more detail on this choice of error function see):

$\Large{\begin{array}{rcl} E &=& \frac{1}{2} \sum_{k \in K}(a_k - t_k)^2 \end{array}}$

Equation (1)

This problem can be solved using gradient descent, which requires determining $\frac{\partial E}{\partial \theta}$ for all $\theta$ in the model. Note that, in general, there are two sets of parameters: those parameters that are associated with the output layer (i.e. $\theta_k = \{w_{jk}, b_k\}$), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly.

Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarification.

• ${z_j}$: input to node $j$ for layer $l$
• ${g_j}$: activation function for node $j$ in layer $l$ (applied to ${z_j}$)
• $a_j=g_j(z_j)$: ouput/activation of node $j$ in layer $l$
• ${w_{ij}}$: weights connecting node $i$ in layer $(l-1)$ to node $j$ in layer $l$
• ${b_{j}}$: bias for unit $j$ in layer $l$
• ${t_{k}}$: target value for node $k$ in the output layer

## Gradients for Output Layer Weights

### Output layer connection weights, $w_{jk}$

Since the output layer parameters directly affect the value of the error function, determining the gradients for those parameters is fairly straight-forward:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{jk}} &=& \frac{1}{2} \sum_{k \in K}(a_k - t_k)^2 \\ &=& (a_k - t_k)\frac{\partial}{\partial w_{jk}}(a_k - t_k) \end{array}}$

Equation (2)

Here, we’ve used the Chain Rule. (Also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the $j$-th dimension/node, the only term that survives in the error gradient is $j$-th, and thus we can ignore the remaining terms in the summation). The derivative with respect to $t_k$ is zero because it does not depend on $w_{jk}$. Also, we note that $a_k = g(z_k)$. Thus

$\Large{\begin{array}{rcl}\frac{\partial E }{\partial w_{jk}} &=& (a_k - t_k)\frac{\partial}{\partial w_{jk}}a_k \\ &=& (a_k - t_k)\frac{\partial}{\partial w_{jk}}g_k(z_k) \\ &=& (a_k - t_k)g_k'(z_k)\frac{\partial}{\partial w_{jk}}z_k, \end{array}}$

Equation (3)

where, again we use the Chain Rule. Now, recall that $z_k = b_j + \sum_j g_j(z_j)w_{jk}$ and thus $\frac{\partial z_{k}}{\partial w_{jk}} = g_j(z_j) = a_j$, giving:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{jk}} &=& (a_k - t_k)g_k'(z_k)a_j \end{array}}$

Equation (4)

The gradient of the error function with respect to the output layer weights is a product of three terms. The first term is the difference between the network output and the target value $t_k$. The second term is the derivative of output layer activation function. And the third term is the activation output of node j in the hidden layer.

If we define $\delta_k$ to be all the terms that involve index k:

$\Large{\begin{array}{rcl} \delta_k &=& (a_k - t_k)g_k'(z_k)\end{array}}$

we obtain the following expression for the derivative of the error with respect to the output weights $w_{jk}$:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{jk}} = \delta_k a_j \end{array}}$

Equation (5)

Here the $\delta_k$ terms can be interpreted as the network output error after being back-propagated through the output activation function, thus creating an error “signal”. Loosely speaking, Equation (5) can be interpreted as determining how much each $w_{jk}$ contributes to the error signal by weighting the error signal by the magnitude of the output activation from the previous (hidden) layer associated with each weight (see Figure 1). The gradients with respect to each parameter are thus considered to be the “contribution” of the parameter to the error signal and should be negated during learning. Thus the output weights are updated as $w_{jk}\leftarrow w_{jk} - \eta \frac{\partial E }{\partial w_{jk}}$, where $\eta$ is some step size (“learning rate”) along the negative gradient.

As we’ll see shortly, the process of backpropagating the error signal can iterate all the way back to the input layer by successively projecting $\delta_k$ back through $w_{jk}$, then through the activation function for the hidden layer via $g'_j$ to give the error signal $\delta_j$, and so on. This backpropagation concept is central to training neural networks with more than one layer.

### Output layer biases, $\Large{b_{k}}$

As far as the gradient with respect to the output layer biases, we follow the same routine as above for $w_{jk}$. However, the third term in Equation (3) is $\frac{\partial}{\partial b_k} z_k = \frac{\partial}{\partial b_k} \left[ b_k + \sum_j g_j(z_j)\right] = 1$, giving the following gradient for the output biases:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial b_k} &=& (a_k - t_k)g_k'(z_k)(1) \\ &=& \delta_k \end{array}}$

Equation (6)

Thus the gradient for the biases is simply the back-propagated error from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.

## Gradients for Hidden Layer Weights

Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights $w_{ij}$  is somewhat more involved. However, the process starts just the same:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=&\frac{1}{2} \sum_{k \in K}(a_k - t_k)^2 \\ &=& \sum_{k \in K} (a_k - t_k) \frac{\partial}{\partial w_{ij}}a_k \end{array}}$

Notice here that the sum does not disappear because, due to the fact that the layers are fully connected, each of the hidden unit outputs affects the state of each output unit. Continuing on, noting that $a_k = g_k(z_k)$

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=& \sum_{k \in K} (a_k - t_k) \frac{\partial }{\partial w_{ij}}g_k(z_k) \\ &=& \sum_{k \in K} (a_k - t_k)g'_k(z_k)\frac{\partial }{\partial w_{ij}}z_k \end{array}}$

Equation (7)

Here, again we use the Chain Rule. Ok, now here’s where things get “slightly more involved”. Notice that the partial derivative in the third term in Equation (7) is with respect to $w_{ij}$, but the target $z_j$ is a function of index $j$. How the heck do we deal with that!? Well, if we expand $z_k$, we find that it is composed of other sub functions (also see Figure 1):

$\Large{\begin{array}{rcl} z_k &=& b_k + \sum_j a_jw_{jk} \\ &=& b_k + \sum_j g_j(z_j)w_{jk} \\ &=& b_k + \sum_j g_j(b_i + \sum_i z_i w_{ij})w_{jk}\end{array}}$

Equation (8)

From the last term in Equation (8) we see that $z_k$ is indirectly dependent on $w_{ij}$.  Equation (8) also suggests that we can use the Chain Rule to calculate $\frac{\partial z_k }{\partial w_{ij}}$. This is probably the trickiest part of the derivation, and goes like…

$\Large{\begin{array}{rcl} \frac{\partial z_k }{\partial w_{ij}} &=& \frac{\partial z_k}{\partial a_j}\frac{\partial a_j}{\partial w_{ij}} \\ &=& \frac{\partial}{\partial a_j}a_jw_{jk}\frac{\partial a_j}{\partial w_{ij}} \\ &=& w_{jk}\frac{\partial a_j}{\partial w_{ij}} \\ &=& w_{jk}\frac{\partial g_j(z_j)}{\partial w_{ij}} \\ &=& w_{jk}g_j'(z_j)\frac{\partial z_j}{\partial w_{ij}} \\ &=& w_{jk}g_j'(z_j)\frac{\partial}{\partial w_{ij}}(b_i + \sum_i a_i w_{ij}) \\ &=& w_{jk}g_j'(z_j)a_i \end{array}}$

Equation (9)

Now, plugging Equation (9) into $z_k$ in Equation (7) gives the following for $\frac{\partial E}{\partial w_{ij}}$:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=& \sum_{k \in K} (a_k - t_k)g'_k(z_k)w_{jk} g'_j(z_j)a_i \\ &=& g'_j(z_j)a_i \sum_{k \in K} (a_k - t_k)g'_k(z_k)w_{jk} \\ &=& a_i g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \end{array}}$

Equation (10)

Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is some term weighted by the output activations from the layer below ($a_i$). For the output weight gradients, the term that was weighted by $a_j$ was the back-propagated error signal $\delta_k$ (i.e. Equation (5)). Here, the weighted term includes $\delta_k$, but the error signal is further projected onto $w_{jk}$ and then weighted by the derivative of hidden layer activation function $g'_j$. Thus, the gradient for the hidden layer weights is simply the output error signal backpropagated to the hidden layer, then weighted by the input to the hidden layer. To make this idea more explicit, we can define the resulting error signal backpropagated to layer $j$ as $\delta_j$, and includes all terms in Equation (10) that involve index $j$. This definition results in the following gradient for the hidden unit weights:

$\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=& a_i g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \\ &=& \delta_j a_i \\ \text{where} \\ \delta_j &=& g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \end{array}}$

Equation (11)

This suggests that in order to calculate the weight gradients at any layer $l$ in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal that reaches that layer $\delta_l$ and weight it by the feed-forward signal $a_{l-1}$feeding into that layer! Analogously, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal, which can only be observed–from the point of view of the weights–by backpropagating the error signal to the hidden layer.

### Output layer biases, $\Large{w_{ij}}$

Calculating the gradients for the hidden layer biases follows a very similar procedure to that for the hidden layer weights where, as in Equation (9), we use the Chain Rule to calculate $\frac{\partial z_k}{\partial b_i}$. However, unlike Equation (9) the third term that results for the biases is slightly different:

$\Large{\begin{array}{rcl} \frac{\partial z_k }{\partial b_i} &=& w_{jk}g_j'(z_j)\frac{\partial z_j}{\partial b_i} \\ &=& w_{jk}g_j'(z_j)\frac{\partial}{\partial b_i}(b_i + \sum_i a_i w_{ij}) \\ &=& w_{jk}g_j'(z_j)(1), \\ \text{giving} \\ \frac{\partial E }{\partial b_i}&=& g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \\ &=& \delta_j \end{array}}$

Equation (12)

In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer $l$ in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer $\delta_l$!

## Wrapping up

In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. Those computations are:

1. Calculated the feed-forward signals from the input to the output.
2. Calculate output error $E$ based on the predictions $a_k$ and the target $t_k$
3. Backpropagate the error signals by weighting it by the weights in previous layers and the gradients of the associated activation functions
4. Calculating the gradients $\frac{\partial E}{\partial \theta}$ for the parameters based on the backpropagated error signal and the feedforward signals from the inputs.
5. Update the parameters using the calculated gradients $\theta \leftarrow \theta - \eta\frac{\partial E}{\partial \theta}$

The only real constraints on model construction is ensuring that the error function $E$ and the activation functions $g_l$ are differentiable. For more details on implementing ANNs and seeing them at work, stay tuned for the next post.

## The Statistical Whitening Transform

In a number of modeling scenarios, it is beneficial to transform the to-be-modeled data such that it has an identity covariance matrix, a procedure known as Statistical Whitening. When data have an identity covariance, all dimensions are statistically independent, and the variance of the data along each of the dimensions is equal to one. (To get a better idea of what an identity covariance entails, see the following post.)

Enforcing statistical independence is useful for a number of reasons. For example, in probabilistic models of data that exist in multiple dimensions, the joint distribution–which may be very complex and difficult to characterize–can factorize into a product of many simpler distributions when the dimensions are statistically independent. Forcing all dimensions to have unit variance is also useful. For instance, scaling all variables to have the same variance treats each dimension with equal importance.

In the remainder of this post we derive how to transform data such that it has an identity covariance matrix, give some examples of applying such a transformation to real data, and address some interpretations of statistical whitening in the scope of theoretical neuroscience.

## Decorrelation: Transforming Data to Have a Diagonal Covariance Matrix

Let’s say we have some data matrix $X$ composed of $K$ dimensions and $n$ observations ($X$ has  size $[K \times n]$).  Let’s also assume that the rows of $X$ have been centered (the mean has been subracted across all observations) . The covariance $\Sigma$ of each of the dimensions with respect to the other is

$\Sigma = Cov(X) = \mathbb E[X X^T]$                                                                                        (1)

Where the covariance $\mathbb E[X X^T]$ can be estimated from the data matrix as follows:

$\mathbb E[X X^T] \approx \frac{X X^T}{n}$                                                                                            (2)

The covariance matrix $\Sigma$, by definition (Equation 2) is symmetric and positive semi-definite (if you don’t know what that means, don’t worry it’s not terribly important for this discussion). Thus we can write the matrix as the product of two simpler matrices $E$ and $D$, using a procedure known as Eigenvalue Decomposition:

$\Sigma = EDE^{-1}$                                                                                                 (3)

The matrix $E$ is an $[K \times K]$-sized matrix, where each column is an eigenvector of $\Sigma$, and $D$ is a diagonal matrix whose diagonal elements $D_{ii}$ are eigenvalues that correspond to the eigenvectors of the $i$-th column of $E$.  For more details on eigenvectors and eigenvalues see the following. From Equation (3), and using a little algebra, we can transform $\Sigma$ into the diagonal matrix $D$

$E^{-1} \Sigma E = D$                                                                                                 (4)

Now, imagine the goal is to transform the data matrix $X$ into a new data matrix $Y$

$Y = W_DX$                                                                                                   (5)

whose dimensions are uncorrelated (i.e. $Y$ has a diagonal covariance $D$). Thus we want to determine the transformation $W_D$ that makes:

$D = Cov(Y) = \mathbb E[YY^T]$                                                                                   (6)

Here we derive the expression for $W_D$ using Equations (2), (4), (5), and (6):

$D = \frac{W_DX(W_DX)^T}{n}$                                                       (a la Equations (5) and (6))

$D = W_D W_D^T \Sigma$                                                                       (via Equation (2))

$E^{-1}\Sigma E = W_D W_D^T \Sigma$                                                                   (via Equation (4))

$\Sigma^{-1}E^{-1} \Sigma E = \Sigma^{-1}W_D W_D^T \Sigma$

now, because $E^{-1} = E^T$                                             (see following link for details)

$E^TE = W_DW_D^T$ and thus

$W_D = E^T$                                                                                                   (7)

This means that we can transform $X$ into an uncorrelated (i.e. orthogonal) set of variables by premultiplying data matrix $X$ with the transpose of the the eigenvectors of data covariance matrix $\Sigma$.

## Whitening: Transforming data to have an Identity Covariance matrix

Ok, so now we have a way of transforming our data so that the dimensions are uncorrelated. However, this only gives us a diagonal covariance matrix, not an Identity covariance matrix. In order to obtain an Identity covariance, we also need to scale each dimension so that its variance is equal to one. How can we determine this transformation? We know how to transform our data so that the covariance is equal to $D$. If we can determine the transformation that leaves $D = I$, then we can apply this transformation to our decorrelated covariance to give us the desired whitening transform. We can determine this from the somewhat trivial notion that

$D^{-1}D = I$                                                                                                        (8)

and further that

$D^{-1} = D^{-1/2}ID^{-1/2}$                                                                                             (9)

Now, using Equation (4) along with Equation (8), we can see that

$D^{-1/2}E^{-1}\Sigma E D^{-1/2} = I$                                                                                      (10)

Now say that we define a variable $Y = W_W X$, where $W_W$ is the desired whitening transform, that leaves the covariance of $Y$ equal to the identity matrix. Using essentially the same set of derivation steps as above to solve for $W_D$, but starting from Equation (9) we find that

$W_W = D^{-1/2}E^T$                                                                                                  (11)

$= D^{-1/2}W_D$                                                                                                 (12)

Thus, the whitening transform is simply the decorrelation transform, but scaled by the inverse of the square root of the $D$ (here the inverse and square root can be performed element-wise because $D$ is a diagonal matrix).

## Interpretation of the Whitening Transform

So what does the whitening transformation actually do to the data (below, blue points)? We investigate this transformation below: The first operation decorrelates the data by premultiplying the data with the eigenvector matrix $E^T$, calculated from the data covariance. This decorrelation can be thought of as a rotation that reorients the data so that the principal axes of the data are aligned with the axes along which the data has the largest (orthogonal) variance. This rotation is essentially the same procedure as the oft-used Principal Components Analysis (PCA), and is shown in the middle row.

The second operation, scaling by $D^{-1/2}$ can be thought of squeezing the data–if the variance along a dimension is larger than one–or stretching the data–if the variance along a dimension is less than one. The stretching and squeezing forms the data into a sphere about the origin (which is why whitening is also referred to as “sphering”). This scaling operation is depicted in the bottom row in the plot above.

The MATLAB to make make the plot above is here:

% INITIALIZE SOME CONSTANTS
mu = [0 0];
S = [1 .9; .9 3];

% SAMPLE SOME DATAPOINTS
nSamples = 1000;
samples = mvnrnd(mu,S,nSamples)';

% WHITEN THE DATA POINTS...
[E,D] = eig(S);

% ROTATE THE DATA
samplesRotated = E*samples;

% TAKE D^(-1/2)
D = diag(diag(D).^-.5);

% SCALE DATA BY D
samplesRotatedScaled = D*samplesRotated;

% DISPLAY
figure;

subplot(311);
plot(samples(1,:),samples(2,:),'b.')
axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Original Data');

subplot(312);
plot(samplesRotated(1,:),samplesRotated(2,:),'r.'),
axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Decorrelate: Rotate by V');

subplot(313);
plot(samplesRotatedScaled(1,:),samplesRotatedScaled(2,:),'ko')
axis square, grid
xlim([-5 5]);ylim([-5 5]);
title('Whiten: scale by D^{-1/2}');


The transformation in Equation (11) and implemented above  whitens the data but leaves the data aligned with principle axes of the original data. In order to observe the data in the original space, it is often customary “un-rotate” the data back into it’s original space. This is done by just multiplying the whitening transform by the inverse of the rotation operation defined by the eigenvector matrix. This gives the whitening transform:

$W =E^{-1}D^{-1/2}E^T$                                                                                                   (13)

Let’s take a look an example of using statistical whitening for a more complex problem: whitening patches of images sampled from natural scenes.

## Example: Whitening Natural Scene Image Patches

Modeling the local spatial structure of pixels in natural scene images is important in many fields including computer vision and computational neuroscience. An interesting model of natural scenes is one that can account for interesting, high-order statistical dependencies between pixels. However, because natural scenes are generally composed of continuous objects or surfaces, a vast majority of the spatial correlations in natural image data can be explained by local pairwise dependencies. For example, observe the image below.

% LOAD AND DISPLAY A NATURAL IMAGE
figure
imagesc(im); colormap gray; axis image; axis off;
title('Base Image')


Given one of the gray pixels in the upper portion of the image, it is very likely that all pixels within the local neighborhood will also be gray. Thus there is a large amount of correlation between pixels in local regions of natural scenes. Statistical models of local structure applied to natural scenes will be dominated by these pairwise correlations, unless they are removed by preprocessing. Whitening provides such a preprocessing procedure.

Below we create and display a dataset of local image patches of size $16 \times 16$ extracted at random from the image above. Each patch is rastered out into a column vector of size $(16)16 \times 1$. Each of these patches can be thought of as samples of the local structure of this natural scene. Below we use the whitening transformation to remove pairwise correlations between pixels in each patch and scale the variance of each pixel to be one.

On the left is the dataset of extracted image patches, along with the corresponding covariance matrix for the image patches on the right. The large local correlation within the neighborhood of each pixel is indicated by the large bright diagonal regions throughout the covariance matrix.

The MATLAB code to extract and display the patches shown above is here:

% CREATE PATCHES DATASET FROM NATURAL IMAGE
rng(12345)
imSize = 256;
nPatches = 400;  % (MAKE SURE SQUARE)
patchSize = 16;
patches = zeros(patchSize*patchSize,nPatches);
patchIm = zeros(sqrt(nPatches)*patchSize);

% PAD IMAGE FOR EDGE EFFECTS

% EXTRACT PATCHES...
for iP = 1:nPatches
pix = ceil(rand(2,1)*imSize);
rows = pix(1):pix(1)+patchSize-1;
cols = pix(2):pix(2)+patchSize-1;
tmp = im(rows,cols);
patches(:,iP) = reshape(tmp,patchSize*patchSize,1);
rowIdx = (ceil(iP/sqrt(nPatches)) - 1)*patchSize + ...
1:ceil(iP/sqrt(nPatches))*patchSize;
colIdx = (mod(iP-1,sqrt(nPatches)))*patchSize+1:patchSize* ...
((mod(iP-1,sqrt(nPatches)))+1);
patchIm(rowIdx,colIdx) = tmp;
end

% CENTER IMAGE PATCHES
patchesCentered = bsxfun(@minus,patches,mean(patches,2));

% CALCULATE COVARIANCE MATRIX
S = patchesCentered*patchesCentered'/nPatches;

% DISPLAY PATCHES
figure;
subplot(121);
imagesc(patchIm);
axis image; axis off; colormap gray;
title('Extracted Patches')

% DISPLAY COVARIANCE
subplot(122);
imagesc(S);
axis image; axis off; colormap gray;
title('Extracted Patches Covariance')


Below we implement the whitening transformation described above to the extracted image patches and display the whitened patches that result.

On the left, we see that the whitening procedure zeros out all areas in the extracted patches that have the same value (zero is indicated by gray). The whitening procedure also boosts the areas of high-contrast (i.e. edges). The right plots the covariance matrix for the whitened patches. The covarance matrix is diagonal, indicating that pixels are now independent. In addition, all diagonal entries have the same value, indicating the that all pixels now have the same variance (i.e. 1). The MATLAB code used to whiten the image patches and create the display above is here:

%% MAIN WHITENING

% DETERMINE EIGENECTORS & EIGENVALUES
% OF COVARIANCE MATRIX
[E,D] = eig(S);

% CALCULATE D^(-1/2)
d = diag(D);
d = real(d.^-.5);
D = diag(d);

% CALCULATE WHITENING TRANSFORM
W = E*D*E';

% WHITEN THE PATCHES
patchesWhitened = W*patchesCentered;

% DISPLAY THE WHITENED PATCHES
wPatchIm = zeros(size(patchIm));
for iP = 1:nPatches
rowIdx = (ceil(iP/sqrt(nPatches)) - 1)*patchSize + 1:ceil(iP/sqrt(nPatches))*patchSize;
colIdx = (mod(iP-1,sqrt(nPatches)))*patchSize+1:patchSize* ...
((mod(iP-1,sqrt(nPatches)))+1);
wPatchIm(rowIdx,colIdx) = reshape(patchesWhitened(:,iP),...
[patchSize,patchSize]);
end

figure
subplot(121);
imagesc(wPatchIm);
axis image; axis off; colormap gray; caxis([-5 5]);
title('Whitened Patches')

subplot(122);
imagesc(cov(patchesWhitened'));
axis image; axis off; colormap gray; %colorbar
title('Whitened Patches Covariance');


## Investigating the Whitening Matrix: implications for theoretical neuroscience

So what does the whitening matrix look like, and what does it do? Below is the whitening matrix $W$ calculated for the image patches dataset:

% DISPLAY THE WHITENING MATRIX
figure; imagesc(W);
axis image; colormap gray; colorbar
title('The Whitening Matrix W')


Each column of $W$ is the operation that scales the variance of the corresponding pixel to be equal to one and forces that pixel independent of the others in the $16 \times 16$ patch. So what exactly does such an operation look like? We can get an idea by reshaping a column of W back into the shape of the image patches. Below we show what the 86th column of W looks like when reshaped in such a way (the index 86 has no particular significance, it was chosen at random):

% DISPLAY A COLUMN OF THE WHITENING MATRIX
figure; imagesc(reshape(W(:,86),16,16)),
colormap gray,
axis image, colorbar
title('Column 86 of W')


We see that the operation is essentially an impulse centered on the 86th pixel in the image (counting pixels starting in the upper left corner, proceeding down columns). This impulse is surrounded by inhibitory weights. If we were to look at the remaining columns of $W$, we would find that that the same center-surround operation is being replicated at every pixel location in each image patch. Essentially, the whitening transformation is performing a convolution of each image patch with a center-surround filter whose properties are estimated from the patches dataset. Similar techniques are common in computer vision edge-detection algorithms.

## Implications for theoretical neuroscience

A theoretical function of the primate retina is data compression: a large number of photoreceptors  pass data from the retina into a physiological bottleneck, the optic nerve, which has far fewer fibers than retinal photoreceptors. Thus removing redundant information is an important task that the retina must perform. When observing the whitened image patches above, we see that redundant information is nullified; pixels that have similar local values to one another are zeroed out. Thus, statistical whitening is a viable form of data compression

It turns out that there is a large class of ganglion cells in the retina whose spatial receptive fields exhibit…that’s right center-surround activation-inhibition like the operation of the whitening matrix shown above! Thus it appears that the primate visual system may be performing data compression at the retina by means of a similar operation to statistical whitening. Above, we derived the center-surround whitening operation based on data sampled from a natural scene. Thus it is seems reasonable that the primate visual system may have evolved a similar data-compression mechanism through experience with natural scenes, either through evolution, or development.

## Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

We showed in an earlier post that for the linear regression model

$y = X\beta + \epsilon$,

the optimal Ordinary Least Squares (OLS) estimator for model parameters $\beta$ is

$\hat \beta = (X^TX)^{-1}X^Ty$

However, because independent variables $X$ and responses $y$ can take on any value, they are both random variables. And, because $\hat \beta$ is a linear combination of $X$ and $y$, it is also a random variable, and therefore has a covariance. The definition of the covariance matrix $C_{\hat \beta}$ for the OLS estimator is defined as:

$C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

where, $E[*]$ denotes the expected value operator. In order to find an expression for $C_{\hat \beta}$, we first need an expression for  $(\hat \beta - \beta)$. The following derives this expression:

$\hat \beta = (X^TX)^{-1}X^T(X\beta + \epsilon)$,

where we use the fact that

$y = X\beta + \epsilon$.

It follows that

$\hat \beta = (X^TX)^{-1}X^TX \beta + (X^TX)^{-1}\epsilon$

$\hat \beta = \beta + (X^TX)^{-1}X^T \epsilon$

and therefore

$(\hat \beta - \beta) = (X^TX)^{-1}X^T \epsilon$

Now following the original definition for $C_{\hat \beta}$

$C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]$

$= E[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T \epsilon)^T]$

$= E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]$

where we take advantage of $(AB)^T = B^T A^T$ in order to rewrite the second term in the product of the expectation. If we take $X$ to be fixed for a given estimator of $\hat \beta$ (in other words we don’t randomly resample the independent variables), then the expectation only depends on the remaining stochastic/random variable, namely $\epsilon$. Therefore the above expression can be written as

$C_{\hat \beta} = (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}$.

where $E[\epsilon \epsilon^T]$ is the covariance of the noise term in the model. Because OLS assumes uncorrelated noise, the noise covariance is equal to $\sigma^2 I$, where $\sigma^2$ is the variance along each dimension, and $I$ is an identity matrix of size equal to the number of dimensions. The expression for the estimator covariance is now:

$C_{\hat \beta} = (X^TX)^{-1}X^T (\sigma^2 I) X(X^TX)^{-1}$,

$= \sigma^2 I (X^TX)^{-1} X^T X(X^TX)^{-1}$

which simplifies to

$C_{\hat \beta} = \sigma^2 (X^T X)^{-1}$

A further simplifying assumption made by OLS that is often made is that $\epsilon$ is drawn from a zero mean multivariate Guassian distribution of unit variances (i.e. $\sigma^2 = 1$), resulting in a noise covariance equal to the identity. Thus

$C_{\hat \beta} = (X^TX)^{-1}$

## Applying the derivation results to Generalized Least Squares

Notice that the expression for the OLS estimator covariance is equal to first inverse term in the expression for the OLS estimator. Identitying the covariance for the OLS estimator in this way gives a helpful heuristic to easily identify the covariance of related estimators that do not make the simplifying assumptions about the covariance that are made in OLS. For instance in Generalized Least Squares (GLS), it is possible for the noise terms to co-vary. The covariance is represented as a noise covariance matrix $C_{\epsilon}$. This gives the model form

$y = X \beta + \epsilon$,

where $E[\epsilon | X] = 0; Var[\epsilon | X] = C_{\epsilon}$.

In otherwords, under GLS, the noise terms have zero mean, and covariance $C_{\epsilon}$.  It turns out that estimator for the GLS model parameters is

$\hat \beta_{GLS} = (X^T C_{\epsilon}^{-1} X)^{-1} X^T C_{\epsilon}^{-1}y$.

Notice the similarity between the GLS and OLS estimators. The only difference is that in GLS, the solution for the parameters is scaled by the inverse of the noise covariance. And, in a similar fashion to the OLS estimator, the covariance for the GLS estimator is first term in the product that defines the GLS estimator:

$C_{\hat \beta, GLS} = (X^T C_{\epsilon}^{-1}X)^{-1}$

## Derivation: Ordinary Least Squares Solution and Normal Equations

In a linear regression framework, we assume some output variable $y$ is a linear combination of some independent input variables $X$ plus some independent noise $\epsilon$. The way the independent variables are combined is defined by a parameter vector $\beta$:

$\Large{\begin{array}{rcl} y &=& X \beta + \epsilon \end{array}}$

We also assume that the noise term $\epsilon$ is drawn from a standard Normal distribution:

$\Large{ \begin{array}{rcl}\epsilon &\sim& N(0,I)\end{array}}$

For some estimate of the model parameters $\hat \beta$, the model’s prediction errors/residuals $e$ are the difference between the model prediction and the observed ouput values

$\Large{\begin{array}{rcl} e = y - X\hat \beta \end{array}}$

The Ordinary Least Squares (OLS) solution to the problem (i.e. determining an optimal solution for $\hat \beta$) involves minimizing the sum of the squared errors with respect to the model parameters, $\hat \beta$. The sum of squared errors is equal to the inner product of the residuals vector with itself $\sum e_i^2 = e^Te$ :

$\Large{\begin{array}{rcl} e^T e &=& (y - X \hat \beta)^T (y - X \hat \beta) \\ &=& y^Ty - y^T (X \hat \beta) - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &=& y^Ty - (X \hat \beta)^T y - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &=& y^Ty - 2(X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\ &=& y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \\ \end{array}}$

To determine the parameters, $\hat \beta$, we minimize the sum of squared residuals with respect to the parameters.

$\Large{\begin{array}{rcl} \frac{\partial}{\partial \beta} \left[ e^T e \right] &=& 0 \\ &=& -2X^Ty + 2X^TX \hat \beta \text{, and thus} \\ X^Ty &=& X^TX \hat \beta \end{array}}$

due to the identity $\frac{\partial \mathbf{a}^T \mathbf{b}}{\partial \mathbf{a}} = \mathbf{b}$, for vectors $\mathbf{a}$ and $\mathbf{b}$. This relationship is matrix form of the Normal Equations. Solving for $\hat \beta$ gives  the analytical solution to the Ordinary Least Squares problem.

$\Large{\begin{array}{rcl} \hat \beta &=& (X^TX)^{-1}X^Ty \end{array}}$

Boom.