Blog Archives

Derivation: Error Backpropagation & Gradient Descent for Neural Networks


Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walk-through of the derivation of gradient descent learning algorithm commonly used to train ANNs (aka the backpropagation algorithm) and try to provide some high-level insights into the computations being performed during learning.

Artificial  Neural Network

Figure 1: Diagram of an artificial neural network with one hidden layer


Some Background and Notation

An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows: The signals from the input layer a_i are multiplied by a set of fully-connected weights w_{ij} connecting the input layer to the hidden layer. These weighted signals are then summed and combined with a bias b_i (not displayed in the graphical model in Figure 1). This calculation forms the pre-activation signal z_j = b_j + \sum_i a_i w_{ij} for the hidden layer. The pre-activation signal is then transformed by the hidden layer activation function g_j to form the feed-forward activation signals leaving leaving the hidden layer a_j. In a similar fashion, the hidden layer activation signals a_j are multiplied by the weights connecting the hidden layer to the output layer w_{jk}, a bias b_k is added, and the resulting signal is transformed by the output activation function g_k to form the network output a_k. The output is then compared to a desired target t_k and the error between the two is calculated.

Training a neural network involves determining the set of parameters \theta = \{\mathbf{W},\mathbf{b}\} that minimize the errors that the network makes. Often the choice for the error function is the sum of the squared difference between the target values t_k and the network output a_k (for more detail on this choice of error function see):

\Large{\begin{array}{rcl} E &=& \frac{1}{2} \sum_{k \in K}(a_k - t_k)^2 \end{array}}

Equation (1)

This problem can be solved using gradient descent, which requires determining \frac{\partial E}{\partial \theta} for all \theta in the model. Note that, in general, there are two sets of parameters: those parameters that are associated with the output layer (i.e. \theta_k = \{w_{jk}, b_k\}), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly.

Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarification.

  • {z_j}: input to node j for layer l
  • {g_j}: activation function for node j in layer l (applied to {z_j})
  • a_j=g_j(z_j): ouput/activation of node j in layer l
  • {w_{ij}}: weights connecting node i in layer (l-1) to node j in layer l
  • {b_{j}}: bias for unit j in layer l
  • {t_{k}}: target value for node k in the output layer

Gradients for Output Layer Weights

Output layer connection weights, w_{jk}

Since the output layer parameters directly affect the value of the error function, determining the gradients for those parameters is fairly straight-forward:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{jk}} &=& \frac{1}{2} \sum_{k \in K}(a_k - t_k)^2 \\  &=& (a_k - t_k)\frac{\partial}{\partial w_{jk}}(a_k - t_k) \end{array}}

Equation (2)

Here, we’ve used the Chain Rule. (Also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the j-th dimension/node, the only term that survives in the error gradient is j-th, and thus we can ignore the remaining terms in the summation). The derivative with respect to t_k is zero because it does not depend on w_{jk}. Also, we note that a_k = g(z_k). Thus

\Large{\begin{array}{rcl}\frac{\partial E }{\partial w_{jk}} &=& (a_k - t_k)\frac{\partial}{\partial w_{jk}}a_k \\  &=& (a_k - t_k)\frac{\partial}{\partial w_{jk}}g_k(z_k) \\  &=& (a_k - t_k)g_k'(z_k)\frac{\partial}{\partial w_{jk}}z_k, \end{array}}

Equation (3)

where, again we use the Chain Rule. Now, recall that z_k = b_j + \sum_j g_j(z_j)w_{jk} and thus \frac{\partial z_{k}}{\partial w_{jk}} = g_j(z_j) = a_j, giving:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{jk}} &=& (a_k - t_k)g_k'(z_k)a_j \end{array}}

Equation (4)

The gradient of the error function with respect to the output layer weights is a product of three terms. The first term is the difference between the network output and the target value t_k. The second term is the derivative of output layer activation function. And the third term is the activation output of node j in the hidden layer.

If we define \delta_k to be all the terms that involve index k:

\Large{\begin{array}{rcl} \delta_k &=& (a_k - t_k)g_k'(z_k)\end{array}}

we obtain the following expression for the derivative of the error with respect to the output weights w_{jk}:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{jk}} = \delta_k a_j \end{array}}

Equation (5)

Here the \delta_k terms can be interpreted as the network output error after being back-propagated through the output activation function, thus creating an error “signal”. Loosely speaking, Equation (5) can be interpreted as determining how much each w_{jk} contributes to the error signal by weighting the error signal by the magnitude of the output activation from the previous (hidden) layer associated with each weight (see Figure 1). The gradients with respect to each parameter are thus considered to be the “contribution” of the parameter to the error signal and should be negated during learning. Thus the output weights are updated as w_{jk}\leftarrow w_{jk} - \eta \frac{\partial E }{\partial w_{jk}}, where \eta is some step size (“learning rate”) along the negative gradient.

As we’ll see shortly, the process of backpropagating the error signal can iterate all the way back to the input layer by successively projecting \delta_k back through w_{jk}, then through the activation function for the hidden layer via g'_j to give the error signal \delta_j, and so on. This backpropagation concept is central to training neural networks with more than one layer.

Output layer biases, \Large{b_{k}}

As far as the gradient with respect to the output layer biases, we follow the same routine as above for w_{jk}. However, the third term in Equation (3) is \frac{\partial}{\partial b_k} z_k = \frac{\partial}{\partial b_k} \left[ b_k + \sum_j g_j(z_j)\right] = 1, giving the following gradient for the output biases:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial b_k} &=& (a_k - t_k)g_k'(z_k)(1) \\  &=& \delta_k \end{array}}

Equation (6)

Thus the gradient for the biases is simply the back-propagated error from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.


Gradients for Hidden Layer Weights

Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights w_{ij}  is somewhat more involved. However, the process starts just the same:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=&\frac{1}{2} \sum_{k \in K}(a_k - t_k)^2 \\  &=& \sum_{k \in K} (a_k - t_k) \frac{\partial}{\partial w_{ij}}a_k  \end{array}}

Notice here that the sum does not disappear because, due to the fact that the layers are fully connected, each of the hidden unit outputs affects the state of each output unit. Continuing on, noting that a_k = g_k(z_k)

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=& \sum_{k \in K} (a_k - t_k) \frac{\partial }{\partial w_{ij}}g_k(z_k) \\  &=& \sum_{k \in K} (a_k - t_k)g'_k(z_k)\frac{\partial }{\partial w_{ij}}z_k  \end{array}}

Equation (7)

Here, again we use the Chain Rule. Ok, now here’s where things get “slightly more involved”. Notice that the partial derivative in the third term in Equation (7) is with respect to w_{ij}, but the target z_j is a function of index j. How the heck do we deal with that!? Well, if we expand z_k, we find that it is composed of other sub functions (also see Figure 1):

\Large{\begin{array}{rcl} z_k &=& b_k + \sum_j a_jw_{jk} \\  &=& b_k + \sum_j g_j(z_j)w_{jk} \\  &=& b_k + \sum_j g_j(b_i + \sum_i z_i w_{ij})w_{jk}\end{array}}

Equation (8)

From the last term in Equation (8) we see that z_k is indirectly dependent on w_{ij}.  Equation (8) also suggests that we can use the Chain Rule to calculate \frac{\partial z_k }{\partial w_{ij}}. This is probably the trickiest part of the derivation, and goes like…

\Large{\begin{array}{rcl} \frac{\partial z_k }{\partial w_{ij}} &=& \frac{\partial z_k}{\partial a_j}\frac{\partial a_j}{\partial w_{ij}} \\  &=& \frac{\partial}{\partial a_j}a_jw_{jk}\frac{\partial a_j}{\partial w_{ij}} \\  &=& w_{jk}\frac{\partial a_j}{\partial w_{ij}} \\  &=& w_{jk}\frac{\partial g_j(z_j)}{\partial w_{ij}} \\  &=& w_{jk}g_j'(z_j)\frac{\partial z_j}{\partial w_{ij}} \\  &=& w_{jk}g_j'(z_j)\frac{\partial}{\partial w_{ij}}(b_i + \sum_i a_i w_{ij}) \\  &=& w_{jk}g_j'(z_j)a_i \end{array}}

Equation (9)

Now, plugging Equation (9) into z_k in Equation (7) gives the following for \frac{\partial E}{\partial w_{ij}}:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=& \sum_{k \in K} (a_k - t_k)g'_k(z_k)w_{jk} g'_j(z_j)a_i \\  &=& g'_j(z_j)a_i \sum_{k \in K} (a_k - t_k)g'_k(z_k)w_{jk} \\  &=& a_i g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \end{array}}

Equation (10)

Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is some term weighted by the output activations from the layer below (a_i). For the output weight gradients, the term that was weighted by a_j was the back-propagated error signal \delta_k (i.e. Equation (5)). Here, the weighted term includes \delta_k, but the error signal is further projected onto w_{jk} and then weighted by the derivative of hidden layer activation function g'_j. Thus, the gradient for the hidden layer weights is simply the output error signal backpropagated to the hidden layer, then weighted by the input to the hidden layer. To make this idea more explicit, we can define the resulting error signal backpropagated to layer j as \delta_j, and includes all terms in Equation (10) that involve index j. This definition results in the following gradient for the hidden unit weights:

\Large{\begin{array}{rcl} \frac{\partial E }{\partial w_{ij}}&=& a_i g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \\  &=& \delta_j a_i \\  \text{where} \\  \delta_j &=& g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \end{array}}

Equation (11)

This suggests that in order to calculate the weight gradients at any layer l in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal that reaches that layer \delta_l and weight it by the feed-forward signal a_{l-1}feeding into that layer! Analogously, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal, which can only be observed–from the point of view of the weights–by backpropagating the error signal to the hidden layer.

Output layer biases, \Large{w_{ij}}

Calculating the gradients for the hidden layer biases follows a very similar procedure to that for the hidden layer weights where, as in Equation (9), we use the Chain Rule to calculate \frac{\partial z_k}{\partial b_i}. However, unlike Equation (9) the third term that results for the biases is slightly different:

\Large{\begin{array}{rcl} \frac{\partial z_k }{\partial b_i} &=& w_{jk}g_j'(z_j)\frac{\partial z_j}{\partial b_i} \\  &=& w_{jk}g_j'(z_j)\frac{\partial}{\partial b_i}(b_i + \sum_i a_i w_{ij}) \\  &=& w_{jk}g_j'(z_j)(1), \\  \text{giving} \\  \frac{\partial E }{\partial b_i}&=& g'_j(z_j) \sum_{k \in K} \delta_k w_{jk} \\  &=& \delta_j \end{array}}

Equation (12)

In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer l in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer \delta_l!

Wrapping up

In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. Those computations are:

  1. Calculated the feed-forward signals from the input to the output.
  2. Calculate output error E based on the predictions a_k and the target t_k
  3. Backpropagate the error signals by weighting it by the weights in previous layers and the gradients of the associated activation functions
  4. Calculating the gradients \frac{\partial E}{\partial \theta} for the parameters based on the backpropagated error signal and the feedforward signals from the inputs.
  5. Update the parameters using the calculated gradients \theta \leftarrow \theta - \eta\frac{\partial E}{\partial \theta}

The only real constraints on model construction is ensuring that the error function E and the activation functions g_l are differentiable. For more details on implementing ANNs and seeing them at work, stay tuned for the next post.

Derivation: The Covariance Matrix of an OLS Estimator (and applications to GLS)

We showed in an earlier post that for the linear regression model

y = X\beta + \epsilon,

the optimal Ordinary Least Squares (OLS) estimator for model parameters \beta is

\hat \beta = (X^TX)^{-1}X^Ty

However, because independent variables X and responses y can take on any value, they are both random variables. And, because \hat \beta is a linear combination of X and y, it is also a random variable, and therefore has a covariance. The definition of the covariance matrix C_{\hat \beta} for the OLS estimator is defined as:

C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]

where, E[*] denotes the expected value operator. In order to find an expression for C_{\hat \beta}, we first need an expression for  (\hat \beta - \beta). The following derives this expression:

\hat \beta = (X^TX)^{-1}X^T(X\beta + \epsilon),

where we use the fact that

y = X\beta + \epsilon.

It follows that

\hat \beta = (X^TX)^{-1}X^TX \beta + (X^TX)^{-1}\epsilon

\hat \beta = \beta + (X^TX)^{-1}X^T \epsilon

and therefore

(\hat \beta - \beta) = (X^TX)^{-1}X^T \epsilon

Now following the original definition for C_{\hat \beta}

C_{\hat \beta} = E[(\hat \beta - \beta)(\hat \beta - \beta)^T]

= E[(X^TX)^{-1}X^T\epsilon((X^TX)^{-1}X^T \epsilon)^T]

= E[(X^TX)^{-1}X^T\epsilon \epsilon^T X(X^TX)^{-1}]

where we take advantage of (AB)^T = B^T A^T in order to rewrite the second term in the product of the expectation. If we take X to be fixed for a given estimator of \hat \beta (in other words we don’t randomly resample the independent variables), then the expectation only depends on the remaining stochastic/random variable, namely \epsilon. Therefore the above expression can be written as

C_{\hat \beta} = (X^TX)^{-1}X^T E[\epsilon \epsilon^T] X(X^TX)^{-1}.

where E[\epsilon \epsilon^T] is the covariance of the noise term in the model. Because OLS assumes uncorrelated noise, the noise covariance is equal to \sigma^2 I, where \sigma^2 is the variance along each dimension, and I is an identity matrix of size equal to the number of dimensions. The expression for the estimator covariance is now:

C_{\hat \beta} = (X^TX)^{-1}X^T (\sigma^2 I) X(X^TX)^{-1},

= \sigma^2 I (X^TX)^{-1} X^T X(X^TX)^{-1}

which simplifies to

C_{\hat \beta} = \sigma^2 (X^T X)^{-1}

A further simplifying assumption made by OLS that is often made is that \epsilon is drawn from a zero mean multivariate Guassian distribution of unit variances (i.e. \sigma^2 = 1), resulting in a noise covariance equal to the identity. Thus

C_{\hat \beta} = (X^TX)^{-1}

Applying the derivation results to Generalized Least Squares

Notice that the expression for the OLS estimator covariance is equal to first inverse term in the expression for the OLS estimator. Identitying the covariance for the OLS estimator in this way gives a helpful heuristic to easily identify the covariance of related estimators that do not make the simplifying assumptions about the covariance that are made in OLS. For instance in Generalized Least Squares (GLS), it is possible for the noise terms to co-vary. The covariance is represented as a noise covariance matrix C_{\epsilon}. This gives the model form

y = X \beta + \epsilon,

where E[\epsilon | X] = 0; Var[\epsilon | X] = C_{\epsilon}.

In otherwords, under GLS, the noise terms have zero mean, and covariance C_{\epsilon}.  It turns out that estimator for the GLS model parameters is

\hat \beta_{GLS} = (X^T C_{\epsilon}^{-1} X)^{-1} X^T C_{\epsilon}^{-1}y.

Notice the similarity between the GLS and OLS estimators. The only difference is that in GLS, the solution for the parameters is scaled by the inverse of the noise covariance. And, in a similar fashion to the OLS estimator, the covariance for the GLS estimator is first term in the product that defines the GLS estimator:

C_{\hat \beta, GLS} = (X^T C_{\epsilon}^{-1}X)^{-1}