# Category Archives: Optimization

## Derivation: Maximum Likelihood for Boltzmann Machines

In this post I will review the gradient descent algorithm that is commonly used to train the general class of models known as Boltzmann machines. Though the primary goal of the post is to supplement another post on restricted Boltzmann machines, I hope that those readers who are curious about how Boltzmann machines are trained, but have found it difficult to track down a complete or straight-forward derivation of the maximum likelihood learning algorithm for these models (as I have), will also find the post informative.

First, a little background: Boltzmann machines are stochastic neural networks that can be thought of as the probabilistic extension of the Hopfield network. The goal of the Boltzmann machine is to model a set of observed data in terms of a set of visible random variables and a set of latent/unobserved random variables . Due to the relationship between Boltzmann machines and neural networks, the random variables are often are often referred to as “units.” The role of the visible units is to approximate the true distribution of the data, while the role of the latent variables it to extend the expressiveness of the model by capturing underlying features in the observed data. The latent variables are often referred to as hidden units, as they do not result directly from the observed data and are generally marginalized over to obtain the likelihood of the observed data, i.e.

,

where is the joint probability distribution over the visible and hidden units based on the current model parameters . The general Boltzmann machine defines through a set of weighted, symmetric connections between all visible and hidden units (but no connections from any unit to itself). The graphical model for the general Boltzmann machine is shown in Figure 1.

Given the current state of the visible and hidden units, the overall configuration of the model network is described by a connectivity function , parameterized by :

The parameter matrix defines the connection strength between the visible and hidden units. The parameters and define the connection strength amongst hidden units and visible units, respectively. The model also includes a set of biases and that capture offsets for each of the hidden and visible units.

The Boltzmann machine has been used for years in field of statistical mechanics to model physical systems based on the principle of energy minimization. In the statistical mechanics, the connectivity function is often referred to the “energy function,” a term that is has also been standardized in the statistical learning literature. Note that the energy function returns a single scalar value for any configuration of the network parameters and random variable states.

Given the energy function, the Boltzmann machine models the joint probability of the visible and hidden unit states as a Boltzmann distribution:

The partition function is a normalizing constant that is calculated by summing over all possible states of the network . Here we assume that all random variables take on discrete values, but the analogous derivation holds for continuous or mixed variable types by replacing the sums with integrals accordingly.

The common way to train the Boltzmann machine is to determine the parameters that maximize the likelihood of the observed data. To determine the parameters, we perform gradient descent on the log of the likelihood function (In order to simplify the notation in the remainder of the derivation, we do not include the explicit dependency on the parameters . To further simplify things, let’s also assume that we calculate the gradient of the likelihood based on a single observation.):

The gradient calculation is as follows:

Here we can simplify the expression somewhat by noting that , that , and also that is a constant:

If we also note that , and use the definition of conditional probability , we can further simplify the expression for the gradient:

Here is the expected value under the distribution . Thus the gradient of the likelihood function is composed of two parts. The first part is expected gradient of the energy function with respect to the conditional distribution . The second part is expected gradient of the energy function with respect to the joint distribution over all variable states. However, calculating these expectations is generally infeasible for any realistically-sized model, as it involves summing over a huge number of possible states/configurations. The general approach for solving this problem is to use Markov Chain Monte Carlo (MCMC) to approximate these sums:

Here is the sample average of samples drawn according to the process . The first term is calculated by taking the average value of the energy function gradient when the visible and hidden units are being driven by observed data samples. In practice, this first term is generally straightforward to calculate. Calculating the second term is generally more complicated and involves running a set of Markov chains until they reach the current model’s equilibrium distribution (i.e. via Gibbs sampling, Metropolis-Hastings, or the like), then taking the average energy function gradient based on those samples. See this post on MCMC methods for details. It turns out that there is a subclass of Boltzmann machines that, due to a restricted connectivity/energy function (specifically, the parameters ), allow for efficient MCMC by way of blocked Gibbs sampling. These models, known as *restricted Boltzman machines* have become an important component for unsupervised pretraining in the field of deep learning and will be the focus of a related post.

## Derivation: Error Backpropagation & Gradient Descent for Neural Networks

## Introduction

Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walk-through of the derivation of gradient descent learning algorithm commonly used to train ANNs (aka the *backpropagation algorithm*) and try to provide some high-level insights into the computations being performed during learning.

### Some Background and Notation

An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows: The signals from the input layer are multiplied by a set of fully-connected weights connecting the input layer to the hidden layer. These weighted signals are then summed and combined with a bias (not displayed in the graphical model in Figure 1). This calculation forms the pre-activation signal for the hidden layer. The pre-activation signal is then transformed by the hidden layer activation function to form the feed-forward activation signals leaving leaving the hidden layer . In a similar fashion, the hidden layer activation signals are multiplied by the weights connecting the hidden layer to the output layer , a bias is added, and the resulting signal is transformed by the output activation function to form the network output . The output is then compared to a desired target and the error between the two is calculated.

Training a neural network involves determining the set of parameters that minimize the errors that the network makes. Often the choice for the error function is the sum of the squared difference between the target values and the network output (for more detail on this choice of error function see):

Equation (1)

This problem can be solved using gradient descent, which requires determining for all in the model. Note that, in general, there are two sets of parameters: those parameters that are associated with the output layer (i.e. ), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly.

Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarification.

- : input to node for layer
- : activation function for node in layer (applied to )
- : ouput/activation of node in layer
- : weights connecting node in layer to node in layer
- : bias for unit in layer
- : target value for node in the output layer

## Gradients for Output Layer Weights

### Output layer connection weights,

Since the output layer parameters directly affect the value of the error function, determining the gradients for those parameters is fairly straight-forward:

Equation (2)

Here, we’ve used the Chain Rule. (Also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the -th dimension/node, the only term that survives in the error gradient is -th, and thus we can ignore the remaining terms in the summation). The derivative with respect to is zero because it does not depend on . Also, we note that . Thus

Equation (3)

where, again we use the Chain Rule. Now, recall that and thus , giving:

Equation (4)

The gradient of the error function with respect to the output layer weights is a product of three terms. The first term is the difference between the network output and the target value . The second term is the derivative of output layer activation function. And the third term is the activation output of node j in the hidden layer.

If we define to be all the terms that involve index k:

we obtain the following expression for the derivative of the error with respect to the output weights :

Equation (5)

Here the terms can be interpreted as the network output error after being back-propagated through the output activation function, thus creating an error “signal”. Loosely speaking, Equation (5) can be interpreted as determining how much each contributes to the error signal by weighting the error signal by the magnitude of the output activation from the previous (hidden) layer associated with each weight (see Figure 1). The gradients with respect to each parameter are thus considered to be the “contribution” of the parameter to the error signal and should be negated during learning. Thus the output weights are updated as , where is some step size (“learning rate”) along the negative gradient.

As we’ll see shortly, the process of backpropagating the error signal can iterate all the way back to the input layer by successively projecting back through , then through the activation function for the hidden layer via to give the error signal , and so on. This backpropagation concept is central to training neural networks with more than one layer.

### Output layer biases,

As far as the gradient with respect to the output layer biases, we follow the same routine as above for . However, the third term in Equation (3) is , giving the following gradient for the output biases:

Equation (6)

Thus the gradient for the biases is simply the back-propagated error from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.

## Gradients for Hidden Layer Weights

Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights is somewhat more involved. However, the process starts just the same:

Notice here that the sum does not disappear because, due to the fact that the layers are fully connected, each of the hidden unit outputs affects the state of each output unit. Continuing on, noting that …

Equation (7)

Here, again we use the Chain Rule. Ok, now here’s where things get “slightly more involved”. Notice that the partial derivative in the third term in Equation (7) is with respect to , but the target is a function of index . How the heck do we deal with that!? Well, if we expand , we find that it is composed of other sub functions (also see Figure 1):

Equation (8)

From the last term in Equation (8) we see that is *indirectly* dependent on . Equation (8) also suggests that we can use the Chain Rule to calculate . This is probably the trickiest part of the derivation, and goes like…

Equation (9)

Now, plugging Equation (9) into in Equation (7) gives the following for :

Equation (10)

Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is some term weighted by the output activations from the layer below (). For the output weight gradients, the term that was weighted by was the back-propagated error signal (i.e. Equation (5)). Here, the weighted term includes , but the error signal is further projected onto and then weighted by the derivative of hidden layer activation function . Thus, the gradient for the hidden layer weights is simply the output error signal backpropagated to the hidden layer, then weighted by the input to the hidden layer. To make this idea more explicit, we can define the resulting error signal backpropagated to layer as , and includes all terms in Equation (10) that involve index . This definition results in the following gradient for the hidden unit weights:

Equation (11)

This suggests that in order to calculate the weight gradients at any layer in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal that reaches that layer and weight it by the feed-forward signal feeding into that layer! Analogously, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal, which can only be observed–from the point of view of the weights–by backpropagating the error signal to the hidden layer.

### Output layer biases,

Calculating the gradients for the hidden layer biases follows a very similar procedure to that for the hidden layer weights where, as in Equation (9), we use the Chain Rule to calculate . However, unlike Equation (9) the third term that results for the biases is slightly different:

Equation (12)

In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer !

## Wrapping up

In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. Those computations are:

- Calculated the feed-forward signals from the input to the output.
- Calculate output error based on the predictions and the target
- Backpropagate the error signals by weighting it by the weights in previous layers and the gradients of the associated activation functions
- Calculating the gradients for the parameters based on the backpropagated error signal and the feedforward signals from the inputs.
- Update the parameters using the calculated gradients

The only real constraints on model construction is ensuring that the error function and the activation functions are differentiable. For more details on implementing ANNs and seeing them at work, stay tuned for the next post.