# Category Archives: Gradient Descent

## Derivation: Maximum Likelihood for Boltzmann Machines

In this post I will review the gradient descent algorithm that is commonly used to train the general class of models known as Boltzmann machines. Though the primary goal of the post is to supplement another post on restricted Boltzmann machines, I hope that those readers who are curious about how Boltzmann machines are trained, but have found it difficult to track down a complete or straight-forward derivation of the maximum likelihood learning algorithm for these models (as I have), will also find the post informative.

First, a little background: Boltzmann machines are stochastic neural networks that can be thought of as the probabilistic extension of the Hopfield network. The goal of the Boltzmann machine is to model a set of observed data in terms of a set of visible random variables and a set of latent/unobserved random variables . Due to the relationship between Boltzmann machines and neural networks, the random variables are often are often referred to as “units.” The role of the visible units is to approximate the true distribution of the data, while the role of the latent variables it to extend the expressiveness of the model by capturing underlying features in the observed data. The latent variables are often referred to as hidden units, as they do not result directly from the observed data and are generally marginalized over to obtain the likelihood of the observed data, i.e.

,

where is the joint probability distribution over the visible and hidden units based on the current model parameters . The general Boltzmann machine defines through a set of weighted, symmetric connections between all visible and hidden units (but no connections from any unit to itself). The graphical model for the general Boltzmann machine is shown in Figure 1.

Given the current state of the visible and hidden units, the overall configuration of the model network is described by a connectivity function , parameterized by :

The parameter matrix defines the connection strength between the visible and hidden units. The parameters and define the connection strength amongst hidden units and visible units, respectively. The model also includes a set of biases and that capture offsets for each of the hidden and visible units.

The Boltzmann machine has been used for years in field of statistical mechanics to model physical systems based on the principle of energy minimization. In the statistical mechanics, the connectivity function is often referred to the “energy function,” a term that is has also been standardized in the statistical learning literature. Note that the energy function returns a single scalar value for any configuration of the network parameters and random variable states.

Given the energy function, the Boltzmann machine models the joint probability of the visible and hidden unit states as a Boltzmann distribution:

The partition function is a normalizing constant that is calculated by summing over all possible states of the network . Here we assume that all random variables take on discrete values, but the analogous derivation holds for continuous or mixed variable types by replacing the sums with integrals accordingly.

The common way to train the Boltzmann machine is to determine the parameters that maximize the likelihood of the observed data. To determine the parameters, we perform gradient descent on the log of the likelihood function (In order to simplify the notation in the remainder of the derivation, we do not include the explicit dependency on the parameters . To further simplify things, let’s also assume that we calculate the gradient of the likelihood based on a single observation.):

The gradient calculation is as follows:

Here we can simplify the expression somewhat by noting that , that , and also that is a constant:

If we also note that , and use the definition of conditional probability , we can further simplify the expression for the gradient:

Here is the expected value under the distribution . Thus the gradient of the likelihood function is composed of two parts. The first part is expected gradient of the energy function with respect to the conditional distribution . The second part is expected gradient of the energy function with respect to the joint distribution over all variable states. However, calculating these expectations is generally infeasible for any realistically-sized model, as it involves summing over a huge number of possible states/configurations. The general approach for solving this problem is to use Markov Chain Monte Carlo (MCMC) to approximate these sums:

Here is the sample average of samples drawn according to the process . The first term is calculated by taking the average value of the energy function gradient when the visible and hidden units are being driven by observed data samples. In practice, this first term is generally straightforward to calculate. Calculating the second term is generally more complicated and involves running a set of Markov chains until they reach the current model’s equilibrium distribution (i.e. via Gibbs sampling, Metropolis-Hastings, or the like), then taking the average energy function gradient based on those samples. See this post on MCMC methods for details. It turns out that there is a subclass of Boltzmann machines that, due to a restricted connectivity/energy function (specifically, the parameters ), allow for efficient MCMC by way of blocked Gibbs sampling. These models, known as *restricted Boltzman machines* have become an important component for unsupervised pretraining in the field of deep learning and will be the focus of a related post.

## A Gentle Introduction to Artificial Neural Networks

## Introduction

Though many phenomena in the world can be adequately modeled using linear regression or classification, most interesting phenomena are generally nonlinear in nature. In order to deal with nonlinear phenomena, there have been a diversity of nonlinear models developed. For example parametric models assume that data follow some parameteric class of nonlinear function (e.g. polynomial, power, or exponential), then fine-tune the shape of the parametric function to fit observed data. However this approach is only helpful if data are fit nicely by the available catalog of parametric functions. Another approach, kernel-based methods, transforms data non-linearly into an abstract space that measures distances between observations, then predicts new values or classes based on these distances. However, kernel methods generally involve constructing a kernel matrix that depends on the number of training observations and can thus be prohibitive for large data sets. Another class of models, the ones that are the focus of this post, are artificial neural networks (ANNs). ANNs are nonlinear models motivated by the physiological architecture of the nervous system. They involve a cascade of simple nonlinear computations that when aggregated can implement robust and complex nonlinear functions. In fact, depending on how they are constructed, ANNs can approximate any nonlinear function, making them a quite powerful class of models (note that this property is not reserved for ANNs; kernel methods are also considered “universal approximators”; however, it turns out that neural networks with multiple layers are more efficient at approximating arbitrary functions than other methods. I refer the interested reader to more in-depth discussion on the topic.).

In recent years ANNs that use multiple stages of nonlinear computation (aka “deep learning”) have been able obtain outstanding performance on an array of complex tasks ranging from visual object recognition to natural language processing. I find ANNs super interesting due to their computational power and their intersection with computational neuroscience. However, I’ve found that most of the available tutorials on ANNs are either dense with formal details and contain little information about implementation or any examples, while others skip a lot of the mathematical detail and provide implementations that seem to come from thin air. This post aims at giving a more complete overview of ANNs, including (varying degrees of) the math behind ANNs, how ANNs are implemented in code, and finally some toy examples that point out the strengths and weaknesses of ANNs.

## Single-layer Neural Networks

The simplest ANN (Figure 1) takes a set of observed inputs , multiplies each of them by their own associated weight , and sums the weighted values to form a pre-activation .Oftentimes there is also a bias that is tied to an input that is always +1 included in the preactivation calculation. The network then transforms the pre-activation using a nonlinear activation function to output a final activation .

There are many options available for the form of the activation function , and the choice generally depends on the task we would like the network to perform. For instance, if the activation function is the identity function:

,

which outputs continuous values , then the network implements a linear model akin to used in standard linear regression. Another choice for the activation function is the logistic sigmoid:

,

which outputs values . When the network outputs use the logistic sigmoid activation function, the network implements linear binary classification. Binary classification can also be implemented using the hyperbolic tangent function, , which outputs values (note that the classes must also be coded as either -1 or 1 when using . Single-layered neural networks used for classification are often referred to as “perceptrons,” a name given to them when they were first developed in the late 1950s.

To get a better idea of what these activation function do, their outputs for a given range of input values are plotted in the left of Figure 2. We see that the logistic and tanh activation functions (blue and green) have the quintessential sigmoidal “s” shape that saturates for inputs of large magnitude. This behavior makes them useful for categorization. The identity / linear activation (red), however forms a linear mapping between the input to the activation function, which makes it useful for predicting continuous values.

A key property of these activation functions is that they are all smooth and differentiable. We’ll see later in this post why differentiability is important for training neural networks. The derivatives for each of these common activation functions are given by (for mathematical details on calculating these derivatives, see this post):

Each of the derivatives are plotted in the right of Figure 2. What is interesting about these derivatives is that they are either a constant (i.e. 1), or are can be defined in terms of the original function. This makes them extremely convenient for efficiently training neural networks, as we can implement the gradient using simple manipulations of the feed-forward states of the network.

**Code Block 1: Defines standard activation functions and generates Figure 2:**

% DEFINE A FEW COMMON ACTIVATION FUNCTIONS gLinear = inline('z','z'); gSigmoid = inline('1./(1+exp(-z))','z'); gTanh = inline('tanh(z)','z'); % ...DEFINE THEIR DERIVATIVES gPrimeLinear = inline('ones(size(z))','z'); gPrimeSigmoid = inline('1./(1+exp(-z)).*(1-1./(1+exp(-z)))','z'); gPrimeTanh = inline('1-tanh(z).^2','z'); % VISUALIZE EACH g(z) z = linspace(-4,4,100); figure set(gcf,'Position',[100,100,960,420]) subplot(121); hold on; h(1) = plot(z,gLinear(z),'r','Linewidth',2); h(2) = plot(z,gSigmoid(z),'b','Linewidth',2); h(3) = plot(z,gTanh(z),'g','Linewidth',2); set(gca,'fontsize',16) xlabel('z') legend(h,{'g_{linear}(z)','g_{logistic}(z)','g_{tanh}(z)'},'Location','Southeast') title('Some Common Activation Functions') hold off, axis square, grid ylim([-1.1 1.1]) % VISUALIZE EACH g'(z) subplot(122); hold on h(1) = plot(z,gPrimeLinear(z),'r','Linewidth',2); h(2) = plot(z,gPrimeSigmoid(z),'b','Linewidth',2); h(3) = plot(z,gPrimeTanh(z),'g','Linewidth',2); set(gca,'fontsize',16) xlabel('z') legend(h,{'g''_{linear}(z)','g''_{logistic}(z)','g''_{tanh}(z)'},'Location','South') title('Activation Function Derivatives') hold off, axis square, grid ylim([-.5 1.1])

## Multi-layer Neural Networks

As was mentioned above, single-layered networks implement linear models, which doesn’t really help us if we want to model nonlinear phenomena. However, by considering the single layer network diagrammed in Figure 1 to be a basic building block, we can construct more complicated networks, ones that perform powerful, nonlinear computations. Figure 3 demonstrates this concept. Instead of a single layer of weights between inputs and output, we introduce a set of single-layer networks between the two. This set of intermediate networks is often referred to as a “hidden” layer, as it doesn’t directly observe input or directly compute the output. By using a hidden layer, we form a mult-layered ANN. Though there are many different conventions for declaring the actual number of layers in a multi-layer network, for this discussion we will use the convention of the number of *distinct sets of trainable weights* as the number of layers. For example, the network in Figure 3 would be considered a 2-layer ANN because it has two layers of weights: those connecting the inputs to the hidden layer (), and those connecting the output of the hidden layer to the output layer().

Multi-layer neural networks form compositional functions that map the inputs nonlinearly to outputs. If we associate index i with the input layer, index j with the hidden layer, and index k with the output layer, then an output unit in the network diagrammed in Figure 3 computes an output value given and input via the following compositional function:

.

Here is the is the pre-activation values for units for layer , is the activation function for units in that layer (assuming they are the same), and is the output activation for units in that layer. The weight links the outputs of units feeding into layer to the activation function of units for that layer. The term is the bias for units in layer .

As with the single-layered ANN, the choice of activation function for the output layer will depend on the task that we would like the network to perform (i.e. categorization or regression), and follows similar rules outlined above. However, it is generally desirable for the hidden units to have* nonlinear* activation functions (e.g. logistic sigmoid or tanh). This is because multiple layers of linear computations can be equally formulated as a single layer of linear computations. Thus using linear activations for the hidden layers doesn’t buy us much. However, as we’ll see shortly, using linear activations for the output unit activation function (in conjunction with nonlinear activations for the hidden units) allows the network to perform nonlinear regression.

## Training neural networks & gradient descent

Training neural networks involves determining the network parameters that minimize the errors that the network makes. This first requires that we have a way of quantifying error. A standard way of quantifying error is to take the squared difference between the network output and the target value:

(Note that the squared error is not chosen arbitrarily, but has a number of theoretical benefits and considerations. For more detail, see the following post) With an error function in hand, we then aim to find the setting of parameters that minimizes this error function. This concept can be interpreted spatially by imagining a “parameter space” whose dimensions are the values of each of the model parameters, and for which the error function will form a surface of varying height depending on its value for each parameter. Model training is thus equivalent to finding point in parameter space that makes the height of the error surface small.

To get a better intuition behind this concept, let’s define a super simple neural network, one that has a single input and a single output (Figure 4, bottom left). For further simplicity, we’ll assume the network has no bias term and thus has a single parameter, . We will also assume that the output layer uses the logistic sigmoid activation function. Accordingly, the network will map some input value onto a predicted output via the following function.

Now let’s say we want this simple network to learn the identity function: given an input of 1 it should return a target value of 1. Given this target value we can now calculate the value of the error function for each setting of . Varying the value of from -10 to 10 results in the error surface displayed in the left of Figure 4. We see that the error is small for large positive values of , while the error is large for strongly negative values of . This not surprising, given that the output activation function is the logistic sigmoid, which will map large values onto an output of 1.

Things become more interesting when we move from a single-layered network to a multi-layered network. Let’s repeat the above exercise, but include a single hidden node between the input and the output (Figure 4, bottom right). Again, we will assume no biases, and logistic sigmoid activations for both the hidden and output nodes. Thus the network will have two parameters: . Accordingly the 2-layered network will predict an output with the following function:

Now, if we vary both and , we obtain the error surface in the right of Figure 4.

We see that the error function is minimized when both and are large and positive. We also see that the error surface is more complex than for the single-layered model, exhibiting a number of wide plateau regions. It turns out that the error surface gets more and more complicated as you increase the number of layers in the network and the number of units in each hidden layer. Thus, it is important to consider these phenomena when constructing neural network models.

**Code Block 2: generates Figure 4 (assumes you have run Code Block 1):**

% VISUALIZE ERROR SURFACE OF SIMPLE ANNS E = {}; [w1,w2] = meshgrid(linspace(-10,10,50)); g = gSigmoid; target = 1; net1Output = g(w1.*target); net2Output = g(w2.*g(w1.*target)); E{1} = (net1Output - target).^2; E{2} = (net2Output - target).^2; figure for ii = 1:2 set(gcf,'Position',[100,100,960,420]) subplot(1,2,ii) surf(w1,w2,E{ii}); shading faceted; colormap(flipud(hot)); caxis([0,max(max(E{ii}))]) set(gca,'fontsize',16) xlabel('w_{1}'), ylabel('w_{2}'), zlabel('E(w)') axis square; title(sprintf('Error Surface: %d-layer Network',ii)) [az, el] = view; view([az + 180, el]); set(gcf,'position',[100,100,1020,440]) drawnow end

The examples in Figure 4 gives us a qualitative idea of how to train the parameters of an ANN, but we would like a more automatic way of doing so. Generally this problem is solved using *gradient descent*: The gradient descent algorithm first calculates the derivative / gradient of the error function with respect to each of the model parameters. This gradient information will give us the direction in parameter space that decreases the height of the error surface. We then take a step in that direction and repeat, iteratively calculating the gradient and taking steps in parameter space.

## The backpropagation algorithm

It turns out that the gradient information for the ANN error surface can be calculated efficiently using a message passing algorithm known as the *backpropagation algorithm*. During backpropagation, input signals are forward-propagated through the network toward the outputs, and network errors are then calculated with respect to target variables and back-propagated backwards towards the inputs. The forward and backward signals are then used to determine the direction in the parameter space to move that lowers the network error.

The formal calculations behind the backpropagation algorithm can be somewhat mathematically involved and may detract from the general ideas behind the learning algorithm. For those readers who are interested in the math, I have provided the formal derivation of the backpropagation algorithm in the following post (for those of you who are not interested in the math, I would also encourage you go over the derivation and try to make connections to the source code implementations provided later in the post).

Figure 5 demonstrates the key steps of the backpropagation algorithm. The main concept underlying the algorithm is that for a given observation we want to determine the degree of “responsibility” that each network parameter has for mis-predicting a target value associated with the observation. We then change that parameter according to this responsibility so that it reduces the network error.

In order to determine the network error, we first propagate the observed input forward through the network layers. This is Step I of the backpropagation algorithm, and is demonstrated in Figure 5-I. Note that in the Figure could be considered network output (for a network with one hidden layer) or the output of a hidden layer that projects the remainder of the network (in the case of a network with more than one hidden layer). For this discussion, however, we assume that the index k is associated with the output layer of the network, and thus each of the network outputs is designated by . Also note that when implementing this forward-propagation step, we should keep track of the feed-forward pre-activations and activations for all layers , as these will be used for calculating backpropagated errors and error function gradients.

Step II of the algorithm is to calculate the network output error and backpropagate it toward the input. Let’s again that we are using the sum of squared differences error function:

,

where we sum over the values of all output units (one in this example). We can now define an “error signal” at the output node that will be backpropagated toward the input. The error signal is calculated as follows:

.

Thus the error signal essentially weights the gradient of the error function by the gradient of the output activation function (notice there is a term is used in this calculation, which is why we keep it around during the forward-propagation step). We can continue backpropagating the error signal toward the input by passing through the output layer weights , summing over all output nodes, and passing the result through the gradient of the activation function at the hidden layer (Figure 5-II). Performing these operations results in the back-propagated error signal for the hidden layer, :

,

For networks that have more than one hidden layer, this error backpropagation procedure can continue for layers , etc.

Step III of the backpropagation algorithm is to calculate the gradients of the error function with respect to the model parameters at each layer using the forward signals , and the backward error signals . If one considers the model weights at a layer as linking the forward signal to the error signal (Figure 5-III), then the gradient of the error function with respect to those weights is:

Note that this result is closely related to the concept of Hebbian learning in neuroscience. Thus the gradient of the error function with respect to the model weight at each layer can be efficiently calculated by simply keeping track of the forward-propagated activations feeding into that layer from below, and weighting those activations by the backward-propagated error signals feeding into that layer from above!

What about the bias parameters? It turns out that the same gradient rule used for the weight weights applies, except that “feed-forward activations” for biases are always +1 (see Figure 1). Thus the bias gradients for layer are simply:

The fourth and final step of the backpropagation algorithm is to update the model parameters based on the gradients calculated in Step III. Note that the gradients point in the direction in parameter space that will *increase *the value of the error function. Thus when updating the model parameters we should choose to go in the opposite direction. How far do we travel in that direction? That is generally determined by a user-defined step size (aka learning rate) parameter, . Thu,s given the parameter gradients and the step size, the weights and biases for a given layer are updated accordingly:

.

To train an ANN, the four steps outlined above and in Figure 5 are repeated iteratively by observing many input-target pairs and updating the parameters until either the network error reaches a tolerably low value, the parameters cease to update (convergence), or a set number of parameter updates has been achieved. Some readers may find the steps of the backpropagation somewhat ad hoc. However, keep in mind that these steps are formally coupled to the calculus of the optimization problem. Thus I again refer the curious reader to check out the derivation in order to make connections between the algorithm and the math.

## Example: learning the OR & AND logical operators using a single layer neural network

Here we go over an example of training a single-layered neural network to perform a classification problem. The network is trained to learn a set of logical operators including the AND, OR, or XOR. To train the network we first generate training data. The inputs consist of 2-dimensional coordinates that span the input values values for a 2-bit truth table:

We then perturb these observations by adding Normally-distributed noise. To generate target variables, we categorize each observations by applying one of logic operators (See Figure 6) to the original (no-noisy) coordinates. We then train the network with the noisy inputs and binary categories targets using the gradient descent / backpropagation algorithm. The code implementation of the network and training procedures, as well as the resulting learning process are displayed below. (Note that in this implementation, I do not use the feed-forward activations to calculate the gradients as suggested above. This is simply to make the implementation of the learning algorithm more explicit in terms of the math. The same situation also applies to the other examples in this post).

**Code Block 3: Implements and trains a single-layer neural network for classification to learn logical operators (assumes you have run Code Block 1):**

%% EXAMPLE: SINGLE-LAYERED NETWORK % DEFINE DATA AND TARGETS data = [0 0; 0 1; 1 0; 1 1;]; classAND = and(data(:,1)>0,data(:,2)>0); classOR = or(data(:,1)>0,data(:,2)>0); classXOR = xor(data(:,1)>0,data(:,2)>0); % THE TYPE OF TRUTH TABLE TO LEARN (UNCOMMENT FOR OTHERS) classes = classOR % classes = classAND; % classes = classXOR; % MAKE MULTIPLE NOISY TRAINING OBSERVATIONS nRepats = 30; data = repmat(data, [nRepats, 1]); classes = repmat(classes, [nRepats, 1]); data = data + .15*randn(size(data)); % SHUFFLE DATA shuffleIdx = randperm(size(data,1)); data = data(shuffleIdx,:); classes = classes(shuffleIdx); % INITIALIZE MODEL PARAMETERS [nObs,nInput] = size(data); % # OF INPUT DIMENSIONS nOutput = 1; % # OF TARGET/OUTPUT DIMENSIONS lRate = 3; % LEARNING RATE FOR PARAMETERS UPDATE nIters = 80; % # OF ITERATIONS % DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES) g_out = gSigmoid; gPrime_out = gPrimeSigmoid; % INITIALIZE RANDOM WEIGHTS W_out = (rand(nInput,nOutput)-.5); b_out = (rand(1,nOutput)-.5); % SOME OTHER INITIALIZATIONS % (FOR VISUALIZATION) visRange = [-.2 1.2]; [xx,yy] = meshgrid(linspace(visRange(1), visRange(2),100)); iter = 1; mse = zeros(1,nIters); figure set(gcf,'Position',[100,100,960,420]) while 1 err = zeros(1,nObs); % LOOP THROUGH THE EXAMPLES for iO = 1:nObs % GET CURRENT NETWORK INPUT DATA AND TARGET input = data(iO,:); target = classes(iO); %% I. FORWARD PROPAGATE DATA THROUGH NETWORK z_out = input*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS a_out = g_out(z_out); % OUTPUT UNIT ACTIVATIONS %% II. BACKPROPAGATE ERROR SIGNAL % CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT delta_out = gPrime_out(z_out).*(a_out - target); %% III. CALCULATE GRADIENT W.R.T. PARAMETERS... dEdW_out = delta_out*input; dEdb_out = delta_out*1; %% IV. UPDATE NETWORK PARAMETERS W_out = W_out - lRate*dEdW_out'; b_out = b_out - lRate*dEdb_out'; % CALCULATE ERROR FUNCTION err(iO) = .5*(a_out-target).^2; end mse(iter) = mean(err); % DISPLAY LEARNING clf; subplot(121); hold on; set(gca,'fontsize',16) netOut = g_out(bsxfun(@plus,[xx(:),yy(:)]*W_out, b_out)); contourf(xx,yy,reshape(netOut,100,100)); colormap(flipud(spring)) hold on; gscatter(data(:,1),data(:,2),classes,[0 0 0 ; 1 1 1],[],20,'off'); title(sprintf('Iteration %d',iter)) xlim([visRange(1) visRange(2)]),ylim([visRange(1) visRange(2)]); axis square subplot(122); set(gca,'fontsize',16) plot(1:iter,mse(1:iter)); xlabel('Iteration') ylabel('Mean Squared Error') axis square m1(iter) = getframe(gcf); if iter >= nIters break end iter = iter + 1; end

Figure 7 displays the procedure for learning the OR mapping. The left plot displays the training data and the network output at each iteration. White dots are training points categorized “1” while black dots are categorized “0”. Yellow regions are where the network predicts values of “0”, while magenta highlights areas where the network predicts “1”. We see that the single-layer network is able to easily separate the two classes. The right plot shows how the error function decreases with each training iteration. The smooth trajectory of the error indicates that the error surface is also fairly smooth.

Figure 8 demonstrates an analogous example, but instead learning the AND operator (by executing Code Block 3, after un-commenting line 11). Again, the categories can be easily separated by a plane, and thus the single-layered network easily learns an accurate predictor of the data.

## Going Deeper: nonlinear classification and multi-layer neural networks

Figures 7 and 8 demonstrate how a single-layered ANN can easily learn the OR and AND operators. This is because the categorization criterion for these logical operators can be represented in the input space by a single linear function (i.e. line/plane). What about more complex categorization criterion that cannot be represented by a single plane? An example of a more complex binary classification criterion is the XOR operator (Figure 6, far right column).

Below we attempt to train the single-layer network to learn the XOR operator (by executing Code Block 3, after un-commenting line 12). The single layer network is unable to learn this nonlinear mapping between the inputs and the targets. However, it turns out we can learn the XOR operator using a multi-layered neural network.

Below we train a two-layer neural network on the XOR dataset. The network incorporates a hidden layer with 3 hidden units and logistic sigmoid activation functions for all units in the hidden and output layers (see Code Block 4, lines 32-33).

**Code Block 4: Implements and trains a two-layer neural network for classification to learn XOR operator and more difficult “ring” problem (Figures 10 & 11; assumes you have run Code Block 1):**

%% EXAMPLE: MULTI-LAYER NEURAL NETWORK FOR CLASSIFICATION data = [0 0; 0 1; 1 0; 1 1;]; classXOR = xor(data(:,1)>0,data(:,2)>0); % THE TYPE OF TRUTH TABLE TO LEARN classes = classXOR; % UNCOMMENT FOR MOR DIFFICULT DATA... % data = [data; .5 .5; 1 .5; 0 .5; .5 0; .5 1]; % classRing = [1; 1; 1; 1; 0; 1; 1; 1; 1]; % classes = classRing; % CREATE MANY NOISY OBSERVATIONS nRepats = 30; data = repmat(data, [nRepats, 1]); classes = repmat(classes, [nRepats, 1]); data = data + .15*randn(size(data)); % SHUFFLE OBSERVATIONS shuffleIdx = randperm(size(data,1)); data = data(shuffleIdx,:); classes = classes(shuffleIdx); % INITIALIZE MODEL PARAMETERS [nObs,nInput] = size(data); % # OF INPUT DIMENSIONS nHidden = 3; % # OF HIDDEN UNITS lRate = 2; % LEARNING RATE FOR PARAMETERS UPDATE nIters = 300; % # OF ITERATIONS % DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES) g_hid = gSigmoid; gPrime_hid = gPrimeSigmoid; g_out = gSigmoid; gPrime_out = gPrimeSigmoid; % INITIALIZE WEIGHTS W_hid = (rand(nInput,nHidden)-.5); b_hid = (rand(1,nHidden)-.5); W_out = (rand(nHidden,nOutput)-.5); b_out = (rand(1,nOutput)-.5); iter = 1; mse = zeros(1,nIters); figure set(gcf,'Position',[100,100,960,420]) % MAIN TRAINING ALGORITHM while 1 err = zeros(1,nObs); % LOOP THROUGH THE EXAMPLES for iO = 1:nObs % GET CURRENT NETWORK INPUT DATA AND TARGET input = data(iO,:); target = classes(iO); %% I. FORWARD PROPAGATE DATA THROUGH NETWORK z_hid = input*W_hid + b_hid; % HIDDEN UNIT PRE-ACTIVATIONS a_hid = g_hid(z_hid); % HIDDEN UNIT ACTIVATIONS z_out = a_hid*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS a_out = g_out(z_out); % OUTPUT UNIT ACTIVATIONS %% II. BACKPROPAGATE ERROR SIGNAL % CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT delta_out = gPrime_out(z_out).*(a_out - target); % CALCULATE ERROR CONTRIBUTIONS FOR HIDDEN NODES... delta_hid = gPrime_hid(z_hid)'.*(delta_out*W_out); %% III. CALCULATE GRADIENT W.R.T. PARAMETERS... dEdW_out = delta_out*a_hid; dEdb_out = delta_out*1; dEdW_hid = delta_hid*input; dEdb_hid = delta_hid*1; %% IV. UPDATE NETWORK PARAMETERS W_out = W_out - lRate*dEdW_out'; b_out = b_out - lRate*dEdb_out'; W_hid = W_hid - lRate*dEdW_hid'; b_hid = b_hid - lRate*dEdb_hid'; % CALCULATE ERROR FUNCTION err(iO) = .5*(a_out-target).^2; end mse(iter) = mean(err); % DISPLAY LEARNING clf; subplot(121); hold on; set(gca,'fontsize',16) netOut = g_out(bsxfun(@plus,g_hid(bsxfun(@plus,[xx(:),yy(:)]*W_hid, b_hid))*W_out, b_out)); contourf(xx,yy,reshape(netOut,100,100)); colormap(flipud(spring)) hold on; gscatter(data(:,1),data(:,2),classes,[0 0 0; 1 1 1],[],20,'off'); title(sprintf('Iteration %d',iter)) xlim([visRange(1), visRange(2)]),ylim([visRange(1), visRange(2)]); axis square subplot(122); set(gca,'fontsize',16) plot(1:iter,mse(1:iter)); xlabel('Iteration') ylabel('Mean Squared Error') axis square m2(iter) = getframe(gcf); if iter >= nIters break end iter = iter + 1; end

Figure 10 displays the learning process for the 2-layer network. The formatting for Figure 10 is analogous to that for Figures 7-9. The 2-layer network is easily able to learn the XOR operator. We see that by adding a hidden layer between the input and output, the ANN is able to learn the nonlinear categorization criterion!

Figure 11 shows the results for learning a even more difficult nonlinear categorization function: points in and around are categorized as “0”, while points in a ring surrounding the “0” datapoints are categorized as a “1” (Figure 11). This example is run by executing Code Block 4 after un-commenting lines 9-11.

Figure 11 shows the learning process. Again formatting is analogous to the formatting in Figures 8-10. The 2-layer ANN is able to learn this difficult classification criterion.

## Example: Neural Networks for Regression

The previous examples demonstrated how ANNs can be used for classification by using a logistic sigmoid as the output activation function. Here we demonstrate how, by making the output activation function the linear/identity function, the same 2-layer network architecture can be used to implement nonlinear regression.

For this example we define a dataset comprised of 1D inputs, that range from . We then generate noisy targets according to the function:

where is a nonlinear data-generating function and is Normally-distributed noise. We then construct a two-layered network with tanh activation functions used in the hidden layer and linear outputs. For this example we set the number of hidden units to 3 and train the model as we did for categorization using gradient descent / backpropagation. The results of the example are displayed below.

**Code Block 5: Trains two-layer network for regression problems (Figures 11 & 12; assumes you have run Code Block 1):**

%% EXAMPLE: NONLINEAR REGRESSION % DEFINE DATA-GENERATING FUNCTIONS f(x) xMin = -5; xMax = 5; xx = linspace(xMin, xMax, 100); f = inline('2.5 + sin(x)','x'); % f = inline('abs(x)','x'); % UNCOMMENT FOR FIGURE 13 yy = f(xx) + randn(size(xx))*.5; % FOR SHUFFLING OBSERVATIONS shuffleIdx = randperm(length(xx)); data = xx; targets = yy; % INITIALIZE MODEL PARAMETERS nObs = length(data); % # OF INPUT DIMENSIONS nInput = 1; % # OF INPUTS nHidden = 3; % # OF HIDDEN UNITS nOutput = 1; % # OF TARGET/OUTPUT DIMENSIONS lRate = .15; % LEARNING RATE FOR PARAMETERS UPDATE nIters = 200; % # OF ITERATIONS cols = lines(nHidden); % DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES) g_hid = gTanh; % HIDDEN UNIT ACTIVATION gPrime_hid = gPrimeTanh; % GRAD OF HIDDEN UNIT ACTIVATION g_out = gLinear; % OUTPUT ACTIVATION gPrime_out = gPrimeLinear; % GRAD. OF OUTPUT ATIVATION % % INITIALIZE WEIGHTS W_hid = (rand(nInput,nHidden)-.5); b_hid = (rand(1,nHidden)-.5); W_out = (rand(nHidden,nOutput)-.5); b_out = (rand(1,nOutput)-.5); % INITIALIZE SOME THINGS.. % (FOR VISUALIZATION) mse = zeros(1,nIters); visRange = [xMin, xMax]; figure set(gcf,'Position',[100,100,960,420]) iter = 1; while 1 err = zeros(1,nObs); % LOOP THROUGH THE EXAMPLES for iO = 1:nObs % GET CURRENT NETWORK INPUT DATA AND TARGET input = data(iO); target = targets(iO); %% I. FORWARD PROPAGATE DATA THROUGH NETWORK z_hid = input*W_hid + b_hid; % HIDDEN UNIT PRE-ACTIVATIONS a_hid = g_hid(z_hid); % HIDDEN UNIT ACTIVATIONS z_out = a_hid*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS a_out = g_out(z_out); % OUTPUT UNIT ACTIVATIONS %% II. BACKPROPAGATE ERROR SIGNAL % CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT delta_out = gPrime_out(z_out).*(a_out - target); %% CALCULATE ERROR CONTRIBUTIONS FOR HIDDEN NODES... delta_hid = gPrime_hid(z_hid)'.*(delta_out*W_out); %% III. CALCULATE GRADIENT W.R.T. PARAMETERS... dEdW_out = delta_out*a_hid; dEdb_out = delta_out*1; dEdW_hid = delta_hid*input; dEdb_hid = delta_hid*1; %% IV. UPDATE NETWORK PARAMETERS W_out = W_out - lRate*dEdW_out'; b_out = b_out - lRate*dEdb_out'; W_hid = W_hid - lRate*dEdW_hid'; b_hid = b_hid - lRate*dEdb_hid'; % CALCULATE ERROR FUNCTION FOR BATCH err(iO) = .5*(a_out-target).^2; end mse(iter) = mean(err); % UPDATE ERROR % DISPLAY LEARNING clf; subplot(121); hold on; set(gca,'fontsize',14) plot(xx,f(xx),'m','linewidth',2); hold on; scatter(xx, yy ,'m'); % PLOT TOTAL NETWORK OUTPUT netOut = g_out(g_hid(bsxfun(@plus, xx'*W_hid, b_hid))*W_out + b_out); plot(xx, netOut, 'k','linewidth', 2) % PLOT EACH HIDDEN UNIT'S OUTPUT FUNCTION for iU = 1:nHidden plot(xx,g_hid(xx*W_hid(iU) + b_hid(iU)),'color',cols(iU,:),'Linewidth',2, ... 'Linestyle','--'); end % TITLE AND LEGEND title(sprintf('Iteration %d',iter)) xlim([visRange(1) visRange(2)]),ylim([visRange(1) visRange(2)]); axis square legend('f(x)', 'Targets', 'Network Output','Hidden Unit Outputs','Location','Southwest') % PLOT ERROR subplot(122); set(gca,'fontsize',14) plot(1:iter,mse(1:iter)); xlabel('Iteration') ylabel('Mean Squared Error') axis square; drawnow % ANNEAL LEARNING RATE lRate = lRate *.99; if iter >= nIters break end iter = iter + 1; end

The training procedure for is visualized in the left plot of Figure 12. The data-generating function is plotted as the solid magenta line, and the noisy target values used to train the network are plotted as magenta circles. The output of the network at each training iteration is plotted in solid black while the output of each of the tanh hidden units is plotted in dashed lines. This visualization demonstrates how multiple nonlinear functions can be combined to form the complex output target function. The mean squared error at each iteration is plotted in the right plot of Figure 12. We see that the error does not follow a simple trajectory during learning, but rather undulates, demonstrating the non-convexity of the error surface.

Figure 13 visualizes the training procedure for trying to learn a different nonlinear function, namely (by running Code Block 5, after un-commenting out line 7). Again, we see how the outputs of the hidden units are combined to fit the desired data-generating function. The mean squared error again follows an erratic path during learning.

Notice for this example that I added an extra implementation detail known as simulated annealing (line 118) that was absent in the classification examples. This technique decreases the learning rate after every iteration thus making the algorithm take smaller and smaller steps in parameter space. This technique can be useful when the gradient updates begin oscillating between two or more locations in the parameter space. It is also helpful for influencing the algorithm to settle down into a steady state.

## Wrapping up

In this post we covered the main ideas behind artificial neural networks including: single- and multi-layer ANNs, activation functions and their derivatives, a high-level description of the backpropagation algorithm, and a number of classification and regression examples. ANNs, particularly mult-layer ANNs, are a robust and powerful class of models that can be used to learn complex, nonlinear functions. However, there are a number of considerations when using neural networks including:

- How many hidden layers should one use?
- How many hidden units in each layer?
- How do these relate to overfitting and generalization?
- Are there better error functions than the squared difference?
- What should the learning rate be?
- What can we do about the complexity of error surface with deep networks?
- Should we use simulated annealing?
- What about other activation functions?

It turns out that there are no easy or definite answers to any of these questions, and there is active research focusing on each topic. This is why using ANNs is often considered as much as a “black art” as it is a quantitative technique.

One primary limitation of ANNs is that they are supervised algorithms, requiring a target value for each input observation in order to train the network. This can be prohibitive for training large networks that may require lots of training data to adequately adjust the parameters. However, there are a set of unsupervised variants of ANNs that can be used to learn an initial condition for the ANN (rather than from randomly-generated initial weights) without the need of target values. This technique of “unsupervised pretraining” has been an important component of many “deep learning” models used in AI and machine learning. In future posts, I look forward to covering two of these unsupervised neural networks: autoencoders and restricted Boltzmann machines.

## Derivation: Error Backpropagation & Gradient Descent for Neural Networks

## Introduction

Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walk-through of the derivation of gradient descent learning algorithm commonly used to train ANNs (aka the *backpropagation algorithm*) and try to provide some high-level insights into the computations being performed during learning.

### Some Background and Notation

An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows: The signals from the input layer are multiplied by a set of fully-connected weights connecting the input layer to the hidden layer. These weighted signals are then summed and combined with a bias (not displayed in the graphical model in Figure 1). This calculation forms the pre-activation signal for the hidden layer. The pre-activation signal is then transformed by the hidden layer activation function to form the feed-forward activation signals leaving leaving the hidden layer . In a similar fashion, the hidden layer activation signals are multiplied by the weights connecting the hidden layer to the output layer , a bias is added, and the resulting signal is transformed by the output activation function to form the network output . The output is then compared to a desired target and the error between the two is calculated.

Training a neural network involves determining the set of parameters that minimize the errors that the network makes. Often the choice for the error function is the sum of the squared difference between the target values and the network output (for more detail on this choice of error function see):

Equation (1)

This problem can be solved using gradient descent, which requires determining for all in the model. Note that, in general, there are two sets of parameters: those parameters that are associated with the output layer (i.e. ), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly.

Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarification.

- : input to node for layer
- : activation function for node in layer (applied to )
- : ouput/activation of node in layer
- : weights connecting node in layer to node in layer
- : bias for unit in layer
- : target value for node in the output layer

## Gradients for Output Layer Weights

### Output layer connection weights,

Since the output layer parameters directly affect the value of the error function, determining the gradients for those parameters is fairly straight-forward:

Equation (2)

Here, we’ve used the Chain Rule. (Also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the -th dimension/node, the only term that survives in the error gradient is -th, and thus we can ignore the remaining terms in the summation). The derivative with respect to is zero because it does not depend on . Also, we note that . Thus

Equation (3)

where, again we use the Chain Rule. Now, recall that and thus , giving:

Equation (4)

The gradient of the error function with respect to the output layer weights is a product of three terms. The first term is the difference between the network output and the target value . The second term is the derivative of output layer activation function. And the third term is the activation output of node j in the hidden layer.

If we define to be all the terms that involve index k:

we obtain the following expression for the derivative of the error with respect to the output weights :

Equation (5)

Here the terms can be interpreted as the network output error after being back-propagated through the output activation function, thus creating an error “signal”. Loosely speaking, Equation (5) can be interpreted as determining how much each contributes to the error signal by weighting the error signal by the magnitude of the output activation from the previous (hidden) layer associated with each weight (see Figure 1). The gradients with respect to each parameter are thus considered to be the “contribution” of the parameter to the error signal and should be negated during learning. Thus the output weights are updated as , where is some step size (“learning rate”) along the negative gradient.

As we’ll see shortly, the process of backpropagating the error signal can iterate all the way back to the input layer by successively projecting back through , then through the activation function for the hidden layer via to give the error signal , and so on. This backpropagation concept is central to training neural networks with more than one layer.

### Output layer biases,

As far as the gradient with respect to the output layer biases, we follow the same routine as above for . However, the third term in Equation (3) is , giving the following gradient for the output biases:

Equation (6)

Thus the gradient for the biases is simply the back-propagated error from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.

## Gradients for Hidden Layer Weights

Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights is somewhat more involved. However, the process starts just the same:

Notice here that the sum does not disappear because, due to the fact that the layers are fully connected, each of the hidden unit outputs affects the state of each output unit. Continuing on, noting that …

Equation (7)

Here, again we use the Chain Rule. Ok, now here’s where things get “slightly more involved”. Notice that the partial derivative in the third term in Equation (7) is with respect to , but the target is a function of index . How the heck do we deal with that!? Well, if we expand , we find that it is composed of other sub functions (also see Figure 1):

Equation (8)

From the last term in Equation (8) we see that is *indirectly* dependent on . Equation (8) also suggests that we can use the Chain Rule to calculate . This is probably the trickiest part of the derivation, and goes like…

Equation (9)

Now, plugging Equation (9) into in Equation (7) gives the following for :

Equation (10)

Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is some term weighted by the output activations from the layer below (). For the output weight gradients, the term that was weighted by was the back-propagated error signal (i.e. Equation (5)). Here, the weighted term includes , but the error signal is further projected onto and then weighted by the derivative of hidden layer activation function . Thus, the gradient for the hidden layer weights is simply the output error signal backpropagated to the hidden layer, then weighted by the input to the hidden layer. To make this idea more explicit, we can define the resulting error signal backpropagated to layer as , and includes all terms in Equation (10) that involve index . This definition results in the following gradient for the hidden unit weights:

Equation (11)

This suggests that in order to calculate the weight gradients at any layer in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal that reaches that layer and weight it by the feed-forward signal feeding into that layer! Analogously, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal, which can only be observed–from the point of view of the weights–by backpropagating the error signal to the hidden layer.

### Output layer biases,

Calculating the gradients for the hidden layer biases follows a very similar procedure to that for the hidden layer weights where, as in Equation (9), we use the Chain Rule to calculate . However, unlike Equation (9) the third term that results for the biases is slightly different:

Equation (12)

In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer !

## Wrapping up

In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. Those computations are:

- Calculated the feed-forward signals from the input to the output.
- Calculate output error based on the predictions and the target
- Backpropagate the error signals by weighting it by the weights in previous layers and the gradients of the associated activation functions
- Calculating the gradients for the parameters based on the backpropagated error signal and the feedforward signals from the inputs.
- Update the parameters using the calculated gradients

The only real constraints on model construction is ensuring that the error function and the activation functions are differentiable. For more details on implementing ANNs and seeing them at work, stay tuned for the next post.