# Category Archives: Theory

## Derivation: Error Backpropagation & Gradient Descent for Neural Networks

## Introduction

Artificial neural networks (ANNs) are a powerful class of models used for nonlinear regression and classification tasks that are motivated by biological neural computation. The general idea behind ANNs is pretty straightforward: map some input onto a desired target value using a distributed cascade of nonlinear transformations (see Figure 1). However, for many, myself included, the learning algorithm used to train ANNs can be difficult to get your head around at first. In this post I give a step-by-step walk-through of the derivation of gradient descent learning algorithm commonly used to train ANNs (aka the *backpropagation algorithm*) and try to provide some high-level insights into the computations being performed during learning.

### Some Background and Notation

An ANN consists of an input layer, an output layer, and any number (including zero) of hidden layers situated between the input and output layers. Figure 1 diagrams an ANN with a single hidden layer. The feed-forward computations performed by the ANN are as follows: The signals from the input layer are multiplied by a set of fully-connected weights connecting the input layer to the hidden layer. These weighted signals are then summed and combined with a bias (not displayed in the graphical model in Figure 1). This calculation forms the pre-activation signal for the hidden layer. The pre-activation signal is then transformed by the hidden layer activation function to form the feed-forward activation signals leaving leaving the hidden layer . In a similar fashion, the hidden layer activation signals are multiplied by the weights connecting the hidden layer to the output layer , a bias is added, and the resulting signal is transformed by the output activation function to form the network output . The output is then compared to a desired target and the error between the two is calculated.

Training a neural network involves determining the set of parameters that minimize the errors that the network makes. Often the choice for the error function is the sum of the squared difference between the target values and the network output (for more detail on this choice of error function see):

Equation (1)

This problem can be solved using gradient descent, which requires determining for all in the model. Note that, in general, there are two sets of parameters: those parameters that are associated with the output layer (i.e. ), and thus directly affect the network output error; and the remaining parameters that are associated with the hidden layer(s), and thus affect the output error indirectly.

Before we begin, let’s define the notation that will be used in remainder of the derivation. Please refer to Figure 1 for any clarification.

- : input to node for layer
- : activation function for node in layer (applied to )
- : ouput/activation of node in layer
- : weights connecting node in layer to node in layer
- : bias for unit in layer
- : target value for node in the output layer

## Gradients for Output Layer Weights

### Output layer connection weights,

Since the output layer parameters directly affect the value of the error function, determining the gradients for those parameters is fairly straight-forward:

Equation (2)

Here, we’ve used the Chain Rule. (Also notice that the summation disappears in the derivative. This is because when we take the partial derivative with respect to the -th dimension/node, the only term that survives in the error gradient is -th, and thus we can ignore the remaining terms in the summation). The derivative with respect to is zero because it does not depend on . Also, we note that . Thus

Equation (3)

where, again we use the Chain Rule. Now, recall that and thus , giving:

Equation (4)

The gradient of the error function with respect to the output layer weights is a product of three terms. The first term is the difference between the network output and the target value . The second term is the derivative of output layer activation function. And the third term is the activation output of node j in the hidden layer.

If we define to be all the terms that involve index k:

we obtain the following expression for the derivative of the error with respect to the output weights :

Equation (5)

Here the terms can be interpreted as the network output error after being back-propagated through the output activation function, thus creating an error “signal”. Loosely speaking, Equation (5) can be interpreted as determining how much each contributes to the error signal by weighting the error signal by the magnitude of the output activation from the previous (hidden) layer associated with each weight (see Figure 1). The gradients with respect to each parameter are thus considered to be the “contribution” of the parameter to the error signal and should be negated during learning. Thus the output weights are updated as , where is some step size (“learning rate”) along the negative gradient.

As we’ll see shortly, the process of backpropagating the error signal can iterate all the way back to the input layer by successively projecting back through , then through the activation function for the hidden layer via to give the error signal , and so on. This backpropagation concept is central to training neural networks with more than one layer.

### Output layer biases,

As far as the gradient with respect to the output layer biases, we follow the same routine as above for . However, the third term in Equation (3) is , giving the following gradient for the output biases:

Equation (6)

Thus the gradient for the biases is simply the back-propagated error from the output units. One interpretation of this is that the biases are weights on activations that are always equal to one, regardless of the feed-forward signal. Thus the bias gradients aren’t affected by the feed-forward signal, only by the error.

## Gradients for Hidden Layer Weights

Due to the indirect affect of the hidden layer on the output error, calculating the gradients for the hidden layer weights is somewhat more involved. However, the process starts just the same:

Notice here that the sum does not disappear because, due to the fact that the layers are fully connected, each of the hidden unit outputs affects the state of each output unit. Continuing on, noting that …

Equation (7)

Here, again we use the Chain Rule. Ok, now here’s where things get “slightly more involved”. Notice that the partial derivative in the third term in Equation (7) is with respect to , but the target is a function of index . How the heck do we deal with that!? Well, if we expand , we find that it is composed of other sub functions (also see Figure 1):

Equation (8)

From the last term in Equation (8) we see that is *indirectly* dependent on . Equation (8) also suggests that we can use the Chain Rule to calculate . This is probably the trickiest part of the derivation, and goes like…

Equation (9)

Now, plugging Equation (9) into in Equation (7) gives the following for :

Equation (10)

Notice that the gradient for the hidden layer weights has a similar form to that of the gradient for the output layer weights. Namely the gradient is some term weighted by the output activations from the layer below (). For the output weight gradients, the term that was weighted by was the back-propagated error signal (i.e. Equation (5)). Here, the weighted term includes , but the error signal is further projected onto and then weighted by the derivative of hidden layer activation function . Thus, the gradient for the hidden layer weights is simply the output error signal backpropagated to the hidden layer, then weighted by the input to the hidden layer. To make this idea more explicit, we can define the resulting error signal backpropagated to layer as , and includes all terms in Equation (10) that involve index . This definition results in the following gradient for the hidden unit weights:

Equation (11)

This suggests that in order to calculate the weight gradients at any layer in an arbitrarily-deep neural network, we simply need to calculate the backpropagated error signal that reaches that layer and weight it by the feed-forward signal feeding into that layer! Analogously, the gradient for the hidden layer weights can be interpreted as a proxy for the “contribution” of the weights to the output error signal, which can only be observed–from the point of view of the weights–by backpropagating the error signal to the hidden layer.

### Output layer biases,

Calculating the gradients for the hidden layer biases follows a very similar procedure to that for the hidden layer weights where, as in Equation (9), we use the Chain Rule to calculate . However, unlike Equation (9) the third term that results for the biases is slightly different:

Equation (12)

In a similar fashion to calculation of the bias gradients for the output layer, the gradients for the hidden layer biases are simply the backpropagated error signal reaching that layer. This suggests that we can also calculate the bias gradients at any layer in an arbitrarily-deep network by simply calculating the backpropagated error signal reaching that layer !

## Wrapping up

In this post we went over some of the formal details of the backpropagation learning algorithm. The math covered in this post allows us to train arbitrarily deep neural networks by re-applying the same basic computations. Those computations are:

- Calculated the feed-forward signals from the input to the output.
- Calculate output error based on the predictions and the target
- Backpropagate the error signals by weighting it by the weights in previous layers and the gradients of the associated activation functions
- Calculating the gradients for the parameters based on the backpropagated error signal and the feedforward signals from the inputs.
- Update the parameters using the calculated gradients

The only real constraints on model construction is ensuring that the error function and the activation functions are differentiable. For more details on implementing ANNs and seeing them at work, stay tuned for the next post.

## Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

## Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function taking as input and producing an output). However, in the real world, we don’t get to observe directly, but instead get noisy observations , where

(1)

Here we will assume that is random variable distributed according to a zero-mean Gaussian with standard deviation . Note that because is a random variable, is also a random variable (with a mean that is on conditioned on both and , and a variance .

As an example, say that the true function we want to determine has the the following form (though we don’t know it):

Thus the observations we get to see have the following distribution.

Below we define the function and display it, then draw a few observation samples , and display them as well:

% CREATE A RANDOM DATASET rng(10) nData = 10; % N x = 2*(rand(1,nData)-.5); xGrid = linspace(-1,1,100); % DEFINE AND TARGET FUNCTION f(x) f = inline('sin(pi*x)','x'); h = []; h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2); xlabel('x') hold on % DEFINE AND DISPLAY y noiseSTD = .1; y = f(x) + noiseSTD*randn(size(x)); h(2) = scatter(x,y,'ko'); legend(h,{'f(x)','y'},'Location','Northwest');

Again, the goal here is to characterized the function . However, since we don’t know the function form of , we must instead estimate some other function that we think will be a good approximation to . Thus we call an estimator of . In general, an estimator is some parameterized model that can capture a wide range of functional forms. One such class of estimators is the weighted combination of ordered polynomials:

As the polynomial order increases, the functions are able to capture increasingly complex behavior. For example, desribes a horizontal line with an adjustable vertical offset, desribes a line with adjustable vertical offset and adjustable slope, describes a function that also includes a quadratic term. We thus try to fit the values of the parameters for a given estimator to best account for observed data in the hopes that we will also accurately approximate .

Below we estimate the parameters of three polynomial model functions of increasing complexity (using Matlab’s ) to the sampled data displayed above. Specifically, we estimate the functions , and .

% FIT POLYNOMIAL MODELS & DISPLAY % (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE) degree = [1,3,10]; theta = {}; cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6]; for iD = 1:numel(degree) figure(1) theta{iD} = polyfit(x,y,degree(iD)); fit{iD} = polyval(theta{iD},xGrid); h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2); xlim([-1 1]) ylim([-1 1]) end legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

Qualitatively, we see that the estimator (red line) provides a poor fit to the observed data, as well as a poor approximation to the function (black curve). We see that the estimator (blue curve) provides a very accurate fit to the data points, but varies wildly to do so, and therefore provides an inaccurate approximation of . Finally, we see that the estimator (green curve) provides a fairly good fit to the observed data, and a much better job at approximating .

Our original goal was to approximate , not the data points per se. Therefore , at least qualitatively, provides a more desirable estimate of than the other two estimators. The fits for and are examples of “underfitting” and “overfitting” to the observed data. * Underfitting* occurs when an estimator is not flexible enough to capture the underlying trends in the observed data.

*occurs when an estimator is too flexible, allowing it to capture illusory trends in the data. These illusory trends are often the result of the noise in the observations .*

**Overfitting**## Variance and Bias of an Estimator

The model fits for discussed above were based on a single, randomly-sampled data set of observations . However, there are in principle, a potentially infinite number of data sets that can be observed (drawn from ). In order to determine a good model of , it would be helpful to have an idea of how an estimator will perform on any or all of these potential datasets. To get an idea of how each of the estimators discussed above performs in general we can repeat the model fitting procedure for many data sets.

Here we perform such an analyses, sampling 50 independent data sets according to Equation (1) (with ), then fitting the parameters for the polynomial functions of model order to each dataset.

% FIT MODELS TO K INDEPENDENT DATASETS K = 50; for iS = 1:K ySim = f(x) + noiseSTD*randn(size(x)); for jD = 1:numel(degree) % FIT THE MODEL USING polyfit.m thetaTmp = polyfit(x,ySim,degree(jD)); % EVALUATE THE MODEL FIT USING polyval.m simFit{jD}(iS,:) = polyval(thetaTmp,xGrid); end end % DISPLAY ALL THE MODEL FITS h = []; for iD = 1:numel(degree) figure(iD+1) hold on % PLOT THE FUNCTION FIT TO EACH DATASET for iS = 1:K h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6)); end % PLOT THE AVERAGE FUNCTION ACROSS ALL FITS h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5); % PLOT THE UNDERLYING FUNCTION f(x) h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3); % CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS squaredError = (mean(simFit{iD})-f(xGrid)).^2; % PLOT THE SQUARED ERROR h(4) = plot(xGrid,squaredError,'k--','Linewidth',3); uistack(h(2),'top') hold off axis square xlim([-1 1]) ylim([-1 1]) legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside') title(sprintf('Model Order=%d',degree(iD))) end

Here are the results for the estimator …

…and for the estimator , …

… and for

The lightly-colored curves in each of the three plots above are an individual polynomial model fit to one of the 50 sampled data sets. The darkly-colored curve in each plot is the average over the 50 individual fits. The dark black curve is the true, underlying function .

We see that for the estimator (red curves), model fits do not vary much from data set to data set. Thus the expected (mean) estimator fit over all the data sets, formally written as , is very similar to all the individual fits.

A common formal definition used in statistics to describe how much a single estimator deviates from the average estimator over datasets is the * variance of the estimator*. Formally defined as

the variance is the average squared difference between any single data-set-dependent estimate of and the average value of estimated over all datasets.

According to the definition of variance, we can say that the estimator exhibits *low variance*.

A commonly-used metric in statistics for assessing how an estimator approximates a target function , based on behavior over many observed data sets is what is called the * bias of the estimator.* Formally defined as:

The bias describes how much the average estimator fit over datasets deviates from the value of the underlying target function .

We can see from the plot for that deviates significantly from . Thus we can say that the estimator exhibits *large bias* when approximating the function .

Investigating the results for the estimator (blue curves), we see that each individual model fit varies dramatically from one data set to another. Thus we can say that this estimator exhibits *large variance*. Looking at (the dark blue curve), we see that on average the estimator provides a better approximation to than the estimator . Thus we can say that exhibits a *lower bias* than the estimator .

Investigating the fits for the estimator (green curves), we find that this estimator has low bias. Furthermore, the average estimator (dark green curve) accurately approximates the true function , telling us that that the estimator also has *low bias*.

We established earlier that the estimator provided a qualitatively better fit to the function than the other two polynomial estimators for a single dataset. It appears that this is also the case over many datasets. We also find that estimator exhibits low bias and low variance, whereas the other two, less-desirable estimators, have either high bias or high variance. Thus it would appear that having both low bias and low variance is a reasonable criterion for selecting an accurate model of .

Included in each of the three plots above is a dashed black line representing the squared difference between the average estimator and the true function . Calculating squared model errors is a common practice for quantifying the goodness of a model fit. If we calculate the expected value of each of the dashed black curves (and assuming that all values of are equally likely to occur), we would obtain a single value for each estimator that is the mean squared error (MSE) between the expected estimator and the true function.

For the estimator , the MSE will be very small, as the dashed black curve for this estimator is near zero for all values of . The estimators and would have significantly larger values. Now, because exhibiting both a low MSE, as well as having both low bias and variance are indicative of a good estimator, it would be reasonable to assume that squared model error is directly related to bias and variance. The next section provides some formal evidence for this notion.

## Expected Prediction Error and the Bias-variance Tradeoff

For a given estimator fit to a data set of pairs, we would like to know, given all the possible datasets out there, what is the expected prediction error we will observe for a new data point . If we define prediction error to be the squared difference in model prediction and observations , the expected prediction error is then:

If we expand this a little and use a few identities, something interesting happens:

(2)

(3)

(4)

where we have applied the following Lemma to the first and third terms of Equation (3), and use the fact to (Think of averaging over an infinite number of datasets sampled from y; all noise will average out, leaving ). Rearranging Equation (4), we obtain

(5)

which can be further simplified and grouped into three terms

(6)

- The first term is the
introduced above.**variance of the estimator** - The second term is the square of the
, also introduced above.**bias of the estimator** - The third term is the
and describes how much the observations vary from the true function . Notice that the noise term does not depend on the estimator . Thus the noise term is a constant that places a lower bound on expected prediction error.**variance of the observation noise**

Here we find that the expected prediction error on new data (in the squared differences sense) is the combination of three terms:

* Expected prediction error* =

*+*

**estimator variance***+*

**squared estimator bias**

**noise**Thus the expected prediction error on new data can be used as a quantitative criterion for selecting the best model from a candidate set of estimators! It turns out that, given new data points , the expected prediction error can be easily approximated as the mean squared error over data pairs:

Below we demonstrate these findings with another set of simulations. We simulate 100 independent datasets, each with 25 pairs. We then partition each dataset into two non-overlapping sets: a training set using for fitting model parameters, and a testing set used to calculate prediction error. We then fit the parameters for estimators of varying complexity. Complexity is varied by using polynomial functions that range in model order from 1 (least complex) to 12 (most complex). We then calculate and display the squared bias, variance, and error on testing set for each of the estimators:

N = 25; % # OF OBSERVATIONS PER DATASET K = 100;% # OF DATASETS noiseSTD = .5; % NOISE STANDARDE DEV. nTrain = ceil(N*.9); % # OF TRAINING POINTS nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY % # INITIALIZE SOME VARIABLES xGrid = linspace(-1,1,N); meanPrediction = zeros(K,N); thetaHat = {}; x = linspace(-1,1,N); x = x(randperm(N)); for iS = 1:K % LOOP OVER DATASETS % CREATE OBSERVED DATA, y y = f(x) + noiseSTD*randn(size(x)); % CREATE TRAINING SET xTrain = x(1:nTrain); yTrain = y(1:nTrain); % CREATE TESTING SET xTest = x(nTrain+1:end); yTest = y(nTrain+1:end); % FIT MODELS for jD = 1:nPolyMax % MODEL PARAMETER ESTIMATES thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD); % PREDICTIONS yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); TRAINING SET yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET % MEAN SQUARED ERROR trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING end end % CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE for iD = 1:nPolyMax trainError(iD) = mean(trainErrors{iD}); testError(iD) = mean(testErrors{iD}); biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2); variance(iD) = mean(var(yHatTest{iD},1)); end [~,bestModel] = min(testError); % DISPLAY figure; hold on; plot(testError,'k','Linewidth',2); plot(biasSquared,'r','Linewidth',2); plot(variance,'b','Linewidth',2); plot(biasSquared + variance,'m-.','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model') hold off;

Here we see how, as the model complexity increases, the estimator variance (blue curve) gradually increases. Additionally, as model complexity increases, the squared bias (red curve) decreases. Thus ** there is a tradeoff between bias and variance that comes with model complexity**: models that are too complex will have high variance and low bias; models that are too simple will have high bias and low variance. The best model will have both low bias and low variance. In this example, we highlight the best estimator in terms of prediction error on the testing set (black curve) with a dashed black vertical line. The best estimator corresponds to a polynomial model of order of . Notice that the vertical black line is located where function defined by the sum of the squared bias and variance (dashed magenta curve) is also at a minimum.

Notice also how the sum of the squared bias and variance also has the same shape as curve defined by the prediction error on the testing set. This exemplifies how the error on novel data can be used as a proxy for determining the best estimator from a candidate set based on squared bias and variance. The noise term in Equation (6) is also represented in the figure by the vertical displacement between the black curve and dashed magenta curve.

It is very important to point out that all of these results are based on evaluating prediction error on **novel data**, not used to estimate model parameters. In fact assessing a model performance based on prediction error calculated on the same data used to estimate the model parameters is highly problematic, as it causes models to always overfit. In plain terms, this means that we will always favor a more complex model if we assess goodness of model fits on the training data, as a more complex model will be better able to capture small, random trends in the data due to noise.

This overfitting phenomenon is demonstrated below. We plot the error calculated on the training set (Train Error, green curve) along with the error calculated on the testing set (Test Error, black curve) for the above simulation. We also identify the best estimator as we did above.

% DISPLAY figure, hold on; plot(trainError,'g','Linewidth',2); plot(testError,'k','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Train Error','Test Error','Best Model'); hold off;

We see here that as model complexity increases, the error calculated on the training set continues to decrease, where as the error on the testing set increases past the optimal polynomial order . We showed above that error calculated on the testing set is the true indicator of how well an estimator will generalize to new data points. The error calculated on the training set strongly disagrees with the error calculated on the testing set after the optimal model complexity has been reached. Since, in general, the whole point of modeling a data set is to generalize to novel data, assessing model predictions on the training set data should be avoided.

## Wrapping Up

Here we discussed how the bias and variance of an estimator are related to squared prediction error. Though we focused on regression, these concepts can also be applied to classification problems. We found that an optimal estimator will have both low variance and low bias. We further found that information about squared bias and variance is contained in expected prediction error calculated on a testing set of data not used to fit a model’s parameters.

The concepts of estimator bias and variance are generally only clear in the context of an ensemble of datasets. However, in real-world applications, there is generally only a single observed dataset. In such cases the roles of bias and variance are less obvious (though, it is possible to calculate estimates of variance and bias using resampling methods such as bootstrapping). However, the direct connection we made between bias, variance with the mean-squared error calculated on a testing set give us a direct means for assessing a group of candidate estimators in light of a single data set. We only need to partition the available data set into separate portions: one used to fit model parameters (a training set), and another used to assess prediction accuracy (testing set). Comparing prediction accuracy across potential estimators is equivalent to assessing biases and variances of the estimators across many datasets. Note that resampling methods such as cross-validation can prove helpful here, particularly when the amount of observed data is small.

Note, all the code for this post, containe in a single script is here:

clear clc close all % CREATE A RANDOM DATASET rng(10) nData = 10; % N x = 2*(rand(1,nData)-.5); xGrid = linspace(-1,1,100); % DEFINE AND TARGET FUNCTION f(x) f = inline('sin(pi*x)','x'); h = []; h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2); xlabel('x') hold on % DEFINE AND DISPLAY y noiseSTD = .1; y = f(x) + noiseSTD*randn(size(x)); h(2) = scatter(x,y,'ko'); legend(h,{'f(x)','y'},'Location','Northwest'); % FIT POLYNOMIAL MODELS & DISPLAY % (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE) degree = [1,3,10]; theta = {}; cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6]; for iD = 1:numel(degree) figure(1) theta{iD} = polyfit(x,y,degree(iD)); fit{iD} = polyval(theta{iD},xGrid); h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2); xlim([-1 1]) ylim([-1 1]) end legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest') % FIT MODELS TO K INDEPENDENT DATASETS K = 50; for iS = 1:K ySim = f(x) + noiseSTD*randn(size(x)); for jD = 1:numel(degree) % FIT THE MODEL USING polyfit.m thetaTmp = polyfit(x,ySim,degree(jD)); % EVALUATE THE MODEL FIT USING polyval.m simFit{jD}(iS,:) = polyval(thetaTmp,xGrid); end end % DISPLAY ALL THE MODEL FITS h = []; for iD = 1:numel(degree) figure(iD+1) hold on % PLOT THE FUNCTION FIT TO EACH DATASET for iS = 1:K h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6)); end % PLOT THE AVERAGE FUNCTION ACROSS ALL FITS h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5); % PLOT THE UNDERLYING FUNCTION f(x) h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3); % CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS squaredError = (mean(simFit{iD})-f(xGrid)).^2; % PLOT THE SQUARED ERROR h(4) = plot(xGrid,squaredError,'k--','Linewidth',3); uistack(h(2),'top') hold off axis square xlim([-1 1]) ylim([-1 1]) legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside') title(sprintf('Model Order=%d',degree(iD))) end N = 25; % # OF OBSERVATIONS PER DATASET K = 100;% # OF DATASETS noiseSTD = .5; % NOISE STANDARDE DEV. nTrain = ceil(N*.9); % # OF TRAINING POINTS nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY % # INITIALIZE SOME VARIABLES xGrid = linspace(-1,1,N); meanPrediction = zeros(K,N); thetaHat = {}; x = linspace(-1,1,N); x = x(randperm(N)); for iS = 1:K % LOOP OVER DATASETS % CREATE OBSERVED DATA, y y = f(x) + noiseSTD*randn(size(x)); % CREATE TRAINING SET xTrain = x(1:nTrain); yTrain = y(1:nTrain); % CREATE TESTING SET xTest = x(nTrain+1:end); yTest = y(nTrain+1:end); % FIT MODELS for jD = 1:nPolyMax % MODEL PARAMETER ESTIMATES thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD); % PREDICTIONS yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); % TRAINING SET yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET % MEAN SQUARED ERROR trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING end end % CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE for iD = 1:nPolyMax trainError(iD) = mean(trainErrors{iD}); testError(iD) = mean(testErrors{iD}); biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2); variance(iD) = mean(var(yHatTest{iD},1)); end [~,bestModel] = min(testError); % DISPLAY figure; hold on; plot(testError,'k','Linewidth',2); plot(biasSquared,'r','Linewidth',2); plot(variance,'b','Linewidth',2); plot(biasSquared + variance,'m-.','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model') hold off; % DISPLAY figure, hold on; plot(trainError,'g','Linewidth',2); plot(testError,'k','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Train Error','Test Error','Best Model'); hold off;

## Derivation: Ordinary Least Squares Solution and Normal Equations

In a linear regression framework, we assume some output variable is a linear combination of some independent input variables plus some independent noise . The way the independent variables are combined is defined by a parameter vector :

We also assume that the noise term is drawn from a standard Normal distribution:

For some estimate of the model parameters , the model’s prediction errors/residuals are the difference between the model prediction and the observed ouput values

The Ordinary Least Squares (OLS) solution to the problem (i.e. determining an optimal solution for ) involves minimizing the sum of the squared errors with respect to the model parameters, . The sum of squared errors is equal to the inner product of the residuals vector with itself :

To determine the parameters, , we minimize the sum of squared residuals with respect to the parameters.

due to the identity , for vectors and . This relationship is matrix form of the Normal Equations. Solving for gives the analytical solution to the Ordinary Least Squares problem.

Boom.

## Cutting Your Losses: Loss Functions & the Sum of Squares Loss

Often times, particularly in a regression framework, we are given a set of inputs (independent variables) and a set outputs (dependent variables) , and we want to devise a **model function**

that predicts the outputs given some inputs as best as possible. But what does it mean for a model to predict “as best as possible” exactly? In order to make the notion of how good a model is explicit, it is common to adopt a **loss function**

,

The loss function is some function of the model’s ** prediction errors** (a.k.a. residuals) at predicting outputs given the inputs (the loss function is also often referred to as the

*cost function,*as it makes explicit the “cost” of incorrect prediction). “Good” models of a dataset will have small prediction errors, and therefore produce small loss function values. Determining the “best” model is equivalent to finding model function that minimizes the loss function. A common choice for this loss function is the sum of squared of the errors (SSE) loss. If there are input-output pairs, the SSE Loss function is formally:

This formula states that, for each output predicted by the model, we determine how far away the prediction is from the actual value (i.e. subtraction). Each of the individual distances are then squared and added to give a single number indicating how well (or badly) the model function captures the structure of the data across all the datapoints. The “best” model under this loss is called the least sum of squares (LSS) solution.

But why square the errors before summing them? At first, this seems somewhat unintuitive (or even ad hoc!). Surely there are other, more straight-forward loss functions we can devise. An initial notion of just adding the errors leads to a dead end because adding many positive and negative errors (i.e. resulting from data located below and above the model function) just cancels out; we want our measure of errors to be all positive (or all negative). Therefore, another idea would be to just take the absolute value of the errors before summing. Turns out, this is a known loss function, called the sum of absolute errors (SAE) or sum of absolute deviations (SAD) loss function. Finding the “best” SAE/SAD model is called the least absolute error LAE/LAD solution and such a solution was actually proposed decades before LSS. Though LAE is indeed used in contemporary methods (we’ll talk more about LAE later), the sum of squares loss function is far more popular in practice. Why does sum of squares always make the cut?

# Useful Interpretations of the Sum of Squares Loss for Linear Regression

## Areas of squares

**Figure 1** demonstrates a set of 2D data (blue dots) and the LSS linear function (black line) of the form

,

where the parameters (the offet of the line from ) and (the slope) have been estimated (LSS) to “best” fit the data.

**Figures 2**. Each red square is a

*literal*interpretation of the squared error for linear function fit in

**Figure 1**. Here we see that no matter if the errors occur from predictions being greater than or less than the actual output values, the error term is always positive.

In this interpretation, the goal of finding the LSS solution is to find the line that results in the smallest red area. This interpretation is also useful for understanding the important regression metric known as the coefficient of determination , which is an indicator of how well a linear model function explains or predicts a dataset. Imagine that instead of the line fit in Figures 1-2, we instead fit a simpler model that has no slope parameter, and only a bias/offset parameter (**Figure 3**). In this case the simpler model only captures the mean value of the data along the y-dimension.

The squared error in this model corresponds to the (unscaled) variance of the data. Lets denote the total area of the green squares in **Figure 3** as the total sum of squares (TSS) error, and the area of the red squares in **Figure 2** as the residual sum of squares (RSS) of the linear model fit. The metric is related to the red and green areas as follows

If the linear model is doing a good job of fitting the data, then the variance of the model errors/residuals (RSS) term will be small compared to the variance of the dataset (TSS), and the metric will be close to one. If the model is doing a poor job of fitting the data, then the variance residuals will approach that of the data itself, and the metric will be close to zero (Note too that the value of can also take negative values, in the case when the RSS is larger than the TSS, indicating a very poor model).

## A bar suspended by springs

We can gain some important insight to the importance of the least squares loss by developing concepts within the framework of a physical system (**Figure 4). **In this formulation, a set of springs (red, dashed lines, our errors ) suspend a bar (solid black line, our linear function to a set of anchors (blue datapoints, our outputs ). Note that in this formulation, the springs are constrained to operate only along the vertical direction (*y*-dimension). This constraint is equivalent to saying that there is only error in our measurement of the dependent variables, and is often an assumption made in regression frameworks.

From Hooke’s Law, the force created by each spring on the bar is proportional to the distance (error) from the bar (linear function) to its corresponding anchor (datapoint) :

Further, there is a potential energy associated with each spring (datapoint). The total potential energy for the entire system is as follows:

*squared*error (distance) between the bar (line function) and the anchors (datapoints).

From the equation corresponding to the first zero-net-force condition, we can solve for the bias parameter of the linear function that describes the orientation of the bar:

Here the (pronounced “bar”) means the average value. Plugging this expression into the second second zero-net-torque condition equation, we discover that the slope of the line has an interesting interpretation related to the variances of the data:

The expressions for the parameters and tell us that, under the least squares linear regression framework, The average of the dependent variables is equal to a scaled version of the average of independent variables plus an offset:

Further, the scaling factor (the slope) is equal to the ratio of the covariance between the dependent and independent variables to the variance of the independent variable. Therefore if and are positively correlated, the slope will be positive, if they are negatively correlated, the slope will be negative.

Because of these relationships, the LSS solution has a number of useful properties:

- The sum of the residuals under the LSS solution is zero(this is equivalent to the first zero-net-force condition above).
- Because of 1., the average residual of the LSS solution is zero
- The covariance between the independent variables and the residuals is zero.
- The LLS solution always passes through the mean (center of mass) of the sample
- The LSS solution minimizes the variance of the residuals/model errors.

Therefore, the least squares loss function directly relates model residuals to how the independent and dependent variables co-vary. These relationships are not available with other loss functions such as the least absolute deviation.

What’s interesting, is that the two physical constraint equations derived from the physical system above are also obtained through other analytic analyses of linear regression including defining the LSS problem using both maximum likelihood estimation and method of moments.

# Wrapping up

There are many other reasons, albeit suggestions, as to why squared errors are often preferred to other rectifying functions of the errors (i.e. making all errors be positive or zero):

- The Least Squares solution can be derived in closed form, allowing simple analytic implementations and fast computation of model parameters.
- Unlike the LAE loss, the SSE loss is differentiable (i.e. is smooth) everywhere, which allows model parameters to be estimated using straight-forward, gradient-based optimizations
- Squared errors have deep ties in statistics and maximum likelihood estimation methods (as mentioned above), particularly when the errors are distributed according to the Golden Boy of statistical distributions, the Normal distribution.
- There are a number of geometric and linear algebra theorems that support using least squares. For instance the Gauss-Markov theorem states that if errors of a linear function are distributed Normally about the mean of the line, then the LSS solution gives the best unbiased estimator for the parameters .
- Squared functions have a long history of facilitating calculus calculations used throughout the physical sciences.

The SSE loss does have a number of downfalls as well. For instance, because each error is squared, any outliers in the dataset can dominate the parameter estimation process. For this reason, the LSS loss is said to lack *robustness.* Therefore preprocessing of the the dataset (i.e. removing or thresholding outlier values) may be necessary when using the LSS loss.

# MATLAB Code

The code to produce the figures in this post is below. The code can be directly copied to your clipboard using the toolbar at the top right of the code display.

close all; clear randn('state',12345); x = -5:5; % INPUTS y = x + .5*randn(size(x)); % OUTPUTS % PART 1 - LINEAR MODEL % PLOT DATA figure(1); h1 = scatter(x,y,'filled'); xlim([min(x)-1 max(x)+1]); xlim([min(y)-1 max(y)+1]); axis square xlabel('x'); ylabel('y'); fprintf('\nHere are a set of 2D data pairs...'); pause clc % FIND LSS SOLUTION bias = ones(size(x)); beta=regress(y',[bias;x]'); intercept = beta(1); slope = beta(2); yHat = x*slope + intercept; % PLOT MODEL FIT hold on; h2 = plot(x,yHat,'k-','Linewidth',2); fprintf('\n...and the LSS linear function determined for the points.\n'); pause clc % CALCULATE PREDICTION ERROR e = y - yHat; posErrors = find(e>=0); negErrors = setdiff(1:numel(x),posErrors); % PLOT "SQUARED" ERRORS cnt = 1; for iP = 1:numel(posErrors); xs = [x(posErrors(iP))-e(posErrors(iP)), ... x(posErrors(iP)), ... x(posErrors(iP)), ... x(posErrors(iP))-e(posErrors(iP))]; ys = [y(posErrors(iP))-e(posErrors(iP)), ... y(posErrors(iP))-e(posErrors(iP)), ... y(posErrors(iP)), ... y(posErrors(iP))]; hS(cnt)=patch(xs,ys,'r'); set(hS(cnt),'EdgeColor','r'); set(hS(cnt),'FaceAlpha',.5); cnt = cnt+1; end for iN = 1:numel(negErrors); xs = [x(negErrors(iN))-e(negErrors(iN)), ... x(negErrors(iN)), ... x(negErrors(iN)), ... x(negErrors(iN))-e(negErrors(iN))]; ys = [y(negErrors(iN)), ... y(negErrors(iN)), ... y(negErrors(iN))-e(negErrors(iN)), ... y(negErrors(iN))-e(negErrors(iN))]; hS(cnt)= patch(xs,ys,'r'); set(hS(cnt),'EdgeColor','r'); set(hS(cnt),'FaceAlpha',.5); cnt = cnt+1; end uistack(h2,'top'); uistack(h1,'top'); fprintf('\nOne helpful interpretation is to represent the') fprintf('\nsquared errors literally as the area spanned') fprintf('\nin the space (red squares).\n') fprintf('\nFinding the LSS solution is equivalent to minimizing') fprintf('\nthe sum of the area of these squares.\n') pause clc % PART 2 - TRIVIAL LINEAR MODEL figure(2); h1 = scatter(x,y,'filled'); xlim([min(x)-1 max(x)+1]); xlim([min(y)-1 max(y)+1]); axis square xlabel('x'); ylabel('y'); yHat0 = mean(y).*ones(size(x)); e0 = y - yHat0; posErrors0 = find(e0>=0); negErrors0 = setdiff(1:numel(x),posErrors0); % PLOT TRIVIAL MODEL FIT hold on; h2 = plot(x,yHat0,'k-','Linewidth',2); fprintf('\nNow, imagine that we fit a simpler model to the datapoints') fprintf('\nthat is a line with no slope parameter and an offset parameter.') fprintf('\nIn this case we''re essentially fitting the mean of the data.') pause clc % PLOT TRIVIAL MODEL "SQUARED" ERRORS cnt = 1; for iP = 1:numel(posErrors0); xs = [x(posErrors0(iP))-e0(posErrors0(iP)), ... x(posErrors0(iP)), ... x(posErrors0(iP)), ... x(posErrors0(iP))-e0(posErrors0(iP))]; ys = [y(posErrors0(iP))-e0(posErrors0(iP)), ... y(posErrors0(iP))-e0(posErrors0(iP)), ... y(posErrors0(iP)), ... y(posErrors0(iP))]; hS(cnt)=patch(xs,ys,'g'); set(hS(cnt),'EdgeColor','g'); set(hS(cnt),'FaceAlpha',.5); cnt = cnt+1; end for iN = 1:numel(negErrors0); xs = [x(negErrors0(iN))-e0(negErrors0(iN)), ... x(negErrors0(iN)), ... x(negErrors0(iN)), ... x(negErrors0(iN))-e0(negErrors0(iN))]; ys = [y(negErrors0(iN)), ... y(negErrors0(iN)), ... y(negErrors0(iN))-e0(negErrors0(iN)), ... y(negErrors0(iN))-e0(negErrors0(iN))]; hS(cnt)= patch(xs,ys,'g'); set(hS(cnt),'EdgeColor','g'); set(hS(cnt),'FaceAlpha',.5); cnt = cnt+1; end uistack(h2,'top'); uistack(h1,'top'); fprintf('\nTherefore the sum of residuals for this model area equal to the') fprintf('\n(unscaled) variance of the data.\n') pause fprintf('\nThe ratio of the area of the green boxes in Figure 1 to the area of') fprintf('\nthe red boxes in Figure 2 is related to the important metric known') fprintf('\nas the coefficient of determination, R^2. Specifically:\n') fprintf('\nR^2 = 1 - Red/Green\n') fprintf('\nNote that as the linear model fit improves, the area of the red') fprintf('\nboxes decreases and the value of R^2 approaches one.') pause clc % PART 3- "SPRINGS" INTERPRETATION figure(3) h1 = scatter(x,y,'filled'); xlim([min(x)-1 max(x)+1]); xlim([min(y)-1 max(y)+1]);hold on h2 = plot(x,yHat,'k-','Linewidth',2); axis square xlabel('x'); ylabel('y'); h3 = line([x;x],[y;yHat],'color','r','linestyle','--','Linewidth',2); uistack(h1,'top'); fprintf('\nIt can also be helpful to think of errors corresponding to') fprintf('\nindividual datapoints as springs (Figuree 3, red dashes)') fprintf('\nattached to a suspended bar (black line)\n') fprintf('\nIf the springs are limited to only operate in the the y-direction,') fprintf('\nthen sum of the potential energies stored in the springs when') fprintf('\nthe bar has reached its equilibtium position directly corresponds') fprintf('\nto the sum of squares error function:\n') fprintf('\ne = y - yHat;\nU = integral(ke)de = 1/2ke^2\n') fprintf('\nif k = 1, then:\n') fprintf('\nU = 1/2(y - yHat)^2, \nthe least squares error function\n') pause clc close all; clear all; clc