# Blog Archives

## Introduction

Though many phenomena in the world can be adequately modeled using linear regression or classification, most interesting phenomena are generally nonlinear in nature. In order to deal with nonlinear phenomena, there have been a diversity of nonlinear models developed. For example parametric models assume that data follow some parameteric class of nonlinear function (e.g. polynomial, power, or exponential), then fine-tune the shape of the parametric function to fit observed data. However this approach is only helpful if data are fit nicely by the available catalog of parametric functions. Another approach, kernel-based methods, transforms data non-linearly into an abstract space that measures distances between observations, then predicts new values or classes based on these distances. However, kernel methods generally involve constructing a kernel matrix that depends on the number of training observations and can thus be prohibitive for large data sets. Another class of models, the ones that are the focus of this post, are artificial neural networks (ANNs). ANNs are nonlinear models motivated by the physiological architecture of the nervous system. They involve a cascade of simple nonlinear computations that when aggregated can implement robust and complex nonlinear functions. In fact, depending on how they are constructed, ANNs can approximate any nonlinear function, making them a quite powerful class of models (note that this property is not reserved for ANNs; kernel methods are also considered “universal approximators”; however, it turns out that neural networks with multiple layers are more efficient at approximating arbitrary functions than other methods. I refer the interested reader to more in-depth discussion on the topic.).

In recent years ANNs that use multiple stages of nonlinear computation (aka “deep learning”)  have been able obtain outstanding performance on an array of complex tasks ranging from visual object recognition to natural language processing. I find ANNs super interesting due to their computational power and their intersection with computational neuroscience.  However, I’ve found that most of the available tutorials on ANNs are either dense with formal details and contain little information about implementation or any examples, while others skip a lot of the mathematical detail and provide implementations that seem to come from thin air.  This post aims at giving a more complete overview of ANNs, including (varying degrees of) the math behind ANNs, how ANNs are implemented in code, and finally some toy examples that point out the strengths and weaknesses of ANNs.

## Single-layer Neural Networks

The simplest ANN (Figure 1) takes a set of observed inputs $\mathbf{a}=(a_1, a_2, ..., a_N)$, multiplies each of them by their own associated weight $\mathbf{w} = (w_1, w_2, ...w_N)$, and sums the weighted values to form a pre-activation $z$.Oftentimes there is also a bias $b$ that is tied to an input that is always +1 included in the preactivation calculation. The network then transforms the pre-activation using a nonlinear activation function $g(z)$ to output a final activation $a_{\text{out}}$.

Figure 1: Diagram of a single-layered artificial neural network.

There are many options available for the form of the activation function $g(z)$, and the choice generally depends on the task we would like the network to perform. For instance, if the activation function is the identity function:

$\Large{\begin{array}{rcl}g_{\text{linear}}(z) = z\end{array}}$,

which outputs continuous values $a_{linear}\in (-\infty, \infty)$, then the network implements a linear model akin to used in standard linear regression. Another choice for the activation function is the logistic sigmoid:

$\Large{ \begin{array}{rcl}g_{\text{logistic}}(z) = \frac{1}{1+e^{-z}}\end{array}}$,

which outputs values $a_{logistic} \in (0,1)$. When the network outputs use the logistic sigmoid activation function, the network implements linear binary classification. Binary classification can also be implemented using the hyperbolic tangent function,  $\text{tanh}(z)$, which outputs values $a_{\text{tanh}}\in (-1, 1)$ (note that the classes must also be coded as either -1 or 1 when using $\text{tanh}$. Single-layered neural networks used for classification are often referred to as “perceptrons,” a name given to them when they were first developed in the late 1950s.

Figure 2: Common activation functions functions used in artificial neural, along with their derivatives

To get a better idea of what these activation function do, their outputs for a given range of input values are plotted in the left of Figure 2. We see that the logistic and tanh activation functions (blue and green) have the quintessential sigmoidal “s” shape that saturates for inputs of large magnitude. This behavior makes them useful for categorization. The identity / linear activation (red), however forms a linear mapping between the input to the activation function, which makes it useful for predicting continuous values.

A key property of these activation functions is that they are all smooth and differentiable. We’ll see later in this post why differentiability is important for training neural networks. The derivatives for each of these common activation functions are given by (for mathematical details on calculating these derivatives,  see this post):

$\Large{\begin{array}{rcl} g'_{\text{linear}}(z) &=& 1 \\ g'_{\text{logistic}}(z) &=& g_{\text{logistic}}(z)(1- g_{\text{logistic}}(z)) \\ g'_{\text{tanh}}(z) &=& 1 - g_{\text{tanh}}^2(z) \\ \end{array}}$

Each of the derivatives are plotted in the right of Figure 2. What is interesting about these derivatives is that they are either a constant (i.e. 1), or are can be defined in terms of the original function. This makes them extremely convenient for efficiently training neural networks, as we can implement the gradient using simple manipulations of the feed-forward states of the network.

Code Block 1: Defines standard activation functions and generates Figure 2:

% DEFINE A FEW COMMON ACTIVATION FUNCTIONS
gLinear = inline('z','z');
gSigmoid = inline('1./(1+exp(-z))','z');
gTanh = inline('tanh(z)','z');

% ...DEFINE THEIR DERIVATIVES
gPrimeLinear = inline('ones(size(z))','z');
gPrimeSigmoid = inline('1./(1+exp(-z)).*(1-1./(1+exp(-z)))','z');
gPrimeTanh = inline('1-tanh(z).^2','z');

% VISUALIZE EACH g(z)
z = linspace(-4,4,100);
figure
set(gcf,'Position',[100,100,960,420])
subplot(121);  hold on;
h(1) = plot(z,gLinear(z),'r','Linewidth',2);
h(2) = plot(z,gSigmoid(z),'b','Linewidth',2);
h(3) = plot(z,gTanh(z),'g','Linewidth',2);
set(gca,'fontsize',16)
xlabel('z')
legend(h,{'g_{linear}(z)','g_{logistic}(z)','g_{tanh}(z)'},'Location','Southeast')
title('Some Common Activation Functions')
hold off, axis square, grid
ylim([-1.1 1.1])

% VISUALIZE EACH g'(z)
subplot(122); hold on
h(1) = plot(z,gPrimeLinear(z),'r','Linewidth',2);
h(2) = plot(z,gPrimeSigmoid(z),'b','Linewidth',2);
h(3) = plot(z,gPrimeTanh(z),'g','Linewidth',2);
set(gca,'fontsize',16)
xlabel('z')
legend(h,{'g''_{linear}(z)','g''_{logistic}(z)','g''_{tanh}(z)'},'Location','South')
title('Activation Function Derivatives')
hold off, axis square, grid
ylim([-.5 1.1])


## Multi-layer Neural Networks

As was mentioned above, single-layered networks implement linear models, which doesn’t really help us if we want to model nonlinear phenomena. However, by considering the single layer network  diagrammed in Figure 1 to be a basic building block, we can construct more complicated networks, ones that perform powerful, nonlinear computations. Figure 3 demonstrates this concept. Instead of a single layer of weights between inputs and output, we introduce a set of  single-layer networks between the two. This set of intermediate networks is often referred to as a “hidden” layer, as it doesn’t directly observe input or directly compute the output. By using a hidden layer, we form a mult-layered ANN. Though there are many different conventions for declaring the actual number of layers in a multi-layer network, for this discussion we will use the convention of the number of distinct sets of trainable weights as the number of layers. For example, the network in Figure 3 would be considered a 2-layer ANN because it has two layers of weights: those connecting the inputs to the hidden layer ($w_{ij}$), and those connecting the output of the hidden layer to the output layer($w_{jk}$).

Figure 3: Diagram of a multi-layer ANN. Each node in the network can be considered a single-layered ANN (for simplicity, biases are not visualized in graphical model)

Multi-layer neural networks form compositional functions that map the inputs nonlinearly to outputs. If we associate index i with the input layer, index j with the hidden layer, and index k with the output layer, then an output unit in the network diagrammed in Figure 3 computes an output value $a_k$ given and input $a_i$ via the following compositional function:

$\Large{ \begin{array}{rcl}a_{\text{out}} = a_k = g_k(b_k + \sum_jg_j(b_j + \sum_i a_i w_{ij})w_{jk}\end{array}}$.

Here $z_l$ is the  is the pre-activation values for units for layer $l$, $g_l()$ is the activation function for units  in that layer (assuming they are the same), and $a_l = g_l(z_l)$ is the output activation for units in that layer. The weight $w_{l-1, l}$ links the outputs of units feeding into layer $l$ to the activation function of units for that layer. The term $b_l$ is the bias for units in layer $l$.

As with the single-layered ANN, the choice of activation function for the output layer will depend on the task that we would like the network to perform (i.e. categorization or regression), and follows similar rules outlined above. However, it is generally desirable for the hidden units to have nonlinear activation functions (e.g. logistic sigmoid or tanh). This is because multiple layers of linear computations can be equally formulated as a single layer of linear computations. Thus using linear activations for the hidden layers doesn’t buy us much. However, as we’ll see shortly, using linear activations for the output unit activation function (in conjunction with nonlinear activations for the hidden units) allows the network to perform nonlinear regression.

## Training neural networks & gradient descent

Training neural networks involves determining the network parameters that minimize the errors that the network makes. This first requires that we have a way of quantifying error. A standard way of quantifying error is to take the squared difference between the network output and the target value:

$\Large{\begin{array}{rcl}E &=& \frac{1}{2}(\text{output} - \text{target})^2\end{array}}$

(Note that the squared error is not chosen arbitrarily, but has a number of theoretical benefits and considerations. For more detail, see the following post) With an error function in hand, we then aim to find the setting of parameters that minimizes this error function. This concept can be interpreted spatially by imagining a “parameter space” whose dimensions are the values of each of the model parameters, and for which the error function will form a surface of varying height depending on its value for each parameter. Model training is thus equivalent to finding point in parameter space that makes the height of the error surface small.

To get a better intuition behind this concept, let’s define a super simple neural network, one that has a single input and a single output (Figure 4, bottom left). For further simplicity, we’ll assume the network has no bias term and thus has a single parameter, $w_1$. We will also assume that the output layer uses the logistic sigmoid activation function. Accordingly, the network will map some input value $a_0$ onto a predicted output $a_{\text{out}}$ via the following function.

$\Large{\begin{array}{rcl}a_{\text{out}} = g_{\text{logistic}}(a_0w_1)\end{array}}$

Now let’s say we want this simple network to learn the identity function: given an input of 1 it should return a target value of 1. Given this target value we can now calculate the value of the error function for each setting of $w_1$. Varying the value of $w_1$ from -10 to 10 results in the error surface displayed in the left of Figure 4.  We see that the error is small for large positive values of $w_1$, while the error is large for strongly negative values of $w_1$. This not surprising, given that the output activation function is the logistic sigmoid, which will map large values onto an output of 1.

Things become more interesting when we move from a single-layered network to a multi-layered network. Let’s repeat the above exercise, but include a single hidden node between the input and the output (Figure 4, bottom right). Again, we will assume no biases, and logistic sigmoid activations for both the hidden and output nodes. Thus the network will have two parameters: $\Large{(w_1, w_2)}$. Accordingly the 2-layered network will predict an output with the  following function:

$\Large{\begin{array}{rcl}a_{\text{out}} = g_{\text{logistic}}(g_{\text{logistic}}(a_0w_1)w_2)\end{array}}$

Now, if we vary both $w_1$ and $w_2$, we obtain the error surface in the right of Figure 4.

Figure 4: Error surface for a simple, single-layer neural network (left) and a 2-layer network (right). The goal is to map the input value 1 to the output value 1.

We see that the error function is minimized when both $w_1$ and $w_2$ are large and positive. We also see that the error surface is more complex than for the single-layered model, exhibiting a number of wide plateau regions. It turns out that the error surface gets more and more complicated as you increase the number of layers in the network and the number of units in each hidden layer. Thus, it is important to consider these phenomena when constructing neural network models.

Code Block 2: generates Figure 4 (assumes you have run Code Block 1):

% VISUALIZE ERROR SURFACE OF SIMPLE ANNS
E = {};
[w1,w2] = meshgrid(linspace(-10,10,50));
g = gSigmoid;
target = 1;
net1Output = g(w1.*target);
net2Output = g(w2.*g(w1.*target));
E{1} = (net1Output - target).^2;
E{2} = (net2Output - target).^2;
figure
for ii = 1:2
set(gcf,'Position',[100,100,960,420])
subplot(1,2,ii)
colormap(flipud(hot)); caxis([0,max(max(E{ii}))])
set(gca,'fontsize',16)
xlabel('w_{1}'), ylabel('w_{2}'), zlabel('E(w)')
axis square;
title(sprintf('Error Surface: %d-layer Network',ii))
[az, el] = view;
view([az + 180, el]);
set(gcf,'position',[100,100,1020,440])
drawnow
end



The examples in Figure 4 gives us a qualitative idea of how to train the parameters of an ANN, but we would like a more automatic way of doing so. Generally this problem is solved  using gradient descent: The gradient descent algorithm first calculates the derivative / gradient of the error function with respect  to each of the model parameters. This gradient information will give us the direction in parameter space that decreases the height of the error surface. We then take a step in that direction and repeat, iteratively calculating the gradient and taking steps in parameter space.

## The backpropagation algorithm

It turns out that the gradient information for the ANN error surface can be calculated efficiently using a message passing algorithm known as the backpropagation algorithm. During backpropagation, input signals are forward-propagated through the network toward the outputs, and network errors are then calculated with respect to target variables and back-propagated backwards towards the inputs. The forward and backward signals are then used to determine the direction in the parameter space to move that lowers the network error.

The formal calculations behind the backpropagation algorithm can be somewhat mathematically involved and may detract from the general ideas behind the learning algorithm. For those readers who are interested in the math, I have provided the formal derivation of the backpropagation algorithm in the following post (for those of you who are not interested in the math, I would also encourage you go over the derivation and try to make connections to the source code implementations provided later in the post).

Figure 5 demonstrates the key steps of the backpropagation algorithm. The main concept underlying the algorithm is that for a given observation we want to determine the degree of “responsibility” that each network parameter has for mis-predicting a target value associated with the observation. We then change that parameter according to this responsibility so that it reduces the network error.

Figure 5: The 4 main steps of the bacdkpropagation algorithm: I Forward propagate error signals to output, II Calculate output error E, and backpropagate error signal, III Use forward signal and backward signals to calculate parameter gradients, IV update network parameters.

In order to determine the network error, we first propagate the observed input forward through the network layers. This is Step I of the backpropagation algorithm, and is demonstrated in Figure 5-I. Note that in the Figure $a_k$ could be considered network output  (for a network with one hidden layer) or the output of a hidden layer that projects the remainder of the network (in the case of a network with more than one hidden layer). For this discussion, however, we assume that the index k is associated with the output layer of the network, and thus each of the network outputs is designated by $a_k$. Also note that when implementing this forward-propagation step, we should keep track of the feed-forward pre-activations $z_l$ and activations $a_l$ for all layers $l$, as these will be used for calculating backpropagated errors and error function gradients.

Step II of the algorithm is to calculate the network output error and backpropagate it toward the input. Let’s again that we are using the sum of squared differences error function:

$\Large{\begin{array}{rcl}E = \frac{1}{2}\sum_{k \in K}(a_k - t_k)^2\end{array}}$,

where we sum over the values of all $k$ output units (one in this example). We can now define an “error signal”  $\delta_k$ at the output node that will be backpropagated toward the input. The error signal is calculated as follows:

$\Large{\begin{array}{rcl} \begin{array}{rcl} \delta_k &=& g_k'(z_k)E'(a_k,t_k) \\ &=& g_k'(z_k)(a_k - t_k)\end{array}\end{array}}$.

Thus the error signal essentially weights the gradient of the error function by the gradient of the output activation function (notice there is a $z_k$ term is used in this calculation, which is why we keep it around during the forward-propagation step). We can continue backpropagating the error signal toward the input by passing $\delta_k$ through the output layer weights $w_{jk}$, summing over all output nodes, and passing the result through the gradient of the activation function at the hidden layer $g_j'(z_j)$ (Figure 5-II). Performing these operations results in the back-propagated error signal for the hidden layer, $\delta_j$:

$\Large{\begin{array}{rcl} \delta_j = g_j'(z_j)\sum_k \delta_k w_{jk}\end{array}}$,

For networks that have more than one hidden layer, this error backpropagation procedure can continue for layers $j-1, j-2, ...$, etc.

Step III of the backpropagation algorithm is to calculate the gradients of the error function with respect to the model parameters at each layer $l$ using the forward signals $a_{l-1}$, and the backward error signals $\delta_l$ . If one considers the model weights $w_{l-1, l}$ at a layer $l$ as linking the forward signal $a_{l-1}$ to the error signal $\delta_l$ (Figure 5-III), then the gradient of the error function with respect to those weights is:

$\Large{ \begin{array}{rcl} \frac{\partial E}{\partial w_{l-1, l}} = a_{l-1}\delta_l\end{array}}$

Note that this result is closely related to the concept of Hebbian learning in neuroscience. Thus the gradient of the error function with respect to the model weight at each layer can be efficiently calculated by simply keeping track of the forward-propagated activations feeding into that layer from below, and weighting those activations by the backward-propagated error signals feeding into that layer from above!

What about the bias parameters? It turns out that the same gradient rule used for the weight weights applies, except that “feed-forward activations” for biases are always +1 (see Figure 1). Thus the bias gradients for layer $l$ are simply:

$\Large{\begin{array}{rcl}\frac{\partial E}{\partial b_{l}} = (1)\delta_l = \delta_l \end{array}}$

The fourth and final step of the backpropagation algorithm is to update the model parameters based on the gradients calculated in Step III. Note that the gradients point in the direction in parameter space that will increase the value of the error function. Thus when updating the model parameters we should choose to go in the opposite direction. How far do we travel in that direction? That is generally determined by a user-defined step size (aka learning rate) parameter, $\eta$. Thu,s given the parameter gradients and the step size, the weights and biases for a given layer are updated accordingly:

$\Large{ \begin{array}{rcl} w_{l-1,l} &\leftarrow& w_{l-1,l} - \eta \frac{\partial E}{\partial w_{l-1, l}} \\ b_l &\leftarrow& b_{l} - \eta \frac{\partial E}{\partial b_{l}}\end{array}}$.

To train an ANN, the four steps outlined above and in Figure 5 are repeated iteratively by observing many input-target pairs and updating the parameters until either the network error reaches a tolerably low value, the parameters cease to update (convergence), or a set number of parameter updates has been achieved. Some readers may find the steps of the backpropagation somewhat ad hoc. However, keep in mind that these steps are formally coupled to the calculus of the optimization problem. Thus I again refer the curious reader to check out the derivation in order to make connections between the algorithm and the math.

## Example: learning the OR & AND logical operators using a single layer neural network

Here we go over an example of training a single-layered neural network to perform a classification problem. The network is trained to learn a set of logical operators including the  AND, OR, or XOR. To train the network we first generate training data. The inputs consist of 2-dimensional coordinates that span the input values $(x_1, x_2)$ values for a 2-bit truth table:

Figure 6: Truth table values learned in classification examples

We then perturb these observations by adding Normally-distributed noise. To generate target variables, we categorize each observations by applying one of logic operators (See Figure 6) to the original (no-noisy) coordinates. We then train the network with the noisy inputs and binary categories targets using the gradient descent / backpropagation algorithm. The code implementation of the network and training procedures, as well as the resulting learning process are displayed below. (Note that in this implementation, I do not use the feed-forward activations to calculate the gradients as suggested above. This is simply to make the implementation of the learning algorithm more explicit in terms of the math. The same situation also applies to the other examples in this post).

Code Block 3: Implements and trains a single-layer neural network for classification to learn logical operators (assumes you have run Code Block 1):

%% EXAMPLE: SINGLE-LAYERED NETWORK

% DEFINE DATA AND TARGETS
data = [0 0; 0 1; 1 0; 1 1;];
classAND = and(data(:,1)&gt;0,data(:,2)&gt;0);
classOR = or(data(:,1)&gt;0,data(:,2)&gt;0);
classXOR = xor(data(:,1)&gt;0,data(:,2)&gt;0);

% THE TYPE OF TRUTH TABLE TO LEARN (UNCOMMENT FOR OTHERS)
classes = classOR
% classes = classAND;
% classes = classXOR;

% MAKE MULTIPLE NOISY TRAINING OBSERVATIONS
nRepats = 30;
data = repmat(data, [nRepats, 1]);
classes = repmat(classes, [nRepats, 1]);
data = data + .15*randn(size(data));

% SHUFFLE DATA
shuffleIdx = randperm(size(data,1));
data = data(shuffleIdx,:);
classes = classes(shuffleIdx);

% INITIALIZE MODEL PARAMETERS
[nObs,nInput] = size(data); % # OF INPUT DIMENSIONS
nOutput = 1;    			% # OF TARGET/OUTPUT DIMENSIONS
lRate = 3;    				% LEARNING RATE FOR PARAMETERS UPDATE
nIters = 80;  				% # OF ITERATIONS

% DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES)
g_out = gSigmoid; gPrime_out = gPrimeSigmoid;

% INITIALIZE RANDOM WEIGHTS
W_out = (rand(nInput,nOutput)-.5);
b_out = (rand(1,nOutput)-.5);

% SOME OTHER INITIALIZATIONS
% (FOR VISUALIZATION)
visRange = [-.2 1.2];
[xx,yy] = meshgrid(linspace(visRange(1), visRange(2),100));
iter = 1;
mse = zeros(1,nIters);

figure
set(gcf,'Position',[100,100,960,420])
while 1
err = zeros(1,nObs);
% LOOP THROUGH THE EXAMPLES
for iO = 1:nObs

% GET CURRENT NETWORK INPUT DATA AND TARGET
input = data(iO,:);
target = classes(iO);

%% I. FORWARD PROPAGATE DATA THROUGH NETWORK
z_out = input*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS
a_out = g_out(z_out);        % OUTPUT UNIT ACTIVATIONS

%% II. BACKPROPAGATE ERROR SIGNAL
% CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT
delta_out = gPrime_out(z_out).*(a_out - target);

%% III. CALCULATE GRADIENT W.R.T. PARAMETERS...
dEdW_out = delta_out*input;
dEdb_out = delta_out*1;

%% IV. UPDATE NETWORK PARAMETERS
W_out = W_out - lRate*dEdW_out';
b_out = b_out - lRate*dEdb_out';

% CALCULATE ERROR FUNCTION
err(iO) = .5*(a_out-target).^2;
end
mse(iter) = mean(err);

% DISPLAY LEARNING
clf; subplot(121); hold on;
set(gca,'fontsize',16)
netOut = g_out(bsxfun(@plus,[xx(:),yy(:)]*W_out, b_out));
contourf(xx,yy,reshape(netOut,100,100)); colormap(flipud(spring))
hold on;
gscatter(data(:,1),data(:,2),classes,[0 0 0 ; 1 1 1],[],20,'off');
title(sprintf('Iteration %d',iter))
xlim([visRange(1) visRange(2)]),ylim([visRange(1) visRange(2)]);
axis square

subplot(122);
set(gca,'fontsize',16)
plot(1:iter,mse(1:iter));
xlabel('Iteration')
ylabel('Mean Squared Error')
axis square
m1(iter) = getframe(gcf);

if iter &gt;= nIters
break
end
iter = iter + 1;
end


Figure 7: Single layer neural network (perceptron) learning a noisy OR mapping.

Figure 7 displays the procedure for learning the OR mapping. The left plot displays the training data and the network output at each iteration. White dots are training points categorized “1” while black dots are categorized “0”. Yellow regions are where the network predicts values of “0”, while magenta highlights areas where the network predicts “1”. We see that the single-layer network is able to easily separate the two classes.  The right plot shows how the error function decreases with each training iteration. The smooth trajectory of the error indicates that the error surface is also fairly smooth.

Figure 8: Single layer neural network (perceptron) learning a noisy AND mapping.

Figure 8 demonstrates an analogous example, but instead learning the AND operator (by executing Code Block 3, after un-commenting line 11). Again, the  categories can be easily separated by a plane, and thus the single-layered network easily learns an accurate predictor of the data.

## Going Deeper: nonlinear classification and multi-layer neural networks

Figures 7 and 8 demonstrate how a single-layered ANN can easily learn the OR and AND operators. This is because the categorization criterion for these logical operators can be represented in the input space by a single linear function (i.e. line/plane). What about more complex categorization criterion that cannot be represented by a single plane? An example of a more complex binary classification criterion is the XOR operator (Figure 6, far right column).

Below we attempt to train the single-layer network to learn the XOR operator (by executing Code Block 3, after un-commenting line 12). The single layer network is unable to learn this nonlinear mapping between the inputs and the targets. However, it turns out we can learn the XOR operator using a multi-layered neural network.

Figure 9: Single layer neural network (perceptron) attempting to learn a noisy XOR mapping. The single layer network chokes on this nonlinear problem.

Below we train a two-layer neural network on the XOR dataset. The network incorporates a hidden layer with 3 hidden units and logistic sigmoid activation functions for all units in the hidden and output layers (see Code Block 4, lines 32-33).

Code Block 4: Implements and trains a two-layer neural network for classification to learn XOR operator and more difficult “ring” problem (Figures 10 & 11; assumes you have run Code Block 1):


%% EXAMPLE: MULTI-LAYER NEURAL NETWORK FOR CLASSIFICATION
data = [0 0; 0 1; 1 0; 1 1;];
classXOR = xor(data(:,1)&gt;0,data(:,2)&gt;0);

% THE TYPE OF TRUTH TABLE TO LEARN
classes = classXOR;

% UNCOMMENT FOR MOR DIFFICULT DATA...
% data = [data; .5 .5; 1 .5; 0 .5; .5 0; .5 1];
% classRing = [1; 1; 1; 1; 0; 1; 1; 1; 1];
% classes = classRing;

% CREATE MANY NOISY OBSERVATIONS
nRepats = 30;
data = repmat(data, [nRepats, 1]);
classes = repmat(classes, [nRepats, 1]);
data = data + .15*randn(size(data));

% SHUFFLE OBSERVATIONS
shuffleIdx = randperm(size(data,1));
data = data(shuffleIdx,:);
classes = classes(shuffleIdx);

% INITIALIZE MODEL PARAMETERS
[nObs,nInput] = size(data);   	% # OF INPUT DIMENSIONS
nHidden = 3;    				% # OF HIDDEN UNITS

lRate = 2;    	% LEARNING RATE FOR PARAMETERS UPDATE
nIters = 300;   % # OF ITERATIONS

% DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES)
g_hid = gSigmoid; gPrime_hid = gPrimeSigmoid;
g_out = gSigmoid; gPrime_out = gPrimeSigmoid;

% INITIALIZE WEIGHTS
W_hid = (rand(nInput,nHidden)-.5);
b_hid = (rand(1,nHidden)-.5);
W_out = (rand(nHidden,nOutput)-.5);
b_out = (rand(1,nOutput)-.5);

iter = 1;
mse = zeros(1,nIters);
figure
set(gcf,'Position',[100,100,960,420])
% MAIN TRAINING ALGORITHM
while 1
err = zeros(1,nObs);
% LOOP THROUGH THE EXAMPLES
for iO = 1:nObs

% GET CURRENT NETWORK INPUT DATA AND TARGET
input = data(iO,:);
target = classes(iO);

%% I. FORWARD PROPAGATE DATA THROUGH NETWORK
z_hid = input*W_hid + b_hid; % HIDDEN UNIT PRE-ACTIVATIONS
a_hid = g_hid(z_hid);        % HIDDEN UNIT ACTIVATIONS
z_out = a_hid*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS
a_out = g_out(z_out);        % OUTPUT UNIT ACTIVATIONS

%% II.  BACKPROPAGATE ERROR SIGNAL
% CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT
delta_out = gPrime_out(z_out).*(a_out - target);

% CALCULATE ERROR CONTRIBUTIONS FOR HIDDEN NODES...
delta_hid = gPrime_hid(z_hid)'.*(delta_out*W_out);

%% III. CALCULATE GRADIENT W.R.T. PARAMETERS...
dEdW_out = delta_out*a_hid;
dEdb_out = delta_out*1;
dEdW_hid = delta_hid*input;
dEdb_hid = delta_hid*1;

%% IV. UPDATE NETWORK PARAMETERS
W_out = W_out - lRate*dEdW_out';
b_out = b_out - lRate*dEdb_out';

W_hid = W_hid - lRate*dEdW_hid';
b_hid = b_hid - lRate*dEdb_hid';

% CALCULATE ERROR FUNCTION
err(iO) = .5*(a_out-target).^2;
end
mse(iter) = mean(err);

% DISPLAY LEARNING
clf; subplot(121); hold on;
set(gca,'fontsize',16)

netOut = g_out(bsxfun(@plus,g_hid(bsxfun(@plus,[xx(:),yy(:)]*W_hid, b_hid))*W_out, b_out));
contourf(xx,yy,reshape(netOut,100,100)); colormap(flipud(spring))
hold on;
gscatter(data(:,1),data(:,2),classes,[0 0 0; 1 1 1],[],20,'off');
title(sprintf('Iteration %d',iter))
xlim([visRange(1), visRange(2)]),ylim([visRange(1), visRange(2)]);
axis square

subplot(122);
set(gca,'fontsize',16)
plot(1:iter,mse(1:iter));
xlabel('Iteration')
ylabel('Mean Squared Error')
axis square
m2(iter) = getframe(gcf);

if iter &gt;= nIters
break
end
iter = iter + 1;
end


Figure 10: A multi-layer neural network (perceptron) attempting to learn a noisy XOR mapping. The multi-layer network easily learns this nonlinear problem.

Figure 10 displays the learning process for the 2-layer network. The formatting for Figure 10 is analogous to that for Figures 7-9. The 2-layer network is easily able to learn the XOR operator. We see that by adding a hidden layer between the input and output, the ANN is able to learn the nonlinear categorization criterion!

Figure 11 shows the results for learning a even more difficult nonlinear categorization function: points in and around $(x1, x2) = (0.5 0.5)$ are categorized as “0”, while points in a ring surrounding the “0” datapoints are categorized as a “1” (Figure 11). This example is run by executing Code Block 4 after un-commenting  lines 9-11.

Figure 11: Multilayer neural network learning a nonlinear binary classification task

Figure 11 shows the learning process. Again formatting is analogous to the formatting in Figures 8-10. The 2-layer ANN is able to learn this difficult classification criterion.

## Example: Neural Networks for Regression

The previous examples demonstrated how ANNs can be used for classification by using a logistic sigmoid as the output activation function. Here we demonstrate how, by making the output activation function the linear/identity function, the same 2-layer network architecture can be used to implement nonlinear regression.

For this example we define a dataset comprised of 1D inputs, $\mathbf{x}$ that range from $(-5, 5)$. We then generate noisy targets $\mathbf y$ according to the function:

$\Large{\begin{array}{rcl}\mathbf{y} = f(\mathbf{x}) + \mathbf{\epsilon}\end{array}}$

where $f(x)$ is a nonlinear data-generating function and $\mathbf \epsilon$ is Normally-distributed noise. We then construct a two-layered network with tanh activation functions used in the hidden layer and linear outputs. For this example we set the number of hidden units to 3 and train the model as we did for categorization using gradient descent / backpropagation. The results of the example are displayed below.

Code Block 5: Trains two-layer network for regression problems (Figures 11 & 12; assumes you have run Code Block 1):

%% EXAMPLE: NONLINEAR REGRESSION

% DEFINE DATA-GENERATING FUNCTIONS f(x)
xMin = -5; xMax = 5;
xx = linspace(xMin, xMax, 100);
f = inline('2.5 + sin(x)','x');
% f = inline('abs(x)','x'); % UNCOMMENT FOR FIGURE 13
yy = f(xx) + randn(size(xx))*.5;

% FOR SHUFFLING OBSERVATIONS
shuffleIdx = randperm(length(xx));
data = xx;
targets = yy;

% INITIALIZE MODEL PARAMETERS
nObs = length(data); 	% # OF INPUT DIMENSIONS
nInput = 1;				% # OF INPUTS
nHidden = 3; 			% # OF HIDDEN UNITS
nOutput = 1;			% # OF TARGET/OUTPUT DIMENSIONS
lRate = .15;   			% LEARNING RATE FOR PARAMETERS UPDATE
nIters = 200;  			% # OF ITERATIONS

cols = lines(nHidden);

% DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES)
g_hid = gTanh;  		   % HIDDEN UNIT ACTIVATION
gPrime_hid = gPrimeTanh;   % GRAD OF HIDDEN UNIT ACTIVATION
g_out = gLinear; 		   % OUTPUT ACTIVATION
gPrime_out = gPrimeLinear; % GRAD. OF OUTPUT ATIVATION

% % INITIALIZE WEIGHTS
W_hid = (rand(nInput,nHidden)-.5);
b_hid = (rand(1,nHidden)-.5);
W_out = (rand(nHidden,nOutput)-.5);
b_out = (rand(1,nOutput)-.5);

% INITIALIZE SOME THINGS..
% (FOR VISUALIZATION)
mse = zeros(1,nIters);
visRange = [xMin, xMax];
figure
set(gcf,'Position',[100,100,960,420])
iter = 1;
while 1

err = zeros(1,nObs);
% LOOP THROUGH THE EXAMPLES
for iO = 1:nObs

% GET CURRENT NETWORK INPUT DATA AND TARGET
input = data(iO);
target = targets(iO);

%% I. FORWARD PROPAGATE DATA THROUGH NETWORK
z_hid = input*W_hid + b_hid; % HIDDEN UNIT PRE-ACTIVATIONS
a_hid = g_hid(z_hid);        % HIDDEN UNIT ACTIVATIONS
z_out = a_hid*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS
a_out = g_out(z_out);        % OUTPUT UNIT ACTIVATIONS

%% II. BACKPROPAGATE ERROR SIGNAL
% CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT
delta_out = gPrime_out(z_out).*(a_out - target);

%% CALCULATE ERROR CONTRIBUTIONS FOR HIDDEN NODES...
delta_hid = gPrime_hid(z_hid)'.*(delta_out*W_out);

%% III. CALCULATE GRADIENT W.R.T. PARAMETERS...
dEdW_out = delta_out*a_hid;
dEdb_out = delta_out*1;
dEdW_hid = delta_hid*input;
dEdb_hid = delta_hid*1;

%% IV. UPDATE NETWORK PARAMETERS
W_out = W_out - lRate*dEdW_out';
b_out = b_out - lRate*dEdb_out';

W_hid = W_hid - lRate*dEdW_hid';
b_hid = b_hid - lRate*dEdb_hid';

% CALCULATE ERROR FUNCTION FOR BATCH
err(iO) = .5*(a_out-target).^2;
end
mse(iter) = mean(err); % UPDATE ERROR

% DISPLAY LEARNING
clf; subplot(121); hold on;
set(gca,'fontsize',14)

plot(xx,f(xx),'m','linewidth',2);
hold on;
scatter(xx, yy ,'m');

% PLOT TOTAL NETWORK OUTPUT
netOut = g_out(g_hid(bsxfun(@plus, xx'*W_hid, b_hid))*W_out + b_out);
plot(xx, netOut, 'k','linewidth', 2)

% PLOT EACH HIDDEN UNIT'S OUTPUT FUNCTION
for iU = 1:nHidden
plot(xx,g_hid(xx*W_hid(iU) + b_hid(iU)),'color',cols(iU,:),'Linewidth',2, ...
'Linestyle','--');
end

% TITLE AND LEGEND
title(sprintf('Iteration %d',iter))
xlim([visRange(1) visRange(2)]),ylim([visRange(1) visRange(2)]);
axis square
legend('f(x)', 'Targets', 'Network Output','Hidden Unit Outputs','Location','Southwest')

% PLOT ERROR
subplot(122);
set(gca,'fontsize',14)
plot(1:iter,mse(1:iter));
xlabel('Iteration')
ylabel('Mean Squared Error')
axis square; drawnow

% ANNEAL LEARNING RATE
lRate = lRate *.99;
if iter &gt;= nIters
break
end
iter = iter + 1;
end


Figure 12: A two-layered ANN used for regression. The network approximates the function f(x) = sin(x) + 2.5

The training procedure for $f(x): \sin(x) + 2.5$ is visualized in the left plot of Figure 12. The data-generating function $f(x)$ is plotted as the solid magenta line, and the noisy target values used to train the network are plotted as magenta circles. The output of the network at each training iteration is plotted in solid black while the output of each of the tanh hidden units is plotted in dashed lines. This visualization demonstrates how multiple nonlinear functions can be combined to form the complex output target function. The mean squared error at each iteration is plotted in the right plot of Figure 12. We see that the error does not follow a simple trajectory during learning, but rather undulates, demonstrating the non-convexity of the error surface.

Figure 13 visualizes the training procedure for trying to learn a different nonlinear function, namely $f(x): \text{abs}(x)$ (by running Code Block 5, after un-commenting out line 7). Again, we see how the outputs of the hidden units are combined to fit the desired data-generating function. The mean squared error again follows an erratic path during learning.

Figure 13: A two-layered ANN used for regression. The network approximates the function f(x) = abs(x)

Notice for this example that I added an extra implementation detail known as simulated annealing (line 118) that was absent in the classification examples. This technique decreases the learning rate after every iteration thus making the algorithm take smaller and smaller steps in parameter space.  This technique can be useful when the gradient updates begin oscillating between two or more locations in the parameter space. It is also helpful for influencing the algorithm to settle down into a steady state.

## Wrapping up

In this post we covered the main ideas behind artificial neural networks including: single- and multi-layer ANNs, activation functions and their derivatives, a high-level description of the backpropagation algorithm, and a number of classification and regression examples. ANNs, particularly mult-layer ANNs, are a robust and powerful class of models that can be used to learn complex, nonlinear functions. However, there are a number of considerations when using neural networks including:

• How many hidden layers should one use?
• How many hidden units in each layer?
• How do these relate to overfitting and generalization?
• Are there better error functions than the squared difference?
• What should the learning rate be?
• What can we do about the complexity of error surface with deep networks?
• Should we use simulated annealing?
• What about other activation functions?

It turns out that there are no easy or definite answers to any of these questions, and there is active research focusing on each topic. This is why using ANNs is often considered as much as a “black art” as it is a quantitative technique.

One primary limitation of ANNs is that they are supervised algorithms, requiring a target value for each input observation in order to train the network. This can be prohibitive for training large networks that may require lots of training data to adequately adjust the parameters. However, there are a set of unsupervised variants of ANNs that can be used to learn an initial condition for the ANN (rather than from randomly-generated initial weights) without the need of target values. This technique of “unsupervised pretraining” has been an important component of many “deep learning” models used in AI and machine learning. In future posts, I look forward to covering two of these unsupervised neural networks: autoencoders and restricted Boltzmann machines.

## Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

## Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable $x$ is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function $f(x)$ that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function $f(x)$ from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function $f(x)$ taking $x$ as input and producing an output). However, in the real world, we don’t get to observe $f(x)$ directly, but instead get noisy observations $y$, where

(1) $y = f(x) + \epsilon$

Here we will assume that $\epsilon$ is random variable distributed according to a zero-mean Gaussian with standard deviation $\sigma^2$. Note that because $\epsilon$ is a random variable, $y$ is also a random variable (with a mean that is on conditioned on both $x$ and $f(x)$, and a variance $\sigma^2)$.

As an example, say that the true function $f(x)$ we want to determine has the the following form (though we don’t know it):

$f(x) = \sin(\pi x)$

Thus the observations $y$ we get to see have the following distribution.

$y = \sin(\pi x) + \mathcal N(0,\sigma^2)$

Below we define the function $f(x)$ and display it, then draw a few observation samples $y$, and display them as well:

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);
xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');


Again, the goal here is to characterized the function $f(x)$. However, since we don’t know the function form of $f(x)$, we must instead estimate some other function $g(x)$ that we think will be a good approximation to $f(x)$. Thus we call $g(x)$ an estimator of $f(x)$. In general, an estimator is some parameterized model that can capture a wide range of functional forms. One such class of estimators is the weighted combination of ordered polynomials:

$g_N(x) = \theta_0 + \theta_1x + \theta_2x^2 + \dots \theta_N x^N$

As the polynomial order $N$ increases, the functions $g_N(x)$ are able to capture increasingly complex behavior. For example, $g_0(x)$ desribes a horizontal line with an adjustable vertical offset, $g_1(x)$ desribes a line with adjustable vertical offset and adjustable slope, $g_2(x)$ describes a function that also includes a quadratic term. We thus try to fit the values of the parameters for a given estimator $g_N(x)$  to best account for observed data in the hopes that we will also accurately approximate $f(x)$.

Below we estimate the parameters of three polynomial model functions of increasing complexity (using Matlab’s $\texttt{polyfit.m}$) to the sampled data displayed above. Specifically, we estimate the functions $g_1(x)$, $g_3(x)$ and $g_{10}(x)$.

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')


Qualitatively, we see that the estimator $g_1(x)$ (red line) provides a poor fit to the observed data, as well as a poor approximation to the function $f(x)$ (black curve). We see that the estimator $g_{10}(x)$ (blue curve) provides a very accurate fit to the data points, but varies wildly to do so, and therefore provides an inaccurate approximation of $f(x)$. Finally, we see that the estimator $g_3(x)$ (green curve) provides a fairly good fit to the observed data, and a much better job at approximating $f(x)$.

Our original goal was to approximate $f(x)$, not the data points per se. Therefore $g_3(x)$, at least qualitatively, provides a more desirable estimate of $f(x)$ than the other two estimators. The fits for $g_1(x)$ and $g_{10}(x)$ are examples of “underfitting” and “overfitting” to the observed data. Underfitting occurs when an estimator $g(x)$ is not flexible enough to capture the underlying trends in the observed data. Overfitting occurs when an estimator is too flexible, allowing it to capture illusory trends in the data. These illusory trends are often the result of the noise in the observations $y$.

## Variance and Bias of an Estimator

The model fits for $g_N(x)$ discussed above were based on a single, randomly-sampled data set of observations $y$. However, there are in principle, a potentially infinite number of data sets that can be observed (drawn from $y$). In order to determine a good model of $f(x)$, it would be helpful to have an idea of how an estimator will perform on any or all of these potential datasets.  To get an idea of how each of the estimators discussed above performs in general we can repeat the model fitting procedure for many data sets.

Here we perform such an analyses, sampling 50 independent data sets according to Equation (1) (with $\sigma=0.1$), then fitting the parameters for the polynomial functions of model order $N = [1,3,10]$ to each dataset.

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end


Here are the results for the estimator $g_1(x)$

…and for the estimator $g_3(x)$, …

… and for $g_{10}(x).$

The lightly-colored curves in each of the three plots above are an individual polynomial model fit to one of the 50 sampled data sets. The darkly-colored curve in each plot is the average over the 50 individual fits. The dark black curve is the true, underlying function $f(x)$.

We see that for the estimator $g_1(x)$ (red curves), model fits do not vary much from data set to data set. Thus the expected (mean) estimator fit over all the data sets, formally written as $\mathbb E[g(x)]$, is very similar to all the individual fits.

A common formal definition used in statistics to describe how much a single estimator deviates from the average estimator over datasets is the variance of the estimator. Formally defined as

$variance=\mathbb E[(g(x)-\mathbb E[g(x)])^2]$

the variance is the average squared difference between any single data-set-dependent estimate of $g(x)$ and the average value of $g(x)$ estimated over all datasets.

According to the definition of variance, we can say that the estimator $g_1(x)$ exhibits low variance.

A commonly-used metric in statistics for assessing how an estimator $g(x)$ approximates a target function $f(x)$, based on behavior over many observed data sets is what is called the bias of the estimator. Formally defined as:

$bias = \mathbb E[g(x)] - f(x)$

The bias describes how much the average estimator fit over datasets $\mathbb E[g(x)]$ deviates from the value of the underlying target function $f(x)$.

We can see from the plot for $g(x)_1$ that $\mathbb E[g_1(x)]$ deviates significantly from $f(x)$. Thus we can say that the estimator $g_1(x)$ exhibits large bias when approximating the function $f(x)$.

Investigating the results for the estimator $g_{10}(x)$ (blue curves), we see that each individual model fit varies dramatically from one data set to another. Thus we can say that this estimator exhibits large variance. Looking at $\mathbb E[g_{10}(x)]$ (the dark blue curve), we see that on average the estimator $g_{10}(x)$ provides a better approximation to $f(x)$ than the estimator $g_1(x)$. Thus we can say that $g_{10}(x)$ exhibits a lower bias than the estimator $g_1(x)$.

Investigating the fits for the estimator $g_3(x)$ (green curves), we find that this estimator has low bias. Furthermore, the average estimator $\mathbb E[g_3(x)]$ (dark green curve) accurately approximates the true function $f(x)$, telling us that that the estimator $g_3(x)$ also has low bias.

We established earlier that the estimator $g_3(x)$ provided a qualitatively better fit to the function $f(x)$ than the other two polynomial estimators for a single dataset. It appears that this is also the case over many datasets. We also find that estimator $g_3(x)$ exhibits low bias and low variance, whereas the other two, less-desirable estimators, have either high bias or high variance. Thus it would appear that having both low bias and low variance is a reasonable criterion for selecting an accurate model of $f(x)$.

Included in each of the three plots above is a dashed black line representing the squared difference between the average estimator $\mathbb E[g_N(x)]$ and the true function $f(x)$. Calculating squared model errors is a common practice for quantifying the goodness of a model fit. If we calculate the expected value of each of the dashed black curves (and assuming that all values of $x$ are equally likely to occur), we would obtain a single value for each estimator that is the mean squared error (MSE) between the expected estimator and the true function.

$\mathbb E[(\mathbb E[g(x)]-f(x))^2] = \frac{1}{N}\sum_{i=1}^N (\mathbb E[g(x)]-f(x))^2$

For the estimator $g_3(x)$, the MSE will be very small, as the dashed black curve for this estimator is near zero for all values of $x$. The estimators $g_1(x)$ and $g_{10}(x)$ would have significantly larger values. Now, because exhibiting both a low MSE, as well as having both low bias and variance are indicative of a good estimator, it would be reasonable to assume that squared model error is directly related to bias and variance. The next section provides some formal evidence for this notion.

## Expected Prediction Error and the Bias-variance Tradeoff

For a given estimator $g(x)$ fit to a data set of $xy$ pairs, we would like to know, given all the possible datasets out there, what is the expected prediction error we will observe for a new data point $x^*, y^* = f(x^*) + \epsilon$. If we define prediction error to be the squared difference in model prediction $g(x^*)$ and observations $y^*$, the expected prediction error is then:

$\mathbb E[(g(x^*) - y^*)^2]$

If we expand this a little and use a few identities, something interesting happens:

(2) $\mathbb E[(g(x^*) - y^*)^2] = \mathbb E[g(x^*)^2-2g(x^*)y^*+y^{*2}]=$

(3) $\mathbb E[g(x^*)^2]-2\mathbb E[g(x^*)y^*]+\mathbb E[y^{*2}]=$

(4) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2-2 \mathbb E[g(x^*)]f(x^*) + \mathbb E[(y^*-f(x^*))^2] + f(x^*)^2$

where we have applied the following Lemma to the first and third terms of Equation (3), and use the fact to $\mathbb E[y] = f(x)$ (Think of averaging over an infinite number of datasets sampled from y; all noise will average out, leaving $f(x)$). Rearranging Equation (4), we obtain

(5) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2 - 2 \mathbb E[g(x^*)]f(x^*) + f(x^*)^2 + \mathbb E[(y^*-f(x^*))^2]$

which can be further simplified and grouped into three terms

(6) $\mathbb E[(g(x^*) - \mathbb E[g(x^*)])^2]$ $+\mathbb (E[g(x^*)]-f(x^*))^2$ $+ \mathbb E[(y^*-f(x^*))^2]$

1. The first term is the variance of the estimator introduced above.
2. The second term is the square of the bias of the estimator, also introduced above.
3. The third term is the variance of the observation noise and describes how much the observations $y$ vary from the true function $f(x)$. Notice that the noise term does not depend on the estimator $g(x)$. Thus the noise term is a constant that places a lower bound on expected prediction error.

Here we find that the expected prediction error on new data $(x^*,y^*)$ (in the squared differences sense) is the combination of three terms:

Expected prediction error = estimator variance + squared estimator bias + noise

Thus the expected prediction error on new data can be used as a quantitative criterion for selecting the best model from a candidate set of estimators! It turns out that, given $N$ new data points $(\bold x^*,\bold y^*)$, the expected prediction error can be easily approximated as the mean squared error over data pairs:

$\mathbb E[(g(\bold x^*) - \bold y^*)^2] \approx \frac{1}{N}\sum_{i=1}^N(g(x_i^*)-y_i^*)^2$

Below we demonstrate these findings with another set of simulations. We simulate 100 independent datasets, each with 25 $xy$ pairs. We then partition each dataset into two non-overlapping sets: a training set using for fitting model parameters, and a testing set used to calculate prediction error. We then fit the parameters for estimators of varying complexity. Complexity is varied by using polynomial functions that range in model order from 1 (least complex) to 12 (most complex). We then calculate and display the squared bias, variance, and error on testing set for each of the estimators:

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS
% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;


Here we see how, as the model complexity increases, the estimator variance (blue curve) gradually increases. Additionally, as model complexity increases, the squared bias (red curve) decreases. Thus there is a tradeoff between bias and variance that comes with model complexity: models that are too complex will have high variance and low bias; models that are too simple will have high bias and low variance. The best model will have both low bias and low variance. In this example, we highlight the best estimator in terms of prediction error on the testing set (black curve) with a dashed black vertical line. The best estimator corresponds to a polynomial model of order of $N=3$. Notice that the vertical black line is located where function defined by the sum of the squared bias and variance (dashed magenta curve) is also at a minimum.

Notice also how the sum of the squared bias and variance also has the same shape as curve defined by the prediction error on the testing set. This exemplifies how the error on novel data can be used as a proxy for determining the best estimator from a candidate set based on squared bias and variance. The noise term in Equation (6) is also represented in the figure by the vertical displacement between the black curve and dashed magenta curve.

It is very important to point out that all of these results are based on evaluating prediction error on novel data, not used to estimate model parameters. In fact assessing a model performance based on prediction error calculated on the same data used to estimate the model parameters is highly problematic, as it causes models to always overfit. In plain terms, this means that we will always favor a more complex model if we assess goodness of model fits on the training data, as a more complex model will be better able to capture small, random trends in the data due to noise.

This overfitting phenomenon is demonstrated below. We plot the error calculated on the training set (Train Error, green curve) along with the error calculated on the testing set (Test Error, black curve) for the above simulation. We also identify the best estimator as we did above.

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;


We see here that as model complexity increases, the error calculated on the training set continues to decrease, where as the error on the testing set increases past the optimal polynomial order $N=3$. We  showed above that error calculated on the testing set is the true indicator of how well an estimator will generalize to new data points. The error calculated on the training set strongly disagrees with the error calculated on the testing set after the optimal model complexity has been reached. Since, in general, the whole point of modeling a data set is to generalize to novel data, assessing model predictions on the training set data should be avoided.

## Wrapping Up

Here we discussed how the bias and variance of an estimator are related to squared prediction error. Though we focused on regression, these concepts can also be applied to classification problems. We found that an optimal estimator will have both low variance and low bias. We further found that information about squared bias and variance is contained in expected prediction error calculated on a testing set of data not used to fit a model’s parameters.

The concepts of estimator bias and variance are generally only clear in the context of an ensemble of datasets. However, in real-world applications, there is generally only a single observed dataset. In such cases the roles of bias and variance are less obvious (though, it is possible to calculate estimates of variance and bias using resampling methods such as bootstrapping). However, the direct connection we made between bias, variance with the mean-squared error calculated on a testing set give us a direct means for assessing a group of candidate estimators in light of a single data set. We only need to partition the available data set into separate portions: one used to fit model parameters (a training set), and another used to assess prediction accuracy (testing set). Comparing prediction accuracy across potential estimators is equivalent to assessing biases and variances of the estimators across many datasets. Note that resampling methods such as cross-validation can prove helpful here, particularly when the amount of observed data is small.

Note, all the code for this post, containe in a single script is here:

clear
clc
close all

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);

xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS

% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); % TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;


## Cutting Your Losses: Loss Functions & the Sum of Squares Loss

Often times, particularly in a regression framework, we are given a set of inputs (independent variables) $\bold{x}$ and a set outputs (dependent variables) $\bold{y}$, and we want to devise a model function

$f(\bold{x})=\bold{y}$

that predicts the outputs given some inputs as best as possible. But what does it mean for a model to predict “as best as possible” exactly? In order to make the notion of how good a model is explicit, it is common to adopt a loss function

$J(f(\bold{x});\bold{y})$,

The loss function is some function of the model’s prediction errors (a.k.a. residuals) $\bold{e} = \bold{y} - f(\bold{x})$ at predicting outputs $\bold y$ given the inputs $\bold x$ (the loss function is also often referred to as the cost function, as it makes explicit the “cost” of incorrect prediction). “Good” models of a dataset will have small prediction errors, and therefore produce small loss function values. Determining the “best” model is equivalent to finding model function that minimizes the loss function. A common choice for this loss function is the sum of squared of the errors (SSE) loss. If there are $M$ input-output pairs, the SSE Loss function is formally:

$J(f(\bold{x});\bold{y}) =\sum_{i=1}^M (y_i - f(x_i))^2$

This formula states that, for each output predicted by the model, we determine how far away the prediction is from the actual value $y_i$ (i.e. subtraction). Each of the $M$ individual distances are then squared and added to give a single number indicating how well (or badly) the model function captures the structure of the data across all the datapoints. The “best” model under this loss is called the least sum of squares (LSS) solution.

But why square the errors before summing them? At first, this seems somewhat unintuitive (or even ad hoc!). Surely there are other, more straight-forward loss functions we can devise. An initial notion of just adding the errors leads to a dead end because adding many positive and negative errors (i.e. resulting from data located below and above the model function) just cancels out; we want our measure of errors to be all positive (or all negative). Therefore, another idea would be to just take the absolute value of the errors $|\bold{e}|$ before summing. Turns out, this is a known loss function, called the sum of absolute errors (SAE) or sum of absolute deviations (SAD) loss function. Finding the “best” SAE/SAD model is called the least absolute error LAE/LAD solution and such a solution was actually proposed decades before LSS. Though LAE is indeed used in contemporary methods (we’ll talk more about LAE later), the sum of squares loss function is far more popular in practice. Why does sum of squares always make the cut?

# Useful Interpretations of the Sum of Squares Loss for Linear Regression

## Areas of squares

Figure 1 demonstrates a set of 2D data (blue dots) and the LSS linear function (black line) of the form

$\bold{\hat y} = f(\bold x) = \beta_0 + \beta_1 \bold x$,

where the parameters $\beta_0$ (the offet of the line from $y = 0$) and $\beta_1$ (the slope) have been estimated (LSS) to “best” fit the data.

Figure 1 – Linear model fit to some data points

One helpful interpretation of SSE loss function is demonstrated in Figures 2. Each red square is a literal interpretation of the squared error for linear function fit in Figure 1. Here we see that no matter if the errors occur from predictions being greater than or less than the actual output values, the error term is always positive.

Figure 2 — Least squares loss function represented as areas

In this interpretation, the goal of finding the LSS solution is to find the line that results in the smallest red area. This interpretation is also useful for understanding the important regression metric known as the coefficient of determination $R^2$, which is an indicator of how well a linear model function explains or predicts a dataset. Imagine that instead of the line fit in Figures 1-2, we instead fit a simpler model that has no slope parameter, and only a bias/offset parameter (Figure 3). In this case the simpler model only captures the mean value of the data along the y-dimension.

Figure 3 — Least squares loss for a linear function that only captures the mean of the dependent variable

The squared error in this model corresponds to the (unscaled) variance of the data. Lets denote the total area of the green squares in Figure 3 as the total sum of squares (TSS) error, and the area of the red squares in Figure 2 as the residual sum of squares (RSS) of the linear model fit. The metric $R^2$ is related to the red and green areas as follows

$R^2 = 1 - \frac{RSS}{TSS}$

If the linear model is doing a good job of fitting the data, then the variance of the model errors/residuals (RSS) term will be small compared to the variance of the dataset (TSS), and the $R^2$ metric will be close to one. If the model is doing a poor job of fitting the data, then the variance residuals will approach that of the data itself, and the metric will be close to zero (Note too that the value of  $R^2$ can also take negative values, in the case when the RSS is larger than the TSS, indicating a very poor model).

## A bar suspended by springs

We can gain some important insight to the importance of the least squares loss by developing concepts within the framework of a physical system (Figure 4). In this formulation, a set of springs (red, dashed lines, our errors $e$) suspend a bar (solid black line, our linear function $f(\bold{x}))$ to a set of anchors (blue datapoints, our outputs $\bold{y}$). Note that in this formulation, the springs are constrained to operate only along the vertical direction (y-dimension). This constraint is equivalent to saying that there is only error in our measurement of the dependent variables, and is often an assumption made in regression frameworks.

Figure 4 — Least squares interpreted in terms of a physical system of a bar suspended by springs

From Hooke’s Law, the force created by each spring on the bar is proportional to the distance (error) from the bar (linear function) to its corresponding anchor (datapoint) :

$F_i = -ke_i$

Further, there is a potential energy $U_i$ associated with each spring (datapoint). The total potential energy for the entire system is as follows:

$\sum_i U_i = \sum_i \int -k e_i de_i \\= \sum_i \frac{1}{2} k e_i^2 \\ = \sum_i (y_i - f(x_i))^2$

(assuming a spring constant of $k=2$). This demonstrates that the equilibrium state of this system (i.e. the arrangement of the bar that minimizes the potential energy of the system) is analogous to the state that minimizes the sum of the squared error (distance) between the bar (line function) and the anchors (datapoints).
The physical interpretation given above can also be used to derive how linear regression solutions are related to the variances of the independent variables $\bold x$ and the covariance between $\bold x$ and $\bold y$. When the bar is in the equilibrium position (optimal solution), the net force  exerted on the bar zero. Because $\bold{\hat y} = \beta_0 + \beta_1 \bold x$, this first zero-net-force condition is formally described as:
$\sum_i^M y_i - \beta_0 - \beta_1x_i = 0$
The second condition that is fullfilled during equilibrium is that there are no torquing forces on the bar (i.e. the bar is not rotating about an axis). Because torque created about an axis is the force times distance away from the origin (average x-value; the origin), this second zero-net-torque condition is formally described by:
$\sum_i^M x_i(y_i - \beta_0 - \beta_1x_i) = 0$

From the equation corresponding to the first zero-net-force condition, we can solve for the bias parameter $\beta_0$ of the linear function that describes the orientation of the bar:

$\beta_0=\frac{1}{M}\sum_i (y_i - \beta_1 x_i)$

$\beta_0= \bar y - \beta_1 \bar x$

Here the $\bar \cdot$ (pronounced “bar”) means the average value. Plugging this expression into the second second zero-net-torque condition equation, we discover that the slope of the line has an interesting interpretation related to the variances of the data:

$\sum_i x_i(y_i - \beta_0 - \beta_1x_i) = 0$

$\sum_i x_i(y_i - (\bar y - \beta_1 \bar x) - \beta_1x_i) = 0$

$\sum_i x_i(y_i - \bar y) = \beta_1 \sum_i x_i(x_i - \bar x)$

$\sum_i (x_i - \bar x)(y_i - \bar y) = \beta_1 \sum_i (x_i -\bar x)^2$

$\beta_1 = \frac{\sum_i (x_i - \bar x)(y_i - \bar y)}{\sum_i (x_i -\bar x)^2} = \frac{\text{cov}(x,y)}{\text{var}(x)}$

The expressions for the parameters $\beta_0$ and $\beta_1$ tell us that, under the least squares linear regression framework, The average of the dependent variables is equal to a scaled version of the average of independent variables plus an offset:

$\bar y = \beta_0 + \beta_1 \bar x$

Further, the scaling factor (the slope) is equal to the ratio of the covariance between the dependent and independent variables to the variance of the independent variable. Therefore if $x$ and $y$ are positively correlated, the slope will be positive, if they are negatively correlated, the slope will be negative.

Because of these relationships, the LSS solution has a number of useful properties:

1. The sum of the residuals under the LSS solution is zero(this is equivalent to the first zero-net-force condition above).
2. Because of 1., the average residual of the LSS solution is zero
3. The covariance between the independent variables and the residuals is zero.
4. The LLS solution always passes through the mean (center of mass) of the sample
5. The LSS solution minimizes the variance of the residuals/model errors.

Therefore, the least squares loss function directly relates model residuals to how the independent and dependent variables co-vary. These relationships are not available with other loss functions such as the least absolute deviation.

What’s interesting, is that the two physical constraint equations derived from the physical system above are also obtained through other analytic analyses of linear regression including defining the LSS problem using both maximum likelihood estimation and method of moments.

# Wrapping up

There are many other reasons, albeit suggestions, as to why squared errors are often preferred to other rectifying functions of the errors (i.e. making all errors be positive or zero):

1. The Least Squares solution can be derived in closed form, allowing simple analytic implementations and fast computation of model parameters.
2. Unlike the LAE loss, the SSE loss is differentiable (i.e. is smooth) everywhere, which allows model parameters to be estimated using straight-forward, gradient-based optimizations
3. Squared errors have deep ties in statistics and maximum likelihood estimation methods (as mentioned above), particularly when the errors are distributed according to the Golden Boy of statistical distributions, the Normal distribution.
4. There are a number of geometric and linear algebra theorems that support using least squares. For instance the Gauss-Markov theorem states that if errors of a linear function are distributed Normally about the mean of the line, then the LSS solution gives the best unbiased estimator for the parameters $\bold \beta$.
5. Squared functions have a long history of  facilitating calculus calculations used throughout the physical sciences.

The SSE loss does have a number of downfalls as well. For instance, because each error is squared, any outliers in the dataset can dominate the parameter estimation process. For this reason, the LSS loss is said to lack robustness. Therefore preprocessing of the the dataset (i.e. removing or thresholding outlier values) may be necessary when using the LSS loss.

# MATLAB Code

The code to produce the figures in this post is below. The code can be directly copied to your clipboard using the toolbar at the top right of the code display.

close all; clear
randn('state',12345);
x = -5:5; % INPUTS
y = x + .5*randn(size(x)); % OUTPUTS

% PART 1 - LINEAR MODEL

% PLOT DATA
figure(1);
h1 = scatter(x,y,'filled');
xlim([min(x)-1 max(x)+1]);
xlim([min(y)-1 max(y)+1]);
axis square
xlabel('x');
ylabel('y');

fprintf('\nHere are a set of 2D data pairs...');
pause
clc

% FIND LSS SOLUTION
bias = ones(size(x));
beta=regress(y',[bias;x]');
intercept = beta(1);
slope = beta(2);
yHat = x*slope + intercept;

% PLOT MODEL FIT
hold on;
h2 = plot(x,yHat,'k-','Linewidth',2);

fprintf('\n...and the LSS linear function determined for the points.\n');
pause
clc

% CALCULATE PREDICTION ERROR
e = y - yHat;
posErrors = find(e&gt;=0);
negErrors = setdiff(1:numel(x),posErrors);

% PLOT &quot;SQUARED&quot; ERRORS
cnt = 1;
for iP = 1:numel(posErrors);
xs = [x(posErrors(iP))-e(posErrors(iP)), ...
x(posErrors(iP)), ...
x(posErrors(iP)), ...
x(posErrors(iP))-e(posErrors(iP))];

ys = [y(posErrors(iP))-e(posErrors(iP)), ...
y(posErrors(iP))-e(posErrors(iP)), ...
y(posErrors(iP)), ...
y(posErrors(iP))];

hS(cnt)=patch(xs,ys,'r');
set(hS(cnt),'EdgeColor','r');
set(hS(cnt),'FaceAlpha',.5);
cnt = cnt+1;
end
for iN = 1:numel(negErrors);
xs = [x(negErrors(iN))-e(negErrors(iN)), ...
x(negErrors(iN)), ...
x(negErrors(iN)), ...
x(negErrors(iN))-e(negErrors(iN))];

ys = [y(negErrors(iN)), ...
y(negErrors(iN)), ...
y(negErrors(iN))-e(negErrors(iN)), ...
y(negErrors(iN))-e(negErrors(iN))];

hS(cnt)= patch(xs,ys,'r');
set(hS(cnt),'EdgeColor','r');
set(hS(cnt),'FaceAlpha',.5);
cnt = cnt+1;
end

uistack(h2,'top');
uistack(h1,'top');

fprintf('\nOne helpful interpretation is to represent the')
fprintf('\nsquared errors literally as the area spanned')
fprintf('\nin the space (red squares).\n')
fprintf('\nFinding the LSS solution is equivalent to minimizing')
fprintf('\nthe sum of the area of these squares.\n')
pause
clc

% PART 2 - TRIVIAL LINEAR MODEL
figure(2);
h1 = scatter(x,y,'filled');
xlim([min(x)-1 max(x)+1]);
xlim([min(y)-1 max(y)+1]);
axis square
xlabel('x');
ylabel('y');

yHat0 = mean(y).*ones(size(x));
e0 = y - yHat0;
posErrors0 = find(e0&gt;=0);
negErrors0 = setdiff(1:numel(x),posErrors0);

% PLOT TRIVIAL MODEL FIT
hold on;
h2 = plot(x,yHat0,'k-','Linewidth',2);

fprintf('\nNow, imagine that we fit a simpler model to the datapoints')
fprintf('\nthat is a line with no slope parameter and an offset parameter.')
fprintf('\nIn this case we''re essentially fitting the mean of the data.')
pause
clc

% PLOT TRIVIAL MODEL &quot;SQUARED&quot; ERRORS
cnt = 1;
for iP = 1:numel(posErrors0);
xs = [x(posErrors0(iP))-e0(posErrors0(iP)), ...
x(posErrors0(iP)), ...
x(posErrors0(iP)), ...
x(posErrors0(iP))-e0(posErrors0(iP))];

ys = [y(posErrors0(iP))-e0(posErrors0(iP)), ...
y(posErrors0(iP))-e0(posErrors0(iP)), ...
y(posErrors0(iP)), ...
y(posErrors0(iP))];

hS(cnt)=patch(xs,ys,'g');
set(hS(cnt),'EdgeColor','g');
set(hS(cnt),'FaceAlpha',.5);
cnt = cnt+1;
end
for iN = 1:numel(negErrors0);
xs = [x(negErrors0(iN))-e0(negErrors0(iN)), ...
x(negErrors0(iN)), ...
x(negErrors0(iN)), ...
x(negErrors0(iN))-e0(negErrors0(iN))];

ys = [y(negErrors0(iN)), ...
y(negErrors0(iN)), ...
y(negErrors0(iN))-e0(negErrors0(iN)), ...
y(negErrors0(iN))-e0(negErrors0(iN))];

hS(cnt)= patch(xs,ys,'g');
set(hS(cnt),'EdgeColor','g');
set(hS(cnt),'FaceAlpha',.5);
cnt = cnt+1;
end

uistack(h2,'top');
uistack(h1,'top');

fprintf('\nTherefore the sum of residuals for this model area equal to the')
fprintf('\n(unscaled) variance of the data.\n')
pause
fprintf('\nThe ratio of the area of the green boxes in Figure 1 to the area of')
fprintf('\nthe red boxes in Figure 2 is related to the important metric known')
fprintf('\nas the coefficient of determination, R^2. Specifically:\n')
fprintf('\nR^2 = 1 - Red/Green\n')
fprintf('\nNote that as the linear model fit improves, the area of the red')
fprintf('\nboxes decreases and the value of R^2 approaches one.')
pause
clc

% PART 3- &quot;SPRINGS&quot; INTERPRETATION
figure(3)
h1 = scatter(x,y,'filled');
xlim([min(x)-1 max(x)+1]);
xlim([min(y)-1 max(y)+1]);hold on
h2 = plot(x,yHat,'k-','Linewidth',2);
axis square
xlabel('x');
ylabel('y');

h3 = line([x;x],[y;yHat],'color','r','linestyle','--','Linewidth',2);
uistack(h1,'top');

fprintf('\nIt can also be helpful to think of errors corresponding to')
fprintf('\nindividual datapoints as springs (Figuree 3, red dashes)')
fprintf('\nattached to a suspended bar (black line)\n')
fprintf('\nIf the springs are limited to only operate in the the y-direction,')
fprintf('\nthen sum of the potential energies stored in the springs when')
fprintf('\nthe bar has reached its equilibtium position directly corresponds')
fprintf('\nto the sum of squares error function:\n')
fprintf('\ne = y - yHat;\nU = integral(ke)de = 1/2ke^2\n')
fprintf('\nif k = 1, then:\n')
fprintf('\nU = 1/2(y - yHat)^2, \nthe least squares error function\n')
pause
clc

close all; clear all; clc