# Monthly Archives: April 2013

## Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

## Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function taking as input and producing an output). However, in the real world, we don’t get to observe directly, but instead get noisy observations , where

(1)

Here we will assume that is random variable distributed according to a zero-mean Gaussian with standard deviation . Note that because is a random variable, is also a random variable (with a mean that is on conditioned on both and , and a variance .

As an example, say that the true function we want to determine has the the following form (though we don’t know it):

Thus the observations we get to see have the following distribution.

Below we define the function and display it, then draw a few observation samples , and display them as well:

% CREATE A RANDOM DATASET rng(10) nData = 10; % N x = 2*(rand(1,nData)-.5); xGrid = linspace(-1,1,100); % DEFINE AND TARGET FUNCTION f(x) f = inline('sin(pi*x)','x'); h = []; h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2); xlabel('x') hold on % DEFINE AND DISPLAY y noiseSTD = .1; y = f(x) + noiseSTD*randn(size(x)); h(2) = scatter(x,y,'ko'); legend(h,{'f(x)','y'},'Location','Northwest');

Again, the goal here is to characterized the function . However, since we don’t know the function form of , we must instead estimate some other function that we think will be a good approximation to . Thus we call an estimator of . In general, an estimator is some parameterized model that can capture a wide range of functional forms. One such class of estimators is the weighted combination of ordered polynomials:

As the polynomial order increases, the functions are able to capture increasingly complex behavior. For example, desribes a horizontal line with an adjustable vertical offset, desribes a line with adjustable vertical offset and adjustable slope, describes a function that also includes a quadratic term. We thus try to fit the values of the parameters for a given estimator to best account for observed data in the hopes that we will also accurately approximate .

Below we estimate the parameters of three polynomial model functions of increasing complexity (using Matlab’s ) to the sampled data displayed above. Specifically, we estimate the functions , and .

% FIT POLYNOMIAL MODELS & DISPLAY % (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE) degree = [1,3,10]; theta = {}; cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6]; for iD = 1:numel(degree) figure(1) theta{iD} = polyfit(x,y,degree(iD)); fit{iD} = polyval(theta{iD},xGrid); h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2); xlim([-1 1]) ylim([-1 1]) end legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

Qualitatively, we see that the estimator (red line) provides a poor fit to the observed data, as well as a poor approximation to the function (black curve). We see that the estimator (blue curve) provides a very accurate fit to the data points, but varies wildly to do so, and therefore provides an inaccurate approximation of . Finally, we see that the estimator (green curve) provides a fairly good fit to the observed data, and a much better job at approximating .

Our original goal was to approximate , not the data points per se. Therefore , at least qualitatively, provides a more desirable estimate of than the other two estimators. The fits for and are examples of “underfitting” and “overfitting” to the observed data. * Underfitting* occurs when an estimator is not flexible enough to capture the underlying trends in the observed data.

*occurs when an estimator is too flexible, allowing it to capture illusory trends in the data. These illusory trends are often the result of the noise in the observations .*

**Overfitting**## Variance and Bias of an Estimator

The model fits for discussed above were based on a single, randomly-sampled data set of observations . However, there are in principle, a potentially infinite number of data sets that can be observed (drawn from ). In order to determine a good model of , it would be helpful to have an idea of how an estimator will perform on any or all of these potential datasets. To get an idea of how each of the estimators discussed above performs in general we can repeat the model fitting procedure for many data sets.

Here we perform such an analyses, sampling 50 independent data sets according to Equation (1) (with ), then fitting the parameters for the polynomial functions of model order to each dataset.

% FIT MODELS TO K INDEPENDENT DATASETS K = 50; for iS = 1:K ySim = f(x) + noiseSTD*randn(size(x)); for jD = 1:numel(degree) % FIT THE MODEL USING polyfit.m thetaTmp = polyfit(x,ySim,degree(jD)); % EVALUATE THE MODEL FIT USING polyval.m simFit{jD}(iS,:) = polyval(thetaTmp,xGrid); end end % DISPLAY ALL THE MODEL FITS h = []; for iD = 1:numel(degree) figure(iD+1) hold on % PLOT THE FUNCTION FIT TO EACH DATASET for iS = 1:K h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6)); end % PLOT THE AVERAGE FUNCTION ACROSS ALL FITS h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5); % PLOT THE UNDERLYING FUNCTION f(x) h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3); % CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS squaredError = (mean(simFit{iD})-f(xGrid)).^2; % PLOT THE SQUARED ERROR h(4) = plot(xGrid,squaredError,'k--','Linewidth',3); uistack(h(2),'top') hold off axis square xlim([-1 1]) ylim([-1 1]) legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside') title(sprintf('Model Order=%d',degree(iD))) end

Here are the results for the estimator …

…and for the estimator , …

… and for

The lightly-colored curves in each of the three plots above are an individual polynomial model fit to one of the 50 sampled data sets. The darkly-colored curve in each plot is the average over the 50 individual fits. The dark black curve is the true, underlying function .

We see that for the estimator (red curves), model fits do not vary much from data set to data set. Thus the expected (mean) estimator fit over all the data sets, formally written as , is very similar to all the individual fits.

A common formal definition used in statistics to describe how much a single estimator deviates from the average estimator over datasets is the * variance of the estimator*. Formally defined as

the variance is the average squared difference between any single data-set-dependent estimate of and the average value of estimated over all datasets.

According to the definition of variance, we can say that the estimator exhibits *low variance*.

A commonly-used metric in statistics for assessing how an estimator approximates a target function , based on behavior over many observed data sets is what is called the * bias of the estimator.* Formally defined as:

The bias describes how much the average estimator fit over datasets deviates from the value of the underlying target function .

We can see from the plot for that deviates significantly from . Thus we can say that the estimator exhibits *large bias* when approximating the function .

Investigating the results for the estimator (blue curves), we see that each individual model fit varies dramatically from one data set to another. Thus we can say that this estimator exhibits *large variance*. Looking at (the dark blue curve), we see that on average the estimator provides a better approximation to than the estimator . Thus we can say that exhibits a *lower bias* than the estimator .

Investigating the fits for the estimator (green curves), we find that this estimator has low bias. Furthermore, the average estimator (dark green curve) accurately approximates the true function , telling us that that the estimator also has *low bias*.

We established earlier that the estimator provided a qualitatively better fit to the function than the other two polynomial estimators for a single dataset. It appears that this is also the case over many datasets. We also find that estimator exhibits low bias and low variance, whereas the other two, less-desirable estimators, have either high bias or high variance. Thus it would appear that having both low bias and low variance is a reasonable criterion for selecting an accurate model of .

Included in each of the three plots above is a dashed black line representing the squared difference between the average estimator and the true function . Calculating squared model errors is a common practice for quantifying the goodness of a model fit. If we calculate the expected value of each of the dashed black curves (and assuming that all values of are equally likely to occur), we would obtain a single value for each estimator that is the mean squared error (MSE) between the expected estimator and the true function.

For the estimator , the MSE will be very small, as the dashed black curve for this estimator is near zero for all values of . The estimators and would have significantly larger values. Now, because exhibiting both a low MSE, as well as having both low bias and variance are indicative of a good estimator, it would be reasonable to assume that squared model error is directly related to bias and variance. The next section provides some formal evidence for this notion.

## Expected Prediction Error and the Bias-variance Tradeoff

For a given estimator fit to a data set of pairs, we would like to know, given all the possible datasets out there, what is the expected prediction error we will observe for a new data point . If we define prediction error to be the squared difference in model prediction and observations , the expected prediction error is then:

If we expand this a little and use a few identities, something interesting happens:

(2)

(3)

(4)

where we have applied the following Lemma to the first and third terms of Equation (3), and use the fact to (Think of averaging over an infinite number of datasets sampled from y; all noise will average out, leaving ). Rearranging Equation (4), we obtain

(5)

which can be further simplified and grouped into three terms

(6)

- The first term is the
introduced above.**variance of the estimator** - The second term is the square of the
, also introduced above.**bias of the estimator** - The third term is the
and describes how much the observations vary from the true function . Notice that the noise term does not depend on the estimator . Thus the noise term is a constant that places a lower bound on expected prediction error.**variance of the observation noise**

Here we find that the expected prediction error on new data (in the squared differences sense) is the combination of three terms:

* Expected prediction error* =

*+*

**estimator variance***+*

**squared estimator bias**

**noise**Thus the expected prediction error on new data can be used as a quantitative criterion for selecting the best model from a candidate set of estimators! It turns out that, given new data points , the expected prediction error can be easily approximated as the mean squared error over data pairs:

Below we demonstrate these findings with another set of simulations. We simulate 100 independent datasets, each with 25 pairs. We then partition each dataset into two non-overlapping sets: a training set using for fitting model parameters, and a testing set used to calculate prediction error. We then fit the parameters for estimators of varying complexity. Complexity is varied by using polynomial functions that range in model order from 1 (least complex) to 12 (most complex). We then calculate and display the squared bias, variance, and error on testing set for each of the estimators:

N = 25; % # OF OBSERVATIONS PER DATASET K = 100;% # OF DATASETS noiseSTD = .5; % NOISE STANDARDE DEV. nTrain = ceil(N*.9); % # OF TRAINING POINTS nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY % # INITIALIZE SOME VARIABLES xGrid = linspace(-1,1,N); meanPrediction = zeros(K,N); thetaHat = {}; x = linspace(-1,1,N); x = x(randperm(N)); for iS = 1:K % LOOP OVER DATASETS % CREATE OBSERVED DATA, y y = f(x) + noiseSTD*randn(size(x)); % CREATE TRAINING SET xTrain = x(1:nTrain); yTrain = y(1:nTrain); % CREATE TESTING SET xTest = x(nTrain+1:end); yTest = y(nTrain+1:end); % FIT MODELS for jD = 1:nPolyMax % MODEL PARAMETER ESTIMATES thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD); % PREDICTIONS yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); TRAINING SET yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET % MEAN SQUARED ERROR trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING end end % CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE for iD = 1:nPolyMax trainError(iD) = mean(trainErrors{iD}); testError(iD) = mean(testErrors{iD}); biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2); variance(iD) = mean(var(yHatTest{iD},1)); end [~,bestModel] = min(testError); % DISPLAY figure; hold on; plot(testError,'k','Linewidth',2); plot(biasSquared,'r','Linewidth',2); plot(variance,'b','Linewidth',2); plot(biasSquared + variance,'m-.','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model') hold off;

Here we see how, as the model complexity increases, the estimator variance (blue curve) gradually increases. Additionally, as model complexity increases, the squared bias (red curve) decreases. Thus ** there is a tradeoff between bias and variance that comes with model complexity**: models that are too complex will have high variance and low bias; models that are too simple will have high bias and low variance. The best model will have both low bias and low variance. In this example, we highlight the best estimator in terms of prediction error on the testing set (black curve) with a dashed black vertical line. The best estimator corresponds to a polynomial model of order of . Notice that the vertical black line is located where function defined by the sum of the squared bias and variance (dashed magenta curve) is also at a minimum.

Notice also how the sum of the squared bias and variance also has the same shape as curve defined by the prediction error on the testing set. This exemplifies how the error on novel data can be used as a proxy for determining the best estimator from a candidate set based on squared bias and variance. The noise term in Equation (6) is also represented in the figure by the vertical displacement between the black curve and dashed magenta curve.

It is very important to point out that all of these results are based on evaluating prediction error on **novel data**, not used to estimate model parameters. In fact assessing a model performance based on prediction error calculated on the same data used to estimate the model parameters is highly problematic, as it causes models to always overfit. In plain terms, this means that we will always favor a more complex model if we assess goodness of model fits on the training data, as a more complex model will be better able to capture small, random trends in the data due to noise.

This overfitting phenomenon is demonstrated below. We plot the error calculated on the training set (Train Error, green curve) along with the error calculated on the testing set (Test Error, black curve) for the above simulation. We also identify the best estimator as we did above.

% DISPLAY figure, hold on; plot(trainError,'g','Linewidth',2); plot(testError,'k','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Train Error','Test Error','Best Model'); hold off;

We see here that as model complexity increases, the error calculated on the training set continues to decrease, where as the error on the testing set increases past the optimal polynomial order . We showed above that error calculated on the testing set is the true indicator of how well an estimator will generalize to new data points. The error calculated on the training set strongly disagrees with the error calculated on the testing set after the optimal model complexity has been reached. Since, in general, the whole point of modeling a data set is to generalize to novel data, assessing model predictions on the training set data should be avoided.

## Wrapping Up

Here we discussed how the bias and variance of an estimator are related to squared prediction error. Though we focused on regression, these concepts can also be applied to classification problems. We found that an optimal estimator will have both low variance and low bias. We further found that information about squared bias and variance is contained in expected prediction error calculated on a testing set of data not used to fit a model’s parameters.

The concepts of estimator bias and variance are generally only clear in the context of an ensemble of datasets. However, in real-world applications, there is generally only a single observed dataset. In such cases the roles of bias and variance are less obvious (though, it is possible to calculate estimates of variance and bias using resampling methods such as bootstrapping). However, the direct connection we made between bias, variance with the mean-squared error calculated on a testing set give us a direct means for assessing a group of candidate estimators in light of a single data set. We only need to partition the available data set into separate portions: one used to fit model parameters (a training set), and another used to assess prediction accuracy (testing set). Comparing prediction accuracy across potential estimators is equivalent to assessing biases and variances of the estimators across many datasets. Note that resampling methods such as cross-validation can prove helpful here, particularly when the amount of observed data is small.

Note, all the code for this post, containe in a single script is here:

clear clc close all % CREATE A RANDOM DATASET rng(10) nData = 10; % N x = 2*(rand(1,nData)-.5); xGrid = linspace(-1,1,100); % DEFINE AND TARGET FUNCTION f(x) f = inline('sin(pi*x)','x'); h = []; h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2); xlabel('x') hold on % DEFINE AND DISPLAY y noiseSTD = .1; y = f(x) + noiseSTD*randn(size(x)); h(2) = scatter(x,y,'ko'); legend(h,{'f(x)','y'},'Location','Northwest'); % FIT POLYNOMIAL MODELS & DISPLAY % (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE) degree = [1,3,10]; theta = {}; cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6]; for iD = 1:numel(degree) figure(1) theta{iD} = polyfit(x,y,degree(iD)); fit{iD} = polyval(theta{iD},xGrid); h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2); xlim([-1 1]) ylim([-1 1]) end legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest') % FIT MODELS TO K INDEPENDENT DATASETS K = 50; for iS = 1:K ySim = f(x) + noiseSTD*randn(size(x)); for jD = 1:numel(degree) % FIT THE MODEL USING polyfit.m thetaTmp = polyfit(x,ySim,degree(jD)); % EVALUATE THE MODEL FIT USING polyval.m simFit{jD}(iS,:) = polyval(thetaTmp,xGrid); end end % DISPLAY ALL THE MODEL FITS h = []; for iD = 1:numel(degree) figure(iD+1) hold on % PLOT THE FUNCTION FIT TO EACH DATASET for iS = 1:K h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6)); end % PLOT THE AVERAGE FUNCTION ACROSS ALL FITS h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5); % PLOT THE UNDERLYING FUNCTION f(x) h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3); % CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS squaredError = (mean(simFit{iD})-f(xGrid)).^2; % PLOT THE SQUARED ERROR h(4) = plot(xGrid,squaredError,'k--','Linewidth',3); uistack(h(2),'top') hold off axis square xlim([-1 1]) ylim([-1 1]) legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside') title(sprintf('Model Order=%d',degree(iD))) end N = 25; % # OF OBSERVATIONS PER DATASET K = 100;% # OF DATASETS noiseSTD = .5; % NOISE STANDARDE DEV. nTrain = ceil(N*.9); % # OF TRAINING POINTS nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY % # INITIALIZE SOME VARIABLES xGrid = linspace(-1,1,N); meanPrediction = zeros(K,N); thetaHat = {}; x = linspace(-1,1,N); x = x(randperm(N)); for iS = 1:K % LOOP OVER DATASETS % CREATE OBSERVED DATA, y y = f(x) + noiseSTD*randn(size(x)); % CREATE TRAINING SET xTrain = x(1:nTrain); yTrain = y(1:nTrain); % CREATE TESTING SET xTest = x(nTrain+1:end); yTest = y(nTrain+1:end); % FIT MODELS for jD = 1:nPolyMax % MODEL PARAMETER ESTIMATES thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD); % PREDICTIONS yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); % TRAINING SET yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET % MEAN SQUARED ERROR trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING end end % CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE for iD = 1:nPolyMax trainError(iD) = mean(trainErrors{iD}); testError(iD) = mean(testErrors{iD}); biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2); variance(iD) = mean(var(yHatTest{iD},1)); end [~,bestModel] = min(testError); % DISPLAY figure; hold on; plot(testError,'k','Linewidth',2); plot(biasSquared,'r','Linewidth',2); plot(variance,'b','Linewidth',2); plot(biasSquared + variance,'m-.','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model') hold off; % DISPLAY figure, hold on; plot(trainError,'g','Linewidth',2); plot(testError,'k','Linewidth',2); yl = ylim; plot([bestModel,bestModel],[yl(1),yl(2)],'k--'); xlim([1,nPolyMax]); xlabel('Model Complexity (Polynomial Order)') legend('Train Error','Test Error','Best Model'); hold off;

## Supplemental Proof 1

Proof:

Let’s start off with the following expression:

(1)

(2)

(3)

(here we take advantage of the notion that is a constant, namely )

(4)

Combining the terms in Line (1) and Line (4) and rearranging, we find that

(5)