# Blog Archives

## Model Selection: Underfitting, Overfitting, and the Bias-Variance Tradeoff

In machine learning and pattern recognition, there are many ways (an infinite number, really) of solving any one problem. Thus it is important to have an objective criterion for assessing the accuracy of candidate approaches and for selecting the right model for a data set at hand. In this post we’ll discuss the concepts of under- and overfitting and how these phenomena are related to the statistical quantities bias and variance. Finally, we will discuss how these concepts can be applied to select a model that will accurately generalize to novel scenarios/data sets.

## Models for Regression

When performing regression analyses we would like to characterize how the value of some dependent variable changes as some independent variable $x$ is varied. For example, say we would like to characterize the firing rate of a neuron in visual cortex as we vary the orientation of a grating pattern presented to the eye. We assume that there is some true relationship function $f(x)$ that maps the independent variable values (i.e. the angle of the grating pattern) onto the dependent variable values (i.e. firing rate). We would like to determine the form of the function $f(x)$ from observations of independent-dependent value pairs (I may also refer to these as input-output pairs, as we can think of the function $f(x)$ taking $x$ as input and producing an output). However, in the real world, we don’t get to observe $f(x)$ directly, but instead get noisy observations $y$, where

(1) $y = f(x) + \epsilon$

Here we will assume that $\epsilon$ is random variable distributed according to a zero-mean Gaussian with standard deviation $\sigma^2$. Note that because $\epsilon$ is a random variable, $y$ is also a random variable (with a mean that is on conditioned on both $x$ and $f(x)$, and a variance $\sigma^2)$.

As an example, say that the true function $f(x)$ we want to determine has the the following form (though we don’t know it):

$f(x) = \sin(\pi x)$

Thus the observations $y$ we get to see have the following distribution.

$y = \sin(\pi x) + \mathcal N(0,\sigma^2)$

Below we define the function $f(x)$ and display it, then draw a few observation samples $y$, and display them as well:

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);
xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');


Again, the goal here is to characterized the function $f(x)$. However, since we don’t know the function form of $f(x)$, we must instead estimate some other function $g(x)$ that we think will be a good approximation to $f(x)$. Thus we call $g(x)$ an estimator of $f(x)$. In general, an estimator is some parameterized model that can capture a wide range of functional forms. One such class of estimators is the weighted combination of ordered polynomials:

$g_N(x) = \theta_0 + \theta_1x + \theta_2x^2 + \dots \theta_N x^N$

As the polynomial order $N$ increases, the functions $g_N(x)$ are able to capture increasingly complex behavior. For example, $g_0(x)$ desribes a horizontal line with an adjustable vertical offset, $g_1(x)$ desribes a line with adjustable vertical offset and adjustable slope, $g_2(x)$ describes a function that also includes a quadratic term. We thus try to fit the values of the parameters for a given estimator $g_N(x)$  to best account for observed data in the hopes that we will also accurately approximate $f(x)$.

Below we estimate the parameters of three polynomial model functions of increasing complexity (using Matlab’s $\texttt{polyfit.m}$) to the sampled data displayed above. Specifically, we estimate the functions $g_1(x)$, $g_3(x)$ and $g_{10}(x)$.

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')


Qualitatively, we see that the estimator $g_1(x)$ (red line) provides a poor fit to the observed data, as well as a poor approximation to the function $f(x)$ (black curve). We see that the estimator $g_{10}(x)$ (blue curve) provides a very accurate fit to the data points, but varies wildly to do so, and therefore provides an inaccurate approximation of $f(x)$. Finally, we see that the estimator $g_3(x)$ (green curve) provides a fairly good fit to the observed data, and a much better job at approximating $f(x)$.

Our original goal was to approximate $f(x)$, not the data points per se. Therefore $g_3(x)$, at least qualitatively, provides a more desirable estimate of $f(x)$ than the other two estimators. The fits for $g_1(x)$ and $g_{10}(x)$ are examples of “underfitting” and “overfitting” to the observed data. Underfitting occurs when an estimator $g(x)$ is not flexible enough to capture the underlying trends in the observed data. Overfitting occurs when an estimator is too flexible, allowing it to capture illusory trends in the data. These illusory trends are often the result of the noise in the observations $y$.

## Variance and Bias of an Estimator

The model fits for $g_N(x)$ discussed above were based on a single, randomly-sampled data set of observations $y$. However, there are in principle, a potentially infinite number of data sets that can be observed (drawn from $y$). In order to determine a good model of $f(x)$, it would be helpful to have an idea of how an estimator will perform on any or all of these potential datasets.  To get an idea of how each of the estimators discussed above performs in general we can repeat the model fitting procedure for many data sets.

Here we perform such an analyses, sampling 50 independent data sets according to Equation (1) (with $\sigma=0.1$), then fitting the parameters for the polynomial functions of model order $N = [1,3,10]$ to each dataset.

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end


Here are the results for the estimator $g_1(x)$

…and for the estimator $g_3(x)$, …

… and for $g_{10}(x).$

The lightly-colored curves in each of the three plots above are an individual polynomial model fit to one of the 50 sampled data sets. The darkly-colored curve in each plot is the average over the 50 individual fits. The dark black curve is the true, underlying function $f(x)$.

We see that for the estimator $g_1(x)$ (red curves), model fits do not vary much from data set to data set. Thus the expected (mean) estimator fit over all the data sets, formally written as $\mathbb E[g(x)]$, is very similar to all the individual fits.

A common formal definition used in statistics to describe how much a single estimator deviates from the average estimator over datasets is the variance of the estimator. Formally defined as

$variance=\mathbb E[(g(x)-\mathbb E[g(x)])^2]$

the variance is the average squared difference between any single data-set-dependent estimate of $g(x)$ and the average value of $g(x)$ estimated over all datasets.

According to the definition of variance, we can say that the estimator $g_1(x)$ exhibits low variance.

A commonly-used metric in statistics for assessing how an estimator $g(x)$ approximates a target function $f(x)$, based on behavior over many observed data sets is what is called the bias of the estimator. Formally defined as:

$bias = \mathbb E[g(x)] - f(x)$

The bias describes how much the average estimator fit over datasets $\mathbb E[g(x)]$ deviates from the value of the underlying target function $f(x)$.

We can see from the plot for $g(x)_1$ that $\mathbb E[g_1(x)]$ deviates significantly from $f(x)$. Thus we can say that the estimator $g_1(x)$ exhibits large bias when approximating the function $f(x)$.

Investigating the results for the estimator $g_{10}(x)$ (blue curves), we see that each individual model fit varies dramatically from one data set to another. Thus we can say that this estimator exhibits large variance. Looking at $\mathbb E[g_{10}(x)]$ (the dark blue curve), we see that on average the estimator $g_{10}(x)$ provides a better approximation to $f(x)$ than the estimator $g_1(x)$. Thus we can say that $g_{10}(x)$ exhibits a lower bias than the estimator $g_1(x)$.

Investigating the fits for the estimator $g_3(x)$ (green curves), we find that this estimator has low bias. Furthermore, the average estimator $\mathbb E[g_3(x)]$ (dark green curve) accurately approximates the true function $f(x)$, telling us that that the estimator $g_3(x)$ also has low bias.

We established earlier that the estimator $g_3(x)$ provided a qualitatively better fit to the function $f(x)$ than the other two polynomial estimators for a single dataset. It appears that this is also the case over many datasets. We also find that estimator $g_3(x)$ exhibits low bias and low variance, whereas the other two, less-desirable estimators, have either high bias or high variance. Thus it would appear that having both low bias and low variance is a reasonable criterion for selecting an accurate model of $f(x)$.

Included in each of the three plots above is a dashed black line representing the squared difference between the average estimator $\mathbb E[g_N(x)]$ and the true function $f(x)$. Calculating squared model errors is a common practice for quantifying the goodness of a model fit. If we calculate the expected value of each of the dashed black curves (and assuming that all values of $x$ are equally likely to occur), we would obtain a single value for each estimator that is the mean squared error (MSE) between the expected estimator and the true function.

$\mathbb E[(\mathbb E[g(x)]-f(x))^2] = \frac{1}{N}\sum_{i=1}^N (\mathbb E[g(x)]-f(x))^2$

For the estimator $g_3(x)$, the MSE will be very small, as the dashed black curve for this estimator is near zero for all values of $x$. The estimators $g_1(x)$ and $g_{10}(x)$ would have significantly larger values. Now, because exhibiting both a low MSE, as well as having both low bias and variance are indicative of a good estimator, it would be reasonable to assume that squared model error is directly related to bias and variance. The next section provides some formal evidence for this notion.

## Expected Prediction Error and the Bias-variance Tradeoff

For a given estimator $g(x)$ fit to a data set of $xy$ pairs, we would like to know, given all the possible datasets out there, what is the expected prediction error we will observe for a new data point $x^*, y^* = f(x^*) + \epsilon$. If we define prediction error to be the squared difference in model prediction $g(x^*)$ and observations $y^*$, the expected prediction error is then:

$\mathbb E[(g(x^*) - y^*)^2]$

If we expand this a little and use a few identities, something interesting happens:

(2) $\mathbb E[(g(x^*) - y^*)^2] = \mathbb E[g(x^*)^2-2g(x^*)y^*+y^{*2}]=$

(3) $\mathbb E[g(x^*)^2]-2\mathbb E[g(x^*)y^*]+\mathbb E[y^{*2}]=$

(4) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2-2 \mathbb E[g(x^*)]f(x^*) + \mathbb E[(y^*-f(x^*))^2] + f(x^*)^2$

where we have applied the following Lemma to the first and third terms of Equation (3), and use the fact to $\mathbb E[y] = f(x)$ (Think of averaging over an infinite number of datasets sampled from y; all noise will average out, leaving $f(x)$). Rearranging Equation (4), we obtain

(5) $\mathbb E[(g(x^*)-\mathbb E[g(x^*)])^2] + \mathbb E[g(x^*)]^2 - 2 \mathbb E[g(x^*)]f(x^*) + f(x^*)^2 + \mathbb E[(y^*-f(x^*))^2]$

which can be further simplified and grouped into three terms

(6) $\mathbb E[(g(x^*) - \mathbb E[g(x^*)])^2]$ $+\mathbb (E[g(x^*)]-f(x^*))^2$ $+ \mathbb E[(y^*-f(x^*))^2]$

1. The first term is the variance of the estimator introduced above.
2. The second term is the square of the bias of the estimator, also introduced above.
3. The third term is the variance of the observation noise and describes how much the observations $y$ vary from the true function $f(x)$. Notice that the noise term does not depend on the estimator $g(x)$. Thus the noise term is a constant that places a lower bound on expected prediction error.

Here we find that the expected prediction error on new data $(x^*,y^*)$ (in the squared differences sense) is the combination of three terms:

Expected prediction error = estimator variance + squared estimator bias + noise

Thus the expected prediction error on new data can be used as a quantitative criterion for selecting the best model from a candidate set of estimators! It turns out that, given $N$ new data points $(\bold x^*,\bold y^*)$, the expected prediction error can be easily approximated as the mean squared error over data pairs:

$\mathbb E[(g(\bold x^*) - \bold y^*)^2] \approx \frac{1}{N}\sum_{i=1}^N(g(x_i^*)-y_i^*)^2$

Below we demonstrate these findings with another set of simulations. We simulate 100 independent datasets, each with 25 $xy$ pairs. We then partition each dataset into two non-overlapping sets: a training set using for fitting model parameters, and a testing set used to calculate prediction error. We then fit the parameters for estimators of varying complexity. Complexity is varied by using polynomial functions that range in model order from 1 (least complex) to 12 (most complex). We then calculate and display the squared bias, variance, and error on testing set for each of the estimators:

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS
% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;


Here we see how, as the model complexity increases, the estimator variance (blue curve) gradually increases. Additionally, as model complexity increases, the squared bias (red curve) decreases. Thus there is a tradeoff between bias and variance that comes with model complexity: models that are too complex will have high variance and low bias; models that are too simple will have high bias and low variance. The best model will have both low bias and low variance. In this example, we highlight the best estimator in terms of prediction error on the testing set (black curve) with a dashed black vertical line. The best estimator corresponds to a polynomial model of order of $N=3$. Notice that the vertical black line is located where function defined by the sum of the squared bias and variance (dashed magenta curve) is also at a minimum.

Notice also how the sum of the squared bias and variance also has the same shape as curve defined by the prediction error on the testing set. This exemplifies how the error on novel data can be used as a proxy for determining the best estimator from a candidate set based on squared bias and variance. The noise term in Equation (6) is also represented in the figure by the vertical displacement between the black curve and dashed magenta curve.

It is very important to point out that all of these results are based on evaluating prediction error on novel data, not used to estimate model parameters. In fact assessing a model performance based on prediction error calculated on the same data used to estimate the model parameters is highly problematic, as it causes models to always overfit. In plain terms, this means that we will always favor a more complex model if we assess goodness of model fits on the training data, as a more complex model will be better able to capture small, random trends in the data due to noise.

This overfitting phenomenon is demonstrated below. We plot the error calculated on the training set (Train Error, green curve) along with the error calculated on the testing set (Test Error, black curve) for the above simulation. We also identify the best estimator as we did above.

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;


We see here that as model complexity increases, the error calculated on the training set continues to decrease, where as the error on the testing set increases past the optimal polynomial order $N=3$. We  showed above that error calculated on the testing set is the true indicator of how well an estimator will generalize to new data points. The error calculated on the training set strongly disagrees with the error calculated on the testing set after the optimal model complexity has been reached. Since, in general, the whole point of modeling a data set is to generalize to novel data, assessing model predictions on the training set data should be avoided.

## Wrapping Up

Here we discussed how the bias and variance of an estimator are related to squared prediction error. Though we focused on regression, these concepts can also be applied to classification problems. We found that an optimal estimator will have both low variance and low bias. We further found that information about squared bias and variance is contained in expected prediction error calculated on a testing set of data not used to fit a model’s parameters.

The concepts of estimator bias and variance are generally only clear in the context of an ensemble of datasets. However, in real-world applications, there is generally only a single observed dataset. In such cases the roles of bias and variance are less obvious (though, it is possible to calculate estimates of variance and bias using resampling methods such as bootstrapping). However, the direct connection we made between bias, variance with the mean-squared error calculated on a testing set give us a direct means for assessing a group of candidate estimators in light of a single data set. We only need to partition the available data set into separate portions: one used to fit model parameters (a training set), and another used to assess prediction accuracy (testing set). Comparing prediction accuracy across potential estimators is equivalent to assessing biases and variances of the estimators across many datasets. Note that resampling methods such as cross-validation can prove helpful here, particularly when the amount of observed data is small.

Note, all the code for this post, containe in a single script is here:

clear
clc
close all

% CREATE A RANDOM DATASET
rng(10)
nData = 10; % N
x = 2*(rand(1,nData)-.5);

xGrid = linspace(-1,1,100);

% DEFINE AND TARGET FUNCTION f(x)
f = inline('sin(pi*x)','x');

h = [];
h(1) = plot(xGrid,f(xGrid),'k','Linewidth',2);
xlabel('x')
hold on

% DEFINE AND DISPLAY y
noiseSTD = .1;
y = f(x) + noiseSTD*randn(size(x));
h(2) = scatter(x,y,'ko');
legend(h,{'f(x)','y'},'Location','Northwest');

% FIT POLYNOMIAL MODELS & DISPLAY
% (ASSUMING PREVIOUS PLOT ABOVE STILL AVAILABLE)
degree = [1,3,10];
theta = {};
cols = [.8 .05 0.05; 0.05 .6 0.05; 0.05 0.05 .6];
for iD = 1:numel(degree)
figure(1)
theta{iD} = polyfit(x,y,degree(iD));
fit{iD} = polyval(theta{iD},xGrid);
h(end+1) = plot(xGrid,fit{iD},'color',cols(iD,:),'Linewidth',2);
xlim([-1 1])
ylim([-1 1])
end
legend(h,'f(x)','y','g_1(x)','g_3(x)','g_{10}(x)','Location','Northwest')

% FIT MODELS TO K INDEPENDENT DATASETS
K = 50;
for iS = 1:K
ySim = f(x) + noiseSTD*randn(size(x));
for jD = 1:numel(degree)
% FIT THE MODEL USING polyfit.m
thetaTmp = polyfit(x,ySim,degree(jD));
% EVALUATE THE MODEL FIT USING polyval.m
simFit{jD}(iS,:) = polyval(thetaTmp,xGrid);
end
end

% DISPLAY ALL THE MODEL FITS
h = [];
for iD = 1:numel(degree)
figure(iD+1)
hold on
% PLOT THE FUNCTION FIT TO EACH DATASET
for iS = 1:K
h(1) = plot(xGrid,simFit{iD}(iS,:),'color',brighten(cols(iD,:),.6));
end
% PLOT THE AVERAGE FUNCTION ACROSS ALL FITS
h(2) = plot(xGrid,mean(simFit{iD}),'color',cols(iD,:),'Linewidth',5);
% PLOT THE UNDERLYING FUNCTION f(x)
h(3) = plot(xGrid,f(xGrid),'color','k','Linewidth',3);
% CALCULATE THE SQUARED ERROR AT EACH POINT, AVERAGED ACROSS ALL DATASETS
squaredError = (mean(simFit{iD})-f(xGrid)).^2;
% PLOT THE SQUARED ERROR
h(4) = plot(xGrid,squaredError,'k--','Linewidth',3);
uistack(h(2),'top')
hold off
axis square
xlim([-1 1])
ylim([-1 1])
legend(h,{sprintf('Individual g_{%d}(x)',degree(iD)),'Mean of All Fits','f(x)','Squared Error'},'Location','WestOutside')
title(sprintf('Model Order=%d',degree(iD)))
end

N = 25; % # OF OBSERVATIONS PER DATASET
K = 100;% # OF DATASETS
noiseSTD = .5; % NOISE STANDARDE DEV.
nTrain = ceil(N*.9); % # OF TRAINING POINTS
nPolyMax = 12; % MAXIMUM MODEL COMPLEXITY

% # INITIALIZE SOME VARIABLES
xGrid = linspace(-1,1,N);
meanPrediction = zeros(K,N);
thetaHat = {};
x = linspace(-1,1,N);
x = x(randperm(N));
for iS = 1:K % LOOP OVER DATASETS

% CREATE OBSERVED DATA, y
y = f(x) + noiseSTD*randn(size(x));

% CREATE TRAINING SET
xTrain = x(1:nTrain);
yTrain = y(1:nTrain);

% CREATE TESTING SET
xTest = x(nTrain+1:end);
yTest = y(nTrain+1:end);

% FIT MODELS
for jD = 1:nPolyMax

% MODEL PARAMETER ESTIMATES
thetaHat{jD}(iS,:) = polyfit(xTrain,yTrain,jD);

% PREDICTIONS
yHatTrain{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTrain); % TRAINING SET
yHatTest{jD}(iS,:) = polyval([thetaHat{jD}(iS,:)],xTest);% TESTING SET

% MEAN SQUARED ERROR
trainErrors{jD}(iS) = mean((yHatTrain{jD}(iS,:) - yTrain).^2); % TRAINING
testErrors{jD}(iS) = mean((yHatTest{jD}(iS,:) - yTest).^2); % TESTING
end
end

% CALCULATE AVERAGE PREDICTION ERROR, BIAS, AND VARIANCE
for iD = 1:nPolyMax
trainError(iD) = mean(trainErrors{iD});
testError(iD) = mean(testErrors{iD});
biasSquared(iD) = mean((mean(yHatTest{iD})-f(xTest)).^2);
variance(iD) = mean(var(yHatTest{iD},1));
end
[~,bestModel] = min(testError);

% DISPLAY
figure;
hold on;
plot(testError,'k','Linewidth',2);
plot(biasSquared,'r','Linewidth',2);
plot(variance,'b','Linewidth',2);
plot(biasSquared + variance,'m-.','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Test Error','Bias^2','Variance','Bias^2+Var.','Best Model')
hold off;

% DISPLAY
figure, hold on;
plot(trainError,'g','Linewidth',2);
plot(testError,'k','Linewidth',2);
yl = ylim;
plot([bestModel,bestModel],[yl(1),yl(2)],'k--');
xlim([1,nPolyMax]);
xlabel('Model Complexity (Polynomial Order)')
legend('Train Error','Test Error','Best Model');
hold off;


## Emergence of Various Probability Distributions: Some Incite via Simulations

There a a large number of probability distributions that arise in statistical modeling and inference procedures. These distributions can take many complex analytic forms, and can have, what appear to be, arbitrary parameterizations. Though the definitions of many of these distributions are mathematically esoteric, it is often very helpful to see how they can emerge from simple simulation procedures. In this post we perform a number of simple manipulations to standard Normal/Gaussian-distributed variables that give rise to some probability distributions that predominate statistics:the  $\chi^2$, the F, and the Student’s t-distributions

# The Standard Normal Distribution

Before we begin the simulations, we need to first describe an important building block, the standard Normal distribution. The Normal distribution is a smooth continuous function that is symmetric about an axis and is often described as having a “bell” shape. The distribution lies at the heart of describing a vast number of physical processes. Also, many statistical quantities and distributions emerge from  simply transforming values drawn from the standard Normal distribution (I’ll describe what I mean by “standard” shortly).

Any Normal distribution is characterized by two parameters: a mean parameter $\mu$, which characterizes the location of the distribution center,  and a variance parameter $\sigma^2$, which characterizes the width of the distribution. Formally, the Normal distribution defines the probability $p(x)$ of some value $x$ occurring as:

$p(x) = \frac{1}{2 \pi \sigma^2}e^{-\frac{1}{2 \sigma^2}(x - \mu)^2}$

The standard Normal distribution has zero mean and and unit  variance (i.e. equal to 1), and therefore takes the simple form:

$p(x) = \frac{1}{2 \pi}e^{-\frac{1}{2}x^2}$

As I will show, a number of common probability distributions emerge from performing simple manipulations of values that follow the standard Normal distribution. In order to simulate these distributions we need to be able to draw values that follow the standard Normal distribution. Turns out that we can generate pseudo-random samples from a standard Normal distribution using a clever concept known as the Box-Muller transformation. Though I won’t go into the details, the Box-Muller transformation relates measurements on the unit circle (i.e. in polar coordinates) to norms of random vectors in Cartesian space (i.e. x-y coordinates). If none of that makes any sense, don’t worry it’s not important. What is important is that if we know how to independently draw two random numbers $R_1$ and $R_2$ that take on values 0 to 1, then it is possible to draw a sample $N$ from the standard Normal via the following relationship

$N = \sqrt{-2 \log R_1} \cos(2 \pi R_2)$

or also

$N = \sqrt{-2 \log R_1} \sin(2 \pi R_2)$

since sin(x) and cos(x) are orthogonal to one another. Figure 1 compares the empirical probability distribution of values drawn from a standard Normal distribution using the MATLAB function $\text{randn}$ to the distribution of values drawn from the uniform distribution using the function $\text{rand}$ and transforming those values using the Box-Muller transformation.

Figure 1: Theoretical and simulated random Normal variables. Simulated variables are calculated using the Box-Muller transformation of uniform variables on the interval (0 1).

Note that the Box-Muller transformation is not the only (or even the best/most efficient) method for generating random variables from the standard Normal distribution (for instance, see the Polar Method). The reason for demonstrating the Box-Muller method is its simplicity. For a MATLAB example of drawing samples using the Box-Muller method, see lines (16-18) in the source code below.

Ok, so now that we have a way generating samples from the standard Normal distribution, we can now begin transforming these samples in order to develop other distributions.

# Simulating the $\chi^2$ Distribution

The $\chi^2$ distribution is a common probability distribution used in statistical inference (i.e. the $\chi^2$ test) and assessing the goodness of fit of linear models to data. Specifically, a random variable $Z$ drawn from the $\chi^2$ with $\nu$ degrees of freedom is obtained by drawing $\nu$ independent variables $N$ from the standard normal distribution, squaring each value drawn and taking the sum of those squared values. Formally, this process is described as:

$Z = \sum_{i=1}^\nu N_i^2$

For a MATLAB example of simulating $\chi^2$-distributed variables, see lines (37-54) in the source code below. The results of this simulation are shown in right of Figure 2, and are compared to the theoretical distribution, which takes the mathematical form:

$p(x) = \frac{x^{(\nu/2) -1} e^{-(x/2)}}{2^\nu \Gamma(\frac{\nu}{2})}$

For values of $x \geq 0$ The $\Gamma$ is what is known as the Gamma function, and can be thought of as an extension to the factorial (i.e. !) operator to continuous values (i.e. ! only works on integers).

Figure 2: Theoretical (left) and simulated (right) Chi-Squared distributions having 1-10 degrees of freedom (dF)

Perhaps it is just me, but I feel that it is far more natural to interpret the $\chi^2$ distribution as a sum of squares of standard Normal variables, than this fairly complicated expression. The interpretation also lends itself well to testing the goodness of fit of a linear model to data using the sum of squares loss function. This brings us to a second probability distribution, which is often used in the analysis of variance of linear model fits.

# Simulating the F-Distribution

In order to test the goodness of fit of a linear model to some data, it is often helpful to look at the ratio of  the mean squared error of the model prediction to the mean squared deviation of the data from its mean (i.e. the variance of the data). This ratio can be thought of as the proportion of the variance in the data that is explained by the model. Because this ratio is composed of two values that result from squaring and summing deviations or errors, we are equivalently taking the ratio of values that follow two $\chi^2$ distributions. If the numerator and denominator of this ratio are scaled by their respective degrees of freedom (i.e. the number of model parameters minus one for the numerator value and the number of datapoints minus one for the denominator), then the value that results is what is known as an F-statistic.  These F-statistics can be used to quantify how well (i.e. the statistical significance) of the model fit by comparing it to the distribution of F-statistics values that follow the form:

$F(x,\nu_1,\nu_2) = \frac{\sqrt{\frac{\nu_1 x^{\nu_1}\nu_2^{\nu_2}}{(\nu_1 x + \nu_2)^{\nu_1+\nu_2}}}}{xB(\nu_1/2,\nu_2/2)}$

for variables having $\nu_1$ and $\nu_2$ degrees of freedom. The funcion $B$ is known as the Beta function or integral.

We can also simulate distributions of F-statistics by generating two $\chi^2$ variables with respective degrees of freedom $\nu_1$ and $\nu_2$ (just as we did above), scaling each variable by it’s relative degrees of freedom and taking their ratio. Formally this is described by:

$F(\nu_1,\nu_2) = \frac{\chi_{\nu_1}^2/\nu_1}{\chi_{\nu_2}^2/\nu_2}$

A MATLAB example of this simulation is shown in lines (77-104) in the code below. The results of the simulation are displayed in the bottom row of Figure 3.

Figure 3: Theoretical (top row) and simulated (bottom row) F-distributions with 2,10,50, & l00 degrees of freedom (dF). Columns represent theoretical and simulated distributions resulting from having a numerator Chi-Squared distribution with each dF. Each function corresponds to the distribution resulting form a denominator Chi-Squared distribution with a particular dF.

Again, it insightful to interpret the F-distribution as simply the ratio of scaled  $\chi^2$ distributions rather than the complicated mathematical expression for the F-distribution shown above.

# Simulating the Student’s t-distribution

During statistics analyses, it is often helpful to determine if the mean $\bar x$ of a dataset of size $N$  is different than some specified value $\mu_0$. In order to test the hypothesis is customary to calculate what is called a test statistic:

$t = \frac{\bar x - \mu_0}{\sqrt{s^2/n}}$

Here the sample mean is

$\bar x = \frac{1}{n}\sum_{i=1}^n x_n$

and the sample variance is

$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x - \bar x)^2$

The distribution of these t-values is known as the t-distribution.  Like the Normal distribution, the t-distribution is symmetric and “bell” shaped, but has heavier tails than a Normal distribution and is parameterized by a single parameter $\nu$  that corresponds to the degrees of freedom in the sample (i.e. sample size). Formally, the t-distribution has the following form

$p(t) = \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu \pi}\Gamma(\nu/2)} \left( 1 + \frac{t^2}{\nu}\right)^{\frac{\nu+1}{2}}$

The t-distribution is used  in a number of other statistical tests as well, particularly when an available sample size is small. Also, because it was developed in order to make Guiness the fine beer that it is today, the Student’s t-distribution is likely even more important than the Normal distribution!

The t-distribution with degrees for freedom $\nu$ can be simulated by repeatedly sampling $\nu$ independent datapoints from the standard normal distributions and calculating t-values on these samples. A MATLAB simulation of various t-distributions is shown in Figure 4, along with comparisons to the theoretical function shown above (see lines (130-148) of the code below).

Figure 4: Theoretical (left) and simulated (right) Student’s-t distribution with 2,10,50,& 100 degrees of freedom (dF).

Also, another way to simulate a t-distribution (not shown here) is to sample values that are the ratio of a standard normal variable $N$ to the square root of a $\chi^2$-distributed variable scaled by its degrees of freedom:

$T(\nu) = \frac{N}{\sqrt{Z_{\nu}/\nu}}$

# Wrapping up

So, hopefully these simulations give you some incite on how some standard probability distributions can come about. Just another example of how “complex” ideas  can be derived from simple manipulations of a general mechanism, but are often hidden in the mathematical detail and formalisms.

# MATLAB Code

(Use the toolbar in the upper right corner of the source in order to copy to your clipboard)

clear all; close all

% SOME CONSTANTS
nNu = 10;
nSimu = 1e6;
dx = 0.1;
x = -6:dx:6;

% PART 1: NORMAL DISTRIBUTION
%-----------------------------
% ONE CAN GENERATE NORMALLY-DISTRIBUTED RANDOM VARIABLES FROM
% UNIFORMLY DISTRIBUTED VARIABLES USING A GEOMETRIC METHOD KNOWN
% AS THE BOX-MULLER TRANSFORM

fprintf('\nSimulating Normal Distribution From Uniform Random Variables.');
R = rand(nSimu,2);
NSimu = sqrt(-2*log(R(:,1))).*cos(2*pi*R(:,2));
NTheory = randn(1,nSimu);
% OR...
% N = sqrt(-2*log(R(:,1))).*sin(2*pi*R(:,2));
% (THESE ACTUALLY COME IN PAIRS)

figure('Name','Simulated Normal Distribution');
[nHistTheory,nBin] = hist(NTheory,x);
nHistTheory = nHistTheory/(sum(nHistTheory)*dx);
bar(nBin,nHistTheory,'k')
hold on;
[nHist,nBin] = hist(NSimu,x);
nHist = nHist/(dx*sum(nHist));
stairs(nBin,nHist,'r','Linewidth',2)
legend({'Theory','Simulated'})
title(sprintf('Simulated Mean = %1.4f\n Simulated Variance = %1.4f',mean(NSimu),var(NSimu)))
xlim([-6 6]);
fprintf('\nPress any key to continue.\n')
pause

% PART 2: CHI-SQUARED-DISTRIBUTION
%----------------------------------
% SIMULATE AND CALCULATE THEORETICAL VALUES
% FOR CHI-SQUARE DISTRIBUTION
fprintf('\nSimulating Chi-Squared Distribution\n');
zFun0 = 'randn(1,nSimu).^2'; zFun = zFun0;
dx = 0.01;
x = 0:dx:15;
legStr = {};
for nu = 1:nNu
fprintf('\r dF = %d',nu);
zTmp = eval(zFun);
[zTmp,zBin] = hist(zTmp,x);
zSimu(:,nu) = zTmp/(dx*sum(zTmp));	% SIMULATED
zTheory(:,nu) = chi2pdf(x,nu);		% THEORETICAL
zFun = [zFun,'+',zFun0];
legStr{end+1} = sprintf('dF = %d',nu);
end

% DISPLAY CHI-SQUARE SIMULATIONS
figure('Name','Simulated Chi-Squared Distribution')
subplot(121);plot(zBin,zTheory,'Linewidth',2)
xlim([0 15]); ylim([0,.4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Theoretical')

subplot(122);plot(zBin,zSimu)
xlim([0 15]); ylim([0,.4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Simulated')
fprintf('\nPress any key to continue.\n')
pause

% PART 3: F-DISTRIBUTION
%------------------------
% SIMULATE AND CALCULATE THEORETICAL VALUES
% FOR F-DISTRIBUTION
fprintf('\nSimulating F-Distribution\n');
nSimu = 1e5;
cnt = 1;
nu = [2,10,50,100];
x = 0:dx:5;
zFun0 = 'randn(1,nSimu).^2'; zFun = zFun0;
legStr = {};

for nuTop = nu
zFunTop = zFun0;
for iF = 1:nuTop-1
zFunTop = [zFunTop,'+',zFun0];
end
for nuBottom = nu
fprintf('\r dF1 = %d, dF2 = %d',nuTop,nuBottom);
zFunBottom = zFun0;
for iF = 1:nuBottom-1
zFunBottom = [zFunBottom,'+',zFun0];
end
zTop = eval(zFunTop)/nuTop;	% CHI-2 DISTRIBUTIONS SCALED BY dF
zBottom = eval(zFunBottom)/nuBottom;
Ftmp = zTop./zBottom;   	% F IS THE RATIO OF SCALED CHI-2s
[Ftmp,zBin] = hist(Ftmp,x);
FSimu(:,cnt) = Ftmp/(dx*sum(Ftmp));
FTheory(:,cnt) = fpdf(x,nuTop,nuBottom);
cnt = cnt+1;
end
legStr{end+1} = sprintf('dF = %d',nuTop);
end

% DISPLAY F-DISTRIBUTION SIMULATIONS
figure('Name','Simulated F Distribution')
axCnt = cnt;
for iP = 1:numel(nu);
subplot(2,numel(nu),iP)
plot(x,FTheory(:,(iP-1)*numel(nu)+1:iP*numel(nu)),'Linewidth',2);
xlim([0 5]); ylim([0 2]);
legend(legStr);
title(sprintf('dF = %d',nu(iP)))
xlabel('x'); ylabel('p(x) Theoretical');

subplot(2,numel(nu),numel(nu)+iP);
plot(x,FSimu(:,(iP-1)*numel(nu)+1:iP*numel(nu)));
xlim([0 5]); ylim([0 2]);
legend(legStr);
title(sprintf('dF = %d',nu(iP)));
xlabel('x'); ylabel('p(x) Simulated');
end
fprintf('\nPress any key to continue.\n')
pause

% PART 4 STUDENT'S T-DISTRIBUTION
%---------------------------------
fprintf('\nSimulating Student''s t Distribution\n');
nSimu = 1e6;
dx = 0.01;
x = -6:dx:6;
nu = [2 10 50 100];
legStr = {};
cnt = 1;
for dF = nu
fprintf('\r dF = %d',dF);
x0 = randn(dF,nSimu);
xBar = mean(x0);
s = var(x0);
mu = 0;                             % randn IS ZERO MEAN
tTmp = (xBar - mu)./sqrt(s/dF);
[tTmp,tBin] = hist(tTmp,x);
tSimu(:,cnt) = tTmp/(dx*sum(tTmp));	% SIMULATED
tTheory(:,cnt) = tpdf(x,dF);		% THEORETICAL
legStr{end+1} = sprintf('dF = %d',dF);
cnt = cnt+1;
end

% DISPLAY T-DISTRIBUTION SIMULATIONS
figure('Name','Simulated Student''s t Distribution')
subplot(121);plot(tBin,tTheory,'Linewidth',2)
xlim([-6 6]); ylim([0 .4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Theoretical')

subplot(122);plot(tBin,tSimu,'Linewidth',2)
xlim([-6 6]); ylim([0 .4]);
legend(legStr)
xlabel('x'); ylabel('p(x)')
title('Simulated')
fprintf('\nPress any key to finish demo.\n')
pause

clear all; close all