Category Archives: Uncategorized

Monte Carlo Approximations

Monte Carlo Approximation for Integration

Using statistical methods we often run into integrals that take the form:

I = \int_a^b h(x)g(x) dx

For instance, the expected value of a some function f(x) of a random variable X

\mathbb E[x] = \int_{p(x)} p(x) f(x) dx

and many quantities essential for Bayesian methods such as the marginal likelihood a.k.a “model evidence”

p(x) = \int_\theta p(x | \theta) p(\theta) dx

involve integrals of this form.  Sometimes (not often) such an integral can be evaluated analytically. When a closed form solution does not exist, numeric integration methods can be applied. However numerical methods quickly become intractable for any practical application that requires more than a small number of dimensions. This is where Monte Carlo approximation comes in. Monte Carlo approximation allows us to calculate an estimate for the value of I by transforming the  integration problem into a procedure of sampling values from a tractable probability distribution and calculating the average of those samples. Here’s what I mean:

If the function g(x) fullfills two simple criteria, namely that the function is always positive on the interval (a,b)

g(x) \geq 0, x \in (a,b)

and that the integral of the function is finite

\int_a^b g(x) = C < \infty

then we can define a corresponding probability distribution on the interval (a,b):

p(x) = \frac{g(x)}{C}

Another way to think of it is that g(x) is a probability distribution scaled by a constant C.

Using this link between probability distributions p(x) and g(x), we can restate the original integration as

I = C \int_a^b h(x) p(x)dx = C \mathbb E_{p(x)}[h(x)]

This statement says that if we can sample values of x using p(x), then the value of  the original integral I is simply a scaled version of the expected value of the integrand function h(x) calculated using those samples. Turns out that the expected value \mathbb E_{p(x)}[h(x)] can be easily approximated by the sample mean:

\mathbb E_{p(x)} [h(x)] \approx \frac{1}{N} \sum_i^N h(x_i)

where samples x_i, i = 1..N are drawn independently from p(x). This leads to a simple 4-Step Procedure for performing Monte Carlo approximation to the integral I:

  1. Identify h(x)
  2. Identify g(x) and from it determine p(x) and C
  3. Draw N independent samples from p(x)
  4. Evaluate I = C \mathbb E[h(x)] \approx \frac{C}{N} \sum_i^N h(x_i)

The larger the number of samples N we draw, the better our approximation to the actual value of I. This 4-step procedure is demonstrated in a some toy examples below:

Example 1: Approximating the integral xe^x

Say we want to calculate the integral:

I = \int_0^1 x e^x dx

We can calculate the closed form solution of this integral using integration by parts:

u = x, dv = e^x

du = dx, v = e^x


I = uv - \int v du

= xe^x - \int e^x dx

= \left. xe^x - e^x \right|_0^1

= \left. e^x(x-1)\right|_0^1

= 0 - (-1) = 1

Orr…we could calculate the  Monte Carlo approximation of this integral.

Step 1 we identify

h(x) = xe^x

Step 2 we identify

g(x) = 1

and from this can also determine the probability distribution function p(x) \in (a,b) = (0,1). According to the definition expression for p(x) given above we detemine p(x) to be:

p(x) = \frac{g(x)}{\int_a^b g(x)dx} = \frac{1}{b-a}.

Step 3: The expression on the right is the definition for the uniform distribution Unif(0,1), which is easy to sample from using the MATLAB \text{rand()} (Notice too that the constant C = b-a = 1).

Step 4: we calculate the Monte Carlo approximation as

I = C \mathbb E_{p(x)} h(x)

\approx \frac{1}{N}\sum_{i=1}^N x_i e^{x_i},

where each x_i is sampled from the standard uniform distribution. Below is some MATLAB code running the Monte Carlo Approximation for two different values of N


N1 = 100;
x = rand(N1,1);
Ihat1 = sum(x.*exp(x))/N1

N2 = 5000;
x = rand(N2,1);
Ihat2 = sum(x.*exp(x))/N2

Comparing the values of the variables \text{Ihat1} and \text{Ihat2} we see that the Monte Carlo approximation is better for a larger number of samples.

Example 2: Approximating the expected value of the Beta distribution

Lets look at how the 4-step Monte Carlo approximation procedure can be used to calculate expectations. In this example we will calculate

\mathbb E[x] = \int_{p(x)} p(x)x dx,

where x \sim p(x) = Beta(\alpha_1,\alpha_2).

Step 1: we identify h(x) = x.

Step 2: the function g(x) is simply the probability density function p(x) due the expression for p(x) above:

p(x)^\star = \frac{p(x)}{\int p(x)dx} = p(x).

Step 3 we can use MATLAB to easily draw N independent samples p(x) using the function \text{betarnd()}. And finally,

Step 4 we approximate the expectation with the expression

\mathbb E[x]_{Beta(\alpha_1,\alpha_2)} \approx \frac{1}{N} \sum_i x_i

Below is some MATLAB code that performs this approximation of the expected value.

alpha1 = 2; alpha2 = 10;
N = 10000;
x = betarnd(alpha1,alpha2,1,N);

expectMC = mean(x);

expectAnalytic = alpha1/(alpha1 + alpha2);

bins = linspace(0,1,50);
counts = histc(x,bins);
probSampled = counts/sum(counts);
probTheory = betapdf(bins,alpha1,alpha2);
b = bar(bins,probSampled); colormap hot; hold on;
t = plot(bins,probTheory/sum(probTheory),'r','Linewidth',2)
m = plot([expectMC,expectMC],[0 .1],'g')
e = plot([expectAnalytic,expectAnalytic],[0 .1],'b')
xlim([expectAnalytic - .05,expectAnalytic + 0.05])
title(['E_{MC} = ',num2str(expectMC),'; E_{Theory} = ',num2str(expectAnalytic)])
hold off

And the output of the code:

Monte Carlo Approximation of the the Expected value of Beta(2,10)

The analytical solution for the expected value of this Beta distribution:

\mathbb E_{Beta(2,10)}[x] = \frac{\alpha_1}{\alpha_1 + \alpha_2} = \frac{2}{12} = 0.167

is quite close to our approximation (also indicated by the small distance between the blue and green lines on the plot above).

Monte Carlo Approximation for Optimization

Monte Carlo Approximation can also be used to solve optimization problems of the form:

\hat x = \underset{x \in (a,b)}{\text{argmax }} g(x)

If g(x) fulfills the same criteria described above (namely that it is a scaled version of a probability distribution), then (as above) we can define the probability function

p(x) = \frac{g(x)}{C}

This allows us to instead solve the problem

\hat x = \underset{x}{\text{argmax }}C p(x)

If we can sample from p(x), the solution \hat x is easily found by drawing samples from p(x) and determining the location of the samples that has the highest density (Note that the solution is not dependent of the value of C). The following example demonstrates Monte Carlo optimization:

Example: Monte Carlo Optimization of g(x) = e^{-\frac{(x-4)^2}{2}}

Say we would like to find the value of x_{opt} which optimizes the function g(x) = e^{-((x-4)^2)/2}. In other words we want to solve the problem

x_{opt} = \underset{x}{\text{argmax }} e^{-\frac{(x-4)^2}{2}}

We could solve for x_{opt} using standard calculus methods, but a clever trick is to use Monte Carlo approximation to solve the problem. First, notice that g(x)  is  a scaled version of a Normal distribution with mean equal to 4 and unit variance:

g(x) = C \times \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-4)^2}{2}}

= C \times \mathcal N(4,1)

where C = \sqrt{2\pi}. This means we can solve for x_{opt} by drawing samples from the normal distribution and determining where those samples have the highest density. The following chunk of matlab code solves the optimization problem in this way.


N = 100000;
x = 0:.1:6;
C = sqrt(2*pi);
g = inline('exp(-.5*(x-4).^2)','x');
ph = plot(x,g(x)/C,'r','Linewidth',3); hold on
gh = plot(x,g(x),'b','Linewidth',2); hold on;
y = normpdf(x,4,1);

x = normrnd(4,1,1,N);
bins = linspace(min(x),max(x),100);
counts = histc(x,bins);
[~,optIdx] = max(counts);
xHat = bins(optIdx);

oh = plot([4 4],[0,1],'k');
hh = plot([xHat,xHat],[0,1],'g');


Monte Carlo Optimization

In the code output above we see the function g(x) we want to optimize in blue and the Normal distribution p(x) from which we draw samples in red. The Monte Carlo method provides a good approximation (green) to the real solution (black).

Wrapping Up

In the toy examples above it was easy to sample from p(x). However, for practical problems the distributions we want to sample from are often complex and operate in many dimensions. For these problems more clever sampling methods have to be used. Such sampling methods include Inverse Transform Sampling, Rejection Sampling, Importance Sampling, and Markov Chain Monte Carlo methods such as the Metropolis Hasting algorithm and the Gibbs sampler, each of which I plan to cover in separate posts.

Derivation: Ordinary Least Squares Solution and Normal Equations

In a linear regression framework, we assume some output variable y is a linear combination of some independent input variables X plus some independent noise \epsilon. The way the independent variables are combined is defined by a parameter vector \beta:

\Large{\begin{array}{rcl} y &=& X \beta + \epsilon \end{array}}

We also assume that the noise term \epsilon is drawn from a standard Normal distribution:

\Large{ \begin{array}{rcl}\epsilon &\sim& N(0,I)\end{array}}

For some estimate of the model parameters \hat \beta, the model’s prediction errors/residuals e are the difference between the model prediction and the observed ouput values

\Large{\begin{array}{rcl} e = y - X\hat \beta \end{array}}

The Ordinary Least Squares (OLS) solution to the problem (i.e. determining an optimal solution for \hat \beta) involves minimizing the sum of the squared errors with respect to the model parameters, \hat \beta. The sum of squared errors is equal to the inner product of the residuals vector with itself \sum e_i^2 = e^Te :

\Large{\begin{array}{rcl} e^T e &=& (y - X \hat \beta)^T (y - X \hat \beta) \\  &=& y^Ty - y^T (X \hat \beta) - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\  &=& y^Ty - (X \hat \beta)^T y - (X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\  &=& y^Ty - 2(X \hat \beta)^T y + (X \hat \beta)^T (X \hat \beta) \\  &=& y^Ty - 2\hat \beta^T X^T y + \hat \beta^T X^T X \hat \beta \\  \end{array}}

To determine the parameters, \hat \beta, we minimize the sum of squared residuals with respect to the parameters.

\Large{\begin{array}{rcl} \frac{\partial}{\partial \beta} \left[ e^T e \right] &=& 0 \\  &=& -2X^Ty + 2X^TX \hat \beta \text{, and thus} \\  X^Ty &=& X^TX \hat \beta  \end{array}}

due to the identity \frac{\partial \mathbf{a}^T \mathbf{b}}{\partial \mathbf{a}} = \mathbf{b}, for vectors \mathbf{a} and \mathbf{b}. This relationship is matrix form of the Normal Equations. Solving for \hat \beta gives  the analytical solution to the Ordinary Least Squares problem.

\Large{\begin{array}{rcl} \hat \beta &=& (X^TX)^{-1}X^Ty \end{array}}