# Monthly Archives: September 2012

## A Brief Introduction to Markov Chains

Markov chains are an essential component of Markov chain Monte Carlo (MCMC) techniques. Under MCMC, the Markov chain is used to sample from some target distribution. To get a better understanding of what a Markov chain is, and further, how it can be used to sample form a distribution, this post introduces and applies a few basic concepts.

A Markov chain is a stochastic process that operates sequentially (e.g. temporally), transitioning from one state to another within an allowed set of states.^{†}

A Markov chain is defined by three elements:

- A
, which is a set of values that the chain is allowed to take*state space* - A
that defines the probability of moving from state to .*transition operator* - An
which defines the probability of being in any one of the possible states at the initial iteration .*initial condition distribution*

The Markov chain starts at some initial state, which is sampled from , then transitions from one state to another according to the transition operator .

A Markov chain is called ** memoryless **if the next state only depends on the current state and not on any of the states previous to the current:

(This memoryless property is formally know as the *Markov property*).

If the transition operator for a Markov chain does not change across transitions, the Markov chain is called ** time homogenous**. A nice property of time homogenous Markov chains is that as the chain runs for a long time and , the chain will reach an equilibrium that is called the chain’s

**:**

*stationary distribution*We’ll see later how the stationary distribution of a Markov chain is important for sampling from probability distributions, a technique that is at the heart of Markov Chain Monte Carlo (MCMC) methods.

## Finite state-space (time homogenous) Markov chain

If the state space of a Markov chain takes on a finite number of distinct values, and it is time homogenous, then the transition operator can be defined by a matrix , where the entries of are:

This means that if the chain is currently in the -th state, the transition operator assigns the probability of moving to the -th state by the entries of -th row of (i.e. each row of defines a conditional probability distribution on the state space). Let’s take a look at a finite state-space Markov chain in action with a simple example.

### Example: Predicting the weather with a finite state-space Markov chain

In Berkeley, CA, there are (literally) only 3 types of weather: sunny, foggy, or rainy (this is analogous to a state-space that takes on three discrete values). The weather patterns are very stable there, so a Berkeley weatherman can easily predict the weather next week based on the weather today with the following transition rules:

If it is *sunny* today, then

- it is highly likely that it will be
*sunny*next week- ,

- it is very unlikely that it will be
*raining*next week - and somewhat likely that it will
*foggy*next week

If it is *foggy* today then

- it is somewhat likely that it will be
*sunny*next week - but slightly less likely that it will be
*foggy*next week- ,

- and fairly unlikely that it will be
*raining*next week.- ,

If it is *rainy* today then

- it is unlikely that it will be
*sunny*next week- ,

- it is somewhat likely that it will be
*foggy*next week- ,

- and it is fairly likely that it will be
*rainy*next week- ,

All of these transition rules can be instantiated in a single 3 x 3 transition operator matrix:

Where each row of corresponds to the weather at iteration (*today*), and each column corresponds to the weather the next week. Let’s say that it is rainy today, what is the probability it will be sunny next week, in two weeks, or in 6 months? We can answer these questions by running a Markov chain from the initial state of “rainy,” transitioning according to . The following chunk of MATLAB code runs the Markov chain.

% FINITE STATE-SPACE MARKOV CHAIN EXAMPLE % TRANSITION OPERATOR % S F R % U O A % N G I % N G N % Y Y Y P = [.8 .15 .05; % SUNNY .4 .5 .1; % FOGGY .1 .3 .6]; % RAINY nWeeks = 25 % INITIAL STATE IS RAINY X(1,:) = [0 0 1]; % RUN MARKOV CHAIN for iB = 2:nWeeks X(iB,:) = X(iB-1,:)*P; % TRANSITION end % DISPLAY figure; hold on h(1) = plot(1:nWeeks,X(:,1),'r','Linewidth',2); h(2) = plot(1:nWeeks,X(:,2),'k','Linewidth',2); h(3) = plot(1:nWeeks,X(:,3),'b','Linewidth',2); h(4) = plot([15 15],[0 1],'g--','Linewidth',2); hold off legend(h, {'Sunny','Foggy','Rainy','Burn In'}); xlabel('Week') ylabel('p(Weather)') xlim([1,nWeeks]); ylim([0 1]); % PREDICTIONS fprintf('\np(weather) in 1 week -->'), disp(X(2,:)) fprintf('\np(weather) in 2 weeks -->'), disp(X(3,:)) fprintf('\np(weather) in 6 months -->'), disp(X(25,:))

Here we see that at week 1 the probability of sunny weather is 0.1. The next week, the probability of sunny weather is 0.26, and in 6 months, there is a 60% chance that it will be sunny. Also note that after approximately 15 weeks the Markov chain has reached the equilibrium/stationary distribution and, chances are, the weather will be sunny. This 15-week period is what is known as the ** burn in** period for the Markov chain, and is the number of transitions it takes the chain to move from the initial conditions to the stationary distribution.

A cool thing about finite state-space time-homogeneous Markov chain is that it is not necessary to run the chain sequentially through all iterations in order to predict a state in the future. Instead we can predict by first raising the transition operator to the -th power, where is the iteration at which we want to predict, then multiplying the result by the distribution over the initial state, . For instance, to predict the probability of the weather in 2 weeks, knowing that it is rainy today (i.e. ):

and in six months:

These are the same results we get by running the Markov chain sequentially through each number of transitions. Therefore we can calculate an approximation to the stationary distribution from by setting to a large number. It turns out that it is also possible to analytically derive the stationary distribution from (hint: think about the properties of eigenvectors).

## Continuous state-space Markov chains

A Markov chain can also have a continuous state space that exists in the real numbers . In this case the transition operator cannot be instantiated simply as a matrix, but is instead some continuous function on the real numbers. Note that the continuous state-space Markov chain also has a burn in period and a stationary distribution. However, the stationary distribution will also be over a continuous set of variables. To get a better understanding of the workings of a continuous state-space Markov chain, let’s look at a simple example.

### Example: Sampling from a continuous distribution using continuous state-space Markov chains

We can use the stationary distribution of a continuous state-space Markov chain in order to sample from a continuous probability distribution: we run a Markov chain for a sufficient amount of time so that it has reached its stationary distribution, then keep the states that the chain visits as samples from that stationary distribution.

In the following example we define a continuous state-space Markov chain. The transition operator is a Normal distribution with unit variance and a mean that is half the distance between zero and the previous state, and the distribution over initial conditions is a Normal distribution with zero mean and unit variance.

To ensure that the chain has moved sufficiently far from the initial conditions and that we are sampling from the chain’s stationary distribution, we will choose to throw away the first 50 burn in states of the chain. We can also run multiple chains simultaneously in order to sample the stationary distribution more densely. Here we choose to run 5 chains simultaneously.

% EXAMPLE OF CONTINUOUS STATE-SPACE MARKOV CHAIN % INITIALIZE randn('seed',12345) nBurnin = 50; % # BURNIN nChains = 5; % # MARKOV CHAINS % DEFINE TRANSITION OPERATOR P = inline('normrnd(.5*x,1,1,nChains)','x','nChains'); nTransitions = 1000; x = zeros(nTransitions,nChains); x(1,:) = randn(1,nChains); % RUN THE CHAINS for iT = 2:nTransitions x(iT,:) = P(x(iT-1),nChains); end % DISPLAY BURNIN figure subplot(221); plot(x(1:100,:)); hold on; minn = min(x(:)); maxx = max(x(:)); l = line([nBurnin nBurnin],[minn maxx],'color','k','Linewidth',2); ylim([minn maxx]) legend(l,'~Burn-in','Location','SouthEast') title('First 100 Samples'); hold off % DISPLAY ENTIRE MARKOV CHAIN subplot(223); plot(x);hold on; l = line([nBurnin nBurnin],[minn maxx],'color','k','Linewidth',2); legend(l,'~Burn-in','Location','SouthEast') title('Entire Chain'); % DISPLAY SAMPLES FROM STATIONARY DISTRIBUTION samples = x(nBurnin+1:end,:); subplot(122); [counts,bins] = hist(samples(:),100); colormap hot b = bar(bins,counts); legend(b,sprintf('Markov Chain\nSamples')); title(['\mu=',num2str(mean(samples(:))),' \sigma=',num2str(var(samples(:)))])

In the upper left panel of the code output we see a close up of the first 100 of the 1000 transitions made by the 5 simultaneous Markov chains; the burn in cutoff is marked by the black line. In the lower left panel we see the entire sequence of transitions for the Markov chains. In the right panel, we can tell from the sampled states that the stationary distribution for this chain is a Normal distribution, with mean equal to zero, and a variance equal to 1.3.

## Wrapping Up

In the previous example we were able to deduce the stationary distribution of the Markov chain by looking at the samples generated from the chain after the burn in period. However, in order to use Markov chains to sample from a specific target distribution, we have to design the transition operator such that the resulting chain reaches a stationary distribution that matches the target distribution. This is where MCMC methods like the Metropolis sampler, the Metropolis-Hastings sampler, and the Gibbs sampler come to rescue. We will discuss each of these Markov-chain-based sampling methods separately in later posts.

## Monte Carlo Approximations

# Monte Carlo Approximation for Integration

Using statistical methods we often run into integrals that take the form:

For instance, the expected value of a some function of a random variable

and many quantities essential for Bayesian methods such as the marginal likelihood a.k.a “model evidence”

involve integrals of this form. Sometimes (not often) such an integral can be evaluated analytically. When a closed form solution does not exist, numeric integration methods can be applied. However numerical methods quickly become intractable for any practical application that requires more than a small number of dimensions. This is where Monte Carlo approximation comes in. Monte Carlo approximation allows us to calculate an estimate for the value of by transforming the integration problem into a procedure of sampling values from a tractable probability distribution and calculating the average of those samples. Here’s what I mean:

If the function fullfills two simple criteria, namely that the function is always positive on the interval

and that the integral of the function is finite

then we can define a corresponding probability distribution on the interval :

Another way to think of it is that is a probability distribution scaled by a constant .

Using this link between probability distributions and , we can restate the original integration as

This statement says that if we can sample values of using , then the value of the original integral is simply a scaled version of the expected value of the integrand function calculated using those samples. Turns out that the expected value can be easily approximated by the sample mean:

where samples are drawn independently from . This leads to a simple 4-Step Procedure for performing Monte Carlo approximation to the integral :

- Identify
- Identify and from it determine and
- Draw independent samples from
- Evaluate

The larger the number of samples we draw, the better our approximation to the actual value of . This 4-step procedure is demonstrated in a some toy examples below:

### Example 1: Approximating the integral

Say we want to calculate the integral:

We can calculate the closed form solution of this integral using integration by parts:

and

Orr…we could calculate the Monte Carlo approximation of this integral.

**Step 1** we identify

**Step 2** we identify

and from this can also determine the probability distribution function . According to the definition expression for given above we detemine to be:

.

**Step 3:** The expression on the right is the definition for the uniform distribution , which is easy to sample from using the MATLAB (Notice too that the constant ).

**Step 4:** we calculate the Monte Carlo approximation as

,

where each is sampled from the standard uniform distribution. Below is some MATLAB code running the Monte Carlo Approximation for two different values of

% MONTE CARLO APPROXIMATION OF INT(xexp(x))dx % FOR TWO DIFFERENT SAMPLE SIZES rand('seed',12345); % THE FIRST APPROXIMATION USING N1 = 100 SAMPLES N1 = 100; x = rand(N1,1); Ihat1 = sum(x.*exp(x))/N1 % A SECOND APPROXIMATION USING N2 = 5000 SAMPLES N2 = 5000; x = rand(N2,1); Ihat2 = sum(x.*exp(x))/N2

Comparing the values of the variables and we see that the Monte Carlo approximation is better for a larger number of samples.

### Example 2: Approximating the expected value of the Beta distribution

Lets look at how the 4-step Monte Carlo approximation procedure can be used to calculate expectations. In this example we will calculate

,

where .

**Step 1**: we identify .

**Step 2:** the function is simply the probability density function due the expression for above:

.

**Step 3** we can use MATLAB to easily draw independent samples using the function . And finally,

**Step 4** we approximate the expectation with the expression

Below is some MATLAB code that performs this approximation of the expected value.

rand('seed',12345); alpha1 = 2; alpha2 = 10; N = 10000; x = betarnd(alpha1,alpha2,1,N); % MONTE CARLO EXPECTATION expectMC = mean(x); % ANALYTIC EXPRESSION FOR BETA MEAN expectAnalytic = alpha1/(alpha1 + alpha2); % DISPLAY figure; bins = linspace(0,1,50); counts = histc(x,bins); probSampled = counts/sum(counts); probTheory = betapdf(bins,alpha1,alpha2); b = bar(bins,probSampled); colormap hot; hold on; t = plot(bins,probTheory/sum(probTheory),'r','Linewidth',2) m = plot([expectMC,expectMC],[0 .1],'g') e = plot([expectAnalytic,expectAnalytic],[0 .1],'b') xlim([expectAnalytic - .05,expectAnalytic + 0.05]) legend([b,t,m,e],{'Samples','Theory','$\hat{E}$','$E_{Theory}$'},'interpreter','latex'); title(['E_{MC} = ',num2str(expectMC),'; E_{Theory} = ',num2str(expectAnalytic)]) hold off

And the output of the code:

The analytical solution for the expected value of this Beta distribution:

is quite close to our approximation (also indicated by the small distance between the blue and green lines on the plot above).

## Monte Carlo Approximation for Optimization

Monte Carlo Approximation can also be used to solve optimization problems of the form:

If fulfills the same criteria described above (namely that it is a scaled version of a probability distribution), then (as above) we can define the probability function

This allows us to instead solve the problem

If we can sample from , the solution is easily found by drawing samples from and determining the location of the samples that has the highest density (Note that the solution is not dependent of the value of ). The following example demonstrates Monte Carlo optimization:

### Example: Monte Carlo Optimization of

Say we would like to find the value of which optimizes the function . In other words we want to solve the problem

We could solve for using standard calculus methods, but a clever trick is to use Monte Carlo approximation to solve the problem. First, notice that is a scaled version of a Normal distribution with mean equal to 4 and unit variance:

where . This means we can solve for by drawing samples from the normal distribution and determining where those samples have the highest density. The following chunk of matlab code solves the optimization problem in this way.

% MONTE CARLO OPTIMIZATION OF exp(x-4)^2 randn('seed',12345) % INITIALZIE N = 100000; x = 0:.1:6; C = sqrt(2*pi); g = inline('exp(-.5*(x-4).^2)','x'); ph = plot(x,g(x)/C,'r','Linewidth',3); hold on gh = plot(x,g(x),'b','Linewidth',2); hold on; y = normpdf(x,4,1); % CALCULATE MONTE CARLO APPROXIMATION x = normrnd(4,1,1,N); bins = linspace(min(x),max(x),100); counts = histc(x,bins); [~,optIdx] = max(counts); xHat = bins(optIdx); % OPTIMA AND ESTIMATED OPTIMA oh = plot([4 4],[0,1],'k'); hh = plot([xHat,xHat],[0,1],'g'); set(gca,'fontsize',16) legend([gh,ph,oh,hh],{'g(x)','$p(x)=\frac{g(x)}{C}$','$x_{opt}$','$\hat{x}$'},'interpreter','latex','Location','Northwest');

In the code output above we see the function we want to optimize in blue and the Normal distribution from which we draw samples in red. The Monte Carlo method provides a good approximation (green) to the real solution (black).

## Wrapping Up

In the toy examples above it was easy to sample from . However, for practical problems the distributions we want to sample from are often complex and operate in many dimensions. For these problems more clever sampling methods have to be used. Such sampling methods include Inverse Transform Sampling, Rejection Sampling, Importance Sampling, and Markov Chain Monte Carlo methods such as the Metropolis Hasting algorithm and the Gibbs sampler, each of which I plan to cover in separate posts.

## Sampling From the Normal Distribution Using the Box-Muller Transform

The Normal Distribution is the workhorse of many common statistical analyses and being able to draw samples from this distribution lies at the heart of many statistical/machine learning algorithms. There have been a number of methods developed to sample from the Normal distribution including Inverse Transform Sampling, the Ziggurat Algorithm, and the Ratio Method (a rejection sampling technique). In this post we will focus on an elegant method called the Box-Muller transform.

## A quick review of Cartesian and polar coordinates.

Before we can talk about using the Box-Muller transform, let’s refresh our understanding of the relationship between Cartesian and polar coordinates. You may remember from geometry that if x and y are two points in the Cartesian plane they can be represented in polar coordinates with a radius and an angle using the following relationships:

, and therefore

Notice that if and , then we map out values contained in the unit circle, as shown in the figure below. Also note that random variables in such a circle can be generated by transforming values sampled from the uniform distribution. Specifically, radii can be sampled from and angle can be sampled from . A similar mechanism (i.e. drawing points in a circle using uniform variables) is at the heart of the Box-Muller transform for sampling Normal random variables.

## Drawing Normally-distributed samples with the Box-Muller transform

Ok, now that we’ve discussed how Cartesian coordinates are represented in polar coordinates, let’s move on to how we can use this relationship to generate random variables. Box-Muller sampling is based on representing the joint distribution of two independent standard Normal random Cartesian variables and

in polar coordinates. The joint distribution (which is circular-symmetric) is:

If we notice that the term in the numerator of the exponent is equal to (as above) we can make the connection between the the Cartesian representation of the joint Normal distribution and its polar representation:

which is the product of two density functions, an exponential distribution over squared radii:

and a uniform distribution over angles:

just like those mentioned above when generating points on the unit circle. Now, if we make another connection between the exponential distribution and the uniform distribution, namely that:

then

This gives us a way to generate points from the joint Gaussian distribution by sampling from two independent uniform distributions, one for and another for , and transforming them into Cartesian coordinates via the relationships above. In detail, the procedure goes as follows:

- Draw,
- Transform the variables into radius and angle representation , and
- Transform radius and angle into Cartesian coordinates:

What results are two independent Normal random variables, and . A MATLAB implementation of the Box-Muller algorithm is shown below:

% NORMAL SAMPLES USING BOX-MUELLER METHOD % DRAW SAMPLES FROM PROPOSAL DISTRIBUTION u = rand(2,100000); r = sqrt(-2*log(u(1,:))); theta = 2*pi*u(2,:); x = r.*cos(theta); y = r.*sin(theta); % DISPLAY BOX-MULLER SAMPLES figure % X SAMPLES subplot(121); hist(x,100); colormap hot;axis square title(sprintf('Box-Muller Samples Y\n Mean = %1.2f\n Variance = %1.2f\n Kurtosis = %1.2f',mean(x),var(x),3-kurtosis(x))) xlim([-6 6]) % Y SAMPLES subplot(122); hist(y,100); colormap hot;axis square title(sprintf('Box-Muller Samples X\n Mean = %1.2f\n Variance = %1.2f\n Kurtosis = %1.2f',mean(y),var(y),3-kurtosis(y))) xlim([-6 6])

## Wrapping Up

The output of the MATLAB code is shown above. Notice the first, second, and fourth central moments (mean, variance, and kurtosis) of the generated samples are consistent with the standard normal. The Box-Muller transform is another example of of how uniform variables on the interval (0,1) and can be transformed in order to sample from a more complicated distribution.

## Rejection Sampling

Suppose that we want to sample from a distribution that is difficult or impossible to sample from directly, but instead have a simpler distribution from which sampling is easy. The idea behind Rejection sampling (aka Acceptance-rejection sampling) is to sample from and apply some rejection/acceptance criterion such that the samples that are accepted are distributed according to .

## Envelope distribution and rejection criterion

In order to be able to reject samples from such that they are sampled from , must “cover” or envelop the distribution . This is generally done by choosing a constant such that for all . For this reason is often called the *envelope distribution*. A common criterion for accepting samples from is based on the ratio of the target distribution to that of the envelope distribution. The samples are accepted if

where , and rejected otherwise. If the ratio is close to one, then must have a large amount of probability mass around and that sample should be more likely accepted. If the ratio is small, then it means that has low probability mass around and we should be less likely to accept the sample. This criterion is demonstrated in the chunk of MATLAB code and the resulting figure below:

rand('seed',12345); x = -10:.1:10; % CREATE A "COMPLEX DISTRIBUTION" f(x) AS A MIXTURE OF TWO NORMAL % DISTRIBUTIONS f = inline('normpdf(x,3,2) + normpdf(x,-5,1)','x'); t = plot(x,f(x),'b','linewidth',2); hold on; % PROPOSAL IS A CENTERED NORMAL DISTRIBUTION q = inline('normpdf(x,0,4)','x'); % DETERMINE SCALING CONSTANT c = max(f(x)./q(x)) %PLOT SCALED PROPOSAL/ENVELOP DISTRIBUTION p = plot(x,c*q(x),'k--'); % DRAW A SAMPLE FROM q(x); qx = normrnd(0,4); fx = f(qx); % PLOT THE RATIO OF f(q(x)) to cq(x) a = plot([qx,qx],[0 fx],'g','Linewidth',2); r = plot([qx,qx],[fx,c*q(qx)],'r','Linewidth',2); legend([t,p,a,r],{'Target','Proposal','Accept','Reject'}); xlabel('x');

Here a zero-mean Normal distribution is used as the proposal distribution. This distribution is scaled by a factor , determined from and to ensure that the proposal distribution covers . We then sample from , and compare the proportion of occupied by . If we compare this proportion to a random number sampled from (i.e. the criterion outlined above), then we would accept this sample with probability proportional to the length of the green line segment and reject the sample with probability proportional to the length of the red line segment.

## Rejection sampling of a random discrete distribution

This next example shows how rejection sampling can be used to sample from any arbitrary distribution, continuous or not, and with or without an analytic probability density function.

The figure above shows a random ** discrete** probability density function generated on the interval (0,15). We will use rejection sampling as described above to sample from . Our proposal/envelope distribution is the uniform discrete distribution on the same interval (i.e. any of the integers from 1-15 are equally probable) multiplied by a constant that is determined such that the maximum value of lies under (or equal to) .

rand('seed',12345) randn('seed',12345) fLength = 15; % CREATE A RANDOM DISTRIBUTION ON THE INTERVAL [1 fLength] f = rand(1,fLength); f = f/sum(f); figure; h = plot(f,'r','Linewidth',2); hold on; l = plot([1 fLength],[max(f) max(f)],'k','Linewidth',2); legend([h,l],{'f(x)','q(x)'},'Location','Southwest'); xlim([0 fLength + 1]) xlabel('x'); ylabel('p(x)'); title('Target (f(x)) and Proposal (q(x)) Distributions'); % OUR PROPOSAL IS THE DISCRETE UNIFORM ON THE INTERVAL [1 fLength] % SO OUR CONSTANT IS c = max(f/(1/fLength)); nSamples = 10000; i = 1; while i < nSamples proposal = unidrnd(fLength); q = c*1/fLength; % ENVELOPE DISTRIBUTION if rand < f(proposal)/q samps(i) = proposal; i = i + 1; end end % DISPLAY THE SAMPLES AND COMPARE TO THE TARGET DISTRIBUTION bins = 1:fLength; counts = histc(samps,bins); figure b = bar(1:fLength,counts/sum(counts),'FaceColor',[.8 .8 .8]) hold on; h = plot(f,'r','Linewidth',2) legend([h,b],{'f(x)','samples'}); xlabel('x'); ylabel('p(x)'); xlim([0 fLength + 1]);

## Rejection sampling from the unit circle to estimate

Though the ratio-based acceptance-rejection criterion introduced above is a common choice for drawing samples from complex distributions, it is not the only criterion we could use. For instance we could use a different set of criteria to generate some geometrically-bounded distribution. If we wanted to generate points uniformly within the unit circle (i.e. a circle centered at and with radius ), we could do so by sampling Cartesian spatial coordinates and uniformly from the interval (-1,1)–which samples form a square centered at (0,0)–and reject those points that lie outside of the radius

and the unit circle has the area:

We can use the ratio of their areas to approximate :

The figure below shows the rejection sampling process and the resulting estimate of from the samples. One-hundred thousand 2D points are sampled uniformly from the interval (-1,1). Those points that lie within the unit circle are plotted as blue dots. Those points that lie outside of the unit circle are plotted as red x’s. If we take four times the ratio of the area in blue to the entire area, we get a very close approximation to 3.14 for .

The MATLAB code used to generate the example figures is below:

% DISPLAY A CIRCLE INSCRIBED IN A SQUARE figure; a = 0:.01:2*pi; x = cos(a); y = sin(a); hold on plot(x,y,'k','Linewidth',2) t = text(0.5, 0.05,'r'); l = line([0 1],[0 0],'Linewidth',2); axis equal box on xlim([-1 1]) ylim([-1 1]) title('Unit Circle Inscribed in a Square') pause; rand('seed',12345) randn('seed',12345) delete(l); delete(t); % DRAW SAMPLES FROM PROPOSAL DISTRIBUTION samples = 2*rand(2,100000) - 1; % REJECTION reject = sum(samples.^2) > 1; % DISPLAY REJECTION CRITERION scatter(samples(1,~reject),samples(2,~reject),'b.') scatter(samples(1,reject),samples(2,reject),'rx') hold off xlim([-1 1]) ylim([-1 1]) piHat = mean(sum(samples.*samples)<1)*4; title(['Estimate of \pi = ',num2str(piHat)]);

## Wrapping Up

Rejection sampling is a simple way to generate samples from complex distributions. However, Rejection sampling also has a number of weaknesses:

- Finding a proposal distribution that can cover the support of the target distribution is a non-trivial task.
- Additionally, as the dimensionality of the target distribution increases, the proportion of points that are rejected also increases. This curse of dimensionality makes rejection sampling an inefficient technique for sampling multi-dimensional distributions, as the majority of the points proposed are not accepted as valid samples.
- Some of these problems are solved by changing the form of the proposal distribution to “hug” the target distribution as we gain knowledge of the target from observing accepted samples. Such a process is called
*Adaptive Rejection Sampling*, which will be covered in another post.

## Inverse Transform Sampling

There are a number of sampling methods used in machine learning, each of which has various strengths and/or weaknesses depending on the nature of the sampling task at hand. One simple method for generating samples from distributions with closed-form descriptions is Inverse Transform (IT) Sampling.

The idea behind IT Sampling is that the probability mass for a random variable distributed according to the probability density function integrates to one and therefore the cumulative distribution function can be used to map from values in the interval (i.e. probabilities) to the domain of . Because it is easy to sample values uniformly from the interval , we can use the inverse of the CDF to transform these sampled probabilities into samples . The code below demonstrates this process at work in order to sample from a student’s t distribution with 10 degrees of freedom.

rand('seed',12345) % DEGREES OF FREEDOM dF = 10; x = -3:.1:3; Cx = cdf('t',x,dF) z = rand; % COMPARE VALUES OF zIdx = min(find(Cx>z)); % DRAW SAMPLE sample = x(zIdx); % DISPLAY figure; hold on plot(x,Cx,'k','Linewidth',2); plot([x(1),x(zIdx)],[Cx(zIdx),Cx(zIdx)],'r','LineWidth',2); plot([x(zIdx),x(zIdx)],[Cx(zIdx),0],'b','LineWidth',2); plot(x(zIdx),z,'ko','LineWidth',2); text(x(1)+.1,z + .05,'z','Color','r') text(x(zIdx)+.05,.05,'x_{sampled}','Color','b') ylabel('C(x)') xlabel('x') hold off

However, the scheme used to create to plot above is inefficient in that one must compare current values of with the for all values of . A much more efficient method is to evaluate directly:

- Derive (or a good approximation) from
- for

- – draw from
- –
- – end for

The IT sampling process is demonstrated in the next chunk of code to sample from the Beta distribution, a distribution for which is easy to approximate using Netwon’s method (which we let MATLAB do for us within the function icdf.m)

rand('seed',12345) nSamples = 1000; % BETA PARAMETERS alpha = 2; beta = 10; % DRAW PROPOSAL SAMPLES z = rand(1,nSamples); % EVALUATE PROPOSAL SAMPLES AT INVERSE CDF samples = icdf('beta',z,alpha,beta); bins = linspace(0,1,50); counts = histc(samples,bins); probSampled = counts/sum(counts) probTheory = betapdf(bins,alpha,beta); % DISPLAY b = bar(bins,probSampled,'FaceColor',[.9 .9 .9]); hold on; t = plot(bins,probTheory/sum(probTheory),'r','LineWidth',2); xlim([0 1]) xlabel('x') ylabel('p(x)') legend([t,b],{'Theory','IT Samples'}) hold off

## Wrapping Up

The IT sampling method is generally only used for univariate distributions where can be computed in closed form, or approximated. However, it is a nice example of how uniform random variables can be used to sample from much more complicated distributions.

## Derivation: Ordinary Least Squares Solution and Normal Equations

In a linear regression framework, we assume some output variable is a linear combination of some independent input variables plus some independent noise . The way the independent variables are combined is defined by a parameter vector :

We also assume that the noise term is drawn from a standard Normal distribution:

For some estimate of the model parameters , the model’s prediction errors/residuals are the difference between the model prediction and the observed ouput values

The Ordinary Least Squares (OLS) solution to the problem (i.e. determining an optimal solution for ) involves minimizing the sum of the squared errors with respect to the model parameters, . The sum of squared errors is equal to the inner product of the residuals vector with itself :

To determine the parameters, , we minimize the sum of squared residuals with respect to the parameters.

due to the identity , for vectors and . This relationship is matrix form of the Normal Equations. Solving for gives the analytical solution to the Ordinary Least Squares problem.

Boom.