A Brief Introduction to Markov Chains

Markov chains are an essential component of Markov chain Monte Carlo (MCMC) techniques. Under MCMC, the Markov chain is used to sample from some target distribution. To get a better understanding of what a Markov chain is, and further, how it can be used to sample form a distribution, this post introduces and applies a few basic concepts.

A Markov chain is a stochastic process that operates sequentially (e.g. temporally), transitioning from one state to another within an allowed set of states.

x^{(0)} \rightarrow x^{(1)} \rightarrow x^{(2)} \dots \rightarrow x^{(t)} \rightarrow \dots

A Markov chain is defined by three elements:

  1. A state space x, which is a set of values that the chain is allowed to take
  2. A transition operator p(x^{(t+1)} | x^{(t)}) that defines  the probability of moving from state x^{(t)} to x^{(t+1)} .
  3. An initial condition distribution \pi^{(0)} which defines the probability of being in any one of the possible states at the initial iteration t = 0.

The Markov chain starts at some initial state, which is sampled from \pi^{(0)}, then transitions from one state to another according to the transition operator p(x^{(t+1)} | x^{(t)}).

A Markov chain is called memoryless if the next state only depends on the current state and not on any of the states previous to the current:

p(x^{(t+1)}|x^{(t)},x^{(t-1)},...x^{(0)}) = p(x^{(t+1)}|x^{(t)})

(This memoryless property is formally know as the Markov property).

If the transition operator for a Markov chain does not change across transitions, the Markov chain is called time homogenous.  A nice  property of time homogenous Markov chains is that as the chain runs for a long time and t \rightarrow \infty, the chain will reach an equilibrium that is called the chain’s stationary distribution:

p(x^{(t+1)} | x^{(t)}) = p(x^{(t)} | x^{(t-1)})

We’ll see later how the stationary distribution of a Markov chain is important for sampling from probability distributions, a technique that is at the heart of Markov Chain Monte Carlo (MCMC) methods.

Finite state-space (time homogenous) Markov chain

If the state space of a Markov chain takes on a finite number of distinct values, and it is time homogenous, then the transition operator can be defined by a matrix P, where the entries of P are:

p_{ij} = p(X^{(t+1)} = j | x^{(t)} = i)

This means that if the chain is currently in the i-th state, the transition operator assigns the probability of moving to the  j-th state by the entries of i-th row of P (i.e.  each row of P defines a conditional probability distribution on the state space). Let’s take a look at a finite state-space Markov chain in action with a simple example.

Example: Predicting the weather with a finite state-space Markov chain

In Berkeley, CA, there are (literally) only 3 types of weather: sunny, foggy, or rainy (this is analogous to a state-space that takes on three discrete values). The weather patterns are very stable there, so a Berkeley weatherman can easily predict the weather next week based on the weather today with the following transition rules:

If it is sunny today, then

  • it is highly likely that it will be sunny next week
    • p(X^{(week)}=sunny | X^{(today)}=sunny)=0.8,
  • it is very unlikely that it will be raining next week
    • p(X^{(week)}=rainy | X^{(today)}=sunny)=0.05
  • and somewhat likely that it will foggy next week
    • p(X^{(week)}=foggy | X^{(today)}=sunny)=0.15

If it is foggy today then

  • it is somewhat likely that it will be sunny next week
    • p(X^{(week)}=sunny | X^{(today)}=foggy)=0.4
  • but slightly less likely that it will be foggy next week
    • p(X^{(week)}=foggy | X^{(today)}=foggy)=0.5,
  • and fairly unlikely that it will be raining next week.
    • p(X^{(week)}=rainy | X^{(today)}=foggy)=0.1,

If it is rainy today then

  • it is unlikely that it will be sunny next week
    • p(X^{(week)}=sunny | X^{(today)}=rainy)=0.1,
  • it is somewhat likely that it will be foggy next week
    • p(X^{(week)}=foggy | X^{(today)}=rainy)=0.3,
  • and it is fairly likely that it will be rainy next week
    • p(X^{(week)}=rainy | X^{(today)}=rainy)=0.6,

All of these transition rules can be instantiated in a single 3 x 3 transition operator matrix:

P = \begin{bmatrix} 0.8 & 0.15 & 0.05\\ 0.4 & 0.5 & 0.1\\ 0.1& 0.3 & 0.6 \end{bmatrix}

Where each row of P corresponds to the weather at iteration t (today), and each column corresponds to the weather the next week. Let’s say that it is rainy today, what is the probability it will be sunny next week, in two weeks, or in 6 months? We can answer these questions by running a Markov chain from the initial state of “rainy,” transitioning according to P. The following chunk of MATLAB code runs the Markov chain.


%     S  F  R
%     U  O  A
%     N  G  I
%     N  G  N
%     Y  Y  Y
P = [.8 .15 .05;  % SUNNY
     .4 .5  .1;   % FOGGY
     .1 .3  .6];  % RAINY
nWeeks = 25

X(1,:) = [0 0 1];

for iB = 2:nWeeks
    X(iB,:) = X(iB-1,:)*P; % TRANSITION

figure; hold on
h(1) = plot(1:nWeeks,X(:,1),'r','Linewidth',2);
h(2) = plot(1:nWeeks,X(:,2),'k','Linewidth',2);
h(3) = plot(1:nWeeks,X(:,3),'b','Linewidth',2);
h(4) = plot([15 15],[0 1],'g--','Linewidth',2);
hold off
legend(h, {'Sunny','Foggy','Rainy','Burn In'});
ylim([0 1]);

fprintf('\np(weather) in 1 week -->'), disp(X(2,:))
fprintf('\np(weather) in 2 weeks -->'), disp(X(3,:))
fprintf('\np(weather) in 6 months -->'), disp(X(25,:))

Finite state-space Markov chain for predicting the weather

Here we see that at week 1 the probability of sunny weather is 0.1. The next week, the probability of sunny weather is 0.26, and in 6 months, there is a 60% chance that it will be sunny. Also note that after approximately 15 weeks the Markov chain has reached the equilibrium/stationary distribution and, chances are, the weather will be sunny. This 15-week period is what is known as the burn in period for the Markov chain, and is the number of transitions it takes the chain to move from the initial conditions to the stationary distribution.

A cool thing about finite state-space time-homogeneous Markov chain is that it is not necessary  to run the chain sequentially through all iterations in order to predict a state in the future. Instead we can predict by first raising the transition operator to the T-th power, where T is the iteration at which we want to predict, then multiplying the result by the distribution over the initial state, \pi^{(0)}. For instance, to predict the probability of the weather in 2 weeks, knowing that it is rainy today (i.e. \pi^{(0)} = [0,0,1]):

p(x^{(week2)}) = \pi^{(0)}P^2 = [0.26, 0.345, 0.395]

and in six months:

p(x^{(week24)}) = \pi^{(0)}P^{24} = [0.596, 0.263, 0.140]

These are the same results we get by running the Markov chain sequentially through each number of transitions. Therefore we can calculate an approximation to the stationary distribution from P by setting T to a large number. It turns out that it is also possible to analytically derive the stationary distribution from P (hint: think about the properties of eigenvectors).

Continuous state-space Markov chains

A Markov chain can also have a continuous state space that exists in the real numbers x \in \mathbb{R}^N. In this case the transition operator cannot be instantiated simply as a matrix, but is instead some continuous function on the real numbers. Note that the continuous state-space Markov chain also has a burn in period and a stationary distribution. However, the stationary distribution will also be over a continuous set of variables. To get a better understanding of the workings of a continuous state-space Markov chain, let’s look at a simple example.

Example: Sampling from a continuous distribution using continuous state-space Markov chains

We can use the stationary distribution of a continuous state-space Markov chain in order to sample from a continuous probability distribution: we  run a Markov chain for a sufficient amount of time so that it has reached its stationary distribution, then keep the states that the chain visits as samples from that stationary distribution.

In the following example we define a continuous state-space Markov chain. The transition operator is a Normal distribution with unit variance and a mean that is half the distance between zero and the previous state, and the distribution over initial conditions is a Normal distribution with zero mean and unit variance.

To ensure that the chain has moved sufficiently far from the initial conditions and that we are sampling  from the chain’s stationary distribution,  we will choose to throw away the first 50 burn in states of the chain. We can also run multiple chains simultaneously in order to sample the stationary distribution more densely. Here we choose to run 5 chains simultaneously.


nBurnin = 50; % # BURNIN
nChains = 5;  % # MARKOV CHAINS

P = inline('normrnd(.5*x,1,1,nChains)','x','nChains');
nTransitions = 1000;
x = zeros(nTransitions,nChains);
x(1,:) = randn(1,nChains);

for iT = 2:nTransitions
    x(iT,:) = P(x(iT-1),nChains);

subplot(221); plot(x(1:100,:)); hold on;
minn = min(x(:));
maxx = max(x(:));
l = line([nBurnin nBurnin],[minn maxx],'color','k','Linewidth',2);
ylim([minn maxx])
title('First 100 Samples'); hold off

subplot(223); plot(x);hold on;
l = line([nBurnin nBurnin],[minn maxx],'color','k','Linewidth',2);
title('Entire Chain');

samples = x(nBurnin+1:end,:);
[counts,bins] = hist(samples(:),100); colormap hot
b = bar(bins,counts);
legend(b,sprintf('Markov Chain\nSamples'));
title(['\mu=',num2str(mean(samples(:))),' \sigma=',num2str(var(samples(:)))])

Sampling from the stationary distribution of a continuous state-space Markov chain

In the upper left panel of the code output we see a close up of the first 100 of the 1000 transitions made by the 5 simultaneous Markov chains; the burn in cutoff is marked by the black line. In the lower left panel we see the entire sequence of transitions for the Markov chains. In the right panel, we can tell from the sampled states that the stationary distribution for this chain is a Normal distribution, with mean equal to zero, and a variance equal to 1.3.

Wrapping Up

In the previous example we were able to deduce the stationary distribution of the Markov chain by looking at the samples generated from the chain after the burn in period. However, in order to use Markov chains to sample from a specific target distribution, we have to design the transition operator such that the resulting chain reaches a stationary distribution that matches the target distribution. This is where MCMC methods like the Metropolis sampler, the Metropolis-Hastings sampler, and the Gibbs sampler come to rescue. We will discuss each of these Markov-chain-based sampling methods separately in later posts.

On notation:

  • Here we use the shorthand notation p(x) to correspond to p(X = x), for some random variable X
  • A superscript in parentheses is an index into an iteration or point in time, not a power

About dustinstansbury

I recently received my PhD from UC Berkeley where I studied computational neuroscience and machine learning.

Posted on September 24, 2012, in Algorithms, Density Estimation, Sampling Methods, Statistics and tagged , , , , , , , . Bookmark the permalink. 3 Comments.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: