Category Archives: Neuroscience

A Gentle Introduction to Artificial Neural Networks

Introduction

Though many phenomena in the world can be adequately modeled using linear regression or classification, most interesting phenomena are generally nonlinear in nature. In order to deal with nonlinear phenomena, there have been a diversity of nonlinear models developed. For example parametric models assume that data follow some parameteric class of nonlinear function (e.g. polynomial, power, or exponential), then fine-tune the shape of the parametric function to fit observed data. However this approach is only helpful if data are fit nicely by the available catalog of parametric functions. Another approach, kernel-based methods, transforms data non-linearly into an abstract space that measures distances between observations, then predicts new values or classes based on these distances. However, kernel methods generally involve constructing a kernel matrix that depends on the number of training observations and can thus be prohibitive for large data sets. Another class of models, the ones that are the focus of this post, are artificial neural networks (ANNs). ANNs are nonlinear models motivated by the physiological architecture of the nervous system. They involve a cascade of simple nonlinear computations that when aggregated can implement robust and complex nonlinear functions. In fact, depending on how they are constructed, ANNs can approximate any nonlinear function, making them a quite powerful class of models (note that this property is not reserved for ANNs; kernel methods are also considered “universal approximators”; however, it turns out that neural networks with multiple layers are more efficient at approximating arbitrary functions than other methods. I refer the interested reader to more in-depth discussion on the topic.).

In recent years ANNs that use multiple stages of nonlinear computation (aka “deep learning”)  have been able obtain outstanding performance on an array of complex tasks ranging from visual object recognition to natural language processing. I find ANNs super interesting due to their computational power and their intersection with computational neuroscience.  However, I’ve found that most of the available tutorials on ANNs are either dense with formal details and contain little information about implementation or any examples, while others skip a lot of the mathematical detail and provide implementations that seem to come from thin air.  This post aims at giving a more complete overview of ANNs, including (varying degrees of) the math behind ANNs, how ANNs are implemented in code, and finally some toy examples that point out the strengths and weaknesses of ANNs.

Single-layer Neural Networks

The simplest ANN (Figure 1) takes a set of observed inputs \mathbf{a}=(a_1, a_2, ..., a_N), multiplies each of them by their own associated weight \mathbf{w} = (w_1, w_2, ...w_N) , and sums the weighted values to form a pre-activation z.Oftentimes there is also a bias b that is tied to an input that is always +1 included in the preactivation calculation. The network then transforms the pre-activation using a nonlinear activation function g(z) to output a final activation a_{\text{out}}.

Figure 1: Diagram of a single-layered artificial neural network.

Figure 1: Diagram of a single-layered artificial neural network.

There are many options available for the form of the activation function g(z), and the choice generally depends on the task we would like the network to perform. For instance, if the activation function is the identity function:

\Large{\begin{array}{rcl}g_{\text{linear}}(z) = z\end{array}},

which outputs continuous values a_{linear}\in (-\infty, \infty), then the network implements a linear model akin to used in standard linear regression. Another choice for the activation function is the logistic sigmoid:

\Large{ \begin{array}{rcl}g_{\text{logistic}}(z) = \frac{1}{1+e^{-z}}\end{array}},

which outputs values a_{logistic} \in (0,1). When the network outputs use the logistic sigmoid activation function, the network implements linear binary classification. Binary classification can also be implemented using the hyperbolic tangent function,  \text{tanh}(z), which outputs values a_{\text{tanh}}\in (-1, 1) (note that the classes must also be coded as either -1 or 1 when using \text{tanh}. Single-layered neural networks used for classification are often referred to as “perceptrons,” a name given to them when they were first developed in the late 1950s.

Figure 2: Common activation functions functions used in artificial neural, along with their derivatives

Figure 2: Common activation functions functions used in artificial neural, along with their derivatives

To get a better idea of what these activation function do, their outputs for a given range of input values are plotted in the left of Figure 2. We see that the logistic and tanh activation functions (blue and green) have the quintessential sigmoidal “s” shape that saturates for inputs of large magnitude. This behavior makes them useful for categorization. The identity / linear activation (red), however forms a linear mapping between the input to the activation function, which makes it useful for predicting continuous values.

A key property of these activation functions is that they are all smooth and differentiable. We’ll see later in this post why differentiability is important for training neural networks. The derivatives for each of these common activation functions are given by (for mathematical details on calculating these derivatives,  see this post):

\Large{\begin{array}{rcl} g'_{\text{linear}}(z) &=& 1 \\  g'_{\text{logistic}}(z) &=& g_{\text{logistic}}(z)(1- g_{\text{logistic}}(z)) \\  g'_{\text{tanh}}(z) &=& 1 - g_{\text{tanh}}^2(z) \\  \end{array}}

Each of the derivatives are plotted in the right of Figure 2. What is interesting about these derivatives is that they are either a constant (i.e. 1), or are can be defined in terms of the original function. This makes them extremely convenient for efficiently training neural networks, as we can implement the gradient using simple manipulations of the feed-forward states of the network.

Code Block 1: Defines standard activation functions and generates Figure 2:

% DEFINE A FEW COMMON ACTIVATION FUNCTIONS
gLinear = inline('z','z');
gSigmoid = inline('1./(1+exp(-z))','z');
gTanh = inline('tanh(z)','z');

% ...DEFINE THEIR DERIVATIVES
gPrimeLinear = inline('ones(size(z))','z');
gPrimeSigmoid = inline('1./(1+exp(-z)).*(1-1./(1+exp(-z)))','z');
gPrimeTanh = inline('1-tanh(z).^2','z');

% VISUALIZE EACH g(z)
z = linspace(-4,4,100);
figure
set(gcf,'Position',[100,100,960,420])
subplot(121);  hold on;
h(1) = plot(z,gLinear(z),'r','Linewidth',2);
h(2) = plot(z,gSigmoid(z),'b','Linewidth',2);
h(3) = plot(z,gTanh(z),'g','Linewidth',2);
set(gca,'fontsize',16)
xlabel('z')
legend(h,{'g_{linear}(z)','g_{logistic}(z)','g_{tanh}(z)'},'Location','Southeast')
title('Some Common Activation Functions')
hold off, axis square, grid
ylim([-1.1 1.1])

% VISUALIZE EACH g'(z)
subplot(122); hold on
h(1) = plot(z,gPrimeLinear(z),'r','Linewidth',2);
h(2) = plot(z,gPrimeSigmoid(z),'b','Linewidth',2);
h(3) = plot(z,gPrimeTanh(z),'g','Linewidth',2);
set(gca,'fontsize',16)
xlabel('z')
legend(h,{'g''_{linear}(z)','g''_{logistic}(z)','g''_{tanh}(z)'},'Location','South')
title('Activation Function Derivatives')
hold off, axis square, grid
ylim([-.5 1.1])

 

Multi-layer Neural Networks

As was mentioned above, single-layered networks implement linear models, which doesn’t really help us if we want to model nonlinear phenomena. However, by considering the single layer network  diagrammed in Figure 1 to be a basic building block, we can construct more complicated networks, ones that perform powerful, nonlinear computations. Figure 3 demonstrates this concept. Instead of a single layer of weights between inputs and output, we introduce a set of  single-layer networks between the two. This set of intermediate networks is often referred to as a “hidden” layer, as it doesn’t directly observe input or directly compute the output. By using a hidden layer, we form a mult-layered ANN. Though there are many different conventions for declaring the actual number of layers in a multi-layer network, for this discussion we will use the convention of the number of distinct sets of trainable weights as the number of layers. For example, the network in Figure 3 would be considered a 2-layer ANN because it has two layers of weights: those connecting the inputs to the hidden layer (w_{ij}), and those connecting the output of the hidden layer to the output layer(w_{jk}).

Diagram of a mult-layer neural network. Each neuron in the networ.k can be considered a perceptron

Figure 3: Diagram of a multi-layer ANN. Each node in the network can be considered a single-layered ANN (for simplicity, biases are not visualized in graphical model)

Multi-layer neural networks form compositional functions that map the inputs nonlinearly to outputs. If we associate index i with the input layer, index j with the hidden layer, and index k with the output layer, then an output unit in the network diagrammed in Figure 3 computes an output value a_k given and input a_i via the following compositional function:

\Large{ \begin{array}{rcl}a_{\text{out}} = a_k = g_k(b_k + \sum_jg_j(b_j + \sum_i a_i w_{ij})w_{jk}\end{array}}.

Here z_l is the  is the pre-activation values for units for layer l, g_l() is the activation function for units  in that layer (assuming they are the same), and a_l = g_l(z_l) is the output activation for units in that layer. The weight w_{l-1, l} links the outputs of units feeding into layer l to the activation function of units for that layer. The term b_l is the bias for units in layer l.

As with the single-layered ANN, the choice of activation function for the output layer will depend on the task that we would like the network to perform (i.e. categorization or regression), and follows similar rules outlined above. However, it is generally desirable for the hidden units to have nonlinear activation functions (e.g. logistic sigmoid or tanh). This is because multiple layers of linear computations can be equally formulated as a single layer of linear computations. Thus using linear activations for the hidden layers doesn’t buy us much. However, as we’ll see shortly, using linear activations for the output unit activation function (in conjunction with nonlinear activations for the hidden units) allows the network to perform nonlinear regression.

Training neural networks & gradient descent

Training neural networks involves determining the network parameters that minimize the errors that the network makes. This first requires that we have a way of quantifying error. A standard way of quantifying error is to take the squared difference between the network output and the target value:

\Large{\begin{array}{rcl}E &=& \frac{1}{2}(\text{output} - \text{target})^2\end{array}}

(Note that the squared error is not chosen arbitrarily, but has a number of theoretical benefits and considerations. For more detail, see the following post) With an error function in hand, we then aim to find the setting of parameters that minimizes this error function. This concept can be interpreted spatially by imagining a “parameter space” whose dimensions are the values of each of the model parameters, and for which the error function will form a surface of varying height depending on its value for each parameter. Model training is thus equivalent to finding point in parameter space that makes the height of the error surface small.

To get a better intuition behind this concept, let’s define a super simple neural network, one that has a single input and a single output (Figure 4, bottom left). For further simplicity, we’ll assume the network has no bias term and thus has a single parameter, w_1. We will also assume that the output layer uses the logistic sigmoid activation function. Accordingly, the network will map some input value a_0 onto a predicted output a_{\text{out}} via the following function.

\Large{\begin{array}{rcl}a_{\text{out}} = g_{\text{logistic}}(a_0w_1)\end{array}}

Now let’s say we want this simple network to learn the identity function: given an input of 1 it should return a target value of 1. Given this target value we can now calculate the value of the error function for each setting of w_1. Varying the value of w_1 from -10 to 10 results in the error surface displayed in the left of Figure 4.  We see that the error is small for large positive values of w_1, while the error is large for strongly negative values of w_1. This not surprising, given that the output activation function is the logistic sigmoid, which will map large values onto an output of 1.

Things become more interesting when we move from a single-layered network to a multi-layered network. Let’s repeat the above exercise, but include a single hidden node between the input and the output (Figure 4, bottom right). Again, we will assume no biases, and logistic sigmoid activations for both the hidden and output nodes. Thus the network will have two parameters: \Large{(w_1, w_2)}. Accordingly the 2-layered network will predict an output with the  following function:

\Large{\begin{array}{rcl}a_{\text{out}} = g_{\text{logistic}}(g_{\text{logistic}}(a_0w_1)w_2)\end{array}}

Now, if we vary both w_1 and w_2, we obtain the error surface in the right of Figure 4.

Figure 4: Error surface for a simple, single-layer neural network (left) and a 2-layer network (right). The goal is to map the input value 1 to the output value 1.

Figure 4: Error surface for a simple, single-layer neural network (left) and a 2-layer network (right). The goal is to map the input value 1 to the output value 1.

 

We see that the error function is minimized when both w_1 and w_2 are large and positive. We also see that the error surface is more complex than for the single-layered model, exhibiting a number of wide plateau regions. It turns out that the error surface gets more and more complicated as you increase the number of layers in the network and the number of units in each hidden layer. Thus, it is important to consider these phenomena when constructing neural network models.

Code Block 2: generates Figure 4 (assumes you have run Code Block 1):

% VISUALIZE ERROR SURFACE OF SIMPLE ANNS
E = {};
[w1,w2] = meshgrid(linspace(-10,10,50));
g = gSigmoid;
target = 1;
net1Output = g(w1.*target);
net2Output = g(w2.*g(w1.*target));
E{1} = (net1Output - target).^2;
E{2} = (net2Output - target).^2;
figure
for ii = 1:2
	set(gcf,'Position',[100,100,960,420])
	subplot(1,2,ii)
	surf(w1,w2,E{ii}); shading faceted;
	colormap(flipud(hot)); caxis([0,max(max(E{ii}))])
	set(gca,'fontsize',16)
	xlabel('w_{1}'), ylabel('w_{2}'), zlabel('E(w)')
	axis square;
	title(sprintf('Error Surface: %d-layer Network',ii))
	[az, el] = view;
	view([az + 180, el]);
	set(gcf,'position',[100,100,1020,440])
	drawnow
end

The examples in Figure 4 gives us a qualitative idea of how to train the parameters of an ANN, but we would like a more automatic way of doing so. Generally this problem is solved  using gradient descent: The gradient descent algorithm first calculates the derivative / gradient of the error function with respect  to each of the model parameters. This gradient information will give us the direction in parameter space that decreases the height of the error surface. We then take a step in that direction and repeat, iteratively calculating the gradient and taking steps in parameter space.

The backpropagation algorithm

It turns out that the gradient information for the ANN error surface can be calculated efficiently using a message passing algorithm known as the backpropagation algorithm. During backpropagation, input signals are forward-propagated through the network toward the outputs, and network errors are then calculated with respect to target variables and back-propagated backwards towards the inputs. The forward and backward signals are then used to determine the direction in the parameter space to move that lowers the network error.

The formal calculations behind the backpropagation algorithm can be somewhat mathematically involved and may detract from the general ideas behind the learning algorithm. For those readers who are interested in the math, I have provided the formal derivation of the backpropagation algorithm in the following post (for those of you who are not interested in the math, I would also encourage you go over the derivation and try to make connections to the source code implementations provided later in the post).

Figure 5 demonstrates the key steps of the backpropagation algorithm. The main concept underlying the algorithm is that for a given observation we want to determine the degree of “responsibility” that each network parameter has for mis-predicting a target value associated with the observation. We then change that parameter according to this responsibility so that it reduces the network error.

Figure 5: The 4 main steps of the bacdkpropagation algorithm: I Forward propagate error signals to output, II Calculate output error E, and backpropagate error signal, III Use forward signal and backward signals to calculate parameter gradients, IV update network parameters.

Figure 5: The 4 main steps of the bacdkpropagation algorithm: I Forward propagate error signals to output, II Calculate output error E, and backpropagate error signal, III Use forward signal and backward signals to calculate parameter gradients, IV update network parameters.

In order to determine the network error, we first propagate the observed input forward through the network layers. This is Step I of the backpropagation algorithm, and is demonstrated in Figure 5-I. Note that in the Figure a_k could be considered network output  (for a network with one hidden layer) or the output of a hidden layer that projects the remainder of the network (in the case of a network with more than one hidden layer). For this discussion, however, we assume that the index k is associated with the output layer of the network, and thus each of the network outputs is designated by a_k. Also note that when implementing this forward-propagation step, we should keep track of the feed-forward pre-activations z_l and activations a_l for all layers l, as these will be used for calculating backpropagated errors and error function gradients.

Step II of the algorithm is to calculate the network output error and backpropagate it toward the input. Let’s again that we are using the sum of squared differences error function:

\Large{\begin{array}{rcl}E = \frac{1}{2}\sum_{k \in K}(a_k - t_k)^2\end{array}},

where we sum over the values of all k output units (one in this example). We can now define an “error signal”  \delta_k at the output node that will be backpropagated toward the input. The error signal is calculated as follows:

\Large{\begin{array}{rcl} \begin{array}{rcl} \delta_k &=& g_k'(z_k)E'(a_k,t_k) \\  &=& g_k'(z_k)(a_k - t_k)\end{array}\end{array}}.

Thus the error signal essentially weights the gradient of the error function by the gradient of the output activation function (notice there is a z_k term is used in this calculation, which is why we keep it around during the forward-propagation step). We can continue backpropagating the error signal toward the input by passing \delta_k through the output layer weights w_{jk}, summing over all output nodes, and passing the result through the gradient of the activation function at the hidden layer g_j'(z_j) (Figure 5-II). Performing these operations results in the back-propagated error signal for the hidden layer, \delta_j:

\Large{\begin{array}{rcl} \delta_j = g_j'(z_j)\sum_k \delta_k w_{jk}\end{array}},

For networks that have more than one hidden layer, this error backpropagation procedure can continue for layers j-1, j-2, ..., etc.

Step III of the backpropagation algorithm is to calculate the gradients of the error function with respect to the model parameters at each layer l using the forward signals a_{l-1}, and the backward error signals \delta_l . If one considers the model weights w_{l-1, l} at a layer l as linking the forward signal a_{l-1} to the error signal \delta_l (Figure 5-III), then the gradient of the error function with respect to those weights is:

\Large{ \begin{array}{rcl} \frac{\partial E}{\partial w_{l-1, l}} = a_{l-1}\delta_l\end{array}}

Note that this result is closely related to the concept of Hebbian learning in neuroscience. Thus the gradient of the error function with respect to the model weight at each layer can be efficiently calculated by simply keeping track of the forward-propagated activations feeding into that layer from below, and weighting those activations by the backward-propagated error signals feeding into that layer from above!

What about the bias parameters? It turns out that the same gradient rule used for the weight weights applies, except that “feed-forward activations” for biases are always +1 (see Figure 1). Thus the bias gradients for layer l are simply:

\Large{\begin{array}{rcl}\frac{\partial E}{\partial b_{l}} = (1)\delta_l = \delta_l \end{array}}

The fourth and final step of the backpropagation algorithm is to update the model parameters based on the gradients calculated in Step III. Note that the gradients point in the direction in parameter space that will increase the value of the error function. Thus when updating the model parameters we should choose to go in the opposite direction. How far do we travel in that direction? That is generally determined by a user-defined step size (aka learning rate) parameter, \eta. Thu,s given the parameter gradients and the step size, the weights and biases for a given layer are updated accordingly:

\Large{ \begin{array}{rcl} w_{l-1,l} &\leftarrow& w_{l-1,l} - \eta \frac{\partial E}{\partial w_{l-1, l}} \\  b_l &\leftarrow& b_{l} - \eta \frac{\partial E}{\partial b_{l}}\end{array}}.

To train an ANN, the four steps outlined above and in Figure 5 are repeated iteratively by observing many input-target pairs and updating the parameters until either the network error reaches a tolerably low value, the parameters cease to update (convergence), or a set number of parameter updates has been achieved. Some readers may find the steps of the backpropagation somewhat ad hoc. However, keep in mind that these steps are formally coupled to the calculus of the optimization problem. Thus I again refer the curious reader to check out the derivation in order to make connections between the algorithm and the math.

Example: learning the OR & AND logical operators using a single layer neural network

Here we go over an example of training a single-layered neural network to perform a classification problem. The network is trained to learn a set of logical operators including the  AND, OR, or XOR. To train the network we first generate training data. The inputs consist of 2-dimensional coordinates that span the input values (x_1, x_2) values for a 2-bit truth table:

Truth table values learned in classification examples

Figure 6: Truth table values learned in classification examples

We then perturb these observations by adding Normally-distributed noise. To generate target variables, we categorize each observations by applying one of logic operators (See Figure 6) to the original (no-noisy) coordinates. We then train the network with the noisy inputs and binary categories targets using the gradient descent / backpropagation algorithm. The code implementation of the network and training procedures, as well as the resulting learning process are displayed below. (Note that in this implementation, I do not use the feed-forward activations to calculate the gradients as suggested above. This is simply to make the implementation of the learning algorithm more explicit in terms of the math. The same situation also applies to the other examples in this post).

Code Block 3: Implements and trains a single-layer neural network for classification to learn logical operators (assumes you have run Code Block 1):

%% EXAMPLE: SINGLE-LAYERED NETWORK

% DEFINE DATA AND TARGETS
data = [0 0; 0 1; 1 0; 1 1;];
classAND = and(data(:,1)>0,data(:,2)>0);
classOR = or(data(:,1)>0,data(:,2)>0);
classXOR = xor(data(:,1)>0,data(:,2)>0);

% THE TYPE OF TRUTH TABLE TO LEARN (UNCOMMENT FOR OTHERS)
classes = classOR
% classes = classAND;
% classes = classXOR;

% MAKE MULTIPLE NOISY TRAINING OBSERVATIONS
nRepats = 30;
data = repmat(data, [nRepats, 1]);
classes = repmat(classes, [nRepats, 1]);
data = data + .15*randn(size(data));

% SHUFFLE DATA
shuffleIdx = randperm(size(data,1));
data = data(shuffleIdx,:);
classes = classes(shuffleIdx);

% INITIALIZE MODEL PARAMETERS
[nObs,nInput] = size(data); % # OF INPUT DIMENSIONS
nOutput = 1;    			% # OF TARGET/OUTPUT DIMENSIONS
lRate = 3;    				% LEARNING RATE FOR PARAMETERS UPDATE
nIters = 80;  				% # OF ITERATIONS

% DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES)
g_out = gSigmoid; gPrime_out = gPrimeSigmoid;

% INITIALIZE RANDOM WEIGHTS
W_out = (rand(nInput,nOutput)-.5);
b_out = (rand(1,nOutput)-.5);

% SOME OTHER INITIALIZATIONS
% (FOR VISUALIZATION)
visRange = [-.2 1.2];
[xx,yy] = meshgrid(linspace(visRange(1), visRange(2),100));
iter = 1;
mse = zeros(1,nIters);

figure
set(gcf,'Position',[100,100,960,420])
while 1
	err = zeros(1,nObs);
	% LOOP THROUGH THE EXAMPLES
	for iO = 1:nObs

		% GET CURRENT NETWORK INPUT DATA AND TARGET
		input = data(iO,:);
		target = classes(iO);

		%% I. FORWARD PROPAGATE DATA THROUGH NETWORK
		z_out = input*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS
		a_out = g_out(z_out);        % OUTPUT UNIT ACTIVATIONS

		%% II. BACKPROPAGATE ERROR SIGNAL
		% CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT
		delta_out = gPrime_out(z_out).*(a_out - target);

		%% III. CALCULATE GRADIENT W.R.T. PARAMETERS...
		dEdW_out = delta_out*input;
		dEdb_out = delta_out*1;

		%% IV. UPDATE NETWORK PARAMETERS
		W_out = W_out - lRate*dEdW_out';
		b_out = b_out - lRate*dEdb_out';

		% CALCULATE ERROR FUNCTION
		err(iO) = .5*(a_out-target).^2;
	end
	mse(iter) = mean(err);

	% DISPLAY LEARNING
	clf; subplot(121); hold on;
	set(gca,'fontsize',16)
	netOut = g_out(bsxfun(@plus,[xx(:),yy(:)]*W_out, b_out));
	contourf(xx,yy,reshape(netOut,100,100)); colormap(flipud(spring))
	hold on;
	gscatter(data(:,1),data(:,2),classes,[0 0 0 ; 1 1 1],[],20,'off');
	title(sprintf('Iteration %d',iter))
	xlim([visRange(1) visRange(2)]),ylim([visRange(1) visRange(2)]);
	axis square

	subplot(122);
	set(gca,'fontsize',16)
	plot(1:iter,mse(1:iter));
	xlabel('Iteration')
	ylabel('Mean Squared Error')
	axis square
	m1(iter) = getframe(gcf);

	if iter >= nIters
		break
	end
	iter = iter + 1;
end
Single layer neural network (perceptron) learning a noisy OR mapping.

Figure 7: Single layer neural network (perceptron) learning a noisy OR mapping.

Figure 7 displays the procedure for learning the OR mapping. The left plot displays the training data and the network output at each iteration. White dots are training points categorized “1” while black dots are categorized “0”. Yellow regions are where the network predicts values of “0”, while magenta highlights areas where the network predicts “1”. We see that the single-layer network is able to easily separate the two classes.  The right plot shows how the error function decreases with each training iteration. The smooth trajectory of the error indicates that the error surface is also fairly smooth.

Single layer neural network (perceptron) learning a noisy and mapping.

Figure 8: Single layer neural network (perceptron) learning a noisy AND mapping.

Figure 8 demonstrates an analogous example, but instead learning the AND operator (by executing Code Block 3, after un-commenting line 11). Again, the  categories can be easily separated by a plane, and thus the single-layered network easily learns an accurate predictor of the data.

Going Deeper: nonlinear classification and multi-layer neural networks

Figures 7 and 8 demonstrate how a single-layered ANN can easily learn the OR and AND operators. This is because the categorization criterion for these logical operators can be represented in the input space by a single linear function (i.e. line/plane). What about more complex categorization criterion that cannot be represented by a single plane? An example of a more complex binary classification criterion is the XOR operator (Figure 6, far right column).

Below we attempt to train the single-layer network to learn the XOR operator (by executing Code Block 3, after un-commenting line 12). The single layer network is unable to learn this nonlinear mapping between the inputs and the targets. However, it turns out we can learn the XOR operator using a multi-layered neural network.

Single layer neural network (perceptron) attempting to learn a noisy XOR mapping. The single layer network chokes on this nonlinear problem.

Figure 9: Single layer neural network (perceptron) attempting to learn a noisy XOR mapping. The single layer network chokes on this nonlinear problem.

Below we train a two-layer neural network on the XOR dataset. The network incorporates a hidden layer with 3 hidden units and logistic sigmoid activation functions for all units in the hidden and output layers (see Code Block 4, lines 32-33).

Code Block 4: Implements and trains a two-layer neural network for classification to learn XOR operator and more difficult “ring” problem (Figures 10 & 11; assumes you have run Code Block 1):


%% EXAMPLE: MULTI-LAYER NEURAL NETWORK FOR CLASSIFICATION
data = [0 0; 0 1; 1 0; 1 1;];
classXOR = xor(data(:,1)>0,data(:,2)>0);

% THE TYPE OF TRUTH TABLE TO LEARN
classes = classXOR;

% UNCOMMENT FOR MOR DIFFICULT DATA...
% data = [data; .5 .5; 1 .5; 0 .5; .5 0; .5 1];
% classRing = [1; 1; 1; 1; 0; 1; 1; 1; 1];
% classes = classRing;

% CREATE MANY NOISY OBSERVATIONS
nRepats = 30;
data = repmat(data, [nRepats, 1]);
classes = repmat(classes, [nRepats, 1]);
data = data + .15*randn(size(data));

% SHUFFLE OBSERVATIONS
shuffleIdx = randperm(size(data,1));
data = data(shuffleIdx,:);
classes = classes(shuffleIdx);

% INITIALIZE MODEL PARAMETERS
[nObs,nInput] = size(data);   	% # OF INPUT DIMENSIONS
nHidden = 3;    				% # OF HIDDEN UNITS

lRate = 2;    	% LEARNING RATE FOR PARAMETERS UPDATE
nIters = 300;   % # OF ITERATIONS

% DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES)
g_hid = gSigmoid; gPrime_hid = gPrimeSigmoid;
g_out = gSigmoid; gPrime_out = gPrimeSigmoid;

% INITIALIZE WEIGHTS
W_hid = (rand(nInput,nHidden)-.5);
b_hid = (rand(1,nHidden)-.5);
W_out = (rand(nHidden,nOutput)-.5);
b_out = (rand(1,nOutput)-.5);

iter = 1;
mse = zeros(1,nIters);
figure
set(gcf,'Position',[100,100,960,420])
% MAIN TRAINING ALGORITHM
while 1
	err = zeros(1,nObs);
	% LOOP THROUGH THE EXAMPLES
	for iO = 1:nObs

		% GET CURRENT NETWORK INPUT DATA AND TARGET
		input = data(iO,:);
		target = classes(iO);

		%% I. FORWARD PROPAGATE DATA THROUGH NETWORK
		z_hid = input*W_hid + b_hid; % HIDDEN UNIT PRE-ACTIVATIONS
		a_hid = g_hid(z_hid);        % HIDDEN UNIT ACTIVATIONS
		z_out = a_hid*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS
		a_out = g_out(z_out);        % OUTPUT UNIT ACTIVATIONS

		%% II.  BACKPROPAGATE ERROR SIGNAL
		% CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT
		delta_out = gPrime_out(z_out).*(a_out - target);

		% CALCULATE ERROR CONTRIBUTIONS FOR HIDDEN NODES...
		delta_hid = gPrime_hid(z_hid)'.*(delta_out*W_out);

		%% III. CALCULATE GRADIENT W.R.T. PARAMETERS...
		dEdW_out = delta_out*a_hid;
		dEdb_out = delta_out*1;
		dEdW_hid = delta_hid*input;
		dEdb_hid = delta_hid*1;

		%% IV. UPDATE NETWORK PARAMETERS
		W_out = W_out - lRate*dEdW_out';
		b_out = b_out - lRate*dEdb_out';

		W_hid = W_hid - lRate*dEdW_hid';
		b_hid = b_hid - lRate*dEdb_hid';

		% CALCULATE ERROR FUNCTION
		err(iO) = .5*(a_out-target).^2;
	end
	mse(iter) = mean(err);

	% DISPLAY LEARNING
	clf; subplot(121); hold on;
	set(gca,'fontsize',16)

	netOut = g_out(bsxfun(@plus,g_hid(bsxfun(@plus,[xx(:),yy(:)]*W_hid, b_hid))*W_out, b_out));
	contourf(xx,yy,reshape(netOut,100,100)); colormap(flipud(spring))
	hold on;
	gscatter(data(:,1),data(:,2),classes,[0 0 0; 1 1 1],[],20,'off');
	title(sprintf('Iteration %d',iter))
	xlim([visRange(1), visRange(2)]),ylim([visRange(1), visRange(2)]);
	axis square

	subplot(122);
	set(gca,'fontsize',16)
	plot(1:iter,mse(1:iter));
	xlabel('Iteration')
	ylabel('Mean Squared Error')
	axis square
	m2(iter) = getframe(gcf);

	if iter >= nIters
		break
	end
	iter = iter + 1;
end
A multi-layer neural network (perceptron) attempting to learn a noisy XOR mapping. The multi-layer network easily learns this nonlinear problem.

Figure 10: A multi-layer neural network (perceptron) attempting to learn a noisy XOR mapping. The multi-layer network easily learns this nonlinear problem.

Figure 10 displays the learning process for the 2-layer network. The formatting for Figure 10 is analogous to that for Figures 7-9. The 2-layer network is easily able to learn the XOR operator. We see that by adding a hidden layer between the input and output, the ANN is able to learn the nonlinear categorization criterion!

Figure 11 shows the results for learning a even more difficult nonlinear categorization function: points in and around (x1, x2) = (0.5 0.5) are categorized as “0”, while points in a ring surrounding the “0” datapoints are categorized as a “1” (Figure 11). This example is run by executing Code Block 4 after un-commenting  lines 9-11.

Multilayer neural network learning a nonlinear binary classification task

Figure 11: Multilayer neural network learning a nonlinear binary classification task

Figure 11 shows the learning process. Again formatting is analogous to the formatting in Figures 8-10. The 2-layer ANN is able to learn this difficult classification criterion.

Example: Neural Networks for Regression

The previous examples demonstrated how ANNs can be used for classification by using a logistic sigmoid as the output activation function. Here we demonstrate how, by making the output activation function the linear/identity function, the same 2-layer network architecture can be used to implement nonlinear regression.

For this example we define a dataset comprised of 1D inputs, \mathbf{x} that range from (-5, 5). We then generate noisy targets \mathbf y according to the function:

\Large{\begin{array}{rcl}\mathbf{y} = f(\mathbf{x}) + \mathbf{\epsilon}\end{array}}

where f(x) is a nonlinear data-generating function and \mathbf \epsilon is Normally-distributed noise. We then construct a two-layered network with tanh activation functions used in the hidden layer and linear outputs. For this example we set the number of hidden units to 3 and train the model as we did for categorization using gradient descent / backpropagation. The results of the example are displayed below.

Code Block 5: Trains two-layer network for regression problems (Figures 11 & 12; assumes you have run Code Block 1):

%% EXAMPLE: NONLINEAR REGRESSION

% DEFINE DATA-GENERATING FUNCTIONS f(x)
xMin = -5; xMax = 5;
xx = linspace(xMin, xMax, 100);
f = inline('2.5 + sin(x)','x');
% f = inline('abs(x)','x'); % UNCOMMENT FOR FIGURE 13
yy = f(xx) + randn(size(xx))*.5;

% FOR SHUFFLING OBSERVATIONS
shuffleIdx = randperm(length(xx));
data = xx;
targets = yy;

% INITIALIZE MODEL PARAMETERS
nObs = length(data); 	% # OF INPUT DIMENSIONS
nInput = 1;				% # OF INPUTS
nHidden = 3; 			% # OF HIDDEN UNITS
nOutput = 1;			% # OF TARGET/OUTPUT DIMENSIONS
lRate = .15;   			% LEARNING RATE FOR PARAMETERS UPDATE
nIters = 200;  			% # OF ITERATIONS

cols = lines(nHidden);

% DECLARE ACTIVATION FUNCTIONS (AND DERIVATIVES)
g_hid = gTanh;  		   % HIDDEN UNIT ACTIVATION
gPrime_hid = gPrimeTanh;   % GRAD OF HIDDEN UNIT ACTIVATION
g_out = gLinear; 		   % OUTPUT ACTIVATION
gPrime_out = gPrimeLinear; % GRAD. OF OUTPUT ATIVATION

% % INITIALIZE WEIGHTS
W_hid = (rand(nInput,nHidden)-.5);
b_hid = (rand(1,nHidden)-.5);
W_out = (rand(nHidden,nOutput)-.5);
b_out = (rand(1,nOutput)-.5);

% INITIALIZE SOME THINGS..
% (FOR VISUALIZATION)
mse = zeros(1,nIters);
visRange = [xMin, xMax];
figure
set(gcf,'Position',[100,100,960,420])
iter = 1;
while 1

	err = zeros(1,nObs);
	% LOOP THROUGH THE EXAMPLES
	for iO = 1:nObs

		% GET CURRENT NETWORK INPUT DATA AND TARGET
		input = data(iO);
		target = targets(iO);

		%% I. FORWARD PROPAGATE DATA THROUGH NETWORK
		z_hid = input*W_hid + b_hid; % HIDDEN UNIT PRE-ACTIVATIONS
		a_hid = g_hid(z_hid);        % HIDDEN UNIT ACTIVATIONS
		z_out = a_hid*W_out + b_out; % OUTPUT UNIT PRE-ACTIVATIONS
		a_out = g_out(z_out);        % OUTPUT UNIT ACTIVATIONS

		%% II. BACKPROPAGATE ERROR SIGNAL
		% CALCULATE ERROR DERIVATIVE W.R.T. OUTPUT
		delta_out = gPrime_out(z_out).*(a_out - target);

		%% CALCULATE ERROR CONTRIBUTIONS FOR HIDDEN NODES...
		delta_hid = gPrime_hid(z_hid)'.*(delta_out*W_out);

		%% III. CALCULATE GRADIENT W.R.T. PARAMETERS...
		dEdW_out = delta_out*a_hid;
		dEdb_out = delta_out*1;
		dEdW_hid = delta_hid*input;
		dEdb_hid = delta_hid*1;

		%% IV. UPDATE NETWORK PARAMETERS
		W_out = W_out - lRate*dEdW_out';
		b_out = b_out - lRate*dEdb_out';

		W_hid = W_hid - lRate*dEdW_hid';
		b_hid = b_hid - lRate*dEdb_hid';

		% CALCULATE ERROR FUNCTION FOR BATCH
		err(iO) = .5*(a_out-target).^2;
	end
	mse(iter) = mean(err); % UPDATE ERROR

	% DISPLAY LEARNING
	clf; subplot(121); hold on;
	set(gca,'fontsize',14)

	plot(xx,f(xx),'m','linewidth',2);
	hold on;
	scatter(xx, yy ,'m');

	% PLOT TOTAL NETWORK OUTPUT
	netOut = g_out(g_hid(bsxfun(@plus, xx'*W_hid, b_hid))*W_out + b_out);
	plot(xx, netOut, 'k','linewidth', 2)

	% PLOT EACH HIDDEN UNIT'S OUTPUT FUNCTION
	for iU = 1:nHidden
		plot(xx,g_hid(xx*W_hid(iU) + b_hid(iU)),'color',cols(iU,:),'Linewidth',2, ...
                                                                           'Linestyle','--');
	end

	% TITLE AND LEGEND
	title(sprintf('Iteration %d',iter))
	xlim([visRange(1) visRange(2)]),ylim([visRange(1) visRange(2)]);
	axis square
	legend('f(x)', 'Targets', 'Network Output','Hidden Unit Outputs','Location','Southwest')

	% PLOT ERROR
	subplot(122);
	set(gca,'fontsize',14)
	plot(1:iter,mse(1:iter));
	xlabel('Iteration')
	ylabel('Mean Squared Error')
	axis square; drawnow

	% ANNEAL LEARNING RATE
	lRate = lRate *.99;
	if iter >= nIters
		break
	end
	iter = iter + 1;
end
A two-layered ANN used for regression. The network approximates the function f(x) = sin(x) + 2.5

Figure 12: A two-layered ANN used for regression. The network approximates the function f(x) = sin(x) + 2.5

The training procedure for f(x): \sin(x) + 2.5 is visualized in the left plot of Figure 12. The data-generating function f(x) is plotted as the solid magenta line, and the noisy target values used to train the network are plotted as magenta circles. The output of the network at each training iteration is plotted in solid black while the output of each of the tanh hidden units is plotted in dashed lines. This visualization demonstrates how multiple nonlinear functions can be combined to form the complex output target function. The mean squared error at each iteration is plotted in the right plot of Figure 12. We see that the error does not follow a simple trajectory during learning, but rather undulates, demonstrating the non-convexity of the error surface.

Figure 13 visualizes the training procedure for trying to learn a different nonlinear function, namely f(x): \text{abs}(x) (by running Code Block 5, after un-commenting out line 7). Again, we see how the outputs of the hidden units are combined to fit the desired data-generating function. The mean squared error again follows an erratic path during learning.

A two-layered ANN used for regression. The network approximates the function f(x) = abs(x)

Figure 13: A two-layered ANN used for regression. The network approximates the function f(x) = abs(x)

Notice for this example that I added an extra implementation detail known as simulated annealing (line 118) that was absent in the classification examples. This technique decreases the learning rate after every iteration thus making the algorithm take smaller and smaller steps in parameter space.  This technique can be useful when the gradient updates begin oscillating between two or more locations in the parameter space. It is also helpful for influencing the algorithm to settle down into a steady state.

Wrapping up

In this post we covered the main ideas behind artificial neural networks including: single- and multi-layer ANNs, activation functions and their derivatives, a high-level description of the backpropagation algorithm, and a number of classification and regression examples. ANNs, particularly mult-layer ANNs, are a robust and powerful class of models that can be used to learn complex, nonlinear functions. However, there are a number of considerations when using neural networks including:

  • How many hidden layers should one use?
  • How many hidden units in each layer?
  • How do these relate to overfitting and generalization?
  • Are there better error functions than the squared difference?
  • What should the learning rate be?
  • What can we do about the complexity of error surface with deep networks?
  • Should we use simulated annealing?
  • What about other activation functions?

It turns out that there are no easy or definite answers to any of these questions, and there is active research focusing on each topic. This is why using ANNs is often considered as much as a “black art” as it is a quantitative technique.

One primary limitation of ANNs is that they are supervised algorithms, requiring a target value for each input observation in order to train the network. This can be prohibitive for training large networks that may require lots of training data to adequately adjust the parameters. However, there are a set of unsupervised variants of ANNs that can be used to learn an initial condition for the ANN (rather than from randomly-generated initial weights) without the need of target values. This technique of “unsupervised pretraining” has been an important component of many “deep learning” models used in AI and machine learning. In future posts, I look forward to covering two of these unsupervised neural networks: autoencoders and restricted Boltzmann machines.

 

fMRI in Neuroscience: Modeling the HRF With FIR Basis Functions

In the previous post on fMRI methods, we discussed how to model the selectivity of a voxel using the General Linear Model (GLM). One of the basic assumptions that we must make in order to use the GLM is that we also have an accurate model of the Hemodynamic Response Function (HRF) for the voxel. A common practice is to use a canonical HRF model established from previous empirical studies of fMRI timeseries. However, voxels throughout the brain and across subjects exhibit a variety of shapes, so the canonical model is often incorrect. Therefore it becomes necessary to estimate the shape of the HRF for each voxel.

There are a number of ways that have been developed for estimating HRFs, most of them are based on temporal basis function models. (For details on basis function models, see this previous post.). There are a number of basis function sets available, but in this post we’ll discuss modeling the HRF using a flexible basis set composed of a set of delayed impulses called Finite Impulse Response (FIR) basis.

Modeling HRFs With a Set of Time-delayed Impulses

Let’s say that we have an HRF with the following shape.

The HRF represented as the sum of the weighted FIR basis functions above.

A Model HRF.

We would like to be able to model the HRF as a weighted combination of simple basis functions. The simplest set of basis functions is the FIR basis, which is a series of H distinct unit-magnitude (i.e. equal to one) impulses, each of which is delayed in time by t = 1 \dots H TRs. An example of  modeling the HRF above using FIR basis functions is below:

%% REPRESENTING AN HRF WITH FIR BASIS FUNCTIONS
% CREATE ACTUAL HRF (AS MEASURED BY MRI SCANNER)
rand('seed',12345)
TR = 1                              % REPETITION TIME
t = 1:TR:20;                        % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % ACTUAL HRF
h = h/max(h);

% DISPLAY THE HRF
figure;
stem(t,h,'k','Linewidth',2)
axis square
xlabel(sprintf('Basis Function Contribution\nTo HRF'))
title(sprintf('HRF as a Series of \nWeighted FIR Basis Functions'))

% CREATE/DISPLAY FIR REPRESENTATION
figure; hold on
cnt = 1;

% COLORS BASIS FUNCTIONS ACCORDING TO HRF WEIGHT
map = jet(64);
cRange = linspace(min(h),max(h),64);

for iT = numel(h):-1:1
	firSignal = ones(size(h));
	firSignal(cnt) = 2;
	[~,cIdx] = min(abs(cRange-h(cnt)));
	color = map(cIdx,:);
	plot(1:numel(h),firSignal + 2*(iT-1),'Color',color,'Linewidth',2)
	cnt = cnt+1;
end
colormap(map); colorbar; caxis([min(h) max(h)]);

% DISPLAY
axis square;
ylabel('Basis Function')
xlabel('Time (TR)')
set(gca,'YTick',0:2:39,'YTickLabel',20:-1:1)
title(sprintf('Weighted FIR Basis\n Set (20 Functions)'));
Representing HRF as a weighted set of FIR basis functions. The color of each of the 20 basis functions corresponds to its weight

Representing the HRF above as a weighted set of FIR basis functions.
The color of each of the 20 basis functions corresponds to its weight

Each of the basis functions b_t has an unit impulse that occurs at time t = 1 \dots 20; otherwise it is equal to zero. Weighting each basis function b_t with the corresponding value of the HRF at each time point t, followed by a sum across all the functions gives the target HRF in the first plot above. The FIR basis model makes no assumptions about the shape of the  HRF–the weight applied to each basis function can take any value–which allows the model to capture a wide range of HRF profiles.

Given an experiment where various stimuli are presented to a subject and BOLD responses evoked within the subject’s brain, the goal is to determine the HRF to each of the stimuli within each voxel. Let’s take a look at a concrete example of how we can use the FIR basis to simultaneously estimate HRFs to many stimuli for multiple voxels with distint tuning properties.

Estimating the HRF of Simulated Voxels Using the FIR Basis

For this example we revisit a simulation of voxels with 4 different types of tuning (for details, see the previous post on fMRI in Neuroscience). One voxel is strongly tuned for visual stimuli (such as a light), the second voxel is weakly tuned for auditory stimuli (such as a tone), the third is moderately tuned for somatosensory stimuli (such as warmth applied to the palm), and the final voxel is unselective (i.e. weakly and equally selective for all three types of stimuli). We simulate an experiment where the blood-oxygen-level dependent (BOLD) signals evoked  in each voxel by a series of stimuli consisting of nonoverlapping lights, tones, and applications of warmth to the palm, are measured over T=330 fMRI measurments (TRs). Below is the simulation of the experiment and the resulting simulated BOLD signals:

%% SIMULATE AN EXPERIMENT
% SOME CONSTANTS
trPerStim = 30;
nRepeat = 10;
nTRs = trPerStim*nRepeat + length(h);
nCond = 3;
nVox = 4;
impulseTrain0 = zeros(1,nTRs);

% RANDOM ONSET TIMES (TRs)
onsetIdx = randperm(nTRs-length(h));

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(onsetIdx(1:nRepeat)) = 1;
onsetIdx(1:nRepeat) = [];

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(onsetIdx(1:nRepeat)) = 1;
onsetIdx(1:nRepeat) = [];

% SOMATOSENSORY STIMULUS
impulseTrainHeat = impulseTrain0;
impulseTrainHeat(onsetIdx(1:nRepeat)) = 1;

% EXPERIMENT DESIGN / STIMULUS SEQUENCE
D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat'];
X = conv2(D,h');
X = X(1:nTRs,:);

%% SIMULATE RESPONSES OF VOXELS WITH VARIOUS SELECTIVITIES
visualTuning =   [4 0 0]; % VISUAL VOXEL TUNING
auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING
somatoTuning =   [0 0 3]; % SOMATOSENSORY VOXEL TUNING
noTuning =       [1 1 1]; % NON-SELECTIVE

beta = [visualTuning', ...
        auditoryTuning', ...
        somatoTuning', ...
        noTuning'];

y0 = X*beta;
SNR = 5;
noiseSTD = max(y0)/SNR;
noise = bsxfun(@times,randn(size(y0)),noiseSTD);
y = y0 + noise; % VOXEL RESPONSES

% DISPLAY VOXEL TIMECOURSES
voxNames = {'Visual','Auditory','Somat.','Unselective'};
cols = lines(4);
figure;
for iV = 1:4
	subplot(4,1,iV)
	plot(y(:,iV),'Color',cols(iV,:),'Linewidth',2); xlim([0,nTRs]);
	ylabel('BOLD Signal')
	legend(sprintf('%s Voxel',voxNames{iV}))
end
xlabel('Time (TR)')
set(gcf,'Position',[100,100,880,500])
Four simulated voxels, each with strong visual, weak auditory, moderate somatosensory and no tuning.

Four simulated voxels, each with strong visual, weak auditory,
moderate somatosensory and unselective tuning.

Now let’s estimate the HRF of each voxel to each of the C = 3 stimulus conditions using an FIR basis function model. To do so, we create a design matrix composed of successive sets of delayed impulses, where each set of impulses begins at the onset of each stimulus condition. For the [T \times C]-sized stimulus onset matrix D, we calculate an [T \times HC] FIR design matrix X_{FIR}, where H is the assumed length of the HRF we are trying to estimate. The code for creating and displaying the design matrix for an assumed HRF length H=16 is below:

%% ESTIMATE HRF USING FIR BASIS SET
% CREATE FIR DESIGN MATRIX
hrfLen = 16;  % WE ASSUME HRF IS 16 TRS LONG

% BASIS SET FOR EACH CONDITOIN IS A TRAIN OF INPULSES
X_FIR = zeros(nTRs,hrfLen*nCond);

for iC = 1:nCond
	onsets = find(D(:,iC));
	idxCols = (iC-1)*hrfLen+1:iC*hrfLen;
	for jO = 1:numel(onsets)
		idxRows = onsets(jO):onsets(jO)+hrfLen-1;
		for kR = 1:numel(idxRows);
			X_FIR(idxRows(kR),idxCols(kR)) = 1;
		end
	end
end

% DISPLAY
figure;
subplot(121);
imagesc(D);
colormap gray;
set(gca,'XTickLabel',{'Light','Tone','Som.'})
title('Stimulus Train');

subplot(122);
imagesc(X_FIR);
colormap gray;
title('FIR Design Matrix');
set(gca,'XTick',[8,24,40])
set(gca,'XTickLabel',{'Light','Tone','Som.'})
set(gcf,'Position',[100,100,550,400])

Left: The stimulus onset matrix (size = [T x 3]). Right the corresponding Design Matrix (size = [T x 3*H])

Left: The stimulus onset matrix (size = [T x 3]).
Right the corresponding Design Matrix (size = [T x 3*H])

In the right panel of the plot above, we see the  form of the FIR design matrix X_{FIR} for the stimulus onset on the left. For each voxel, we want to determine the weight on each column of X_{FIR} that will best explain the BOLD signals y measured from each voxel. We can form this problem in terms of a General Linear Model:

y = X_{FIR}\beta_{FIR}

Where \beta_{FIR} are the weights on each column of the FIR design matrix. If we set the values of \beta_{HRF} such as to minimize the sum of the squared errors (SSE) between the model above and the measured actual responses

SSE = \sum_i^N(y^{(i)} - X_{FIR}^{(i)})^2,

then we can use the Ordinary Least Squares (OLS) solution discussed earlier to solve the for \beta_{HRF}.  Specifically, we solve for the weights as:

\hat \beta_{FIR} = (X_{FIR}^T X_{FIR})^{-1} X_{FIR} y

Once determined, the resulting [CH \times V] matrix of weights \hat \beta_{FIR} has the HRF of each of the V=4 different voxels to each stimulus condition along its columns. The first H (1-16) of the weights along a column define the HRF to the first stimulus (the light). The second H (17-32) weights along a column determine the HRF to the second stimulus (the tone), etc… Below we parse out these weights and display the resulting HRFs for each voxel:

% ESTIMATE HRF FOR EACH CONDITION AND VOXEL
betaHatFIR = pinv(X_FIR'*X_FIR)*X_FIR'*y;

% RESHAPE HRFS
hHatFIR = reshape(betaHatFIR,hrfLen,nCond,nVox);

% DISPLAY
figure
cols = lines(4);
names = {'Visual','Auditory','Somat.','Unselective'};
for iV = 1:nVox
	subplot(2,2,iV)
	hold on;
	for jC = 1:nCond
		hl = plot(1:hrfLen,hHatFIR(:,jC,iV),'Linewidth',2);
		set(hl,'Color',cols(jC,:))
	end
	hl = plot(1:numel(h),h,'Linewidth',2);
	xlabel('TR')
	legend({'Light','Tone','Heat','True HRF'})
	set(hl,'Color','k')
	xlim([0 hrfLen])
	grid on
	axis tight
	title(sprintf('%s Voxel',names{iV}));
end
set(gcf,'Position',[100,100,880,500])
HRF estimates for each voxel to each of the 3 stimuli. Black plots show the shape of the actual HRF.

HRF estimates for each voxel to each of the 3 stimuli. Black plots show the shape of the true HRF.

Here we see that estimated HRFs accurately capture both the shape of the HRF and the selectivity of each of the voxels. For instance, the HRFs estimated from the responses of first voxel indicate strong tuning for the light stimulus. The HRF estimated for the light stimulus has an amplitude that is approximately 4 times that of the true HRF. This corresponds with the actual tuning of the voxel (compare this to the value of  \beta(1,1)). Additionally, time delay till the maximum value (time-to-peak) of the HRF to the light is the same as the true HRF. The first voxel’s HRFs estimated for the other stimuli are essentially noise around baseline. This (correctly) indicates that the first voxel has no selectivity for those stimuli. Further inspection of the remaining estimated HRFs indicate accurate tuning and HRF shape is recovered for the other three voxels as well.

Wrapping Up

In this post we discussed how to apply a simple basis function model (the FIR basis) to estimate the HRF profile and get an idea of the tuning of individual voxels. Though the FIR basis model can accurately model any HRF shape, it is often times too flexible. In scenarios where voxel signals are very noisy, the FIR basis model will tend to model the noise.

Additionally, the FIR basis set needs to incorporate a basis function for each time measurement.  For the example above, we assumed the HRF had a length of 16 TRs. The FIR basis therefore had 16 tuneable weights for each condition. This leads to a model with 48 (C\times H = 3 \times 16) tunable parameters for the GLM model. For experiments with many different stimulus conditions, the number of parameters can grow quickly (as HC). If the number of parameters is comparable (or more) than the number of BOLD signal measurements, it will be difficult accurately estimate \hat \beta_{FIR}. As we’ll see in later posts, we can often improve upon the FIR basis set by using more clever basis functions.

Another important but indirect issue that effects estimating the HRF is the experimental design, or rather the schedule used to present the stimuli. In the example above, the stimuli were presented in random, non-overlapping order. What if the stimuli were presented in the same order every time, with some set frequency? We’ll discuss in a later post the concept of design efficiency and how it affects our ability to characterize the shape of the HRF and, consequently, voxel selectivity.

fMRI in Neuroscience: Estimating Voxel Selectivity & the General Linear Model (GLM)

In a typical fMRI experiment  a series of stimuli are presented to an observer and evoked brain activity–in the form of blood-oxygen-level-dependent (BOLD) signals–are measured from tiny chunks of the brain called voxels. The task of the researcher is then to infer the tuning of the voxels to features in the presented stimuli based on the evoked BOLD signals. In order to make this inference quantitatively, it is necessary to have a  model of how BOLD signals are evoked in the presence of stimuli. In this post we’ll develop a model of evoked BOLD signals, and from this model recover the tuning of individual voxels measured during an fMRI experiment.

Modeling the Evoked BOLD Signals — The Stimulus and Design Matrices

Suppose we are running an event-related fMRI experiment where we present C different stimulus conditions to an observer while recording the BOLD signals evoked in their brain over a series of T consecutive fMRI measurements (TRs). We can represent the stimulus presentation quantitatively with a T \times C binary Stimulus Matrix, D, whose entries indicate the onset of each stimulus condition (columns) at each point in time (rows). Now let’s assume that we have an accurate model of how a voxel is activated by a single, very short stimulus. This activation model is called hemodynamic response function (HRF), h, for the voxel, and, as we’ll discuss in a later post, can be estimated from the measured BOLD signals. Let’s assume for now that the voxel is also activated to an equal degree to all stimuli. In this scenario we can represent the BOLD signal evoked over the entire experiment with another T \times C matrix X called the Design Matrix that is the convolution of the stimulus matrix D with the voxel’s HRF h.

X = D * h

Note that this model of the BOLD signal is an example of the Finite Impulse Response (FIR) model that was introduced in the previous post on fMRI Basics.

To make the concepts of D and X more concrete, let’s say our experiment consists of C = 3 different stimulus conditions: a light, a tone, and heat applied to the palm. Each stimulus condition is presented twice in a staggered manner during 80 TRs of fMRI measurements. The stimulus matrix and the design matrix are simulated here in Matlab:

TR = 1; % REPETITION TIME
t = 1:TR:20; % MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10); % HRF MODEL
h = h/max(h); % SCALE HRF TO HAVE MAX AMPLITUDE OF 1

trPerStim = 30; % # TR PER STIMULUS
nRepeat = 2; % # OF STIMULUS REPEATES
nTRs = trPerStim*nRepeat + length(h);
impulseTrain0 = zeros(1,nTRs);

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(1:trPerStim:trPerStim*nRepeat) = 1;

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(5:trPerStim:trPerStim*nRepeat) = 1;

% SOMATOSENSORY STIMULUS
impulseTrainHeat = impulseTrain0;
impulseTrainHeat(9:trPerStim:trPerStim*nRepeat) = 1;

% COMBINATION OF ALL STIMULI
impulseTrainAll = impulseTrainLight + impulseTrainTone + impulseTrainHeat;

% SIMULATE VOXELS WITH VARIOUS SELECTIVITIES
visualTuning = [4 0 0]; % VISUAL VOXEL TUNING
auditoryTuning = [0 2 0]; % AUDITORY VOXEL TUNING
somatoTuning = [0 0 3]; % SOMATOSENSORY VOXEL TUNING
noTuning = [1 1 1]; % NON-SELECTIVE

beta = [visualTuning', ...
        auditoryTuning', ...
        somatoTuning', ...
        noTuning'];

% EXPERIMENT DESIGN / STIMULUS SEQUENCE
D = [impulseTrainLight',impulseTrainTone',impulseTrainHeat'];

% CREATE DESIGN MATRIX FOR THE THREE STIMULI
X = conv2(D,h'); % X = D * h
X(nTRs+1:end,:) = []; % REMOVE EXCESS FROM CONVOLUTION

% DISPLAY STIMULUS AND DESIGN MATRICES
subplot(121); imagesc(D); colormap gray;
xlabel('Stimulus Condition')
ylabel('Time (TRs)');
title('Stimulus Train, D');
set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});

subplot(122);
imagesc(X);
xlabel('Stimulus Condition')
ylabel('Time (TRs)');
title('Design Matrix, X = D * h')
set(gca,'XTick',1:3); set(gca,'XTickLabel',{'Light','Tone','Heat'});

Stimulus presentation matrix, D (left) and the Design Matrix X for an experiment with three stimulus conditions: a light, a tone, and heat applied to the palm

Each column of the design matrix above (the right subpanel in the above figure) is essentially a model of the BOLD signal evoked independently by each stimulus condition, and the total signal is simply a sum of these independent signals.

Modeling Voxel Tuning — The Selectivity Matrix

In order to develop the concept of the design matrix we assumed that our theoretical voxel is equally tuned to all stimuli. However, few voxels in the brain exhibit such non-selective tuning. For instance, a voxel located in visual cortex will be more selective for the light than for the tone or the heat stimulus. A voxel in auditory cortex will be more selective for the tone than for the other two stimuli. A voxel in the somoatorsensory cortex will likely be more selective for the heat than the visual or auditory stimuli. How can we represent the tuning of these different voxels?

A simple way to model tuning to the stimulus conditions in an experiment is to multiplying each column of the design matrix by a weight that modulates the BOLD signal according to the presence of the corresponding stimulus condition. For example, we could model a visual cortex voxel by weighting the first column of X with a positive value, and the remaining two columns with much smaller values (or even negative values to model suppression). It turns out that we can model the selectivity of V individual voxels simultaneously through a C \times V Selectivity Matrix\beta. Each entry in \beta is the amount that the v-th voxel (columns) is tuned to the c-th stimulus condition (rows). Given the design matrix and the selectivity matrix, we can then predict the BOLD signals y of selectively-tuned voxels with a simple matrix multiplication:

y = X\beta

Keeping with our example experiment, let’s assume that we are modeling the selectivity of four different voxels: a strongly-tuned visual voxel, a moderately-tuned somatosensory voxel, a weakly tuned auditory voxel, and an unselective voxel that is very weakly tuned to all three stimulus conditions. We can represent the tuning of these four voxels with a 3 \times 4 selectivity matrix. Below we define a selectivity matrix that represents the tuning of these 4 theoretical voxels and simulate the evoked BOLD signals to our 3-stimulus experiment.

% SIMULATE NOISELESS VOXELS' BOLD SIGNAL
% (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE)
y0 = X*beta;

figure;
subplot(211);
imagesc(beta); colormap hot;
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Voxel Selectivity, \beta')

subplot(212);
plot(y0,'Linewidth',2);
legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'});
xlabel('Time (TRs)'); ylabel('BOLD Signal');
title('Activity for Voxels with Different Stimulus Tuning')
set(gcf,'Position',[100 100 750 540])
subplot(211); colorbar

Selectivity matrix (top) for four theoretical voxels and GLM BOLD signals (bottom) for a simple experiment

The top subpanel in the simulation output visualizes the selectivity matrix defined for the four theoretical voxels. The bottom subpanel plots the columns of the T \times V matrix of voxel responses y. We see that the maximum response of the strongly-tuned visual voxel (plotted in blue) is larger than that of the other voxels, corresponding to the larger weight upper left of the selectivity matrix. Also note that the response for the unselective voxel (plotted in cyan) demonstrates the linearity property of the FIR model. The attenuated but complex BOLD signal from the unselective voxel results from the sum of small independent signals evoked by each stimulus.

Modeling Voxel Noise

The example above demonstrates how we can model BOLD signals evoked in noisless theoretical voxels. Though this noisless scenario is helpful for developing a modeling framework, real-world voxels exhibit variable amounts of noise (noise is any signal that cannot be accounted by the FIR model). Therefore we need to incorporate a noise term into our BOLD signal model.

The noise in a voxel is often modeled as a random variable \epsilon. A common choice for the noise model is a zero-mean Normal/Gaussian distribution with some variance \sigma^2:

\epsilon \sim \mathcal N(0,\sigma^2)

Though the variance of the noise model may not be known apriori, there are methods for estimating it from data. We’ll get to estimating noise variance in a later post when we discuss various sources of noise and how to account for them using more advance techniques. For simplicity, let’s just assume that the noise variance is 1 as we proceed.

Putting It All Together — The General Linear Model (GLM)

So far we have introduced on the concepts of the stimulus matrix, the HRF, the design matrix, selectivity matrix, and the noise model. We can combine all of these to compose a comprehensive quantitative model of BOLD signals measured from a set of voxels during an experiment:

y = X\beta + \epsilon \\ = (D * h)\beta + \epsilon

This is referred to as the General Linear Model (GLM).

In a typical fMRI experiment the researcher controls the stimulus presentation D, and measures the evoked BOLD responses y from a set of voxels. The problem then is to estimate the selectivities of the voxels based on these measurments. Specifically, we want to determine the parameters \hat \beta that best explain the measured BOLD signals during our experiment. The most common way to do this is a method known as Ordinary Least Squares (OLS) Regression. Using OLS the idea is to adjust the values of \hat \beta such that the predicted model BOLD signals are as similar to the measured signals as possible. In other words, the goal is to infer the selectivity each voxel would have to exhibit in order to produce the measured BOLD signals. I showed in an earlier post that the optimal OLS solution for the selectivities \hat \beta is given by:

\hat \beta = (X^T X)^{-1} X^T y

Therefore, given a design matrix X and a set of voxel responses y associated with the design matrix, we can calculate the selectivities of voxels to the stimulus conditions represented by the columns of the design matrix. This works even when the BOLD signals are noisy. To get a better idea of this process at work let’s look at a quick example based on our toy fMRI experiment.

Example: Recovering Voxel Selectivity Using OLS

Here the goal is to recover the selectivities of the four voxels in our toy experiment they have been corrupted with noise. First, we add noise to the voxel responses. In this example the variance of the added noise is based on a concept known as signal-to-noise-ration or SNR.  As the name suggests, SNR is the ratio of the underlying signal to the noise “on top of” the signal. SNR is a very important concept when interpreting fMRI analyses. If a voxel exhibits a low SNR, it will be far more difficult to estimate its tuning. Though there are many ways to define SNR, in this example it is defined as the ratio of the maximum signal amplitude to the variance of the noise model. The underlying noise model variance is adjusted to be one-fifth of the maximum amplitude of the BOLD signal, i.e. an SNR of 5.  Feel free to try different values of SNR by changing the value of the variable \text{SNR} in the Matlab simulation. Noisy versions of the 4 model BOLD signals are plotted in the top subpanel of the figure below. We see that the noisy signals are very different from the actual underlying BOLD signals.

Noisy BOLD signals from 4 voxels (top) and GLM predictions (bottom) of the underlying BOLD signals

Here we estimate the selectivities \hat \beta from the GLM using OLS, and then predict the BOLD signals in our experiment with this estimate. We see in the bottom subpanel of the above figure that the resulting GLM predictions of are quite accurate. We also compare the estimated selectivity matrix \hat \beta to the actual selectivity matrix \beta below. We see that OLS is able to recover the selectivity of all the voxels.

Actual (top) and estimated (bottom) selectivity matrices.

% SIMULATE NOISY VOXELS & ESTIMATE TUNING
% (ASSUMING VARIABLES FROM ABOVE STILL IN WORKSPACE)

SNR = 5; % (APPROX.) SIGNAL-TO-NOISE RATIO
noiseSTD = max(y0(:))./SNR; % NOISE LEVEL FOR EACH VOXEL
noise = bsxfun(@times,randn(size(y0)),noiseSTD);
y = y0 + noise;

betaHat = inv(X'*X)*X'*y % OLS
yHat = X*betaHat; % GLM PREDICTION

figure
subplot(211);
plot(y,'Linewidth',3);
xlabel('Time (s)'); ylabel('BOLD Signal');
legend({'Visual Voxel','Auditory Voxel','Somato. Voxel','Unselective'});
title('Noisy Voxel Responses');

subplot(212)
h1 = plot(y0,'Linewidth',3); hold on
h2 = plot(yHat,'-o');
legend([h1(end),h2(end)],{'Actual Responses','Predicted Responses'})
xlabel('Time (s)'); ylabel('BOLD Signal');
title('Model Predictions')
set(gcf,'Position',[100 100 750 540])

figure
subplot(211);
imagesc(beta); colormap hot(5);
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Actual Selectivity, \beta')

subplot(212)
imagesc(betaHat); colormap hot(5);
axis tight
ylabel('Condition')
set(gca,'YTickLabel',{'Visual','Auditory','Somato.'})
xlabel('Voxel');
set(gca,'XTick',1:4)
title('Noisy Estimated Selectivity')
drawnow

Wrapping Up

Here we introduced the GLM commonly used for fMRI data analyses and used the GLM framework to recover the selectivities of simulated voxels. We saw that the GLM is quite powerful of recovering the selectivity in the presence of noise. However, there are a few details left out of the story.

First, we assumed that we had an accurate (albeit exact) model for each voxel’s HRF. This is generally not the case. In real-world scenarios the HRF is either assumed to have some canonical shape, or the shape of the HRF is estimated the experiment data. Though assuming a canonical HRF shape has been validated for block design studies of peripheral sensory areas, this assumption becomes dangerous when using event-related designs, or when studying other areas of the brain.

Additionally, we did not include any  physiological noise signals in our theoretical voxels. In real voxels, the BOLD signal changes due to physiological processes such as breathing and heartbeat can be far larger than the signal change due to underlying neural activation. It then becomes necessary to either account for the nuisance signals in the GLM framework, or remove them before using the model described above. In two upcoming posts we’ll discuss these two issues: estimating the HRF shape from data, and dealing with nuisance signals.

fMRI In Neuroscience: The Basics

Magnetic Resonance Imaging (MRI) is a procedure used to essentially “look inside” of materials in a non-invasive manner. As the name suggests, the procedure forms images by measuring the differences in resonances of different materials while in the presence of a strong magnetic field (the details of the MRI procedure are pretty fascinating, but I will save them for another post). In the setting of neuroscience, MRI is often used to image the inside of the brain while it is performing basic functions like hearing a tone, or pressing a button. This flavor of MRI is appropriately referred to as Functional MRI, or fMRI.

The main concept behind fMRI is that as the neurons in the brain function they consume fuel (sugar) and oxygen, which is supplied by blood flow. The harder the neurons in one part of the brain work, the more blood and therefore the more oxygen flows to that part of the brain. The levels of oxygen in the blood will also vary proportionately to how quickly oxygen is being consumed by the active neurons; the more activity, the faster oxygen is consumed. It turns out the resonant frequency of a tissue will vary depending on the level of oxygen present in the tissue. fMRI essentially measures the differences in resonances of your brain tissue based on these functionally-dependent levels of blood oxygen. This is commonly referred to as the Blood-Oxygen-Level-Dependent (BOLD) signal.

The BOLD signal is not measured from individual neurons, but rather from small (on the order of 2-3 mm) cubic regions of the brain called voxels (imagine your brain being composed of hundreds of thousands of tiny Leggos). Each voxel contains hundreds of thousands of neurons, so the BOLD signal measured from a voxel is indicative of the group activity of the neurons located within that voxel. As the neurons in a voxel become active due to brain function, the BOLD signal in each of voxel will vary over time. The work of many a neuroscientist is to accurately characterize the BOLD signal change and to relate (i.e. correlate) it to brain function.

Characterizing the BOLD Signal — the Hemodynamic Response Function (HRF)

Because the BOLD signal is based on blood flow, it is delayed in time from the onset of neural activity due to the period it takes for blood to flow into (and out of) the voxel. The BOLD signal generally peaks 4 to 6 seconds after the onset of neural activity, after which it decreases back toward baseline, even undershooting baseline amplitude around 8 to 12 seconds after onset. This undershoot is believed to be caused by ongoing neuronal metobolism that overconsumes the initial supply of oxygen to the voxel to sub-baseline levels. If there is no addition neural activity the BOLD signal eventually (after approximately 20 seconds) returns to baseline levels. These signal dynamics are referred to as a voxel’s Hemodynamic Response Function, or HRF. A common quantiative model of the HRF is a sum of two Gamma distributions. One of the distributions models the initial peak of the BOLD signal, and another (inverted) distribution models the undershoot. An example of a model HRF is shown below.

Model Hemodynamic Response Function (HRF)

%% MODEL OF THE HRF
t = 0:.1:20;
hrfModel = gampdf(t,6) + -.5*gampdf(t,10);

% DISPLAY
figure
plot(t,hrfModel,'r','Linewidth',2);
hold on;
hb = plot(t,zeros(size(t)),'k--');
title('Model HRF');
xlabel('Time From Activity Onset (s)');
ylabel('BOLD Signal');
legend(hb,'Baseline')

It is important to note that the MRI scanner doesn’t necessarily measure the BOLD signal at such a high temporal resolution as indicated above. Because of limitations imposed by both hardware and software, the required period for an individual fMRI measurement–or TR, short for Repetition Time–is on the order of 1 to 2 seconds. (However recent advances in parallell imaging and multiband excitation technologies are vastly shortening the required TR of fMRI measurements.)

Relating the HRF to Brain Activity — The Finite Impulse Response (FIR) Model

The HRF provides a model for how the BOLD signal in a voxel will vary due to a short burst of neural activity. Therefore, a useful interpretation of BOLD signal dynamics comes from signal processing. If we treat the HRF as a filter h that operates over some finite length of time equal to the length of the HRF, then the measured BOLD signal y(t) evoked by a series of bursts in neural activity can be modeled as a convolution of an impulse train D(t) with the filter:

y(t) = D(t) * h

where D(t) equals one at each  point in time where a burst of neural activity  occurs, and zero otherwise. The (*) is the convolution operator (if you’re not familiar with convolution, it’s not terribly important here; just think of it as the operation of applying  a filter to some data). This is identical to the Finite Impulse Response (FIR) model used in signal processing. An example of the FIR model used to model the BOLD signal evoked by a sequence of three bursts of neural activity is shown below.

Demonstration of the FIR model of the BOLD signal

%% FIR MODEL OF BOLD RESPONSE

% HRF AS MEASURED BY MRI SCANNER
TR = 1;			% REPETITION TIME
t = 1:TR:20;	% MEASUREMENTS
h = gampdf(t,6) + -.5*gampdf(t,10);

% CREATE A STIMULUS IMPULSE TRAIN
% FOR THREE SEPARATE STIMULUS CONDITIONS
trPerStim = 30;			% # TR PER STIMULUS
nRepeat = 2;
nTRs = trPerStim*nRepeat + length(h);
impulseTrain0 = zeros(1,nTRs);

impulseTrainModel = impulseTrain0;
impulseTrainModel([1,24,28]) = 1;
boldModel = conv(impulseTrainModel,h);
boldModel(nTRs+1:end) = [];

% DISPLAY AN EXAMPLE OF FIR RESPONSE
figure
stem(impulseTrainModel*max(h),'k');
hold on;
plot(boldModel,'r','Linewidth',2); xlim([0,nTRs]);
title('HRF Convolved with Impulse Stimulus Train');
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Activity Onset', 'Voxel Response'})
set(gcf,'Position',[100 100 750 380])

A useful property of the FIR model is that BOLD signals from overlapping HRFs sum linearly. This effect is displayed in the figure above. The BOLD signals evoked by the second and third impulses, which appear close to one another in time, combine linearly to form a larger and more complicated BOLD signal than that evoked by the first impulse alone. As we will see later, this linearity property makes it possible to determine the selectivity of neurons within a voxel even in the presence of rapidly-presented stimuli.

Block and Event-related fMRI Experiments

Using fMRI, neuroscientists design experiments that present stimuli having particular features that are hypothesized to be encoded by neurons in a particular region of interest in the brain. Given the evoked BOLD signal measured from a voxel during stimulus presentation, along with the HRF for the voxel, and the assumptions of FIR model framework, it is possible to calculate the degree to which the neurons in the voxel must be selective for each the stimulus features in order to produce the measured BOLD signals. We’ll delve more into how this selectivity is calculated in a later post, but for now let’s take look at two flavors of experiment designs.

The Block Design

One flavor of experimental design is to simply present a stimulus continuosly for a long period, followed by absence of stimuli for a long period. This is what is called a Block Design experiment. From the view of the FIR model, presenting the stimulus for a long period is equivalent to composing D(t) of many consecutive impulses. Because the signal from the HRF of each impulse are assumed to sum linearly, the BOLD signal evoked by the block of impulses will be much larger than the signal evoked by a single brief stimulus presentation. This is demonstrated in the simulation below:

%% SIMULATE A BLOCK-DESIGN EXPERIMENT
%% (ASSUMING VARIABLES FRON ABOVE ARE IN WORKSPACE)

% CREATE BLOCK STIMULUS TRAIN
blocks = repmat([ones(8,1);zeros(8,1)], round(nTRs-10)/16,1);
blockImpulseTrain = impulseTrain0;
blockImpulseTrain(1:numel(blocks)) = blocks;
boldBlock = conv(blockImpulseTrain,h);

% DISPLAY BOLD RESPONSES FROM BLOCK DESIGN
figure
stem(blockImpulseTrain*max(h),'k');
hold on;
plot(boldBlock(1:nTRs),'r','Linewidth',2); xlim([0,nTRs]);
title('Simulated Block Design');
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Stimulus Train', 'Voxel Response'})
set(gcf,'Position',[100 100 750 380])

Simulation of BOLD signals evoked by Block Design experiment

Here the height of the stimulus onset indicators (in black) are scaled to be the maximum height of the HRF from a single impulse. We see that the peak bold signal from the block design is much larger. The block design is used when the number of features/stimuli that the experimenter would like to probe is small. In this scenario, the increased signal amplitude results in a much more sensitive measurement of selectivity for the probed stimulus features, but at the cost of an inefficient model design. Therefore many separate block design experiments will have to be run to test various hypotheses.

The Event-related Design

An alternative to the block design is to instead show many types of stimuli for short duations. This is what is known as an Event-related design. For instance, say are interested in how visual, auditory, and somatosensory information are encoded in the brain. We could run three separate block design experiments, one for each stimulus modality, or we could run a single event-related design experiment, where we intermittently present a subject a light, a tone, and ask them to press a button. If there exists a voxel that is involved in encoding each of these stimuli to an equal degree, it would be difficult to discover this relationship by running three separate block design experiments. A simulation of such an experiment and the responses from such a voxel is shown below.

Simulation of an Event-related experiment

%% SIMULATE AN EVENT-RELATED EXPERIMENT
%% (ASSUMING VARIABLES FRON ABOVE ARE IN WORKSPACE)

% VISUAL STIMULUS
impulseTrainLight = impulseTrain0;
impulseTrainLight(1:trPerStim:trPerStim*nRepeat) = 1;

% AUDITORY STIMULUS
impulseTrainTone = impulseTrain0;
impulseTrainTone(5:trPerStim:trPerStim*nRepeat) = 1;

% SOMATOSENSORY STIMULUS
impulseTrainPress = impulseTrain0;
impulseTrainPress(9:trPerStim:trPerStim*nRepeat) = 1;

% COMBINATION OF ALL STIMULI
impulseTrainAll = impulseTrainLight + impulseTrainTone + impulseTrainPress;

% SIMULATE BOLD SIGNAL EVOKED BY EACH CONDITION
boldLight = conv(impulseTrainLight,h);
boldTone = conv(impulseTrainTone,h);
boldPress = conv(impulseTrainPress,h);
boldAll = conv(impulseTrainAll,h);

% DISPLAY STIMULUS ONSETS FOR EACH CONDITION
figure
subplot(211)
hold on
stem(impulseTrainLight,'k');
stem(impulseTrainTone,'b');
stem(impulseTrainPress,'g');
xlim([0,nTRs]);
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Light Stimulus Train', 'Tone Stimulus Train','Press Stimulus Train'})
title('Impulse Trains for 3 Different Stimuli');

% DISPLAY COMBINATION OF BOLD RESPONSES FROM EACH CONDITION
subplot(212)
hold on;
plot(boldLight(1:nTRs),'k');
plot(boldTone(1:nTRs),'b');
plot(boldPress(1:nTRs),'g');
plot(boldAll(1:nTRs),'r','Linewidth',2);
xlim([0,nTRs]);
xlabel('Time (TRs)'); ylabel('BOLD Signal');
legend({'Response to Light','Response to Tone','Response to Press','Total Response'});
title('Simulation of BOLD Signal from Overlapping Stimuli');
set(gcf,'Position',[100 100 750 540])

In this example notice how the responses evoked by each class of stimulus overlap, resulting in a complex BOLD signal. However, if each class of stimulus were shown separately, as in the case of the block design, the evoked BOLD signal would look very different, as indicated by the individual responses to each stimulus (black, blue and green responses, respectively). Event-related designs are powerful in that we can probe many features simultaneously and therefore uncover correlated selectivities/tuning. However, event-related designs require that we have a very accurate model for the HRF. This requirement is greatly relaxed for block design experiments (in fact many block-designs assume very simplistic HRF shapes such as a triangle or square waveform).

Wrapping Up

In this post we introduced some of the basic concepts in fMRI used for neuroscience research, including the BOLD signal, the HRF, as well as Block and Event-related experiment designs. However, the underlying story used to present these concepts has been dramatically simplified. Here we ignore the effects of physiological noise, which can be a debilatating factor in many fMRI analyses. Also missing are the details of how to calculate a voxel’s selectivity to a set of stimulus features given measured data, an HRF, and an experimental paradigm. This is where the General Linear Model (GLM) that predominates fMRI research comes in. We also assume that the proposed model HRF is an accurate characterization of a voxel’s BOLD response dynamics. Though this is often a reasonable assumption for peripheral sensory areas of the brain, it is often a very poor assumption for other areas of the brain. It is therefore often necessary to estimate the shape of the HRF from fMRI data. We’ll discuss all of these issues: physiological noise, tuning/selectivity estimation using the GLM, as well as HRF estimation in a following series of posts on fMRI methods in neuroscience.