An artificial neural network can be thought of as a meta-function that accepts a fixed number of numeric inputs and produces a fixed number of numeric outputs. In most situations, a neural network has a layer of hidden neurons where each hidden neuron is fully connected to the input neurons and the output neurons. Associated with each individual hidden neuron and each individual output neuron are a set of weight values and a single so-called bias value. The weights and biases determine the output values for a given set of input values.

When neural networks are used to model a set of existing data so that predictions can be made on new data, the main challenge is to find the set of weight and bias values that generate the outputs that best match the existing data. The most common technique for estimating optimal neural network weights and biases is called back-propagation. Although there are many excellent references that describe the complicated mathematics that underlie back-propagation, there are very few guides available for programmers that clearly explain how to program the back-propagation algorithm. This article explains how to implement back-propagation. I use the C# language, but you should have no trouble refactoring the code presented here to other languages.

The best way to see where I’m headed is to take a look at the screenshot of a demo program in **Figure 1**. The demo program creates a neural network that has three input neurons, a hidden layer with four neurons, and two output neurons. Neural networks with a single hidden layer need two activation functions. In many situations, though, the two activation functions are the same, typically the sigmoid function. But in this demo, in order to illustrate the relationship between activation functions and back-propagation, I use different activation functions: the sigmoid function for the input-to-hidden computations, and the tanh (hyperbolic tangent) function for the hidden-to-output computations.

**Figure 1 Back-Propagation Algorithm in Action**

A fully connected 3-4-2 neural network requires 3*4 + 4*2 = 20 weight values and 4+2 = 6 bias values for a total of 26 weights and biases. These weights and biases are initialized to more or less arbitrary values. The three dummy input values are set to 1.0, 2.0 and 3.0. With the initial weight, bias and input values, the initial output values are computed by the neural network to be {0.7225, -0.8779}. The demo program arbitrarily assumes that the two correct output values are {-0.8500, 0.7500}. The goal of the back-propagation algorithm is to find a new set of weights and biases that generate outputs that are very close to the correct values for inputs {1.0, 2.0, 3.0}.

Back-propagation requires two free parameters. The learning rate, usually given the Greek letter eta in back-propagation literature, controls how fast the algorithm converges to a final estimate. The momentum, usually given the Greek letter alpha, helps the back-propagation algorithm avoid situations in which the algorithm oscillates and never converges to a final estimate. The demo program sets the learning rate to 0.90 and the momentum to 0.04. Typically these values are found by trial and error.

Finding the best set of weights and biases for a neural network is sometimes called training the network. Training with back-propagation is an iterative process. At each iteration, back-propagation computes a new set of neural network weight and bias values that in theory generate output values that are closer to the target values. After the first training iteration of the demo program, the back-propagation algorithm found new weight and bias values that generated new outputs of {-0.8932, -0.8006}. The new first output value of -0.8932 was much closer to the first target output value of -0.8500. The second new output value of -0.8006 was still far away from its target value of 0.7500.

The training process can be terminated in a variety of ways. The demo program iterates training until the sum of the absolute differences between output values and target values is <= 0.01 or the training reaches 1,000 iterations. In the demo, after six iterations of training, back-propagation found a set of neural network weight and bias values that generated outputs of {-0.8423, 0.7481}, which were very close to the {-0.8500, 0.7500} desired target values.

This article assumes you have expert-level programming skills and that you have a very basic understanding of neural networks. (For basic information on neural networks, see my May 2012 article, “Dive into Neural Networks,” at msdn.microsoft.com/magazine/hh975375.) The code for the demo program shown in **Figure 1** is a bit too long to present in this article, so I’ll concentrate on explaining the key parts of the algorithm. The complete source code for the demo program is available at msdn.microsoft.com/magazine/msdnmag1012.

## Defining a Neural Network Class

Coding a neural network that uses back-propagation lends itself nicely to an object-oriented approach. The class definition used for the demo program is listed in **Figure 2**.

**Figure 2 Neural Network Class**

```
class NeuralNetwork
{
private int numInput;
private int numHidden;
private int numOutput;
// 15 input, output, weight, bias, and other arrays here
public NeuralNetwork(int numInput,
int numHidden, int numOutput) {...}
public void UpdateWeights(double[] tValues,
double eta, double alpha) {...}
public void SetWeights(double[] weights) {...}
public double[] GetWeights() {...}
public double[] ComputeOutputs(double[] xValues) {...}
private static double SigmoidFunction(double x)
{
if (x < -45.0) return 0.0;
else if (x > 45.0) return 1.0;
else return 1.0 / (1.0 + Math.Exp(-x));
}
private static double HyperTanFunction(double x)
{
if (x < -10.0) return -1.0;
else if (x > 10.0) return 1.0;
else return Math.Tanh(x);
}
}
```

Member fields, numInput, numHidden and numOutput are defining characteristics of the neural network architecture. In addition to a simple constructor, the class has four publicly accessible methods and two helper methods. Method UpdateWeights contains all the logic of the back-propagation algorithm. Method SetWeights accepts an array of weights and biases and copies those values sequentially into member arrays. Method GetWeights performs the reverse operation by copying the weights and biases into a single array and returning that array. Method ComputeOutputs determines the neural network output values using the current input, weight and bias values.

Method SigmoidFunction is used as the input-to-hidden activation function. It accepts a real value (type double in C#) and returns a value between 0.0 and 1.0. Method HyperTanFunction also accepts a real value but returns a value between -1.0 and +1.0. The C# language has a built-in hyperbolic tangent function, Math.Tanh, but if you’re using a language that doesn’t have a native tanh function, you’ll have to code one from scratch.

## Setting up the Arrays

One of the keys to successfully programming a neural network back-propagation algorithm is to fully understand the arrays that are being used to store weight and bias values, store different kinds of input and output values, store values from a previous iteration of the algorithm, and store scratch calculations. The large diagram in **Figure 3** contains all the information you need to know to understand how to program back-propagation. Your initial reaction to **Figure 3** is likely to be something along the lines of, “Forget it—this is too complicated.” Hang in there. Back-propagation is not trivial, but once you understand the diagram you’ll be able to implement back-propagation using any programming language.

**Figure 3 The Back-Propagation Algorithm**

**Figure 3** has primary inputs and outputs at the edges of the figure, but also several local input and output values that occur in the interior of the diagram. You should not underestimate the difficulty of coding a neural network and the need to keep the names and meanings of all these inputs and outputs clear. Based on my experience, a diagram like the one in **Figure 3** is absolutely essential.

The first five of the 15 arrays used in the neural network definition outlined in **Figure 2** deal with the input-to-hidden layers and are:

```
public class NeuralNetwork
{
// Declare numInput, numHidden, numOutput
private double[] inputs;
private double[][] ihWeights;
private double[] ihSums;
private double[] ihBiases;
private double[] ihOutputs;
...
```

The first array, named inputs, holds the numeric input values. These values typically come directly from some normalized data source such as a text file. The NeuralNetwork constructor instantiates inputs as:

`this.inputs = new double[numInput];`

Array ihWeights (input-to-hidden weights) is a virtual two-dimensional array implemented as an array of arrays. The first index indicates the input neuron and the second index indicates the hidden neuron. The array is instantiated by the constructor as:

`this.ihWeights = Helpers.MakeMatrix(numInput, numHidden);`

Here, Helpers is a utility class of static methods that help simplify the neural network class:

```
public static double[][] MakeMatrix(int rows, int cols)
{
double[][] result = new double[rows][];
for (int i = 0; i < rows; ++i)
result[i] = new double[cols];
return result;
}
```

Array ihSums is a scratch array that’s used to hold an intermediate calculation in the ComputeOutputs method. The array holds values that will become the local inputs for the hidden neurons and is instantiated as:

`this.ihSums = new double[numHidden];`

Array ihBiases holds the bias values for the hidden neurons. Neural network weight values are constants that are applied by multiplying them with a local input value. Bias values are added to an intermediate sum to produce a local output value, which becomes the local input to the next layer. Array ihBiases is instantiated as:

`this.ihBiases = new double[numHidden];`

Array ihOutputs holds the values that are emitted from the hidden-layer neurons (which become the inputs to the output layer).

The next four arrays in the NeuralNetwork class hold values related to the hidden-to-output layer:

```
private double[][] hoWeights;
private double[] hoSums;
private double[] hoBiases;
private double[] outputs;
```

These four arrays are instantiated in the constructor as:

```
this.hoWeights = Helpers.MakeMatrix(numHidden, numOutput);
this.hoSums = new double[numOutput];
this.hoBiases = new double[numOutput];
this.outputs = new double[numOutput];
```

The neural network class has six arrays that are directly related to the back-propagation algorithm. The first two arrays hold values called the gradients for the output- and hidden-layer neurons. A gradient is a value that indirectly describes how far off, and in what direction (positive or negative), local outputs are relative to the target outputs. Gradient values are used to compute delta values, which are added to current weight and bias values to produce new, better weights and biases. There’s one gradient value for each hidden-layer neuron and each output-layer neuron. The arrays are declared as:

```
private double[] oGrads; // Output gradients
private double[] hGrads; // Hidden gradients
```

The arrays are instantiated in the constructor as:

```
this.oGrads = new double[numOutput];
this.hGrads = new double[numHidden];
```

The final four arrays in class NeuralNetwork hold the deltas (not gradients) from the previous iteration of the training loop. These previous deltas are required if you use the momentum mechanism to prevent back-propagation non-convergence. I consider momentum essential, but if you decide not to implement momentum you can omit these arrays. They are declared as:

```
private double[][] ihPrevWeightsDelta; // For momentum
private double[] ihPrevBiasesDelta;
private double[][] hoPrevWeightsDelta;
private double[] hoPrevBiasesDelta;
```

These arrays are instantiated as:

```
ihPrevWeightsDelta = Helpers.MakeMatrix(numInput, numHidden);
ihPrevBiasesDelta = new double[numHidden];
hoPrevWeightsDelta = Helpers.MakeMatrix(numHidden, numOutput);
hoPrevBiasesDelta = new double[numOutput];
```

## Computing Outputs

Each iteration in the training loop shown in **Figure 1** has two parts. In the first part, outputs are computed using the current primary inputs, weights and biases. In the second part, back-propagation is used to modify the weights and biases. The diagram in **Figure 3** illustrates both parts of the training process.

Working from left to right, inputs x0, x1 and x2 are assigned values of 1.0, 2.0 and 3.0. These primary input values go into the input-layer neurons and are emitted without modification. Although input-layer neurons can modify their input, such as normalizing the values to be within a certain range, in most cases such processing is done externally. Because of this, neural network diagrams often use rectangles or square boxes for the input neurons to indicate they aren’t processing neurons in the same sense that the hidden-layer and output-layer neurons are. Additionally, this affects the terminology used. In some cases, the neural network shown in **Figure****3** would be called a three-layer network, but because the input layer doesn’t perform processing, the neural network shown is sometimes called a two-layer network.

Next, each of the hidden-layer neurons computes a local input and a local output. For example, the bottommost hidden neuron, with index [3], computes its scratch sum as (1.0)(0.4)+(2.0)(0.8)+(3.0)(1.2) = 5.6. The scratch sum is the product of the sum of the three inputs times the associated input-to-hidden weight. The values above each arrow are the weights. Next, the bias value, -7.0, is added to the scratch sum to yield a local input value of 5.6 + (-7.0) = -1.40. Then the input-to-hidden activation function is applied to this intermediate input value to yield the local output value of the neuron. In this case, the activation function is the sigmoid function, so the local output is 1 / (1 + exp(-(-1.40))) = 0.20.

The output-layer neurons compute their input and output similarly. For example, in **Figure 3**, the bottommost output-layer neuron with index[1] computes its scratch sum as (0.86)(1.4)+(0.17)(1.6)+(0.98)(1.8)+(0.20)(2.0) = 3.73. The associated bias is added to give the local input: 3.73 + (-5.0) = -1.37. And the activation function is applied to give the primary output: tanh(-1.37) = -0.88. If you examine the code for ComputeOutputs, you’ll see that the method computes outputs exactly as I’ve just described.

## Back-Propagation

Although the math behind the theory of back-propagation is fairly complicated, once you know what those math results are, implementing back-propagation is not too difficult. Back-propagation starts by working from right to left in the diagram shown in **Figure 3**. The first step is to compute the gradient values for each output-layer neuron. Recall the gradient is a value that has information regarding the magnitude and direction of an error. The gradients for the output-layer neurons are computed differently from the gradients for the hidden-layer neurons.

The gradient of an output-layer neuron is equal to the target (desired) value minus the computed output value, times the calculus derivative of the output-layer activation function evaluated at the computed output value. For example, the gradient value of the bottommost output-layer neuron in **Figure 3**, with index [1], is computed as:

(0.75 – (-0.88)) * (1 – (-0.88)) * (1 + (-0.88)) = 0.37

The 0.75 is the desired value. The -0.88 is the computed output value from the forward-pass computation. Recall that in this example the output-layer activation function is the tanh function. The calculus derivative of tanh(x) is (1 - tanh(x)) * (1 + tanh(x)). The math analysis is a bit tricky but, ultimately, computing the gradient of an output-layer neuron is given by the formula described here.

The gradient of a hidden-layer neuron is equal to the calculus derivative of the activation function of the hidden layer evaluated at the local output of the neuron times the sum of the product of the primary outputs times their associated hidden-to-output weights. For example, in **Figure 3**, the gradient of the bottommost hidden-layer neuron with index [3] is:

(0.20)(1 – 0.20) * [ (-0.76)(1.9) + (0.37)(2.0) ] = -0.03

If we call the sigmoid function g(x), it turns out that the calculus derivative of the sigmoid function is g(x) * (1 - g(x)). Recall that this example uses the sigmoid function for the input-to-hidden activation function. Here the 0.20 is the local output from the neuron. The -0.76 and 0.37 are the gradients of the output-layer neurons, and the 1.9 and 2.0 are the hidden-to-output weights associated with the two output-layer gradients.

## Computing the Weight and Bias Deltas

After all the output-layer gradients and hidden-layer gradients have been computed, the next step in the back-propagation algorithm is to use the gradient values to compute delta values for each weight and bias value. Unlike the gradients, which must be computed right to left, the delta values can be computed in any order. The delta value for any weight or bias is equal to eta times the gradient associated with the weight or bias, times the input value associated with the weight or bias. For example, the delta value for the input-to-hidden weight from input neuron [2] to hidden neuron [3] is:

delta i-h weight[2][3] = eta * hidden gradient[3] * input[2]

= 0.90 * (-0.11) * 3.0

= -0.297

The 0.90 is eta, which controls how fast the back-propagation learns. Larger values of eta produce larger changes in delta, with the risk of overshooting a good answer. The -0.11 value is the gradient for hidden neuron [3]. The 3.0 value is the input value for input neuron [2]. In terms of the diagram in **Figure 3**, if a weight is represented as an arrow from one neuron to another, to compute the delta for a particular weight, you use the gradient value of the neuron pointed to on the right and the input value of the neuron pointed from on the left.

When computing the deltas for bias values, notice that because bias values are simply added to an intermediate sum, they have no associated input value. So, to compute the delta for a bias value you can either omit the input value term altogether, or use a dummy 1.0 value as a form of documentation. For example, in **Figure 3**, the bottommost hidden-layer bias has value -7.0. The delta for that bias value is:

0.90 * gradient of neuron pointed to * 1.0

= 0.90 * (-0.11) * 1.0

= 0.099

## Adding a Momentum Term

After all weight and bias delta values have been computed, it’s possible to update each weight and bias by simply adding the associated delta value. However, experience with neural networks has shown that with certain data sets, the back-propagation algorithm can oscillate, repeatedly overshooting and then undershooting the target value and never converging to a final set of weight and bias estimates. One technique for reducing this tendency is to add to each new weight and bias an additional term called momentum. The momentum for a weight (or bias) is just some small value (like 0.4 in the demo program) times the value of the previous delta for the weight. Using momentum adds a small amount of complexity to the back-propagation algorithm because the values of previous deltas must be stored. The math behind why this technique prevents oscillation is subtle, but the result is simple.

To summarize, to update a weight (or bias) using back-propagation, the first step is to compute gradients for all output-layer neurons. The second step is to compute gradients for all hidden-layer neurons. The third step is to compute deltas for all weights using eta, the learning rate. The fourth step is to add the deltas to each weight. The fifth step is to add a momentum term to each weight.

## Coding with Visual Studio 2012

The explanation of back-propagation presented in this article, together with the sample code, should give you enough information to understand and code the back-propagation algorithm. Back-propagation is just one of several techniques that can be used to estimate the best weight and bias values for a data set. Compared to alternatives such as particle swarm optimization and evolutionary optimization algorithms, back-propagation tends to be faster. But back-propagation does have disadvantages. It can’t be used with neural networks that use non-differentiable activation functions. Determining good values for the learning rate and momentum parameters is more art than science and can be time-consuming.

There are several important topics that this article does not address, in particular how to deal with multiple target data items. I’ll explain this concept and other neural network techniques in future articles.

When I coded the demo program for this article, I used the beta version of Visual Studio 2012. Even though many of the new features in Visual Studio 2012 are related to Windows 8 apps, I wanted to see how Visual Studio 2012 handled good old console applications. I was pleasantly surprised that I wasn’t unpleasantly surprised by any of the new features in Visual Studio 2012. My transition to Visual Studio 2012 was effortless. Although I didn’t make use of the new Async feature in Visual Studio 2012, it could have been useful when computing all the delta values for each weight and bias. I tried out the new Call Hierarchy feature and found it useful and intuitive. My initial impressions of Visual Studio 2012 were favorable, and I plan to transition to it as soon as I’m able.

**Dr. James McCaffrey** *works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.*

Thanks to the following technical expert for reviewing this article: Dan Liebling

###
**Comments** (16)

Shashi Sathyanarayana: Wednesday, July 23, 2014 10:55 AMThank you for making this resource available. A few people that I know programmed their first neural networks using your article.

To those who might say “Forget it—this is too complicated,” I would advise them to revisit your article after reading my tutorial, "A Gentle Introduction to Backpropagation" available at http://numericinsight.blogspot.com/2014/07/a-gentle-introduction-to-ba...

Shashi Sathyanarayana: Wednesday, July 23, 2014 10:36 AMThank you for making this resource available. A few people that I know programmed their first neural networks using your article.

To those who might say “Forget it—this is too complicated,” I would advise them to revisit your article after reading my tutorial, "A Gentle Introduction to Backpropagation" available at <http://numericinsight.blogspot.com/2014/07/a-gentle-introduction-to-backpropagation.html>.

Maxim Schwartz: Monday, May 26, 2014 12:26 PMI implemented my own version of a neural network that supports either one or two hidden layers using this article as a guide. I tested the implementation on a two-input XOR function, four-input XOR function, and some other data. The implementation described by the author does work. Here are some additional tips for those who want to try this.

In my implementation, a single optimization run goes through all training cases. In order to handle multiple training cases, I use a variable called "cumulative_delta" for each bias and weight. For each training case, I add the bias or weight delta to this variable. After going through all training cases, I divide each of these cumulative_delta variables by the total number of training cases to get the average deltas. And these are the deltas I apply at the end of a single optimization run to the weights and biases. Optimization runs keep getting executed until a limit like 10000 is reached or the cumulative distance to targets (outputs) falls below some threshold, as described in the article.

One important thing I discovered is that the neural network is extremely sensitive to the initial values for weights and biases! Like most gradient descent schemes, it seems to have a strong tendency to get stuck in local minima. Do randomize the initial weights and biases. Assigning the same starting value to all weights and biases seems to make it unlikely that the NN will find a decent solution. For a particular hidden node configuration (number of layer 1 and layer 2 nodes), I actually try at least 20 different initial random values.

jmccaffrey: Friday, January 17, 2014 4:11 PM(from the author to ValentinH) Yes, you are correct on both typos. Thank you for pointing them out.

jmccaffrey: Friday, January 17, 2014 4:10 PM(from the author) Here is a link to a bigger picture of the back-propagation calculations: http://www.quaetrix.com/BackProp/BackPropCalculations.jpg

AristosQueue: Monday, December 30, 2013 11:29 AMThank you for this information. Do you have a higher resolution image of Figure 3?

ValentinH: Tuesday, December 24, 2013 11:27 PMAnother error when calculating the bias delta for hidden-neuro[3]:

0.90 * gradient of neuron pointed to * 1.0

= 0.90 * (-0.11) * 1.0

= 0.099

Shouldn't it be negative?

ValentinH: Tuesday, December 24, 2013 11:18 PMHello, there is an error in one of the calculations within the article.

The gradient of the hidden-layer neuron[3]:

(0.20)(1 – 0.20) * [ (-0.76)(1.9) + (0.37)(2.0) ] = -0.03 (incorrect)

The same calculation is shown in the diagram with the correct result.

Demaunt: Wednesday, March 6, 2013 7:34 AMHello.

Activation functions for hidden and output layer are different.

Why didnt u use same act functions for layers?

Sigmoid always give positive values. (sigmoid (any negative value) < 0.5; sigmoid(0)=0.5; sigmoid(any positive value)>0.5)

So any input value will make hidden neurons produce positive value.

Then you give hidden layer outputs as inputs for output layers. Output layers use tanh act function, but output layers will take only positive inputs, and this will cause only positive values from output neurons.

Or u used different act functions, to demonstrate gradient calculations?

Anyway great article "thumbs up"

Perry-: Tuesday, February 5, 2013 6:53 AMExcelent article...

I understand basic of BP, but discussed solution uses only one training input... what about more of them ?

gregthatcher: Friday, December 28, 2012 3:51 PMThese articles are awesome! Please keep up the great work.

jmccaffrey: Friday, December 21, 2012 10:33 AM(from the author)

My thanks to reader Reto Matter who pointed out a minor mistake in the source code. The previous deltas and biases for the input-to-hidden layers are not saved. Correction (add two lines of code) is:

// 3. update input to hidden weights

for (int i = 0; i < ihWeights.Length; ++i)

{

for (int j = 0; j < ihWeights[0].Length; ++j)

{

double delta = eta * hGrads[j] * inputs[i];

ihWeights[i][j] += delta;

ihWeights[i][j] += alpha * ihPrevWeightsDelta[i][j];

ihPrevWeightsDelta[i][j] = delta; // add this line

}

}

// 3b. update input to hidden biases

for (int i = 0; i < ihBiases.Length; ++i)

{

double delta = eta * hGrads[i] * 1.0;

ihBiases[i] += delta;

ihBiases[i] += alpha * ihPrevBiasesDelta[i];

ihPrevBiasesDelta[i] = delta; // add this line

}

ITGuy327: Friday, October 19, 2012 12:33 PMVery well written. Not to sound obtuse, but if I want to model energy prices or commodities, could this type of program be of use?

I'm very new to learning about Neural Networks. Thank you

Dodle: Sunday, October 14, 2012 10:26 AMGood article!

One question though, I extended the program to handle timeslots in a time series and tried to forecast the sunspots based on yearly data from http://sidc.oma.be/sunspot-data/

The result is very poor though. I slot the data as 10y input and 3y output (30 hidden), and then slide forward by one year and redo the training. I achieve an error rate at best of 0.5 after 1000 iterations. I use this when forecasting as well, so I enter the last ten years and then forecast the coming 3 years (out of sample) based on that.

One reason for why the result could be so poor are the weights. The first 300 values are all around negative 1400, the following values are between negative 1.5 and 1. I assume this is not how it should be, any suggestions?

Dennis Thorsen: Monday, October 8, 2012 11:22 AMFantastic article

CW Jones: Tuesday, October 2, 2012 4:35 PMThank you so much for sharing your expertise!!!