March 2016

Volume 31 Number 3

[Test Run]

Neural Network Regression

By James McCaffrey

James McCaffreyThe goal of a regression problem is to predict the value of a numeric variable (usually called the dependent variable) based on the values of one or more predictor variables (the independent variables), which can be either numeric or categorical. For example, you might want to predict the annual income of a person based on age, sex (male or female) and years of education.

The simplest form of regression is called linear regression (LR). An LR prediction equation might look like this: income = 17.53 + (5.11 * age) + (-2.02 * male) + (-1.32 * female) + (6.09 * education). Although LR is useful for some problems, in many situations it’s not effective. But there are other common types of regression—polynomial regression, general linear model regression and neural network regression (NNR). Arguably, the last type of regression is the most powerful form of regression.

The most common type of neural network (NN) is one that predicts a categorical variable. For example, you might want to predict a person’s political inclination (conservative, moderate, liberal) based on factors such as age, income and sex. An NN classifier has n output nodes, where n is the number of values that the dependent variable can take. The values of the n output nodes sum to 1.0 and can be loosely interpreted as probabilities. So, for predicting political inclination, an NN classifier would have three output nodes. If the output node values were (0.24, 0.61, 0.15), the NN classifier is predicting “moderate” because the middle node has the largest probability.

In NN regression, the NN has a single output node that holds the predicted value of the dependent numeric variable. So, for the example that predicts annual income, there would be three input nodes (one for age, one for sex where male = -1 and female = +1, and one for years of education), and one output node (annual income).

A good way to understand what NN regression is and to see where this article is headed, is to take a look at the demo program in Figure 1. Rather than tackling a realistic problem, in order to keep the ideas of NN regression as clear as possible, the goal of the demo is to create an NN model that can predict the value of the sine function. In case your trigonometry knowledge is a bit rusty, the graph of the sine function is shown in Figure 2. The sine function accepts a single real input value from negative infinity to positive infinity and returns a value between -1.0 and +1.0. The sine function returns 0 when x = 0.0, x = pi (~3.14), x = 2 * pi, x= 3 * pi, and so on. The sine function is a surprisingly difficult function to model.

Neural Network Regression Demo
Figure 1 Neural Network Regression Demo

The Sin(x) Function
Figure 2 The Sin(x) Function

The demo starts by programmatically generating 80 data items to be used for training the NN model. The 80 training items have a random x input value between 0 and 6.4 (a bit more than 2 * pi) and a corresponding y value, which is the sin(x).

The demo creates a 1-12-1 NN, that is, an NN with one input node (for x), 12 hidden processing nodes (that effectively define the prediction equation), and one output node (the predicted sine of x). When working with NNs, there’s always experimentation involved; the number of hidden nodes was determined by trial and error.

NN classifiers have two activation functions, one for the hidden nodes and one for the output nodes. The output node activation function for a classifier is almost always the softmax function because softmax produces values that sum to 1.0. The hidden node activation function for a classifier is usually either the logistic sigmoid function or the hyperbolic tangent function (abbreviated tanh). But in NN regression, there’s a hidden node activation function, but no output node activation function. The demo NN uses the tanh function for hidden node activation.

The output of an NN is determined by its input values and a set of constants called the weights and biases. Because biases are really just special kinds of weights, the term “weights” is sometimes used to refer to both. A neural network with i input nodes, j hidden nodes, and k output nodes has a total of (i * j) + j + (j * k) + k weights and biases. So the 1-12-1 demo NN has (1 * 12) + 12 + (12 * 1) + 1 = 37 weights and biases.

The process of determining the values of the weights and the biases is called training the model. The idea is to try different values of the weights and biases to determine where the computed output values of the NN closely match the known correct output values of the training data.

There are several different algorithms that can be used to train an NN. By far the most common approach is to use the back-propagation algorithm. Back propagation is an iterative process in which values of the weights and biases slowly change, so that the NN usually computes more accurate output values.

Back propagation uses two required parameters (maximum number of iterations and learning rate) and one optional parameter (the momentum rate). The maxEpochs parameter sets a limit on the number of algorithm iterations. The learnRate parameter controls how much the weights and bias values can change in each iteration. The momentum parameter speeds up training and also helps prevent the back-propagation algorithm from getting stuck at a poor solution. The demo sets the value of maxEpochs to 10,000, the value of learnRate to 0.005, and the value of momentum to 0.001. These values were determined by trial and error.

When using the back-propagation algorithm for NN training, there are three variations that can be used. In batch back propagation, all training items are examined first and then all the weights and bias values are adjusted. In stochastic back propagation (also called online back propagation), after each training item is examined, all weights and bias values are adjusted. In mini-batch back propagation, all weights and bias values are adjusted after examining a specified fraction of the training items. The demo program uses the most common variant, stochastic back propagation.

The demo program displays a measure of error every 1,000 training epochs. Notice that the error values jump around a bit. After training completed, the demo displayed the values of the 37 weights and biases that define the NN model. The values of NN weights and biases don’t have any obvious interpretation, but it’s important to examine the values to check for bad results, for example, when one weight has an extremely large value and all the other weights are close to zero.

The demo program concludes by evaluating the NN model. The NN predicted values of the sin(x) for x = pi, pi / 2, and 3 * pi / 2 are all within 0.02 of the correct values. The predicted value for sin(6 * pi) is very far away from the correct value. But this is an expected result because the NN was trained only to predict the values of sin(x) for x values between 0 and 2 * pi.

This article assumes you have at least intermediate-level programming skills, but doesn’t assume you know much about neural network regression. The demo program is coded using C#, but you shouldn’t have very much trouble refactoring the code to another language such as Visual Basic or Perl. The demo program is too long to present in its entirety in this article, but the complete source is available in the accompanying code download. All normal error checking was removed from the demo to keep the size of the code small and the key ideas as clear as possible.

Demo Program Structure

To create the demo program, I launched Visual Studio and selected the C# console application template from the File | New | Project menu action. I used Visual Studio 2015, but the demo has no significant .NET dependencies so any version of Visual Studio will work. I named the project NeuralRegression.

After the template code loaded into the Editor window, in the Solution Explorer window I selected file Program.cs, right-clicked on it, and renamed it to the somewhat more descriptive NeuralRegressionProgram.cs. I allowed Visual Studio to automatically rename class Program for me. At the top of the Editor code, I deleted all reference to unused namespaces, leaving just the reference to the top-level System namespace.

The overall structure of the demo program, with a few minor edits to save space, is shown in Figure 3. All of the control statements are in the Main method. All of the neural network regression functionality is contained in a program-defined class named NeuralNetwork.

Figure 3 Neural Network Regression Program Structure

using System;
namespace NeuralRegression
{
  class NeuralRegressionProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin NN regression demo");
      Console.WriteLine("Goal is to predict sin(x)");
      // Create training data
      // Create neural network
      // Train neural network
      // Evaluate neural network
      Console.WriteLine("End demo");
      Console.ReadLine();
    }
    public static void ShowVector(double[] vector,
      int decimals, int lineLen, bool newLine) { . . }
    public static void ShowMatrix(double[][] matrix,
      int numRows, int decimals, bool indices) { . . }
  }
  public class NeuralNetwork
  {
    private int numInput; // Number input nodes
    private int numHidden;
    private int numOutput;
    private double[] inputs; // Input nodes
    private double[] hiddens;
    private double[] outputs;
    private double[][] ihWeights; // Input-hidden
    private double[] hBiases;
    private double[][] hoWeights; // Hidden-output
    private double[] oBiases;
    private Random rnd;
    public NeuralNetwork(int numInput, int numHidden,
      int numOutput, int seed) { . . }
    // Misc. private helper methods
    public void SetWeights(double[] weights) { . . }
    public double[] GetWeights() { . . }
    public double[] ComputeOutputs(double[] xValues) { . . }
    public double[] Train(double[][] trainData,
      int maxEpochs, double learnRate,
      double momentum) { . . }
  } // class NeuralNetwork
} // ns

In the Main method, the training data is created by these statements:

int numItems = 80;
double[][] trainData = new double[numItems][];
Random rnd = new Random(1);
for (int i = 0; i < numItems; ++i) {
  double x = 6.4 * rnd.NextDouble();
  double sx = Math.Sin(x);
  trainData[i] = new double[] { x, sx };
}

As a general rule when dealing with neural networks, the more training data you have, the better. For modeling the sine function for x values between 0 and 2 * pi, I needed at least 80 items to get good results. The choice of a seed value of 1 for the random number object was arbitrary. The training data is stored in an array-of-­arrays-style matrix. In realistic scenarios, you’d probably read training data from a text file.

The neural network is created by these statements:

int numInput = 1;
int numHidden = 12;
int numOutput = 1;
int rndSeed = 0;
NeuralNetwork nn = new NeuralNetwork(numInput,
  numHidden, numOutput, rndSeed);

There’s only one input node because the target sine function accepts only a single value. For most neural network regression problems you’ll have several input nodes, one for each of the predictor-independent variables. In most neural network regression problems there’s only a single output node, but it’s possible to predict two or more numeric values.

An NN needs a random object to initialize weight values and to scramble the order in which training items are processed. The demo NeuralNetwork constructor accepts a seed value for the internal random object. The value used, 0, was arbitrary.

The neural network is trained by these statements:

int maxEpochs = 10000;
double learnRate = 0.005;
double momentum  = 0.001;
double[] weights = nn.Train(trainData, maxEpochs,
  learnRate, momentum);
ShowVector(weights, 4, 8, true);

An NN is extremely sensitive to the training parameter values. Even a very small change can produce a dramatically different result.

The demo program evaluates the quality of the resulting NN model by predicting the sin(x) for three standard values. The statements, with some minor edits, are:

double[] y = nn.ComputeOutputs(new double[] { Math.PI });
Console.WriteLine("Predicted =  " + y[0]);
y = nn.ComputeOutputs(new double[] { Math.PI / 2 });
Console.WriteLine("Predicted =  " + y[0]);
y = nn.ComputeOutputs(new double[] { 3 * Math.PI / 2.0 });
Console.WriteLine("Predicted = " + y[0]);

Notice that the demo NN stores its outputs in an array of output nodes, even though there’s just a single output value for this example. Returning an array allows you to predict multiple values without changing the source code.

The demo concludes by predicting the sin(x) for an x value that’s well outside the range of the training data:

y = nn.ComputeOutputs(new double[] { 6 * Math.PI });
Console.WriteLine("Predicted =  " + y[0]);
Console.WriteLine("End demo");

In most NN classifier scenarios, you call a method that calculates the classification accuracy, that is, the number of correct predictions divided by the total number of predictions. This is possible because a categorical output value is either correct or incorrect. But when working with NN regression, there’s no standard way to define accuracy. If you want to calculate some measure of accuracy, it will be problem-­dependent. For example, for predicting the sin(x) you could arbitrarily define a correct prediction as one that’s within 0.01 of the correct value.

Computing Output Values

Most of the key differences between an NN designed for classification and one designed for regression occur in the methods that compute output and train the model. The definition of class NeuralNetwork method ComputeOutputs begins with:

public double[] ComputeOutputs(double[] xValues)
{
  double[] hSums = new double[numHidden];
  double[] oSums = new double[numOutput];
...

The method accepts an array that holds the values of the predictor-independent variables. Local variables hSums and oSums are scratch arrays that hold preliminary (before activation) values of the hidden and output nodes. Next, the independent variable values are copied into the neural network’s input nodes:

for (int i = 0; i < numInput; ++i)
  this.inputs[i] = xValues[i];

Then the preliminary values of the hidden nodes are calculated by multiplying each input value by its corresponding input-to-hidden weight, and accumulating:

for (int j = 0; j < numHidden; ++j)
  for (int i = 0; i < numInput; ++i)
    hSums[j] += this.inputs[i] * this.ihWeights[i][j];

Next, the hidden node bias values are added:

for (int j = 0; j < numHidden; ++j)
  hSums[j] += this.hBiases[j];

The values of the hidden nodes are determined by applying the hidden node activation function to each preliminary sum:

for (int j = 0; j < numHidden; ++j)
  this.hiddens[j] = HyperTan(hSums[j]);

Next, the preliminary values of the output nodes are calculated by multiplying each hidden node value by its corresponding hidden-to-output weight, and accumulating:

for (int k = 0; k < numOutput; ++k)
  for (int j = 0; j < numHidden; ++j)
    oSums[k] += hiddens[j] * hoWeights[j][k];

Then the hidden node bias values are added:

for (int k = 0; k < numOutput; ++k)
  oSums[k] += oBiases[k];

Up to this point, computing output node values for a regression network is exactly the same as computing output node values for a classifier network. But in a classifier, the final output node values would be computed by applying the softmax activation function to each accumulated sum. For a regression network no activation function is applied. Therefore, method ComputeOutputs concludes by simply copying the values in the oSums scratch array directly to the output nodes:

...
  Array.Copy(oSums, this.outputs, outputs.Length);
  double[] retResult = new double[numOutput]; // Could define a GetOutputs
  Array.Copy(this.outputs, retResult, retResult.Length);
  return retResult;
}

For convenience, the values in the output nodes are also copied to a local return array so they can be easily accessed without calling a GetOutputs method of some sort.

When training an NN classifier using the back-propagation algorithm, the calculus derivatives of the two activation functions are used. For the hidden nodes the code looks like:

for (int j = 0; j < numHidden; ++j) {
  double sum = 0.0; // sums of output signals
  for (int k = 0; k < numOutput; ++k)
    sum += oSignals[k] * hoWeights[j][k];
  double derivative = (1 + hiddens[j]) * (1 - hiddens[j]);
  hSignals[j] = sum * derivative;
}

The value for the local variable named derivative is the calculus derivative of the tanh function and comes from quite complex theory. In an NN classifier, the calculation involving the derivative of the output node activation function is:

for (int k = 0; k < numOutput; ++k) {
  double derivative = (1 - outputs[k]) * outputs[k];
  oSignals[k] = (tValues[k] - outputs[k]) * derivative;
}

Here, the value for local variable derivative is the calculus derivative of the softmax function. However, because NN regression doesn’t use an activation function for output nodes, the code is:

for (int k = 0; k < numOutput; ++k) {
  double derivative = 1.0;
  oSignals[k] = (tValues[k] - outputs[k]) * derivative;
}

Of course, multiplying by 1.0 has no effect so you could simply drop the derivative term. Another way of thinking about this is that in NN regression, the output node activation function is the identity function f(x) = x. The calculus derivative of the identity function is the constant 1.0.

Wrapping Up

The demo code and explanation in this article should be enough to get you up and running if you want to explore neural network regression with one or more numeric predictor variables. If you have a predictor variable that’s categorical, you’ll need to encode the variable. For a categorical predictor variable that can take one of two possible values, such as sex (male, female), you’d encode one value as -1 and the other as +1.

For a categorical predictor variable that can take three or more possible values, you’d use what’s called 1-of-(N-1) encoding. For example, if a predictor variable is color that can take one of four possible values (red, blue, green, yellow), then red would be encoded as (1, 0, 0), blue as (0, 1, 0), green as (0, 0, 1), and yellow as (-1, -1, -1).


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Gaz Iqbal and Umesh Madan