Article
06/17/2015

March 2015

Volume 30 Number 3

Test Run - Gradient Descent Training Using C#

James McCaffrey My informal definition of machine learning (ML) is a system that uses data to make predictions. Anyone who starts investigating ML quickly encounters the somewhat mysterious phrase “gradient descent.” In this article, I’ll explain what gradient descent is and demonstrate how to use it to train a logistic regression classification system.

To get an idea of where this article is headed, take a look at the demo program in Figure 1. The demo begins by generating 10,000 synthetic data items. Using artificial data rather than real data is often useful when investigating ML because you can control the characteristics of the data. Each data item has eight predictor variable values (often called features in ML terminology) followed by a single dependent variable, which can be either 0 or 1. The data was generated by using eight random weight values (-7.78, -0.65, ... -7.97) plus an additional constant (-5.02).

Figure 1 Training a Logistic Regression Classifier Using Gradient Descent

You can imagine that the synthetic data corresponds to a problem where you’re trying to predict the sex (male = 0, female = 1) of a person based on eight features such as age, annual income, credit score, and so on, where the feature values have all been scaled so they fall between -10.0 and +10.0.

After generating the 10,000 data items, the demo randomly splits that data into an 8,000-item set to be used to train a classifier, and a 2,000-item set to be used to estimate the predictive accuracy of the resulting model. Next, the demo creates a logistic regression binary classifier and then prepares for gradient descent training by setting values for variables maxEpochs (1,000) and learning rate (0.01). Gradient descent is an iterative process and variable maxEpochs sets a limit on the number of iterations. I’ll explain the learning rate parameter later, but for now you can think of the learning rate as a value that controls how much change occurs in the logistic regression classifier model in each training iteration.

The demo trains the classifier and displays the error of the model on the training data, every 100 iterations. Gradient descent can be used in two different ways to train a logistic regression classifier. The first, more common, approach is called “stochastic” or “online” or “incremental.” (ML vocabulary is chaotic.) The second approach is called “batch” or “offline.” I’ll describe both approaches later, but the demo program uses the stochastic gradient descent training approach.

After training completes, the demo displays the best weight values found (-9.84, -14.88, ... -15.09). Notice the model weights are all about twice as large as the weights used to generate the random data. The demo program computes the accuracy of the resulting model on the training data (99.88 percent, or 7,990 out of 8,000 correct) and the accuracy on the test data (99.80 percent, or 1,996 out of 2,000 correct).

Logistic Regression Classification

Logistic regression classification is best explained using a concrete example. Suppose you want to predict the sex of a person (male = 0, female = 1) based on age (x1), annual income (x2) and education level (x3). If Y is the predicted value, a logistic regression model for this problem would take the form:

Z = b0 + b1(x1) + b2(x2) + b3(x3)
Y = 1.0 / (1.0 + e^-Z)

Here, b0, b1, b2 and b3 are weights, which are just numeric values that must be determined. In words, you compute an intermediate value Z that is the sum of input values times b-weights, add a b0 constant, then pass the Z value to the equation that uses math constant e. The equation is called the logistic sigmoid function. Notice that each input variable (xi) has an associated weight (bi), and that there’s an additional weight (b0) not associated with any input.

It turns out Y will always be between 0 and 1. If Y is less than 0.5 (closer to 0), you conclude the predicted output is 0 and if Y is greater than 0.5, you conclude the output is 1. If there are n features, there will be n+1 b-weights. Not all data can be modeled using logistic regression, but because it’s one of the simplest classification techniques, logistic regression is a good place to start.

Suppose a person has a scaled age of x1 = +0.80 (older than average), annual income of x2 = -0.50 (slightly less than average) and education level x3 = -1.00 (less than average). And suppose that b0 = 3.0, b1 = -2.0, b2 = 2.0 and b3 = 1.5. Then Z = 3.0 + (-2.0)(0.80) + (2.0)(-0.50) + (1.5)(-1.00) = -1.10 and so Y = 1.0 / (1.0 + e^-(-1.10)) = 0.25. Because Y is closer to 0 (less than 0.5) than to 1, you’d predict the person is male.

Here’s the demo code that implements computing logistic regression output:

public double ComputeOutput(double[] dataItem, double[] weights)
{
  double z = 0.0;
  z += weights[0]; // Add b0 constant
  for (int i = 0; i < weights.Length - 1; ++i)
    z += (weights[i + 1] * dataItem[i]); // Skip b0
  return 1.0 / (1.0 + Math.Exp(-z)); // Logistic sigmoid
}

The question is, where do the b-weights come from? The process of determining the values of the b-weights is called training the model. The idea is to use a set of training data that has known input and output values, and then try different values for the b-weights until you find a set of values that minimizes the error between computed outputs and the known, correct output values (often called the target values, or the desired values).

Finding the weight values that minimize error is difficult and there are many numerical optimization algorithms you can use. Each algorithm has different strengths and weaknesses. The most common optimization algorithms include simplex optimization, L-BFGS optimization, particle swarm optimization, iterated Newton-Raphson, plus about a dozen others. The most fundamental optimization algorithm is called gradient descent.

Understanding Gradient Descent

Let me try to explain gradient descent from a software developer’s point of view. I’ll take liberties with my explanation and terminology in order to make the ideas as clear as possible. Take a look at the graph in Figure 2. The graph plots error as a function of the value of some weight. As the value of a weight changes, the resulting logistic regression classifier’s error will change. The goal is to find the weight value where error is at a minimum. For Figure 2, this would be w = 5.0. Here, I use w to indicate any of the b-weights. Note that there’d be one graph like the one in Figure 2 for each weight.

Figure 2 Partial Derivatives and Gradient Descent

If you knew the shape of all the error graphs, determining each weight would be easy. But, unfortunately, you don’t know the shape of any error graph. You might think you could just try every possible weight value and then compute the resulting error, but there are an infinite number of possible weight values.

A calculus derivative of a function at some point is the slope of the tangent line at the point. A slope has a sign (+ or -) that indicates the direction of the tangent line, and a magnitude that indicates the steepness of the tangent. For example, in Figure 2, when w = 7.0, the slope of the tangent line (in other words, the derivative) is +2.15 (from upper right to lower left, and steep). When w = -5.0, the derivative is -0.90 (from upper left to lower right, and not too steep).

Each individual derivative is called a partial derivative (or just a “partial” for brevity) because there are derivatives for each weight. A gradient is the collection of all the partial derivatives. In casual usage, the terms gradient and partial derivative are often used interchangeably, mostly because a phrase like, “the partial derivative of the error function with respect to weight b2,” is a lot harder to say or write than, “the gradient.” Partial derivatives are often indicated using a special math symbol that resembles a backward 6.

So, what’s the point? If you look closely at Figure 2, you’ll see that a partial derivative can be used to move from a given weight value toward the weight value where error is minimized. The sign of the partial indicates the direction to move, and the magnitude of the partial gives a hint of how far to move; a larger magnitude means you can consider moving farther than a smaller magnitude. This technique is called gradient descent because you’re going down the error function toward a minimum value.

OK, so far so good, but how does this idea translate into usable code? Or, put another way, what is the pseudo-code to update a logistic regression weight? There are many resources on the Internet that show some fairly sophisticated calculus to derive the weight-update rule. The end result is:

wj = wj + a * (target - computed) * xj

In words, “for the jth weight, the new weight value is the old weight plus the product of a constant ‘a,’ times the difference between the target value in the training data and the computed output value, times the feature (input) value associated with the jth weight.” The update rule is often written using Greek letters with theta (θ) for the weight, and alpha (α) for the constant. The constant “a” is often called the learning rate.

Suppose you’re working with a weight where j = 2, b2. And suppose the current value of b2 is 0.50. If, for some training data item, the known target value is 1 and the computed output value (using all x values) is 0.80, and x2 (the single input value that corresponds to weight b2) has value 3.0, and the learning rate is 0.10, then the new b2 weight is:

b2 = 0.50 + 0.10 * (1 - 0.80) * 3.0
b2 = 0.50 + 0.06
b2 = 0.56

The update rule is applied iteratively until a stopping condition is met. It’s almost too simple. Notice that the computed output (0.80) was too small compared to the target output (1.0), so the weight update rule increased the value of the weight. This will have the effect of increasing the computed output value on the next iteration of training. If the computed output had been too large compared to the target output, the weight update rule would’ve reduced the weight. Very neat!

There are two main ways to derive the weight update rule. The most common approach you’ll see in references on the Internet starts by defining the probability of getting a set of training data for a given set of weight values, and then uses a fairly complicated calculus technique called maximum likelihood expectation to find the values of the parameters that maximize the probability of the observed data.

An alternate approach starts by defining what’s meant by error, using either two common error definitions—the sum of squared deviations error or the cross entropy error. This approach then uses calculus to find the set of weight values that minimizes error. When starting with cross entropy error, the resulting weight update rule is identical to the rule generated by maximizing probability. When starting with sum of squared deviations error, the resulting update rule has two additional terms:

wj = wj + a * (target - computed) * xj * computed * (1 - computed)

In the alternate update rule, because the computed term is always between 0 and 1, the product of computed and (1 - computed) will always be between 0 and 0.25, which means that updating weights using the alternate update rule just takes smaller steps than the simpler form. In practice, both update rules give similar results so the simpler form is almost always used. Luckily, you don’t need to know how to derive the weight update rule in order to train a logistic regression classifier.

The probability-derivation approach maximizes a probability by going up a gradient, so it’s called gradient ascent. The error derivation approach minimizes an error by going down a gradient and is called gradient descent. The point is that you’ll see training a logistic regression classifier using a gradient referred to as both the gradient descent technique and the gradient ascent technique. Both terms refer to the same weight update rule.

Demo Program Structure

The structure of the demo program, with some minor edits to save space, is presented in Figure 3. To create the demo, I launched Visual Studio and selected the C# Console Application template. I named the project LogisticGradient. The demo has no significant .NET dependencies so any version of Visual Studio will work. The demo is too long to present in its entirety, but all the source code is available in the download that accompanies this article. I removed all normal error checking to keep the main ideas as clear as possible.

Figure 3 Demo Program Structure

using System;
namespace LogisticGradient
{
  class LogisticGradientProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin classification demo");
      Console.WriteLine("Demonstrating gradient descent");
      ...
      Console.WriteLine("End demo");
      Console.ReadLine();
    }
    static double[][] MakeAllData(int numFeatures,
      int numRows, int seed) { . . }
    static void MakeTrainTest(double[][] allData, int seed,
      out double[][] trainData, out double[][] testData) { . . }
    static void ShowData(double[][] data, int numRows,
      int decimals, bool indices) { . . }
    static void ShowVector(double[] vector,
      int decimals, bool newLine) { . . }
  }
  public class LogisticClassifier
  {
    private int numFeatures;
    private double[] weights;
    private Random rnd;
    public LogisticClassifier(int numFeatures) { . . }
    public double[] Train(double[][] trainData,
      int maxEpochs, double alpha) { . . }
    private void Shuffle(int[] sequence) { . . }
    private double Error(double[][] trainData,
      double[] weights) { . . }
    private double ComputeOutput(double[] dataItem,
      double[] weights) { . . }
    private int ComputeDependent(double[] dataItem,
      double[] weights) { . . }
    public double Accuracy(double[][] trainData,
      double[] weights) { . . }
  }
}

After the template code loaded, in the Solution Explorer window, I right-clicked file Program.cs and renamed it to the more descriptive LogisticGradientProgram.cs and Visual Studio automatically renamed class Program for me. In the editor window, at the top of the source code, I deleted all unneeded using statements, leaving just the one referencing the top level System namespace.

The LogisticGradientProgram class contains helper methods MakeAllData, MakeTrainTest, ShowData, and ShowVector, which create and display the synthetic data. All the classification logic is contained in program-defined class LogisticClassifier. The Main method creates the synthetic data with these statements:

int numFeatures = 8;
int numRows = 10000;
int seed = 1; // Arbitrary
double[][] allData = MakeAllData(numFeatures, numRows, seed);

Method MakeAllData is essentially logistic regression classification in reverse. The method generates random weights and then iteratively generates random input values, combines the weights and input values using the logistic sigmoid function, and calculates the corresponding output value. The method doesn’t add any random noise to the data, which means that, in theory, 100 percent prediction accuracy is possible. The synthetic data is split into training and test sets, like so:

double[][] trainData;
double[][] testData;
MakeTrainTest(allData, 0, out trainData, out testData);
ShowData(trainData, 3, 2, true);
ShowData(testData, 3, 2, true);

Method MakeTrainTest uses a hardcoded 80 percent to 20 percent train-test split. You might want to pass the training percentage as a parameter. The logistic regression classifier is created and trained with:

LogisticClassifier lc = new LogisticClassifier(numFeatures);
int maxEpochs = 1000;
double alpha = 0.01; // Learning rate
double[] weights = lc.Train(trainData, maxEpochs, alpha);
ShowVector(weights, 4, true);

The values for training parameters maxEpochs and alpha (the learning rate) were determined by trial and error. Tuning most ML training methods typically requires some experimentation to get good prediction accuracy. The quality of the trained model is evaluated like so:

double trainAcc = lc.Accuracy(trainData, weights);
Console.WriteLine(trainAcc.ToString("F4"));
double testAcc = lc.Accuracy(testData, weights);
Console.WriteLine(testAcc.ToString("F4"));

The accuracy of the model on the test data is the more relevant of the two accuracy values. It provides you with a rough estimate of how accurate the model would be when presented with new data with unknown output values.

Implementing Gradient Descent Training

The definition of method Train begins with:

public double[] Train(double[][] trainData, int 
  maxEpochs, double alpha)
{
  int epoch = 0;
  int[] sequence = new int[trainData.Length];
  for (int i = 0; i < sequence.Length; ++i)
    sequence[i] = i;
...

Variable epoch is the training loop counter variable. The array named sequence is initialized to the indices into the training data. The idea here is that the training data will be processed in a different, random order on every iteration. Next, the main weight-update loop begins:

while (epoch < maxEpochs)
{
  ++epoch;
  if (epoch % 100 == 0 && epoch != maxEpochs)
  {
    double mse = Error(trainData, weights);
    Console.Write("epoch = " + epoch);
    Console.WriteLine(" error = " + mse.ToString("F4"));
  }
  Shuffle(sequence); // Process data in random order
...

A measure of error is calculated and displayed every 100 epochs. Method Error returns the mean squared error, which is the average of the sum of squared differences between computed and target output values. Note that this is slightly different from the definition of error that’s the basis of the gradient descent weight update rule. When using gradient descent training, the error is implicitly used, but isn’t used directly. Other training techniques, in particular particle swarm optimization, use the error explicitly. Method Shuffle scrambles the training data indices contained in the sequence array using the Fisher-Yates algorithm.

The heart of gradient descent training is short:

for (int ti = 0; ti < trainData.Length; ++ti)
{
  int i = sequence[ti];
  double computed = ComputeOutput(trainData[i], weights);
  int targetIndex = trainData[i].Length - 1;
  double target = trainData[i][targetIndex];
  weights[0] += alpha * (target - computed) * 1;
  for (int j = 1; j < weights.Length; ++j)
    weights[j] += alpha * (target - computed) * trainData[i][j - 1];
}

First, the next random-order training item is identified using the scrambled sequence array. Method ComputeOutput uses the current weights to compute the output for the current set of weight values, which will be a value between 0.0 and 1.0. The target value, 0 or 1, is extracted from the current training item. The b0 weight is updated first. Recall that the weight update rule uses the input value associated with the weight being modified. However, the b0 weight isn’t associated with any actual input. To deal with this, logistic regression b0 weights are said to have a dummy input value always equal to 1.0. The demo code multiplies by the 1.0 value, which obviously has no effect, to illustrate the similarity between updating b0 and updating any other b-weight. Updating the regular b-weights is pretty straightforward and just requires some attention to the indexing details. Method Train concludes with:

...
} // While loop
  return this.weights;
} // Train

The method returns a reference to the actual weights in the LogisticClassifier weights array. For safety, you might want to consider creating a results array, then copying the weights into that array and returning a reference to the array.

As noted earlier, the demo uses stochastic gradient descent. As each training item is encountered, the gradient for that one training item is calculated and used to update all weights. In batch gradient descent, in contrast, in each iteration gradients are accumulated over all training items first, and then the weights are updated. To use batch training, the heart of method Train would become the code shown in Figure 4.

Figure 4 The Train Method When Using Batch Training

double[] accumulatedGradients = new double[weights.Length];
for (int i = 0; i < trainData.Length; ++i)  // Accumulate
{
  double computed = ComputeOutput(trainData[i], weights);
  int targetIndex = trainData[i].Length - 1;
  double target = trainData[i][targetIndex];
  accumulatedGradients[0] += (target - computed) * 1; // For b0
  for (int j = 1; j < weights.Length; ++j)
    accumulatedGradients[j] += (target - computed) * trainData[i][j - 1];
}
for (int j = 0; j < weights.Length; ++j) // Update all wts
  weights[j] += alpha * accumulatedGradients[j];

With batch training, because all training items are processed before any weights are updated, there’s no advantage to processing the training data in random order.

When gradient descent training was first conceived, the batch approach was considered theoretically preferable because that technique uses all available information to find the weight gradient. However, ML practitioners quickly realized that training speed could be increased by using the gradient for just a single training item as an estimate for the overall gradient. In other words, stochastic (which means randomly determined) gradient descent uses one gradient to estimate the overall gradient.

Wrapping Up

Two related variations of basic gradient descent that are often used with logistic regression classifiers are called BFGS and L-BFGS. These two algorithms are an attempt to improve on basic gradient descent, at the expense of significantly increased complexity.

In addition to logistic regression classification, gradient descent can be used with several other ML techniques. In particular, gradient descent can be used to train a neural network. When gradient descent is used with neural networks, the technique is called back-propagation.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following technical expert at Microsoft Research for reviewing this article: Richard Hughes