January 2015

Volume 30 Number 1


Test Run - Logistic Regression Classification with Multi-Swarm Optimization

By James McCaffrey

James McCaffreyOne of the most fundamental forms of machine learning is logistic regression (LR) classification. The goal of LR classification is to create a model that predicts a variable that can take on one of two possible values. For example, you might want to predict which of two candidates a voter will select (“Smith” = 0, “Jones” = 1) based on the voter’s age (x1), sex (x2) and annual income (x3).

If Y is the predicted value, an LR model for this problem would take the form:

z = b0 + b1(x1) + b2(x2) + b3(x3)
Y = 1.0 / (1.0 + e^-z)

Here, b0, b1, b2 and b3 are weights, which are just numeric values that must be determined. In words, you compute a value z that is the sum of input values times b-weights, plus a b0 constant, then feed the z value to the equation that uses math constant e. It turns out that Y will always be between 0 and 1. If Y is less than 0.5, you conclude the output is 0 and if Y is greater than 0.5 you conclude the output is 1.

For example, suppose a voter’s age is 32, sex is male (-1), and annual income in tens of thousands of dollars is 48.0. And suppose b0 = -9.0, b1 = 8.0, b2 = 5.0, and b3 = -5.0. Then z = -9.0 + (8.0)(32) + (5.0)(-1) + (-5.0)(48.0) = 2.0 and so Y = 1.0 / (1.0 + e^-2.0) = 0.88.

Because Y is greater than 0.5, you’d conclude the voter will pick candidate 1 (“Jones”). But where do the b-weight values come from? Training an LR classifier is the process of finding the values for the b-weights. The idea is to use training data that has known output values and then find the set of b values so that the difference between the computed output values and the known output values is minimized. This is a math problem that’s often called numerical optimization.

There are about a dozen major optimization algorithms used in machine learning. For logistic regression classification training, two of the most common algorithms are called iterative Newton-Raphson and L-BFGS. In this article, I present a technique called multi-swarm optimization (MSO). MSO is a variation of particle swarm optimization (PSO). In MSO, a virtual particle has a position that corresponds to a set of b-weight values. A swarm is a collection of particles that move in a way inspired by group behavior such as the flocking of birds. MSO maintains several swarms that interact with each other, as opposed to PSO, which uses just one swarm.

A good way to see where this article is headed, and to get an idea of what LR with MSO is, is to take a look at the demo program in Figure 1. The goal of the demo program is to use MSO to create an LR model that predicts Y for a set of synthetic (programmatically generated) data.

Logistic Regression with Multi-Swarm Optimization in Action
Figure 1 Logistic Regression with Multi-Swarm Optimization in Action

The demo program begins by creating 10,000 random data items where there are five predictor variables (often called features in ML terminology). Each feature value is between -10.0 and +10.0, and the Y-value, 0 or 1, is in the last column of the data set. The feature values don’t correspond to a real problem.

The 10,000-item data set is randomly split into an 8,000-item training set used to create the LR model, and a 2,000-item hold-out test set used to evaluate the accuracy of the model after training. The demo program creates an LR classifier and then uses four swarms, each with three particles, to train the classifier. MSO is an iterative process and the maximum number of iterations, maxEpochs, is set to 100.

The demo displays the best (smallest) error found by any particle, every 10 epochs. After training is completed, the best weights found are (4.09, 10.00, 8.43, 1.86, -9.27, 1.84). Using these weights, the accuracy of the LR model is computed for the training data (99.98 percent correct, which is 7,998 out of 8,000) and for the test data (99.85 percent correct, which is 1,997 out of 2,000). The accuracy of the model with the test data gives you a rough approximation of how well the model would do when presented with new data that has unknown output values.

This article assumes you have at least intermediate programming skills, but doesn’t assume you know anything about logistic regression classification or multi-swarm optimization. The demo program is coded using C#, but you shouldn’t have too much difficulty refactoring the code to another language such as Visual Basic .NET or Python.

The demo code is too long to present here in its entirety, but the complete source code is available in the code download that accompanies this article. The demo code has all normal error checking removed to keep the main ideas as clear as possible and the size of the code small.

Overall Program Structure

The overall program structure, with a few minor edits to save space, is presented in Figure 2. To create the demo, I launched Visual Studio and created a new C# console application named LogisticWithMulti. The demo has no significant Microsoft .NET Framework dependencies, so any recent version of Visual Studio will work.

Figure 2 Overall Program Structure

using System;
namespace LogisticWithMulti
{
  class LogisticMultiProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin demo");
      int numFeatures = 5;
      int numRows = 10000;
      int seed = 0;
      Console.WriteLine("Generating " + numRows +
        " artificial data items with " + numFeatures + " features");
      double[][] allData = MakeAllData(numFeatures, numRows, seed);
      Console.WriteLine("Done");
      Console.WriteLine("Creating train and test matrices");
      double[][] trainData;
      double[][] testData;
      MakeTrainTest(allData, 0.80, seed, out trainData, out testData);
      Console.WriteLine("Done");
      Console.WriteLine("Training data: ");
      ShowData(trainData, 4, 2, true);
      Console.WriteLine("Test data: ");
      ShowData(testData, 3, 2, true);
      Console.WriteLine("Creating Logistic Regression classifier");
      LogisticClassifier lc = new LogisticClassifier(numFeatures);
      int numSwarms = 4;
      int numParticles = 3;
      int maxEpochs = 100;
      Console.WriteLine("Setting numSwarms = " + numSwarms);
      Console.WriteLine("Setting numParticles = " + numParticles);
      Console.WriteLine("Setting maxEpochs = " + maxEpochs);
      Console.WriteLine("\nStarting training");
      double[] bestWeights = lc.Train(trainData, maxEpochs,
        numSwarms, numParticles);
      Console.WriteLine("Training complete");
      Console.WriteLine("Best weights found:");
      ShowVector(bestWeights, 4, true);
      double trainAcc = lc.Accuracy(trainData, bestWeights);
      Console.WriteLine("Accuracy on training data = " +
        trainAcc.ToString("F4"));
      double testAcc = lc.Accuracy(testData, bestWeights);
      Console.WriteLine("Accuracy on test data = " +
        testAcc.ToString("F4"));
      Console.WriteLine("End demo");
      Console.ReadLine();
    } // Main
    static double[][] MakeAllData(int numFeatures,
      int numRows, int seed) { . . }
    static void MakeTrainTest(double[][] allData, 
      double trainPct, int seed,
      out double[][] trainData, out double[][] testData) { . . }
    static void ShowData(double[][] data, int numRows,
      int decimals, bool indices) { . . }
    static void ShowVector(double[] vector, int decimals,
      bool newLine) { . . }
  } // Program
  public class LogisticClassifier { . . }
} // ns

After the template code loaded into the Visual Studio editor, in the Solution Explorer window I renamed file Program.cs to the more descriptive LogisticMultiProgram.cs and Visual Studio automatically renamed class Program for me. At the top of the source code, I deleted all using statements that pointed to unneeded namespaces, leaving just the reference to the top-level System namespace.

The program class has helper methods MakeAllData, Make­TrainTest, ShowData and ShowVector. All of the logistic regression logic is contained in a single LogisticClassifier class. The LogisticClassifier class contains nested helper classes Particle, Swarm, and MultiSwarm to encapsulate the MSO data and logic used during training. These helper classes could’ve been defined outside of the LogisticClassifier class.

The Main method has a lot of WriteLine noise. The key calling statements are quite simple. The synthetic data is generated, like so:

int numFeatures = 5;
int numRows = 10000;
int seed = 0; // Gives representative demo
double[][] allData = MakeAllData(numFeatures, numRows, seed);

Method MakeAllData creates random b-weight values between -10.0 and +10.0, then for each data item, random x-values between -10.0 and +10.0 are generated and combined with the b-weight values, which are then used to generate Y-values. The synthetic data corresponds to a data set where the x-values have been normalized, and where there are roughly equal counts of Y = 0 and Y = 1 values.

The data is split into training and test sets with these statements:

double[][] trainData;
double[][] testData;
MakeTrainTest(allData, 0.80, seed, out trainData, out testData);

The LR model is created and trained with these statements:

LogisticClassifier lc = new LogisticClassifier(numFeatures);
int numSwarms = 4;
int numParticles = 3;
int maxEpochs = 100;
double[] bestWeights = lc.Train(trainData, 
  maxEpochs, numSwarms, numParticles);

And the accuracy of the model is evaluated with these two statements:

double trainAcc = lc.Accuracy(trainData, bestWeights);
double testAcc = lc.Accuracy(testData, bestWeights);

Understanding the MSO Algorithm

Here’s the MSO optimization in high-level pseudo-code:

for-each swarm
  initialize each particle to a random position
end-for
for-each swarm
  for-each particle in swarm
    compute new velocity
    use new velocity to compute new position
    check if position is a new best
    does particle die?
    does particle move to different swarm?
  end-for
end-for
return best position found

The key part of the MSO algorithm is computing a particle’s velocity, which is just a set of values that control to where a particle will move. For example, for a problem with just two x-dimensions, if a particle is at (6.0, 8.0) and the velocity is (-2.0, 1.0), the particle’s new position will be at (4.0, 9.0).

Velocity is calculated so that a particle tends to move in its current direction; tends to move toward its best position found to date; tends to move toward the best position found by any of its fellow swarm members; and tends to move toward the best position found by any particle in any swarm.

In math terms, if x(t) is a particle’s position at time t, then a new velocity, v(t+1) is calculated as:

v(t+1) = w * v(t) +
         (c1 * r1) * (p(t) - x(t)) +
         (c2 * r2) * (s(t) - x(t)) +
         (c3 * r3) * (g(t) - x(t))

Term p(t) is the particle’s best-known position. Term s(t) is the best position of any particle in the particle’s swarm. Term g(t) is the global best position of any particle in any swarm. Term w is a constant called the inertia factor. Terms c1, c2 and c3 are constants that establish a maximum change for each component of the new velocity. Terms r1, r2 and r3 and random values between 0 and 1 that provide a randomization effect to each velocity update.

Suppose a particle is currently at (20.0, 30.0) and its current velocity is (-1.0, -3.0). Also, the best-known position of the particle is (10.0, 12.0), the best-known position of any particle in the swarm is (8.0, 9.0), and the best-known position of any particle in any swarm is (5.0, 6.0). And suppose that constant w has value 0.7, constants c1 and c2 are both 1.4, and constant c3 is 0.4. Finally, suppose random values r1, r2 and r3 are all 0.2.

The new velocity of the particle (with rounding to one decimal) is:

v(t+1) = 0.7 * (-1.0, -3.0) +
         (1.4 * 0.2) * ((10.0, 12.0) - (20.0, 30.0)) +
         (1.4 * 0.2) * ((8.0, 9.0) - (20.0, 30.0)) +
         (0.4 * 0.2) * ((5.0, 6.0) - (20.0, 30.0))
       = 0.7 * (-1.0, -3.0) +
         0.3 * (-10.0, -18.0) +
         0.3 * (-12.0, -21.0) +
         0.1 * (-15.0, -24.0)
       = (-8.8, -16.2)

And so the particle’s new position is:

x(t+1) = (20.0, 30.0) + (-8.8, -16.2)
       = (11.2, 13.8)

The graph in Figure 3 illustrates the MSO process for a problem with two x-values, three swarms with five particles each, and where the optimal position is at (0, 0). The graph shows how the first particle in each swarm closes in on the optimal position. The spiral motion is characteristic of MSO.

The Multi-Swarm Optimization Algorithm Illustrated
Figure 3 The Multi-Swarm Optimization Algorithm Illustrated

In the MSO pseudo-code, a particle can die with some low probability. When a particle dies, it’s replaced with a new particle at a random position. A particle can immigrate with some low probability. When immigration occurs, a particle is swapped with another particle from a different swarm. The death and immigration mechanisms add an element of randomness to MSO and help prevent the algorithm from getting stuck in a non-optimal solution.

Implementing Logistic Regression Using MSO

The definition of method Train begins as:

public double[] Train(double[][] trainData, int maxEpochs,
  int numSwarms, int numParticles)
{
  int dim = numFeatures + 1;
  double minX = -10.0;
  double maxX = 10.0;
  MultiSwarm ms = new MultiSwarm(numSwarms, numParticles, dim);
...

The dimension of the problem is the number of predictor features, plus one, to account for the b0 constant. Variables minX and maxX hold the smallest and largest values for any single value in a Particle object’s position array. The LogisticClassifier class contains a nested MultiSwarm class. The MultiSwarm class constructor creates an array-of-arrays of Particle objects, each of which initially has a random position, and a random velocity. Because the logistic regression Error method isn’t directly visible to the nested Particle definition, the MultiSwarm constructor doesn’t supply error value for each particle, so method Train adds error information. First, each particle gets an error value:

for (int i = 0; i < numSwarms; ++i)
{
  for (int j = 0; j < numParticles; ++j)
  {
    Particle p = ms.swarms[i].particles[j];
    p.error = Error(trainData, p.position);
    p.bestError = p.error;
    Array.Copy(p.position, p.bestPosition, dim);
...

Next, the current swarm’s best error and the overall global best error are computed:

...
    if (p.error < ms.swarms[i].bestError) // Swarm best?
    {
      ms.swarms[i].bestError = p.error;
      Array.Copy(p.position, ms.swarms[i].bestPosition, dim);
    }
    if (p.error < ms.bestError) // Global best?
    {
      ms.bestError = p.error;
      Array.Copy(p.position, ms.bestPosition, dim);
    }
  } // j
} // i

The main training loop is prepared, like so:

int epoch = 0;
double w = 0.729; // inertia
double c1 = 1.49445; // particle
double c2 = 1.49445; // swarm
double c3 = 0.3645; // multiswarm
double pDeath = 1.0 / maxEpochs;
double pImmigrate = 1.0 / maxEpochs;
int[] sequence = new int[numParticles];
for (int i = 0; i < sequence.Length; ++i)
  sequence[i] = i;

The values for constants w, c1 and c2 are the result of some PSO research. Interestingly, compared to many numerical optimization algorithms, PSO and MSO are relatively insensitive to the values used for internal magic constants (called free parameters or hyper parameters). There’s little available research for the c3 constant, which influences the tendency of a particle to move toward the best-known position found to date by any particle in any swarm. The value I use, 0.3645, is half the value of the inertia constant, and has worked well for me in practice.

The sequence array holds indices of the training data. This array will be scrambled so that in each swarm, particles will be processed in a different order in each iteration of the main training loop.

The main training loop begins:

while (epoch < maxEpochs)
{
  ++epoch;
  // Optionally print best error here
  for (int i = 0; i < numSwarms; ++i) // Each swarm
  {
    Shuffle(sequence);
    for (int pj = 0; pj < numParticles; ++pj) // Each particle
    {
      int j = sequence[pj];
      Particle p = ms.swarms[i].particles[j];
...

Knowing when to stop training is one of the most difficult challenges in machine learning. Here, a fixed number, maxEpochs, of iterations is used. This is a simple approach, but there’s a risk you might not train enough. Or you might train too much, which would result in a model that fits the training data extremely well, but works poorly on new data. This is called over-fitting. There are dozens of techniques that can be used to combat over-fitting.

Next, the new velocity for the current particle is calculated, as explained earlier (see Figure 4).

Figure 4 New Velocity for Current Particle Is Calculated

for (int k = 0; k < dim; ++k)
{
  double r1 = rnd.NextDouble();
  double r2 = rnd.NextDouble();
  double r3 = rnd.NextDouble();
  p.velocity[k] = (w * p.velocity[k]) +
    (c1 * r1 * (p.bestPosition[k] - p.position[k])) +
    (c2 * r2 * (ms.swarms[i].bestPosition[k] - p.position[k])) +
    (c3 * r3 * (ms.bestPosition[k] - p.position[k]));
  if (p.velocity[k] < minX)
    p.velocity[k] = minX;
  else if (p.velocity[k] > maxX)
    p.velocity[k] = maxX;
} // k

After the velocity is calculated, each of its component values is checked to see whether the magnitude is large, and if so, the value is reined back in. This prevents a particle from moving a very great distance in any one iteration. For some ML training tasks, eliminating the velocity constraint appears to speed up the training, but there are no solid research results to give any specific advice here.

Next, the new velocity is used to calculate the current particle’s new position and associated error:

for (int k = 0; k < dim; ++k)
{
  p.position[k] += p.velocity[k];
  if (p.position[k] < minX)
    p.position[k] = minX;
  else if (p.position[k] > maxX)
    p.position[k] = maxX;
}

After the current particle has moved, its new error is calculated:

p.error = Error(trainData, p.position); // Expensive
if (p.error < p.bestError) // New best position for particle?
{
  p.bestError = p.error;
  Array.Copy(p.position, p.bestPosition, dim);
}

The current particle’s new error might be a new swarm best or global best:

if (p.error < ms.swarms[i].bestError) // New best for swarm?
{
  ms.swarms[i].bestError = p.error;
  Array.Copy(p.position, ms.swarms[i].bestPosition, dim);
}
if (p.error < ms.bestError) // New global best?
{
  ms.bestError = p.error;
  Array.Copy(p.position, ms.bestPosition, dim);
}

After the current particle has moved, an option is to possibly kill that particle and generate a new one. If a random value is below some small threshold, a new particle is generated:

double p1 = rnd.NextDouble();
if (p1 < pDeath)
{
  Particle q = new Particle(dim); // A replacement
  q.error = Error(trainData, q.position);
  Array.Copy(q.position, q.bestPosition, dim);
  q.bestError = q.error;

Instead of using a fixed probability of death, an alternative is to gradually increase the probability of death, for example:

double pDeath = (maxProbDeath / maxEpochs) * epoch;

After the replacement particle has been created, that particle may be, by pure luck, a new swarm best or global best:

if (q.error < ms.swarms[i].bestError) // Best swarm error by pure luck?
{
  ms.swarms[i].bestError = q.error;
  Array.Copy(q.position, ms.swarms[i].bestPosition, dim);
  if (q.error < ms.bestError) // Best global error?
  {
    ms.bestError = q.error;
    Array.Copy(q.position, ms.bestPosition, dim);
  }
}

The death mechanism concludes by exchanging the current particle for the replacement, effectively killing the current particle:

...
  ms.swarms[i].particles[j] = q;
} // Die

Next, the (optional) immigration mechanism starts with:

double p2 = rnd.NextDouble();
if (p2 < pImmigrate)
{
  int ii = rnd.Next(0, numSwarms); // rnd swarm
  int jj = rnd.Next(0, numParticles); // rnd particle
  Particle q = ms.swarms[ii].particles[jj]; // q points to other
  ms.swarms[i].particles[j] = q;
  ms.swarms[ii].particles[jj] = p; // the exchange
...

The immigration mechanism is quite crude. The code selects a random swarm and then a random particle from that swarm, and then swaps the random particle with the current particle. As coded, there’s no guarantee the randomly selected particle will be in a swarm different from the current particle.

The immigration mechanism finishes by checking to see if swapping particles in two swarms resulted in one or two new swarm-best positions:

...
  if (q.error < ms.swarms[i].bestError) // Curr has new position
  {
    ms.swarms[i].bestError = q.error;
    Array.Copy(q.position, ms.swarms[i].bestPosition, dim);
  }
  if (p.error < ms.swarms[ii].bestError) // Other has new position
  {
    ms.swarms[ii].bestError = p.error;
    Array.Copy(p.position, ms.swarms[ii].bestPosition, dim);
  }
} // Immigrate

Method Train concludes by returning the best position found by any particle:

...
      } // j - each particle
    } // i - each swarm
  } // while
  return ms.bestPosition;
} // Train

The return value is a reference to an array representing a set of weights of logistic regression. A minor alternative to reduce the chances of an unwanted side effect is to copy the values to a local array and return a reference to the local array.

Wrapping Up

This article and accompanying code should get you up and running if you want to explore logistic regression classification using multi-swarm optimization for training. MSO is really more of a meta-heuristic than an algorithm. By that I mean MSO is a set of design principles that can be implemented in many different ways. There are dozens of modifications to the basic MSO design you can investigate.

Logistic regression is one of the simplest forms of machine learning classification. For many classification problems, LR just doesn’t work well. However, many ML practitioners, including me, usually begin investigating a classification problem using LR, and then use more sophisticated techniques, if necessary.

Compared to older, calculus-based numerical optimization techniques such as gradient descent and L-BFGS, based on my experience, training using MSO tends to produce better ML models, but MSO is almost always an order of magnitude slower.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Todd Bello and Alisson Sol