November 2014

Volume 29 Number 11


Test Run : Consensus Classification Using C#

James McCaffrey

James McCaffreyIn machine learning (ML), classification is the process of creating a model (typically a math equation of some sort, or a set of rules) that predicts a value that can take on discrete, non-numeric values. For example, you might want to predict the political party (Democrat or Republican) of a member of Congress based on his or her voting record. Training the model is the process of finding the set of constants (for a math equation model) or set of rules (for a rule-based model) so that when presented with training data with known output-dependent variable values, the computed outputs closely match the known outputs. Then the model can be used to make predictions for new data with unknown outputs.

Although there are many standard classification algorithms and techniques, including Naive Bayes classification, logistic regression classification and neural network classification, for some problems a custom classification algorithm is useful. This article presents a custom technique that, for lack of a better name, I call consensus classification. The best way to get a feel for what consensus classification is and to see where this article is headed is to take a look at the demo program in Figure 1.

Consensus Classification in Action
Figure 1 Consensus Classification in Action

The goal of the demo program is to create a model that predicts the political party, Democrat or Republican, of a member of the U.S. House of Representatives based on the representative’s voting record on 16 legislative bills. The raw data set consists of 100 items, each of which corresponds to a Representative and has 17 fields. The first 16 fields in each item are votes where character “y” is a yes vote and character “n” is a no vote. The last field in each item is the Representative’s actual political party. For example, the first two data items are:

n, y, y, n, y, y, n, n, n, n, n, n, y, y, y, y, democrat
n, y, n, y, y, y, n, n, n, n, n, y, y, y, n, y, republican

The demo data is a subset of a well-known benchmark set called the Congressional Voting Records Data Set. The complete benchmark data set has 435 items, some of which have a vote value of “?,” indicating an unknown vote, or abstention. The demo subset consists of the first 100 items, from the full set, that have no “?” votes. Additionally, the source data set has political party in the first column; I moved party to the last column, which is much more convenient when programming.

Although it’s not necessary to know what legislative bill each of the 16 votes correspond to, the topic of each bill is suggested by the 16 following short terms: handicapped-infants, water-project, adopt-budget, physician-fees, el-salvador, school-religion, anti-satellite, nicaraguan-contras, mx-missile, immigration-bill, synfuels-cutback, education-spending, superfund-sue, crime-bill, duty-free, south-africa.

The demo program splits the 100-item data set into an 80-item set used to train the model, and a 20-item data set used to estimate the accuracy of the resulting model. A consensus classification model consists of a set of simple rules such as, “If the Representative voted ‘y’ on bill 0 and ‘n’ on bill 3 and ‘y’ on bill 15, then the Representative is a Republican.” In the demo, the number of Boolean conditions in each simple rule is set to 5, and the total number of rules is set to 500. Furthermore, each simple rule is required to be at least 90 percent accurate on the training data items for which the rule is applicable.

Behind the scenes, the training process generated the 500 simple rules. After the model was created, the demo program applied the model rules to the training data and got a 93.75 percent accuracy. Then the model was applied to the 20-item test set, resulting in 84.21 percent accuracy—16 correct predictions, three incorrect predictions, and one data item where none of the 500 rules in the model was applicable.

This article assumes you have at least intermediate programming skills and a basic knowledge of ML classification. The demo is coded using C#, but you shouldn’t have much trouble refactoring the code to another language, such as Visual Basic .NET or Python. The demo code is a bit too long to present in its entirety, but the entire source code is available in the download that accompanies this article at msdn.microsoft.com/magazine/msdnmag1114.

Overall Program Structure

The overall structure of the demo program, with some WriteLine statements removed and minor edits to save space, is presented in Figure 2. To create the demo, I launched Visual Studio and created a new C# console application and named it Consensus­Classification. The demo has no significant Microsoft .NET Framework dependencies, so any relatively recent version of Visual Studio will work. After the template-generated code loaded, in the Solution Explorer window I renamed file Program.cs to the more descriptive ConsensusProgram.cs and Visual Studio automatically renamed class Program for me.

Figure 2 Overall Program Structure

using System;
using System.Collections.Generic;
namespace ConsensusClassification
{
  class ConsensusProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin consensus classification demo");
      Console.WriteLine("Goal is predict political party");
      string[][] allData = new string[100][];
      allData[0] = new string[] { "n", "y", "y", "n", "y", "y", "n", "n",
        "n", "n", "n", "n", "y", "y", "y", "y", "democrat" };
      allData[1] = new string[] { "n", "y", "n", "y", "y", "y", "n", "n",
        "n", "n", "n", "y", "y", "y", "n", "y", "republican" };
      // Etc.
      allData[99] = new string[] { "y", "n", "y", "n", "n", "y", "y", "y",
        "y", "y", "y", "n", "n", "n", "y", "y", "democrat" };
      Console.WriteLine("All data: ");
      ShowData(allData, 5, true);
      Console.WriteLine("Creating 80-20 train-test data");
      string[][] trainData;
      string[][] testData;
      MakeTrainTest(allData, 0, out trainData, out testData); // 0 = seed
      Console.WriteLine("Training data: \n");
      ShowData(trainData, 3, true);
      Console.WriteLine("Test data: \n");
      ShowData(testData, 3, true);
      int numConditions = 5; // Conditions per rule
      int maxNumRules = 500;
      double minAccuracy = 0.90; // Min % rule accuracy
      Console.WriteLine("Setting number conditions per rule = " + 
        numConditions);
      Console.WriteLine("Setting max number simple rules    = " + 
        maxNumRules);
      Console.WriteLine("Setting simple rule min accuracy   = " +
        minAccuracy.ToString("F2"));
      ConsensusClassifier cc =
        new ConsensusClassifier(numConditions, maxNumRules);
      Console.WriteLine("Starting training");
      cc.Train(trainData, minAccuracy);
      Console.WriteLine("Done");
      Console.WriteLine("Created " + cc.RuleCount() + " simple rules");
      double trainAcc = cc.Accuracy(trainData);
      Console.WriteLine("Accuracy on train data = " + 
        trainAcc.ToString("F4"));
      int numCorrect, numWrong, numUnknown;
      double testAcc = cc.Accuracy(testData, out numCorrect,
        out numWrong, out numUnknown);
      Console.WriteLine("Accuracy on test data  = " + 
        testAcc.ToString("F4"));
      Console.WriteLine("Number correct = " + numCorrect);
      Console.WriteLine("Number wrong   = " + numWrong);
      Console.WriteLine("Number unknown = " + numUnknown);
      Console.WriteLine("End consensus classification demo\n");
      Console.ReadLine();
    }
    static void MakeTrainTest(string[][] allData, int seed,
      out string[][] trainData, out string[][] testData) { . . }
    static void ShowData(string[][] rawData, int numRows,
      bool indices) { . . }
  } // Program
  public class ConsensusClassifier { . . }
} // ns

All the program logic is contained in the Main method. The demo has two static helper methods, MakeTrainTest and ShowData. The classification logic is contained in a single program-defined class named ConsensusClassifier. The classifier exposes six public methods: a single constructor, RuleCount (the actual number of simple rules in the model), ComputeOutput (to make predictions after the model has been created), Train (to create the model) and an overloaded Accuracy (to compute the percentage of correct predictions).

In the Main method, the demo program hardcodes the 100-item source data into an array-of-arrays-style string matrix:

string[][] allData = new string[100][];
allData[0] = new string[] { "n", "y", "y", "n", "y", "y", "n", "n",
  "n", "n", "n", "n", "y", "y", "y", "y", "democrat" };
...

In a non-demo scenario, your data would likely be in a text file and you’d load the data into memory using a helper method. The source data is split into a training set and a test set, like so:

string[][] trainData;
string[][] testData;
MakeTrainTest(allData, 0, out trainData, out testData);

The 80/20 split percentages are hardcoded. The 0-value argument passed to method MakeTrainTest is a seed value for a Math.Random object so that the train-test split can assign data items randomly. An alternative is to use a stratified approach so that the proportions of Democrat and Republican data items in the training and test matrices are roughly the same as the percentages in the overall source data.

The demo creates the model with this code:

int numConditions = 5;
int maxNumRules = 500;
double minAccuracy = 0.90;
ConsensusClassifier cc = 
  new ConsensusClassifier(numConditions, maxNumRules);
cc.Train(trainData, minAccuracy);

Here, I decided to pass values for the number of Boolean conditions in each rule and the maximum number of rules to create to the constructor, and pass a reference to the training data and the minimum accuracy each rule must meet to method Train. Many developers prefer to pass all relevant parameters to the constructor (at the expense of a constructor with a large number of parameters). Here, I passed values most directly related to each method (at the expense of a rather arbitrary-looking calling interface).

The demo program concludes by computing and displaying the accuracy of the model:

double trainAcc = cc.Accuracy(trainData);
Console.WriteLine("Accuracy on train data = " + trainAcc.ToString("F4"));
int numCorrect, numWrong, numUnknown;
double testAcc = cc.Accuracy(testData, out numCorrect, out numWrong,
  out numUnknown);
Console.WriteLine("Accuracy on test data  = " + testAcc.ToString("F4"));
Console.WriteLine("Number correct = " + numCorrect);
Console.WriteLine("Number wrong   = " + numWrong);
Console.WriteLine("Number unknown = " + numUnknown);

The first overloaded Accuracy returns just the percentage of correct classifications. The second overloaded Accuracy returns, in addition, the number of correct, incorrect and non-applicable data items, where non-applicable means that none of the model’s rules apply to a particular data item. For example, suppose the number-of-conditions parameter was set to 3 and one of the data items has vote0 = ‘y,’ vote1 = ‘y’ and vote2 = ‘n.’ Even with 500 rules, it’s possible for none of the rules to take this combination of votes into account. An alternative is to design method Accuracy so that non-applicable data items aren’t possible. One common way to do this is to add a final rule that resembles, “. . .else Representative is a Democrat,” where Democrat is a default value, which is typically the most common dependent-variable value.

The demo program doesn’t use the model to predict the political party of a Representative. Predicting the political party of a Representative who voted “yes” on bills [0] through [7], and “no” on bills [8] through [15] could resemble this:

string[] newData = new string[] { "y", "y", "y", "y", "y", "y", "y", "y",
  "n", "n", "n", "n", "n", "n", "n", "n" };
int party = cc.ComputeOutput(newData);
Console.WriteLine("Predicted party = " + party);

Method ComputeOutput returns an integer value, 0 or 1, where 0 indicates Republican and 1 indicates Democrat.

The Consensus Algorithm

The heart of the consensus classifier is a rule set. The two primary issues to deal with are deciding how to represent a rule, and generating rules. The demo program represents a rule as an integer array. The scheme is best explained using a concrete example, as shown in Figure 3. Suppose the number of Boolean conditions in each rule is set to three. A single rule would be represented as an array with seven cells. If the array had values { 3, 0, 8, 1, 15, 1, 0 }, then the rule corresponds to “If vote [3] is 0 and vote [8] is 1 and vote [15] is 1, then party is 0.” The rule set is a generic List collection of rules.

Consensus Classification Data Structures
Figure 3 Consensus Classification Data Structures

The meaning of 0 and 1 can vary for each column/vote. The demo program constructs an array of Dictionary collections, one per column/vote. Zero-based integer IDs are assigned to each string value as they’re encountered in each column in the training data. For example, in column [0], for vote [0], in the training data, “n” is encountered first and assigned to 0 and “y” is encountered next and assigned to 1. But in column [1], “y” is encountered first and assigned to 0 and “n” is encountered second and assigned to 1.

Similarly, in column [16] of the training data, the last column, which holds dependent variable values “democrat” and “republican,” “republican” is encountered first and so is 0, and “democrat” is 1. The string-to-integer mapping information is stored in a class member array named stringToInt, as shown in Figure 3. So the expression stringToInt[1]["y"] returns the zero-based index value for a yes vote for vote [1], which, for the demo training data, is 0.

Here’s the high-level pseudo-code for creating the rules that define the classification model:

loop while numRules < maxRules && trial < maxTrials
  ++trial
  select a random row of training data
  select numConditions random columns of selected row
  create candidate rule from selected columns of selected row
  if candidate rule already in rule list, continue
  if candidate rule does not meet minimum accuracy, continue
  candidate rule is good so add rule to rule list
end loop

A concrete example may help clarify. In the main loop, suppose row [0] of the training data is randomly selected. Next, suppose numConditions has been set to 3 and the three randomly selected columns are [1], [2] and [15]. In the training data, these columns have values “y,” “n,” and “y” and the dependent variable has value “republican.” The column values are looked up in the array-of-­Dictionary collections and are 0, 0 and 0. Therefore, the candidate rule is an integer array with values { 1, 0, 2, 0, 15, 0, 0 }, which can be interpreted as “If column 1 is 0 and column 2 is 0 and column 15 is 0, then party is 0.”

Next, the 80 items in the training data are scanned to see if the candidate rule meets the minimum accuracy criterion. Note that the candidate rule will be correct for at least one data item—the item that was used to create the rule. However, the candidate rule will not necessarily be applicable to all training items. For example, the candidate rule { 1, 0, 2, 0, 15, 0, 0 } that was generated from data item [0] isn’t applicable to data item [1] because column [2] has value 1 rather than the required 0.

Once the rule set has been created, the output for a given set of input data is determined by what might be described as counting votes. For each data item, all of the rules in the rule set are analyzed. Suppose, as in the demo, the rule set has 500 rules. And suppose that for one data item, 220 of the rules predict a political party of Republican, 250 of the rules predict Democrat, and 30 of the rules aren’t applicable. The classifier would predict the person associated with the data is a Democrat. In other words, the output is a consensus of the predictions from the rule set.

A Few Comments

The motivation for the consensus classification technique presented in this article came from a research project I was working on some time ago. In particular, I was trying to predict the sex—male or female—of an instant message user based on variables such as the region of the user, the age category of the user and so on. I tried every kind of standard classification approach I could think of, using existing ML tools, but none of my approaches were able to generate an effective prediction model, even though my gut feeling was that the data contained enough information to yield a signal.

I noticed that using Naive Bayes classification seemed to be the best of my attempts. Naive Bayes assumes each predictor variable is mathematically independent. Even though this assumption is often not true, Naive Bayes sometimes works quite well. For my problem, the predictor variables were almost certainly related in some way, but I wasn’t sure which variables were related. So I decided to use parts of several ML techniques to create the approach described in this article. Ultimately, I was able to create a prediction model that was significantly more accurate than any standard-technique model.

The demo problem is a binary classification problem because the dependent variable can be either Democrat or Republican. Furthermore, all of the predictors are binary categorical variables because they can be either yes or no. Consensus classification can deal with multinomial classification problems (for example, Democrat, Republican, Independent), and can handle categorical predictors that have more than two values (for example, yes, no, absent, abstained). In my experiments, it was unclear how effective consensus classification is when presented with problems where the predictor variables are numeric, as with age, and have been binned into categories (for example, young, medium and old).

Combining several rules or models to generate a classification result is often called ensemble learning in ML vocabulary. The idea of generating many simple rules is used in ML techniques, and these are sometimes referred to as boosting techniques. Let me note that in my Test Run column I almost always present ML techniques that have a solid research basis. This article is an exception. The technique presented here hasn’t been studied formally or subjected to serious research. Several of my academic colleagues have taken me to task for having the colossal gall to use custom, non-traditional ML techniques, but my usual somewhat snarky response is that I’m typically more concerned with getting results than with establishing an elegant math theorem. In my opinion, using standard ML techniques is the best way to initially approach an ML problem, but when traditional techniques fail, creating a custom solution can sometimes be surprisingly effective.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. He can be reached at jammc@microsoft.com.

Thanks to the following technical experts at Microsoft Research for reviewing this article: Marciano Moreno Diaz Covarrubias, Delbert Murphy, David Raskino and Alisson Sol