June 2019

Volume 34 Number 6

[Test Run]

Simplified Naive Bayes Classification Using C#

By James McCaffrey | June 2019 | Get the Code

James McCaffreyThe goal of a naive Bayes classification problem is to predict a discrete value. For example, you might want to predict the authenticity of a gemstone based on its color, size and shape (0 = fake, 1 = authentic). In this article I show how to implement a simplified naive Bayes classification algorithm using the C# language.

The best way to understand where this article is headed is to take a look at the demo run in Figure 1. The demo program sets up 40 dummy data items. Each item has three predictor values: color (Aqua, Blue, Cyan, Dune), size (Small, Large), and shape (Pointed, Rounded, Twisted), followed by a binary class to predict (0 or 1).

Simplified Naive Bayes Classification Demo Run
Figure 1 Simplified Naive Bayes Classification Demo Run

The demo sets up an item to predict: (Cyan, Small, Twisted). Naive Bayes classification is based on probabilities, which in turn are based on counts in the data. The demo scans the 40-item dataset and computes and displays six joint counts: items that have both Cyan and 0 (3 items), Cyan and 1 (5), Small and 0 (18), Small and 1 (15), Twisted and 0 (5), Twisted and 1 (3). The demo also counts the number of class 0 items (24) and class 1 items (16).

Using the count information, the demo calculates intermediate values called evidence terms (0.0082, 0.0131) and then uses the evidence terms to calculate pseudo-probabilities (0.3855, 0.6145) that correspond to predictions of class 0 and class 1. Because the second pseudo-probability is larger, the conclusion is that (Cyan, Small, Twisted) is class 1.

This article assumes you have intermediate or better programming skill with C# or a C-family language such as Python or Java, but doesn’t assume you know anything about naive Bayes classification. The complete demo code and the associated data are presented in this article. The source code and the data are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

Understanding Naive Bayes Classification

Naive Bayes classification is both simple and complicated. Implementation is relatively simple, but the underlying math ideas are very complex. There are many different math equations that define naive Bayes classification. Three examples are shown in Figure 2. Letter P means probability; Ck is one of k classes (0, 1, 2, . . .); the pipe symbol is read “given that”; and X is a vector of input values such as (Cyan, Small, Twisted). Unfortunately, these equations don’t provide much help when implementing naive Bayes classification until after you understand the technique.

Three Forms of Naive Bayes Classification Math
Figure 2 Three Forms of Naive Bayes Classification Math

In my opinion, naive Bayes classification is best explained using a concrete example. The first step is to scan through the source data and compute joint counts. If there are nx predictor variables (three in the demo) and nc classes (two in the demo), then there are nx * nc joint counts to compute. Notice that the counting process means that predictor data must be discrete rather than numeric.

After calculating joint counts, 1 is added to each count. This is called Laplacian smoothing and is done to prevent any joint count from being 0, which would zero out the final results. For the demo data the smoothed joint counts are:

cyan and 0:     2 + 1 = 3
cyan and 1:     4 + 1 = 5
small and 0:   17 + 1 = 18
small and 1:   14 + 1 = 15
twisted and 0:  4 + 1 = 5
twisted and 1:  2 + 1 = 3

Next, the raw counts of class 0 items and class 1 items are calculated. Because these counts will always be greater than zero, no smoothing factor is needed. For the demo data, the class counts are:

0:  24
1:  16

Next, an evidence term for each class is calculated. For class 0, the evidence term is:

Z(0) = (3 / 24+3) * (18 / 24+3) * (5 / 24+3) * (24 / 40)
        = 3/27 * 18/27 * 5/27 * 24/40
        = 0.1111 * 0.6667 * 0.1852 * 0.6
        = 0.0082

The first three terms of calculation for Z(0) use the smoothed joint counts for class 0, divided by the class count for 0 (24) plus the number of predictor variables (nx = 3) to compensate for the three additions of 1 due to the Laplacian smoothing. The fourth term is P(class 0). The calculation of the class 1 evidence term follows the same pattern:

Z(1) = (5 / 16+3) * (15 / 16+3) * (3 / 16+3) * (16 / 40)
        = 5/19 * 15/19 * 3/19 * 16/40
        = 0.2632 * 0.7895 * 0.1579 * 0.4
        = 0.0131

The last step is to compute pseudo-probabilities:

P(class 0) = Z(0) / (Z(0) + Z(1))
                 = 0.0082 / (0.0082 + 0.0131)
                 = 0.3855
P(class 1) = Z(1) / (Z(0) + Z(1))
                 = 0.0131 / (0.0082 + 0.0131)
                 = 0.6145

The denominator sum is called the evidence and is used to normalize the evidence terms so that they sum to 1.0 and can be loosely interpreted as probabilities. Note that if you’re just interested in prediction, you can simply use the largest evidence term and skip the evidence normalization step.

The Demo Program

The complete demo program, with a few minor edits to save space, is presented in Figure 3. To create the program, I launched Visual Studio and created a new console application named NaiveBayes. I used Visual Studio 2017, but the demo has no significant .NET Framework dependencies so any version of Visual Studio will work fine.

Figure 3 The Naive Bayes Classification Demo Program

using System;
using System.IO;
namespace CSharp
{
  class Program
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin simple naive Bayes demo\n");
      string fn = "C:\\NaiveBayes\\data.txt";
      int nx = 3;  // Number predictor variables
      int nc = 2;  // Number classes
      int N = 40;  // Number data items
      string[][] data = LoadData(fn, N, nx+1, ',');
      Console.WriteLine("Training data:");
      for (int i = 0; i < 5; ++i) {
        Console.Write("[" + i + "] ");
        for (int j = 0; j < nx+1; ++j) {
          Console.Write(data[i][j] + " ");
        }
        Console.WriteLine("");
      }
      Console.WriteLine(". . . \n");
      int[][] jointCts = MatrixInt(nx, nc);
      int[] yCts = new int[nc];
      string[] X = new string[] { "Cyan", "Small", "Twisted" };
      Console.WriteLine("Item to classify: ");
      for (int i = 0; i < nx; ++i)
        Console.Write(X[i] + " ");
      Console.WriteLine("\n");
      // Compute joint counts and y counts
      for (int i = 0; i < N; ++i) {
        int y = int.Parse(data[i][nx]);
        ++yCts[y];
        for (int j = 0; j < nx; ++j) {
          if (data[i][j] == X[j])
            ++jointCts[j][y];
        }
      }
      // Laplacian smoothing
      for (int i = 0; i < nx; ++i)
        for (int j = 0; j < nc; ++j)
          ++jointCts[i][j];
      Console.WriteLine("Joint counts: ");
      for (int i = 0; i < nx; ++i) {
        for (int j = 0; j < nc; ++j) {
          Console.Write(jointCts[i][j] + " ");
        }
        Console.WriteLine("");
      }
      Console.WriteLine("\nClass counts: ");
      for (int k = 0; k < nc; ++k)
        Console.Write(yCts[k] + " ");
      Console.WriteLine("\n");
      // Compute evidence terms
      double[] eTerms = new double[nc];
      for (int k = 0; k < nc; ++k) {
        double v = 1.0;
        for (int j = 0; j < nx; ++j) {
          v *= (double)(jointCts[j][k]) / (yCts[k] + nx);
        }
        v *= (double)(yCts[k]) / N;
        eTerms[k] = v;
      }
      Console.WriteLine("Evidence terms:");
      for (int k = 0; k < nc; ++k)
        Console.Write(eTerms[k].ToString("F4") + " ");
      Console.WriteLine("\n");
      double evidence = 0.0;
      for (int k = 0; k < nc; ++k)
        evidence += eTerms[k];
      double[] probs = new double[nc];
      for (int k = 0; k < nc; ++k)
         probs[k] = eTerms[k] / evidence;
      Console.WriteLine("Probabilities: ");
      for (int k = 0; k < nc; ++k)
        Console.Write(probs[k].ToString("F4") + " ");
      Console.WriteLine("\n");
      int pc = ArgMax(probs);
      Console.WriteLine("Predicted class: ");
      Console.WriteLine(pc);
      Console.WriteLine("\nEnd naive Bayes ");
      Console.ReadLine();
    } // Main
    static string[][] MatrixString(int rows, int cols)
    {
      string[][] result = new string[rows][];
      for (int i = 0; i < rows; ++i)
        result[i] = new string[cols];
      return result;
    }
    static int[][] MatrixInt(int rows, int cols)
    {
      int[][] result = new int[rows][];
      for (int i = 0; i < rows; ++i)
        result[i] = new int[cols];
      return result;
    }
    static string[][] LoadData(string fn, int rows,
      int cols, char delimit)
    {
      string[][] result = MatrixString(rows, cols);
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      string[] tokens = null;
      string line = null;
      int i = 0;
      while ((line = sr.ReadLine()) != null)  {
        tokens = line.Split(delimit);
        for (int j = 0; j < cols; ++j)
          result[i][j] = tokens[j];
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }
    static int ArgMax(double[] vector)
    {
      int result = 0;
      double maxV = vector[0];
      for (int i = 0; i < vector.Length; ++i) {
        if (vector[i] > maxV) {
          maxV = vector[i];
          result = i;
        }
      }
      return result;
    }
  } // Program
} // ns

After the template code loaded, in the editor window I removed all unneeded namespace references and added a reference to the System.IO namespace. In the Solution Explorer window, I right-clicked on file Program.cs, renamed it to the more descriptive NaiveBayesProgram.cs, and allowed Visual Studio to automatically rename class Program.

After building the project, I used Notepad to create the 40-item dummy data file with the contents shown in Figure 4, and saved it as BayesData.txt in the project root directory.

Figure 4 File BayesData.txt Contents

Aqua,Small,Twisted,1
Blue,Small,Pointed,0
Dune,Large,Rounded,0
Dune,Small,Rounded,1
Cyan,Large,Rounded,0
Aqua,Small,Rounded,1
Aqua,Small,Rounded,0
Cyan,Small,Pointed,1
Cyan,Small,Pointed,1
Dune,Small,Rounded,1
Dune,Small,Rounded,0
Dune,Small,Rounded,1
Dune,Small,Rounded,1
Cyan,Small,Pointed,1
Dune,Small,Rounded,1
Dune,Large,Rounded,0
Cyan,Small,Twisted,1
Blue,Small,Rounded,0
Aqua,Small,Pointed,1
Aqua,Small,Pointed,1
Dune,Small,Twisted,0
Blue,Small,Rounded,0
Dune,Small,Rounded,0
Blue,Small,Twisted,0
Dune,Small,Rounded,0
Aqua,Large,Pointed,1
Dune,Large,Rounded,0
Dune,Small,Rounded,0
Dune,Small,Rounded,0
Cyan,Large,Rounded,0
Dune,Small,Twisted,0
Dune,Large,Twisted,0
Dune,Small,Rounded,0
Dune,Small,Rounded,0
Dune,Large,Rounded,0
Aqua,Large,Rounded,1
Aqua,Small,Rounded,0
Aqua,Small,Rounded,1
Dune,Small,Rounded,0
Blue,Small,Rounded,0

Loading Data into Memory

The demo uses a program-defined method named LoadData to read the data file into memory as an array-of-arrays style matrix of type string. Method LoadData assumes the class values are the last value on each line of data. An alternative design is to read the predictor values into a string matrix and read the 0-1 class values into an array of type integer.

Method LoadData calls a helper method named MatrixString, which creates an array-of-arrays string matrix with the specified number of rows (40) and columns (4). An alternative design is to programmatically compute the number of rows and columns, but in my opinion the hardcoded approach is simpler and better.

Program Logic

All of the program control logic is contained in the Main method. The joint counts are stored in an array-of-arrays integer matrix named jointCts, which is created using a helper method called MatrixInt. The row indices of jointCts point to the predictor values (Cyan, Small, Twisted) and the column indices refer to the class value. For example, jointCts[2][1] holds the count of data items that have both Twisted and class 1. An integer array with length 2 named yCts holds the number of data items with class 0 and class 1.

The code that uses the joint counts and class counts to compute the two evidence terms is:

double[] eTerms = new double[nc];
for (int k = 0; k < nc; ++k) {
  double v = 1.0;
  for (int j = 0; j < nx; ++j) {
    v *= (double)(jointCts[j][k]) / (yCts[k] + nx);
  }
  v *= (double)(yCts[k]) / N;
  eTerms[k] = v;
}

Notice that v is the product of nx+1 fractions that are all less than 1.0. This isn’t a problem for the demo program because there are only nx = 3 predictor values and none of the fraction terms can ever be less than 1/27 = 0.0370. But in some situations, you could run into arithmetic trouble by multiplying many very small values.

A technique for avoiding arithmetic errors when computing the evidence terms is to use the log trick that takes advantage of the facts that log(A * B) = log(A) + log(B) and log(A / B) = log(A) - log(B). By computing the log of each evidence term, and then taking the exp function of that result, you can use addition and subtraction of many small values instead of multiplication and division. One possible refactoring using the log trick is:

double[] eTerms = new double[nc];
for (int k = 0; k < nc; ++k) {
  double v = 0.0;
  for (int j = 0; j < nx; ++j) {
    v += Math.Log(jointCts[j][k]) - Math.Log(yCts[k] + nx);
  }
  v += Math.Log(yCts[k]) - Math.Log(N);
  eTerms[k] = Math.Exp(v);
}

Generating the Predicted Class

After the evidence terms have been computed, the demo program sums them and uses them to compute pseudo-probabilities. I call these values pseudo-probabilities because, although they sum to 1.0, they don’t really represent the likelihood of a result from many repeated sampling experiments. However, you can cautiously interpret pseudo-probabilities as mild forms of confidence. For example, pseudo-probabilities of (0.97, 0.03) suggest class 0 with a bit more strength than (0.56, 0.44).

The predicted class is generated by calling the program-defined ArgMax function, which returns the index of the largest value in a numeric array. For example, if an array holds values (0.20, 0.50, 0.90, 0.10) then ArgMax returns 2.

Wrapping Up

The demo program presented in this article performs binary classification because there are only two class values. The program logic can also be used without modification for multiclass classification. The predictor values in the demo program are all categorical. If your data has numeric data, such as a weight variable with values like 3.78 and 9.21, you can apply the technique presented in this article by binning the numeric data into categories such as light, medium and heavy.

There are several other forms of naive Bayes classification in addition to the type presented in this article. One form uses the same underlying math principles as those used by the demo program, but can handle data where all the predictor values are numeric. However, you must make assumptions about the math properties of the data, such as that the data is a normal (Gaussian) distribution with a certain mean and standard deviation.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Chris Lee, Ricky Loynd, Kirk Olynyk


Discuss this article in the MSDN Magazine forum