March 2013

Volume 28 Number 03

Data Clustering - Data Clustering Using Naive Bayes Inference

By James McCaffrey | March 2013

Data clustering is a machine-learning technique that has many important practical applications, such as grouping sales data to reveal consumer-buying behavior, or grouping network data to give insights into communication patterns. Data clustering is also useful for identifying anomalous data points. In this article I present a complete C# program you can use as the basis for adding clustering features to a software system or for creating a powerful standalone clustering utility.

There are many different clustering algorithms, in part because the effectiveness of a clustering algorithm depends to some extent on the characteristics of the data being clustered. The most common algorithm is called k-means clustering. Unfortunately, this algorithm is applicable only for numeric data items. In contrast, the clustering algorithm I’ll present in this article is based on a technique called Naive Bayes inference, which works with either categorical or numeric data. Although all the ideas used in the clustering algorithm presented here are well-known, the overall algorithm and specific implementation have not, to the best of my knowledge, been published before. I call this algorithm and its implementation Iterative Naive Bayesian Inference Agglomerative Clustering (INBIAC) to distinguish it from other clustering techniques. Naive Bayes inference is a very common technique for performing data classification, but it’s not generally known that Naive Bayes can also be used for data clustering.

The best way to understand where I’m headed in this article is to take a look at Figure 1. The demo program begins by generating eight random data tuples that describe the location (urban, suburban or rural), income (low, medium, high or very high) and politics (liberal or conservative) of eight hypothetical people. The program then loads the raw string data into memory, and converts the data into int values to enable efficient processing. For example, the last tuple of (“Rural,” “Low,” “Conservative”) is stored as (2, 0, 1).

Data Clustering Using Naive Bayes Inference
Figure 1 Data Clustering Using Naive Bayes Inference

Many clustering algorithms, including INBIAC, require the number of clusters to be specified. Here, variable numClusters is set to 3. The demo program clusters the data and then displays the final clustering of [2, 0, 2, 1, 1, 2, 1, 0]. Behind the scenes, the algorithm seeds clusters 0, 1 and 2 with tuples 1, 6 and 0, respectively. Then each of the remaining five tuples is assigned, one at a time, to a cluster. This type of algorithm is called agglomerative.

Because no training data or user intervention is required, clustering is sometimes termed “unsupervised learning.” Each index in the clustering array represents a tuple and the value in the array is a cluster ID. So tuple 2 (“Suburban,” “VeryHigh,” “Conservative”) is assigned to cluster 0, tuple 1 (“Rural,” “Medium,” “Conservative”) is assigned to cluster 2 and so on.

The demo program finishes up by displaying the original raw data in clustered form. As you can see, the final clustering seems reasonable. And if you look at the original data, I think you’ll agree that trying to cluster data manually, even for very small data sets, would be extremely difficult.

This article assumes you have advanced programming skills with a C-family language, but does not assume you have experience with Naive Bayes inference or clustering algorithms. The demo program shown in Figure 1 is a single C# console application. I coded it without using OOP techniques so you can more easily refactor to languages that don’t fully support OOP.

I removed all error checking to keep the ideas as clear as possible. The code for the demo program is too long to present in its entirety in this article, so I’ll focus on explaining the key parts of the algorithm. The complete source code for the demo program is available at msdn.com/magazine/msdnmag0313.

Program Structure

The overall program structure, with some comments and WriteLine statements removed, is listed in Figure 2.

Figure 2 Clustering Program Structure

using System;
using System.Collections.Generic;
namespace ClusteringBayesian
{
  class ClusteringBayesianProgram
  {
    static Random random = null;
    static void Main(string[] args)
    {
      try
      {
        Console.WriteLine("\nBegin data clustering using Naive Bayes demo\n");
        random = new Random(6); // Seed of 6 gives a nice demo
        int numTuples = 8;
        Console.WriteLine("Generating " + numTuples +
          "tuples of location-income-politics data");
        string[] rawData = MakeData(numTuples);
        Console.WriteLine("\nRaw data in string form is:\n");
        ShowData(rawData, numTuples);
        string[] attNames = new string[] { "Location", "Income", "Politics" };
        string[][] attValues = new string[attNames.Length][];
        attValues[0] = new string[] { "Urban", "Suburban", "Rural" };
        // Location
        attValues[1] = new string[] { "Low", "Medium", "High", "VeryHigh" };
        // Income
        attValues[2] = new string[] { "Liberal", "Conservative" };
        // Politics
        Console.WriteLine("Loading raw data into memory\n");
        string[][] tuples = LoadData(rawData, attValues);
        Console.WriteLine("Converting raw data to int form and" +
          "storing in memory\n");
        int[][] tuplesAsInt = TuplesToInts(tuples, attValues);
        Console.WriteLine("Data in int form is:\n");
        ShowData(tuplesAsInt, numTuples);
        int numClusters = 3;
        int numTrials = 10;  // Times to get good seed indexes (different tuples)
        Console.WriteLine("\nSetting numClusters to " + numClusters);
        Console.WriteLine("\nInitializing Naive Bayes joint counts with " +
          "Laplacian smoothing");
        int[][][] jointCounts = InitJointCounts(tuplesAsInt, attValues,
          numClusters);
        int[] clusterCounts = new int[numClusters];
        Console.WriteLine("\nBegin clustering data using Naive Bayes " +
          "(with equal priors) ");
        int[] clustering = Cluster(tuplesAsInt, numClusters, numTrials,
          jointCounts, clusterCounts, true);
        Console.WriteLine("Clustering complete");
        Console.WriteLine("\nClustering in internal form is:\n");
        for (int i = 0; i < clustering.Length; ++i)
          Console.Write(clustering[i] + " ");
        Console.WriteLine("Raw data displayed in clusters is:\n");
        DisplayClustered(tuplesAsInt, numClusters, clustering, attValues);
        Console.WriteLine("\nEnd demo\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // Main
    // Other methods go here
  } // class
} // ns

I used Visual Studio 2010 to create a C# console app named ClusteringBayesian. In the Solution Explorer window I renamed file Program.cs to the more descriptive ClusteringBayesian­Program.cs, which automatically renamed the single class. I removed unneeded template-generated references to the .NET namespaces; notice the program has few dependencies and needs only the Systems.Collections.Generic namespace.

I declare a class-scope Random object named random. This object is used to generate semi-random raw data, and to determine which tuples to use to seed each cluster. In the Main method, I instantiate random using a seed value of 6, only because this generates a nice-looking demo.

The next few lines of the demo program use helper methods MakeData and ShowData to generate and display eight lines of dummy data. MakeData calls sub-helper methods MakeParty, MakeLocation, MakeIncome and MakePolitics. If you want to experiment with specific hardcoded data tuples, you can replace this code with, for example:

int numTuples = 4;
string[] rawData = new string[] { "Urban,VeryHigh,Liberal",
  "Rural,Medium,Conservative", "Urban,Low,Liberal",
  "Suburban,Medium,Liberal" };

Next, the demo program hardcodes the attribute names Location, Income and Politics. In some scenarios, for example if your raw data is in a text file with a header or in a SQL table with column names, you may want to scan your data and programmatically determine the attribute names.

The demo program also hardcodes the attribute values Urban, Suburban and Rural for Location; Low, Medium, High and VeryHigh for Income; and Liberal and Conservative for Politics. Again, in some scenarios you may want to scan your raw data to programmatically determine the attribute values.

The demo program calls helper method LoadData to load the raw string data into an array of arrays in memory. For very large data sets, this might not be possible. In such situations you’ll need to either load blocks of data at a time or iterate through the externally stored raw data one line or row at a time.

Although it’s possible to work with raw string data directly, it’s much more efficient to work with the data encoded as type int. So, next, Helper method TuplesToInts accepts the array of arrays of strings, scans through that array, converts each string to a zero-based index, and stores all the data in an array of arrays of type int. Again, for very large data sets you may need to read raw data a line at a time and convert attribute values to ints on the fly.

The demo program prepares the clustering routine by setting two parameter values. Parameter numClusters is the number of clusters to generate. In general, data clustering is an exploratory process and you must experiment with different values of numClusters (although there are some fascinating techniques for programmatically determining an optimal number of clusters). Parameter numTrials is used by the clustering routine when generating the initial clustering, as I’ll explain shortly.

The clustering method requires two arrays to hold the counts of assigned tuples at any given time in the INBIAC algorithm. Array jointCounts holds the number of clustered tuples that have a particular attribute value and a particular cluster. The jointCounts array is a bit tricky, and I’ll explain it in more detail shortly. But each value in jointCounts is initialized to 1 as part of an important step in the Bayesian technique called Laplacian smoothing. The second array, clusterCounts, holds the number of tuples assigned to each cluster at any given time. The index of clusterCounts represents a cluster, and the value is the associated count.

The clustering is performed by method Cluster. Method Cluster returns an array that encodes which cluster is assigned to each tuple. The demo program concludes by displaying the clustering array, and by displaying the raw data grouped by cluster using helper method DisplayClustered.

Naive Bayes Inference

The key to understanding the INBIAC algorithm so that you’ll be able to modify the demo code to meet your own needs is understanding Naive Bayes inference. Naive Bayes is best explained by example. Suppose you have the eight tuples shown in Figure 1 and you want to place each tuple into one of three clusters:

[0] Suburban VeryHigh  Liberal
[1] Rural    Medium    Conservative
[2] Suburban VeryHigh  Conservative
[3] Suburban Low       Liberal
[4] Suburban High      Liberal
[5] Suburban Low       Conservative
[6] Urban    Low       Liberal
[7] Rural    Low       Conservative

First, assume that each cluster receives a single seed tuple (in a way that will be explained later) so that tuple 1 is assigned to cluster 0, tuple 6 is assigned to cluster 1, and tuple 0 is assigned to cluster 2. There are several ways to represent this clustering, but the INBIAC algorithm would store the information as an array:

[  2,  0, -1, -1, -1, -1,  1, -1 ]

In this array, the array indexes are tuples, the array values are clusters, and a value of -1 indicates the associated tuple has not yet been assigned to a cluster. So, conceptually, the clustering at this point is:

c0 : Rural    Medium   Conservative  (tuple 1)
c1 : Urban    Low      Liberal       (tuple 6)
c2 : Suburban VeryHigh Liberal       (tuple 0)

Now the goal is to assign the first unassigned tuple, tuple 2 (Suburban, VeryHigh, Conservative), to one of the three clusters. Naive Bayes computes the probabilities that tuple 2 belongs to clusters 0, 1 and 2, and then assigns the tuple to the cluster that has the greatest probability. Expressed symbolically, if X = (Suburban, VeryHigh, Conservative) and c0 stands for cluster 0, we want:

P(c0 | X)
P(c1 | X)
P(c2 | X)

The first probability can be read as, “the probability of cluster 0 given the X values.” Now, bear with me for a moment. To compute these conditional probabilities, you have to compute terms I call partial probabilities (PP). After the partials for each of the conditionals are computed, the probability for each cluster is equal to the partial for the cluster divided by the sum of all the partials. Symbolically:

P(c0 | X) = PP(c0 | X) / [PP(c0 | X) + PP(c1 | X) + PP(c2 | X)]
P(c1 | X) = PP(c1 | X) / [PP(c0 | X) + PP(c1 | X) + PP(c2 | X)]
P(c2 | X) = PP(c2 | X) / [PP(c0 | X) + PP(c1 | X) + PP(c2 | X)]

The partial probability for P(c0 | X) is:

PP(c0 | X) = P(Suburban | c0) *
             P(VeryHigh | c0) *
             P(Conservative | c0) *
             P(c0)
           = count(Suburban     & co) / count c0 *
             count(VeryHigh     & co) / count c0 *
             count(Conservative & c0) / count c0 *
             P(c0)

These equations come from Bayes theory and aren’t at all obvious. The term “naive” in Naive Bayes means that each of the probability terms in the partial probabilities are assumed to be mathematically independent, which leads to much simpler calculations than if the probability terms were mathematically dependent.

The last term, P(c0), is the “probability of cluster 0.” This term is sometimes called a prior and can be computed in one of two ways. One way is to assume equal priors, in which case the P(c0) = P(c1) = P(c2) = 1/3. The other way is to not assume equal priors and use the current counts of assigned tuples, in which case P(c0) = count(c0) / (count(c0) + count(c1) + count(c2)). For the INBIAC algorithm, it’s preferable to assume equal priors.

Notice that the equation needs what are called joint counts. For example, count(Suburban & c0) is the count of assigned tuples, where the cluster is c0 and the tuple location is Suburban.

If you look back at the current clustering, you’ll see there’s a problem: at this point, with only the first three seed tuples assigned to clusters, the count(Suburban & c0) is 0, which makes its term 0, which zeros out the entire partial probability. To avoid this, the joint counts are all initialized with a value of 1. This is called Laplacian smoothing. Laplacian smoothing also adds 3, the number of clusters, to the denominators of the conditional probabilities (but not the unconditional probability term). So the modified computation for the partial for c0 is:

PP(c0 | X) = P(Suburban     | c0) *
             P(VeryHigh     | c0) *
             P(Conservative | c0) *
             P(c0)
           = count(Suburban     & co) / (count c0 + 3) *
             count(VeryHigh     & co) / (count c0 + 3) *
             count(Conservative & c0) / (count c0 + 3) *
             1 / numClusters
           = (1 / 4) * (1 / 4) * (2 / 4) * (1 / 3)
           = 0.0104

Similarly, the partial probabilities for c1 and c2 are:

PP(c1 | X) = count(Suburban     & c1) / (count c1 + 3) *
             count(VeryHigh     & c1) / (count c1 + 3) *
             count(Conservative & c1) / (count c1 + 3) *
             1 / numClusters
           = (1 / 4) * (1 / 4) * (1 / 4) * (1 / 3)
           = 0.0052
PP(c2 | X) = count(Suburban     & c2) / (count c2 + 3) *
             count(VeryHigh     & c2) / (count c2 + 3) *
             count(Conservative & c2) / (count c2 + 3) *
             1 / numClusters
           = (2 / 4) * (2 / 4) * (1 / 4) * (1 / 3)
           = 0.0208

After the partials for each cluster have been calculated, it’s easy to compute the probabilities for each cluster that are needed to assign tuple 1 to a cluster. Here are the computations:

P(c0 | X) = PP(X | c0) / [PP(X | c0) + PP(X | c1) + PP(X | c2)]
          = 0.0104 / (0.0104 + 0.0052 + 0.0208)
          = 0.2857
P(c1 | X) = PP(X | c1) / [PP(X | c0) + PP(X | c1) + PP(X | c2)]
          = 0.0052 / (0.0104 + 0.0052 + 0.0208)
          = 0.1429
P(c2 | X) = PP(X | c0) / [PP(X | c0) + PP(X | c1) + PP(X | c2)]
          = 0.0208 / (0.0104 + 0.0052 + 0.0208)
          = 0.5714

The tuple is assigned to the cluster with the greatest probability, which in this case is cluster c2.

To summarize, to assign a tuple to a cluster, the partial probabilities for each cluster are computed using the joint counts of tuples that have already been assigned. The partials are used to compute the probability that the tuple belongs to each cluster. The tuple is assigned to the cluster that has the greatest probability.

Key Data Structures

There are several ways to store the various data used by the INBIAC clustering algorithm. Figure 3 shows most of the data structures used by the demo program. Array jointCounts is used to compute the partial probabilities that in turn are used to compute cluster probabilities that in turn are used to assign a tuple to a cluster. There’s one joint count for every combination of attribute value and cluster. So, for the demo, because there are nine attribute values (Urban, Suburban, ... Conservative) and three clusters, there are 9 * 3 = 27 joint counts. The first index in jointCounts indicates the attribute, the second index indicates the attribute value and the third index indicates the cluster. For example, jointCounts[1][3][2] holds the count of assigned tuples where the Income (1) is VeryHigh (3) and cluster is (2).

Key Data Structures
Figure 3 Key Data Structures

Array clustering encodes how tuples are assigned to clusters. The index of array clustering represents a tuple, and the cell value represents a cluster. For example, if clustering[0] = 2, then tuple 0 is assigned to cluster 2. Cell values of -1 indicate the associated tuple has not yet been assigned to a cluster.

Implementing the Clustering Method

Method Cluster is listed in Figure 4. The method accepts as inputs the tuplesAsInt array (the data to be clustered), numClusters (the number of clusters to use), numTrials (which assigns initial tuples to clusters), jointCounts (as explained earlier), clusterCounts (the number of tuples assigned to each cluster, needed if computing with non-equal priors) and equalPriors (a Boolean that indicates how to compute probabilities of each cluster when computing partial probabilities).

 Figure 4 Method Cluster

static int[] Cluster(int[][] tuplesAsInt, int numClusters,
  int numTrials, int[][][] jointCounts, int[] clusterCounts,
  bool equalPriors)
{
  int numRows = tuplesAsInt.Length;
  int[] clustering = new int[numRows];
  for (int i = 0; i < clustering.Length; ++i)
    clustering[i] = -1;
  int[] goodIndexes = GetGoodIndexes(tuplesAsInt, numClusters, numTrials);
  for (int i = 0; i < goodIndexes.Length; ++i)
  {
    int idx = goodIndexes[i];
    clustering[idx] = i;
  }
  for (int i = 0; i < goodIndexes.Length; ++i)
  {
    int idx= goodIndexes[i];
    for (int j = 0; j < tuplesAsInt[idx].Length; ++j)
    {
      int v = tuplesAsInt[idx][j];
      ++jointCounts[j][v][i];     // Very tricky indexing
    }
  }
  for (int i = 0; i < clusterCounts.Length; ++i)
    ++clusterCounts[i];
  for (int i = 0; i < tuplesAsInt.Length; ++i)
  {
    if (clustering[i] != -1) continue; // Tuple already clustered
    double[] allProbabilities = AllProbabilities(tuplesAsInt[i],
      jointCounts, clusterCounts, equalPriors);
    int c = IndexOfLargestProbability(allProbabilities);
    clustering[i] = c;  // Assign tuple i to cluster c
    for (int j = 0; j < tuplesAsInt[i].Length; ++j)
    {
      int v = tuplesAsInt[i][j];
      ++jointCounts[j][v][c];
    }
    ++clusterCounts[c];
  } // Main loop
  return clustering;
}

Method Cluster begins by allocating memory for the clustering array and assigning values of -1 in each cell, to indicate no cluster has been assigned to the associated tuple. Next, helper method GetGoodIndexes is called to get the indexes of the tuples that are maximally different from each other. I’ll explain method GetGoodIndexes shortly. Method Cluster next uses the good indexes to initialize the clustering array and then updates the jointCounts and clusterCounts arrays.

The main processing loop iterates through each data tuple in order and computes the probabilities that the current tuple belongs to each cluster using method AllProbabilities. Then the index of the greatest probability is determined using helper IndexOfLargestProbability. Because clusters are zero-based, that index also represents the best cluster, and it’s used to assign the current tuple to a cluster (in the clustering array), and update the jointCounts and clusterCounts arrays.

Because the processing loop always starts at tuple [0], this effectively gives more weight to tuples with lower-numbered indexes. An alternative is to walk through the tuples in random order. Notice that the INBIAC algorithm assigns tuples based on the greatest probability of cluster membership. You could also compute and return the average of these greatest probabilities. This would be a measure of confidence in the cluster assignments. Then you could call method Cluster several times and return the clustering that was produced by the call that yielded the highest confidence.

Another option I often use is to post-process the clustering result produced by method Cluster to try and generate a better clustering. The idea in pseudo-code is:

loop n times
  select a random tuple index
  unassign the tuple from its curr cluster
  assign the tuple using non-equal priors computation
end loop

Recall that INBIAC builds up cluster assignment one tuple at a time by finding the cluster the current tuple belongs to with greatest probability. The probabilities are computed using equal priors, meaning the probabilities of each cluster are assumed to be equal. But after clustering, there’s now more information available about how likely each cluster is, and that information can be used to possibly improve the clustering result. The code download implements this option using a method named Refine.

Getting Good Initial Seed Tuples

The INBIAC algorithm seeds each cluster with a single tuple. It’s important that these seed tuples be as different from each other as possible. There are many measures of dissimilarity used by clustering algorithms. Method GetGoodIndexes generates a set of random candidate indexes, then computes the total number of tuple attributes that are different, a metric called the Hamming distance. This process is repeated numTrials times, and the indexes of the tuples that have the greatest dissimilarity are returned.

For example, consider the data in Figure 1. Suppose the candidate indexes are 0, 1, 2. The corresponding data tuples are:

[0] Suburban VeryHigh Liberal
[1] Rural    Medium   Conservative
[2] Suburban VeryHigh Conservative

Tuples [0] and [1] differ in three positions; tuples [0] and [2] differ in one position; and tuples [1] and [2] differ in two positions, for a total delta of 3 + 1 + 2 = 6. In pseudo-code, method GetGoodIndexes is:

init a result array
loop numTrials times
  generate numClusters distinct random tuple indexes
  compute their dissimilarity
  if curr dissimilarity > greatest dissimilarity
    greatest dissimilarity = curr dissimilarity
    store curr indexes into result array
  end if
end loop
return result array

You may wish to consider alternative approaches. One advanced option is to observe that attribute Income—with values Low, Medium, High and VeryHigh—is inherently numeric. So you could modify method GetGoodIndexes so that the difference between Low and VeryHigh is greater than the difference between Low and Medium.

Generating the distinct candidate seed tuple indexes is an interesting little sub-problem. This is performed by helper method GetRandomIndexes. In pseudo-code, the method works like this:

init a dictionary to store result indexes
init a result array
set count = 0
while count < numClusters
  generate a random tuple index
  if index not in dictionary
    add index to result set
    add index to dictionary
    increment count
  end if
end while
return result

This technique is a fairly brute-force approach, but it has worked well for me in practice.

Wrapping Up

This article, along with the code for the demo program, should give you a solid basis for experimenting with Naive Bayesian clustering and adding clustering functionality to a software system. I developed the INBIAC clustering algorithm while working on a project that had an extremely large data set that contained both numeric and categorical data. I found that existing clustering tools were either too slow, didn’t work at all or gave poor results. The algorithm presented here can deal with both categorical and numeric data (if numeric data is binned into categories), can handle huge data sets and is very fast.

As I mentioned in the introduction, research suggests that there’s no single best clustering algorithm, but rather that you must experiment with different algorithms to get the best results. The ability to explore data sets with clustering based on Naive Bayes inference can be a valuable addition to your technical skill set.

Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Dan Liebling