May 2013

Volume 28 Number 05

Test Run - Data Clustering Using Category Utility

By James McCaffrey | May 2013

Dr. James McCaffreyData clustering is the process of placing data items into different groups—clusters—in such a way that items in a particular group are similar to each other and different from those in other groups. Clustering is a machine learning technique that has many important practical uses. For example, cluster analysis can be used to determine what types of items are often purchased together so that targeted advertising can be aimed at consumers, or to determine which companies are similar so that stock market prices can be predicted.

The most basic form of clustering uses what’s called the k-means algorithm. (See my article, “Detecting Abnormal Data Using k-Means Clustering,” at msdn.microsoft.com/magazine/jj891054.) Unfortunately, k-means clustering can be used only in situations where the data items are completely numeric. Clustering categorical data (such as color, which can take on values such as “red” and “blue”) is a challenging problem. In this article I present a previously unpublished (as far as I can determine) clustering algorithm that I designed that can handle either numeric or categorical data items, or a mixture of both. I call this algorithm Greedy Agglomerate Category Utility Clustering (GACUC) to distinguish it from the many other clustering algorithms. This article will give you all the information you need to experiment with data clustering, add clustering features to a .NET application or system, or create a powerful standalone data-clustering tool.

The best way to understand what clustering is and see where I’m headed in this article is to take a look at Figure 1. The demo program shown in the figure is clustering a small set of five dummy data items. Data items are sometimes called tuples in clustering terminology. Each tuple has three categorical attributes: color, length and rigidity. Color can take on one of four possible values: red, blue, green or yellow. Length can be short, medium or long. Rigidity can be false or true.

Clustering Categorical Data in Action
Figure 1 Clustering Categorical Data in Action

The demo program converts the raw string data into integer form for more-efficient processing. For color, red, blue, green and yellow are encoded as 0, 1, 2 and 3, respectively. For length, short, medium and long are encoded as 0, 1, 2, respectively. For rigidity, false is encoded as 0 and true is encoded as 1. So, the first raw data item, “Red Short True,” is encoded as “0 0 1.”

Many clustering algorithms, including GACUC, require the number of clusters to be specified. In this case, the number of clusters is set to 2. The demo program then uses GACUC to find the best clustering of the data. Behind the scenes, the algorithm starts by placing seed tuples 0 and 4 into clusters 0 and 1, respectively. The clustering algorithm then iterates through the tuples, assigning each to the cluster that generates the best overall result. Because it doesn’t use any kind of training data, clustering is called unsupervised learning. After a preliminary clustering pass, the GACUC algorithm performs a refining pass to try to improve the clustering. No improvement is found in this example.

Internally, the demo program defines a clustering as an array of int. The index of the array indicates a tuple index, and the cell value in the array indicates a cluster. In Figure 1, the best clustering obtained by the algorithm is [0,0,1,1,1], which means tuple 0 (“Red Short True”) is in cluster 0; tuple 1 (“Red Long False”) is in cluster 0; tuple 2 (“Blue Medium True”) is in cluster 1; and so on. The demo program displays the final, best clustering in string format for readability, and also displays a key metric called the category utility (CU) of the best clustering, 0.3733 in this example. As you’ll see, category utility is the key to the GACUC clustering algorithm.

This article assumes you have advanced programming skills with a C-family language but does not assume you know anything about data clustering. I coded the demo program using a non-OOP approach with the C# language, but you shouldn’t have too much trouble refactoring the demo code to OOP or another language. I’ve removed all normal error checking for clarity. The demo program code is too long to present in its entirety in this article, so I focused on the GACUC algorithm so you’ll be able to modify the demo code to meet your own needs. The complete source code for the demo program is available at msdn.com/magazine/msdnmag0513.

Category Utility

Data clustering involves solving two main problems. The first problem is defining exactly what makes a good clustering of data. The second problem is determining an effective technique to search through all possible clustering combinations to find the best one. Category utility addresses the first problem. CU is a clever metric that defines a clustering’s goodness. Small values of CU indicate poor clustering while larger values indicate better clustering. As far as I’ve been able to determine, CU was first defined by M. Gluck and J. Corter in a 1985 research paper titled “Information, Uncertainty, and the Utility of Categories.”

The mathematical equation for CU is a bit intimidating at first glance:

But the equation is actually simpler than it appears. In the equation, uppercase C is an overall clustering; lowercase m is the number of clusters; uppercase P means “probability of”; uppercase A means attribute (such as color); uppercase V means attribute value (such as red). Unless you’re a mathematician, computing category utility is best understood by example. Suppose the data set to be clustered is the one shown in Figure 1, and you want to compute the CU of the clustering:

k = 0
Red      Short    True
Red      Long     False
k = 1
Blue     Medium   True
Green    Medium   True
Green    Medium   False

The first step is to compute P(Ck), which are the probabilities of each cluster. For k = 0, because there are five tuples in the data set and two of them are in cluster 0, P(C0) = 2/5 = 0.40. Similarly, P(C1) = 3/5 = 0.60.

The second step is to compute the last double summation in the equation, called the unconditional probability term. The computation is the sum of N terms where N is the total number of different attribute values in the data set, and goes like this:

Red: (2/5)^2 = 0.1600
Blue: (1/5)^2 = 0.0400
Green: (2/5)^2 = 0.1600
Yellow: (0/5)^2 = 0.0000
Short: (1/5)^2 = 0.0400
Medium: (3/5)^2 = 0.3600
Long: (1/5)^2 = 0.0400
False: (2/5)^2 = 0.1600
True: (3/5)^2 = 0.3600
Unconditional sum = 1.3200

The third step is to compute the middle term double summation, called the conditional probability terms. There are m sums (where m is the number of clusters), each of which has N terms.

For k = 0 the computation goes:

Red: (2/2)^2 = 1.0000
Blue: (0/2)^2 = 0.0000
Green: (0/2)^2 = 0.0000
Yellow: (0/2)^2 = 0.0000
Short: (1/2)^2 = 0.2500
Medium: (0/2)^2 = 0.0000
Long: (1/2)^2 = 0.2500
False: (1/2)^2 = 0.2500
True: (1/2)^2 = 0.2500
Conditional k = 0 sum = 2.0000

For k = 1 the computation is:

Red: (0/3)^2 = 0.0000
Blue: (1/3)^2 = 0.1111
Green: (2/3)^2 = 0.4444
Yellow: (0/3)^2 = 0.0000
Short: (0/3)^2 = 0.0000
Medium: (3/3)^2 = 1.0000
Long: (0/3)^2 = 0.0000
False: (1/3)^2 = 0.1111
True: (2/3)^2 = 0.4444
Conditional k = 1 sum = 2.1111

The last step is to combine the computed sums according to the CU equation:

CU = 1/2 * [ 0.40 * (2.0000 - 1.3200) + 0.60 * (2.1111 - 1.3200) ]
   = 0.3733

A detailed explanation of exactly why CU measures the goodness of a clustering is fascinating, but unfortunately outside the scope of this article. The key point is that for any clustering of a data set containing categorical data, it’s possible to compute a value that describes how good the clustering is.

Searching for the Best Clustering

After defining a way to measure clustering goodness, the second problem to solve in any clustering algorithm is coming up with a technique to search through all possible clusterings for the best clustering. In general it’s not feasible to examine every possible clustering of a data set. For example, for a data set with only 100 tuples, and m = 2 clusters, there are 2^100 / 2! = 2^99 possible clusterings. Even if you could somehow examine one trillion clusterings per second, it would take roughly 19 billion years to check every possible clustering. (By comparison, the estimated age of the universe is about 14 billion years.)

Different clustering algorithms use different techniques to search through all possible clusterings. The GACUC algorithm uses what is called a greedy agglomerative approach. The idea is to begin by seeding each cluster with a single tuple and then, for each remaining tuple, determine which cluster k’, if the current tuple were added to it, would yield the best overall clustering. Then, the current tuple is actually assigned to cluster k’. This technique doesn’t guarantee that the optimal clustering will be found, but experiments have shown that the technique generally produces a reasonably good clustering. The final clustering produced by the GACUC algorithm depends on which m tuples are selected as seed tuples and the order in which the remaining tuples are assigned to clusters.

It turns out that selecting a seed tuple for each cluster is not trivial. One naive approach would be to simply select random tuples as the seeds. However, if the seed tuples are similar to each other, the resulting clustering will be poor. A better approach for selecting seed tuples is to choose m tuples that are as different as possible from each other. Here’s where category utility comes in handy again—the CU of any potential set of candidate seed tuples can be computed, and the set of tuples with the best CU (largest value, meaning most dissimilar) can be used as the seed tuples. As before, it’s generally not feasible to examine every possible set of seed tuples, so the approach is to repeatedly select m random tuples, compute the CU of those random tuples, and use the set that has the best CU as seeds.

Overall Program Structure

The Main method of the demo program shown running in Figure1, with some WriteLine statements and comments removed, is listed in Figure 2. I used Visual Studio 2010 with the Microsoft .NET Framework 4, but the demo code has no significant depen­dencies and any version of Visual Studio that supports the .NET Framework 2.0 or greater should work fine. For simplicity, I created a single C# console application named Clustering­Categorical; you might want to implement clustering as a class library.

Figure 2 Overall Program Structure

using System;
using System.Collections.Generic;
namespace ClusteringCategorical
{
  class ClusteringCategoricalProgram
  {
    static Random random = null;
    static void Main(string[] args)
    {
      try
      {
        random = new Random(2);
        Console.WriteLine("\nBegin category utility demo\n");
        string[] attNames = new string[] { "Color", "Length", "Rigid" };
        string[][] attValues = new string[attNames.Length][];
        attValues[0] = new string[] { "Red", "Blue", "Green", "Yellow" };
        attValues[1] = new string[] { "Short", "Medium", "Long" };
        attValues[2] = new string[] { "False", "True" };
        string[][] tuples = new string[5][];
        tuples[0] = new string[] { "Red", "Short", "True" };
        tuples[1] = new string[] { "Red", "Long", "False" };
        tuples[2] = new string[] { "Blue", "Medium", "True" };
        tuples[3] = new string[] { "Green", "Medium", "True" };
        tuples[4] = new string[] { "Green", "Medium", "False" };
        Console.WriteLine("Tuples in raw (string) form:\n");
        DisplayMatrix(tuples);
        int[][] tuplesAsInt = TuplesToInts(tuples, attValues);
        Console.WriteLine("\nTuples in integer form:\n");
        DisplayMatrix(tuplesAsInt);
        int numClusters = 2;
        int numSeedTrials = 10;
        Console.WriteLine("Initializing clustering result array");
        int[] clustering = InitClustering(tuplesAsInt);
        int[][][] valueCounts = InitValueCounts(tuplesAsInt, attValues,
          numClusters);
        int[][] valueSums = InitValueSums(tuplesAsInt, attValues);
        int[] clusterCounts = new int[numClusters];
        Console.WriteLine("\nBeginning clustering routine");
        Cluster(tuplesAsInt, attValues, clustering, valueCounts,
          valueSums, clusterCounts, numSeedTrials);
        double cu = CategoryUtility(valueCounts, valueSums,
          clusterCounts);
        Console.WriteLine("\nCategory Utility of clustering = " +
          cu.ToString("F4"));
        DisplayVector(clustering);
        Refine(20, tuplesAsInt, clustering, valueCounts, valueSums,
          clusterCounts);
        Console.WriteLine("Refining complete");
        DisplayVector(clustering);
        Console.WriteLine("\nFinal clustering in string form:\n");
        DisplayClustering(numClusters, clustering, tuples);
        cu = CategoryUtility(valueCounts, valueSums, clusterCounts);
        Console.WriteLine("\nFinal clustering CU = " + cu.ToString("F4"));
        Console.WriteLine("\nEnd demo\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // Main
    // All other methods here
  } // class Program
} // ns

In Figure 2, method Cluster performs one iteration of the basic GACUC clustering algorithm. Variable numSeedTrials is set to 10 and is passed to the routine that determines the initial seed tuples that are assigned to each cluster. Method Refine performs post-clustering passes through the data in an attempt to find a clustering that produces a better category utility.

The Key Data Structures

Although it’s possible to compute the category utility of a data set on the fly by iterating through each tuple in the data set, because the clustering method needs to compute category utility many times, a much better approach is to store the counts of attribute values of tuples that are assigned to clusters at any given point in time. Figure 3 shows most of the key data structures used by the GACUC algorithm.

Key Data Structures
Figure 3 Key Data Structures

Array valueCounts holds the number of attribute values by cluster. For example, valueCounts[0][2][1] holds the number of tuples with attribute 0 (Color) and value 2 (Green) that are currently assigned to cluster 1. Array valueSums holds the sum of the counts, across all clusters, for each attribute value. For example, valueSums[0][2] holds the total number of tuples with attribute 0 (Color) and value 2 (Green) that are assigned to all clusters. Array clusterCounts holds the number of tuples that are currently assigned to each cluster. For example, if numClusters = 2 and clusterCounts = [2,2], then there are two tuples assigned to cluster 0 and two tuples assigned to cluster 1. Array clustering encodes the assignment of tuples to clusters. The index of clustering represents a tuple and a cell value represents a cluster, and a value of -1 indicates the associated tuple has not yet been assigned to any cluster. For example, if clustering[2] = 1, then tuple 2 is assigned to cluster 1.

Coding the CategoryUtility Method

The code for the method that computes category utility is not conceptually difficult, but it is a bit tricky. The method’s definition begins:

static double CategoryUtility(int[][][] valueCounts, 
  int[][] valueSums,
  int[] clusterCounts)
{
  int numTuplesAssigned = 0;
  for (int k = 0; k < clusterCounts.Length; ++k)
    numTuplesAssigned += clusterCounts[k];
  int numClusters = clusterCounts.Length;
  double[] clusterProbs = new double[numClusters];   // P(Ck)
  for (int k = 0; k < numClusters; ++k)
    clusterProbs[k] = (clusterCounts[k] * 1.0) / numTuplesAssigned;

The three input arrays, valueCounts, valueSums and clusterCounts are assumed to hold valid values that reflect the current clustering, as described in the previous section and in Figure 3. The method begins by scanning through the clusterCounts array to compute the total number of tuples that are currently assigned to clusters. The number of clusters is inferred from the Length property of the clusterCounts array, and the probabilities for the clusters are then computed and stored into local array clusterProbs.

The next step is to compute the single unconditional probability for the current clustering:

double unconditional = 0.0;
for (int i = 0; i < valueSums.Length; ++i)
{
  for (int j = 0; j < valueSums[i].Length; ++j)
  {
    double p = (valueSums[i][j] * 1.0) / numTuplesAssigned;
    unconditional += (p * p);
  }
}

After the unconditional probability has been computed, the next step is to compute a conditional probability for each cluster:

double[] conditionals = new double[numClusters];
for (int k = 0; k < numClusters; ++k)
{
  for (int i = 0; i < valueCounts.Length; ++i)
  {
    for (int j = 0; j < valueCounts[i].Length; ++j)
    {
      double p = (valueCounts[i][j][k] * 1.0) / clusterCounts[k];
      conditionals[k] += (p * p);
    }
  }
}

Now, with the probabilities of each cluster in array clusterProbs, the unconditional probability term in variable unconditional and the conditional probability terms in array conditionals, the category utility for the clustering can be determined:

double summation = 0.0;
for (int k = 0; k < numClusters; ++k)
  summation += clusterProbs[k] * 
    (conditionals[k] - unconditional);
return summation / numClusters;
}

A good way to understand the behavior of the CU function is to experiment with the demo code by supplying hardcoded clusterings. For example, you can modify the code in Figure 2 along the lines of:

string[] attNames = new string[] { "Color", "Length", "Rigid" };
// And so on, as in Figure 1
int[][] tuplesAsInt = TuplesToInts(tuples, attValues);
int[] clustering[] = new int[] { 0, 1, 0, 1, 0 };
// Hardcoded clustering
double cu = CategoryUtility(valueCounts, valueSums, clusterCounts);
Console.WriteLine("CU of clustering = " + cu.ToString("F4"));

Implementing the Cluster Method

With a category utility method in hand, implementing a method to cluster data is relatively simple. In high-level pseudo-code, method Cluster is:

define an empty clustering with all tuples unassigned
determine m good (dissimilar) seed tuples
assign one good seed tuple to each cluster
loop each unassigned tuple t
  compute the CU for each cluster if t were actually assigned
  determine the cluster that gives the best CU
  assign tuple t to best cluster
end loop (each tuple)
return clustering

For simplicity, method Cluster iterates through each unassigned tuple in order, starting at tuple [0]. This effectively gives more influence to tuples with lower-number indexes. An alternative approach I often use is to iterate through unassigned tuples in random order or in scrambled order by using a linear, congruential, full-cycle generator.

A surprisingly tricky part of the algorithm is finding m good seed tuples. In high-level pseudo-code, method GetGoodIndexes is:

loop numSeedTrials times
  create a mini-tuples array with length numClusters
  pick numClusters random tuple indexes
  populate the mini-tuples with values from data set
  compute CU of mini data set
  if CU > bestCU
    bestCU = CU
    indexes of best mini data set = indexes of data set
  end if
end loop
return indexes of best mini data set

This approach generates random tuple indexes by using a brute-force generate-check-if-not-used approach, which has worked well for me in practice.

Because method Cluster will, in general, return a good but not optimal clustering, the GACUC algorithm optionally calls a refine method that attempts to rearrange the tuple cluster assignments in an effort to find a clustering with a better category utility value. In pseudo-code, method Refine is:

loop numRefineTrials times
  select a random tuple that is in a cluster with at least two tuples
  determine the random tuple's cluster
  select a second cluster that is different from curr cluster
  unassign tuple from curr cluster
  assign tuple to different cluster
  compute new CU
  if new CU > old CU
    leave the cluster switch alone
  else
    unassign tuple from different cluster
    assign tuple back to original cluster
  end if
end loop

There are many other post-processing refinements with which you can experiment. The preceding refinement maintains a fixed number of clusters. One possibility is to define a refinement method that allows the number of clusters to increase or decrease.

A key helper method in the GACUC clustering algorithm is one that assigns a tuple to a cluster:

static void Assign(int tupleIndex, int[][] tuplesAsInt, 
  int cluster,  int[] clustering, 
  int[][][] valueCounts, int[][] valueSums,
  int[] clusterCounts)
{
  clustering[tupleIndex] = cluster;  // Assign
  for (int i = 0; i < valueCounts.Length; ++i)  // Update
  {
    int v = tuplesAsInt[tupleIndex][i];
    ++valueCounts[i][v][cluster];
    ++valueSums[i][v];
  }
  ++clusterCounts[cluster];  // Update
}

Notice that method Assign accepts a lot of parameters; this is a general weakness of using static methods rather than an OOP approach. Method Assign modifies the clustering array and then updates the arrays that hold the counts for each attribute value by cluster (valueCounts), the counts for each attribute value across all clusters (valueSums), and the number of tuples assigned to each cluster (clusterCounts). Method Unassign is essentially the opposite of Assign. The three key statements in method Unassign are:

clustering[tupleIndex] = -1;  // Unassign
--valueCounts[i][v][cluster]; // Update
--valueSums[i][v];            // Update
--clusterCounts[cluster];     // Update

Wrapping Up

The code download that accompanies this article, along with the explanation of the GACUC clustering algorithm presented here, should get you up and running if you want to experiment with data clustering, add clustering features to an application, or create a clustering utility tool. Unlike many clustering algorithms (including k-means), the GACUC algorithm can be easily modified to deal with numeric data or mixed numeric and categorical data. The idea is to preprocess numeric data by binning it into categories. For example, if your raw data contained people’s heights measured in inches, you could bin height so that values less than 60.0 are “short,” values between 60.0 and 74.0 are “medium,” and values greater than 74.0 are “tall.”

There are many algorithms available for clustering categorical data, including algorithms that rely on category utility to define clustering goodness. However, the algorithm presented here is relatively simple, has worked well in practice, can be applied to both numeric and categorical data, and scales well to huge data sets. GACUC data clustering can be a valuable addition to your developer tool set.


Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following technical expert for reviewing this article: Dan Liebling (Microsoft Research)