Diese Website verwendet Cookies für Analysen, personalisierte Inhalte und Werbung. Indem Sie diese Website nutzen, erklären Sie sich mit dieser Verwendung einverstanden. Weitere Informationen
Microsoft Logo
Gray Pipe
  • Developer Network
    • Downloads
      • Visual Studio
      • SDKs
      • Testsoftware
    • Programme
      • Abonnements
      • Students
      • ISV
      • Start-ups
      • Veranstaltungen
    • Community
      • Magazine
      • Foren
      • Blogs
      • Channel 9
    • Dokumentation
      • Microsoft API- und Referenzkatalog
      • Entwicklungscenter
      • Zurückgezogene Inhalte
    • Code
Developer Network Developer
MSDN-Abonnements
Tools herunterladen
Magazin
  • Ausgaben und Downloads
    • Alle Ausgaben
    • 2109
      • January 2019
    • 2018
      • Connect(); 2018
      • Dezember 2018
      • November 2018
      • Oktober 2018
      • September 2018
      • August 2018
      • Juli 2018
      • Juni 2018
      • Mai 2018
      • April 2018
      • März 2018
      • Februar 2018
      • Januar 2018
    • 2017
      • Dezember 2017
      • November 2017
      • Oktober 2017
      • September 2017
      • August 2017
      • Juni 2017
      • Mai 2017
      • April 2017
      • März 2017
      • Februar 2017
      • Januar 2017
    • 2016
      • Dezember 2016
      • Connect(); 2016
      • November 2016
      • Oktober 2016
      • September 2016
      • August 2016
      • Juli 2016
      • Juni 2016
      • Mai 2016
      • April 2016
      • März 2016
      • Februar 2016
      • Januar 2016
    • 2015
      • Dezember 2015
      • November 2015
      • SONDERAUSGABE 2015 ZU WINDOWS 10
      • Oktober 2015
      • September 2015
      • August 2015
      • Juli 2015
      • Juni 2015
      • Mai 2015
      • April 2015
      • März 2015
      • Februar 2015
      • Januar 2015
    • 2014
      • MSDN Magazin Sonderausgabe 2014
      • MSDN Magazin Dezember 2014
      • MSDN Magazin November 2014
      • MSDN Magazin Oktober 2014
      • MSDN Magazin September 2014
      • MSDN Magazin August 2014
      • MSDN Magazin Juli 2014
      • MSDN Magazin Juni 2014
      • MSDN Magazin Mai 2014
      • MSDN Magazin April 2014
      • MSDN Magazin März 2014
      • MSDN Magazin Februar 2014
      • MSDN Magazin Januar 2014
    • 2013
      • MSDN Magazin Dezember 2013
      • MSDN Magazin November 2013
      • MSDN Magazin Oktober 2013
      • MSDN Magazin September 2013
      • MSDN Magazin August 2013
      • MSDN Magazin Juli 2013
      • MSDN Magazin Juni 2013
      • MSDN Magazin Mai 2013
      • MSDN Magazin April 2013
      • MSDN Magazin März 2013
      • MSDN Magazin Februar 2013
      • MSDN Magazin Januar 2013
    • 2012
      • MSDN Magazin Dezember 2012
      • MSDN Magazin November 2012
      • MSDN Magazin Oktober 2012
      • MSDN Magazin Windows 8 – Sonderausgabe 2012
      • MSDN Magazin September 2012
      • MSDN Magazin August 2012
      • MSDN Magazin Juli 2012
      • MSDN Magazin Juni 2012
      • MSDN Magazin Mai 2012
      • MSDN Magazin April 2012
      • MSDN Magazin März 2012
      • MSDN Magazin Februar 2012
      • MSDN Magazin Januar 2012
      • MSDN Magazine-Ausgaben von 2012
    • 2011
      • MSDN Magazin Dezember 2011
      • MSDN Magazin November 2011
      • MSDN Magazin Oktober 2011
      • MSDN Magazin September 2011
      • MSDN Magazin August 2011
      • MSDN Magazin Juli 2011
      • MSDN Magazin Juni 2011
      • MSDN Magazin Mai 2011
      • MSDN Magazin April 2011
      • MSDN Magazin März 2011
      • MSDN Magazin Februar 2011
      • MSDN Magazin Januar 2011
    • 2010
      • MSDN Magazin Dezember 2010
      • MSDN Magazin November 2010
      • MSDN Magazin Oktober 2010
      • MSDN Magazin September 2010
      • MSDN Magazin August 2010
      • MSDN Magazin Juli 2010
      • MSDN Magazin Juni 2010
      • MSDN Magazin Mai 2010
      • MSDN Magazin April 2010
      • MSDN Magazin März 2010
      • MSDN Magazin Februar 2010
      • MSDN Magazin Januar 2010
    • 2009
      • MSDN Magazin Dezember 2009
      • MSDN Magazin November 2009
      • MSDN Magazin Oktober 2009
      • MSDN Magazin September 2009
      • MSDN Magazin August 2009
      • MSDN Magazin Juli 2009
      • MSDN Magazin Juni 2009
      • MSDN Magazin Mai 2009
      • MSDN Magazin April 2009
      • MSDN Magazin März 2009
      • MSDN Magazin Februar 2009
      • MSDN Magazin Januar 2009
  • Abonnieren
  • Artikel einsenden
search clear
Es tut uns leid. Der angeforderte Inhalt wurde entfernt. Sie werden in 1 Sekunde automatisch umgeleitet.
Ausgaben und Downloads 2013 MSDN Magazin Februar 2013 Testlauf: Naive Bayes-Klassifikation mit C#

Test Run

Naive Bayes Classification with C#

James McCaffrey

 

James McCaffreyNaive Bayes classification is a machine-learning technique that can be used to predict to which category a particular data case belongs. In this article I explain how Naive Bayes classification works and present an example coded with the C# language.

There are plenty of standalone tools available that can perform Naive Bayes classification. However, these tools can be difficult or impossible to integrate directly into your application, and difficult to customize to meet specific needs. And they might have hidden copyright issues. This article will give you a solid foundation for adding Naive Bayes classification features to a .NET application, without relying on any external dependencies.

The best way to understand what Naive Bayes classification is and to see where I’m headed in the article is to examine the screenshot of a demo program in Figure 1. The demo program begins by generating 40 lines of data that will be used to train the classifier. In most cases you’d be using an existing data source, but I generated dummy data to keep the demo simple. The first line of data is “administrative,right,72.0,female.” The first field is an occupation, the second is hand dominance, the third is height in inches and the fourth is sex. The goal of the classifier is to predict sex from a given set of values for occupation, dominance and height. Because the dependent variable sex has two possible values, this is an example of binary classification.



Figure 1 Naive Bayes Classification Demo

After generating raw data, the demo program converts each numeric height field to a category—short, medium or tall—by binning height. As I’ll explain, binning numeric data into categorical data is an approach that has pros and cons. After the training data has been binned, the demo program scans the 40 lines of categorical data and computes joint counts. For example, the number of data cases where the person’s occupation is administrative and the person’s sex is male is 2. Additionally, the total numbers of each dependent value (the attribute to be predicted, male or female in this example) are computed. You can see that there are 24 males and 16 females in the training data.

The demo program then has all the information needed to classify the sex of a new data case where the occupation is education, the dominance is right and the height is tall. In this example, it turns out the demo determined the probability that the data case is a male is 0.3855 and the probability that the case is female is 0.6145, and so the system concludes the data case is most likely a female.

In the sections that follow I’ll first explain exactly how Naive Bayes classification works, walk you through the code in the demo program, and describe how to modify the demo to meet your own needs. This article assumes you have at least beginning programming skills with a C-family language, but doesn’t assume you know anything about Naive Bayes classification. The code for the demo program is a bit too long to present in its entirety here, but the complete source is available from the MSDN download site at archive.msdn.microsoft.com/mag201302TestRun.

How Naive Bayes Classification Works

Using the example shown in Figure 1, the goal is to predict the sex (male or female) of a person whose occupation is education, who is right-handed and whose height is tall (greater than or equal to 71.0 inches). To do this, we can compute the probability that the person is male given that information, and the probability the person is female given the information, and then predict the sex with the larger probability. Expressed symbolically, we want to know P(male | X), usually read as, “the probability of male given independent variable values X”) and P(female | X), where X is (education, right, tall). The term “naive” in Naive Bayes means that all X attributes are assumed to be mathematically independent, which greatly simplifies classification. You can find many online references that explain the rather interesting mathematics behind Naive Bayes classification, but the result is relatively simple. Symbolically:

XML
Kopieren
P(male | X) =
  [ P(education | male) * P(right | male) * P(tall | male) * P(male) ] /
    [ PP(male | X) + PP(female | X) ]

Notice the equation is a fraction. The numerator, sometimes loosely called a partial probability, consists of four terms multiplied together. In this article I use the nonstandard notation of PP for a partial probability term. The denominator is the sum of two terms, one of which is the numerator. The first piece to compute is P(education | male), or the probability that a person’s occupation is education, given that he is male. This, as it turns out, can be estimated by the count of training cases where occupation is education and sex is male, divided by the number of cases that are male (with any occupation), so:

XML
Kopieren
P(education | male ) = count(education & male) / count(male) = 2/24 = 0.0833

Using the same logic:

XML
Kopieren
P(right | male) = count(right & male) / count(male) = 17/24 = 0.7083
P(tall | male) = count(tall & male) / count(male) = 4/24 = 0.1667

The next piece of the puzzle is P(male). In Naive Bayes terminology, this is called a prior. There’s some debate about how best to compute priors. On the one hand, we can hypothesize that there’s no reason to believe that the presence of males is more or less likely than the presence of females and so assign 0.5 to P(male). On the other hand, we can use the fact that the training data has 24 males and 16 females and estimate a probability of 24/40 = 0.6000 for P(male). I prefer this approach, where priors are estimated using training data.

Now, if you refer to the earlier equation for P(male | X), you’ll note that it contains the PP(female | X). The bottom sum, PP(male | X) + PP(female | X), is sometimes called the evidence. The pieces for PP(female | X) are computed like so:

XML
Kopieren
P(education | female) = count(education & female) / count(female) = 4/16 = 0.2500
P(right | female) = count(right & female) / count(female) = 14/16 = 0.8750
P(tall | female) = count(tall & female) / count(female) = 2/16 = 0.1250
P(female) = 16/40 = 0.4000

So the partial probability numerator for P(male | X) is:

XML
Kopieren
PP(male | X) = 0.0833 * 0.7083 * 0.1667 * 0.6000 = 0.005903

Using the same logic, the partial probability for female given X = (education, right, tall) is:

XML
Kopieren
PP(female | X) = 0.2500 * 0.8750 * 0.1250 * 0.4000 = 0.010938

And, finally, the overall probabilities of male and female are:

XML
Kopieren
P(male | X) = 0.005903 / (0.005903 + 0.010938) = 0.3505
P(female | X) = 0.010938 / (0.005903 + 0.010938) = 0.6495

These overall probabilities are sometimes called the posteriors. Because P(female | X) is greater than P(male | X), the system concludes the sex of the unknown person is female. But wait. These two probabilities, 0.3505 and 0.6495, are close to but definitely not the same as the two probabilities, 0.3855 and 0.6145, shown in Figure 1. The reason for this discrepancy is that the demo program uses an important optional modification of basic Naive Bayes called Laplacian smoothing.

Laplacian Smoothing

If you refer to Figure 1, you’ll see that the count of training cases in which the person has occupation = construction and sex = female is 0. In the demo, the X values are (education, right, tall), which doesn’t include construction. But suppose X had been (construction, right, tall). In the computation of PP(female | X) it would be necessary to compute P(construction | female) = count(construction & female) / count(female), which would be 0, and which in turn would zero-out the entire partial probability. In short, it’s bad when a joint count is 0. The most common technique to avoid this situation is to simply add 1 to all joint counts. This has the feel of a hack but, in fact, has a solid mathematical basis. The technique is called add-one smoothing, which is a specific kind of Laplacian smoothing.

With Laplacian smoothing, if X = (education, right, tall) as in the previous section, P(male | X) and P(female | X) are calculated as follows:

XML
Kopieren
P(education | male ) =
count(education & male) + 1 / count(male) + 3 = 3/27 = 0.1111
P(right | male) =
count(right & male) + 1 / count(male) + 3 = 18/27 = 0.6667
P(tall | male) =
count(tall & male) + 1 / count(male) + 3 = 5/27 = 0.1852
P(male) = 24/40 = 0.6000
P(education | female) =
count(education & female) + 1 / count(female) + 3 = 5/19 = 0.2632
P(right | female) =
count(right & female) + 1 / count(female) + 3 = 15/19 = 0.7895
P(tall | female) =
count(tall & female) + 1 / count(female) + 3 = 3/19 = 0.1579
P(female) = 16/40 = 0.4000

The partial probabilities are:

XML
Kopieren
PP(male | X) = 0.1111 * 0.6667 * 0.1852 * 0.6000 = 0.008230
PP(female | X) = 0.2632 * 0.7895 * 0.1579 * 0.4000 = 0.013121

And so the two final probabilities are:

XML
Kopieren
P(male | X) = 0.008230 / (0.008230 + 0.013121) = 0.3855
P(female | X) = 0.013121 / (0.008230 + 0.013121) = 0.6145

These are the values shown in the screenshot in Figure 1. Notice that 1 is added to each joint count but that 3 is added to denominators count(male) and count(female). The 3 is to some extent arbitrary in the sense that Laplacian smoothing doesn’t specify any particular value to be used. In this case, it’s the number of X attributes (occupation, dominance, height). This is the most common value to add to denominators of partial probabilities in Laplacian smoothing, but you may wish to experiment with other values. The value to add to the denominator is often given the symbol k in math literature on Naive Bayes. Also, notice that the priors, P(male) and P(female), are typically not modified in Naive Bayes Laplacian smoothing.

Overall Program Structure

The demo program shown running in Figure 1 is a single C# console application. The Main method, with some WriteLine statements removed, is listed in Figure 2.

Figure 2 Naive Bayes Program Structure

C#
Kopieren
using System;
namespace NaiveBayes
{
  class Program
  {
    static Random ran = new Random(25); // Arbitrary
    static void Main(string[] args)
    {
      try
      {
        string[] attributes = new string[] { "occupation", "dominance",
          "height", "sex"};
        string[][] attributeValues = new string[attributes.Length][];
        attributeValues[0] = new string[] { "administrative",
          "construction", "education", "technology" };
        attributeValues[1] = new string[] { "left", "right" };
        attributeValues[2] = new string[] { "short", "medium", "tall" };
        attributeValues[3] = new string[] { "male", "female" };
        double[][] numericAttributeBorders = new double[1][];
        numericAttributeBorders[0] = new double[] { 64.0, 71.0 };
        string[] data = MakeData(40);
        for (int i = 0; i < 4; ++i)
          Console.WriteLine(data[i]);
        string[] binnedData = BinData(data, attributeValues,
          numericAttributeBorders);
        for (int i = 0; i < 4; ++i)
          Console.WriteLine(binnedData[i]);
        int[][][] jointCounts = MakeJointCounts(binnedData, attributes,
          attributeValues);
        int[] dependentCounts = MakeDependentCounts(jointCounts, 2);
        Console.WriteLine("Total male = " + dependentCounts[0]);
        Console.WriteLine("Total female = " + dependentCounts[1]);
        ShowJointCounts(jointCounts, attributeValues);
        string occupation = "education";
        string dominance = "right";
        string height = "tall";
        bool withLaplacian = true;
        Console.WriteLine(" occupation = " + occupation);
        Console.WriteLine(" dominance = " + dominance);
        Console.WriteLine(" height = " + height);
        int c = Classify(occupation, dominance, height, jointCounts,
          dependentCounts, withLaplacian, 3);
        if (c == 0)
          Console.WriteLine("\nData case is most likely male");
        else if (c == 1)
          Console.WriteLine("\nData case is most likely female");
        Console.WriteLine("\nEnd demo\n");
      }
      catch (Exception ex)
      {
        Console.WriteLine(ex.Message);
      }
    } // End Main
    // Methods to create data
    // Method to bin data
    // Method to compute joint counts
    // Helper method to compute partial probabilities
    // Method to classify a data case
  } // End class Program
}

The program begins by setting up the hardcoded X attributes occupation, dominance, and height, and the dependent attribute sex. In some situations you may prefer to scan your existing data source to determine the attributes, especially when the source is a data file with headers or a SQL table with column names. The demo program also specifies the nine categorical X attribute values: (administrative, construction, education, technology) for occupation; (left, right) for dominance; and (short, medium, tall) for height. In this example there are two dependent variable attribute values: (male, female) for sex. Again, you may want to programmatically determine attribute values by scanning your data.

The demo sets up hardcoded boundary values of 64.0 and 71.0 to bin the numeric height values so that values less than or equal to 64.0 are categorized as short; heights between 64.0 and 71.0 are medium; and heights greater than or equal to 71.0 are tall. When binning numeric data for Naive Bayes, the number of boundary values will be one less than the number of categories. In this example, the 64.0 and 71.0 were determined by scanning the training data for minimum and maximum height values (57.0 and 78.0), computing the difference, 21.0, and then computing interval size by dividing by number of height categories, 3, giving 7.0. In most situations, you’ll want to the determine boundary values for numeric X attributes programmatically rather than manually.

The demo program calls a helper method MakeData to generate somewhat random training data. MakeData calls helpers MakeSex, MakeOccupation, MakeDominance and MakeHeight. For example, these helpers generate data so that male occupations are more likely to be construction and technology, male dominance is more likely to be right, and male height is most likely to be between 66.0 and 72.0 inches.

The key methods called in Main are BinData to categorize height data; MakeJointCounts to scan binned data and compute the joint counts; MakeDependentCounts to compute total number of males and females; and Classify, which uses joint counts and dependent counts to perform a Naive Bayes classification.

Binning Data

Method BinData is listed in Figure 3. The method accepts an array of comma-delimited strings where each string looks like “education,left,67.5,male.” In many situations, you’ll be reading training data from a text file where each line is a string. The method uses String.Split to parse each string into tokens. Token[2] is the height. It’s converted from a string into type double using the double.Parse method. The numeric height is compared against the boundary values until the height’s interval is found, and then the corresponding height category as a string is determined. A result string is stitched together using the old tokens, comma delimiters and the new computed-height category string.

Figure 3 Method BinData for Categorizing Height

C#
Kopieren
static string[] BinData(string[] data, string[][] attributeValues,
  double[][] numericAttributeBorders)
{
  string[] result = new string[data.Length];
  string[] tokens;
  double heightAsDouble;
  string heightAsBinnedString;
  for (int i = 0; i < data.Length; ++i)
  {
    tokens = data[i].Split(',');
    heightAsDouble = double.Parse(tokens[2]);
    if (heightAsDouble <= numericAttributeBorders[0][0]) // Short
      heightAsBinnedString = attributeValues[2][0];
    else if (heightAsDouble >= numericAttributeBorders[0][1]) // Tall
      heightAsBinnedString = attributeValues[2][2];
    else
      heightAsBinnedString = attributeValues[2][1]; // Medium
    string s = tokens[0] + "," + tokens[1] + "," + heightAsBinnedString +
      "," + tokens[3];
    result[i] = s;
  }
  return result;
}

It’s not a requirement to bin numeric data when performing Naive Bayes classification. Naive Bayes can deal with numeric data directly, but those techniques are outside the scope of this article. Binning data has the advantages of simplicity and avoiding the need to make any particular explicit assumptions about the mathematical distribution (such as Gaussian or Poisson) of the data. However, binning essentially loses information and does require you to determine and specify into how many categories to divide the data.

Determining Joint Counts

The key to Naive Bayes classification is computing joint counts. In the demo example, there are nine total independent X attribute values (administrative, construction, … tall) and two dependent attribute values (male, female), so a total of 9 * 2 = 18 joint counts must be computed and stored. My preferred approach is to store joint counts in a three-dimensional array int[][][] jointCounts. The first index indicates the independent X attribute; the second index indicates the independent X attribute value; and the third index indicates the dependent attribute value. For example, jointCounts[0][3][1] means attribute 0 (occupation), attribute value 3 (technology) and sex 1 (female), or in other words the value at jointCounts[0][3][1] is the count of training cases where occupation is technology and sex is female. Method MakeJointCounts is listed in Figure 4.

Figure 4 Method MakeJointCounts

C#
Kopieren
static int[][][] MakeJointCounts(string[] binnedData, string[] attributes,
  string[][] attributeValues)
{
  int[][][] jointCounts = new int[attributes.Length - 1][][]; // -1 (no sex)
  jointCounts[0] = new int[4][]; // 4 occupations
  jointCounts[1] = new int[2][]; // 2 dominances
  jointCounts[2] = new int[3][]; // 3 heights
  jointCounts[0][0] = new int[2]; // 2 sexes for administrative
  jointCounts[0][1] = new int[2]; // construction
  jointCounts[0][2] = new int[2]; // education
  jointCounts[0][3] = new int[2]; // technology
  jointCounts[1][0] = new int[2]; // left
  jointCounts[1][1] = new int[2]; // right
  jointCounts[2][0] = new int[2]; // short
  jointCounts[2][1] = new int[2]; // medium
  jointCounts[2][2] = new int[2]; // tall
  for (int i = 0; i < binnedData.Length; ++i)
  {
    string[] tokens = binnedData[i].Split(',');
    int occupationIndex = AttributeValueToIndex(0, tokens[0]);
    int dominanceIndex = AttributeValueToIndex(1, tokens[1]);
    int heightIndex = AttributeValueToIndex(2, tokens[2]);
    int sexIndex = AttributeValueToIndex(3, tokens[3]);
    ++jointCounts[0][occupationIndex][sexIndex];
    ++jointCounts[1][dominanceIndex][sexIndex];
    ++jointCounts[2][heightIndex][sexIndex];
  }
  return jointCounts;
}

The implementation has many hardcoded values to make it easier to understand. For example, these three statements could be replaced by a single for loop that allocates space using Length properties in array attributeValues:

C#
Kopieren
jointCounts[0] = new int[4][]; // 4 occupations
jointCounts[1] = new int[2][]; // 2 dominances
jointCounts[2] = new int[3][]; // 3 heights

Helper function AttributeValueToIndex accepts an attribute index and an attribute value string and returns the appropriate index. For example, AttributeValueToIndex(2, “medium”) returns the index of “medium” in the height attribute, which is 1.

The demo program uses a method MakeDependentCounts to determine the number of male and number of female data cases. There are several ways to do this. If you refer to Figure 1, you’ll observe that one approach is to add the number of joint counts of any of the three attributes. For example, the number of males is the sum of count(administrative & male), count(construction & male), count(education & male) and count(technology & male):

C#
Kopieren
static int[] MakeDependentCounts(int[][][] jointCounts,
  int numDependents)
{
  int[] result = new int[numDependents];
  for (int k = 0; k < numDependents; ++k) 
  // Male then female
    for (int j = 0; j < jointCounts[0].Length; ++j)
    // Scanning attribute 0
      result[k] += jointCounts[0][j][k];
  return result;
}

Classifying a Data Case

Method Classify, shown in Figure 5, is short because it relies on helper methods.

Figure 5 Method Classify

C#
Kopieren
static int Classify(string occupation, string dominance, string height,
  int[][][] jointCounts, int[] dependentCounts, bool withSmoothing,
  int xClasses)
{
  double partProbMale = PartialProbability("male", occupation, dominance,
    height, jointCounts, dependentCounts, withSmoothing, xClasses);
  double partProbFemale = PartialProbability("female", occupation, dominance,
    height, jointCounts, dependentCounts, withSmoothing, xClasses);
  double evidence = partProbMale + partProbFemale;
  double probMale = partProbMale / evidence;
  double probFemale = partProbFemale / evidence;
  if (probMale > probFemale) return 0;
  else return 1;
}

Method Classify accepts the jointCounts and dependentCounts arrays; a Boolean field to indicate whether or not to use Laplacian smoothing; and parameter xClasses, which in this example will be 3 because there are three independent variables (occupation, dominance, height). This parameter could also be inferred from the jointCounts parameter.

Method Classify returns an int that represents the index of the predicted dependent variable. Instead, you might want to return an array of probabilities for each dependent variable. Notice that the classification is based on probMale and probFemale, both of which are computed by dividing partial probabilities by the evidence value. You might want to simply omit the evidence term and just compare the values of the partial probabilities by themselves.

Method Classify returns the index of the dependent variable that has the largest probability. An alternative is to supply a threshold value. For example, suppose probMale is 0.5001 and probFemale is 0.4999. You may wish to consider these values too close to call and return a classification value representing “undetermined.”

Method PartialProbability does most of the work for Classify and is listed in Figure 6.

Figure 6 Method PartialProbability

C#
Kopieren
static double PartialProbability(string sex, string occupation, string dominance,
  string height, int[][][] jointCounts, int[] dependentCounts,
  bool withSmoothing, int xClasses)
{
  int sexIndex = AttributeValueToIndex(3, sex);
  int occupationIndex = AttributeValueToIndex(0, occupation);
  int dominanceIndex = AttributeValueToIndex(1, dominance);
  int heightIndex = AttributeValueToIndex(2, height);
  int totalMale = dependentCounts[0];
  int totalFemale = dependentCounts[1];
  int totalCases = totalMale + totalFemale;
  int totalToUse = 0;
  if (sex == "male") totalToUse = totalMale;
  else if (sex == "female") totalToUse = totalFemale;
  double p0 = (totalToUse * 1.0) / (totalCases); // Prob male or female
  double p1 = 0.0;
  double p2 = 0.0;
  double p3 = 0.0;
  if (withSmoothing == false)
  {
    p1 = (jointCounts[0][occupationIndex][sexIndex] * 1.0) / totalToUse
    p2 = (jointCounts[1][dominanceIndex][sexIndex] * 1.0) / totalToUse;  
    p3 = (jointCounts[2][heightIndex][sexIndex] * 1.0) / totalToUse;     
  }
  else if (withSmoothing == true)
  {
    p1 = (jointCounts[0][occupationIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0); 
    p2 = (jointCounts[1][dominanceIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0 ;
    p3 = (jointCounts[2][heightIndex][sexIndex] + 1) /
     ((totalToUse + xClasses) * 1.0);
  }
  //return p0 * p1 * p2 * p3; // Risky if any very small values
  return Math.Exp(Math.Log(p0) + Math.Log(p1) + Math.Log(p2) + Math.Log(p3));
}

Method PartialProbability is mostly hardcoded for clarity. For example, there are four probability pieces, p0, p1, p2 and p3. You can make PartialProbability more general by using an array of probabilities where the size of the array is determined from the jointCounts array.

Notice that instead of returning the product of the four probability pieces, the method returns the equivalent Exp of the sum of the Log of each piece. Using log probabilities is a standard technique in machine-learning algorithms that’s used to prevent numeric errors that can occur with very small real numeric values.

Wrapping Up

The example presented here should give you a good foundation for adding Naive Bayes classification features to your .NET applications. Naive Bayes classification is a relatively crude technique, but it does have several advantages over more-sophisticated alternatives such as neural network classification, logistic regression classification and support vector machine classification. Naive Bayes is simple, relatively easy to implement and scales well to very large data sets. And Naive Bayes easily extends to multinomial classification problems—those with three or more dependent variables.

Dr. James McCaffrey works for Volt Information Sciences Inc., where he manages technical training for software engineers working at the Microsoft Redmond, Wash., campus. He has worked on several Microsoft products including Internet Explorer and MSN Search. He’s the author of “.NET Test Automation Recipes” (Apress, 2006), and can be reached at jammc@microsoft.com.

Thanks to the following Microsoft technical expert for reviewing this article: Rich Caruana

 

Bewerten">">">">">



MSDN Magazine Blog

14 Top Features of Visual Basic 14: The Q&A
Wednesday, Jan 7
Big Start to the New Year at MSDN Magazine
Friday, Jan 2

More MSDN Magazine Blog entries >


Current Issue


November 2018

Browse All MSDN Magazines


Subscribe to the MSDN Flash newsletter Subscribe to MSDN Flash newsletter


Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.

Folgen Sie uns
  • https://www.facebook.com/microsoftdeveloper
  • https://twitter.com/msdev
  • https://plus.google.com/111221966647232053570/
Registrieren Sie sich für den MSDN Newsletter
Ist diese Seite hilfreich?
Ihr Feedback ist uns wichtig.
Teilen Sie uns Ihre Meinung mit.
Weiteres Feedback?
1500 verbleibende Zeichen
Vielen Dank!
Vielen Dank für Ihr Feedback.

Entwicklungscenter

  • Windows
  • Office
  • Visual Studio
  • Microsoft Azure
  • Mehr...

Lernressourcen

  • Microsoft Virtual Academy
  • Channel 9
  • MSDN Magazine

Community

  • Foren
  • Blogs
  • CodePlex

Support

  • Selbsthilfe

Programme

  • BizSpark (für Startups)
  • Microsoft Imagine (for students)
Deutschland (Deutsch)
  • Impressum
  • Newsletter
  • Datenschutz & Cookies
  • Nutzungsbedingungen
  • Marken
logo © 2019 Microsoft