Article
02/08/2019

February 2019

Volume 34 Number 2

[Test Run]

Rating Competitors Using Infer.NET

James McCaffrey Infer.NET is an open source code library that can be used to create probabilistic programming systems. I tend to think of an or-dinary computer program as one that's largely based on variables that have a value of a specified type, such as a char variable that has value "Q." Probabilistic programming is largely based on probability distributions, such as a Gaussian distribution that has mean 0.0 and standard deviation 1.0.

In this article, I show you how to get started with the Infer.NET library by computing ratings for a set of competitors using head-to-head results. A good way to get an idea of probabilistic programming and see where this article is headed is to take a look at the demo program in Figure 1. The demo sets up six competitors (think sports teams): Angels, Bruins, Comets, Demons, Eagles and Flyers.

Figure 1 Infer.NET Rating in Action

The six teams play against each other. Each team plays three times so there are a total of nine games. The demo pro-gram defines a probabilistic model that assumes each team has an intrinsic strength and that each strength can be described by a Gaussian (also called normal or bell-shaped) distribution with a mean of 2000 and a standard deviation of 200 arbitrary units.

Using the win-loss data, the demo program infers the strengths of the six teams. The Angels won three games and lost none and have an inferred strength of 2256.8 units, which is about 1.25 standard deviation units better than the assumed average strength of 2000 units. The Flyers lost all three of their games and have an inferred strength of 1739.7 units.

I use the term strengths here but you can think of the inferred numeric values as ratings. Notice that if you can infer the rat-ings of a set of items, you automatically get a ranking of the items. The final ranking, from best to worst, is Angels, Bruins, Com-ets, Eagles, Demons, Flyers.

This article assumes you have intermediate or better programming skills with C#, but doesn't assume you know anything about Infer.NET or probabilistic programming. Infer.NET supports only C# and F#, so you could refactor the demo to F# if you wish. Once you understand the basics of probabilistic programming, you should be able to rewrite the demo program using one of the many other probabilistic programming frameworks, such as Stan or Edward.

The complete source code for the demo program is presented in this article. The source code is also available in the accompa-nying file download. All normal error checking has been removed to keep the main ideas as clear as possible.

Understanding Random Variables

The demo program assumes that each team's strength is a Gaussian-distributed random variable with a specified mean (aver-age) and standard deviation. What exactly does this mean and from where does this assumption come?

There are many kinds of random variable distributions. Each has one or more characteristic parameters. For example, a ran-dom variable that follows a uniform distribution with a = 2.0 and b = 5.0 can be any value between 2.0 and 5.0, where each pos-sible value is equally likely. For the demo problem it would be possible to assume that team strength is uniformly distributed with a = 1000 and b = 3000, which would imply that many such strengths would have a mean of 0.5 * (1000 + 3000) = 2000.0 units and a standard deviation of sqrt((1/12) * (3000 - 1000)^2) = 577.35 units.

The demo assumes that team strengths are Gaussian with mean = 2000 and standard deviation = 200, and therefore the aver-age strength of many such teams would be about 2000 and about 99 percent of many such teams would have strengths between 2000 - (3 * 200) = 1400 and 2000 + (3 * 200) = 2600.

At this point, you may be thinking, “Great, so I have to have a Ph.D. in statistics distribution to use Infer.NET and create prob-abilistic programs." This isn't entirely true. Infer.NET supports many distributions, but in practice you usually need to under-stand just a few. The ones I use most frequently are Gaussian, uniform, beta, binomial, multinomial, gamma and Poisson. An analogy is the many types of data structures supported by the System.Collections namespace of the .NET Framework. Even though there are many, you typically use only a few (stacks, queues, dictionaries), and you learn about new ones as you encoun-ter problems that need them.

The mean is the average value of the data and standard deviation is a measure of how spread out the data is. The variance is the square of the standard deviation, and the precision is the inverse of the variance. The reason there are three different terms that contain the exact same information is that sometimes one form is more convenient than the others when using it in a math equation of some sort. The Infer.NET library tends to use variance and precision rather than standard deviation.

The Demo Program Structure

The complete source code for the demo program, with a few minor edits to save space, is presented in Figure 2.

Figure 2 Complete Source Code

using System;
using Microsoft.ML.Probabilistic.Models;
using Microsoft.ML.Probabilistic.Algorithms;
using Microsoft.ML.Probabilistic.Distributions;
// VS2017 (Framework 4.7) Infer.NET 0.3.1810.501

namespace InferStrengths
{
  class InferStrengthsProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin Infer.NET demo ");

      // ===== Set up teams and win-loss data =====================

      string[] teamNames = new string[] { "Angels", "Bruins",
        "Comets", "Demons", "Eagles", "Flyers" };
      int N = teamNames.Length; 

      int[] winTeamIDs =  new int[] { 0, 2, 1, 0, 1, 3, 0, 2, 4 };
      int[] loseTeamIDs = new int[] { 1, 3, 2, 4, 3, 5, 5, 4, 5 };

      Console.WriteLine("Data: \n");
      for (int i = 0; i < winTeamIDs.Length; ++i) {
        Console.WriteLine("game: " + i + "   winning team: " +
          teamNames[winTeamIDs[i]] + "   losing team: " +
          teamNames[loseTeamIDs[i]]);
      }

      // ===== Define a probabilistic model =======================

      Range teamIDsRange = new Range(N).Named("teamsIDRange");
      Range gameIDsRange =
        new Range(winTeamIDs.Length).Named("gameIDsRange");

      double mean = 2000.0;
      double sd = 200.0;
      double vrnc = sd * sd;
      
      Console.WriteLine("\nDefining Gaussian model mean = " +
        mean.ToString("F1") + " and sd = " + sd.ToString("F1"));
      VariableArray<double> strengths =
        Variable.Array<double>(teamIDsRange).Named("strengths");
      
      strengths[teamIDsRange] =
        Variable.GaussianFromMeanAndVariance(mean,
        vrnc).ForEach(teamIDsRange);

      VariableArray<int> winners =
        Variable.Array<int>(gameIDsRange).Named("winners");
      VariableArray<int> losers =
        Variable.Array<int>(gameIDsRange).Named("losers");

      winners.ObservedValue = winTeamIDs;
      losers.ObservedValue = loseTeamIDs;

      using (Variable.ForEach(gameIDsRange))
      {
        var ws = strengths[winners[gameIDsRange]];
        var ls = strengths[losers[gameIDsRange]];
        Variable<double> winnerPerf =
      Variable.GaussianFromMeanAndVariance(ws, 400).Named("winPerf");
        Variable<double> loserPerf =
      Variable.GaussianFromMeanAndVariance(ls, 400).
        Named("losePerf");

        Variable.ConstrainTrue(winnerPerf > loserPerf);
      }

      // ===== Infer team strengths using win-loss data ===========

      Console.WriteLine("\nInferring strengths from win-loss data");
      var iengine = new InferenceEngine();
      iengine.Algorithm = new ExpectationPropagation();
      iengine.NumberOfIterations = 40;
      // iengine.ShowFactorGraph = true;  // needs Graphviz install

      Gaussian[] inferredStrengths =
        iengine.Infer<Gaussian[]>(strengths);
      Console.WriteLine("Inference complete. Inferred strengths: ");

      // ===== Show results =======================================

      for (int i = 0; i < N; ++i) {
        double strength = inferredStrengths[i].GetMean();
        Console.WriteLine(teamNames[i] + ": " +
          strength.ToString("F1"));
      }

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    } // Main
  }
} // ns

To create the demo program, I launched Visual Studio 2017 Community Edition and selected the console application template project with .NET Framework version 4.7 and named it InferStrengths. Infer.NET can run in a .NET Core application, but I prefer the classic .NET Framework.

After the template code loaded, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to InferStrengthsProgram.cs and then allowed Visual Studio to automatically rename class Program for me.

I right-clicked on the project name in the Solution Explorer window and selected the Manage NuGet Packages option. In the NuGet window, I selected the Browse tab and then searched for 'Infer.NET.' In the results list I selected the Microsoft.ML.Probabilistic.Compiler package (version 0.3.1810.501) and clicked the install button. Microsoft plans to migrate Infer.NET into the ML.NET library at some point, so if you can't find Infer.NET as a standalone package, look for it inside the ML.NET package.

Setting up the Data

The demo sets up the six sports teams like so:

string[] teamNames = new string[] { "Angels", "Bruins",
  "Comets", "Demons", "Eagles", "Flyers" };
int N = teamNames.Length;

Because Infer.NET works primarily with random variables, in many programs all data is strictly numeric. Here, the team names are specified as strings rather than integer indexes just for readability. The win-loss data is:

int[] winTeamIDs =
  new int[]  { 0, 2, 1, 0, 1, 3, 0, 2, 4 };
int[] loseTeamIDs = 
  new int[]  { 1, 3, 2, 4, 3, 5, 5, 4, 5 };

The data means in game [0], team 0 (Angels) beat team 1 (Bruins). In game [1], team 2 (Comets) beat team 3 (Demons), and so on through game [8]. With numerical programming, using parallel arrays like this tends to be a more common pattern than placing the data into a class or struct object.

Notice that at this point the demo is using only native .NET data types. However, Infer.NET has its own type system and the data will be converted into types useable by Infer.NET shortly. The demo displays the source data:

for (int i = 0; i < winTeamIDs.Length; ++i) {
  Console.WriteLine("game: " + i + " winning team: " +
    teamNames[winTeamIDs[i]] + " losing team: " +
    teamNames[loseTeamIDs[i]]);
}

There is very little data in the demo. The ability to work with limited data is a strength of probabilistic programming compared to some machine learning techniques, such as neural networks or reinforcement learning, that often require very large amounts of training data to be effective.

Defining the Probabilistic Model

The demo prepares the Gaussian model with these statements:

Range teamIDsRange = new Range(N).Named("teamsIDRange");
Range gameIDsRange =
  new Range(winTeamIDs.Length).Named("gameIDsRange");
double mean = 2000.0;
double sd = 200.0;
double vrnc = sd * sd;

The idea of a Range object is important in Infer.NET. Unlike standard procedural programming where it’s common to explicitly iterate using a for loop or a foreach loop, in Infer.NET it's more common to apply meta operations via a Range object. This is a coding paradigm that can be a bit difficult to get used to.

Almost all Infer.NET objects can be supplied with an optional string name by using the Named method chained to the constructor call. The object string name doesn’t need to match the object identifier name, but it’s good practice to make them the same. As you’ll see shortly, using the Named method is useful because Infer.NET has a built-in way to display a model’s computational graph, and the graph will use a string name if available rather than a library-generated name like vdouble23.

The team strengths to infer are set up with these two statements:

VariableArray<double> strengths =
  Variable.Array<double>(teamIDsRange).Named("strengths");
strengths[teamIDsRange] =
  Variable.GaussianFromMeanAndVariance(mean,
  vrnc).ForEach(teamIDsRange);

The first statement sets up a single object as a special VariableArray type that consists of six random Variable objects. The second statement initializes each random variable as a Gaussian with mean = 2000 and variance = 4000 (which is equivalent to standard deviation = 200). It’s possible to set up an array of individual Variable objects along the lines of:

Variable<double>[] arr = new Variable<double>[6];
for (int i = 0; i < 6; ++i)
  arr[i] = Variable.GaussianFromMeanAndVariance(2000, 4000);

However, the Infer.NET documentation recommends using the VariableArray approach for performance reasons. Next, random variables of type int are created to hold the indexes of the winning and losing teams, and then the native .NET int data is transformed into Infer.NET objects:

VariableArray<int> winners =
  Variable.Array<int>(gameIDsRange).Named("winners");
VariableArray<int> losers =
  Variable.Array<int>(gameIDsRange).Named("losers");
winners.ObservedValue = winTeamIDs;
losers.ObservedValue = loseTeamIDs;

In this situation, the random variables don't follow a probability distribution; instead, they're essentially just ordinary integer values. The ObservedValue property is used to set fixed values in this situation.

The key statements that define the characteristics of the probabilistic model are:

using (Variable.ForEach(gameIDsRange)) {
  var ws = strengths[winners[gameIDsRange]];
  var ls = strengths[losers[gameIDsRange]];
  Variable<double> winnerPerf =
    Variable.GaussianFromMeanAndVariance(ws,
    400.0).Named("winPerf");
  Variable<double> loserPerf =
    Variable.GaussianFromMeanAndVariance(ls,
    400.0).Named("losePerf");

  Variable.ConstrainTrue(winnerPerf > loserPerf);
}

This code is quite subtle. The code is inside a Variable.ForEach block so the operations are applied to each team and each game outcome in a meta fashion. Random variables that correspond to the performance of the winning and losing teams for each game are defined so that they're based on the team's strength, but they have a noise component that's Gaussian-distributed with variance = 400. One way to think of this is that when the team strengths are being inferred, there's a chance that one team may slightly underperform and the opponent may slightly outperform. The noise variance value of 400 must be determined by trial and error.

The ConstrainTrue statement is critical and adds the logic that allows the inference engine to compute each team's strength. The inference engine uses a sophisticated algorithm to examine different values of means and variances for each of the six teams and then determines how likely the observed win-loss outcomes are, given the assumed means and variances. The inference algorithm finds the six means and variances that best match the observed data. Clever!

Inferring the Team Strengths

Once the probabilistic model is defined, an inference engine is created with these statements:

var iengine = new InferenceEngine();
iengine.Algorithm = new ExpectationPropagation();
iengine.NumberOfIterations = 40;
// iengine.ShowFactorGraph = true;

Infer.NET supports three different algorithms: expectation propagation, variational message passing and Gibbs sampling. Different algorithms apply to different types of probabilistic models. For the demo model, only expectation propagation is valid. Expectation propagation is unique to Infer.NET and minimizes the Kullback-Liebler divergence metric to approximate the probability distribution for a set of observed data. The value of the NumberOfIterations property must be determined by trial and error.

An interesting feature of Infer.NET is that you can automatically generate a visual representation of a model's computational graph, such as the one in Figure 3 that corresponds to the demo program. This feature has a dependency on the open source Graphviz program. If you set ShowFactorGraph to true and Graphviz is installed, a graph will be saved in SVG format in the current working directory, and it can be displayed in a browser.

Figure 3 Visual Representation of the Computational Graph

After the inference engine is created, it's easy to compute and display the team strengths using the Infer method:

Gaussian[] inferredStrengths =
  iengine.Infer<Gaussian[]>(strengths);
for (int i = 0; i < N; ++i) {
  double strength = inferredStrengths[i].GetMean();
  Console.WriteLine(teamNames[i] + ": " + strength);
}

The strengths object is passed to Infer. Recall that strengths is type VariableArray, which is a collection of Gaussian random variable objects that are associated to the winners and losers VariableArray objects, which have the win-loss data. Notice that an Infer.NET model is a collection of loosely connected objects rather than a single top-level object. For me, at least, this idea took a while to get used to when I was new to Infer.NET.

Wrapping Up

Probabilistic programming using Infer.NET has a much different feel to it than programming using standard procedural languages, and there’s a non-trivial learning curve involved. The example presented in this article is loosely based on the TrueSkill ranking system developed by Microsoft Research for use in Xbox matchmaking. But probabilistic programming can be applied to many problems other than rating and ranking. Somewhat unusually for a code library that has its origins in research, Infer.NET has excellent documentation. If you want to learn more about probabilistic programming, I recommend that you take a look at the tutorials at bit.ly/2rEr784.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd

Discuss this article in the MSDN Magazine forum