July 2019

Volume 34 Number 7

[Machine Learning]

Create a Machine Learning Prediction System Using AutoML

By James McCaffrey

Microsoft ML.NET is a large, open source library of machine learning functions that allows you to create a prediction model using a C# language program, typically in Visual Studio. Writing a program that directly uses ML.NET to create a prediction model isn’t simple. The AutoML system uses the ML.NET command-line interface (CLI) tool to automatically create a prediction model for you, and also generates sample code that uses the model, which you can then customize.

A good way to understand what the ML.NET CLI and AutoML are, and to see where this article is headed, is to examine the screenshot of a demo system in Figure 1. The demo uses a file of training data named people_train.tsv to create a prediction model, and a file of test data named people_test.tsv to evaluate the accuracy of the prediction model. The goal of the demo is to predict the political leaning of a person (conservative, moderate, liberal) from their age, sex, geo-region and annual income.

AutoML and ML.NET CLI in Action
Figure 1 AutoML and ML.NET CLI in Action

The demo program runs AutoML in a Windows CMD shell by executing the command:

mlnet auto-train ^
--task multiclass-classification ^
--dataset ".\Data\people_train.tsv" ^
--test-dataset ".\Data\people_test.tsv" ^
--label-column-name politic ^
--max-exploration-time 5

The carat character is used for line continuation in a command shell. The same AutoML command can be run in PowerShell, too, by using the backtick line continuation character instead of the carat.

AutoML automatically creates and evaluates several different machine learning models using different algorithms, such as SgdCalibratedOva (“stochastic gradient descent calibrated one versus all”) and AveragedPerceptronOva. For the demo run, AutoML identified the LightGbmMulti (“lightweight gradient boosting machine multiclass”) algorithm as the best option, with a prediction accuracy of 77.01 percent on the test data.

After examining the different prediction models, AutoML saved the best model as MLModel.zip and also generated sample C# code and a Visual Studio project file named the rather lengthy SampleMulticlassClassification.Console.App.csproj to use for the model.

Figure 2 shows an example of how the trained model can be used from within a C# program. The generated code was edited to make a prediction for a person who is 33 years old, male, lives in the “central” region and has a $62,000.00 annual income. The prediction is that the person is a political “moderate.”

Using the AutoML Model to Make a Prediction
Figure 2 Using the AutoML Model to Make a Prediction

The demo program also displays the prediction probabilities for conservative, moderate, liberal: (0.0034, 0.9055, 0.0912). These aren’t true probabilities in the mathematical sense, but they do give you a loose suggestion of the confidence of the prediction. In this case, the model seems quite certain (p = 0.9055) that the person is “moderate.”

Both ML.NET CLI and AutoML are currently in Preview mode and are in rapid development, so some of the information presented here may have changed by the time you read this article. However, most of the changes should be in the form of additional features rather than in underlying architecture.

This article assumes you have intermediate or better skills with C#, and basic familiarity with working in a command shell, but doesn’t assume you know anything about ML.NET CLI or AutoML. All the demo code is presented in this article. The two demo data files are available in the download that accompanies this article.

Understanding the Data

Most machine learning problems start with analysis and preparation of the available data, and that’s the case when using ML.NET CLI and AutoML. The training data has 1,000 items and looks like:

sex   age  region   income    politic
False  26  eastern  53800.00  conservative
False  19  western  39200.00  moderate
True   19  central  80800.00  liberal
False  52  eastern  86700.00  conservative
False  56  eastern  89200.00  liberal
...

The test data has the same format and consists of 200 items. The data is synthetic and was generated programmatically. Both files are tab-delimited and have a .tsv extension to identify them as such to AutoML. AutoML also supports space-delimited (.txt) and comma-delimited (.csv) files. AutoML supports data files with or without a header line, but, as you’ll see shortly, supplying a header line is more convenient than not supplying one.

Although there are many kinds of prediction problems, there are three fundamental types: multiclass classification, binary classification and regression. The goal of a multiclass classification problem is to predict a discrete value where there are three or more possible values to consider. For example, predicting the political leaning (conservative, moderate, liberal) of a person based on their age, sex, geo-region and annual income, as in the demo program.

The goal of a binary classification problem is to predict a discrete value that can be one of just two possible values. For example, you might want to predict the sex (male or female) of a person based on their age, geo-region, income and political leaning. If you’re new to machine learning, you might find it a bit strange that binary classification and multiclass classification are considered different categories. It turns out that the two types of problems have some fundamental math differences.

The goal of a regression problem is to predict a single numeric value. For example, you might want to predict the annual income of a person based on their age, sex, geo-region and political leaning. AutoML currently supports multiclass classification, binary classification and regression. Support will eventually be added for other types of problems, such as ranking and clustering.

The demo data files illustrate the use of binary, integer, categorical and floating-point data. When working with binary data, such as the sex variable, you should use True and False (either uppercase or lowercase), rather than using 0 and 1. The demo uses False for male and True for female, so you can think of the sex variable as “is-female.”

It’s important to keep AutoML files and directories organized. I created a top-level directory named MLdotNET. Then, within MLdotNET, I created a directory named People to act as the root for AutoML files associated with the people data. Within the People directory I created a directory named Data and placed files people_train.tsv and people_test.tsv there. I executed AutoML commands while in the root People directory because AutoML generates subdirectories inside the root directory from which commands are issued.

Installing AutoML

As is often the case with pre-release software, I ran into several minor glitches while installing AutoML and you can expect a few hiccups, too. Briefly, there are three steps to getting AutoML up and running. First, install Visual Studio if necessary. Second, install the .NET Core SDK if necessary. Third, install the ML.NET CLI tool that contains AutoML.

It’s possible to use AutoML without Visual Studio, but the models created by AutoML are designed specifically for Visual Studio. I successfully used Visual Studio 2017 Professional and the free Visual Studio 2017 Community edition. The AutoML documentation states that AutoML works with Visual Studio 2019, but I was unable to get my models to load using it.

The AutoML system relies on the .NET Core framework, in particular the .NET Core SDK. After a bit of trial and error, I succeeded by installing SDK version 2.2.101. The installation process uses a standard self-extracting executable with a nice GUI. Installing the .NET Core SDK also gives you the Runtime environment, so you don’t have to install it separately.

After installing the .NET Core SDK, the last step is to install AutoML. AutoML isn’t a standalone program, instead it resides within a tool named mlnet, which is a bit confusing. To install the mlnet tool, you launch a shell (CMD or PowerShell) and issue the command > dotnet tool install -g mlnet. The command reaches out to the Internet (so you must be online) to a default repository and installs mlnet to a default location on your machine. After a couple of false starts I eventually got a “Tool ‘mlnet’ (version 0.3.0) was successfully installed” message. I verified the installation by issuing the command > dotnet tool list -g.

The diagram in Figure 3 shows the relationship between the key components used in an AutoML system.

AutoML Components
Figure 3 AutoML Components

Multiclass Classification

To create a multiclass classification problem using AutoML, a minimum of three arguments are needed, for example:

mlnet auto-train ^
--task multiclass-classification ^
--dataset ".\Data\people_train.tsv" ^
--label-column-name politic ^

The task type can be “multiclass classification.” “binary classification” or “regression.” The dataset argument specifies the path to the training data. You can use either Windows-style backslash characters or Linux-style forward slashes. The test-dataset argument is optional. If test-dataset isn’t present, AutoML will evaluate the trained model using the training data.

You can supply an optional valid­ation-dataset argument to allow Auto­ML to use the train-validate-test paradigm. During training, AutoML monitors the error associated with the model on the validation data, and when the error starts to increase, training can stop early so that the model doesn’t become overfitted to the training data.

The label-column-name argument specifies the name of the column that contains the variable to predict. If your training dataset doesn’t have a header, you can use the label-­column-index argument with a 1-based column index, for example --label-column-index 5.

The table in Figure 4 summarizes the 14 arguments for AutoML. Each argument has a full name preceded by double hyphens and a shortcut alias of one case-sensitive letter preceded by a single hyphen. The meaning of most of the arguments is self-explanatory. The --cache argument tells AutoML to load all data into memory if possible (on) or not (off) or to automatically determine what to do.

Figure 4 AutoML Command Summary

Argument Alias Values Default Value
--task -T multiclass classification, binary classification, regression  
--dataset -d path to file  
--test-dataset -t path to file none
--validation-dataset -v path to file none
--label-column-name -n column name in header of variable to predict  
--label-column-index -i 1-based column index in header of variable to predict  
--ignore-columns -I comma-separated column names in header to ignore none
--has-header -h true, false true
--max-exploration-time -x time in seconds 10
--verbosity -V quiet, minimal, diagnostic minimal
--cache -c on, off, auto auto
--name -N name of created output project Sample{task}
--output-path -o directory to place output project current directory
--help -h    

The --max-exploration-time argument has the biggest impact on AutoML results. In general, the more time you allow AutoML to work, the better the prediction model that will be generated. AutoML does what’s in effect a double exploration. First, it tries different algorithms that are applicable to the type of prediction task. For example, for a multiclass classification problem, AutoML currently supports 10 algorithms: AveragedPerceptronOva, FastForestOva, FastTreeOva, LbfgsLogisticRegressionOva, LbfgsMaximumEntropyMulti, LightGbmMulti, LinearSvmOva, SdcaMaximumEntropyMulti, SgdCalibratedOva, SymbolicSgdLogisticRegressionOva.

Second, for each applicable algorithm, AutoML tries different values of the hyperparameters that are specific to the algorithm. For example, the FastTreeOva algorithm requires you to specify values for five parameters: NumberOfLeaves, MinimumExampleCountPerLeaf, NumberOfTrees, LearningRate and Shrinkage. The LightGbmMulti algorithm requires values for 13 parameters, including items such as NumberOfIterations, LearningRate and L2Regularization.

The number of different combinations of algorithms and hyperparameters is unimaginably large. Human machine learning experts rely on intuition and experience to find a good algorithm and a good set of hyperparameters, but the process is extremely tedious and time-consuming. AutoML does a sophisticated search automatically.

Interpreting the Results

If you refer to the screenshot in Figure 1, you’ll see that AutoML displays this information for the five best models found:

Trainer                MicroAccuracy  MacroAccuracy
1  LightGbmMulti          0.7701         0.7495
2  FastTreeOva            0.7471         0.7201
3  FastForestOva          0.7471         0.7236
4  AveragedPerceptronOva  0.4598         0.3333
5  LinearSvmOva           0.4598         0.3333

The MicroAccuracy and MacroAccuracy values give you two different metrics for model prediction accuracy. MicroAccruacy is the more important of the two. MicroAccuracy is normal accuracy—just the number of correct predictions on the test data divided by the total number of items. The test dataset has 200 items and the best algorithm, LightGbmMulti, scored 77.01 percent, which is 154 of 200 correct.

The MacroAccuracy is the average accuracy across the classes to predict. For example, suppose the 200-item test dataset had 60 conservative items, 90 moderate items and 50 liberal items. And suppose a model correctly predicted 45 of the 60 (0.7500) conservative items, 63 of the 90 (0.7000) moderate items, and 30 of the 50 (0.6000) liberal items. Then the MacroAccuracy for the model is (0.7500 + 0.7000 + 0.6000) / 3 = 0.6833.

MacroAccuracy is useful when a dataset is highly skewed toward one class. For example, if the test dataset had 180 conservative items, 10 moderate items and 10 liberal items, a model could just predict conservative for all items and score 180 / 200 = 0.9000 for MicroAccuracy, but the MacroAccurcy would be just (0.9000 + 0.0000 + 0.0000) / 3 = 0.3000. So, a big discrepancy between MicroAccuracy and MacroAccuracy values should be investigated.

Using the Generated Model

Creating a machine learning prediction model is interesting, but the whole point is to use the model to make predictions. AutoML creates a subdirectory named SampleMulticlassClassification in the root People directory. You can specify a more descriptive name using the --name argument. The subdirectory contains directories that hold the generated model in .zip format and an auto-generated C# console application that can be used as a template for making predictions.

The auto-generated code is clear and easy to interpret. I double-­clicked on file SampleMulticlassClassification.sln, which launched Visual Studio 2017, and then used the Solution file to load the C# project. The sample code makes a prediction using the first data item in the test dataset file. I edited the template code to make a prediction of the political leaning for a new, previously unseen person, as shown in Figure 2.

The edited prediction code begins by loading the trained model into memory and using the model to create a PredictionEngine object:

static void Main(string[] args)
{
  MLContext mlContext = new MLContext();
  ITransformer mlModel =
    mlContext.Model.Load(GetAbsolutePath(MODEL_FILEPATH),
    out DataViewSchema inputSchema);
  var predEngine =
    mlContext.Model.CreatePredictionEngine<ModelInput,
    ModelOutput>(mlModel);...

Next, the custom code sets up the predictor values for a person:

Console.WriteLine("\nPredicting politic for Age = 33,
  Sex = Male, Region = central, Income = $62,000.00");
ModelInput X = new ModelInput();
X.Age = 33; X.Sex = false;
X.Region = "central"; X.Income = 62000.00f; ...

Recall that binary predictor variables such as Sex are Boolean-­encoded. And notice that the Income variable has a trailing “f” to cast the value to type float, which is the default floating-point type used by ML.NET systems. The Age variable is also type float, but doesn’t require a trailing “f” because the value doesn’t contain a decimal point and is cast to type float automatically.

The prediction is made like so:

ModelOutput Y = predEngine.Predict(X);
string predPolitic = Y.Prediction;
float[] predProbs = Y.Score;...

The Prediction property is a string representation of the predicted class (“moderate” for the demo) and the Score is an array of float values that correspond to each possible class: (0.0034, 0.9055, 0.0912). The AutoML system uses the order in which class labels are first seen in the training data. Recall the training data looks like:

sex   age  region   income    politic
False  26  eastern  53800.00  conservative
False  19  western  39200.00  moderate
True   19  central  80800.00  liberal
False  52  eastern  86700.00  conservative

So “conservative” is [0], “moderate” is [1], and “liberal” is [2]. When I use AutoML I often rearrange the first few lines of my training data to get a nice order for the values to predict.

Binary Classification and Regression

Once you understand the principles for using AutoML to create and use a prediction model for a multiclass classification problem, it’s relatively simple to work with binary classification and regression problems. For example, you could issue the following command to create a model to predict a person’s sex based on age, region, income and political leaning:

mlnet auto-train ^
--task binary-classification ^
--dataset ".\Data\people_train.tsv" ^
--test-dataset ".\Data\people_test.tsv" ^
--label-column-name sex ^
--max-exploration-time 300

And you could use this command to predict annual income based on age, sex and political leaning, but not region:

mlnet auto-train ^
--task regression ^
--dataset ".\Data\people_train.tsv" ^
--test-dataset ".\Data\people_test.tsv" ^--ignore-columns region ^
--label-column-name income ^
--max-exploration-time 300

Binary classification and regression commands produce different result metrics than multiclass classification. Binary classification displays Accuracy, AUC, AUPRC and F1-score metrics. Briefly, AUC is “area under curve” of the receiver operating characteristics function and is a measure of how well two classes can be separated in a binary classification problem. Larger values of AUC are better. AUPRC is “area under the precision recall curve,” which is a somewhat similar metric where larger values are also better. The F1 score is an average of precision and recall, both of which are metrics where larger values are better.

Using AutoML for regression displays R-squared, Absolute-loss, Squared-loss and RMS-loss. Larger values of R-square are better, and smaller values for Absolute-loss, Squared-loss and RMS-loss are better. If you’re a relative newcomer to machine learning, don’t get overly concerned with all these statistics. It’s common practice in machine learning to give you many metrics. Accuracy is usually the primary metric to pay attention to.

But notice that there’s no accuracy metric for a regression model. This is because there’s no inherent definition of what a correct prediction is for a regression problem. For example, if a predicted annual income is $58,001.00 and the true annual income is $58,000.00, is the prediction correct?

For regression problems you must define a problem-dependent meaning of accuracy. Typically, you specify an allowable percentage difference. For example, if you specify a percentage of 0.10 and a correct income is $60,000.00, then any predicted income between $54,000.00 and $66,000.00 will be considered a correct prediction.

The template code generated by AutoML makes it easy for you to compute accuracy for a prediction problem. In pseudo-code:

loop each line in test dataset file
  parse out sex, age, region, politic, and correct income
  use sex, age, region, politic to compute predicted income
  if predicted is within x% of correct
    increment number correct
  else
    increment number wrong
end-loop
return number correct / (number correct + number wrong)

These examples should give you a good idea of the types of problems that can be tackled by AutoML and ML.NET. One type of machine learning that AutoML and ML.NET don’t handle is prediction based on a neural network. Neural networks are significantly more complex than the traditional machine learning algorithms supported by AutoML. There has been discussion of adding neural network functionality to ML.NET and AutoML, but such functionality isn’t likely to be added in the short term.

An interesting benefit of using AutoML is that, in addition to generating template code to load a trained model and use it to make predictions, AutoML generates a ModelBuilder.cs file that contains the underlying code that was used to create, train and save the prediction model. For example, some of the code generated in the ModelBuilder.cs file for the multiclass classification example is:

// Set the training algorithm
var trainer =   mlContext.MulticlassClassification.Trainers.
  LightGbm(labelColumnName: "politic",
  featureColumnName: "Features").Append(mlContext.Transforms.Conversion.
  MapKeyToValue("PredictedLabel", "PredictedLabel"));

var trainingPipeline = dataProcessPipeline.Append(trainer);

If you want to explore creating prediction models using ML.NET manually in Visual Studio instead of automatically using AutoML, you can use the ModelBuilder.cs code as a starting point. This is much, much easier than writing the code from scratch.

Wrapping Up

I’m quite impressed by my first look at AutoML. I could describe its cool technical features, but more important is that AutoML “just feels right.” The CLI is simple and easy to use and the generated template code is clean and easy to modify. Put another way, AutoML feels like it’s helping me instead of fighting me.

AutoML is just one part of a rapidly growing ecosystem of machine learning tools and systems. This ecosystem is expanding so quickly that even my colleagues and I, who work very close to the sources of these new systems, are finding it challenging to stay on top of everything. Systems that have been around for a couple of years, such as Azure Cognitive Services, Azure Machine Learning Studio, and Azure Data Science Virtual Machines, are being joined by newcomers such as Azure Data Prep SDK, NimbusML, and the ONNX Runtime. It’s an exciting time to develop machine learning systems using .NET and open source technologies.


Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd


Discuss this article in the MSDN Magazine forum