Code download available at msdn.com/magazine/0518magcode.

The goal of a regression problem is to make a prediction where the value to predict is a single numeric value. For example, you might

want to predict the height of a person based on their weight, age and sex. There are many techniques that can be used to tackle a regression problem. In this article I’ll explain how to use the CNTK library to create a neural network regression model.

A good way to see where this article is headed is to take a look at the demo program in **Figure 1**. The demo program creates a regression model for the well-known Yacht Hydrodynamics Data Set benchmark. The goal is to predict a measure of resistance for a yacht hull, based on six predictor variables: center of buoyancy of the hull, prismatic coefficient, length-displacement ratio, beam-draught ratio, length-beam ratio and Froude number.

The demo program creates a neural network with two hidden layers, each of which has five processing nodes. After training, the model is used to make predictions for two of the data items. The first item has predictor values (0.52, 0.79, 0.55, 0.41, 0.65, 0.00). The predicted hull resistance is 0.0078 and the actual resistance is 0.0030. The second item has predictor values (1.00, 1.00, 0.55, 0.56, 0.46, 1.00). The predicted hull resistance is 0.8125 and the actual resistance is 0.8250. The model appears to be quite accurate.

This article assumes you have intermediate or better programming skills but doesn’t assume you know much about CNTK or neural networks. The demo is coded using Python, the default language for machine learning, but even if you don’t know Python you should be able to follow along without too much difficulty. The code for the demo program is presented in its entirety in this article. The yacht hull data file used by the demo program can be found at bit.ly/2Ibsm5D, and is also available in the download that accompanies this article.

## Understanding the Data

When creating a machine learning model, data preparation is almost always the most time-consuming part of the project. The raw data set has 308 items and looks like:

-2.3 0.568 4.78 3.99 3.17 0.125 0.11 -2.3 0.568 4.78 3.99 3.17 0.150 0.27 ... -5.0 0.530 4.78 3.75 3.15 0.125 0.09 ... -2.3 0.600 4.34 4.23 2.73 0.450 46.66

The file is space-delimited. The first six values are the predictor values (often called features in machine learning terminology). The last value on each line is the "residuary resistance per unit weight of displacement."

Because there’s more than one predictor variable, it’s not possible to show the complete data set in a graph. But you can get a rough idea of the structure of the data by examining the graph in **Figure 2**. The graph plots just the prismatic coefficient predictor values and the hull resistance. You can see that the prismatic coefficient values, by themselves, don’t give you enough information to make an accurate prediction of hull resistance.

When working with neural networks, it’s usually necessary to normalize the data in order to create a good prediction model. I used min-max normalization on the six predictor values and on the hull resistance values. I dropped the raw data into an Excel spreadsheet and, for each column, I computed the max and min values. Then, for each column, I replaced every value v with (v - min) / (max - min). For example, the minimum prismatic coefficient value is 0.53 and the maximum value is 0.60. The first value in the column is 0.568 and it’s normalized to (0.568 - 0.53) / (0.60 - 0.53) = 0.038 / 0.07 = 0.5429.

After normalizing, I inserted tags |predictors and |resistance into the Excel spreadsheet so the data can be easily read by a CNTK data reader object. Then I exported the data as a tab-delimited file. The resulting data looks like:

|predictors 0.540000 0.542857 . . |resistance 0.001602 |predictors 0.540000 0.542857 . . |resistance 0.004166 ...

Alternatives to min-max normalization include z-score normalization and order-magnitude normalization.

## The Demo Program

The complete demo program, with a few minor edits to save space, is presented in **Figure 3**. All normal error checking has been removed. I indent with two space characters instead of the usual four as a matter of personal preference and to save space. Note that the ‘\’ character is used by Python for line continuation.

# hydro_reg.py # CNTK 2.4 with Anaconda 4.1.1 (Python 3.5, NumPy 1.11.1) # Predict yacht hull resistance based on six predictors import numpy as np import cntk as C def create_reader(path, input_dim, output_dim, rnd_order, sweeps): x_strm = C.io.StreamDef(field='predictors', shape=input_dim, is_sparse=False) y_strm = C.io.StreamDef(field='resistance', shape=output_dim, is_sparse=False) streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm) deserial = C.io.CTFDeserializer(path, streams) mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order, max_sweeps=sweeps) return mb_src # ======================================================== def main(): print("\nBegin yacht hull regression \n") print("Using CNTK version = " + \ str(C.__version__) + "\n") input_dim = 6 # center of buoyancy, etc. hidden_dim = 5 output_dim = 1 # residuary resistance train_file = ".\\Data\\hydro_data_cntk.txt" # data resembles: # |predictors 0.540 0.542 . . |resistance 0.001 # |predictors 0.540 0.542 . . |resistance 0.004 # 1. create neural network model X = C.ops.input_variable(input_dim, np.float32) Y = C.ops.input_variable(output_dim) print("Creating a 6-(5-5)-1 tanh regression NN for \ yacht hull dataset ") with C.layers.default_options(): hLayer1 = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer1')(X) hLayer2 = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer2')(hLayer1) oLayer = C.layers.Dense(output_dim, activation=None, name='outLayer')(hLayer2) model = C.ops.alias(oLayer) # alias # 2. create learner and trainer print("Creating a squared error batch=11 Adam \ fixed LR=0.005 Trainer \n") tr_loss = C.squared_error(model, Y) max_iter = 50000 batch_size = 11 learn_rate = 0.005 learner = C.adam(model.parameters, learn_rate, 0.99) trainer = C.Trainer(model, (tr_loss), [learner]) # 3. create reader for train data rdr = create_reader(train_file, input_dim, output_dim, rnd_order=True, sweeps=C.io.INFINITELY_REPEAT) hydro_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src } # 4. train print("Starting training \n") for i in range(0, max_iter): curr_batch = rdr.next_minibatch(batch_size, input_map=hydro_input_map) trainer.train_minibatch(curr_batch) if i % int(max_iter/10) == 0: mcee = trainer.previous_minibatch_loss_average print("batch %6d: mean squared error = %8.4f" % \ (i, mcee)) print("\nTraining complete") # (could save model to disk here) # 5. use trained model to make some predictions np.set_printoptions(precision=2, suppress=True) inpts = np.array( [[0.520000, 0.785714, 0.550000, 0.405512, \ 0.648352, 0.000000], [1.000000, 1.000000, 0.550000, 0.562992, \ 0.461538, 1.000000]], dtype=np.float32) actuals = np.array([0.003044, 0.825028], dtype=np.float32) for i in range(len(inpts)): print("\nInput: ", inpts[i]) pred = model.eval(inpts[i]) print("predicted resistance: %0.4f" % pred[0][0]) print("actual resistance: %0.4f" % actuals[i]) print("\nEnd yacht hull regression ") # ======================================================== if __name__ == "__main__": main()

Installing CNTK can be a bit tricky. First you install the Anaconda distribution of Python, which contains the necessary Python interpreter, required packages such as NumPy and SciPy, plus useful utilities such as pip. I used Anaconda3 4.1.1 64-bit, which has Python 3.5. After installing Anaconda, you install CNTK as a Python package, not a standalone system, using the pip utility. From an ordinary shell, the command I used was:

The hydro_reg.py demo has one helper function, create_reader. You can consider create_reader as boilerplate for a CNTK regression problem. The only thing you’ll need to change in most scenarios is the tag names in the data file.

All control logic is in a single main function. The code begins:

def main(): print("Begin yacht hull regression \n") print("Using CNTK version = " + \ str(C.__version__) + "\n") input_dim = 6 # center of buoyancy, etc. hidden_dim = 5 output_dim = 1 # residuary resistance train_file = ".\\Data\\hydro_data_cntk.txt" ...

Because CNTK is young and under continuous development, it’s a good idea to display the version that’s being used (2.4 in this case). The number of input nodes is determined by the structure of the data set. For a regression problem, the number of output nodes is always set to 1. The number of hidden layers and the number of processing nodes in each hidden layer are free parameters—they must be determined by trial and error.

The demo program uses all 308 items for training. An alternative approach is to split the data set into a training set (typically 80 percent of the data) and a test set (the remaining 20 percent). After training, you can compute loss and accuracy metrics of the model on the test data set to verify that the metrics are similar to those on the training data.

## Creating the Neural Network Model

The demo sets up CNTK objects to hold the predictor and true hull resistance values:

CNTK uses 32-bit values by default because 64-bit precision is rarely needed. The name of the input_variable function can be a bit confusing if you’re new to CNTK. Here, the “input_” refers to the fact that the return objects hold values that come from the input data (that correspond to both input and output of the neural network).

The neural network is created with these statements:

print("Creating a 6-(5-5)-1 NN") with C.layers.default_options(): hLayer1 = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer1')(X) hLayer2 = C.layers.Dense(hidden_dim, activation=C.ops.tanh, name='hidLayer2')(hLayer1) oLayer = C.layers.Dense(output_dim, activation=None, name='outLayer')(hLayer2) model = C.ops.alias(oLayer) # alias

There’s quite a bit going on here. The Python “with” statement can be used to pass a set of common parameter values to multiple functions. In this case, the neural network weights and biases values are initialized using CNTK default values. Neural networks are highly sensitive to initial weights and biases values, so supplying non-default values is one of the first things to try when your neural network fails to learn—a painfully common situation.

The neural network has two hidden layers. The X object as acts the input to the first hidden layer; the first hidden layer acts as input to the second hidden layer; and the second hidden layer acts as input to the output layer.

The two hidden layers use tanh (hyperbolic tangent) activation. The two main alternatives are logistic sigmoid and rectified linear units (ReLU) activation. The output layer uses the “None” activation, which means the values of the output nodes aren’t modified. This is the design pattern to use for a regression problem. Using no activation is sometimes called using the identify activation function because the mathematical identity function is f(x) = x, which has no effect.

The demo program creates an alias named “model” for the output layer. This technique is optional and is a bit subtle. The idea here is that a neural network is essentially a complex math function. The output nodes conceptually represent both a layer of the network and the network/model as a whole.

## Training the Model

The heart of CNTK functionality is the ability to train a neural network model. Training is prepared with these statements:

tr_loss = C.squared_error(model, Y) max_iter = 50000 batch_size = 11 learn_rate = 0.005 learner = C.adam(model.parameters, learn_rate, 0.99) trainer = C.Trainer(model, (tr_loss), [learner])

A loss (error) function is required so the training object knows how to adjust weights and biases to reduce error. CNTK 2.4 has nine loss functions, but the simple squared_error is almost always suitable for a regression problem. The number of iterations corresponds to the number of update operations and must be determined by trial and error.

The Trainer object requires a Learner object. You can think of a Learner as an algorithm. CNTK supports eight learning algorithms. For regression problems, I typically get good results using either basic stochastic gradient descent or the more sophisticated Adam (“adaptive momentum estimation”).

The batch size is used by CNTK to determine how often to perform weight and bias updates. The demo sets the batch size to 11. Therefore, the 308 items will be grouped into 308 / 11 = 28 randomly selected batches. Each batch is analyzed and then updates are performed. The learning rate controls the magnitude of the weight and bias adjustments. Determining good values for the batch size, the maximum number of iterations, and the learning rate are often the biggest challenges when creating a neural network prediction model.

The demo calls the program-defined create_reader function to, well, create a reader object. And an input_map is created that tells the reader where the feature values are and where the value-to-predict is:

rdr = create_reader(train_file, input_dim, output_dim, rnd_order=True, sweeps=C.io.INFINITELY_REPEAT) hydro_input_map = { X : rdr.streams.x_src, Y : rdr.streams.y_src }

The rnd_order parameter ensures that the data items will be processed differently on each pass, which is important to prevent training from stalling out. The INFINITELY_REPEAT argument allows training over multiple passes through the 308-item data set.

After preparation, the model is trained like so:

for i in range(0, max_iter): curr_batch = rdr.next_minibatch(batch_size, input_map=hydro_input_map) trainer.train_minibatch(curr_batch) if i % int(max_iter/10) == 0: mcee = trainer.previous_minibatch_loss_average print("batch %6d: mean squared error = %8.4f" % \ (i, mcee))

The next_minibatch function pulls 11 items from the data. The train function uses the Adam algorithm to update weights and biases based on squared error between computed hull resistance values and actual resistance values. The squared error on the current 11-item batch is displayed every 50,000 / 10 = 5,000 batches so you can visually monitor training progress: You want to see loss/error values that generally decrease.

## Using the Model

After the model has been trained, the demo program makes some predictions. First, the predictor values for two arbitrary items from the normalized data set are selected (items 99 and 238) and placed into an array-of-arrays style matrix:

inpts = np.array( [[0.520000, 0.785714, 0.550000, 0.405512, 0.648352, 0.000000], [1.000000, 1.000000, 0.550000, 0.562992, 0.461538, 1.000000]], dtype=np.float32)

Next, the corresponding normalized actual hull resistance values are placed into an array:

Then, the predictor values are used to compute the predicted values using the model.eval function, and predicted and actual values are displayed:

for i in range(len(inpts)): print("\nInput: ", inpts[i]) pred = model.eval(inpts[i]) print("predicted resistance: %0.4f" % pred[0][0]) print("actual resistance: %0.4f" % actuals[i]) print("End yacht hull regression ")

Notice that the predicted hull resistance value is returned as an array-of-arrays matrix with a single value. Therefore, the value itself is at [0][0] (row 0, column 0). Dealing with shapes of CNTK vectors and matrices is a significant syntax challenge. When working with CNTK I spend a lot of time printing objects and displaying their shape, along the lines of print(something.shape).

## Wrapping Up

When creating a neural network regression model, there’s no predefined accuracy metric. If you want to compute prediction accuracy you must define what it means for a predicted value to be close enough to the corresponding actual value in order to be considered correct. Typically, you’d specify a percentage/proportion, such as 0.10, and evaluate a predicted value as correct if it’s within that percentage of the actual value.

Because the demo model works with normalized data, if you use the model to make a prediction for new, previously unseen predictor values, you have to normalize them using the same min-max values that were used on the training data. Similarly, a predicted hull resistance value, pv, is normalized, so you’d have to de-normalize by computing pv * (max - min) + min.

The term “regression” can have several different meanings. In this article the term refers to a problem scenario where the goal is to predict a single numeric value (hull resistance). The classical statistics linear regression technique is much simpler than neural network regression, but usually much less accurate. The machine learning logistic regression technique predicts a single numeric value between 0.0 and 1.0, which is interpreted as a probability and then used to predict a categorical value such as “male” (p < 0.5) or “female” (p > 0.5).

**Dr. James McCaffrey**

*works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.