Article
01/04/2019

July 2018

Volume 33 Number 7

[Test Run]

Introduction to DNN Image Classification Using CNTK

James McCaffrey Image classification involves determining what category an input image belongs to, for example identifying a photograph as one containing “apples” or “oranges” or “bananas.” The two most common approaches for image classification are using a standard deep neural network (DNN) or using a convolutional neural network (CNN). In this article I’ll explain the DNN approach, using the CNTK library.

Take a look at Figure 1 to see where this article is headed. The demo program creates an image classification model for a subset of the Modified National Institute of Standards and Technology (MNIST) dataset. The demo training dataset consists of 1,000 images of handwritten digits. Each image is 28 high by 28 pixels wide (784 pixels) and represents a digit, 0 through 9.

Figure 1 Image Classification Using a DNN with CNTK

The demo program creates a standard neural network with 784 input nodes (one for each pixel), two hidden processing layers (each with 400 nodes) and 10 output nodes (one for each possible digit). The model is trained using 10,000 iterations. The loss (also known as training error) slowly decreases and the prediction accuracy slowly increases, indicating training is working.

After training completes, the demo applies the trained model to a test dataset of 100 items. The model’s accuracy is 84.00 percent, so 84 of the 100 test images were correctly classified.

This article assumes you have intermediate or better programming skill with a C-family language, but doesn’t assume you know much about CNTK or neural networks. The demo is coded using Python, but even if you don’t know Python, you should be able to follow along without too much difficulty. The code for the demo program is presented in its entirety in this article. The two data files used are available in the download that accompanies this article.

Understanding the Data

The full MNIST dataset consists of 60,000 images for training and 10,000 images for testing. Somewhat unusually, the training set is contained in two files, one that holds all the pixel values and one that holds the associated label values (0 through 9). The test images are also contained in two files.

Additionally, the four source files are stored in a proprietary binary format. When working with deep neural networks, getting the data into a usable form is almost always time-consuming and difficult. Figure 2 shows the contents of the first training image. The key point is that each image has 784 pixels, and each pixel is a value between 00h (0 decimal) and FFh (255 decimal).

Figure 2 An MNIST Image

Before writing the demo program, I wrote a utility program to read the binary source files and write a subset of their contents to text files that can be easily consumed by a CNTK reader object. File mnist_train_1000_cntk.txt looks like:

|digit 0 0 0 0 0 1 0 0 0 0 |pixels 0 .. 170 52 .. 0
|digit 0 1 0 0 0 0 0 0 0 0 |pixels 0 .. 254 66 .. 0
etc.

Getting the raw MNIST binary data into CNTK format isn’t trivial. The source code for my utility program can be found at: bit.ly/2ErcCbw.

There are 1,000 lines of data and each represents one image. The tags “|digit” and “|pixels” indicate the start of the value-to-predict and the predictor values. The digit label is one-hot encoded where the position of the 1 bit indicates the digit. Therefore, in the preceding code, the first two images represent a “5” and a “1.” Each line of data has 784 pixel values, each of which is between 0 and 255. File mnist_test_100_cntk.txt has 100 images and uses the same CNTK-friendly format.

In most neural network problems, you want to normalize the predictor values. Instead of directly normalizing the pixel values in the data files, the demo program normalizes the data on the fly, as you’ll see shortly.

The Demo Program

The complete demo program, with a few minor edits to save space, is presented in Figure 3. All normal error checking has been removed. I indent with two space characters instead of the usual four to save space. Note that the “\” character is used by Python for line continuation.

Figure 3 Complete Demo Program Listing

# mnist_dnn.py
# MNIST using a 2-hidden layer DNN (not a CNN)
# Anaconda 4.1.1 (Python 3.5.2), CNTK 2.4
import numpy as np
import cntk as C
def create_reader(path, input_dim, output_dim, rnd_order, m_swps):
  x_strm = C.io.StreamDef(field='pixels', shape=input_dim,
    is_sparse=False)
  y_strm = C.io.StreamDef(field='digit', shape=output_dim,
    is_sparse=False)
  streams = C.io.StreamDefs(x_src=x_strm, y_src=y_strm)
  deserial = C.io.CTFDeserializer(path, streams)
  mb_src = C.io.MinibatchSource(deserial, randomize=rnd_order,
    max_sweeps=m_swps)
  return mb_src
# ===================================================================
def main():
  print("\nBegin MNIST classification using a DNN \n")
  train_file = ".\\Data\\mnist_train_1000_cntk.txt"
  test_file  = ".\\Data\\mnist_test_100_cntk.txt"
  C.cntk_py.set_fixed_random_seed(1)
  input_dim = 784  # 28 x 28 pixels
  hidden_dim = 400
  output_dim = 10  # 0 to 9
  X = C.ops.input_variable(input_dim, dtype=np.float32)
  Y = C.ops.input_variable(output_dim)  # float32 is default
  print("Creating a 784-(400-400)-10 ReLU classifier")
  with C.layers.default_options(init=\
    C.initializer.uniform(scale=0.01)):
    h_layer1 = C.layers.Dense(hidden_dim, activation=C.ops.relu,
      name='hidLayer1')(X/255) 
    h_layer2 = C.layers.Dense(hidden_dim, activation=C.ops.relu,
      name='hidLayer2')(h_layer1)
    o_layer = C.layers.Dense(output_dim, activation=None,
      name='outLayer')(h_layer2)
  dnn = o_layer               # train this
  model = C.ops.softmax(dnn)  # use for prediction
  tr_loss = C.cross_entropy_with_softmax(dnn, Y)
  tr_eror = C.classification_error(dnn, Y)
  max_iter = 10000   # num batches, not epochs
  batch_size = 50   
  learn_rate = 0.01
  learner = C.sgd(dnn.parameters, learn_rate)
  trainer = C.Trainer(dnn, (tr_loss, tr_eror), [learner]) 
  # 3. create reader for train data
  rdr = create_reader(train_file, input_dim, output_dim,
    rnd_order=True, m_swps=C.io.INFINITELY_REPEAT)
  mnist_input_map = {
    X : rdr.streams.x_src,
    Y : rdr.streams.y_src
  } 
  # 4. train
  print("\nStarting training \n")
  for i in range(0, max_iter):
    curr_batch = rdr.next_minibatch(batch_size, \
      input_map=mnist_input_map)
    trainer.train_minibatch(curr_batch)
    if i % int(max_iter/10) == 0:
      mcee = trainer.previous_minibatch_loss_average
      macc = (1.0 - trainer.previous_minibatch_evaluation_average) \
        * 100
      print("batch %4d: mean loss = %0.4f, accuracy = %0.2f%% " \
        % (i, mcee, macc))
  print("\nTraining complete \n")
  # 5. evaluate model on test data
  rdr = create_reader(test_file, input_dim, output_dim,
    rnd_order=False, m_swps=1)
  mnist_input_map = {
    X : rdr.streams.x_src,
    Y : rdr.streams.y_src
  }
  num_test = 100
  test_mb = rdr.next_minibatch(num_test, input_map=mnist_input_map)
  test_acc = (1.0 - trainer.test_minibatch(test_mb)) * 100
  print("Model accuracy on the %d test items = %0.2f%%" \
    % (num_test,test_acc)) 
  print("\nEnd MNIST classification using a DNN \n")
if __name__ == "__main__":
  main()

The mnist_dnn.py demo has one helper function, create_reader. All control logic is in the single main function. Because CNTK is young and under continuous development, it’s a good idea to add a comment detailing which version is being used (2.4 in this case).

Installing CNTK can be a bit tricky if you’re new to the Python world. First you install an Anaconda distribution of Python, which contains the required Python interpreter, the necessary packages such as NumPy and SciPy, and useful utilities such as pip. I used Anaconda3 4.1.1 64-bit, which includes Python 3.5. After installing Anaconda, you install CNTK as a Python package, not a standalone system, using the pip utility. From an ordinary shell, the command I used was:

>pip install https://cntk.ai/PythonWheel/CPU-Only/cntk-2.4-cp35-cp35m-win_amd64.whl

Note the “cp35” in the wheel file that indicates the file is for use with Python 3.5. Be careful; almost all the CNTK installation failures I’ve seen have been due to Anaconda-CNTK version incompatibilities.

The signature of the reader function is create_reader(path, input_dim, output_dim, rnd_order, m_swps). The path parameter points to a training or test file that’s in CNTK format. The rnd_order parameter is a Boolean flag that will be set to True for training data because you want to process training data in random order to prevent oscillating without making training progress. The parameter will be set to False when reading test data to evaluate model accuracy because order isn’t important then. The m_swps parameter (“maximum sweeps”) will be set to the constant INFINITELY_REPEAT for training data (so it can be processed repeatedly) and set to 1 for test data evaluation.

Creating the Model

The demo prepares a deep neural network with:

train_file = ".\\Data\\mnist_train_1000_cntk.txt"
test_file  = ".\\Data\\mnist_test_100_cntk.txt"
C.cntk_py.set_fixed_random_seed(1)
input_dim = 784
hidden_dim = 400
output_dim = 10
X = C.ops.input_variable(input_dim, dtype=np.float32)
Y = C.ops.input_variable(output_dim)  # 32 is default

It’s usually a good idea to explicitly set the CNTK global random number seed so your results will be reproducible. The number of input and output nodes is determined by your data, but the number of hidden processing nodes is a free parameter and must be determined by trial and error. Using 32-bit variables is the default for CNTK and is typical for neural networks because the precision gained by using 64 bits isn’t worth the performance penalty incurred.

The network is created like so:

with C.layers.default_options(init=
  C.initializer.uniform(scale=0.01)):
  h_layer1 = C.layers.Dense(hidden_dim,
    activation=C.ops.relu, name='hidLayer1')(X/255) 
  h_layer2 = C.layers.Dense(hidden_dim,
  activation=C.ops.relu, name='hidLayer2')(h_layer1)
  o_layer = C.layers.Dense(output_dim, activation=None,
    name='outLayer')(h_layer2)
dnn = o_layer               # train this
model = C.ops.softmax(dnn)  # use for prediction

The Python with statement is a syntactic shortcut to apply a set of common arguments to multiple functions. Here it’s used to initialize all network weights to random values between -0.01 and +0.01. The X object holds the 784 input values for an image. Notice that each value is normalized by dividing by 255 so the actual input values will be in the range [0.0, 1.0].

The normalized input values act as input to the first hidden layer. The outputs of the first hidden layer act as inputs to the second hidden layer. Then, the outputs of the second hidden layer are sent to the output layer. The two hidden layers use ReLU (rectified linear units) activation, which, for image classification, often works better than standard tanh activation.

Notice that there’s no activation applied to the output nodes. This is a quirk of CNTK because the CNTK training function expects raw, un-activated values. The dnn object is just a convenience alias. The model object has softmax activation so it can be used after training to make predictions. Because Python assigns by reference, training the dnn object also trains the model object.

Training the Neural Network

The neural network is prepared for training with:

tr_loss = C.cross_entropy_with_softmax(dnn, Y)
tr_eror = C.classification_error(dnn, Y)
max_iter = 10000 
batch_size = 50   
learn_rate = 0.01
learner = C.sgd(dnn.parameters, learn_rate)
trainer = C.Trainer(dnn, (tr_loss, tr_eror), [learner])

The training loss ( tr_loss) object tells CNTK how to measure error when training. The cross-entropy error is usually the best choice for classification problems. The training classification error (tr_eror) object can be used to automatically compute the percentage of incorrect predictions during training or after training. Specifying a loss function is required, but specifying a classification error function is optional.

The values for the maximum number of training iterations, the number of items in a batch to train at a time, and the learning rate are all free parameters that must be determined by trial and error. You can think of the learner object as an algorithm, and the trainer object as the object that uses the learner to find good values for the neural network’s weights and biases values. The stochastic gradient descent (sgd) learner is the most primitive algorithm but works well for simple problems. Alternatives include adaptive moment estimation (adam) and root mean square propagation (rmsprop).

A reader object for the training data is created with these statements:

rdr = create_reader(train_file, input_dim, output_dim,
  rnd_order=True, m_swps=C.io.INFINITELY_REPEAT)
mnist_input_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}

If you examine the create_reader code in Figure 3, you’ll see that it specifies the tag names (“pixels” and “digit”) used in the data file. You can consider create_reader and the code to create a reader object as boilerplate code for DNN image classification problems. All you have to change is the tag names, and the name of the mapping dictionary (mnist_input_map).

After everything is prepared, training is performed, as shown in Figure 4.

Figure 4 Training

print("\nStarting training \n")
for i in range(0, max_iter):
  curr_batch = rdr.next_minibatch(batch_size, \
    input_map=mnist_input_map)
  trainer.train_minibatch(curr_batch)
  if i % int(max_iter/10) == 0:
    mcee = trainer.previous_minibatch_loss_average
    macc = (1.0 - \
      trainer.previous_minibatch_evaluation_average) \
        * 100
    print("batch %4d: mean loss = %0.4f, accuracy = \
      %0.2f%% " % (i, mcee, macc))

The demo program is designed so that each iteration processes one batch of training items. Many neural network libraries use the term “epoch” to refer to one pass through all training items. In this example, because there are 1,000 training items, and the batch size is set to 50, one epoch would be 20 iterations.

An alternative to training with a fixed number of iterations is to stop training when loss/error drops below some threshold. It’s important to display loss/error during training because training failure is the rule rather than the exception. Cross-entropy error is difficult to interpret directly, but you want to see values that tend to get smaller. Instead of displaying average classification error (“25 percent wrong”), the demo computes and prints the average classification accuracy (“75 percent correct”), which is a more natural metric in my opinion.

Evaluating and Using the Model

After an image classifier has been trained, you’ll usually want to evaluate the trained model on test data that has been held out. The demo computes classification accuracy as shown in Figure 5.

Figure 5 Computing Classification Accuracy

rdr = create_reader(test_file, input_dim, output_dim,
  rnd_order=False, m_swps=1)
mnist_input_map = {
  X : rdr.streams.x_src,
  Y : rdr.streams.y_src
}
num_test = 100
test_mb = rdr.next_minibatch(num_test,
  input_map=mnist_input_map)
test_acc = (1.0 - trainer.test_minibatch(test_mb)) * 100
print("Model accuracy on the %d test items = %0.2f%%" \
  % (num_test,test_acc)))

A new data reader is created. Notice that unlike the reader used for training, the new reader doesn’t traverse the data in random order, and that the number of sweeps is set to 1. The mnist_input_map dictionary object is recreated. A common mistake is to try and use the original reader—but the rdr object has changed so you need to recreate the mapping. The test_minibatch function returns the average classification error for its mini-batch argument, which in this case is the entire 100-item test set.

After training, or during training, you’ll usually want to save the model. In CNTK, saving would look like:

mdl_name = ".\\Models\\mnist_dnn.model"
model.save(mdl_name)

This would save using the default CNTK v2 format. An alternative is to use the Open Neural Network Exchange (ONNX) format. Notice that you’ll generally want to save the model object (with softmax activation) rather than the dnn object (no output activation). From a different program, a saved model could be loaded into memory along the lines of:

mdl_name = ".\\Models\\mnist_dnn.model"
model = C.ops.functions.Function.load(mdl_name)

After loading, the model can be used as if it had just been trained. The demo program doesn’t use the trained model to make a prediction. Prediction code could resemble this:

input_list = [0.55] * 784  # [0.55, 0.55, . . 0.55]
input_vec = np.array(input_list, dtype=np.float32)
pred_probs = model.eval(input_vec)
pred_digit = np.argmax(pred_probs)
print(pred_digit)

The input_list has a dummy input of 784 pixel values, each with value 0.55 (recall the model was trained on normalized data so you must feed in normalized data). The pixel values are copied into a NumPy array. The call to the eval function would return an array of 10 values that sum to 1.0 and can loosely be interpreted as probabilities. The argmax function returns the index (0 through 9) of the largest value, which is conveniently the same as the predicted digit. Neat!

Wrapping Up

Using a deep neural network used to be the most common approach for simple image classification. However, DNNs have at least two key limitations. First, DNNs don’t scale well to images that have a huge number of pixels. Second, DNNs don’t explicitly take into account the geometry of image pixels. For example, in an MNIST image, a pixel that’s directly below a second pixel is 28 positions away from first pixel in the input file.

Because of these limitations, and for other reasons, too, the use of a convolutional neural network (CNN) is now more common for image classification. That said, for simple image classification tasks, using a DNN is easier and often just as (or even more) effective than using a CNN.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee, Ricky Loynd, Ken Tran

Discuss this article in the MSDN Magazine forum