Article
10/01/2019

October 2019

Volume 34 Number 10

[Test Run]

Neural Binary Classification Using PyTorch

James McCaffrey The goal of a binary classification problem is to make a prediction where the result can be one of just two possible categorical values. For example, you might want to predict the sex (male or female) of a person based on their age, annual income and so on. Somewhat surprisingly, binary classification problems require a slightly different set of techniques than classification problems where the value to predict can be one of three or more possible values.

There are many different binary classification algorithms. In this article I’ll demonstrate how to perform binary classification using a deep neural network with the PyTorch code library. The best way to understand where this article is headed is to take a look at the demo program in Figure 1.

Figure 1 Binary Classification Using PyTorch

The demo program creates a prediction model on the Banknote Authentication dataset. The problem is to predict whether a banknote (think dollar bill or euro) is authentic or a forgery, based on four predictor variables. The demo loads a training subset into memory, then creates a 4-(8-8)-1 deep neural network.

After training for 100 iterations, the resulting model scores 98.18 percent accuracy on a held-out test dataset. The demo concludes by making a prediction for a hypothetical, previously unseen banknote. The probability that the unknown item is a forgery is only 0.0215, so the conclusion is that the banknote is authentic.

This article assumes you have intermediate or better programming skills with a C-family language and a basic familiarity with machine learning, but doesn’t assume you know anything about binary classification using PyTorch. All of the demo code is presented in this article. The code and the two data files used by the demo are available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

Installing PyTorch

PyTorch is a relatively low-level code library for creating neural networks. It’s roughly similar in terms of functionality to TensorFlow and CNTK. PyTorch is written in C++, but has a Python language API for easier programming.

Installing PyTorch involves two main steps. First, you install Python and several required auxiliary packages, such as NumPy and SciPy. Second, you install PyTorch as a Python add-on package. Although it’s possible to install Python and the packages required to run PyTorch separately, in most cases it’s much better to install a Python distribution. A distribution is a collection of code libraries containing the base Python interpreter and additional packages that are compatible with each other. For my demo, I installed the Anaconda3 5.2.0 distribution, which contains Python 3.6.5.

After installing Anaconda, I went to the pytorch.org Web site and selected the options for the Windows OS, pip installer, Python 3.6 and no-GPU version. This gave me a URL that pointed to the corresponding .whl (pronounced “wheel”) file, which I downloaded to my local machine. I downloaded PyTorch version 1.0.0. (If you’re new to the Python ecosystem, you can think of a Python .whl file as somewhat similar to a Windows .msi file.) I opened a command shell, navigated to the directory holding the .whl file and entered the command:

pip install torch-1.0.0-cp36-cp36m-win_amd64.whl

Understanding the Data

The Banknote Authentication dataset has 1,372 items. The raw data looks like:

3.6216, 8.6661, -2.8073, -0.44699, 0
4.5459, 8.1674, -2.4586, -1.4621, 0
...
-2.5419, -0.65804, 2.6842, 1.1952, 1

The first four values on each line are the predictor values. The last value on each line is either 0 (authentic) or 1 (forgery). The predictor values are from a digitized image of each banknote and include variance, skewness, kurtosis and entropy. All the predictors are numeric. If the data had a categorical predictor such as color, those values could’ve been converted to numeric values using either 1-of-(N-1) or one-hot encoding.

Because there are four predictor variables, it isn’t possible to easily visualize the dataset, but you can get a rough idea of the data from the graph in Figure 2. The graph shows the kurtosis and entropy values for the first 100 of the 1,372 data items. Notice that simple linear prediction algorithms would likely perform poorly on this data because it isn’t linearly separable.

Figure 2 Partial Banknote Authentication Data

The first step to prepare the raw data is to randomly split the dataset into a training set and a test set. I split as 80 percent (1097 items) for training and the remaining 20 percent (275 items) for testing. Next, when using a neural network, it’s advisable to normalize numeric predictors so that values with large magnitudes don’t overwhelm small values. I used min-max normalization on the four predictor variables in the training set.

For each predictor column, I computed the min value and the max value, and then for every value x, normalized as (x - min) / (max - min). After min-max normalization, all values will be between 0.0 and 1.0, where 0.0 maps to the smallest value, and 1.0 maps to the largest value. I saved the min-max values for each column and then normalized the test data using those values. Note that you should normalize test data using the training set min-max values rather than normalize each dataset independently.

During normalization I replaced the comma separators used in the raw data by tab characters. I saved the training and test data in a subdirectory named Data. The demo program code that loads the two datasets into memory is:

train_file = “.\\Data\\banknote_norm_train.txt”
test_file = “.\\Data\\banknote_norm_test.txt”
train_x = np.loadtxt(train_file, delimiter=’\t’,
  usecols=[0,1,2,3], dtype=np.float32)
train_y = np.loadtxt(train_file, delimiter=’\t’,
  usecols=[4], dtype=np.float32, ndmin=2)
test_x = np.loadtxt(test_file, delimiter=’\t’,
  usecols=[0,1,2,3], dtype=np.float32)
test_y =np.loadtxt(test_file, delimiter=’\t’,
  usecols=[4], dtype=np.float32, ndmin=2)

Notice that PyTorch wants the Y data (authentic or forgery) in a two-dimensional array, even when the data is one-dimensional (conceptually a vector of 0 and 1 values). The default data type for PyTorch neural networks is 32 bits because the precision gained by using 64 bits usually isn’t worth the memory and performance penalty incurred.

The Demo Program

The complete demo program, with a few minor edits to save space, is presented in Figure 3. I indent with two spaces rather than the usual four spaces to save space. Note that Python uses the “\” character for line continuation. I used Notepad to edit my program. Most of my colleagues prefer a more sophisticated editor, but I like the raw simplicity of Notepad.

Figure 3 The Binary Classification Demo Program

# banknote_bnn.py
# Anaconda3 5.2.0 (Python 3.6.5), PyTorch 1.0.0
# raw data looks like:
#  4.5459, 8.1674, -2.4586, -1.4621, 0
#  0 = authentic, 1 = fake
import numpy as np
import torch as T
# ------------------------------------------------------------
class Batcher:
  def __init__(self, num_items, batch_size, seed=0):
    self.indices = np.arange(num_items)
    self.num_items = num_items
    self.batch_size = batch_size
    self.rnd = np.random.RandomState(seed)
    self.rnd.shuffle(self.indices)
    self.ptr = 0
  def __iter__(self):
    return self
  def __next__(self):
    if self.ptr + self.batch_size > self.num_items:
      self.rnd.shuffle(self.indices)
      self.ptr = 0
      raise StopIteration  # exit calling for-loop
    else:
      result = self.indices[self.ptr:self.ptr+self.batch_size]
      self.ptr += self.batch_size
      return result
# ------------------------------------------------------------
def akkuracy(model, data_x, data_y):
  # data_x and data_y are numpy array-of-arrays matrices
  X = T.Tensor(data_x)
  Y = T.ByteTensor(data_y)   # a Tensor of 0s and 1s
  oupt = model(X)            # a Tensor of floats
  pred_y = oupt >= 0.5       # a Tensor of 0s and 1s
  num_correct = T.sum(Y==pred_y)  # a Tensor
  acc = (num_correct.item() * 100.0 / len(data_y))  # scalar
  return acc
# ------------------------------------------------------------
class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 8)  # 4-(8-8)-1
    self.hid2 = T.nn.Linear(8, 8)
    self.oupt = T.nn.Linear(8, 1)
    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)
  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.sigmoid(self.oupt(z))  # necessary
    return z
# ------------------------------------------------------------
def main():
  # 0. get started
  print(“\nBanknote authentication using PyTorch \n”)
  T.manual_seed(1)
  np.random.seed(1)
  # 1. load data
  print(“Loading Banknote data into memory \n”)
  train_file = “.\\Data\\banknote_norm_train.txt”
  test_file = “.\\Data\\banknote_norm_test.txt”
  train_x = np.loadtxt(train_file, delimiter=’\t’,
    usecols=[0,1,2,3], dtype=np.float32)
  train_y = np.loadtxt(train_file, delimiter=’\t’,
    usecols=[4], dtype=np.float32, ndmin=2)
  test_x = np.loadtxt(test_file, delimiter=’\t’,
    usecols=[0,1,2,3], dtype=np.float32)
  test_y =np.loadtxt(test_file, delimiter=’\t’,
    usecols=[4], dtype=np.float32, ndmin=2)
  # 2. define model
  print(“Creating 4-(8-8)-1 binary NN classifier \n”)
  net = Net()
  # 3. train model
  net = net.train()  # set training mode
  lrn_rate = 0.01
  bat_size = 16
  loss_func = T.nn.BCELoss()  # softmax() + binary CE
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
  max_epochs = 100
  n_items = len(train_x)
  batcher = Batcher(n_items, bat_size)
# ------------------------------------------------------------
  print(“Starting training”)
  for epoch in range(0, max_epochs):
    if epoch > 0 and epoch % (max_epochs/10) == 0:
      print(“epoch = %6d” % epoch, end=””)
      print(“  batch loss = %7.4f” % loss_obj.item(), end=””)
      acc = akkuracy(net, train_x, train_y)
      print(“  accuracy = %0.2f%%” % acc)
    for curr_bat in batcher:
      X = T.Tensor(train_x[curr_bat])
      Y = T.Tensor(train_y[curr_bat])
      optimizer.zero_grad()
      oupt = net(X)
      loss_obj = loss_func(oupt, Y)
      loss_obj.backward()
      optimizer.step()
  print(“Training complete \n”)
  # 4. evaluate model
  net = net.eval()  # set eval mode
  acc = akkuracy(net, test_x, test_y)
  print(“Accuracy on test data = %0.2f%%” % acc)
  # 5. save model
  print(“Saving trained model \n”)
  path = “.\\Models\\banknote_model.pth”
  T.save(net.state_dict(), path)
# ------------------------------------------------------------
  # 6. make a prediction
  train_min_max = np.array([
    [-7.0421, 6.8248],
    [-13.7731, 12.9516],
    [-5.2861, 17.9274],
    [-7.8719, 2.1625]], dtype=np.float32)
  unknown_raw = np.array([[1.2345, 2.3456, 3.4567, 4.5678]],
    dtype=np.float32)
  unknown_norm = np.zeros(shape=(1,4), dtype=np.float32)
  for i in range(4):
    x = unknown_raw[0][i]
    mn = train_min_max[i][0]  # min
    mx = train_min_max[i][1]  # max
    unknown_norm[0][i] = (x - mn) / (mx - mn)
  np.set_printoptions(precision=4)
  print(“Making prediction for banknote: “)
  print(unknown_raw)
  print(“Normalized to:”)
  print(unknown_norm)
  unknown = T.Tensor(unknown_norm)  # to Tensor
  raw_out = net(unknown)       # a Tensor
  pred_prob = raw_out.item()   # scalar, [0.0, 1.0]
  print(“\nPrediction prob = %0.4f “ % pred_prob)
  if pred_prob < 0.5:
    print(“Prediction = authentic”)
  else:
    print(“Prediction = forgery”)
if __name__==”__main__”:
  main()

The demo program starts by importing the NumPy and PyTorch packages and assigning shortcut aliases. An alternative to importing the entire PyTorch package is to import just the necessary modules, for example, import torch.optim as opt.

Defining the Neural Network Architecture

The demo defines a 4-(8-8)-1 neural network model with these statements:

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(4, 8)  # 4-(8-8)-1
    self.hid2 = T.nn.Linear(8, 8)
    self.oupt = T.nn.Linear(8, 1)
...

The number of input nodes, four in this case, is determined by the data. For binary classification, by far the most common approach is to use a single output node where a value less than 0.5 maps to class zero (authentic) and a value greater than 0.5 maps to class one (forgery). The number of hidden layers (two in the demo) and the number of nodes in each hidden layer (eight in the demo) are hyperparameters that must be determined by trial and error.

The demo code explicitly initializes the hidden node and output node weights using the Xavier Uniform (also known as Glorot Uniform) algorithm, and initializes the biases to zero. This is the default mechanism so explicit initialization could’ve been omitted. But in my opinion, it’s good practice to explicitly initialize because the default initialization scheme could change in the future.

The demo code specifies the hidden layer and output layer activation functions in the forward function:

def forward(self, x):
  z = T.tanh(self.hid1(x))
  z = T.tanh(self.hid2(z))
  z = T.sigmoid(self.oupt(z))
  return z

For relatively shallow neural networks, the tanh activation function often works well for hidden layer nodes, but for deep neural networks, ReLU (rectified linear units) activation is generally preferred. The output node has logistic sigmoid activation, which forces the output value to be between 0.0 and 1.0.

The demo program uses a program-defined class, Net, to define the layer architecture and the input-output mechanism. An alternative is to create the network by using the Sequential function, for example:

net = T.nn.Sequential(
  T.nn.Linear(4,8), T.nn.Tanh(),
  T.nn.Linear(8,8), T.nn.Tanh(),
  T.nn.Linear(8,1), T.nn.Sigmoid())

Because PyTorch works at a relatively low level of abstraction, there are several different ways to implement each part of a prediction system. This gives you a lot of flexibility, but increases the difficulty of trying to understand code examples.

Training the Model

Training the model/network is prepared with these eight statements:

net = net.train()  # set training mode
lrn_rate = 0.01
bat_size = 16
loss_func = T.nn.BCELoss()
optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)
max_epochs = 100
n_items = len(train_x)
batcher = Batcher(n_items, bat_size)

The learning rate (0.01), batch size (16), and max epochs (100) must be determined by trial and error. For binary classification with a single logistic sigmoid output node, you can use either binary cross entropy or mean squared error loss, but not cross entropy (which is used for multiclass classification). The demo uses a program-defined class Batcher to serve up the indices of 16 training items at a time. An alternative approach is to use the built-in Dataset and DataLoader objects in the torch.utils.data module.

An epoch is one complete pass through all training items. Because there are 1,097 training items and each batch is 16 items, there are 1097 / 16 = 68 weight and bias update operations per epoch. During training, the prediction accuracy of the model is computed and displayed every 10 epochs using a program-defined function named akkuracy. The akkuracy function operates at the Tensor level using efficient aggregate operations. During the development of the demo, I used a function named accuracy that uses a less efficient approach.

Making a Prediction

After the model was trained, the demo used the model to make a prediction for a new, previously unseen banknote. First, the four pairs of min-max values for each predictor variable in the training data are placed into a matrix:

train_min_max = np.array([
  [-7.0421, 6.8248],
  [-13.7731, 12.9516],
  [-5.2861, 17.9274],
  [-7.8719, 2.1625]], dtype=np.float32)

Recall that the first predictor variable is image variance. So, in the 1,097 training items, the smallest variance is -7.0421 and the largest variance is 6.8248.

The unknown banknote is set to arbitrary values (1.2345, 2.3456, 3.4567, 4.5678) and then min-max normalized, like so:

unknown_raw = np.array([[1.2345, 2.3456, 3.4567, 4.5678]],
  dtype=np.float32)
unknown_norm = np.zeros(shape=(1,4), dtype=np.float32)
for i in range(4):
  x = unknown_raw[0][i]
  mn = train_min_max[i][0]  # min
  mx = train_min_max[i][1]  # max
  unknown_norm[i] = (x - mn) / (mx - mn)

A PyTorch network expects two-dimensional input (though there are some exceptions), so the demo sets up input with one row and four columns. The prediction is made with these statements:

unknown = T.Tensor(unknown_norm)  # to Tensor
raw_out = net(unknown)       # a Tensor
pred_prob = raw_out.item()   # scalar, [0.0, 1.0]

The network requires a Tensor object so the NumPy matrix is converted to a Tensor. A quirk of PyTorch is that if a Tensor has a single value, the value can be extracted using the Tensor.item method.

Wrapping Up

The field of neural machine learning is advancing with tremendous speed. Significant new algorithms and neural architectures are appearing every few months. At the time this article was written, three neural network code libraries appear to be distancing themselves from the dozens of those available. PyTorch and TensorFlow are starting to be the most commonly used libraries where some customization or flexibility is needed. The Keras library is becoming the library of choice for situations where a relatively straightforward neural network can be used. But it’s too early to predict which of these libraries (if any) will become de facto standards.

Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several key Microsoft products, including Azure and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.

Thanks to the following Microsoft technical experts for reviewing this article: Chris Lee, Ricky Loynd

Discuss this article in the MSDN Magazine forum