A kernel perceptron is a machine learning (ML) classifier that can be used to make binary predictions. For example, a kernel perceptron could predict the sex of a person (male = -1, female = +1) based on age, income, height and weight. Kernel perceptrons are an advanced variation of ordinary perceptrons and can handle more complex data.

A good way to see where this article is headed is to take a look at the demo program in **Figure 1** and the associated data in **Figure 2**. The goal of the demo program is to predict the class, -1 or +1, of dummy input data that has just two predictor variables, x0 and x1. For example, the first training data item is (2.0, 3.0, -1), which means that if the input values are x0 = 2.0 and x1 = 3.0, the correct class is -1.

**Figure 1 Kernel Perceptron Demo**

**Figure 2 Kernel Perceptron Training Data**

The 21 training data points have a circular geometry, which means that simple linear classification techniques, such as ordinary perceptrons or logistic regression, are ineffective. Such data is called non-linearly separable.

The demo program uses the training data to create a prediction model using a kernel perceptron. Unlike many prediction models that consist of a set of numeric values called weights, a kernel prediction model creates a set of counter values, one for each training data item. The demo program displays the counter values for the first two data items (1 and 2) and the last two data items (0 and 1).

After training, the kernel perceptron model predicts all 21 data items correctly, which is to be expected. The trained model is applied to four new test data items that were not used during training. The first test item has inputs (2.0, 4.0) and a class of -1. The prediction model correctly predicts that item. Overall, the model correctly predicts three of the four test items for 0.75 accuracy.

Behind the scenes, the kernel perceptron uses a function called a radial basis function (RBF) kernel. The kernel function requires a parameter called sigma. The value of sigma must be determined by trial and error, and sigma = 1.5 in the demo. Note that by setting sigma to 0.5, the demo program would achieve 100 percent accuracy. I used sigma = 1.5 to point out that for most data sets you won’t achieve 100 percent accuracy.

This article assumes you have intermediate or higher programming skills, but doesn’t assume you know anything about kernel perceptrons. The demo program is coded using C#, but you should have no trouble refactoring the code to another language such as Python or JavaScript if you wish. All the key demo code, except for the hardcoded data and some display statements, is presented in this article. The complete demo source code, including data, is available in the accompanying download.

## Understanding Kernel Functions

In order to understand kernel perceptrons you must understand kernel functions in general. There are many kernel functions, but the most common one, and the type used by the demo program, is called the RBF kernel. The RBF kernel function accepts two vectors (that is, arrays) and a parameter named sigma, and returns a single value between 0.0 and 1.0 that’s a measure of similarity between the two arrays. A return value of exactly 1.0 means the two arrays are identical. The less similar the two arrays are, the smaller the value of RBF is, approaching but never quite reaching 0.0.

The equation for RBF is:

Here, K stands for kernel, x1 and x2 are two arrays that have the same length, sigma is a free parameter with a value like 1.0 or 1.5, the || indicates Euclidean distance, and exp means Euler’s number (e) raised to a power.

The RBF is best explained by example. Suppose x1 = (2.0, 1.0, 3.0) and x2 = (0.0, 1.0, 5.0), and sigma is 1.0. First, you compute the squared Euclidean distance:

Next, you divide the squared distance by 2 times sigma squared:

Last, you take Euler’s number (e = approximately 2.71828) and raise it to the negative of the previous result:

Notice that if x1 equals x2, the squared Euclidean distance will be 0, and then dividing by 2 times sigma squared will be 0, and e raised to the 0th power is 1.0. The less similar x1 and x2 are, the larger the squared difference will be, and e raised to a negative large value gets very small, approaching 0.0. The larger the value of sigma is, the larger the value of K is, but K still varies from 1.0 (equal vectors) down to 0.0 exclusive (very dissimilar vectors). The order of array parameters doesn’t matter, so K(x1, x2) = K(x2, x1).

The demo program defines the RBF kernel function as:

static double Kernel(double[] d1, double[] d2, double sigma) { double num = 0.0; for (int i = 0; i < d1.Length - 1; ++i) num += (d1[i] - d2[i]) * (d1[i] - d2[i]); double denom = 2.0 * sigma * sigma; double z = num / denom; double result = Math.Exp(-z); return result; }

The function assumes that the last cell of each array holds the class label (+1 or -1), so the last cell isn’t included in the calculation. There are several other kernel functions that can be used with a kernel perceptron, including the linear kernel, polynomial kernel, Fisher kernel, and sigmoid kernel. Calculating kernel functions is relatively easy. What’s not obvious, however, is that kernel functions have some remarkable mathematical properties that allow simple classifiers, like an ordinary perceptron, to be transformed into classifiers that can handle non-linearly separable data.

## Understanding Ordinary Perceptrons

An ordinary perceptron can perform binary classification for simple, linearly separable data. An ordinary perceptron consists of a set of weights, where the number of weights equals the number of predictor values, and an additional special weight called the bias. For example, suppose you have a binary classification problem where there are three predictor variables. And suppose the perceptron weights are w0 = 1.5, w1 = 1.1 and w2 = -3.0, and three predictor variables have values x0 = 4.0, x1 = 3.0, x2 = 5.0 and the bias is b = 2.5. The prediction is calculated as the sign (positive or negative) of the sum of the products of weight and input values, plus the bias:

sum = (w0)(x0) + (w1)(x1) +(w2)(x2) + b = (1.5)(4.0) + (1.1)(3.0) + (-3.0)(5.0) + 2.5 = 6.0 + 3.3 + (-15.0) + 2.5 = -3.2 prediction = sign(sum) = -1

It’s that simple. But where do the weights and bias values come from? You take training data that has known input and correct class values and then use a mathematical optimization algorithm to find values for the weights and bias so that the calculated class values closely match the known correct class values.

Note that in perceptron literature, the bias value is often considered to be a regular weight associated with a dummy input value of 1.0. Using that scheme, the preceding example would have inputs x = (1.0, 4.0, 3.0, 5.0) and weights w = (2.5, 1.5, 1.1, -3.0), yielding the following (which is the same result as before):

In other words, a perceptron bias value can be explicit or implicit. You should not underestimate the confusion this can cause.

Now, unfortunately, ordinary perceptrons can classify only linearly separable data, which makes them very limited in practice. However, by using a kernel function in conjunction with a perceptron, it’s possible to create a classifier that can work with non-linearly separable data, such as the demo data shown in **Figure 2**.

## The Kernel Perceptron Algorithm

At first glance, the kernel perceptron algorithm appears unrelated to the ordinary perceptron algorithm, but in fact, the two are deeply related. Suppose there are just four training data items: td[0] = (2.0, 3.0, -1), td[1] = (3.0, 2.0, +1), td[2] = (3.0, 5.0, +1), td[3] = (4.0, 3.0, -1). And suppose the trained kernel perceptron model gave you wrong counter values of a = (1, 2, 3, 4). (I use “a” for the wrong counters because in research literature they’re usually given the symbol Greek sigma, which resembles lowercase English “a.”)

To predict the class of new data item x = (3.0, 6.0), you compute the weighted sum (by a-value and training item correct class) of the kernel function applied to the new data item and each of the four training items:

K(td[0], x) = 0.1083 K(td[1], x) = 0.0285 K(td[2], x) = 0.8007 K(td[3], x) = 0.1083 sum = (1)(0.1083)(-1) + (2)(0.0285)(+1) + (3)(0.8007)(+1) + (4)(0.1083)(-1) = +1.9175 prediction = sign(sum) = +1

Stated somewhat loosely, the kernel perceptron looks at the similarity of the data item to be classified and all training data items, and aggregates the similarity values—using the wrong counters as weights—into a single value that indicates predicted class.

So, to make a kernel perceptron prediction you need the training data and the associated wrong counter values. When you train an ordinary perceptron, you use an iterative process. In each iteration, you use the current values of the weights and bias to calculate a predicted class. If the predicted class is incorrect (doesn’t match the class in the training data) you adjust the weights a bit so that the predicted class is closer to the known correct class value.

In a kernel perceptron, you use a similar iterative training process, but instead of adjusting weight values when a calculated class is wrong, you increment the wrong counter for the current training item. Quite remarkable! The math proof of why this works is stunningly beautiful and can be found in the Wikipedia entry for kernel perceptrons.

Expressed in high-level pseudo-code, the kernel perceptron training algorithm is:

loop a few times for each curr training item, i for each other training item, j sum += a[j] * K(td[i], td[j], sigma) * y[j]; end-for pred = sign(sum) if (pred is wrong) increment a[i] end-for end-loop

The two parameters needed in the training algorithm are the number of times to iterate and the value of sigma for the kernel function. Both of these values must be determined by a bit of trial and error. The demo program iterates 10 times and uses 1.5 for sigma.

Note that the ordinary perceptron training algorithm uses training data to generate the weights and bias values, and then conceptually discards the training data. The kernel perceptron training algorithm generates wrong counter values that are mathematically related to weights, but the algorithm must keep the training data in order to make predictions.

## The Demo Program Structure

The overall structure of the demo program, with a few minor edits to save space, is presented in **Figure 3**. I used a static method style rather than an object-oriented programming style for simplicity. The Main method has all the control logic. There are two helper methods, Kernel and Accuracy.

using System; namespace KernelPerceptron { class KernelPerceptronProgram { static void Main(string[] args) { Console.WriteLine("Begin demo "); int numFeatures = 2; Console.WriteLine("Goal is classification(-1/+1) "); Console.WriteLine("Setting up 21 training items "); double[][] trainData = new double[21][]; trainData[0] = new double[] { 2.0, 3.0, -1 }; trainData[1] = new double[] { 2.0, 5.0, -1 }; . . . trainData[20] = new double[] { 5.0, 6.0, +1 }; int numTrain = trainData.Length; double[][] testData = new double[4][]; testData[0] = new double[] { 2.0, 4.0, -1 }; . . testData[3] = new double[] { 5.5, 5.5, +1 }; int[] a = new int[trainData.Length]; int maxEpoch = 10; int epoch = 0; double sigma = 1.5; // for the kernel function Console.WriteLine("Starting train, sigma = 1.5 "); ... Console.WriteLine("Training complete "); double trainAcc = Accuracy(trainData, a, trainData, sigma, false); // silent Console.WriteLine("Accuracy = " + trainAcc.ToString("F4")); Console.WriteLine("Analyzing test data: "); double testAcc = Accuracy(testData, a, trainData, sigma, true); // verbose Console.WriteLine("End kernel perceptron demo "); Console.ReadLine(); } // Main static double Kernel(double[] d1, double[] d2, double sigma) { . . } static double Accuracy(double[][] data, int[] a, double[][] trainData, double sigma, bool verbose) { . . } } // Program } // ns

To code the demo program, I launched Visual Studio and created a new C# console application program and named it KernelPerceptron. I used Visual Studio 2015, but the demo program has no significant .NET Framework dependencies so any recent version will work.

After the template code loaded into the editor window, I right-clicked on file Program.cs in the Solution Explorer window and renamed the file to KernelPerceptronProgram.cs, then allowed Visual Studio to automatically rename class Program for me. At the top of the template-generated code, I deleted all unnecessary using statements, leaving just the one that references the top-level System namespace.

The Main method sets up the training and test data like so:

int numFeatures = 2; double[][] trainData = new double[21][]; trainData[0] = new double[] { 2.0, 3.0, -1 }; . . trainData[20] = new double[] { 5.0, 6.0, +1 }; int numTrain = trainData.Length; double[][] testData = new double[4][]; testData[0] = new double[] { 2.0, 4.0, -1 }; . . testData[3] = new double[] { 5.5, 5.5, +1 };

The demo uses two predictor variables (also called features in ML terminology) for simplicity, but kernel perceptrons can handle any number of predictor variables. The data is hardcoded but in a non-demo scenario you’d likely load the data from a text file. The demo uses -1 and +1 to represent the two possible classes. This encoding is typical for perceptrons, but classes can be encoded as 0 and 1 instead (though this encoding would require some changes to the code logic).

Training is prepared with these statements:

int[] a = new int[numTrain]; // "Wrong counters" double sigma = 1.5; // For the kernel function int maxEpoch = 10; int epoch = 0;

Array a holds the wrong counters for each training item, and sigma is the free parameter for the RBF kernel function. The value of sigma and the maxEpoch loop control were determined by trial and error. Next, all possible kernel function values are pre-calculated and stored into a matrix:

double[][] kernelMatrix = new double[numTrain][]; for (int i = 0; i < kernelMatrix.Length; ++i) kernelMatrix[i] = new double[numTrain]; for (int i = 0; i < numTrain; ++i) { for (int j = 0; j < numTrain; ++j) { double k = Kernel(trainData[i], trainData[j], sigma); kernelMatrix[i][j] = kernelMatrix[j][i] = k; } }

The idea is that during training, the kernel similarity between all pairs of training items will have to be used several times so it makes sense to pre-calculate these values.

The training loop is:

while (epoch < maxEpoch) { for (int i = 0; i < numTrain; ++i) { // Get "desired" correct class into di for (int j = 0; j < numTrain; ++j) { // Get "other" desired class into dj // Compute y = weighted sum of products } if ((di == -1 && y >= 0.0) || (di == 1 && y <= 0.0)) ++a[i]; // increment wrong counter } ++epoch; }

The desired, correct class (-1 or +1) is pulled from the current training item with this code:

I use di here to stand for desired value. Two other common variable names are t (for target value) and y (which just stands for general output).

The inner nested for loop that calculates the weighted sum of kernel values is the heart of the kernel perceptron-learning algorithm:

double y = 0.0; // Weighted sum of kernel results for (int j = 0; j < numTrain;; ++j) { int dj = (int)trainData[j][numFeatures]; double kern = kernelMatrix[i][j]; y += a[j] * dj * kern; }

The demo code calculates the kernel for all pairs of training items. But when an associated wrong counter has value 0, the product term will be 0. Therefore, an important optional optimization when the number of training items is large is to skip the calculation of the kernel when the value of a[i] or a[j] is 0. Equivalently, training items that have a wrong counter value of 0 can be removed entirely.

I don’t compute an explicit predicted class because it’s easier to check if the predicted class is wrong directly:

You have to be careful to check y >= 0.0 or y <= 0.0 rather than y > 0.0 or y < 0.0 because the first time through the training loop, all wrong counter values in the a array are zero and so the weighed sum of products will be 0.0.

After the training loop terminates, the kernel perceptron is effectively defined by the training data, the wrong-counter array and the RBF kernel parameter sigma.

## Making Predictions

Helper function Accuracy makes predictions. The method’s definition starts with:

static double Accuracy(double[][] data, int[] a, double[][] trainData, double sigma, bool verbose) { int numFeatures = data[0].Length - 1; double[] x = new double[numFeatures]; int numCorrect = 0; int numWrong = 0; ...

The parameter named data holds an array-of-arrays style matrix of data to evaluate. Parameter array a holds the wrong counter values generated by training.

Inside the body of the function, the key code is exactly like the training code, except instead of incrementing a wrong counter for the current training data item on an incorrect prediction, a single numWrong variable or numCorrect variable is incremented:

After all data items under investigation have been examined, the percentage of correct predictions is returned:

The Accuracy method has a verbose parameter, which if true, causes diagnostic information to be displayed:

if (verbose == true) { Console.Write("Input: "); for (int j = 0; j < data[i].Length - 1; ++j) Console.Write(data[i][j].ToString("F1") + " "); // Etc. }

Using the code from the Accuracy method, you could write a dedicated Predict method that accepts an array of x values (without a known correct class), the training data, the wrong-counter array, and the RBF kernel sigma value, which returns a +1 or -1 prediction value.

## Wrapping Up

Kernel perceptrons aren’t used very often. This is due in large part to the fact that there are more powerful binary classification techniques available, and that there is a lot of mystery surrounding ML kernel methods in general. If you do an Internet search for kernel perceptrons, you’ll find many references that show the beautiful mathematical relationships between ordinary perceptrons and kernel perceptrons, but very little practical implementation information. In my opinion, the primary value of understanding kernel perceptrons is that the knowledge makes it easier to understand more sophisticated ML kernel methods (which I’ll present in a future column).

One of the weaknesses of kernel perceptrons is that to make a prediction, the entire training data set (except for items that have a wrong counter value of 0) must be examined. If the training set is huge, this could make real-time predictions unfeasible in some scenarios.

Kernel perceptrons are arguably the simplest type of kernel methods. The kernel trick can be applied to other linear classifiers. In fact, applying the kernel trick to a maximum margin linear classifier is the basis for support vector machine (SVM) classifiers, which were popular in the late 1990s.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Ani Anirudh and Chris Lee

Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.

Test Run - Kernel Perceptrons using C#

Unlike ordinary perceptrons, which can perform binary classification on simple, linearly separable data, a kernel perceptron can be used to handle non-linearly-separable data. The demo program in this article creates a prediction model using a kernel...

Apr 3, 2017