Kernel logistic regression (KLR) is a machine learning technique that can be used to make binary predictions. For example, KLR could predict if a person will repay a loan (fail to repay = 0, successfully repay = 1) based on predictor variables such as age, income and existing debt amount. KLR is an advanced variation of ordinary logistic regression.

A good way to see where this article is headed is to take a look at the demo program in **Figure 1** and the associated data in **Figure 2**. The goal of the demo program is to predict the class, 0 or 1, of dummy data that has just two predictor variables (sometimes called features), x0 and x1. For example, the first training data item is (2.0, 3.0, 0), which means that if the predictor values are x0 = 2.0 and x1 = 3.0, the correct class is 0. KLR can handle data with any number of predictor variables, but using just two allows you to visualize the technique easily.

The 21 training data points have a circular geometry, which means that simple linear classification techniques, such as ordinary logistic regression, are ineffective. Such data is called non-linearly separable.

Behind the scenes, KLR uses a function called a radial basis function (RBF) kernel. The RBF kernel function has a parameter called sigma. The value of sigma must be determined by trial and error, and sigma is set to 1.0 in the demo. Training a KLR model is an iterative process and the demo sets the maximum number of iterations to 1,000, and sets a learning rate, eta, to 0.001.

Training a KLR model creates a set of “alpha values,” one for each training data item, plus an additional “bias” value. The demo program displays the alpha values for the first three training items (-0.3071, -0.3043, -0.3071) and the last two items (0.8999, 0.6108) and the bias (-1.0722).

After training, the KLR model predicts all 21 data items correctly. Then the model is applied to four test data items, shown as black dots in **Figure 2**. The first test item has inputs (1.5, 4.5) and a correct class of 0. The prediction model correctly predicts that item, and the other three test items, too.

This article assumes you have intermediate or higher programming skills but doesn’t assume you know anything about KLR. The demo program is coded using C#, but you should have no trouble refactoring the code to another language, such as Java or Python, if you wish. The demo program is too long to present in its entirety, but the complete source code is available in the file download that accompanies this article.

## The RBF Kernel

A kernel function measures the similarity of two vectors or arrays. The RBF kernel function I mentioned earlier is the most common, and is the type used by the demo program. An RBF value of 1.0 means two vectors are identical. Smaller RBF values indicate two vectors are less similar.

The equation for RBF is:

K(v1, v2) = exp( - || v1 - v2 ||^2 / (2 * sigma^2) )

Here, K stands for kernel; v1 and v2 are two vectors that have the same length; sigma is a parameter with a value like 1.0 or 1.5; the || indicates Euclidean distance; and the exp function is Euler’s number (e = 2.71828) raised to a power.

The RBF function is best explained by example. Suppose v1 = (3.0, 1.0, 2.0) and v2 = (1.0, 0.0, 5.0), and sigma is 1.5. First, you compute the squared Euclidean distance:

|| v1 - v2 ||^2 = (3.0 - 1.0)^2 + (1.0 - 0.0)^2 + (2.0 - 5.0)^2

= 4.0 + 1.0 + 9.0

= 14.0

Next, you divide the squared distance by 2 times sigma squared:

14.0 / (2 * (1.5)^2) = 14.0 / 4.5 = 3.11

Last, you take Euler’s number and raise it to the negative of the previous result:

K(v1, v2) = e^(-3.11) = 0.0446

The small kernel value indicates that v1 and v2 are not very similar. The demo program defines the RBF kernel function as:

static double Kernel(double[] v1, double[] v2, double sigma) { double num = 0.0; for (int i = 0; i < v1.Length - 1; ++i) num += (v1[i] - v2[i]) * (v1[i] - v2[i]); double denom = 2.0 * sigma * sigma; double z = num / denom; return Math.Exp(-z); }

The function assumes that the last cell of each array holds the class label (0 or 1) and so the last cell isn’t included in the calculation. KLR uses the kernel function to compare a given data item with all training items, and uses that information to determine a predicted class label.

## Ordinary Logistic Regression

Ordinary logistic regression (LR) is best explained by example. Suppose you have three predictor variables: x0 = 2.5, x1 = 1.7, and x2 = 3.4. A regular LR model creates a set of numeric constants called weights (wi), one for each predictor variable, and an additional numeric constant called the bias (b). Note that the bias in regular LR isn’t the same as the KLR bias shown in **Figure 1**.

**Figure 1 Kernel Logistic Regression Demo**

Suppose w0 = 0.11, w1 = 0.33, w2 = 0.22, b = 0.44. To predict the class label, 0 or 1, for the input data (2.5, 1.7, 3.4) you first compute the sum of the products of each x and its associated w, and add the bias:

z = (2.5)(0.11) + (1.7)(0.33) + (3.4)(0.22) + 0.44

= 2.024

Next, you compute p = 1.0 / (1.0 + exp(-z)):

p = 1.0 / (1.0 + exp(-2.024))

= 0.8833

The p value is the probability that the data item has class label = 1, so if p is less than 0.5, your prediction is 0 and if p is greater than 0.5 (as it is in this example), your prediction is 1.

OK, but where do the weights and bias values come from in regular LR? The idea is that you determine the weights and bias values by using a set of training data that has known input values and known, correct class labels, then use an optimization algorithm to find values for the weights and biases so that the predicted class labels closely match the known, correct label values. There are many algorithms that can be used to find the weight and bias values for regular LR, including gradient ascent with log likelihood, gradient descent with squared error, iterated Newton-Raphson, simplex optimization, L-BFGS and particle swarm optimization.

The primary disadvantage of regular LR is that it can handle only data that’s linearly separable. Regular LR can’t handle data that’s not linearly separable, such as the demo data shown in **Figure 2**.

**Figure 2 Kernel Logistic Regression Training Data**

## Understanding Kernel Logistic Regression

KLR is best explained by example. Let me state up front that at first glance KLR doesn’t appear very closely related to ordinary LR. However, the two techniques are closely related mathematically.

Suppose there are just four training data items:

td[0] = (2.0, 4.0, 0)

td[1] = (4.0, 1.0, 1)

td[2] = (5.0, 3.0, 0)

td[3] = (6.0, 7.0, 1)

Your goal is to predict the class label for x = (3.0, 5.0). Suppose the trained KLR model gave you alpha values and a bias of: alpha[0] = -0.3, alpha[1] = 0.4, alpha[2] = -0.2, alpha[3] =0.6, b = 0.1.

The first step is to compute the RBF similarity between the data item to predict each of the training items:

K(td[0], x) = 0.3679

K(td[1], x) = 0.0002

K(td[2], x) = 0.0183

K(td[3], x) = 0.0015

Notice that at this point, x is most similar to td[0] and td[2], which both have class label 0. Next, you compute the sum of products of each K value and the associated alpha, and add the bias value:

z = (0.3679)(-0.3) + (0.0002)(0.4) + (0.0183)(-0.2) + (0.0015)(0.6) + 0.1

= -0.1120

Now you compute p = 1.0 / (1.0 + exp(-z)):

p = 1.0 / (1.0 + exp(0.1120))

= 0.4720

If the p value is greater than 0.5, the predicted class is 1, and if the p value is less than 0.5, the predicted class is 0, as it is (just barely) for this example.

## Training a KLR Model

Training a KLR model is the process of using training data to find the alpha values and the bias value. Expressed in very high-level pseudo-code, the KLR training algorithm is:

compute K(td[i], td[j]) for all i, j loop maxIter times for-each curr training item, i for-each j: sum += alphas[j] * K(i j) sum += bias y = 1.0 / (1.0 + exp(-sum)) t = target class (0 or 1) for-each j: alpha[j] += eta * (t - y) * K(i, j) bias += eta * (t - y) * 1.0 end-loop

The key statement in the demo code is alphas[j] += eta * (t - y) * kernelMatrix[i][j], which updates the alpha value for the training data item at index [j] based on the current training data item at index [i]. Here, t is the known, correct target class, 0 or 1, and y is a calculated probability that the item at [i] has class 1.

For example, suppose an alpha value for a training item is currently 0.1234 and the target class is 1 and the computed probability is 0.60. The current prediction would be correct, but you’d like the p value to be even closer to 1. Suppose the similarity between the two items is K(i, j) = 0.70 and the learning rate eta is 0.10. The new alpha value would be:

alpha = 0.1234 + 0.10 * (1 - 0.60) * 0.70

= 0.1234 + 0.0280

= 0.1514

Because alpha is a multiplier value in the probability calculation, the new, slightly larger value of alpha will increase p a little bit, making the prediction more accurate.

## The Demo Program

To code the demo program, I launched Visual Studio and created a new C# console application program and named it KernelLogistic. I used Visual Studio 2015, but the demo program has no significant .NET Framework dependencies, so any recent version of Visual Studio will work.

After the template code loaded into the editor window, I right-clicked on file Program.cs in the Solution Explorer window, renamed the file to KernelLogisticProgram.cs and then allowed Visual Studio to automatically rename class Program for me. At the top of the template-generated code, I deleted all unnecessary using statements leaving just the one that references the top-level System namespace. Then I instantiated a Random object:

using System; namespace KernelLogistic { class KernelLogisticProgram { static Random rnd = new Random(0); static void Main(string[] args) { Console.WriteLine(“Begin KLR demo”); int numFeatures = 2; ...

For simplicity, I coded the demo using a static method approach rather than object-oriented programming, and removed all normal error checking. The Main method sets up the 21 training items and the 4 test items like so:

double[][] trainData = new double[21][]; trainData[0] = new double[] { 2.0, 3.0, 0 }; ... trainData[20] = new double[] { 5.0, 6.0, 1 }; double[][] testData = new double[4][]; testData[0] = new double[] { 1.5, 4.5, 0 }; ... testData[3] = new double[] { 5.5, 5.5, 1 };

In a non-demo scenario, you’d likely read data from a text file. Next, the alpha values are initialized:

int numTrain = trainData.Length; int numTest = testData.Length; double[] alphas = new double[numTrain + 1]; for (int i = 0; i < alphas.Length; ++i) alphas[i] = 0.0;

When coding machine learning systems, there are usually several ways to deal with bias values. Here, I store the KLR bias in the last cell of the alphas array. An alternative design is to create a separate standalone variable. Next, the kernel similarities between all pairs of training items are computed:

double[][] kernelMatrix = new double[numTrain][]; for (int i = 0; i < kernelMatrix.Length; ++i) kernelMatrix[i] = new double[numTrain]; double sigma = 1.0; for (int i = 0; i < numTrain; ++i) { for (int j = 0; j < numTrain; ++j) { double k = Kernel(trainData[i], trainData[j], sigma); kernelMatrix[i][j] = kernelMatrix[j][i] = k; } }

Because there are only 21 data items, I sacrifice efficiency for simplicity. I could’ve reduced the number of kernel calculations by using the facts that K(v1, v2) = K(v2, v1) and K(v, v) = 1. Next, the demo program prepares for training:

double eta = 0.001; int iter = 0; int maxIter = 1000; int[] indices = new int[numTrain]; for (int i = 0; i < indices.Length; ++i) indices[i] = i;

The values of eta and maxIter were determined by trial and error. The idea behind the array named indices is that when training, it’s important to visit the training items in a random order on each pass, to avoiding getting into a situation where training stalls or oscillates back and forth. The main training loop begins:

while (iter < maxIter) { Shuffle(indices); for (int idx = 0; idx < indices.Length; ++idx) { int i = indices[idx];

The Shuffle method is a helper that scrambles the order of the training items using the Fisher-Yates mini-algorithm. The target class label and the predicted probability of the current training item are computed like so:

double sum = 0.0; for (int j = 0; j < alphas.Length-1; ++j) sum += alphas[j] * kernelMatrix[i][j]; sum += alphas[alphas.Length - 1]; double y = 1.0 / (1.0 + Math.Exp(-sum)); double t = trainData[i][numFeatures];

Notice that this design assumes that the class label is in the last cell of a training data array. Next, the alphas and the beta values are updated:

for (int j = 0; j < alphas.Length - 1; ++j) alphas[j] = alphas[j] + (eta * (t - y) * kernelMatrix[i][j]); alphas[alphas.Length-1] = alphas[alphas.Length - 1] + (eta * (t - y)) * 1; } ++iter; } // While (train)

Updating the bias value uses a dummy value of 1 in place of the kernel similarity value, just to make the symmetry of the relationship clear. Of course, you can remove the multiplication by 1 because it has no effect. After training, a few of the values of the alphas and the bias value, are displayed, as shown in **Figure 1**.

The demo program concludes by computing the classification accuracy of the trained KLR model on the training and test data:

double accTrain = Accuracy(trainData, trainData, alphas, sigma, false); Console.WriteLine(“accuracy = “ + accTrain.ToString(“F4”) + “\n”); double accTest = Accuracy(testData, trainData, alphas, sigma, true); // Verbose

The Boolean argument passed to method Accuracy indicates whether to compute in verbose mode (with diagnostic messages) or silent mode.

## Wrapping Up

Kernel logistic regression isn’t used very often, at least among my colleagues. Its major advantage is simplicity. The major disadvantage of KLR is that it doesn't scale well to large data sets because you either have to precompute all item-to-item kernel similarity values and save them, or you must keep all training data and then compute all similarity values on the fly for every prediction.

KLR is designed for binary classification. It’s possible to extend KLR to handle classification problems with three or more class values, but in my opinion, there are better alternatives to use, in particular a single hidden layer feed-forward neural network. KLR has some similarities to the K nearest neighbors (K-NN) classification algorithm, and also to support vector machine (SVM) classification.

**Dr. James McCaffrey** *works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products, including Internet Explorer and Bing. Dr. McCaffrey can be reached at jamccaff@microsoft.com.*

Thanks to the following Microsoft technical experts who reviewed this article: Chris Lee and Adith Swaminathan

Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.