R Programming Language - Introduction to R for C# Programmers
By James McCaffrey | July 2015 | Get the Code: C# VB
The R language is used by data scientists and programmers for statistical computing. In part because of the increasing amounts of data collected by software systems, and the need to analyze that data, R is one of the fastest-growing technologies among my colleagues who use C#. A familiarity with R can be a valuable addition to your technical skill set.
The R language is a GNU project and is free software. R was derived from a language called S (for “statistics”), which was created at Bell Laboratories in the 1970s. There are many excellent online tutorials for R, but most of those tutorials assume you’re a university student studying statistics. This article is aimed at helping C# programmers get up to speed with R as quickly as possible.
The best way to see where this article is headed is to take a look at the example R session in Figure 1. The example session has two unrelated topics. The first few set of commands show what’s called a chi-square test (also called the chi-squared test) for a uniform distribution. The second set of commands shows an example of linear regression, which in my opinion is the Hello World technique of statistical computing.
Figure 1 An Example R Session
The R Web site is located at r-project.org. The site has links to several mirror sites where you can download and install R. The install is a simple self-extracting executable. R is officially supported on Windows XP and later, and also on most common non-Windows platforms. I’ve installed R on Windows 7 and Windows 8 machines without any issues. By default the installation process gives you both 32-bit and 64-bit versions.
This article assumes you have at least intermediate C# programming skills (so you can understand the explanations of the similarities and differences between C# and R), but doesn’t assume you know anything about R. It presents a C# demo program in its entirety, which is also available from the download file that accompanies this article.
The Chi-Square Test Using R
Looking at Figure 1, the first thing to notice is that using R is quite a bit different from using C#. Although it’s possible to write R scripts, R is most often used in interactive mode in a command shell. The first R example is an analysis to see if a normal six-sided die is fair or not. When rolled many times, a fair die would be expected to give approximately the same count for each of the six possible results.
The R prompt is indicated by the (>) token in the shell. The first statement typed in Figure 1 begins with the (#) character, which is the R token to indicate a comment.
The first actual command in the example session is:
This creates a vector named observed using the c (for concatenate) function. A vector holds objects with the same data type. A roughly equivalent C# statement would be:
The first value, 20, is the number of times a one-spot occurred, 28 is the number of times a two-spot occurred and so on. The sum of the six counts is 20 + 28 + . . + 36 = 150. You’d expect a fair die to have about 150/6 = 25 counts for each possible outcome. But the number of observed three-spots (12) looks suspiciously low.
After creating the vector, a chi-square test is performed using the chisq.test function:
In R, the dot (.) character is often used rather than the underscore (_) character to create variable and function names that are easier to read. The result of calling the chisq.test function is quite a bit of text:
Chi-squared test for given probabilities
In C# terms, most R functions return a data structure that can be ignored, but also contain many Console.WriteLine equivalent statements that expose output. Notice that it’s up to you to decipher the meaning of R output. Later in this article, I’ll show you how to create the equivalent chi-square test using raw (no libraries) C# code.
In this example, the “X-squared” value of 15.28 is the calculated chi-squared statistic (the Greek letter chi resembles uppercase X). A value of 0.0 indicates that the observed values are exactly what you’d expect if the die was fair. Larger values of chi-squared indicate an increasing likelihood that the observed counts could not have happened by chance if the die was fair. The df value of 5 is the “degrees of freedom,” which is one less than the number of observed values. The df is not too important for this analysis.
The p-value of 0.009231 is the probability that the observed counts could have occurred by chance if the die was fair. Because the p-value is so small, less than 1 percent, you’d conclude that the observed values are very unlikely to have occurred by chance and, therefore, there’s statistical evidence that the die in question is most likely biased.
Linear Regression Analysis Using R
The second set of statements in Figure 1 shows an example of linear regression. Linear regression is a statistical technique used to describe the relationship between a numeric variable (called the dependent variable in statistics) and one or more explanatory variables (called the independent variables) that can be either numeric or categorical. When there’s just one independent explanatory/predictor variable, the technique is called simple linear regression. When there are two or more independent variables, as in the demo example, the technique is called multiple linear regression.
Before doing the linear regression analysis, I created an eight-item, comma-delimited text file named DummyData.txt in directory C:\IntroToR with this content:
Color,Length,Width,Rate blue, 5.4, 1.8, 0.9 blue, 4.8, 1.5, 0.7 blue, 4.9, 1.6, 0.8 pink, 5.0, 1.9, 0.4 pink, 5.2, 1.5, 0.3 pink, 4.7, 1.9, 0.4 teal, 3.7, 2.2, 1.4 teal, 4.2, 1.9, 1.2
This file is supposed to represent flower data with the color of the flower, the length and width of the petal, and the growth rate. The idea is to predict Rate values (in the last column) from the Color, Length and Width values. After a comment statement, the first three R commands in the linear regression analysis are:
> setwd("C:\\IntroToR") > data <- read.table("DummyData.txt", header=TRUE, sep=",") > print(data)
The first command sets the working directory so I wouldn’t have to fully qualify the path to the source data file. Instead of using the (\\) token as is common with C#, I could have used (/) as is common on non-Windows platforms.
The second command loads the data into memory in a table object named data. Notice that R uses named parameters. The header parameter tells if the first line is header information (TRUE, or T in shortened form) or not (FALSE or F). R is case-sensitive and you can usually use either the (<-) operator to assign values, or the (=) operator. The choice is mostly a matter of personal preference. I typically use (<-) for object assignment and (=) for parameter value assignment.
The sep (separator) parameter indicates how values on each line are separated. For example, (\t) would indicate tab-delimited values, and (" ") would indicate space-delimited values.
The print function displays the data table in memory. The print function has many optional parameters. Notice that the output in Figure 1 displays data item indices starting at 1. For array, matrix and object indices, R is a 1-based language, rather than 0-based like the C# language.
The linear regression analysis is performed with these two R commands:
You can interpret the first command as, “Store into an object named model the result of the lm (linear model) function analysis where the dependent variable to predict is the Rate column in the table object (data$Rate), and the independent predictor variables are Color, Length and Width.” The second command means, “Display just the basic results of the analysis stored in the object named model.”
The lm function generates a large amount of information. Suppose you wanted to predict the Rate value when the input values are Color = pink, Length = 5.0 and Width = 1.9. (Notice that this corresponds to data item [4], which has an actual Rate value of 0.4.) To make a prediction you’d use the values in the Estimate column:
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.14758 0.48286 -0.306 0.77986 data$Colorpink -0.49083 0.04507 -10.891 0.00166 ** data$Colorteal 0.35672 0.09990 3.571 0.03754 * data$Length 0.04159 0.07876 0.528 0.63406 data$Width 0.45200 0.11973 3.775 0.03255 *
If X represents the independent variable values, and if Y represents the predicted Rate, then:
X = (blue = NA, pink = 1, teal = 0, Length = 5.0, Width = 1.9) Y = -0.14758 + (-0.49083)(1) + (0.35672)(0) + (0.04159)(5.0) + (0.45200)(1.9) = -0.14758 + (-0.49083) + (0) + (0.20795) + (0.85880) = 0.42834
Notice the predicted Rate, 0.43, is quite close to the actual rate, 0.40.
In words, to make a prediction using the model, you calculate a linear sum of products of the Estimate values multiplied by their corresponding X values. The Intercept value is a constant not associated with any variable. When you have categorical explanatory variables, one of the values is dropped (blue in this case).
The information at the bottom of the output display indicates how well the independent variables, Color, Length and Width, explain the dependent variable, Rate:
Residual standard error: 0.05179 on 3 degrees of freedom Multiple R-squared: 0.9927, Adjusted R-squared: 0.9829 F-statistic: 101.6 on 4 and 3 DF, p-value: 0.00156
The multiple R-squared value (0.9927) is the percentage of variation in the dependent variable explained by the linear combination of the independent variables. Put slightly differently, R-squared is a value between 0 and 1 where higher values mean a better predictive model. Here, the R-squared value is extremely high, indicating Color, Length and Width can predict Rate very accurately. The F-statistic, adjusted R-squared value, and p-value are other measures of model fit.
One of the points of this example is that when programming using R, your biggest challenge by far is understanding the statistics behind the language functions. Most people learn R in an incremental way, by adding knowledge of one technique at a time, as needed to answer some specific question. A C# analogy would be learning about the various collection objects in the Collections.Generic namespace, such as the Hashtable and Queue classes. Most developers learn about one data structure at a time rather than trying to memorize information about all the classes before using any of them in a project.
Another Chi-Square Test
The type of chi-square test in Figure 1 is often called a test for uniform distribution because it tests if the observed data all have equal counts; that is, if the data is distributed uniformly. There are several other kinds of chi-square tests, including one called a chi-square test of independence.
Suppose you have a group of 100 people and you’re interested if sex (male, female) is independent from political party affiliation (Democrat, Republican, Other). Imagine your data is in a contingency matrix as shown in Figure 2. It appears that perhaps males are more likely to be Republican than females are.
Figure 2 Contingency Matrix
Dem | Rep | Other | ||
Male | 15 | 25 | 10 | 50 |
Female | 30 | 15 | 5 | 50 |
45 | 40 | 15 | 100 |
To use R to test if the two factors, sex and affiliation, are statistically independent, you’d first place the data into a numeric matrix with this command:
Notice that in R, matrix data is stored by columns (top to bottom, left to right) rather than rows (left to right, top to bottom) as in C#.
The R chi-square test command is:
The p-value result of the test is 0.01022, which indicates that at the 5 percent significance level, the two factors are not independent. In other words, there’s a statistical relationship between sex and affiliation.
Notice in the first chi-square dice example, the input parameter is a vector, but in the second sex-affiliation example, the input is a matrix. The chisq.test function has a total of seven parameters, one required (a vector or a matrix), followed by six optional named parameters.
Like many C# methods in the Microsoft .NET Framework namespaces, most R functions are heavily overloaded. In C#, overloading is usually implemented using multiple methods with the same name but with different parameters. The use of generics is also a form of overloading. In R, overloading is implemented using a single function with many optional named parameters.
Graphs with R
As a C# programmer, when I want to make a graph of some program output data, I typically run my program, copy the output data, Ctrl+V paste that data into Notepad to remove weird control characters, copy that data, paste it into Excel, and then create a graph using Excel. This is kind of hacky, but it works fine in most situations.
One of the strengths of the R system is its native ability to generate graphs. Take a look at an example in Figure 3. This is a type of graph that isn’t possible in Excel without an add-in of some sort.
Figure 3 A 3D Graph Using R
In addition to the shell program shown in Figure 1, R also has a semi-GUI interface, RGui.exe, for use when you want to make graphs.
The graph in Figure 3 shows the function z = f(x,y) = x * e^(-(x^2 + y^2)). The first three R commands to generate the graph are:
The rm function deletes an object from the current workspace in memory. The command used is a magic R incantation to delete all objects. The second and third commands create vectors of 25 values, evenly spaced, from -2 to +2 inclusive. I could’ve used the c function instead.
The next two commands are:
The first command shows how to define a function in R using the function keyword. The built-in R function named outer creates a matrix of values using vectors x and y, and a function definition f. The result is a 25 x 25 matrix where the value in each cell is the value of function f that corresponds to x and y.
The next two commands are:
The nrow and ncol functions return the number of rows or the number of columns in their matrix argument. Here, both values would be 25.
The next R command uses the colorRampPalette function to create a custom color gradient palette to paint the graph:
> jet.colors <- colorRampPalette(c("midnightblue", "blue", + "cyan", "green", "yellow", "orange", "red", "darkred"))
When typing a long line in R, if you hit the <Enter> key, the system will jump the cursor down to the next line and place a + character as a prompt to indicate your command is not yet complete. Next:
The result of these two commands is a vector named color that holds 64 different color values ranging from a very dark blue, through green and yellow, to a very dark red. Next:
> zfacet <- z[-1,-1] + z[-1,-ncz] + z[-nrz,-1] + z[-nrz,-ncz] > facetcol <- cut(zfacet,nbcol)
If you look closely at the graph in Figure 3, you’ll see the surface is made of 25 x 25 = 625 small squares, or facets. The two preceding commands create a 25 x 25 matrix where the value in each cell is one of the 64 colors to use for the corresponding surface facet.
Finally, the 3D graph is displayed using the persp (perspective graph) function:
> persp(x,y,z, col=color[facetcol], phi=20, theta=-35, + ticktype="detailed", d=5, r=1, shade=0.1, expand=0.7)
The persp function has a lot of optional, named parameters. The col parameter is the color, or colors, to use. Parameters phi and theta set the viewing angle (left and right, and up and down) of the graph. Parameter ticktype controls how values on the x, y and z axes are displayed. Parameters r and d control the perceived eye distance to the graph, and the perceived 3D effect. The parameter named shade controls simulated shading from a virtual light source. The parameter named expand controls the ratio of the height and width of the graph. The persp function has many more parameters, but the ones used here will be sufficient in most situations.
This example points out that R has very powerful and flexible native graphing capabilities, but they tend to be relatively low level and require a fair amount of effort.
The Chi-Square Test in C#
To gain an understanding of the similarities and differences between R language analyses and C# language programming, it’s useful to examine a C# implementation of a chi-square test. In addition, the C# code can make a nice addition to your personal code library. Take a look at the C# demo program in Figure 4.
Figure 4 Chi-Square Test Using C#
The demo program approximates the R language chi-square dice test shown in Figure 1. Notice the output values of the C# demo are exactly the same as those in the R session.
To create the demo program, I launched Visual Studio and created a new C# console application project named ChiSquare. After the template code loaded into the editor, in the Solution Explorer window I right-clicked on file Program.cs and renamed it to ChiSquareProgram.cs and allowed Visual Studio to automatically rename class Program to ChiSquareProgram.
The complete C# demo program is listed in Figure 5. You’ll notice that the source code is longer than you might expect. Implementing statistical programming using raw C# isn’t overwhelmingly difficult, but the code does tend to be long.
using System; namespace ChiSquare { class ChiSquareProgram { static void Main(string[] args) { try { Console.WriteLine("\nBegin Chi-square test using C# demo\n"); Console.WriteLine( "Goal is to see if one die from a set of dice is biased or not\n"); int[] observed = new int[] { 20, 28, 12, 32, 22, 36 }; Console.WriteLine("\nStarting chi-square test"); double p = ChiSquareTest(observed); Console.WriteLine("\nChi-square test complete"); double crit = 0.05; if (p < crit) { Console.WriteLine("\nBecause p-value is below critical value of " + crit.ToString("F2")); Console.WriteLine("the null hypothsis is rejected and we conclude"); Console.WriteLine("the data is unlikely to have happened by chance."); } else { Console.WriteLine("\nBecause p-value is not below critical value of " + crit.ToString("F2")); Console.WriteLine( "the null hypothsis is accepted (not rejected) and we conclude"); Console.WriteLine("the observed data could have happened by chance."); } Console.WriteLine("\nEnd\n"); Console.ReadLine(); } catch (Exception ex) { Console.WriteLine(ex.Message); Console.ReadLine(); } } // Main static void ShowVector(int[] v) { for (int i = 0; i < v.Length; ++i) Console.Write(v[i] + " "); Console.WriteLine(""); } static double ChiSquareTest(int[] observed) { Console.WriteLine("Observed frequencies are: "); ShowVector(observed); double x = ChiSquareStatistic(observed); Console.WriteLine("\nX-squared = " + x.ToString("F2")); int df = observed.Length - 1; Console.WriteLine("\ndf = " + df); double p = ChiSquareProb(x, df); Console.WriteLine("\np-value = " + p.ToString("F6")); return p; } static double ChiSquareStatistic(int[] observed) { double sumObs = 0.0; for (int i = 0; i < observed.Length; ++i) sumObs += observed[i]; double expected = (int)(sumObs / observed.Length); double result = 0.0; for (int i = 0; i < observed.Length; ++i) { result += ((observed[i] - expected) * (observed[i] - expected)) / expected; } return result; } public static double ChiSquareProb(double x, int df) { // x = a computed chi-square value. df = degrees of freedom. // output = prob. the x value occurred by chance. // So, for example, if result < 0.05 there is only a 5% chance // that the x value occurred by chance and, therefore, // we conclude that the actual data which produced x is // NOT the same as the expected data. // This function can be used to create a ChiSquareTest procedure. // ACM Algorithm 299 and update ACM TOMS June 1985. // Uses custom Exp() function below. if (x <= 0.0 || df < 1) throw new Exception("parameter x must be positive " + "and parameter df must be 1 or greater in ChiSquaredProb()"); double a = 0.0; // 299 variable names double y = 0.0; double s = 0.0; double z = 0.0; double e = 0.0; double c; bool even; // Is df even? a = 0.5 * x; if (df % 2 == 0) even = true; else even = false; if (df > 1) y = Exp(-a); // ACM update remark (4) if (even == true) s = y; else s = 2.0 * Gauss(-Math.Sqrt(x)); if (df > 2) { x = 0.5 * (df - 1.0); if (even == true) z = 1.0; else z = 0.5; if (a > 40.0) // ACM remark (5) { if (even == true) e = 0.0; else e = 0.5723649429247000870717135; c = Math.Log(a); // log base e while (z <= x) { e = Math.Log(z) + e; s = s + Exp(c * z - a - e); // ACM update remark (6) z = z + 1.0; } return s; } // a > 40.0 else { if (even == true) e = 1.0; else e = 0.5641895835477562869480795 / Math.Sqrt(a); c = 0.0; while (z <= x) { e = e * (a / z); // ACM update remark (7) c = c + e; z = z + 1.0; } return c * y + s; } } // df > 2 else { return s; } } // ChiSquare() private static double Exp(double x) // ACM update remark (3) { if (x < -40.0) // ACM update remark (8) return 0.0; else return Math.Exp(x); } public static double Gauss(double z) { // input = z-value (-inf to +inf) // output = p under Normal curve from -inf to z // e.g., if z = 0.0, function returns 0.5000 // ACM Algorithm #209 double y; // 209 scratch variable double p; // result. called 'z' in 209 double w; // 209 scratch variable if (z == 0.0) p = 0.0; else { y = Math.Abs(z) / 2; if (y >= 3.0) { p = 1.0; } else if (y < 1.0) { w = y * y; p = ((((((((0.000124818987 * w - 0.001075204047) * w + 0.005198775019) * w - 0.019198292004) * w + 0.059054035642) * w - 0.151968751364) * w + 0.319152932694) * w - 0.531923007300) * w + 0.797884560593) * y * 2.0; } else { y = y - 2.0; p = (((((((((((((-0.000045255659 * y + 0.000152529290) * y - 0.000019538132) * y - 0.000676904986) * y + 0.001390604284) * y - 0.000794620820) * y - 0.002034254874) * y + 0.006549791214) * y - 0.010557625006) * y + 0.011630447319) * y - 0.009279453341) * y + 0.005353579108) * y - 0.002141268741) * y + 0.000535310849) * y + 0.999936657524; } } if (z > 0.0) return (p + 1.0) / 2; else return (1.0 - p) / 2; } // Gauss() } // Program class } // ns
The Main method consists mostly of WriteLine statements. The essential calling code is:
int[] observed = new int[] { 20, 28, 12, 32, 22, 36 }; double p = ChiSquareTest(observed); double crit = 0.05; if (p < crit) { // Messages } else { // Messages }
The C# chi-square test accepts an array of observed values, computes the chi-square statistic value and displays it; computes and displays the degrees of freedom; and computes and returns the p-value.
Method ChiSquareTest calls three helper methods:
ShowVector(observed); double x = ChiSquareStatistic(observed); int df = observed.Length - 1; double p = ChiSquareProb(x, df);
Method ShowVector displays the input vector, similar to the approach used by the R chisq.test function of echoing input parameter values. Method ChiSquareStatistic returns the calculated chi-square (“X-squared” in the R output), and method ChiSquareProb uses the return from ChiSquareStatistic to compute a probability (the “p-value” in the R output).
Method ChiSquareStatistic is a simple test for uniform distribution. The statistics equation for the chi-squared statistic is:
So, before calculating chi-squared, you need to compute the expected values associated with the observed values. As explained earlier, to do this you add up the count value in the observed array (150 in the demo), then divide that sum by the number of values in the array (6 in the demo), to give the expected value (25) for all cells.
The most difficult part of writing a chi-square test is the calculation of the p-value. If you scan the definition of method ChiSquareProb in Figure 5, you’ll quickly realize it requires deep, specialized knowledge. Furthermore, the method calls a helper method named Gauss that’s equally complex.
Coding complicated numerical functions is actually quite easy because most functions have been solved for decades. Two of my most frequently used resources are the Association for Computing Machinery (ACM) Collected Algorithms repository, and the “Handbook of Mathematical Functions” by Abramowitz and Stegun (Dover Publications, 1965)—so well-known it’s often just called “A&S.” Both references are freely available in several places on the Web.
For example, method ChiSquareProb implements ACM Algorithm #299 and the helper method, Gauss, implements ACM Algorithm #209. An alternative to ACM #209 is A&S equation #7.1.26 plus a simple wrapping statement.
Many of the functions in the ACM Algorithms and A&S are implemented in existing C# libraries; however, most of those libraries are quite large. When possible, I prefer to code from scratch, using just the methods I need and avoiding external dependencies.
A Few Comments
This article shows only a tiny fraction of the R language but it should be enough to get you up and running. C# programmers tend to come into contact with R in two ways. First, you might use R to perform your own statistical analyses. As this article shows, using R is fairly easy, assuming you grasp the underlying statistics. Second, you might need to interact with someone who uses R as their primary language, and thus need to understand the R environment and terminology.
There are a few R language integrated development tools you can use. One well-known project is called RStudio. I generally prefer using the native R console. Several Microsoft products work with R. For example, the Microsoft Azure Machine Learning service can use R language scripts and I suspect support for R will be added to Visual Studio at some point.
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. Dr. McCaffrey can be reached at jammc@microsoft.com.
Thanks to the following Microsoft Research technical expert for reviewing this article: Dan Liebling and Robin Reynolds-Haertle
Robin Reynolds-Haertle writes documentation for cross-platform development with Visual Studio. Her current interests are .NET Core, C#, Swift, and R
Receive the MSDN Flash e-mail newsletter every other week, with news and information personalized to your interests and areas of focus.