ScaleR tutorial using airplane flight data (Microsoft R)
This tutorial builds on what you learned in the quick start introduction to ScaleR by adding more steps and functions that broaden your experience with the functions. As before, you'll work with airline sample data to complete the steps.
ScaleR functions can be loosely categorized as data-oriented, or platform-specific for tapping into capabilities of a particular database system, operating system, or distributed file system. This tutorial focuses on data-oriented functions to minimize the dependencies.
To complete this tutorial, you will need a development tool such as R Tools for Visual Studio and ScaleR, which you can get through Microsoft R Client, R Server, or R in Azure.
What you will learn in this tutorial
Tasks you'll perform include the following:
Convert text data to the .xdf data file format.
Summarize your data.
Fit a linear model to the data.
Create a new .xdf file, sub-setting the original data set and performing data transformations.
Fit a logistic regression model.
Compute predicted values.
Get the airline data set
AirlineDemoSmall.csv is the dataset used in this tutorial. It is a subset of a data set containing information on flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. The AirlineDemoSmall.csv file contains three columns of data: two numeric columns, ArrDelay and CRSDepTime, and a column of strings, DayOfWeek. The file contains 600,000 rows of data in addition to a first row with variable names.
This dataset is built into ScaleR so there is nothing to download.
Import text data into .xdf
ScaleR provides a data file format (.xdf) designed to be very efficient for reading arbitrary rows and columns. To convert the AirlineDemoSmall.csv text file into the .xdf data format, use the function rxImport. Using this function, you can convert the string column, DayOfWeek, to a factor variable.
Enter the following command to create a variable named
sampleDataDirrepresenting the input file in the sample data directory. The location of this directory is stored as an option. It is initialized to the location of the SampleData directory included in the RevoScaleR package. You can use the rxGetOption function to retrieve this location and store it as
sampleDataDir <- rxGetOption("sampleDataDir")
The rxImport function uses the current working directory to store the imported data. To see this location, use the R function getwd(). The output will be something like
C:/Users/<your-name>/documents/visual studio 2015/projects/<folder-name>/<file-name>.
Enter the following script to set the input file and use rxImport to import the text file into an .xdf file named ADS in your current working directory.
inputFile <- file.path(sampleDataDir, "AirlineDemoSmall.csv") airDS <- rxImport(inData = inputFile, outFile = "ADS.xdf", missingValueString = "M", stringsAsFactors = TRUE)
The input .csv file uses the letter M to represent missing values, rather than the default NA, so we specify this with the missingValueString argument.
Setting stringsAsFactors to TRUE will set the levels specification for DayOfWeek to the unique strings found in that variable, listed in the order in which they are encountered. Since this order is arbitrary, and can easily vary from data set to data set, it is preferred to explicitly specify the levels in the desired order using the colInfo argument.
Let's rerun the import operation with a fixed order for days of the week. When rerunning an import command, append the overwrite argument to replace the file previously imported:
colInfo <- list(DayOfWeek = list(type = "factor", levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) airDS <- rxImport(inData = inputFile, outFile = "ADS.xdf", missingValueString = "M", colInfo = colInfo, overwrite = TRUE)
Notice that once we supply the colInfo argument, we no longer need to specify stringsAsFactors; DayOfWeek is our only factor variable.
Examining .xdf files
Using the small airDS object representing the ADS.xdf file, you can apply standard R methods to get basic information about the data set:
nrow(airDS) ncol(airDS) head(airDS) > nrow(airDS)  6e+05 > ncol(airDS)  3 > head(airDS) ArrDelay CRSDepTime DayOfWeek 1 6 9.666666 Monday 2 -8 19.916666 Monday 3 -2 13.750000 Monday 4 1 11.750000 Monday 5 -2 6.416667 Monday 6 -14 13.833333 Monday
The rxGetVarInfo function provides additional variable information:
rxGetVarInfo(airDS) Var 1: ArrDelay, Type: integer, Low/High: (-86, 1490) Var 2: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833) Var 3: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
You can also read an arbitrary chunk of the data set into a data frame for further examination. For example, read 10 rows into a data frame starting with the 100,000th row:
myData <- rxReadXdf(airDS, numRows=10, startRow=100000) myData
This code should generate the following output:
ArrDelay CRSDepTime DayOfWeek 1 -2 11.416667 Saturday 2 39 9.916667 Friday 3 NA 10.033334 Monday 4 1 17.000000 Friday 5 -17 9.983334 Wednesday 6 8 21.250000 Saturday 7 -9 6.250000 Friday 8 -11 15.000000 Friday 9 4 20.916666 Sunday 10 -8 6.500000 Thursday
You can now look to see what the factor levels are for the DayOfWeek variable:
This command should generate the following output:
 "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"  "Sunday"
Use the rxSummary function to obtain descriptive statistics for your .xdf data file. The rxSummary function takes a formula as its first argument, and the name of the data set as the second. To get summary statistics for all of the data in your data file, you can alternatively use the summary method for the airDS object.
adsSummary <- rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data = airDS) adsSummary or adsSummary <- summary( airDS ) adsSummary
Summary statistics will be computed on the variables in the formula. By default, rxSummary computes data summaries term-by-term, and missing values are omitted on a term-by-term basis. In earlier versions, summaries were computed on the complete table after observations with missing elements were omitted.
Call: rxSummary(formula = ~ArrDelay + CRSDepTime + DayOfWeek, data = airDS) Summary Statistics Results for: ~ArrDelay + CRSDepTime + DayOfWeek Data: airDS (RxXdfData Data Source) File name: C:\YourWorkingDir\ADS.xdf Number of valid observations: 6e+05 Name Mean StdDev Min Max ValidObs MissingObs ArrDelay 11.31794 40.688536 -86.000000 1490.00000 582628 17372 CRSDepTime 13.48227 4.697566 0.016667 23.98333 600000 0 Category Counts for DayOfWeek Number of categories: 7 Number of valid observations: 6e+05 Number of missing observations: 0 DayOfWeek Counts Monday 97975 Tuesday 77725 Wednesday 78875 Thursday 81304 Friday 82987 Saturday 86159 Sunday 94975
Notice that the summary information shows cell counts for categorical variables, and appropriately does not provide summary statistics such as Mean and StdDev. Also notice that the Call: line will show the actual call you entered or the call provided by summary, so will appear differently in different circumstances.
You can also compute summary information by one or more categories by using interactions of a numeric variable with a factor variable. For example, to compute summary statistics on Arrival Delay by Day of Week:
rxSummary(~ArrDelay:DayOfWeek, data = airDS)
The output shows the summary statistics for ArrDelay for each day of the week:
Call: rxSummary(formula = ~ArrDelay:DayOfWeek, data = airDS) Summary Statistics Results for: ~ArrDelay:DayOfWeek File name: C:\YourWorkingDir\ADS.xdf Number of valid observations: 6e+05 Name Mean StdDev Min Max ValidObs MissingObs ArrDelay:DayOfWeek 11.31794 40.68854 -86 1490 582628 17372 Statistics by category (7 categories): Category DayOfWeek Means StdDev Min Max ValidObs ArrDelay for DayOfWeek=Monday Monday 12.025604 40.02463 -76 1017 95298 ArrDelay for DayOfWeek=Tuesday Tuesday 11.293808 43.66269 -70 1143 74011 ArrDelay for DayOfWeek=Wednesday Wednesday 10.156539 39.58803 -81 1166 76786 ArrDelay for DayOfWeek=Thursday Thursday 8.658007 36.74724 -58 1053 79145 ArrDelay for DayOfWeek=Friday Friday 14.804335 41.79260 -78 1490 80142 ArrDelay for DayOfWeek=Saturday Saturday 11.875326 45.24540 -73 1370 83851 ArrDelay for DayOfWeek=Sunday Sunday 10.331806 37.33348 -86 1202 93395
To get a better feel for the data, we can draw histograms for each variable. Run each rxHistogram function one at a time. In RTVS, the histogram is rendered in the R Plot pane.
options("device.ask.default" = T) rxHistogram(~ArrDelay, data = airDS) rxHistogram(~CRSDepTime, data = airDS) rxHistogram(~DayOfWeek, data = airDS)
We can also easily extract a subsample of the data file into a data frame in memory. For example, we can look at just the flights that were between 4 and 5 hours late:
myData <- rxDataStep(inData = airDS, rowSelection = ArrDelay > 240 & ArrDelay <= 300, varsToKeep = c("ArrDelay", "DayOfWeek")) rxHistogram(~ArrDelay, data = myData)
Fitting a linear model
This section of the tutorial demonstrates several approaches for fitting a model to the sample data.
Fitting a Simple Model
Use the rxLinMod function to fit a linear model using your ADS.xdf file. Use a single dependent variable, the factor DayOfWeek:
arrDelayLm1 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS) summary(arrDelayLm1)
The resulting output is:
Call: rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airDS) Linear Regression Results for: ArrDelay ~ DayOfWeek File name: C:\YourWorkingDir\ADS.xdf Dependent variable(s): ArrDelay Total independent variables: 8 (Including number dropped: 1) Number of valid observations: 582628 Number of missing observations: 17372 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 10.3318 0.1330 77.673 2.22e-16 *** DayOfWeek=Monday 1.6938 0.1872 9.049 2.22e-16 *** DayOfWeek=Tuesday 0.9620 0.2001 4.809 1.52e-06 *** DayOfWeek=Wednesday -0.1753 0.1980 -0.885 0.376 DayOfWeek=Thursday -1.6738 0.1964 -8.522 2.22e-16 *** DayOfWeek=Friday 4.4725 0.1957 22.850 2.22e-16 *** DayOfWeek=Saturday 1.5435 0.1934 7.981 2.22e-16 *** DayOfWeek=Sunday Dropped Dropped Dropped Dropped --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 40.65 on 582621 degrees of freedom Multiple R-squared: 0.001869 Adjusted R-squared: 0.001858 F-statistic: 181.8 on 6 and 582621 DF, p-value: < 2.2e-16 Condition number: 10.5595
Using the cube argument and plotting results
If you are using categorical data in your regression, you can use the cube argument. If cube is set to TRUE and the first term of the regression is categorical (a factor or an interaction of factors), the regression is done using a partitioned inverse, which may be faster and use less memory than standard regression computation. Averages/counts of the category "bins" can be computed in addition to standard regression results. The intercept will also be automatically dropped, so that each category level will have an estimated coefficient (unlike the previous example). If cube is set to TRUE and the first term is not categorical, you get an error message.
Re-estimate the linear model, this time setting cube to TRUE. Then print a summary of the results:
arrDelayLm2 <- rxLinMod(ArrDelay ~ DayOfWeek, data = airDS, cube = TRUE) summary(arrDelayLm2)
You should see the following output:
Call: rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airDS, cube = TRUE) Cube Linear Regression Results for: ArrDelay ~ DayOfWeek File name: C:\YourWorkingDir\ADS.xdf Dependent variable(s): ArrDelay Total independent variables: 7 Number of valid observations: 582628 Number of missing observations: 17372 Coefficients: Estimate Std. Error t value Pr(>|t|) | Counts DayOfWeek=Monday 12.0256 0.1317 91.32 2.22e-16 *** | 95298 DayOfWeek=Tuesday 11.2938 0.1494 75.58 2.22e-16 *** | 74011 DayOfWeek=Wednesday 10.1565 0.1467 69.23 2.22e-16 *** | 76786 DayOfWeek=Thursday 8.6580 0.1445 59.92 2.22e-16 *** | 79145 DayOfWeek=Friday 14.8043 0.1436 103.10 2.22e-16 *** | 80142 DayOfWeek=Saturday 11.8753 0.1404 84.59 2.22e-16 *** | 83851 DayOfWeek=Sunday 10.3318 0.1330 77.67 2.22e-16 *** | 93395 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 40.65 on 582621 degrees of freedom Multiple R-squared: 0.001869 (as if intercept included) Adjusted R-squared: 0.001858 F-statistic: 181.8 on 6 and 582621 DF, p-value: < 2.2e-16 Condition number: 1
The coefficient estimates are the mean arrival delay for each day of the week.
When the cube argument is set to TRUE and the model has only one term as an independent variable, the "countDF" component of the results object also contains category means. This data frame can be extracted using the rxResultsDF function, and is particularly convenient for plotting.
countsDF <- rxResultsDF(arrDelayLm2, type = "counts") countsDF
You should see the following output:
DayOfWeek ArrDelay Counts 1 Monday 12.025604 95298 2 Tuesday 11.293808 74011 3 Wednesday 10.156539 76786 4 Thursday 8.658007 79145 5 Friday 14.804335 80142 6 Saturday 11.875326 83851 7 Sunday 10.331806 93395
Now plot the average arrival delay for each day of the week:
rxLinePlot(ArrDelay~DayOfWeek, data = countsDF, main = "Average Arrival Delay by Day of Week")
The following plot is generated, showing the lowest average arrival delay on Thursdays:
A Linear Model with Multiple Independent Variables
We can run a more complex model examining the dependency of arrival delay on both day of week and the departure time. We’ll estimate the model using the F expression to have the CRSDepTime variable interpreted as a categorical or factor variable.
F() is not an R function, although it looks like one when used inside ScaleR formulas. An F expression tells ScaleR to create a factor by creating one level for each integer in the range (floor(min(x)), floor(max(x))) and binning all the observations into the resulting set of levels. You can look up the rxFormulapackage help for more information.
By interacting DayOfWeek with F(CRSDepTime) we are creating a dummy variable for every combination of departure hour and day of the week.
arrDelayLm3 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime), data = airDS, cube = TRUE) arrDelayDT <- rxResultsDF(arrDelayLm3, type = "counts") head(arrDelayDT, 15)
The output shows the first fifteen rows of the counts data frame. The variable CRSDepTime gives the hour of the departure time and ArrDelay shows the average departure delay for that hour for that day of week. Counts gives the number of observations that contain that combination of day of week and departure hour.
DayOfWeek CRSDepTime ArrDelay Counts 1 Monday 0 7.4360902 133 2 Tuesday 0 7.0000000 107 3 Wednesday 0 3.7857143 98 4 Thursday 0 3.0097087 103 5 Friday 0 4.4752475 101 6 Saturday 0 4.5460993 141 7 Sunday 0 0.9243697 119 8 Monday 1 2.6590909 44 9 Tuesday 1 5.5428571 35 10 Wednesday 1 -5.9696970 33 11 Thursday 1 5.1714286 35 12 Friday 1 -4.9062500 32 13 Saturday 1 -4.2727273 44 14 Sunday 1 -1.7368421 38 15 Monday 2 -0.2000000 15
Now plot the results:
rxLinePlot( ArrDelay~CRSDepTime|DayOfWeek, data = arrDelayDT, title = "Average Arrival Delay by Day of Week by Departure Hour")
You should see the following plot:
Subset the data and compute a crosstab
Create a new data set containing a subset of rows and variables. This is convenient if you intend to do lots of analysis on a subset of a large data set. To do this, we use the rxDataStep function with the following arguments:
- outFile: the name of the new data set
- inData: the name of an .xdf file or an RxXdfData object representing the original data set you are subsetting
- varsToDrop: a character vector of variables to drop from the new data set, here CRSDepTime
- rowSelection: keep only the rows where the flight was more than 15 minutes late
The resulting call is as follows:
airLateDS <- rxDataStep(inData = airDS, outFile = "ADS1.xdf", varsToDrop = c("CRSDepTime"), rowSelection = ArrDelay > 15) ncol(airLateDS) nrow(airLateDS)
You will see that the new data set has only two variables and has dropped from 600,000 to 148,526 observations.
Compute a crosstab showing the mean arrival delay by day of week.
myTab <- rxCrossTabs(ArrDelay~DayOfWeek, data = airLateDS) summary(myTab, output = "means")
The results show that in this data set “late” flights are on average over 10 minutes later on Tuesdays than on Sundays:
Call: rxCrossTabs(formula = ArrDelay ~ DayOfWeek, data = airLateDS) Cross Tabulation Results for: ArrDelay ~ DayOfWeek File name: C:\YourWorkingDir\ADS1.xdf Dependent variable(s): ArrDelay Number of valid observations: 148526 Number of missing observations: 0 Statistic: means ArrDelay (means): means means % Monday 56.94491 13.96327 Tuesday 64.28248 15.76249 Wednesday 60.12724 14.74360 Thursday 55.07093 13.50376 Friday 56.11783 13.76047 Saturday 61.92247 15.18380 Sunday 53.35339 13.08261 Col Mean 57.96692
Create a new data set with variable transformations
Now use the rxDataStep function to create a new data set containing the variables in ADS.xdf plus additional variables created through transformations. Typically additional variables are created using the transforms argument. See the ScaleR User's Guide for information on doing more complex transformations using a transform function. Remember that all expressions used in transforms must be able to be processed on a chunk of data at a time.
In the example below, three new variables are created. The variable Late is a logical variable set to TRUE (or 1) if the flight was more than 15 minutes late in arriving. The variable DepHour is an integer variable indicating the hour of the departure. The variable Night is also a logical variable, set to TRUE (or 1) if the flight departed between 10 P.M. and 5 A.M.
The rxDataStep function will read the existing data set and perform the transformations chunk by chunk, and create a new data set.
airExtraDS <- rxDataStep(inData = airDS, outFile="ADS2.xdf", transforms=list( Late = ArrDelay > 15, DepHour = as.integer(CRSDepTime), Night = DepHour >= 20 | DepHour <= 5)) rxGetInfo(airExtraDS, getVarInfo=TRUE, numRows=5)
You should see the following information in your output:
File name: C:\YourWorkingDir\ADS2.xdf Number of observations: 6e+05 Number of variables: 6 Number of blocks: 2 Compression type: zlib Variable information: Var 1: ArrDelay, Type: integer, Low/High: (-86, 1490) Var 2: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833) Var 3: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday Var 4: Late, Type: logical, Low/High: (0, 1) Var 5: DepHour, Type: integer, Low/High: (0, 23) Var 6: Night, Type: logical, Low/High: (0, 1) Data (5 rows starting with row 1): ArrDelay CRSDepTime DayOfWeek Late DepHour Night 1 6 9.666666 Monday FALSE 9 FALSE 2 -8 19.916666 Monday FALSE 19 FALSE 3 -2 13.750000 Monday FALSE 13 FALSE 4 1 11.750000 Monday FALSE 11 FALSE 5 -2 6.416667 Monday FALSE 6 FALSE
Run a logistic regression on the new data
The function rxLogit takes a binary dependent variable. Here we will use the variable Late, which is TRUE (or 1) if the plane was more than 15 minutes late arriving. For dependent variables we will use the DepHour, the departure hour, and Night, indicating whether or not the flight departed at night.
logitObj <- rxLogit(Late~DepHour + Night, data = airExtraDS) summary(logitObj)
You should see the following results:
Call: rxLogit(formula = Late ~ DepHour + Night, data = airExtraDS) Logistic Regression Results for: Late ~ DepHour + Night File name: C:\Users\sue\SVN\bigAnalytics\trunk\revoAnalytics\ADS2.xdf Dependent variable(s): Late Total independent variables: 3 Number of valid observations: 582628 Number of missing observations: 17372 -2*LogLikelihood: 649607.8613 (Residual deviance on 582625 degrees of freedom) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0990076 0.0104460 -200.94 2.22e-16 *** DepHour 0.0790215 0.0007671 103.01 2.22e-16 *** Night -0.3027030 0.0109914 -27.54 2.22e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Condition number of final variance-covariance matrix: 3.0178 Number of iterations: 4
Computing predicted values
You can use the object returned from the call to rxLogit in the previous section to compute predicted values. In addition to the model object, we specify the data set on which to compute the predicted values and the data set in which to put the newly computed predicted values. In the call below, we use the same dataset for both. In general, the data set on which to compute the predicted values must be similar to the original data set used to estimate the model in the following ways; it should have the same variable names and types, and factor variables must have the same levels in the same order.
predictDS <- rxPredict(modelObject = logitObj, data = airExtraDS, outData = airExtraDS) rxGetInfo(predictDS, getVarInfo=TRUE, numRows=5)
You should see the following information:
File name: C:\YourWorkingDir\ADS2.xdf Number of observations: 6e+05 Number of variables: 7 Number of blocks: 2 Compression type: zlib Variable information: Var 1: ArrDelay, Type: integer, Low/High: (-86, 1490) Var 2: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833) Var 3: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday Var 4: Late, Type: logical, Low/High: (0, 1) Var 5: DepHour, Type: integer, Low/High: (0, 23) Var 6: Night, Type: logical, Low/High: (0, 1) Var 7: Late_Pred, Type: numeric, Low/High: (0.0830, 0.3580) Data (5 rows starting with row 1): ArrDelay CRSDepTime DayOfWeek Late DepHour Night Late_Pred 1 6 9.666666 Monday FALSE 9 FALSE 0.1997569 2 -8 19.916666 Monday FALSE 19 FALSE 0.3548931 3 -2 13.750000 Monday FALSE 13 FALSE 0.2550745 4 1 11.750000 Monday FALSE 11 FALSE 0.2262214 5 -2 6.416667 Monday FALSE 6 FALSE 0.1645331
This tutorial demonstrated how to use several important functions, but on a small data set. Next up are two additional tutorials that explore ScaleR with bigger data sets, and customization approaches if built-in functions are not quite enough.
Try demo scripts
Another way to learn about ScaleR is through demo scripts. Scripts provided in your Microsoft R installation contain code that's very similar to what you see in this tutorial. You can highlight portions of the script, right-click Execute in Interactive to run the script in RTVS.
Demo scripts are located in the demoScripts subdirectory of your Microsoft R installation. On Windows, this is typically:
`C:\Program Files\Microsoft\R Client\R_SERVER\library\RevoScaleR\demoScripts`
Watch this video
This 30-minute video is the second in a 4-part video series. It demonstrates ScaleR functions for data ingestion.
Get function help
R packages typically include embedded package help reference and ScaleR is no exception. To view embedded help, use the R Help tab, located next to Solution Explorer.
- In R Help, click the Home button.
- Click Packages.
- Scroll down and click RevoScaleR to open the package help. All ScaleR functions are documented here. A subset of more commonly used functions have help pages on MSDN.
Get more information
Continue building up your knowledge of ScaleR with these additional guides and tutorials.
- ScaleR Getting Started with Hadoop
- ScaleR Getting Started with Teradata
- ScaleR User’s Guide
- ScaleR Distributed Computing Guide
- ScaleR ODBC Data Import Guide