Table of contents
TOC
Collapse the table of content
Expand the table of content

Example: Analyzing loan data with ScaleR (Microsoft R)

Heidi Steen|Last Updated: 11/6/2016

This example builds on what you learned in the quick start introduction to ScaleR by showing you how to import .csv files to create an .xdf file, and use statistical ScaleR functions to summarize the data. As before, you'll work with sample data to complete the steps.

Get the mortgage default data set

Data import and exploration are covered further on, but to complete those steps you will need a collection of .csv files providing the sample data.

The sample data used in this tutorial consists of simulated data on mortgage defaults. A small version of the data set is pre-installed with the RevoScaleR package that ships with R Client and R Server. It provides 100,000 observations. Alternatively, you can download a larger version of the data set providing 10 million observations.

Use the pre-installed .csv files

Small versions of the data sets can be found in the sample data directory:

    mortDefaultSmall2000.csv
    mortDefaultSmall2001.csv
    mortDefaultSmall2002.csv
    mortDefaultSmall2003.csv
    mortDefaultSmall2004.csv
    mortDefaultSmall2005.csv
    mortDefaultSmall2006.csv
    mortDefaultSmall2007.csv
    mortDefaultSmall2008.csv
    mortDefaultSmall2009.csv

Each file contains 10,000 rows, for a total of 100,000 observations.

Download the larger data set

To work with a larger dataset, you can download mortDefault, a set of ten comma-separated files, each of which contains one million observations of simulated data on mortgage defaults. You can download zipped from this web site. Windows users should download the zip version, mortDefault.zip. Linux users should download mortDefault.tar.gz. File size is approximately 220 MB unpacked.

When downloading these files, put them in a directory where you can easily access them. For example, create a directory "C:\MRS\BigData" and unpack the files there. When running examples using these files, you will want to specify this location as your bigDataDir. For example:

bigDataDir <- "C:\MRS\BigData"

Each download contains ten files, each of which contains one million observations for a total of 10 million:

    mortDefault2000.csv
    mortDefault2001.csv
    mortDefault2002.csv
    mortDefault2003.csv
    mortDefault2004.csv
    mortDefault2005.csv
    mortDefault2006.csv
    mortDefault2007.csv
    mortDefault2008.csv
    mortDefault2009.csv

About the data

This example uses simulated data at the individual level to analyze loan defaults. Data has been collected every year for 10 years on mortgage holders, stored in a comma delimited data set for each year. The data contains 5 variables:

  • default – a 0/1 binary variable indicating whether or not the mortgage holder defaulted on the loan
  • creditScore – a credit rating
  • yearsEmploy – the number of years the mortgage holder has been employed at their current job
  • ccDebt – the amount of credit card debt
  • houseAge – the age (in years) of the house
  • year – the year the data was collected

As in previous tutorials, you can use package help to get more information by typing?mortDefaultSmall at the > prompt in the R interactive window.

Specify the input files and output XDF file

The first step is to import the data into an XDF file format for analysis. First, specify the source files you will be using.

For the smaller data sets that exist in the sample data directory, enter:

sampleDataDir <- rxGetOption("sampleDataDir")
mortCsvDataName <- file.path(sampleDataDir, "mortDefaultSmall")

For the large files, enter the location of your downloaded data:

bigDataDir <- "C:/MRS/Data"
mortCsvDataName <- file.path(bigDataDir, "mortDefault", "mortDefault")

Notice that the file path uses a forward slash / character. This is the correct syntax, even for file paths on Windows operating systems that would otherwise use the back slash character.

Next, specify the name of the .xdf file you will create in your working directory (use "mortDefault" instead of "mortDefaultSmall" if you are using the large data sets):

mortXdfFileName <- "mortDefaultSmall.xdf"

or

mortXdfFileName <- "mortDefault.xdf"

Import a set of files in a loop using append

Use the rxImport function to import the data. When the first data file is read, a new XDF file is created using the dataFileName specified above. Subsequent data files are appended to that XDF file. Within the loop, the name of the imported data file is created.

append <- "none"
for (i in 2000:2009)
{
    importFile <- paste(mortCsvDataName, i, ".csv", sep="")
    mortDS <- rxImport(importFile, mortXdfFileName, append=append)
    append <- "rows"
}

Output will be a series of Rows Read notifications, one for each file.

If you have previously imported the data and created the .xdf data source, you can create an .xdf data source representing the file us RxXdfData:

To get a basic summary of the data set and show the first 5 rows enter:

rxGetInfo(mortDS, numRows=5)

Output for the small data files should be as follows:

File name: C:\YourWorkingDir\mortDefaultSmall.xdf
Number of observations: 1e+05
Number of variables: 6
Number of blocks: 10
Compression type: zlib
Data (5 rows starting with row 1):
  creditScore houseAge yearsEmploy ccDebt year default
1         691       16           9   6725 2000       0
2         691        4           4   5077 2000       0
3         743       18           3   3080 2000       0
4         728       22           1   4345 2000       0
5         745       17           3   2969 2000       0

Output for large data files:

File name: C:\YourWorkingDir\mortDefault.xdf
Number of observations: 1e+07
Number of variables: 6
Number of blocks: 20
Compression type: zlib
Data (5 rows starting with row 1):
  creditScore houseAge yearsEmploy ccDebt year default
1         615       10           5   2818 2000       0
2         780       34           5   3575 2000       0
3         735       12           1   3184 2000       0
4         713       15           5   6236 2000       0
5         689       10           5   6817 2000       0

Computing Summary Statistics

Use the rxSummary function to compute summary statistics for the variables in the data set, setting the blocksPerRead to 2.

The blocksPerRead argument is ignored if the script runs locally on R Client.

rxSummary(~., data = mortDS, blocksPerRead = 2)

The following output is returned (for the large data set):

Call:
rxSummary(formula = ~., data = mortDS, blocksPerRead = 2)

Summary Statistics Results for: ~.
File name: C:\YourWorkingDir\mortDefault.xdf
Number of valid observations: 1e+07
Number of missing observations: 0

 Name        Mean         StdDev       Min  Max   ValidObs MissingObs
 creditScore  700.0382475 5.000443e+01  432   955 1e+07    0         
 houseAge      20.0007868 7.646592e+00    0    40 1e+07    0         
 yearsEmploy    5.0042771 2.009815e+00    0    15 1e+07    0         
 ccDebt      5003.6681809 1.988664e+03    0 15566 1e+07    0         
 year        2004.5000000 2.872281e+00 2000  2009 1e+07    0         
 default        0.0049555 7.022068e-02    0     1 1e+07    0

Computing a Logistic Regression

Using the binary default variable as the dependent variable, estimate a logistic regression using year, creditScore, yearsEmploy, and ccDebt as independent variables. Year is an integer value; if we include it as-is in the regression, we would get an estimate of a single coefficient for it indicating the trend in mortgage defaults.

Alternatively, we can treat year as a categorical or factor variable by using the F function. The benefit is that we get a separate coefficient estimated for each year (except the last), telling us which years have higher default rates, while controlling for the other variables in the regression. The logistic regression is specified as follows:

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

logitObj <- rxLogit(default~F(year) + creditScore +
       yearsEmploy + ccDebt,
        data = mortDS, blocksPerRead = 2,
      reportProgress = 1)
summary(logitObj)

You will see timings for each iteration and the final results printed. The results for the large data set are:

    Call:
    rxLogit(formula = default ~ F(year) + creditScore + yearsEmploy +
        ccDebt, data = mortDS, blocksPerRead = 2, reportProgress = 1)

    Logistic Regression Results for: default ~ F(year) + creditScore + yearsEmploy + ccDebt
    File name: C:\YourWorkingDir\mortDefault.xdf
    Dependent variable(s): default
    Total independent variables: 14 (Including number dropped: 1)
    Number of valid observations: 1e+07
    Number of missing observations: 0
    -2*LogLikelihood: 300023.0364 (Residual deviance on 9999987 degrees of freedom)

    Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
    (Intercept) -7.294e+00  7.773e-02  -93.84 2.22e-16 ***
    F_year=2000 -3.996e+00  3.428e-02 -116.57 2.22e-16 ***
    F_year=2001 -2.722e+00  2.148e-02 -126.71 2.22e-16 ***
    F_year=2002 -3.865e+00  3.236e-02 -119.43 2.22e-16 ***
    F_year=2003 -4.182e+00  3.661e-02 -114.24 2.22e-16 ***
    F_year=2004 -4.721e+00  4.625e-02 -102.09 2.22e-16 ***
    F_year=2005 -4.903e+00  4.930e-02  -99.45 2.22e-16 ***
    F_year=2006 -4.442e+00  4.098e-02 -108.39 2.22e-16 ***
    F_year=2007 -3.209e+00  2.546e-02 -126.02 2.22e-16 ***
    F_year=2008 -7.228e-01  1.240e-02  -58.28 2.22e-16 ***
    F_year=2009    Dropped    Dropped Dropped  Dropped    
    creditScore -6.761e-03  1.062e-04  -63.67 2.22e-16 ***
    yearsEmploy -2.736e-01  2.698e-03 -101.41 2.22e-16 ***
    ccDebt       1.365e-03  3.922e-06  348.02 2.22e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Condition number of final variance-covariance matrix: 6.685
    Number of iterations: 10

Computing a Logistic Regression with Many Parameters

If you are using the large mortDefault data set, you can continue and estimate a logistic regression with many parameters.

One of the variables in the data set is houseAge. As with year, we have no reason to believe that the logistic expression is linear with respect to the age of the house. We can look at how the age of the house is related to the default rate by including houseAge in the formula as a categorical variable, again using the F function. A coefficient will be estimated for 40 of the 41 values of houseAge.

    system.time(
    logitObj <- rxLogit(default  ~ F(houseAge) + F(year)+  
        creditScore + yearsEmploy + ccDebt,
        data = mortDS, blocksPerRead = 2, reportProgress = 1))
    summary(logitObj)

The results of the estimation are:

    Call:
    rxLogit(formula = default ~ F(houseAge) + F(year) + creditScore +
        yearsEmploy + ccDebt, data = dataFileName, blocksPerRead = 2,
        reportProgress = 1)

    Logistic Regression Results for: default ~ F(houseAge) + F(year) +
        creditScore + yearsEmploy + ccDebt
    File name: mortDefault.xdf
    Dependent variable(s): default
    Total independent variables: 55 (Including number dropped: 2)
    Number of valid observations: 1e+07
    Number of missing observations: 0
    -2*LogLikelihood: 287165.5811 (Residual deviance on 9999947 degrees of freedom)

    Coefficients:
                    Estimate Std. Error  z value Pr(>|z|)    
    (Intercept)   -8.889e+00  2.043e-01  -43.515 2.22e-16 ***
    F_houseAge=0  -4.160e-01  2.799e-01   -1.486 0.137262    
    F_houseAge=1  -1.712e-01  2.526e-01   -0.678 0.497815    
    F_houseAge=2  -3.593e-01  2.433e-01   -1.477 0.139775    
    F_houseAge=3  -1.183e-01  2.274e-01   -0.520 0.602812    
    F_houseAge=4  -2.874e-02  2.177e-01   -0.132 0.895000    
    F_houseAge=5   8.370e-02  2.106e-01    0.397 0.691025    
    F_houseAge=6  -1.524e-01  2.088e-01   -0.730 0.465538    
    F_houseAge=7   5.902e-03  2.033e-01    0.029 0.976837    
    F_houseAge=8   9.319e-02  2.000e-01    0.466 0.641225    
    F_houseAge=9   1.180e-01  1.977e-01    0.597 0.550485    
    F_houseAge=10  2.850e-01  1.954e-01    1.459 0.144683    
    F_houseAge=11  4.472e-01  1.936e-01    2.310 0.020911 *  
    F_houseAge=12  4.951e-01  1.929e-01    2.567 0.010266 *  
    F_houseAge=13  7.642e-01  1.918e-01    3.985 6.75e-05 ***
    F_houseAge=14  9.484e-01  1.911e-01    4.963 6.93e-07 ***
    F_houseAge=15  1.141e+00  1.905e-01    5.991 2.09e-09 ***
    F_houseAge=16  1.305e+00  1.902e-01    6.863 6.76e-12 ***
    F_houseAge=17  1.481e+00  1.899e-01    7.797 2.22e-16 ***
    F_houseAge=18  1.622e+00  1.897e-01    8.552 2.22e-16 ***
    F_houseAge=19  1.821e+00  1.895e-01    9.610 2.22e-16 ***
    F_houseAge=20  1.895e+00  1.893e-01   10.012 2.22e-16 ***
    F_houseAge=21  2.063e+00  1.893e-01   10.893 2.22e-16 ***
    F_houseAge=22  2.056e+00  1.894e-01   10.859 2.22e-16 ***
    F_houseAge=23  2.094e+00  1.894e-01   11.056 2.22e-16 ***
    F_houseAge=24  2.052e+00  1.895e-01   10.831 2.22e-16 ***
    F_houseAge=25  1.980e+00  1.896e-01   10.440 2.22e-16 ***
    F_houseAge=26  1.901e+00  1.898e-01   10.014 2.22e-16 ***
    F_houseAge=27  1.748e+00  1.901e-01    9.193 2.22e-16 ***
    F_houseAge=28  1.613e+00  1.906e-01    8.466 2.22e-16 ***
    F_houseAge=29  1.397e+00  1.913e-01    7.304 2.22e-16 ***
    F_houseAge=30  1.340e+00  1.919e-01    6.987 2.82e-12 ***
    F_houseAge=31  1.127e+00  1.930e-01    5.840 5.23e-09 ***
    F_houseAge=32  8.557e-01  1.951e-01    4.386 1.15e-05 ***
    F_houseAge=33  6.801e-01  1.976e-01    3.442 0.000576 ***
    F_houseAge=34  6.015e-01  2.002e-01    3.004 0.002666 ***
    F_houseAge=35  5.077e-01  2.050e-01    2.477 0.013239 *  
    F_houseAge=36  2.856e-01  2.098e-01    1.361 0.173455    
    F_houseAge=37  1.988e-01  2.192e-01    0.907 0.364336    
    F_houseAge=38  1.050e-01  2.311e-01    0.454 0.649508    
    F_houseAge=39  1.778e-01  2.422e-01    0.734 0.462974    
    F_houseAge=40    Dropped    Dropped  Dropped  Dropped    
    F_year=2000   -4.111e+00  3.482e-02 -118.075 2.22e-16 ***
    F_year=2001   -2.802e+00  2.185e-02 -128.261 2.22e-16 ***
    F_year=2002   -3.972e+00  3.282e-02 -121.025 2.22e-16 ***
    F_year=2003   -4.307e+00  3.711e-02 -116.064 2.22e-16 ***
    F_year=2004   -4.852e+00  4.676e-02 -103.770 2.22e-16 ***
    F_year=2005   -5.019e+00  4.973e-02 -100.921 2.22e-16 ***
    F_year=2006   -4.563e+00  4.152e-02 -109.901 2.22e-16 ***
    F_year=2007   -3.303e+00  2.587e-02 -127.649 2.22e-16 ***
    F_year=2008   -7.461e-01  1.261e-02  -59.171 2.22e-16 ***
    F_year=2009      Dropped    Dropped  Dropped  Dropped    
    creditScore   -6.987e-03  1.079e-04  -64.747 2.22e-16 ***
    yearsEmploy   -2.821e-01  2.746e-03 -102.729 2.22e-16 ***
    ccDebt         1.406e-03  4.092e-06  343.586 2.22e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Condition number of final variance-covariance matrix: 5254.541
    Number of iterations: 10

We can extract the coefficients from logitObj for the houseAge variables and plot them:

    cc <- coef(logitObj)
    df <- data.frame(Coefficient=cc[2:41], HouseAge=0:39)
    rxLinePlot(Coefficient~HouseAge,data=df, type="p")

The resulting plot shows that the age of the house is associated with a higher default rate in the middle of the range than for younger and older houses.

HouseAge Plot

Compute the Probability of Default

Using the rxPredict function, we can compute the probability of default for given characteristics using our estimated logistic regression model as input. First create some vectors containing values of interest for each of the independent variables:

    creditScore <- c(300, 700)
    yearsEmploy <- c( 2, 8)
    ccDebt <- c(5000, 10000)
    year <- c(2008, 2009)
    houseAge <- c(5, 20)

Now create a data frame with combinations of those values:

    predictDF <- data.frame(
        creditScore = rep(creditScore, times = 16),
        yearsEmploy = rep(rep(yearsEmploy, each = 2), times = 8),
        ccDebt      = rep(rep(ccDebt, each = 4), times = 4),
        year        = rep(rep(year, each = 8), times = 2),
        houseAge    = rep(houseAge, each = 16))

Using the rxPredict function, we can compute the predicted probability of default for each of the variable combinations. The predicted values will be added to the outData data set, if it already exists:

    predictDF <- rxPredict(modelObject = logitObj, data = predictDF,
        outData = predictDF)
    predictDF[order(predictDF$default_Pred, decreasing = TRUE),]

The results should be printed to your console, with the highest default rate at the top:

       creditScore yearsEmploy ccDebt year houseAge default_Pred
    29         300           2  10000 2009       20 9.879328e-01
    21         300           2  10000 2008       20 9.748884e-01
    31         300           8  10000 2009       20 9.377608e-01
    13         300           2  10000 2009        5 9.304742e-01
    23         300           8  10000 2008       20 8.772219e-01
    5          300           2  10000 2008        5 8.638767e-01
    30         700           2  10000 2009       20 8.334778e-01
    15         300           8  10000 2009        5 7.112344e-01
    22         700           2  10000 2008       20 7.035689e-01
    7          300           8  10000 2008        5 5.387369e-01
    32         700           8  10000 2009       20 4.794787e-01
    14         700           2  10000 2009        5 4.500066e-01
    24         700           8  10000 2008       20 3.040133e-01
    6          700           2  10000 2008        5 2.795344e-01
    16         700           8  10000 2009        5 1.308739e-01
    25         300           2   5000 2009       20 6.756520e-02
    8          700           8  10000 2008        5 6.664645e-02
    17         300           2   5000 2008       20 3.321951e-02
    27         300           8   5000 2009       20 1.316013e-02
    9          300           2   5000 2009        5 1.170657e-02
    19         300           8   5000 2008       20 6.284003e-03
    1          300           2   5000 2008        5 5.585628e-03
    26         700           2   5000 2009       20 4.410500e-03
    11         300           8   5000 2009        5 2.175239e-03
    18         700           2   5000 2008       20 2.096317e-03
    3          300           8   5000 2008        5 1.032677e-03
    28         700           8   5000 2009       20 8.146337e-04
    10         700           2   5000 2009        5 7.236564e-04
    20         700           8   5000 2008       20 3.864641e-04
    2          700           2   5000 2008        5 3.432878e-04
    12         700           8   5000 2009        5 1.332594e-04
    4          700           8   5000 2008        5 6.319589e-05

Next steps

See Also

Introduction to Microsoft R

Diving into data analysis in Microsoft R

RevoScaleR Functions

© 2017 Microsoft