Table of contents
TOC
Collapse the table of content
Expand the table of content

Estimating Correlation and Variance/Covariance Matrices

Rich Calaway|Last Updated: 12/15/2016
|
2 Contributors

The rxCovCor function in RevoScaleR calculates the covariance, correlation, or sum of squares/cross-product matrix for a set of variables in an .xdf file or data frame. The size of these matrices is determined by the number of variables rather than the number of observations, so typically the results can easily fit into memory in R. A broad category of analyses can be computed from some form of a cross-product matrix, for example, factor analysis and principal components.

A cross-product matrix is a matrix of the form X'X, where X represents an arbitrary set of raw or standardized variables. More generally, this is a matrix of the form X'WX, where W is a diagonal weighting matrix.

Computing Cross-Product Matrices

While rxCovCor is the primary tool for computing covariance, correlation, and other cross-product matrices, you will seldom call it directly. Instead, it is usually simpler to use one of the following convenience functions:

  • rxCov: Use rxCov to return the covariance matrix
  • rxCor: Use rxCor to return the correlation matrix
  • rxSSCP: Use rxSSCP to return the augmented cross-product matrix, that is, we first add a column of 1’s (if no weights are specified) or a column equaling the square root of the weights to the data matrix, and then compute the cross-product.

Computing a Correlation Matrix for Use in Factor Analysis

The 5% sample of the U.S. 2000 census has over 14 million observations. In this example we will compute the correlation matrix for 16 variables derived from variables in the data set for individuals over the age of 20. This correlation matrix is then used as input into the standard R factor analysis function, factanal.

First, we specify the name and location of the data set:

######################################################## 
# Chapter 15: Estimating Correlation and Variance/Covariance Matrices
#  Computing a Correlation Matrix for Use in Factor Analysis
Ch15Start <- Sys.time()


bigDataDir <- "C:/MRS/Data"
bigCensusData <- file.path(bigDataDir, "Census5PCT2000.xdf"

Next, we can take a quick look at some basic demographic and socio-economic variables by calling rxSummary. (For more information on variables in the census data, see http://usa.ipums.org/usa-action/variables/group.) Throughout the analysis we will use the probability weighting variable, perwt, and restrict the analysis to people over the age of 20.

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

rxSummary(~phone + speakeng + wkswork1 + incwelfr + incss + educrec + metro +
    ownershd + marst + lingisol + nfams + yrsusa1 + movedin + racwht + age,
     data = bigCensusData, blocksPerRead = 5, pweights = "perwt", 
    rowSelection = age > 20)

This call provides summary information about the variables in this weighted sub-sample, including cell counts for factor variables:

Call:
rxSummary(formula = ~phone + speakeng + wkswork1 + incwelfr + 
    incss + educrec + metro + ownershd + marst + lingisol + nfams + 
    yrsusa1 + movedin + racwht + age, data = censusData, pweights = "perwt", 
    rowSelection = age > 20, blocksPerRead = 5)

Summary Statistics Results for: ~phone + speakeng + wkswork1 + incwelfr
    + incss + educrec + metro + ownershd + marst + lingisol + nfams +
    yrsusa1 + movedin + racwht + age
File name: C:\MRS\Data\Census5PCT2000.xdf
Probability weights: perwt
Number of valid observations: 9822124 

 Name     Mean        StdDev       Min Max   SumOfWeights MissingWeights
 wkswork1   32.068473   23.2438663  0     52 196971131    0             
 incwelfr   61.155293  711.0955602  0  25500 196971131    0             
 incss    1614.604835 3915.7717233  0  26800 196971131    0             
 nfams       1.163434    0.5375238  1     48 196971131    0             
 yrsusa1     2.868573    9.0098343  0     90 196971131    0             
 age        46.813005   17.1797905 21     93 196971131    0             

Category Counts for phone
Number of categories: 3

 phone                  Counts   
 N/A                      5611380
 No, no phone available   3957030
 Yes, phone available   187402721

Category Counts for speakeng
Number of categories: 10

 speakeng                 Counts   
 N/A (Blank)                      0
 Does not speak English     2956934
 Yes, speaks English...           0
 Yes, speaks only English 162425091
 Yes, speaks very well     17664738
 Yes, speaks well           7713303
 Yes, but not well          6211065
 Unknown                          0
 Illegible                        0
 Blank                            0

Category Counts for educrec
Number of categories: 10

 educrec                 Counts  
 N/A  or No schooling           0
 None or preschool        2757363
 Grade 1, 2, 3, or 4      1439820
 Grade 5, 6, 7, or 8     10180870
 Grade 9                  4862980
 Grade 10                 5957922
 Grade 11                 5763456
 Grade 12                63691961
 1 to 3 years of college 55834997
 4+ years of college     46481762

Category Counts for metro
Number of categories: 5

 metro                                Counts  
 Not applicable                       13829398
 Not in metro area                    32533836
 In metro area, central city          32080416
 In metro, area, outside central city 60836302
 Central city status unknown          57691179

Category Counts for ownershd
Number of categories: 8

 ownershd                    Counts  
 N/A                          5611380
 Owned or being bought              0
 Check mark (owns?)                 0
 Owned free and clear        40546259
 Owned with mortgage or loan 94626060
 Rents                              0
 No cash rent                 3169987
 With cash rent              53017445

Category Counts for marst
Number of categories: 6

 marst                      Counts   
 Married, spouse present    112784037
 Married, spouse absent       5896245
 Separated                    4686951
 Divorced                    21474299
 Widowed                     14605829
 Never married/single (N/A)  37523770

Category Counts for lingisol
Number of categories: 3

 lingisol                    Counts   
 N/A (group quarters/vacant)   5611380
 Not linguistically isolated 182633786
 Linguistically isolated       8725965

Category Counts for movedin
Number of categories: 7

 movedin                Counts  
 NA                     92708540
 This year or last year 20107246
 2-5 years ago          30328210
 6-10 years ago         16959897
 11-20 years ago        16406155
 21-30 years ago        10339278
 31+ years ago          10121805

Category Counts for racwht
Number of categories: 2

 racwht Counts   
 No      40684944
 Yes    156286187

Next we will call the rxCor function, a convenience function for rxCovCor that returns just the Pearson’s correlation matrix for the variables specified. We will make heavy use of the transforms argument to create a series of logical (or dummy) variables from factor variables to be used in the creation of the correlation matrix.

censusCor <- rxCor(formula=~poverty + noPhone + noEnglish  + onSocialSecurity + 
    onWelfare + working + incearn + noHighSchool + inCity + renter + 
    noSpouse + langIsolated + multFamilies + newArrival + recentMove + 
    white + sei + older, 
    data = bigCensusData, pweightsb= "perwt", blocksPerRead = 5, 
    rowSelection = age > 20,
    transforms= list(
        noPhone = phone == "No, no phone available",
        noEnglish = speakeng == "Does not speak English",
        working = wkswork1 > 20,
        onWelfare = incwelfr > 0,
        onSocialSecurity = incss > 0,
         noHighSchool = 
            !(educrec %in% 
            c("Grade 12", "1 to 3 years of college", "4+ years of college")),
        inCity = metro == "In metro area, central city",
        renter = ownershd %in% c("No cash rent", "With cash rent"),
        noSpouse = marst != "Married, spouse present",
        langIsolated = lingisol == "Linguistically isolated",
        multFamilies = nfams > 2,
        newArrival = yrsusa2 == "0-5 years",
        recentMove = movedin == "This year or last year",
        white = racwht == "Yes",
        older = age > 64    
         ))

The resulting correlation matrix is used as input into the factor analysis function provided by the stats package in R. In interpreting the results, note that the variable poverty represents family income as a percentage of a poverty threshold, so increases as family income increases. First, specify two factors:

censusFa <- factanal(covmat = censusCor, factors=2)
print(censusFa, digits=2, cutoff = .2, sort= TRUE)

This results in:

Call:
factanal(factors = 2, covmat = censusCor)

Uniquenesses:
         poverty          noPhone        noEnglish onSocialSecurity 
            0.53             0.96             0.93             0.23 
       onWelfare          working          incearn     noHighSchool 
            0.96             0.51             0.68             0.82 
          inCity           renter         noSpouse     langIsolated 
            0.97             0.79             0.90             0.90 
    multFamilies       newArrival       recentMove            white 
            0.96             0.93             0.97             0.88 
             sei            older 
            0.57             0.24 

Loadings:
                 Factor1 Factor2
onSocialSecurity  0.83   -0.29  
working          -0.68          
sei              -0.59   -0.29  
older             0.82   -0.29  
poverty          -0.28   -0.63  
noPhone                         
noEnglish                 0.26  
onWelfare                       
incearn          -0.46   -0.34  
noHighSchool      0.31    0.29  
inCity                          
renter                    0.45  
noSpouse                  0.28  
langIsolated              0.31  
multFamilies                    
newArrival                0.26  
recentMove                      
white                    -0.34  

               Factor1 Factor2
SS loadings       2.60    1.67
Proportion Var    0.14    0.09
Cumulative Var    0.14    0.24

The degrees of freedom for the model is 118 and the fit was 0.6019  

We can use the same correlation matrix to estimate three factors:

censusFa <- factanal(covmat = censusCor, factors=3)
print(censusFa, digits=2, cutoff = .2, sort= TRUE)

  Call:
  factanal(factors = 3, covmat = censusCor)

  Uniquenesses:
           poverty          noPhone        noEnglish onSocialSecurity 
              0.49             0.96             0.72             0.24 
         onWelfare          working          incearn     noHighSchool 
              0.96             0.50             0.62             0.80 
            inCity           renter         noSpouse     langIsolated 
              0.96             0.80             0.90             0.59 
      multFamilies       newArrival       recentMove            white 
              0.96             0.79             0.97             0.88 
               sei            older 
              0.56             0.22 

  Loadings:
                   Factor1 Factor2 Factor3
  onSocialSecurity  0.87                  
  working          -0.62    0.34          
  sei              -0.51    0.42          
  older             0.88                  
  poverty                   0.68          
  incearn          -0.36    0.51          
  noEnglish                         0.53  
  langIsolated                      0.64  
  noPhone                                 
  onWelfare                -0.21          
  noHighSchool      0.25   -0.26    0.26  
  inCity                                  
  renter                   -0.36    0.24  
  noSpouse                 -0.30          
  multFamilies                            
  newArrival                        0.46  
  recentMove                              
  white                     0.24   -0.24  

                 Factor1 Factor2 Factor3
  SS loadings       2.41    1.50    1.18
  Proportion Var    0.13    0.08    0.07
  Cumulative Var    0.13    0.22    0.28

  The degrees of freedom for the model is 102 and the fit was 0.343

The degrees of freedom for the model is 102 and the fit was 0.343

Computing A Covariance Matrix for Principal Components Analysis

Principal components analysis, or PCA, is a technique closely related to factor analysis. PCA seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as much variability as possible and subsequent axes or components are chosen to maximize variance while maintaining orthogonality with previous axes. Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use rxCovCor and its relatives with the standard R function princomp.

As an example, we will use the rxCov function to calculate a covariance matrix for the log of the iris data, and pass this to the princomp function; this will reproduce the example from pages 303-304 of Modern Applied Statistics with S :

#  Computing A Covariance Matrix for Principal Components Analysis

irisLog <- as.data.frame(lapply(iris[,1:4], log))
irisSpecies <- iris[,5]
irisCov <- rxCov(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
    data=irisLog)
irisPca <- princomp(covmat=irisCov, cor=TRUE)
summary(irisPca)

This yields the following:

Importance of components:
                          Comp.1    Comp.2     Comp.3    Comp.4
Standard deviation     1.7124583 0.9523797 0.36470294 0.1656840
Proportion of Variance 0.7331284 0.2267568 0.03325206 0.0068628
Cumulative Proportion  0.7331284 0.9598851 0.99313720 1.0000000

The default plot method for objects of class princomp is a screeplot, which is a barplot of the variances of the principal components. We can obtain the plot as usual by calling plot with our principal components object:

plot(irisPca)

This yields the following plot:

Another useful bit of output is given by the loadings function, which returns a set of columns showing the linear combinations for each principal component:

loadings(irisPca)

  Loadings:
               Comp.1 Comp.2 Comp.3 Comp.4
  Sepal.Length  0.504 -0.455  0.709  0.191
  Sepal.Width  -0.302 -0.889 -0.331       
  Petal.Length  0.577        -0.219 -0.786
  Petal.Width   0.567        -0.583  0.580

                 Comp.1 Comp.2 Comp.3 Comp.4
  SS loadings      1.00   1.00   1.00   1.00
  Proportion Var   0.25   0.25   0.25   0.25
  Cumulative Var   0.25   0.50   0.75   1.00

You may have noticed that we supplied the flag cor=TRUE in the call to princomp; this tells princomp to use the correlation matrix rather than the covariance matrix to compute the principal components. We can obtain the same results by omitting the flag but submitting the correlation matrix as returned by rxCor instead:

irisCor <- rxCor(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
    data=irisLog)
irisPca2 <- princomp(covmat=irisCor)
summary(irisPca2)
loadings(irisPca2)
plot(irisPca2)

A Large Data Principal Components Analysis

Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available at https://github.com/thebigjc/HackReduce/blob/master/datasets/nyse/daily_prices/NYSE_daily_prices_subset.csv. The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks. As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction. We read the original data into an .xdf file, NYSE_daily_prices.xdf, using the same process we used in the Getting Started Guide to read our mortgage data (set revoDataDir to the full path to the NYSE directory containing the .csv files when you unpack the download):

#  A Large Data Principal Components Analysis

if (bHasNYSE){  

bigDataDir <- "C:/MRS/Data"
nyseCsvFiles <- file.path(bigDataDir, "NYSE_daily_prices","NYSE",
    "NYSE_daily_prices_")

nyseXdf <- "NYSE_daily_prices.xdf"
append <- "none"
for (i in LETTERS)
{
    importFile <- paste(nyseCsvFiles, i, ".csv", sep="")
    rxImport(importFile, nyseXdf, append=append)
    append <- "rows"
}

Once we have our .xdf file, we proceed as before:

stockCor <- rxCor(~ stock_price_open + stock_price_high + 
    stock_price_low + stock_price_close + 
    stock_price_adj_close, data="NYSE_daily_prices.xdf")
stockPca <- princomp(covmat=stockCor)
summary(stockPca)
loadings(stockPca)
plot(stockPca)

This yields the following output:

> summary(stockPca)
Importance of components:
                          Comp.1    Comp.2      Comp.3       Comp.4
Standard deviation     2.0756631 0.8063270 0.197632281 0.0454173922
Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479
Cumulative Proportion  0.8616755 0.9917081 0.999519853 0.9999324005
                             Comp.5
Standard deviation     1.838470e-02
Proportion of Variance 6.759946e-05
Cumulative Proportion  1.000000e+00
> loadings(stockPca)

Loadings:
                      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
stock_price_open      -0.470 -0.166  0.867              
stock_price_high      -0.477 -0.151 -0.276  0.410 -0.711
stock_price_low       -0.477 -0.153 -0.282  0.417  0.704
stock_price_close     -0.477 -0.149 -0.305 -0.811       
stock_price_adj_close -0.309  0.951                     

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

The screeplot is shown below:

Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.

} # End of bHasNYSE

Ridge Regression

Another application of correlation matrices is to calculate ridge regression, a type of regression that can help deal with multicollinearity and is part of a broader class of models called Penalized Regression Models.

Where the ordinary least squares regression minimizes the sum of squared residuals

ridge regression minimizes the slightly modified sum

The solution to the ridge regression is

where is the model matrix. This is similar to the ordinary least squares regression solution with a “ridge” added along the diagonal.

Since the model matrix is embedded in the correlation matrix, the following function allows us to compute the ridge regression solution:

#  Ridge regression
rxRidgeReg <- function(formula, data, lambda, ...) {
  myTerms <- all.vars(formula)
  newForm <- as.formula(paste("~", paste(myTerms, collapse = "+")))
  myCor <- rxCovCor(newForm, data = data, type = "Cor", ...)
  n <- myCor$valid.obs
  k <- nrow(myCor$CovCor) - 1
  bridgeprime <- do.call(rbind, lapply(lambda, 
        function(l) qr.solve(myCor$CovCor[-1,-1] + l*diag(k), 
                             myCor$CovCor[-1,1])))
  bridge <-  myCor$StdDevs[1] * sweep(bridgeprime, 2, 
        myCor$StdDevs[-1], "/")
  bridge <- cbind(t(myCor$Means[1] - 
        tcrossprod(myCor$Means[-1], bridge)), bridge)
  rownames(bridge) <- format(lambda)
  return(bridge)
}

The following example shows how ridge regression dramatically improves the solution in a regression involving heavy multicollinearity:

set.seed(14)
x <- rnorm(100)
y <- rnorm(100, mean=x, sd=.01)
z <- rnorm(100, mean=2 + x +y)
data <- data.frame(x=x, y=y, z=z)
lm(z ~ x + y)

 Call:
 lm(formula = z ~ x + y)

 Coefficients:
 (Intercept)            x            y  
      1.827        4.359       -2.584  
rxRidgeReg(z ~ x + y, data=data, lambda=0.02)
                      x         y
 0.02 1.827674 0.8917924 0.8723334

You can supply a vector of lambdas as a quick way to compare various choices:

rxRidgeReg(z ~ x + y, data=data, lambda=c(0, 0.02, 0.2, 2, 20))
                        x           y
  0.00 1.827112 4.35917238 -2.58387778
  0.02 1.827674 0.89179239  0.87233344
  0.20 1.833899 0.81020130  0.80959911
  2.00 1.865339 0.44512940  0.44575060
 20.00 1.896779 0.08092296  0.08105322

Notice that for (\lambda = 0), the ridge regression is identical to ordinary least squares, while as (\backslash n\lambda \rightarrow \infty), the coefficients of x and y approach 0.

© 2017 Microsoft