# Estimating Correlation and Variance/Covariance Matrices

Rich Calaway|Last Updated: 12/15/2016
|
2 Contributors

The rxCovCor function in RevoScaleR calculates the covariance, correlation, or sum of squares/cross-product matrix for a set of variables in an .xdf file or data frame. The size of these matrices is determined by the number of variables rather than the number of observations, so typically the results can easily fit into memory in R. A broad category of analyses can be computed from some form of a cross-product matrix, for example, factor analysis and principal components.

A cross-product matrix is a matrix of the form X'X, where X represents an arbitrary set of raw or standardized variables. More generally, this is a matrix of the form X'WX, where W is a diagonal weighting matrix.

### Computing Cross-Product Matrices

While rxCovCor is the primary tool for computing covariance, correlation, and other cross-product matrices, you will seldom call it directly. Instead, it is usually simpler to use one of the following convenience functions:

• rxCov: Use rxCov to return the covariance matrix
• rxCor: Use rxCor to return the correlation matrix
• rxSSCP: Use rxSSCP to return the augmented cross-product matrix, that is, we first add a column of 1’s (if no weights are specified) or a column equaling the square root of the weights to the data matrix, and then compute the cross-product.

### Computing a Correlation Matrix for Use in Factor Analysis

The 5% sample of the U.S. 2000 census has over 14 million observations. In this example we will compute the correlation matrix for 16 variables derived from variables in the data set for individuals over the age of 20. This correlation matrix is then used as input into the standard R factor analysis function, factanal.

First, we specify the name and location of the data set:

########################################################
# Chapter 15: Estimating Correlation and Variance/Covariance Matrices
#  Computing a Correlation Matrix for Use in Factor Analysis
Ch15Start <- Sys.time()



Next, we can take a quick look at some basic demographic and socio-economic variables by calling rxSummary. (For more information on variables in the census data, see http://usa.ipums.org/usa-action/variables/group.) Throughout the analysis we will use the probability weighting variable, perwt, and restrict the analysis to people over the age of 20.

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

rxSummary(~phone + speakeng + wkswork1 + incwelfr + incss + educrec + metro +
ownershd + marst + lingisol + nfams + yrsusa1 + movedin + racwht + age,
data = bigCensusData, blocksPerRead = 5, pweights = "perwt",
rowSelection = age > 20)


This call provides summary information about the variables in this weighted sub-sample, including cell counts for factor variables:

Call:
rxSummary(formula = ~phone + speakeng + wkswork1 + incwelfr +
incss + educrec + metro + ownershd + marst + lingisol + nfams +
yrsusa1 + movedin + racwht + age, data = censusData, pweights = "perwt",
rowSelection = age > 20, blocksPerRead = 5)

Summary Statistics Results for: ~phone + speakeng + wkswork1 + incwelfr
+ incss + educrec + metro + ownershd + marst + lingisol + nfams +
yrsusa1 + movedin + racwht + age
File name: C:\MRS\Data\Census5PCT2000.xdf
Probability weights: perwt
Number of valid observations: 9822124

Name     Mean        StdDev       Min Max   SumOfWeights MissingWeights
wkswork1   32.068473   23.2438663  0     52 196971131    0
incwelfr   61.155293  711.0955602  0  25500 196971131    0
incss    1614.604835 3915.7717233  0  26800 196971131    0
nfams       1.163434    0.5375238  1     48 196971131    0
yrsusa1     2.868573    9.0098343  0     90 196971131    0
age        46.813005   17.1797905 21     93 196971131    0

Category Counts for phone
Number of categories: 3

phone                  Counts
N/A                      5611380
No, no phone available   3957030
Yes, phone available   187402721

Category Counts for speakeng
Number of categories: 10

speakeng                 Counts
N/A (Blank)                      0
Does not speak English     2956934
Yes, speaks English...           0
Yes, speaks only English 162425091
Yes, speaks very well     17664738
Yes, speaks well           7713303
Yes, but not well          6211065
Unknown                          0
Illegible                        0
Blank                            0

Category Counts for educrec
Number of categories: 10

educrec                 Counts
N/A  or No schooling           0
None or preschool        2757363
Grade 1, 2, 3, or 4      1439820
Grade 5, 6, 7, or 8     10180870
1 to 3 years of college 55834997
4+ years of college     46481762

Category Counts for metro
Number of categories: 5

metro                                Counts
Not applicable                       13829398
Not in metro area                    32533836
In metro area, central city          32080416
In metro, area, outside central city 60836302
Central city status unknown          57691179

Category Counts for ownershd
Number of categories: 8

ownershd                    Counts
N/A                          5611380
Owned or being bought              0
Check mark (owns?)                 0
Owned free and clear        40546259
Owned with mortgage or loan 94626060
Rents                              0
No cash rent                 3169987
With cash rent              53017445

Category Counts for marst
Number of categories: 6

marst                      Counts
Married, spouse present    112784037
Married, spouse absent       5896245
Separated                    4686951
Divorced                    21474299
Widowed                     14605829
Never married/single (N/A)  37523770

Category Counts for lingisol
Number of categories: 3

lingisol                    Counts
N/A (group quarters/vacant)   5611380
Not linguistically isolated 182633786
Linguistically isolated       8725965

Category Counts for movedin
Number of categories: 7

movedin                Counts
NA                     92708540
This year or last year 20107246
2-5 years ago          30328210
6-10 years ago         16959897
11-20 years ago        16406155
21-30 years ago        10339278
31+ years ago          10121805

Category Counts for racwht
Number of categories: 2

racwht Counts
No      40684944
Yes    156286187


Next we will call the rxCor function, a convenience function for rxCovCor that returns just the Pearson’s correlation matrix for the variables specified. We will make heavy use of the transforms argument to create a series of logical (or dummy) variables from factor variables to be used in the creation of the correlation matrix.

censusCor <- rxCor(formula=~poverty + noPhone + noEnglish  + onSocialSecurity +
onWelfare + working + incearn + noHighSchool + inCity + renter +
noSpouse + langIsolated + multFamilies + newArrival + recentMove +
white + sei + older,
data = bigCensusData, pweightsb= "perwt", blocksPerRead = 5,
rowSelection = age > 20,
transforms= list(
noPhone = phone == "No, no phone available",
noEnglish = speakeng == "Does not speak English",
working = wkswork1 > 20,
onWelfare = incwelfr > 0,
onSocialSecurity = incss > 0,
noHighSchool =
!(educrec %in%
c("Grade 12", "1 to 3 years of college", "4+ years of college")),
inCity = metro == "In metro area, central city",
renter = ownershd %in% c("No cash rent", "With cash rent"),
noSpouse = marst != "Married, spouse present",
langIsolated = lingisol == "Linguistically isolated",
multFamilies = nfams > 2,
newArrival = yrsusa2 == "0-5 years",
recentMove = movedin == "This year or last year",
white = racwht == "Yes",
older = age > 64
))


The resulting correlation matrix is used as input into the factor analysis function provided by the stats package in R. In interpreting the results, note that the variable poverty represents family income as a percentage of a poverty threshold, so increases as family income increases. First, specify two factors:

censusFa <- factanal(covmat = censusCor, factors=2)
print(censusFa, digits=2, cutoff = .2, sort= TRUE)


This results in:

Call:
factanal(factors = 2, covmat = censusCor)

Uniquenesses:
poverty          noPhone        noEnglish onSocialSecurity
0.53             0.96             0.93             0.23
onWelfare          working          incearn     noHighSchool
0.96             0.51             0.68             0.82
inCity           renter         noSpouse     langIsolated
0.97             0.79             0.90             0.90
multFamilies       newArrival       recentMove            white
0.96             0.93             0.97             0.88
sei            older
0.57             0.24

Factor1 Factor2
onSocialSecurity  0.83   -0.29
working          -0.68
sei              -0.59   -0.29
older             0.82   -0.29
poverty          -0.28   -0.63
noPhone
noEnglish                 0.26
onWelfare
incearn          -0.46   -0.34
noHighSchool      0.31    0.29
inCity
renter                    0.45
noSpouse                  0.28
langIsolated              0.31
multFamilies
newArrival                0.26
recentMove
white                    -0.34

Factor1 Factor2
Proportion Var    0.14    0.09
Cumulative Var    0.14    0.24

The degrees of freedom for the model is 118 and the fit was 0.6019


We can use the same correlation matrix to estimate three factors:

censusFa <- factanal(covmat = censusCor, factors=3)
print(censusFa, digits=2, cutoff = .2, sort= TRUE)

Call:
factanal(factors = 3, covmat = censusCor)

Uniquenesses:
poverty          noPhone        noEnglish onSocialSecurity
0.49             0.96             0.72             0.24
onWelfare          working          incearn     noHighSchool
0.96             0.50             0.62             0.80
inCity           renter         noSpouse     langIsolated
0.96             0.80             0.90             0.59
multFamilies       newArrival       recentMove            white
0.96             0.79             0.97             0.88
sei            older
0.56             0.22

Factor1 Factor2 Factor3
onSocialSecurity  0.87
working          -0.62    0.34
sei              -0.51    0.42
older             0.88
poverty                   0.68
incearn          -0.36    0.51
noEnglish                         0.53
langIsolated                      0.64
noPhone
onWelfare                -0.21
noHighSchool      0.25   -0.26    0.26
inCity
renter                   -0.36    0.24
noSpouse                 -0.30
multFamilies
newArrival                        0.46
recentMove
white                     0.24   -0.24

Factor1 Factor2 Factor3
Proportion Var    0.13    0.08    0.07
Cumulative Var    0.13    0.22    0.28

The degrees of freedom for the model is 102 and the fit was 0.343


The degrees of freedom for the model is 102 and the fit was 0.343

### Computing A Covariance Matrix for Principal Components Analysis

Principal components analysis, or PCA, is a technique closely related to factor analysis. PCA seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as much variability as possible and subsequent axes or components are chosen to maximize variance while maintaining orthogonality with previous axes. Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use rxCovCor and its relatives with the standard R function princomp.

As an example, we will use the rxCov function to calculate a covariance matrix for the log of the iris data, and pass this to the princomp function; this will reproduce the example from pages 303-304 of Modern Applied Statistics with S :

#  Computing A Covariance Matrix for Principal Components Analysis

irisLog <- as.data.frame(lapply(iris[,1:4], log))
irisSpecies <- iris[,5]
irisCov <- rxCov(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=irisLog)
irisPca <- princomp(covmat=irisCov, cor=TRUE)
summary(irisPca)


This yields the following:

Importance of components:
Comp.1    Comp.2     Comp.3    Comp.4
Standard deviation     1.7124583 0.9523797 0.36470294 0.1656840
Proportion of Variance 0.7331284 0.2267568 0.03325206 0.0068628
Cumulative Proportion  0.7331284 0.9598851 0.99313720 1.0000000


The default plot method for objects of class princomp is a screeplot, which is a barplot of the variances of the principal components. We can obtain the plot as usual by calling plot with our principal components object:

plot(irisPca)


This yields the following plot:

Another useful bit of output is given by the loadings function, which returns a set of columns showing the linear combinations for each principal component:

loadings(irisPca)

Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length  0.504 -0.455  0.709  0.191
Sepal.Width  -0.302 -0.889 -0.331
Petal.Length  0.577        -0.219 -0.786
Petal.Width   0.567        -0.583  0.580

Comp.1 Comp.2 Comp.3 Comp.4
Proportion Var   0.25   0.25   0.25   0.25
Cumulative Var   0.25   0.50   0.75   1.00


You may have noticed that we supplied the flag cor=TRUE in the call to princomp; this tells princomp to use the correlation matrix rather than the covariance matrix to compute the principal components. We can obtain the same results by omitting the flag but submitting the correlation matrix as returned by rxCor instead:

irisCor <- rxCor(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=irisLog)
irisPca2 <- princomp(covmat=irisCor)
summary(irisPca2)
plot(irisPca2)


### A Large Data Principal Components Analysis

Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available at https://github.com/thebigjc/HackReduce/blob/master/datasets/nyse/daily_prices/NYSE_daily_prices_subset.csv. The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks. As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction. We read the original data into an .xdf file, NYSE_daily_prices.xdf, using the same process we used in the Getting Started Guide to read our mortgage data (set revoDataDir to the full path to the NYSE directory containing the .csv files when you unpack the download):

#  A Large Data Principal Components Analysis

if (bHasNYSE){

"NYSE_daily_prices_")

nyseXdf <- "NYSE_daily_prices.xdf"
append <- "none"
for (i in LETTERS)
{
importFile <- paste(nyseCsvFiles, i, ".csv", sep="")
rxImport(importFile, nyseXdf, append=append)
append <- "rows"
}


Once we have our .xdf file, we proceed as before:

stockCor <- rxCor(~ stock_price_open + stock_price_high +
stock_price_low + stock_price_close +
stockPca <- princomp(covmat=stockCor)
summary(stockPca)
plot(stockPca)


This yields the following output:

> summary(stockPca)
Importance of components:
Comp.1    Comp.2      Comp.3       Comp.4
Standard deviation     2.0756631 0.8063270 0.197632281 0.0454173922
Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479
Cumulative Proportion  0.8616755 0.9917081 0.999519853 0.9999324005
Comp.5
Standard deviation     1.838470e-02
Proportion of Variance 6.759946e-05
Cumulative Proportion  1.000000e+00

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
stock_price_open      -0.470 -0.166  0.867
stock_price_high      -0.477 -0.151 -0.276  0.410 -0.711
stock_price_low       -0.477 -0.153 -0.282  0.417  0.704
stock_price_close     -0.477 -0.149 -0.305 -0.811

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0


The screeplot is shown below:

Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.

} # End of bHasNYSE


### Ridge Regression

Another application of correlation matrices is to calculate ridge regression, a type of regression that can help deal with multicollinearity and is part of a broader class of models called Penalized Regression Models.

Where the ordinary least squares regression minimizes the sum of squared residuals

ridge regression minimizes the slightly modified sum

The solution to the ridge regression is

where is the model matrix. This is similar to the ordinary least squares regression solution with a “ridge” added along the diagonal.

Since the model matrix is embedded in the correlation matrix, the following function allows us to compute the ridge regression solution:

#  Ridge regression
rxRidgeReg <- function(formula, data, lambda, ...) {
myTerms <- all.vars(formula)
newForm <- as.formula(paste("~", paste(myTerms, collapse = "+")))
myCor <- rxCovCor(newForm, data = data, type = "Cor", ...)
n <- myCor$valid.obs k <- nrow(myCor$CovCor) - 1
bridgeprime <- do.call(rbind, lapply(lambda,
function(l) qr.solve(myCor$CovCor[-1,-1] + l*diag(k), myCor$CovCor[-1,1])))
bridge <-  myCor$StdDevs[1] * sweep(bridgeprime, 2, myCor$StdDevs[-1], "/")
bridge <- cbind(t(myCor$Means[1] - tcrossprod(myCor$Means[-1], bridge)), bridge)
rownames(bridge) <- format(lambda)
return(bridge)
}


The following example shows how ridge regression dramatically improves the solution in a regression involving heavy multicollinearity:

set.seed(14)
x <- rnorm(100)
y <- rnorm(100, mean=x, sd=.01)
z <- rnorm(100, mean=2 + x +y)
data <- data.frame(x=x, y=y, z=z)
lm(z ~ x + y)

Call:
lm(formula = z ~ x + y)

Coefficients:
(Intercept)            x            y
1.827        4.359       -2.584
rxRidgeReg(z ~ x + y, data=data, lambda=0.02)
x         y
0.02 1.827674 0.8917924 0.8723334


You can supply a vector of lambdas as a quick way to compare various choices:

rxRidgeReg(z ~ x + y, data=data, lambda=c(0, 0.02, 0.2, 2, 20))
x           y
0.00 1.827112 4.35917238 -2.58387778
0.02 1.827674 0.89179239  0.87233344
0.20 1.833899 0.81020130  0.80959911
2.00 1.865339 0.44512940  0.44575060
20.00 1.896779 0.08092296  0.08105322


Notice that for (\lambda = 0), the ridge regression is identical to ordinary least squares, while as (\backslash n\lambda \rightarrow \infty), the coefficients of x and y approach 0.