Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Keeping an eye on confounds: a walk through for calculating a partial correlation matrix

An R demo illustrating two approaches for calculating partial correlation matrices

Alex daSilva
Towards Data Science
4 min readJul 26, 2019

--

Don’t forget about potential confounding variables! Photo from Wikimedia Commons

One of the most common steps analysts perform following data munging/ pre-processing is to run a correlation analysis to check the pairwise associations among variables in a standardized way. It’s a quick and pretty effortless way to get a feel for your data. However, looking at the relationship between two variables in isolation can be misleading if there are other confounding variables lurking in your data set related to both variables of interest.

Thus, calculating a matrix of partial correlations may be a better option in some circumstances. Such a matrix will allow one to quickly measure and view the association between 2 variables while controlling for all the other variables in the data set. Below, we’ll go over 2 ways to calculate a partial correlation matrix from scratch and also illustrate how to do so using the “ppcor” package.

For this illustration, we’ll use part of the air quality data set which contains information about daily air quality levels in New York during part of 1973.

Manhattan in 1970. Photo from Wikimedia Commons
Overview of the “airquality” data set

To start, we’ll run a correlation analysis to check out the relationships between these quality measurements. To touch on a few of the results, we can see that higher levels of ozone are associated with increased temperatures while higher winds levels are associated with lower levels of ozone. Wind and solar radiation are nearly orthogonal to each other.

We can make an improved visualization using the “corrplot” package.

Now that we’ve looked at the correlation structure, lets move on to calculating a partial correlation matrix. The idea here is the same behind running multiple linear regression and attempting to “control” for other confounding variables (say X2 and X3) while looking at the effect of a predictor variable of interest, X1 on your outcome variable, Y. The nice thing about a partial correlation matrix is that one can get all these associations in a standardized matrix for easy viewing.

The first approach we’ll cover for calculating a partial correlation matrix is to solve a series (equal to the number of variables in your data set) of 2 associated linear regression models. We take the residuals from these resulting models and calculate the correlation between them. To clarify, let’s say we want to calculate the partial correlation between Ozone and Solar.R.

We’d fit 2 models:
1) Model 1: Ozone ~ Wind + Temp
2) Model 2: Solar.R ~ Wind + Temp

And correlate the residuals to calculate the partial correlation which comes out to be 0.242.

We’ll generalize this to find partial correlations in a pairwise fashion in the airquality data set below.

We can look at the correlation and partial correlation matrices side-by-side to see how the coefficients have changed. All coefficients experience a reduction in magnitude which is not surprising considering we are “controlling” for the other variables in the data set. This reduction is more pronounced in some cases. For example, the correlation between temp and wind is -.5. The partial correlation between temp and wind, accounting for solar.R and ozone, is only -.13.

The second method we’ll look at for calculating partial correlations uses matrix algebra and is a bit cleaner.

Working from a correlation matrix R, we first need to find the anti-image covariance matrix, AICOV. To find AICOV, we need D, the inverses of the diagonal elements of the inverse of R. With D, we can calculate AICOV as follows: AICOV = DR⁻¹D.

With the AICOV matrix, we can calculate anti-image correlation matrix, AICOR. Here, D is equal to the diagonal of AICOV, square-rooted and inversed. AICOR = D¹/² AICOV D¹/².

On the off diagonal of AICOR are the negative partial correlation values. We can multiply a negative through the off-diagonal elements of AICOR to get the partial correlation matrix. Below is a function that can take in a data set or a correlation matrix and return a partial correlation matrix.

Finally, to ensure that all of the prior calculations are correct, we
can use the “ppcor” package and the “pcor” function to easily calculate
partial correlation coefficients and obtain metrics for significance
testing.

In this demo, we looked at 2 different ways to calculate partial correlations from scratch and observed how drastic changes to coefficients (previously observed in correlations) can result when accounting for other variables in the data set.

Hopefully this was useful and, if appropriate, you’ll check out the partial correlations when taking a look at your data!

--

--

Social Science and Data Science Nerd | PhD Candidate in Psychological and Brain Sciences at Dartmouth College | https://www.linkedin.com/in/alex-w-dasil/