Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

5 Steps : To get an understanding on Correlation, Auto-Correlation and Partial Auto-Correlation

Pralhad Teggi
Analytics Vidhya
Published in
8 min readJan 3, 2020

In my last story, I tried explaining the concepts of co-variance and correlation. In this story, Lets understand the concepts of auto-correlation and partial correlation.

We defined correlation as standardized version of co-variance in determining the direction of relationship between two variables. So far, the correlation adds an analytical value during the bivariate analysis and How about when we have univariate data.

1. Univariate Descriptive Statistics

The descriptive statistics provides a summarized view of or describe the data in a sample. In univariate descriptive statistics, we are more interested on the descriptive statistics of an individual variable in a dataset. The descriptive statistics uses the measures of central tendency and measures of dispersion to provide the description of the dataset.

Let us consider a sample temperature data set and each data sample is collected at regular interval of time. This kind of dataset is an univariate time series data. The following screenshot shows, the imported dataset in the eviews software tool and its descriptive statistics.

In the figure, you can observe that, a variable has 10 observations and it provides the measures of central tendency statistics as mean, median, maximum, minimum etc. Also it provides the measures of dispersion as standard of deviation, skewness and kurtosis. The skewness and kurtosis are important descriptive statistics to explain the distribution of the data. Skewness measure the “sideness” or symmetry of the distribution and Kurtosis measure the “tailedness” of the distribution.

When we have univariate time series data as above, the descriptive statistics will not give an idea on how the value we observe at time t depends on what has been observed in the past or not. So we need some more special statistics to understand the dependencies between the present and past observed values.

2. ACF — Autocorrelation Function

The autocorrelation is also called as serial correlation. This type of correlation is used to understand how the time series observations depend on with values of the same series at previous times. The past observations in the series is called as lags.

In the above dataset, the first column is an original data time series, and the second column is a one step shift original data time series and its called as Lag 1 time series data of original data. The length of this Lag-1 series will be one less than the original time series.

Lets look into the mathematical formulas to compute the ACF. First compute the correlation coefficient between X and Y time series, then extend it to compute the correlation between the same time series.

Now we computed the ACF of Lag 1, and we can same extend the same formula to generalize the lag terms.

The value of ACF(Lk) is called the autocorrelation coefficient at lag k. The plot of the sample autocorrelations ACF(Lk) versus k (the time lags) is called the correlogram or autocorrelation plot.

The correlogram is a commonly used tool for checking randomness in a data set. This randomness is ascertained by computing autocorrelations for data values at varying time lags. If random, such autocorrelations should be near zero for any and all time-lag separations. If non-random, then one or more of the autocorrelations will be significantly non-zero.

3. Compute ACF

Lets compute the ACF for Lag 1 for the above temperature data set. I am using the below steps to compute the ACF.

ACF(Lag K = 1)
1. Compute the mean of the original data time series
2. Compute the difference between Original Data and Mean for all the observations
3. Square the output of (2) step
4. Compute the SUM of squared difference between Original Data and Mean for all the observations
5. Compute the difference between Lag 1 series and Mean for (n-k) observations
6. Compute the product between the output of (2) and (5)
7. Compute the SUM of output of step (6)
8. ACF of Lag 1 = Output(6) / Output(4)

Lets compare ACF calculation of our output for Lag 1 with Eviews Software. The below screenshot shows the correlogram graph for the above time series data.

Our computed ACF value for Lag 1 is 0.4099 and Eviews ouput is 0.410.

4. ACVF — Autocovariance Function

For a given random process, the autocovariance is a function that gives the covariance of the process with itself at pairs of time points. Let Y be the data series and each observations are collected at regular intervals of time. The expectation of Y is given as μ=E[Y]. Subtracting the μ from each observations of Y gives the Z.

Multiplying the Vector Z with Z transpose, gives the (N X N) matrix as below.

So the expectation of the above matrix gives the covariance matrix.

The E[Z Tran(Z)] matrix is a variance-covariance matrix, which is a square matrix that contains the variances and covariances associated with several variables. The diagonal elements of the matrix contain the variances of the variables and the off-diagonal elements contain the covariances between all possible pairs of variables.

The auto covariance function (ACVF) at lag k for the time series is defined by

Easily, the ACVF for various lags can be computed from the variance-covariance matrix. In the matrix, the values on the main diagonal are s0, the values on the diagonal above and below the main diagonal are s1. The values on the diagonal two units away are s2 and so on.

Let us compute the auto covariance function (ACVF) for the temperature data set. Here I have a written a simple python code to compute the variance-covariance matrix.

data = [23.2,23.6,25.3,25.2,25.1,25.6,24.6,24.6,23.9,24.1]
# Create a Matrix Z for the data set
Z = np.array(data)
# Compute the Mean and substract it from the data
mean = Z.mean()
Z = Z - mean
# Transpose the Z matrix and perform the mat multiplication Z and Z_Dash
Z_Dash = np.transpose(Z)
R = np.dot(Z[:,None],Z_Dash[None,:])

The output looks as below :

From the above 8X8 Matrix, Let us compute the auto covariance function (ACVF) values. The below simple python code generates these values.

l = len(data)
slist = []
for i in range(0,l) :
s = 0
for j in range(0,l-i) :
s = s + R[j+i][j]
slist.append(s)

The generated ACVF values are stored in slist list. Here is the output of slist.

5. PACF — Partial Autocorrelation Function

We understood the autocorrelation function as “its a correlation between the observation at the current time and the observations at previous times”. The partial autocorrelation function PACF is also a correlation between the observation at the current time and the observation at previous time given that we consider both observations are correlated to observations at other times.
For example, today 31/Dec stock price can be correlated to the 29/Dec stock price and yesterday 30/Dec stock price can also be correlated to the 29/Dec. Then PACF of yesterday 30/Dec stock price is the real correlation between today 31/Dec and 30/Dec after taking out the influence of the 29/Dec stock price.

Consider one more example. Let us say we have input variables as x1, x2 and x3, output variable as y. The partial autocorrelation function PACF between y and x3 is the correlation between the variables y and x3 determined taking into account how both y and x3 are related x1 and x2.

To understand how y and x3 are related x1 and x2, perform 2 regressions as below —
1. Regression — Predict y from x1 and x2.
2. Regression — Predict x3 from x1 and x2

There is a residual left out in each regression. The residuals is not explained by the input variables x1 and x2. Taking the correlation between these two residuals will give the partial correlation between y and x3 variables.

6. Compute PACF

For the above temperature dataset, Let us compute the PACF for Lag 2. Now we have to solve the below equation.

Here I am following the Yule-Walker Equations method to compute the PACF values. At high level, the method looks as below.

I have written a simple python code to compute the PACF for various lags. Let us compute the PACF for Lag2 and compare with the output of eviews.

lag = 2
acvf_lags = slist[1:lag+1]
mat = np.zeros((lag, lag))
for i in range(0,lag) :
for j in range(0,lag-i) :
mat[j+i][j] = slist[i]
mat[j][j+i] = slist[i]

ainv = np.linalg.inv(mat)
result = np.matmul(ainv,acvf_lags)

In the above code I am computing the matrix multiplication between the inverse of autocovariance matrix and acvf values. The output of matrix multiplication is as below.

The bottom value -0.289 is the PACF(2). The output of eviews is also the same as -0.290. Similarly, the PACF can be computed for various lags.

Conclusion

Covariance, correlation, autocovariance, autocorrelation and partial correlation are important topics and to be well understood in data analytics. They also play a key role in time series analysis.

--

--

Pralhad Teggi
Analytics Vidhya

Working in Micro Focus, Bangalore, India (14+ Years). Research interests include data science and machine learning.