Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

5 Steps : To get an understanding on Co-variance and Correlation

Pralhad Teggi
Analytics Vidhya
Published in
7 min readNov 22, 2019

As the number of trees are cut, the global warming increases. In this statement, there are 2 variables, one is the number of trees and the other is global warming. They are associated to each other making them statistically dependent. To measure the association between such statistical variables, we need mathematical tools.

With this introduction, here I am writing a short story to get an understanding on co-variance and correlation.

1. Lets begin

Let me start with a simple question, What is the similarity between two numbers 5 and 6. To answer this question, even though, these two numbers are sequential on number line, we have to look into their properties. The number 5 is odd and prime but where as number 6 is even and has a factors of 2 and 3. This conclude that these 2 numbers are not similar to each other.

Now let me add 2 more numbers to the above situation. So what is the similarity between [5,3] and [6,2]. Now they are not just numbers but they are lists. The properties which I have to consider for the discussion are mean, variance and standard of deviation.

import numpy as npx = np.array([5,3])
y = np.array([6,2])
print("Mean(x)=",x.mean(), "Mean(y)=",y.mean())
print("Variance(x)=",x.var(), "Variance(y)=",y.var())
print("SD(x)=",x.std(), "SD(y)=",y.std())

Here is the output of the code :

Mean(x)= 4.0 Mean(y)= 4.0
Variance(x)= 1.0 Variance(y)= 4.0
SD(x)= 1.0 SD(y)= 2.0

Each of these lists has the same mean, namely 4.0. However, they have different standard of deviation. As the standard of deviation is larger, then the data are more spread out. In this case, the second list data is more spread out than first one.

2. Understanding the linear relationship

The equation y=2x+c is linear and the relationship between x and y variables is also linear. A linear relationship is one, where increasing or decreasing one variable n times will cause a corresponding increase or decrease of n times in the other variable too. In simpler words, if you double one variable, the other will double as well.

The co-variance determine the direction of the linear relationship between two variables. So the direction could be positive, negative and zero.

In the below example, there are 6 data points and co-variance is computed in tabular form. The result is 130.2 and indicating positive linear relationship.

The below figure is a scatter plot for the above data points. It shows the direction of linear relationship and its positive.

3. Comparing Co-variances

Taking the same above example and adding one more feature to it. Lets evaluate the co-variance of (x,y) and (x,z).

The co-variance (x,z) = 5910.8 and which is a very large value than co-variance(x,y) = 130.2.

Does it mean that, the two attributes x and z have better linear relationship than the x and y ?

To answer this question, Let me list out few points —

  • The co-variance is a product of 2 units and so, its unit become product of units. The co-variance of (x,y) and (x,z) have different units. So it doesn't make sense to compare the two. Its like comparing two distances whose values are in miles and kms. Of course they need a conversion before a comparison.
  • How can we bring the product of 2 units on to a same scale ?. We can make them unit less very easily by dividing them by same product of 2 units. This can be achieved by dividing the co-variance by standard of deviation. For example, the co-variance(x,y) is divided by sd(x) and sd(y) as below. The sd(x) and (xi-x_bar) has same unit. The sd(y) and (yi-y_bar) has same unit.

Lets go back to our example and calculate standard deviation of x , y , and z.

Now divide the co-variances of (x,y) and (x,z) by standard deviations as below.

Now the result answers our previous question. The attribute pair (x,y) has a better positive linear relationship than the attribute pair (x,z). Instead of saying better rather I would prefer to say strong.

How about standardizing the data set and then comparing just the co-variances to understand the strength and direction of linear relationship ?.

Please comment your thoughts.

4. Correlation

Most of the time in data science, the correlation coefficient is taken by default as Pearson product-moment correlation coefficient. Usually, in statistics, there are four types of correlations: Pearson correlation, Kendall rank correlation, Spearman correlation, and the Point-Biserial correlation. In this story, we are discussing only the Pearson correlation.

The correlation is the standardized form of co-variance by dividing the co-variance with standard of deviation of each variable. In the previous step, we divided the co-variance(x,y) by sd(x) and sd(y) to get the correlation coefficient.

As said, the correlation coefficient is unit-less and its range is between −1 and +1. It is used to find how strong a relationship is between the attributes. The formulas return a value between -1 and 1, where:

  • +1 indicates a strong positive relationship.
  • -1 indicates a strong negative relationship.
  • 0 indicates no relationship at all.

A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker.

5. Implementation of Correlation in Python

Let us work out few examples to get better better understanding. In the below code, I am generating two random arrays and computing the correlation coefficient.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
np.random.seed(1)x = np.random.randint(0,50,1000)
y = x + np.random.normal(0,10,1000)
print(np.corrcoef(x,y))plt.scatter(x, y)
plt.show()

The output shows that the linear relationship between x and y is stronger and positive direction.

In the below code, I am generating the random arrays such that as one increases and the other decreases.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
np.random.seed(1)x = np.random.randint(0, 50, 1000)
y = 100 - x + np.random.normal(0, 5, 1000)
print(np.corrcoef(x,y))plt.scatter(x, y)
plt.show()

The output shows that the linear relationship between x and y is stronger and negative direction.

In the below code, I am generating the random arrays such that as there is no linear relationship between the two.

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
np.random.seed(1)x = np.random.randint(0, 50, 1000)
y = np.random.randint(0, 50, 1000)
np.corrcoef(x, y)print(np.corrcoef(x,y))plt.scatter(x, y)
plt.show()

The output shows that there is NO linear relationship between x and y.

Correlation matrix

A correlation matrix is a table showing correlation coefficients between variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses.

Lets compute the correlation matrix for the above data set having 3 attributes as x, y and z.

import pandas as pd
dataframe = pd.DataFrame({'X' : [192, 218, 197, 192, 198, 191],
'Y' : [218, 251, 221, 219, 223, 218],
'Z' : [6200, 5777, 4888, 4983, 5888, 2000]})
print(dataframe.corr())

The correlation matrix is printed as below :

We can also plot the correlation matrix as below :

import pandas as pd
dataframe = pd.DataFrame({'X' : [192, 218, 197, 192, 198, 191],
'Y' : [218, 251, 221, 219, 223, 218],
'Z' : [6200, 5777, 4888, 4983, 5888, 2000]})
plt.matshow(dataframe.corr())
plt.xticks(range(len(dataframe.columns)), dataframe.columns)
plt.yticks(range(len(dataframe.columns)), dataframe.columns)
plt.colorbar()
plt.show()

Here is the output —

Conclusion

Correlation is most helpful and widely used statistical concept in data analysis and especially in regression analysis. The correlation explains only the strength and direction of the linear relationship and do not explain the causal relationship between the variables. It does not explain which variable is the cause and which is the effect.
The correlation gives data summary statistic from where the journey begins to understand the complete story of relationships in the data.

If you like this article, Please encourage by following me and don’t forget to give a clap… or …claps. and if you have any questions write a comment until then happy learning.

--

--

Pralhad Teggi
Analytics Vidhya

Working in Micro Focus, Bangalore, India (14+ Years). Research interests include data science and machine learning.