Computing The Pearson Correlation Coefficient
Computing The Pearson Correlation Coefficient
Computing The Pearson Correlation Coefficient
(10.1)
The following numerical example shows how the formula ( 10.1) is used:
(10.2)
Consider the Ad Spending example at the start of this chapter. Many of the (X, Y) points are simultaneously above
average, since companies that have higher than average Advertising Spending also have higher than average
Impressions. Both and are positive for these companies. Therefore, the product is
positive for these companies. Most of the remaining companies have lower than average Spending and lower than
average Impressions. Both and are negative for these companies, but the product is
still positive! Hence the numerator in ( 10.2) tends to be a large positive number for the Ad Spending data.
If the points were sloped downwards, then high X-values tend to go with low Y-values, and the
product is negative for these points. This is partly how the correlation formula ( 10.2) works. The
denominator terms have been put in to ensure that r does not go beyond -1 or +1.
The Pearson correlation coefficient can be calculated by hand or one a graphing calculator such as the TI-89
What are the Possible Values for the Pearson Correlation?
The results will be between -1 and 1. You will very rarely see 0, -1 or 1. You’ll get a number somewhere in between those values. The closer the value of r gets
to zero, the greater the variation the data points are around the line of best fit.
High correlation: .5 to 1.0 or -0.5 to 1.0.
Medium correlation: .3 to .5 or -0.3 to .5.
Low correlation: .1 to .3 or -0.1 to -0.3.
3
8. More data would be needed, but only three samples are shown for purposes of example.
9. Step two: Complete the chart using basic multiplication of the variable values.
Perso Score
Age (x) (xy) (x^2) (y^2)
n (y)
12. Step five: Once you complete the formula above by plugging in all the correct values, the result is your coefficient value! If the value
is a negative number, then there is a negative correlation of relationship strength, and if the value is a positive number, then there is
a positive correlation of relationship strength. Note: The above examples only show data for three people, but the ideal sample size
to calculate a Pearson correlation coefficient should be more than ten people.
13. Examples
14. Let's say you were analyzing the relationship between your participant's age and reported level of income. You're curious as to if
there is a positive or negative relationship between someone's age and their income level. After conducting the test, your Pearson
correlation coefficient value is +.20. Therefore, you would have a slightly positive correlation between the two variables, so the
strength of the relationship is also positive and considered strong. You could confidently conclude there is a strong relationship and
positive correlation between one's age and their income. In other words, as people grow older, their income tends to increase as
well.
15. Perhaps you were interested in learning more about the relationship strength of your participant's anxiety score and the number of
hours they work each week. After conducting the test, your Pearson correlation coefficient value is -.80. Therefore, you would have
a negative correlation between the two variables, and the strength of the relationship would be weak. You could confidently
conclude there is a weak relationship and negative correlation between one's anxiety score and how many hours a week they
report working. Therefore, those who scored high on anxiety would tend to report less hours of work per week, while those who
scored lower on anxiety would tend to report more hours of work each week.
16. Significance
17. A discussion on the Pearson correlation coefficient wouldn't be complete if we didn't talk about statistical significance. When
conducting statistical tests, statistical significance must be present in order to establish a probability of the results without error.
18. To unlock this lesson you must be a Study.com Member. Create your account
You need two variables that are either ordinal, interval or ratio (see our Types of Variable guide if you need clarification).
Although you would normally hope to use a Pearson product-moment correlation on interval or ratio data, the Spearman
correlation can be used when the assumptions of the Pearson correlation are markedly violated. However, Spearman's
correlation determines the strength and direction of the monotonic relationship between your two variables rather than
the strength and direction of the linear relationship between your two variables, which is what Pearson's correlation
determines.
A monotonic relationship is a relationship that does one of the following: (1) as the value of one variable increases, so
does the value of the other variable; or (2) as the value of one variable increases, the other variable value decreases.
Examples of monotonic and non-monotonic relationships are presented in the diagram below:
Join the 10,000s of students, academics and professionals who rely on Laerd Statistics. TAKE THE
TOUR PLANS & PRICING
Why is a monotonic relationship important to Spearman's correlation?
Spearman's correlation measures the strength and direction of monotonic association between two variables.
Monotonicity is "less restrictive" than that of a linear relationship. For example, the middle image above shows a
relationship that is monotonic, but not linear.
A monotonic relationship is not strictly an assumption of Spearman's correlation. That is, you can run a Spearman's
correlation on a non-monotonic relationship to determine if there is a monotonic component to the association.
However, you would normally pick a measure of association, such as Spearman's correlation, that fits the pattern of the
observed data. That is, if a scatterplot shows that the relationship between your two variables looks monotonic you would
run a Spearman's correlation because this will then measure the strength and direction of this monotonic relationship. On
the other hand if, for example, the relationship appears linear (assessed via scatterplot) you would run a Pearson's
correlation because this will measure the strength and direction of any linear relationship. You will not always be able to
visually check whether you have a monotonic relationship, so in this case, you might run a Spearman's correlation
anyway.
In some cases your data might already be ranked, but often you will find that you need to rank the data yourself (or
use SPSS Statistics to do it for you). Thankfully, ranking data is not a difficult task and is easily achieved by working
through your data in a table. Let us consider the following example data regarding the marks achieved in a maths and
English exam:
Exa
Marks
m
5 7 4 7 6 6 5 8 7 6
English
6 5 5 1 1 4 8 0 6 1
6 7 4 6 6 5 5 7 6 6
Maths
6 0 0 0 5 6 9 7 7 3
First, create a table with four columns and label them as below:
56 66 9 4
75 70 3 2
45 40 10 10
71 60 4 7
61 65 6.5 5
64 56 5 9
58 59 8 8
80 77 1 1
76 67 2 3
61 63 6.5 6
You need to rank the scores for maths and English separately. The score with the highest value should be labelled "1"
and the lowest score should be labelled "10" (if your data set has more than 10 cases then the lowest score will be how
many cases you have). Look carefully at the two individuals that scored 61 in the English exam (highlighted in bold).
Notice their joint rank of 6.5. This is because when you have two identical values in the data (called a "tie"), you need to
take the average of the ranks that they would have otherwise occupied. We do this because, in this example, we have no
way of knowing which score should be put in rank 6 and which score should be ranked 7. Therefore, you will notice that
the ranks of 6 and 7 do not exist for English. These two ranks have been averaged ((6 + 7)/2 = 6.5) and assigned to each
of these "tied" scores.
There are two methods to calculate Spearman's correlation depending on whether: (1) your data does not have tied ranks
or (2) your data has tied ranks. The formula for when there are no tied ranks is:
where di = difference in paired ranks and n = number of cases. The formula to use when there are tied ranks is:
Spearman's Rank correlation coefficient is used to identify and test the strength of a relationship between two sets of data. It is
often used as a statistical method to aid with either proving or disproving a hypothesis e.g. the depth of a river does not progressively
increase the further from the river bank.
he Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data. This example looks at the
strength of the link between the price of a convenience item (a 50cl bottle of water) and distance from the Contemporary Art Museum in El
Raval, Barcelona.
Example: The hypothesis tested is that prices should decrease with distance from the key area of gentrification surrounding the
Contemporary Art Museum. The line followed is Transect 2 in the map below, with continuous sampling of the price of a 50cl bottle water at
every convenience store.
Disprove
Map to show the location of environmental gradients for transect lines in El Raval, Barcelona
Hypothesis
We might expect to find that the price of a bottle of water decreases as distance from the Contemporary Art Museum increases. Higher
property rents close to the museum should be reflected in higher prices in the shops.
The hypothesis might be written like this:
The price of a convenience item decreases as distance from the Contemporary Art Museum increases.
The more objective scientific research method is always to assume that no such price-distance relationship exists and to express the null
hypothesis as:
there is no significant relationship between the price of a convenience item and distance from the Contemporary Art Museum.
What can go wrong?
Having decided upon the wording of the hypothesis, you should consider whether there are any other factors that may influence the study.
Some factors that may influence prices may include:
The type of retail outlet. You must be consistent in your choice of retail outlet. For example, bars and restaurants often charge
significantly more for water than a convenience store. You should decide which type of outlet to use and stick with it for all your data
collection.
Some shops have different prices for the same item: a high tourist and lower local price, dependent upon the shopkeeper's perception
of the customer.
Shops near main roads may charge more than shops in less accessible back streets, due to the higher rents demanded for main road
retail sites.
The positive spread effects from other nearby areas of gentrification or from competing areas of tourist attraction.
The negative spread effects from nearby areas of urban decay.
Higher prices may be charged during the summer when demand is less flexible, making seasonal comparisons less reliable.
Cumulative sampling may distort the expected price-distance gradient if several shops cluster within a short area along the transect
line followed by a considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
Data collected (see data table below) suggests a fairly strong negative relationship as shown in this scatter graph:
Scatter graph to show the change in the price of a convenience item with distance from the Contemporary Art Museum. Roll
over image to see trend line.
The scatter graph shows the possibility of a negative correlation between the two variables and the Spearman's rank correlation technique
should be used to see if there is indeed a correlation, and to test the strength of the relationship.
Spearman’s Rank correlation coefficient
A correlation can easily be drawn as a scatter graph, but the most precise way to compare several pairs of data is to use a statistical test -
this establishes whether the correlation is really significant or if it could have been the result of chance alone.
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a
relationship between two variables.
The result will always be between 1 and minus 1.
Method - calculating the coefficient
Create a table from your data.
Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest number in a column, '2' to the second biggest
value and so on. The smallest value in the column will get the lowest ranking. This should be done for both sets of measurements.
Tied scores are given the mean (average) rank. For example, the three tied scores of 1 euro in the example below are ranked fifth in
order of price, but occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The mean rank in this case is calculated
as (5+6+7) ÷ 3 = 6.
Find the difference in the ranks (d): This is the difference between the ranks of the two values on each row of the table. The rank of
the second value (price) is subtracted from the rank of the first (distance from the museum).
Square the differences (d²) To remove negative values and then sum them ( d²).
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)
1 50 10 1.80 2 8 64
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² = 285.5
In the example, the value 0.73 gives a significance level of slightly less than 5%. That means that the probability of the relationship you have
found being a chance event is about 5 in a 100. You are 95% certain that your hypothesis is correct. The reliability of your sample can be
stated in terms of how many researchers completing the same study as yours would obtain the same results: 95 out of 100.
The fact two variables correlate cannot prove anything - only further research can actually prove that one thing affects the other.
Data reliability is related to the size of the sample. The more data you collect, the more reliable your result.
Click Spearman's Rank Signifance Graph for a blank copy of the above significance graph.
he Spearman's Rank Correlation Coefficient is used to discover the strength of a link between two sets of data. This example looks at the
strength of the link between the price of a convenience item (a 50cl bottle of water) and distance from the Contemporary Art Museum in El
Raval, Barcelona.
Example: The hypothesis tested is that prices should decrease with distance from the key area of gentrification surrounding the
Contemporary Art Museum. The line followed is Transect 2 in the map below, with continuous sampling of the price of a 50cl bottle water at
every convenience store.
Map to show the location of environmental gradients for transect lines in El Raval, Barcelona
Hypothesis
We might expect to find that the price of a bottle of water decreases as distance from the Contemporary Art Museum increases. Higher
property rents close to the museum should be reflected in higher prices in the shops.
The hypothesis might be written like this:
The price of a convenience item decreases as distance from the Contemporary Art Museum increases.
The more objective scientific research method is always to assume that no such price-distance relationship exists and to express the null
hypothesis as:
there is no significant relationship between the price of a convenience item and distance from the Contemporary Art Museum.
What can go wrong?
Having decided upon the wording of the hypothesis, you should consider whether there are any other factors that may influence the study.
Some factors that may influence prices may include:
The type of retail outlet. You must be consistent in your choice of retail outlet. For example, bars and restaurants often charge
significantly more for water than a convenience store. You should decide which type of outlet to use and stick with it for all your data
collection.
Some shops have different prices for the same item: a high tourist and lower local price, dependent upon the shopkeeper's perception
of the customer.
Shops near main roads may charge more than shops in less accessible back streets, due to the higher rents demanded for main road
retail sites.
The positive spread effects from other nearby areas of gentrification or from competing areas of tourist attraction.
The negative spread effects from nearby areas of urban decay.
Higher prices may be charged during the summer when demand is less flexible, making seasonal comparisons less reliable.
Cumulative sampling may distort the expected price-distance gradient if several shops cluster within a short area along the transect
line followed by a considerable gap before the next group of retail outlets.
You should mention such factors in your investigation.
Data collected (see data table below) suggests a fairly strong negative relationship as shown in this scatter graph:
Scatter graph to show the change in the price of a convenience item with distance from the Contemporary Art Museum. Roll
over image to see trend line.
The scatter graph shows the possibility of a negative correlation between the two variables and the Spearman's rank correlation technique
should be used to see if there is indeed a correlation, and to test the strength of the relationship.
Spearman’s Rank correlation coefficient
A correlation can easily be drawn as a scatter graph, but the most precise way to compare several pairs of data is to use a statistical test -
this establishes whether the correlation is really significant or if it could have been the result of chance alone.
Spearman’s Rank correlation coefficient is a technique which can be used to summarise the strength and direction (negative or positive) of a
relationship between two variables.
The result will always be between 1 and minus 1.
Method - calculating the coefficient
Create a table from your data.
Rank the two data sets. Ranking is achieved by giving the ranking '1' to the biggest number in a column, '2' to the second biggest
value and so on. The smallest value in the column will get the lowest ranking. This should be done for both sets of measurements.
Tied scores are given the mean (average) rank. For example, the three tied scores of 1 euro in the example below are ranked fifth in
order of price, but occupy three positions (fifth, sixth and seventh) in a ranking hierarchy of ten. The mean rank in this case is calculated
as (5+6+7) ÷ 3 = 6.
Find the difference in the ranks (d): This is the difference between the ranks of the two values on each row of the table. The rank of
the second value (price) is subtracted from the rank of the first (distance from the museum).
Square the differences (d²) To remove negative values and then sum them ( d²).
Distance Price of Difference
Convenienc Rank
from CAM 50cl bottle Rank price between d²
e Store distance
(m) (€) ranks (d)
1 50 10 1.80 2 8 64
3 270 8 2.00 1 7 49
4 375 7 1.00 6 1 1
5 425 6 1.00 6 0 0
7 710 4 0.80 9 -5 25
8 790 3 0.60 10 -7 49
9 890 2 1.00 6 -4 16
10 980 1 0.85 8 -7 49
d² = 285.5
In the example, the value 0.73 gives a significance level of slightly less than 5%. That means that the probability of the relationship you have
found being a chance event is about 5 in a 100. You are 95% certain that your hypothesis is correct. The reliability of your sample can be
stated in terms of how many researchers completing the same study as yours would obtain the same results: 95 out of 100.
The fact two variables correlate cannot prove anything - only further research can actually prove that one thing affects the other.
Data reliability is related to the size of the sample. The more data you collect, the more reliable your result.
Click Spearman's Rank Signifance Graph for a blank copy of the above significance graph.
What is correlation?
Comparison of Pearson and Spearman coefficients
Other nonlinear relationships
What is correlation?
A correlation coefficient measures the extent to which two variables tend to change together. The coefficient describes both the
strength and the direction of the relationship. Minitab offers two different correlation analyses:
Pearson product moment correlation
The Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when
a change in one variable is associated with a proportional change in the other variable.
For example, you might use a Pearson correlation to evaluate whether increases in temperature at your production
facility are associated with decreasing thickness of your chocolate coating.
Spearman correlation is often used to evaluate relationships involving ordinal variables. For example, you might use a
Spearman correlation to evaluate whether the order in which employees complete a test exercise is related to the
number of months they have been employed.
It is always a good idea to examine the relationship between variables with a scatterplot. Correlation coefficients only
measure linear (Pearson) or monotonic (Spearman) relationships. Other relationships are possible.
Comparison of Pearson and Spearman coefficients
The Pearson and Spearman correlation coefficients can range in value from −1 to +1. For the Pearson correlation
coefficient to be +1, when one variable increases then the other variable increases by a consistent amount. This
relationship forms a perfect line. The Spearman correlation coefficient is also +1 in this case.
When a relationship is random or non-existent, then both correlation coefficients are nearly zero.
Pearson = −0.093, Spearman = −0.093
If the relationship is a perfect line for a decreasing relationship, then both correlation coefficients are −1.
Pearson correlation coefficients measure only linear relationships. Spearman correlation coefficients measure only
monotonic relationships. So a meaningful relationship can exist even if the correlation coefficients are 0. Examine a
Minitab.co
Statistical analysis often uses probabilitydistributions, and the two topics are often studied together. However, probability theory
contains much that is mostly of mathematical interest and not directly relevant to statistics. Moreover, many topics instatistics are
independent of probability theory.
Statistics overview
Categorical data displays
Two-way tables for categorical data
Dot plots and frequency tables
Histograms
Comparing features of distributions
Stem-and-leaf plots
Line graphs
Mean and median: The basics
More on mean and median
Range, Interquartile range (IQR), Mean absolute deviation (MAD)
Box and whisker plots
Population variance and standard deviation
Sample variance and standard deviation
Designing studies
Study design focuses on collecting data properly and making the most valid conclusions we can based on how the data was collected. This topic
covers explore samples, surveys, and experiments.
Random variables
Random variables can be any outcomes from some chance process, like how many heads will occur in a series of 20 flips. We calculate
probabilities of random variables and calculate expected value for different types of random variables.
Sampling distributions
A sampling distribution shows every possible result a statistic can take in every possible sample from a population and how often each result
happens. This topic covers how sample proportions and sample means behave in repeated samples.
Sample proportions
Sample means
Confidence intervals (one sample)
Confidence intervals give us a range of plausible values for some unknown value based on results from a sample. This topic covers confidence
intervals for means and proportions.
Nonlinear regression
Analysis of variance (ANOVA)
Analysis of variance, also called ANOVA, is a collection of methods for comparing multiple means across different groups.
Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of
a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the
center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
Contents
[hide]
External video
The Pearson correlation is defined only if both of the standard deviations are finite and nonzero. It is a corollary of the Cauchy–Schwarz
inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
The Pearson correlation is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect
decreasing (inverse) linear relationship (anticorrelation),[5] and some value in the open interval (−1, 1) in all other cases, indicating the
degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The
closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true because the correlation coefficient
detects only linear dependencies between two variables. For example, suppose the random variable X is symmetrically distributed
about zero, and Y = X2. Then Y is completely determined by X, so that X and Y are perfectly dependent, but their correlation is zero;
they are uncorrelated. However, in the special case when X and Y are jointly normal, uncorrelatedness is equivalent to independence.
If we have a series of n measurements of X and Y written as xi and yi for i = 1, 2, ..., n, then the sample correlation coefficient can be
used to estimate the population Pearson correlation r between X and Y. The sample correlation coefficient is written:
where x and y are the sample means of X and Y, and sx and sy are the sample standard deviations of X and Y.
This can also be written as:
If x and y are results of measurements that contain measurement error, the realistic limits on the correlation coefficient are not
−1 to +1 but a smaller range.[6]
For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square
of r, Pearson's product-moment coefficient.
Pearson/Spearman correlation coefficients between X and Y are shown when the two variables' ranges are unrestricted, and when the
range of X is restricted to the interval (0,1).
Most correlation measures are sensitive to the manner in which X and Y are sampled. Dependencies tend to be stronger if
viewed over a wider range of values. Thus, if we consider the correlation coefficient between the heights of fathers and
their sons over all adult males, and compare it to the same correlation coefficient calculated when the fathers are selected
to be between 165 cm and 170 cm in height, the correlation will be weaker in the latter case. Several techniques have
been developed that attempt to correct for range restriction in one or both variables, and are commonly used in meta-
analysis; the most common are Thorndike's case II and case III equations. [13]
Various correlation measures in use may be undefined for certain joint distributions of X and Y. For example, the Pearson
correlation coefficient is defined in terms of moments, and hence will be undefined if the moments are undefined.
Measures of dependence based on quantiles are always defined. Sample-based statistics intended to estimate population
measures of dependence may or may not have desirable statistical properties such as being unbiased, or asymptotically
consistent, based on the spatial structure of the population from which the data were sampled.
Sensitivity to the data distribution can be used to an advantage. For example, scaled correlation is designed to use the
sensitivity to the range in order to pick out correlations between fast components of time series. [14] By reducing the range of
values in a controlled manner, the correlations on long time scale are filtered out and only the correlations on short time
scales are revealed.
Correlation matrices[edit]
"Correlation matrix" redirects here. For correlation matrices in quantum physics, see Quark–lepton complementarity.
See also: Covariance matrix § Correlation matrix
The correlation matrix of n random variables X1, ..., Xn is the n × n matrix whose i,j entry is corr(Xi, Xj). If the measures of
correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of
the standardized random variables Xi / σ (Xi) for i = 1, ..., n. This applies to both the matrix of population correlations (in
which case "σ" is the population standard deviation), and to the matrix of sample correlations (in which case "σ" denotes
the sample standard deviation). Consequently, each is necessarily a positive-semidefinite matrix. Moreover, the correlation
matrix is strictly positive definite if no variable can have all its values exactly generated as a linear combination of the
others.
The correlation matrix is symmetric because the correlation between Xi and Xj is the same as the correlation
between Xj and Xi.
A correlation matrix appears, for example, in one formula for the coefficient of multiple determination, a measure of
goodness of fit in multiple regression.
Common misconceptions[edit]
Correlation and causality[edit]
Main article: Correlation does not imply causation
See also: Normally distributed and uncorrelated does not imply independent
The conventional dictum that "correlation does not imply causation" means that correlation cannot be used to infer a causal
relationship between the variables. [15] This dictum should not be taken to mean that correlations cannot indicate the
potential existence of causal relations. However, the causes underlying the correlation, if any, may be indirect and
unknown, and high correlations also overlap with identity relations (tautologies), where no causal process exists.
Consequently, establishing a correlation between two variables is not a sufficient condition to establish a causal
relationship (in either direction).
A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health
in people is less so. Does improved mood lead to improved health, or does good health lead to good mood, or both? Or
does some other factor underlie both? In other words, a correlation can be taken as evidence for a possible causal
relationship, but cannot indicate what the causal relationship, if any, might be.
See also[edit]