0% found this document useful (0 votes)

14 views

Correlation 1

Uploaded by

Bhavadharani R

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Correlation 1

Uploaded by

Bhavadharani R

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Correlation

1 Introduction
We are often interested in the relationship between two variables.

• Do people with more years of full-time education earn higher salaries?

• Do factories with more safety officers have fewer accidents?

Questions like this only make sense if the possible values of our variables have a natural
order. The techniques that we look at in this handout assume that variables are measured
on a scale that is at least ordinal. In discussing Pearson’s correlation coefficient, we shall
need to go further and assume that we have interval scale data - i.e. that equal intervals
correspond to equal changes in the characteristic that we are trying to measure.

2 Plotting the data

The first step in looking for a correlation is to draw a scatterplot of the data. Figure 1
shows four examples1 .

2.1 Interpreting a scatterplot

These are some of the points to look for.

• How strong is the relationship?

– In Figure 1, the relationship between gas consumption and outside temperature

is very strong, while the relationship between Educational level and Crime rate
is much weaker.

• Is the relationship increasing or decreasing?

– In the ‘Gas’ example, higher outside temperatures are associated with lower
gas consumption, but in the ‘Ice cream’ example, higher mean temperatures
go with higher levels of ice cream consumption.
1

1
r = − 0.971, ρ = − 0.958, τ = − 0.853 r = 0.776, ρ = 0.829, τ = 0.644

0.25 0.30 0.35 0.40 0.45 0.50 0.55

Ice cream (pints per head)

6
Gas consumption

5
4
3

0 2 4 6 8 10 −20 −15 −10 −5

Outside temperature (°C) Mean temperature (°C)

r = 0.323, ρ = 0.256, τ = 0.169 r = 0.935, ρ = 0.982, τ = 0.907

200

2.0
Direct current output
150

1.5
Crime rate

100

1.0
0.5
50

90 95 100 105 110 115 120 4 6 8 10

Educational level Wind velocity (mph)

Figure 1: Scatterplots showing strong and weak relationships

• Is the relationship roughly linear?

– The plot in the top left of Figure 1 shows a clear linear pattern, while the plot
in the bottom right suggests a non-linear relationship with the initial steep
slope leveling off as the wind speed increases.

• What is the slope of the relationship?

– Is an increase in one variable associated with a small, or a large, increase in

the other one? For example, factories with more safety officers may have fewer
accidents, but is the reduction in accidents large enough to justify the cost of
the additional safety officers?

• Are there any outliers?

– Figure 2 shows a plot of Police expenditure per head against Population size

2
for 47 US states2 . At first glance, there seems to be an increasing relationship,
with larger states spending more per head on policing. But if you cover up the
two points at the top right of the plot, the correlation seems to disappear. The
evidence for a correlation comes almost entirely from these two points, so we’ll
need to check our data source to make sure that the points really are correct.

r = 0.514, ρ = 0.301, τ = 0.219

160
140
Police expenditure per head

120
100
80
60
40

0 50 100 150

State population size

Figure 2: Effect of outliers

3
3 Correlation coefficients
A correlation coefficient gives a numerical summary of the degree of association between
two variables - e,g, to what degree do high values of one variable go with high values of
the other one?
Correlation coefficients vary from -1 to +1, with positive values indicating an increasing
relationship and negative values indicating a decreasing relationship.
We focus on two widely used measures of correlation - Pearson’s r and Kendall’s τ .

• Pearson’s coefficient

– measures degree to which a relationship conforms to a straight line

• Kendall’s coefficient

– measures degree to which a relationship is always increasing or always decreas-

ing

Spearman’s rank correlation coefficient, ρ behaves in much the same way as Kendall’s τ ,
but has a less direct interpretation.

3.1 Which coefficient should I use?

• Interval scale data and interested in linear relationships - e.g. wish to build linear
model

– Use Pearson’s coefficient

• Interested in any increasing/decreasing relationship

– Use Kendall’s coefficient

3.2 Pearson’s coefficient

Suppose we have n data pairs (xi , yi )
Pearson’s correlation coefficient is given by
Pn
i=1 (xi − x)(yi − y)
r = qP Pn
n
i=1 (xi − x)2 i=1 (yi − y)2

. . . where x and y are the means of the x and y values.

In practice, we always use statistical software to do the calculations.
Looking back at Figure 1, notice how the absolute size of the coefficient drops towards zero
as we get more and more scatter. While Pearson’s r is good at measuring the strength of

4
a linear association, it can be quite misleading in the presence of curvature. Look at the
wind turbine data in the bottom right plot of Figure 1. Pearson’s r is 0.935, suggesting a
strong linear association, but a linear model would clearly not be sensible here.
Because Pearson’s r is based on the idea of linearity, it only makes sense for data that is
measured on at least an interval scale. For ordinal data, use Kendall’s τ or Spearman’s ρ.

3.4 Modification of Kendall’s coefficient for tied data

If some of the pairs of observations are tied, Kendall’s coefficient cannot reach the limiting
values of ±1 even if all untied pairs are concordant (discordant). This is a particular
problem in the analysis of contingency tables, where there will usually be a large number
of ties. Kendall proposed the following as an alternative to the simpler coefficient defined
above.

C −D
τb = q
(n(n − 1)/2 − tx )(n(n − 1)/2 − ty )

5
. . . where tx is the number of tied x values and ty is the number of tied y values.
This version of Kendall’s τ is the one used by SPSS.

3.5 Interpreting a correlation coefficient

It’s easy to misinterpret a correlation coefficient. These are some of the points to watch.

• A correlation coefficient can be badly affected by one or two outlying observations.

For the ‘police expenditure’ data in Figure 2, the value of Pearson’s r is 0.514, but
if the two points at the top right of this plot are removed, the correlation drops to
0.237. Always look at a scatter plot before calculating a correlation coefficient!

• Correlation is not the same as causality. For example, factories with more safety
officers may have fewer accidents, but this doesn’t prove that the variation in acci-
dent levels is attributable to the provision of safety officers. The correlation may be
a spurious one induced by another factor such as the age of the factory.
One possible approach here is to use partial correlation. We ’adjust’ our two vari-
ables to remove any variation that can be accounted for by our third variable (age
of factory) and then look at the correlation between the two adjusted variables.

• Even if a relationship is genuine, a strong correlation doesn’t necessarily imply that

a change in one variable will produce a large change in the other one. The two
sets of data shown in Figure 3 give the same correlation coefficient, but say quite
different things about the effect of engine capacity on fuel economy.

Large effect (r = 0.832) Small effect (r = 0.832)

45
40

40
Fuel economy

Fuel economy
35

35
30

30
25

1500 2000 2500 3000 1500 2000 2500 3000

Engine capacity Engine capacity

Figure 3: Correlation and size of effect

• Correlation coefficients are subject to sampling variation and may give a misleading
picture of the correlation in the population we’re sampling. We can quantify the
uncertainty in an estimate of a correlation by quoting a confidence interval, or

6
range of plausible values. For the ’Ice cream’ data in Figure 1, the 95% confidence
interval for Pearson’s r is 0.576 to 0.888, so we can be fairly sure that the population
coefficient lies in this range.
For details of how to calculate confidence intervals for correlation coefficients, see
Howells(1994) and Hollander(1999).

3.6 Testing for zero correlation

Most statistical software packages allow us to check whether a sample correlation is com-
patible with zero correlation in the population we’re sampling. The test that is carried
out here first assumes that the population correlation is zero and calculates the chance of
obtaining a sample correlation as large or larger in absolute size than our observed value
- this chance is given as the p value. If the p value is very small, we conclude that our
sample correlation is probably incompatible with zero correlation in the population.
The limitation of a test for zero correlation is that it doesn’t tell us anything about the
size of the correlation. A correlation can be nonzero, but too small to be of any practical
interest. For example, if we test for zero correlation with the data in the plot in the
bottom left of Figure 1, we obtain p value of 0.027, which gives strong evidence for a
nonzero correlation. But would a relationship as weak as this be of any practical interest?

Rhonda Hamilton Case Scenario: Exhibit 1 Hamilton'S Regression Model For Electric Utility Industry
No ratings yet
Rhonda Hamilton Case Scenario: Exhibit 1 Hamilton'S Regression Model For Electric Utility Industry
34 pages
Correlation New
100% (1)
Correlation New
38 pages
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Statistics: Correlation: 2.1 Interpreting A Scatterplot
No ratings yet
Statistics: Correlation: 2.1 Interpreting A Scatterplot
8 pages
4-Pearson-r
No ratings yet
4-Pearson-r
30 pages
Correl Explanation
No ratings yet
Correl Explanation
2 pages
SLG 9.2 Bivariate quantitative data - Correlation Coefficient
No ratings yet
SLG 9.2 Bivariate quantitative data - Correlation Coefficient
5 pages
Voorbeeldvragen_econometrie_2023
No ratings yet
Voorbeeldvragen_econometrie_2023
12 pages
LM09 Parametric and Non-Parametric Tests of Independence IFT Notes
No ratings yet
LM09 Parametric and Non-Parametric Tests of Independence IFT Notes
7 pages
Measures of Covariability
No ratings yet
Measures of Covariability
24 pages
Example:: Item and Distance From The Contemporary Art Museum
No ratings yet
Example:: Item and Distance From The Contemporary Art Museum
9 pages
Pearson's Correlation Coefficient
No ratings yet
Pearson's Correlation Coefficient
7 pages
Correlation Analysis
No ratings yet
Correlation Analysis
17 pages
Case2_1015_1018_1060_1116_1124
No ratings yet
Case2_1015_1018_1060_1116_1124
8 pages
Chapter 6-Correlation Analysis
No ratings yet
Chapter 6-Correlation Analysis
35 pages
Week 04
No ratings yet
Week 04
32 pages
CM8.1 Pearson Product Moment Correlation
No ratings yet
CM8.1 Pearson Product Moment Correlation
5 pages
Data Analysis Final Assignment
No ratings yet
Data Analysis Final Assignment
14 pages
Week 7 Statistics - Tagged
No ratings yet
Week 7 Statistics - Tagged
72 pages
SPSS - Correlation Assignment - Completed
No ratings yet
SPSS - Correlation Assignment - Completed
10 pages
Eda Reviewer
No ratings yet
Eda Reviewer
12 pages
Correlation and Regression-2023
No ratings yet
Correlation and Regression-2023
28 pages
What Is The Correlation Coefficient?: Coefficient. The Sample Value Is Called R, and The Population Value Is Called
No ratings yet
What Is The Correlation Coefficient?: Coefficient. The Sample Value Is Called R, and The Population Value Is Called
22 pages
CSL-410-L29
No ratings yet
CSL-410-L29
23 pages
2 Assignment For Data Analysis For Decision Making: Dipanwita Ghosh
No ratings yet
2 Assignment For Data Analysis For Decision Making: Dipanwita Ghosh
13 pages
Correlation-and-Regression-Handout-1
No ratings yet
Correlation-and-Regression-Handout-1
7 pages
Chapter 3
No ratings yet
Chapter 3
11 pages
Lecture 3
No ratings yet
Lecture 3
59 pages
Chap 15
No ratings yet
Chap 15
44 pages
Correlation
No ratings yet
Correlation
9 pages
Correlation
No ratings yet
Correlation
12 pages
Assignment SPSS - 6
No ratings yet
Assignment SPSS - 6
12 pages
Econometrics - Functional Forms
No ratings yet
Econometrics - Functional Forms
22 pages
CORRELATION ANALYSIS Pearson's R
No ratings yet
CORRELATION ANALYSIS Pearson's R
3 pages
SolomonAntonioVisuyanTandoyBallartaGumbocAretanoNaive - Ed104 - Pearson R & Simple Regression - April 24, 2021
No ratings yet
SolomonAntonioVisuyanTandoyBallartaGumbocAretanoNaive - Ed104 - Pearson R & Simple Regression - April 24, 2021
13 pages
Chap 15
No ratings yet
Chap 15
44 pages
Correlation Lecture
No ratings yet
Correlation Lecture
20 pages
P Value 0.000: Symmetric Measures
No ratings yet
P Value 0.000: Symmetric Measures
2 pages
Chapter 3 ODL
No ratings yet
Chapter 3 ODL
13 pages
Econmetrics Chapter 3
No ratings yet
Econmetrics Chapter 3
20 pages
Lecture
No ratings yet
Lecture
3 pages
Nadhratul Hikmah 1910533031 Multivariate Statistics
No ratings yet
Nadhratul Hikmah 1910533031 Multivariate Statistics
7 pages
1ststeps in Hyphothesis Testing
No ratings yet
1ststeps in Hyphothesis Testing
17 pages
Pearson Correlation
No ratings yet
Pearson Correlation
59 pages
Terro's Real Estate Agency
No ratings yet
Terro's Real Estate Agency
17 pages
Lesson 7:: Normal Distribution in Statistics
No ratings yet
Lesson 7:: Normal Distribution in Statistics
5 pages
Prac 3
No ratings yet
Prac 3
8 pages
Testing Assumptions: Normality and Equal Variances
No ratings yet
Testing Assumptions: Normality and Equal Variances
4 pages
Decision Science - June - 2023
No ratings yet
Decision Science - June - 2023
8 pages
Data Science Unit-3
No ratings yet
Data Science Unit-3
42 pages
5-LR Doc - R Sqared-Bias-Variance-Ridg-Lasso
No ratings yet
5-LR Doc - R Sqared-Bias-Variance-Ridg-Lasso
26 pages
Regcorr 5
No ratings yet
Regcorr 5
20 pages
QTT Lec Correlations
No ratings yet
QTT Lec Correlations
33 pages
Arjav - BMS - 2C - 22036 and Damia - BMS - 2C - 22056
No ratings yet
Arjav - BMS - 2C - 22036 and Damia - BMS - 2C - 22056
9 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
34 pages
DAFM
No ratings yet
DAFM
13 pages
Week8 Tutorial
No ratings yet
Week8 Tutorial
17 pages
Correlation and Regration
No ratings yet
Correlation and Regration
8 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)
PERFUMES
No ratings yet
PERFUMES
18 pages
Sebumeter
No ratings yet
Sebumeter
12 pages
Share OTC AND QUASI DRUGS
No ratings yet
Share OTC AND QUASI DRUGS
11 pages
Share PPD-hairdye
No ratings yet
Share PPD-hairdye
16 pages
DIMTOT
No ratings yet
DIMTOT
3 pages
What Is Bias Variance Decomposition - BbGoogle Search
No ratings yet
What Is Bias Variance Decomposition - BbGoogle Search
3 pages
Python Implementation of Random Forest Algorithm
No ratings yet
Python Implementation of Random Forest Algorithm
10 pages
Econometrics For Finance Ch6
No ratings yet
Econometrics For Finance Ch6
10 pages
Decision Science - Assignment
No ratings yet
Decision Science - Assignment
6 pages
Data Preparation
No ratings yet
Data Preparation
12 pages
De Chaisemartin D Haultfœuille 2020 Two Way Fixed Effects Estimators With Heterogeneous Treatment Effects
No ratings yet
De Chaisemartin D Haultfœuille 2020 Two Way Fixed Effects Estimators With Heterogeneous Treatment Effects
35 pages
Support Vector Machine: Scenario 1
No ratings yet
Support Vector Machine: Scenario 1
3 pages
EDA Unit IV
No ratings yet
EDA Unit IV
17 pages
Machine Learning in Statistical Arbitrage
No ratings yet
Machine Learning in Statistical Arbitrage
5 pages
Simple-Linear-Regression-Model-3 24
No ratings yet
Simple-Linear-Regression-Model-3 24
87 pages
Econometrics Assignment 1 Fall 2020
No ratings yet
Econometrics Assignment 1 Fall 2020
2 pages
ML L8 Decision Tree
No ratings yet
ML L8 Decision Tree
109 pages
Formula Sheet
No ratings yet
Formula Sheet
5 pages
QAM 2 End Term Deepak Prajapati
100% (1)
QAM 2 End Term Deepak Prajapati
7 pages
OUTPUTaul 2
No ratings yet
OUTPUTaul 2
25 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Hands On Analysis - Simple Linear Regression - Baseball Data
No ratings yet
Hands On Analysis - Simple Linear Regression - Baseball Data
2 pages
Econometrics For MPM, LNotes 2
No ratings yet
Econometrics For MPM, LNotes 2
45 pages
CLM: Review: - OLS Estimation
No ratings yet
CLM: Review: - OLS Estimation
44 pages
Meinshausen & Bühlmann, High-Dimensional Graphs and Variable Selection With The Lasso 009053606000000281
No ratings yet
Meinshausen & Bühlmann, High-Dimensional Graphs and Variable Selection With The Lasso 009053606000000281
27 pages
Banking Data Practice
No ratings yet
Banking Data Practice
1 page
Cheatsheet Reflex Models
No ratings yet
Cheatsheet Reflex Models
4 pages
Correlation and Causal Comparative Research
No ratings yet
Correlation and Causal Comparative Research
34 pages
Point-Biserial and Biserial Correlation
0% (1)
Point-Biserial and Biserial Correlation
6 pages
Reliability
No ratings yet
Reliability
2 pages
Chapter7 Explanatory Models3 ARIMA
No ratings yet
Chapter7 Explanatory Models3 ARIMA
63 pages
Lab 4 - Support Vector Machines: Part B
No ratings yet
Lab 4 - Support Vector Machines: Part B
5 pages
Econometrics Practical 2
No ratings yet
Econometrics Practical 2
2 pages
Chapter-24 Multivariate Statistical Analysis
No ratings yet
Chapter-24 Multivariate Statistical Analysis
80 pages