Topic V

Topic 5
Relationship between two numerical variables
Bachelor in Global Studies

First Term 2022-2023
First Year
1
Content
1. Introduction to the association analysis between two numerical variables
2. Graphical method of summarizing bivariate data: Scatterplot
3. Quantifying association of two numerical variables: Covariance and Correlation

coefficient
4. Linear regression analysis: least squares regression and 𝑹𝟐
2
1. Introduction
3
Statistics of two variables: An introduction
When a variable explains another one:

1. Response variable (also known as dependent variable): the variable that we
want to explain, denoted by Y
2. Explanatory variable (also known as covariate): the variable that we use to
explain the response variable, denoted as X
In other words, the variable Y is a function of the variable X:
Y = f (X )
4
Some examples of variables (categorical and/or numerical) that might be related to
each other:
1. Level of education and salary

2. Choice of university degree and professional background of parents
3. Marketing expenditure and sales volume of a given company
4. Country of origin and religious choice
5. Health status and country of residence
5
Linear versus non-linear relationship
Variable X and Y will hold a perfect relationship if one can be expressed in terms of
the other as:
𝑌 = 𝑎 + 𝑏𝑋
If 𝑏 > 0, we say they keep a perfect positive relationship

If 𝑏 < 0, we say they keep a perfect negative relationship
Example of non-linear relationship: 𝑌 = e%&'(
à During this lecture we will only discuss how to deal with linear relationships
6
Some initial concepts concerning relations between two variables:
Association versus causation

• Association (correlation): relation between two variables
• Causation (causality): detecting dependent and independent variable
Important: correlation is not causation!
Funny examples
7
2. Scatterplot
8
Scatterplot
• Scatterplots are used to graph to numerical variables, and at least one

of them should be continuous
• The scatter plot can be a first indicator of whether there is a linear
relationship between the two variables or not
• We can form pairs (xi, yi) for each observation in the sample. Each pair
represents a point in the plane
• Every pair of values corresponding to a given observation are taken
coordinates in 2-D graph, each dot corresponds to one observation (xi,
y i)
9
Scatterplot
Graph interpretation: Form, direction and strength
70 25 50
65 45
20
60 40
15 35
55
30
50 10
25
45
5 20
40
15
0
35 0 5 10 15 20 10
0 5 10 15 20 0 5 10 15 20
Linear relationship Linear relationship No relationship

(negative) (positive)
10
Scatterplot
Graph interpretation: non-linear relationship (example: Quadratic)
11
Scatterplot
Example: Infant mortality and number of pediatricians per year in 10 cities
i Nº Nº
deaths doctors Child-death and Pediatricians (annual per city)
(yi) (xi)
100
1 10 80
80
2 20 85
3 30 70
Nº of deaths
60
4 35 60
5 45 60 40
6 55 50
20
7 65 40
8 75 30
0
9 90 25 20 40 60 80
10 100 25 Nº of pediatricians
12
3. Covariance and correlation coefficient
13
Covariance
Once we identify a possible linear relationship by plugging the numerical data into
a scatterplot, we can quantify this relation.
As a first attempt, we can use the covariance:
S XY = å
n
(xi - x )´ ( yi - y ) =
i =1 n -1
n
xi ´ yi n
=å - (x ´ y )
i =1 n - 1 n -1
14
Covariance
Direction of the relationship:
• If SXY > 0, then there is a positive linear relationship

• If SXY < 0, then there is a negative linear relationship
• If SXY = 0, then there is no linear relationship
The problem with the covariance is that it is not bounded, and hence we
cannot determine if the relationship between variables is weak or strong
15
Correlation coefficient
A finer tool to quantify the relationship between two numerical variables

is the correlation coefficient. It is the measure of covariance normalized by
the product of the standard deviations.
S XY
rXY =
S X ´ SY
where SX and SY are the standard deviations of X and Y respectively
16
Correlation coefficient
It is bounded between -1 and 1, where rXY = ±1 represents a perfect relationship
As happened with the covariance, the sign of rXY tells us the direction of the
relationship:
• If rXY > 0 then there is a positive relationship

• If rXY < 0 then there is a negative relationship
• If rXY = 0 then there is no relationship
Evans (1996) suggests following interpretation:
• .00-.19 “very weak”

• .20-.39 “weak”
• .40-.59 “moderate”
• .60-.79 “strong”
• .80-1.0 “very strong”
17
Covariance and correlation coefficient
i Nº Nº
deaths doctors Learned in previous sessions:
(yi) (xi)
𝑥̅ = 52,5 𝑦' = 52,5
1 10 80
2 20 85 𝑠! = 29,93 𝑠" = 22,14
3 30 70
4 35 60
5 45 60
6 55 50
7 65 40
8 75 30
9 90 25
10 100 25
18
Covariance and correlation coefficient
%&
i Nº Nº . 𝑥# 𝑦# = 21.750
deaths doctors #$%
(yi) (xi)
1 10 80 𝑥̅ = 52,5 𝑦' = 52,5
2 20 85
3 30 70
4 35 60 𝟐𝟏.𝟕𝟓𝟎 𝟏𝟎
5 45 60 𝒔𝒙𝒚 = 𝟗
− 𝟗
∗ 𝟓𝟐, 𝟓 ∗ 𝟓𝟐, 𝟓 = −𝟔𝟒𝟓, 𝟖𝟑
6 55 50
7 65 40
8 75 30 𝑠" = 29,93 𝑠! = 22,14
9 90 25
10 100 25
−𝟔𝟒𝟓, 𝟖𝟑
𝒓𝒙𝒚 = = −𝟎, 𝟗𝟕
𝟐𝟗, 𝟗𝟑 ∗ 𝟐𝟐, 𝟏𝟒
19
4. Lineal regression analysis
20
Lineal regression analysis
Why regression analysis?
• It provides a summary a possible relationship between two or more variables
• We can use the regression model to make predictions on the variable Y
• We can interpret how changes in X may affect changes in Y
21
Best fitting line (least squares line)
What is the best fitting line?
200
The idea: we want to construct a
linear equation that does the best
possible job at matching the true 190
relationship.
We want to estimate:
𝑌* = 𝑎 + 𝑏𝑋* 180
Let’s call our estimation line

= * = 𝑎C + 𝑏𝑋
𝐸𝑆𝑇𝐼𝑀 D *
170
We denote
= * = 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑌* − 𝐸𝑆𝑇𝐼𝑀
160
We want to make the least
mistakes (we want the residuals to
be minimal) à the least squares 150
line minimizes the sum of squared 50 60 70 80 90 100
residuals
22
Best fitting line (least squares line)
Equation for the least squares line is:

slope
&
𝑦! = 𝑎! + 𝑏𝑥
Intercept/constant
a and b that minimize the sum of squared residuals are:

!!" $"
𝑏= = 𝑟"#
!!# $!
𝑎 = 𝑦% − 𝑏· 𝑥̅
23
Best fitting line (least squares line) - Computation
i Nº Nº
deaths doctors
We already had:
(yi) (xi)
1 10 80 𝑠+ = 22,14 𝑠, = 29,93 𝑟+, = −0,97
2 20 85
3 30 70
4 35 60 𝑥̅ = 52,5 𝑦V = 52,5
5 45 60
6 55 50
7 65 40
8 75 30
9 90 25
10 100 25
24
Best fitting line (least squares line) - Computation
i Nº Nº
deaths doctors 𝑠+ = 22,14 𝑠, = 29,93 𝑟+, = −0,97
(yi) (xi)
1 10 80
-𝟎,𝟗𝟕(𝟐𝟗,𝟗𝟑)
2 20 85 𝒔𝒍𝒐𝒑𝒆 → 𝒃 = 𝟐𝟐,𝟏𝟒
= −𝟏, 𝟑𝟐
3 30 70
4 35 60
𝑥̅ = 52,5 𝑦V = 52,5
5 45 60
6 55 50
7 65 40 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 → 𝒂 = 𝟓𝟐, 𝟓 − −𝟏, 𝟑𝟐 ∗ 𝟓𝟐, 𝟓 = 𝟏𝟐𝟏, 𝟖
8 75 30
9 90 25
= = 121,8 – 1,32 𝑑𝑜𝑐𝑡𝑜𝑟𝑠
𝑑𝑒𝑎𝑡ℎ𝑠
10 100 25
25
Best fitting line (least squares line) - Prediction
For a given a value of x we can predict a value for y
Example: number of child death (per year) and number of child-doctors in 10 cities
How many annual child death would be expected in a city with 55 doctors?
= = 121,8 – 1,32 55 = 49,2

26
Scatterplot with best fitting line
= = 121,8 – 1,32 𝑑𝑜𝑐𝑡𝑜𝑟𝑠

i Nº Nº
deaths doctors
(yi) (xi) Child death (per year) and number of child-doctors in 10 cities
100
1 10 80
2 20 85
80
3 30 70
4 35 60
60
5 45 60
6 55 50
40
7 65 40
8 75 30
20
9 90 25
10 100 25
0
20 40 60 80
doctors
Observed deaths Expected deaths (BFL)
27
Coefficient of determination
We can also quantify how good the best fitting line (BFL) is in describing the
sample
The idea is to measure how much of the variance of Y does the BFL capture in
terms of the variance of X
To measure this goodness of fit, we use the coefficient of determination:
R = (rXY )
2 2
The closer to 1, the better the fit of the line
28
Coefficient of determination
𝑠+, = −645,83 𝑟 = −0,97 𝑅 3 = 0,94
• The correlation coefficients tells us that there is a very strong negative lineal
relationship between the number of child-deaths and the number of doctors
• The coefficient of determination tells us that the number of doctors in a given

city is able to explain 94% of the variation in child-deaths for the same city
29
Summary
To do an analysis of a linear regression:

1. Draw a scatterplot of the data (it will give you a first indication about whether
or not there is a linear relationship between variables)
2. Compute the correlation coefficient (to measure how strong the linear
relationship between the variables is)
3. Find the lineal regression model that describes the variable Y in terms of the
variable X, also defined as equation of the best fitting line
4. Graph the best fitting line in scatter plot of observed values xi
5. Compute the coefficient of determination, to measure how good the lineal
model is in describing Y
30

Topic V

Uploaded by

Copyright:

Available Formats

Topic V

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic V

Uploaded by

Copyright:

Available Formats

Topic 5

Relationship between two numerical variables

Bachelor in Global Studies

1. Introduction to the association analysis between two numerical variables

2. Graphical method of summarizing bivariate data: Scatterplot

3. Quantifying association of two numerical variables: Covariance and Correlation

4. Linear regression analysis: least squares regression and 𝑹𝟐

When a variable explains another one:

In other words, the variable Y is a function of the variable X:

1. Level of education and salary

Linear versus non-linear relationship

If 𝑏 > 0, we say they keep a perfect positive relationship

Example of non-linear relationship: 𝑌 = e%&'(

Association versus causation

Important: correlation is not causation!

• Scatterplots are used to graph to numerical variables, and at least one

Linear relationship Linear relationship No relationship

As a first attempt, we can use the covariance:

Direction of the relationship:

• If SXY > 0, then there is a positive linear relationship

A finer tool to quantify the relationship between two numerical variables

where SX and SY are the standard deviations of X and Y respectively

It is bounded between -1 and 1, where rXY = ±1 represents a perfect relationship

• If rXY > 0 then there is a positive relationship

Evans (1996) suggests following interpretation:

• .00-.19 “very weak”

Why regression analysis?

• It provides a summary a possible relationship between two or more variables

• We can use the regression model to make predictions on the variable Y

• We can interpret how changes in X may affect changes in Y

Let’s call our estimation line

Equation for the least squares line is:

a and b that minimize the sum of squared residuals are:

For a given a value of x we can predict a value for y

= = 121,8 – 1,32 55 = 49,2

= = 121,8 – 1,32 𝑑𝑜𝑐𝑡𝑜𝑟𝑠

Observed deaths Expected deaths (BFL)

To measure this goodness of fit, we use the coefficient of determination:

The closer to 1, the better the fit of the line

Example: Infant mortality and number of pediatricians per year in 10 cities

𝑠+, = −645,83 𝑟 = −0,97 𝑅 3 = 0,94

• The coefficient of determination tells us that the number of doctors in a given

To do an analysis of a linear regression:

You might also like