Topic V
Topic V
Topic V
1
Content
2
1. Introduction
3
Statistics of two variables: An introduction
Y = f (X )
4
Statistics of two variables: An introduction
Some examples of variables (categorical and/or numerical) that might be related to
each other:
5
Statistics of two variables: An introduction
Variable X and Y will hold a perfect relationship if one can be expressed in terms of
the other as:
𝑌 = 𝑎 + 𝑏𝑋
à During this lecture we will only discuss how to deal with linear relationships
6
Statistics of two variables: An introduction
Some initial concepts concerning relations between two variables:
Funny examples
7
2. Scatterplot
8
Scatterplot
9
Scatterplot
Graph interpretation: Form, direction and strength
70 25 50
65 45
20
60 40
15 35
55
30
50 10
25
45
5 20
40
15
0
35 0 5 10 15 20 10
0 5 10 15 20 0 5 10 15 20
10
Scatterplot
Graph interpretation: non-linear relationship (example: Quadratic)
11
Scatterplot
Example: Infant mortality and number of pediatricians per year in 10 cities
i Nº Nº
deaths doctors Child-death and Pediatricians (annual per city)
(yi) (xi)
100
1 10 80
80
2 20 85
3 30 70
Nº of deaths
60
4 35 60
5 45 60 40
6 55 50
20
7 65 40
8 75 30
0
9 90 25 20 40 60 80
10 100 25 Nº of pediatricians
12
3. Covariance and correlation coefficient
13
Covariance
Once we identify a possible linear relationship by plugging the numerical data into
a scatterplot, we can quantify this relation.
S XY = å
n
(xi - x )´ ( yi - y ) =
i =1 n -1
n
xi ´ yi n
=å - (x ´ y )
i =1 n - 1 n -1
14
Covariance
The problem with the covariance is that it is not bounded, and hence we
cannot determine if the relationship between variables is weak or strong
15
Correlation coefficient
S XY
rXY =
S X ´ SY
16
Correlation coefficient
As happened with the covariance, the sign of rXY tells us the direction of the
relationship:
17
Covariance and correlation coefficient
Example: Infant mortality and number of pediatricians per year in 10 cities
i Nº Nº
deaths doctors Learned in previous sessions:
(yi) (xi)
𝑥̅ = 52,5 𝑦' = 52,5
1 10 80
2 20 85 𝑠! = 29,93 𝑠" = 22,14
3 30 70
4 35 60
5 45 60
6 55 50
7 65 40
8 75 30
9 90 25
10 100 25
18
Covariance and correlation coefficient
Example: Infant mortality and number of pediatricians per year in 10 cities
%&
i Nº Nº . 𝑥# 𝑦# = 21.750
deaths doctors #$%
(yi) (xi)
1 10 80 𝑥̅ = 52,5 𝑦' = 52,5
2 20 85
3 30 70
4 35 60 𝟐𝟏.𝟕𝟓𝟎 𝟏𝟎
5 45 60 𝒔𝒙𝒚 = 𝟗
− 𝟗
∗ 𝟓𝟐, 𝟓 ∗ 𝟓𝟐, 𝟓 = −𝟔𝟒𝟓, 𝟖𝟑
6 55 50
7 65 40
8 75 30 𝑠" = 29,93 𝑠! = 22,14
9 90 25
10 100 25
−𝟔𝟒𝟓, 𝟖𝟑
𝒓𝒙𝒚 = = −𝟎, 𝟗𝟕
𝟐𝟗, 𝟗𝟑 ∗ 𝟐𝟐, 𝟏𝟒
19
4. Lineal regression analysis
20
Lineal regression analysis
21
Best fitting line (least squares line)
What is the best fitting line?
200
The idea: we want to construct a
linear equation that does the best
possible job at matching the true 190
relationship.
We want to estimate:
𝑌* = 𝑎 + 𝑏𝑋* 180
We denote
= * = 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑌* − 𝐸𝑆𝑇𝐼𝑀
160
We want to make the least
mistakes (we want the residuals to
be minimal) à the least squares 150
line minimizes the sum of squared 50 60 70 80 90 100
residuals
22
Best fitting line (least squares line)
&
𝑦! = 𝑎! + 𝑏𝑥
Intercept/constant
𝑎 = 𝑦% − 𝑏· 𝑥̅
23
Best fitting line (least squares line) - Computation
Example: Infant mortality and number of pediatricians per year in 10 cities
i Nº Nº
deaths doctors
We already had:
(yi) (xi)
1 10 80 𝑠+ = 22,14 𝑠, = 29,93 𝑟+, = −0,97
2 20 85
3 30 70
4 35 60 𝑥̅ = 52,5 𝑦V = 52,5
5 45 60
6 55 50
7 65 40
8 75 30
9 90 25
10 100 25
24
Best fitting line (least squares line) - Computation
Example: Infant mortality and number of pediatricians per year in 10 cities
i Nº Nº
deaths doctors 𝑠+ = 22,14 𝑠, = 29,93 𝑟+, = −0,97
(yi) (xi)
1 10 80
-𝟎,𝟗𝟕(𝟐𝟗,𝟗𝟑)
2 20 85 𝒔𝒍𝒐𝒑𝒆 → 𝒃 = 𝟐𝟐,𝟏𝟒
= −𝟏, 𝟑𝟐
3 30 70
4 35 60
𝑥̅ = 52,5 𝑦V = 52,5
5 45 60
6 55 50
7 65 40 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 → 𝒂 = 𝟓𝟐, 𝟓 − −𝟏, 𝟑𝟐 ∗ 𝟓𝟐, 𝟓 = 𝟏𝟐𝟏, 𝟖
8 75 30
9 90 25
= = 121,8 – 1,32 𝑑𝑜𝑐𝑡𝑜𝑟𝑠
𝑑𝑒𝑎𝑡ℎ𝑠
10 100 25
25
Best fitting line (least squares line) - Prediction
Example: number of child death (per year) and number of child-doctors in 10 cities
How many annual child death would be expected in a city with 55 doctors?
26
Scatterplot with best fitting line
Example: Infant mortality and number of pediatricians per year in 10 cities
100
1 10 80
2 20 85
80
3 30 70
4 35 60
60
5 45 60
6 55 50
40
7 65 40
8 75 30
20
9 90 25
10 100 25
0
20 40 60 80
doctors
27
Coefficient of determination
We can also quantify how good the best fitting line (BFL) is in describing the
sample
The idea is to measure how much of the variance of Y does the BFL capture in
terms of the variance of X
R = (rXY )
2 2
28
Coefficient of determination
• The correlation coefficients tells us that there is a very strong negative lineal
relationship between the number of child-deaths and the number of doctors
29
Summary
30