Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Topic V

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Topic 5

Relationship between two numerical variables

Bachelor in Global Studies


First Term 2022-2023
First Year

1
Content

1. Introduction to the association analysis between two numerical variables

2. Graphical method of summarizing bivariate data: Scatterplot

3. Quantifying association of two numerical variables: Covariance and Correlation


coefficient

4. Linear regression analysis: least squares regression and 𝑹𝟐

2
1. Introduction

3
Statistics of two variables: An introduction

When a variable explains another one:


1. Response variable (also known as dependent variable): the variable that we
want to explain, denoted by Y
2. Explanatory variable (also known as covariate): the variable that we use to
explain the response variable, denoted as X

In other words, the variable Y is a function of the variable X:

Y = f (X )

4
Statistics of two variables: An introduction
Some examples of variables (categorical and/or numerical) that might be related to
each other:

1. Level of education and salary


2. Choice of university degree and professional background of parents
3. Marketing expenditure and sales volume of a given company
4. Country of origin and religious choice
5. Health status and country of residence

5
Statistics of two variables: An introduction

Linear versus non-linear relationship

Variable X and Y will hold a perfect relationship if one can be expressed in terms of
the other as:

𝑌 = 𝑎 + 𝑏𝑋

If 𝑏 > 0, we say they keep a perfect positive relationship


If 𝑏 < 0, we say they keep a perfect negative relationship

Example of non-linear relationship: 𝑌 = e%&'(

à During this lecture we will only discuss how to deal with linear relationships

6
Statistics of two variables: An introduction
Some initial concepts concerning relations between two variables:

Association versus causation


• Association (correlation): relation between two variables
• Causation (causality): detecting dependent and independent variable

Important: correlation is not causation!

Funny examples

7
2. Scatterplot

8
Scatterplot

• Scatterplots are used to graph to numerical variables, and at least one


of them should be continuous
• The scatter plot can be a first indicator of whether there is a linear
relationship between the two variables or not
• We can form pairs (xi, yi) for each observation in the sample. Each pair
represents a point in the plane
• Every pair of values corresponding to a given observation are taken
coordinates in 2-D graph, each dot corresponds to one observation (xi,
y i)

9
Scatterplot
Graph interpretation: Form, direction and strength

70 25 50

65 45
20
60 40

15 35
55
30
50 10
25
45
5 20
40
15
0
35 0 5 10 15 20 10
0 5 10 15 20 0 5 10 15 20

Linear relationship Linear relationship No relationship


(negative) (positive)

10
Scatterplot
Graph interpretation: non-linear relationship (example: Quadratic)

11
Scatterplot
Example: Infant mortality and number of pediatricians per year in 10 cities

i Nº Nº
deaths doctors Child-death and Pediatricians (annual per city)
(yi) (xi)

100
1 10 80

80
2 20 85
3 30 70
Nº of deaths

60
4 35 60
5 45 60 40
6 55 50
20

7 65 40
8 75 30
0

9 90 25 20 40 60 80
10 100 25 Nº of pediatricians

12
3. Covariance and correlation coefficient

13
Covariance

Once we identify a possible linear relationship by plugging the numerical data into
a scatterplot, we can quantify this relation.

As a first attempt, we can use the covariance:

S XY = å
n
(xi - x )´ ( yi - y ) =
i =1 n -1
n
xi ´ yi n
=å - (x ´ y )
i =1 n - 1 n -1

14
Covariance

Direction of the relationship:

• If SXY > 0, then there is a positive linear relationship


• If SXY < 0, then there is a negative linear relationship
• If SXY = 0, then there is no linear relationship

The problem with the covariance is that it is not bounded, and hence we
cannot determine if the relationship between variables is weak or strong

15
Correlation coefficient

A finer tool to quantify the relationship between two numerical variables


is the correlation coefficient. It is the measure of covariance normalized by
the product of the standard deviations.

S XY
rXY =
S X ´ SY

where SX and SY are the standard deviations of X and Y respectively

16
Correlation coefficient

It is bounded between -1 and 1, where rXY = ±1 represents a perfect relationship

As happened with the covariance, the sign of rXY tells us the direction of the
relationship:

• If rXY > 0 then there is a positive relationship


• If rXY < 0 then there is a negative relationship
• If rXY = 0 then there is no relationship

Evans (1996) suggests following interpretation:

• .00-.19 “very weak”


• .20-.39 “weak”
• .40-.59 “moderate”
• .60-.79 “strong”
• .80-1.0 “very strong”

17
Covariance and correlation coefficient
Example: Infant mortality and number of pediatricians per year in 10 cities

i Nº Nº
deaths doctors Learned in previous sessions:
(yi) (xi)
𝑥̅ = 52,5 𝑦' = 52,5
1 10 80
2 20 85 𝑠! = 29,93 𝑠" = 22,14
3 30 70
4 35 60
5 45 60
6 55 50
7 65 40
8 75 30
9 90 25
10 100 25

18
Covariance and correlation coefficient
Example: Infant mortality and number of pediatricians per year in 10 cities
%&

i Nº Nº . 𝑥# 𝑦# = 21.750
deaths doctors #$%
(yi) (xi)
1 10 80 𝑥̅ = 52,5 𝑦' = 52,5
2 20 85
3 30 70
4 35 60 𝟐𝟏.𝟕𝟓𝟎 𝟏𝟎
5 45 60 𝒔𝒙𝒚 = 𝟗
− 𝟗
∗ 𝟓𝟐, 𝟓 ∗ 𝟓𝟐, 𝟓 = −𝟔𝟒𝟓, 𝟖𝟑
6 55 50
7 65 40
8 75 30 𝑠" = 29,93 𝑠! = 22,14
9 90 25
10 100 25
−𝟔𝟒𝟓, 𝟖𝟑
𝒓𝒙𝒚 = = −𝟎, 𝟗𝟕
𝟐𝟗, 𝟗𝟑 ∗ 𝟐𝟐, 𝟏𝟒

19
4. Lineal regression analysis

20
Lineal regression analysis

Why regression analysis?

• It provides a summary a possible relationship between two or more variables

• We can use the regression model to make predictions on the variable Y

• We can interpret how changes in X may affect changes in Y

21
Best fitting line (least squares line)
What is the best fitting line?
200
The idea: we want to construct a
linear equation that does the best
possible job at matching the true 190
relationship.
We want to estimate:
𝑌* = 𝑎 + 𝑏𝑋* 180

Let’s call our estimation line


= * = 𝑎C + 𝑏𝑋
𝐸𝑆𝑇𝐼𝑀 D *
170

We denote
= * = 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑌* − 𝐸𝑆𝑇𝐼𝑀
160
We want to make the least
mistakes (we want the residuals to
be minimal) à the least squares 150
line minimizes the sum of squared 50 60 70 80 90 100

residuals

22
Best fitting line (least squares line)

Equation for the least squares line is:


slope

&
𝑦! = 𝑎! + 𝑏𝑥
Intercept/constant

a and b that minimize the sum of squared residuals are:


!!" $"
𝑏= = 𝑟"#
!!# $!

𝑎 = 𝑦% − 𝑏· 𝑥̅

23
Best fitting line (least squares line) - Computation
Example: Infant mortality and number of pediatricians per year in 10 cities

i Nº Nº
deaths doctors
We already had:
(yi) (xi)
1 10 80 𝑠+ = 22,14 𝑠, = 29,93 𝑟+, = −0,97
2 20 85
3 30 70
4 35 60 𝑥̅ = 52,5 𝑦V = 52,5
5 45 60
6 55 50
7 65 40
8 75 30
9 90 25
10 100 25

24
Best fitting line (least squares line) - Computation
Example: Infant mortality and number of pediatricians per year in 10 cities

i Nº Nº
deaths doctors 𝑠+ = 22,14 𝑠, = 29,93 𝑟+, = −0,97
(yi) (xi)
1 10 80
-𝟎,𝟗𝟕(𝟐𝟗,𝟗𝟑)
2 20 85 𝒔𝒍𝒐𝒑𝒆 → 𝒃 = 𝟐𝟐,𝟏𝟒
= −𝟏, 𝟑𝟐
3 30 70
4 35 60
𝑥̅ = 52,5 𝑦V = 52,5
5 45 60
6 55 50
7 65 40 𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 → 𝒂 = 𝟓𝟐, 𝟓 − −𝟏, 𝟑𝟐 ∗ 𝟓𝟐, 𝟓 = 𝟏𝟐𝟏, 𝟖
8 75 30
9 90 25
= = 121,8 – 1,32 𝑑𝑜𝑐𝑡𝑜𝑟𝑠
𝑑𝑒𝑎𝑡ℎ𝑠
10 100 25

25
Best fitting line (least squares line) - Prediction

For a given a value of x we can predict a value for y

Example: number of child death (per year) and number of child-doctors in 10 cities

How many annual child death would be expected in a city with 55 doctors?

= = 121,8 – 1,32 55 = 49,2


𝑑𝑒𝑎𝑡ℎ𝑠

26
Scatterplot with best fitting line
Example: Infant mortality and number of pediatricians per year in 10 cities

= = 121,8 – 1,32 𝑑𝑜𝑐𝑡𝑜𝑟𝑠


𝑑𝑒𝑎𝑡ℎ𝑠
i Nº Nº
deaths doctors
(yi) (xi) Child death (per year) and number of child-doctors in 10 cities

100
1 10 80
2 20 85
80
3 30 70
4 35 60
60

5 45 60
6 55 50
40

7 65 40
8 75 30
20

9 90 25
10 100 25
0

20 40 60 80
doctors

Observed deaths Expected deaths (BFL)

27
Coefficient of determination

We can also quantify how good the best fitting line (BFL) is in describing the
sample

The idea is to measure how much of the variance of Y does the BFL capture in
terms of the variance of X

To measure this goodness of fit, we use the coefficient of determination:

R = (rXY )
2 2

The closer to 1, the better the fit of the line

28
Coefficient of determination

Example: Infant mortality and number of pediatricians per year in 10 cities

𝑠+, = −645,83 𝑟 = −0,97 𝑅 3 = 0,94

• The correlation coefficients tells us that there is a very strong negative lineal
relationship between the number of child-deaths and the number of doctors

• The coefficient of determination tells us that the number of doctors in a given


city is able to explain 94% of the variation in child-deaths for the same city

29
Summary

To do an analysis of a linear regression:


1. Draw a scatterplot of the data (it will give you a first indication about whether
or not there is a linear relationship between variables)
2. Compute the correlation coefficient (to measure how strong the linear
relationship between the variables is)
3. Find the lineal regression model that describes the variable Y in terms of the
variable X, also defined as equation of the best fitting line
4. Graph the best fitting line in scatter plot of observed values xi
5. Compute the coefficient of determination, to measure how good the lineal
model is in describing Y

30

You might also like