Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Descriptive statistics, lecture 4 '

The document discusses the analysis of dependence, focusing on methods to analyze relationships between statistical features, including correlation and regression analyses. It details various coefficients such as Tschuprow's T, Pearson's product-moment, and Spearman's rank correlation, along with their properties and applications. Additionally, it covers data presentation methods like correlation series and contingency tables, and introduces concepts of multiple and partial correlation for analyzing more than two variables.

Uploaded by

braianszafran17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Descriptive statistics, lecture 4 '

The document discusses the analysis of dependence, focusing on methods to analyze relationships between statistical features, including correlation and regression analyses. It details various coefficients such as Tschuprow's T, Pearson's product-moment, and Spearman's rank correlation, along with their properties and applications. Additionally, it covers data presentation methods like correlation series and contingency tables, and introduces concepts of multiple and partial correlation for analyzing more than two variables.

Uploaded by

braianszafran17
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Analysis of dependence

Descriptive statistics, lecture 4


The subject of analysis of dependence
• Analysis of dependence analyses relations between two or more
statistical features.
Methods of analysis of dependence
• graphical (scatterplot),
• analytical:
• correlation analysis:
• Tschuprow's T coefficient (Txy = Tyx),
• Spearman’s rank correlation coefficient (Rxy = Ryx),
• Correlation ratios (exy, eyx),
• Pearson product-moment correlation coefficient (rxy = ryx).
• regression analysis:
• empirical regression lines,
• theoretical regression lines.
Methods of data presentation
• Correlation series,
• Contingency table.
Correlation series
Values of Values of
variable x variable y
xi yi
x1 y1
x2 y2
⁝ ⁝
xn yn
Contingency table
Variants of variable y
Variants of
ni.
variable x y1 y2 … yl

x1 n11 n12 … n1l n1.


x2 n21 n22 … n2l n2.
⁝ ⁝ ⁝ ⋱ ⁝ ⁝
xk nk1 nk2 … nkl nk.
n.j n.1 n.2 … n.l n
Tschuprow's T coefficient
𝜒2
𝑇𝑥𝑦 = 𝑇𝑦𝑥 = ,
𝑛∙ 𝑘−1 ∙ 𝑙−1
where:
𝑘 𝑙 2
2
𝑛𝑖𝑗 − 𝑛ො 𝑖𝑗
𝜒 = ෍෍ ,
𝑛ො 𝑖𝑗
𝑖=1 𝑗=1
𝑛𝑖 .∙ 𝑛.𝑗
𝑛ො 𝑖𝑗 = ,
𝑛
k – number of variants of the variable x (number of rows in the contingency table),
l – number of variants of the variable y (number of columns in the contingency table),
nij – empirical numbers,
𝑛ො 𝑖𝑗 – theoretical numbers.
Tschuprow's T coefficient – properties
• It is symmetric (Txy = Tyx).
• It can be calculated only for contingency table.
• Both variables can be nominal (qualitative, non-measurable) – it is
calculated on the basis of numbers, not the variants of variables.
• It takes values from the interval [0, 1] – it measures only the
correlation strength, not the direction.
• If its value is 0 – then there is no correlation between the two features,
if it is equal to 1 – then the correlation is functional. The closer to 1 it
is, the stronger the correlation is.
• The empirical (nij) or theoretical (𝑛ො 𝑖𝑗 ) number must be at least 5.
Correlation ratios
𝑆 2 𝑦ത𝑖 𝑆 2 𝑥𝑗ҧ
𝑒𝑦𝑥 = 2 ; 𝑒𝑥𝑦 = 2 ;
𝑆 (𝑦) 𝑆 (𝑥)
where:
2
σ𝑘𝑖=1
𝑦
ത𝑖 − 𝑦
ത ∙ 2
𝑛𝑖 . σ𝑙𝑗=1 𝑥𝑗ҧ − 𝑥ҧ ∙ 𝑛.𝑗
𝑆 2 𝑦ത𝑖 = ; 𝑆 2 𝑥𝑗ҧ = ;
𝑛 𝑛
σ𝑙𝑗=1 𝑦𝑗 ∙ 𝑛𝑖𝑗 σ𝑘𝑖=1 𝑥𝑖 ∙ 𝑛𝑖𝑗
𝑦ത𝑖 = ; 𝑥𝑗ҧ = ;
𝑛𝑖 . 𝑛.𝑗
𝑦ത𝑖 – conditional means of variable y,
𝑥𝑗ҧ – conditional means of variable x.
Correlation ratios – properties
• They are not symmetric (exy ≠ eyx).
• They can be calculated only for contingency table.
• At least one variable – dependent (y for the coefficient eyx and x for the
coefficient exy) must be numerical (measurable).
• It takes values from the interval [0, 1] – it measures only the
correlation strength, not the direction.
• If its value is 0 – then there is no correlation between the two features,
if it is equal to 1 – then the correlation is functional. The closer to 1 it
is, the stronger the correlation is.
Pearson product-moment correlation
coefficient
cov(𝑥, 𝑦)
𝑟𝑥𝑦 = 𝑟𝑦𝑥 = ;
𝑆 𝑥 ∙ 𝑆(𝑦)
where:
cov 𝑥, 𝑦 = 𝑥 ∙ 𝑦 − 𝑥ҧ ∙ 𝑦;

σ𝑛
𝑖=1 𝑥𝑖 ∙𝑦𝑖
𝑥∙𝑦 = – for the correlation series,
𝑛
σ𝑘 σ 𝑙
𝑖=1 𝑗=1 𝑥𝑖 ∙𝑦𝑗 ∙𝑛𝑖𝑗
𝑥∙𝑦 = – for the contingency table.
𝑛
2 2
𝑟𝑥𝑦 ∙ 100% = ∙ 100% – coefficient of linear determination. It says,
𝑟𝑦𝑥
in how many percent changes of one variable were determined by
changes of the second one.
Pearson product-moment correlation
coefficient – properties
• It is symmetric (rxy = ryx).
• It can be calculated for both the correlation series or the contingency table.
• Both variables must be strictly numerical.
• The relations between variables must be linear – if it is not, then its value will be
underestimated.
• It takes the values from the interval [-1, 1] – it measures both the correlation strength, and
the direction.
• If the correlation is negative, then if one variable increases, the other decreases and vice
versa.
• If the correlation is positive, then if one variable increases, the other also increases and
vice versa.
• If its value is 0 – then there is no correlation between the two features, if it is equal to -1
or 1 – then the correlation is functional. The closer to -1 or 1 it is, the stronger the
correlation is.
Estimation of the degree of nonlinearity (only
for the contingency table)
As correlation ratios (exy and eyx) are always at least equal to the
|rxy|=|ryx|, the formulas:
2 2 2 2
𝑚𝑥𝑦 = 𝑒𝑥𝑦 − 𝑟𝑥𝑦 , 𝑚𝑦𝑥 = 𝑒𝑦𝑥 − 𝑟𝑦𝑥
measure the degree of nonlinearity of relationship (mxy – x is dependent
on y and myx – y is dependent on x).
Spearman’s rank coefficient
• We use this coefficient when:
• variables are numerical, but conditions for the Pearson product-moment
correlation coefficient (linearity of relationship and normality of variables) are
not satisfied,
• when at least one variable is measured on the ordinal scale.
• In the first step we assign ranks to the variabes:
• We set the values of both variables in the ascending or descending order (but
we must be consequent – both variables must be set in the same order).
• We assign ranks to the subsequent values.
• If two or more units have the same values, we calculate the mean from the
subsequent ranks that would be assigned to them.
Spearman’s rank coefficient
cov(𝑟𝑥 , 𝑟𝑦 )
𝑅𝑥𝑦 = 𝑅𝑦𝑥 =
𝑆 𝑟𝑥 ∙ 𝑆(𝑟𝑦 )
where:
𝑟𝑥 = rank of variable 𝑥
𝑟𝑦 = rank of variable 𝑦
Spearman’s rank coefficient
When all ranks are distinct, then the formula simplifies to:
6 ∙ σ𝑛𝑖=1 𝑑𝑖2
𝑅𝑥𝑦 = 𝑅𝑦𝑥 = 1 −
𝑛 ∙ 𝑛2 − 1
where:
𝑑𝑖 = rank 𝑥𝑖 − rank 𝑦𝑖
Spearman’s rank coefficient
In case of tied ranks, we obtain:
𝑛3 − 𝑛
− σ𝑛𝑖=1 𝑑𝑖2 − 𝑇𝑋 − 𝑇𝑌
𝑅𝑥𝑦 = 𝑅𝑦𝑥 = 6
𝑛3 − 𝑛 𝑛3 − 𝑛
− 2 ⋅ 𝑇𝑋 ⋅ − 2 ⋅ 𝑇𝑌
6 6
where:
1
𝑇𝑥 = ෍ 𝑡𝑗3 − 𝑡𝑗 ,
12
𝑗
1
𝑇𝑦 = ෍ 𝑢𝑘3 − 𝑢𝑘 ,
12
𝑘
tj – number of observations having the same, j-th rank value of variable x,
uk – number of observations having the same, k-th rank value of variable y.
Spearman’s rank coefficient – properties
• It is symmetric (Rxy = Ryx).
• It can be calculated only for the correlation series.
• Both variables must be at least on the ordinal scale.
• It takes the values from the interval [-1, 1] – it measures both the correlation
strength, and the direction.
• If the correlation is negative, then if one variable increases, the other
decreases and vice versa.
• If the correlation is positive, then if one variable increases, the other also
increases and vice versa.
• If its value is 0 – then there is no correlation between the two features, if it
is equal to -1 or 1 – then the correlation is functional. The closer to -1 or 1 it
is, the stronger the correlation is.
Regression analysis – empirical regression
lines
• They are drawn on the basis of the contingency table.
• They are based on the conditional means.
• Both variables must be strictly numerical (measurable).
• We draw two lines joining the following points:
xi ഥ𝒊
𝒚 ഥ𝒋
𝒙 yj
x1 𝑦ത1 𝑥1ҧ y1
x2 𝑦ത2 𝑥ҧ2 y2
⁝ ⁝ ⁝ ⁝
xk 𝑦ത𝑘 𝑥ҧ𝑙 yl
Empirical regression lines – properties
• Empirical regression lines cross each other in one point.
• The smaller the angle between them is, the stronger dependence is.
• If the empirical regression lines directly cover each other then the
dependence is functional.
• If the angle between them is 90 degrees then there is no dependence
between both variables.
• If one empirical line is ascending, the other is also ascending and
dependence between variables is positive and vice versa.
Theoretical regression lines
• By theoretical regression lines we mean the fitted mathematical
function that describes dependence between both variables.
• Let us assume the linear regression between analysed variables:
𝑦 = 𝑎𝑦 ∙ 𝑥 + 𝑏𝑦 – variable y is the dependent one and x – independent
𝑥 = 𝑎𝑥 ∙ 𝑦 + 𝑏𝑥 – variable x is the dependent one and y – independent
ay, ax – slope parameters,
by, bx – intercepts.
Parameters estimation
• Parameters are estimated by means of the Ordinary Least Squares method (OLS).
• Parameters estimates:
cov(𝑥, 𝑦) 𝑟𝑦𝑥 ∙ 𝑆(𝑦)
𝑎𝑦 = 2
= ; 𝑏𝑦 = 𝑦ത − 𝑎𝑦 ∙ 𝑥.ҧ
𝑆 (𝑥) 𝑆(𝑥)
cov(𝑥, 𝑦) 𝑟𝑥𝑦 ∙ 𝑆(𝑥)
𝑎𝑥 = 2
= ; 𝑏𝑥 = 𝑥ҧ − 𝑎𝑥 ∙ 𝑦.

𝑆 (𝑦) 𝑆(𝑦)
ay – it says, how much the variable y will change, if the variable x increases by one
unit.
ax – it says, how much the variable x will change, if the variable y increases by one
unit.
by, bx – generally do not have the economic interpretation.
Correlation between more than two variables
• multiple correlation – the total influence of all independent variables
on the dependent one;
• partial correlation – the correlation between two variables, with the
omission of the influence of remaining ones.
Multiple correlation
Multiple correlation coefficient is calculated by means of the following formula:
det 𝑅𝑛
𝑅𝑦.𝑥1 ,𝑥2 ,…,𝑥𝑘 = 𝑅𝑤 = 1 − ,
det 𝑅𝑚
where:
Rn – correlation matrix,
Rm – correlation matrix after removing the row and the column that refer to the
dependent variable.
For three variables, the formula can be rewritten as follows:
2 2
𝑟12 + 𝑟13 − 2 ∙ 𝑟12 ∙ 𝑟13 ∙ 𝑟23
𝑅𝑦.𝑥1 ,𝑥2 = 𝑅1.23 = 2 .
1 − 𝑟23
Multiple correlation coefficient – properties
• It takes values from the interval [0, 1] – it measures only the
correlation strength, not the direction.
• If its value is 0 – then there is no correlation, if it is equal to 1 – then
the correlation is functional. The closer to 1 it is, the stronger the
correlation is.
• Squared multiple correlation coefficient gives us the coefficient of
linear determination that says, in how many percent changes of the
dependent variable were explained by changes of the independent
ones.
Partial correlation
Partial correlation coefficient is calculated by means of the following
formula:
−𝑅12 𝑟12 − 𝑟13 ∙ 𝑟23
𝑟12.3 = = ,
𝑅11 ∙ 𝑅22 2 ∙ 1 − 𝑟2
1 − 𝑟13 23
where:
Rij – cofactor matrix of the element of the matrix Rn, standing in the i-th row
and the j-th column:
𝑅𝑖𝑗 = −1 𝑖+𝑗 𝑀𝑖𝑗 ,
where:
Mij – minor, or the determinant of the submatrix, originated by the removal of
the i-th row and the j-th column from the matrix Rn.
Properties of the partial correlation
coefficient
• It takes the values from the interval [-1, 1] – it measures both the
correlation strength, and the direction.
• If the correlation is negative, then if one variable increases, the other
decreases and vice versa.
• If the correlation is positive, then if one variable increases, the other
also increases and vice versa.
• If its value is 0 – then there is no correlation between the two features,
if it is equal to -1 or 1 – then the correlation is functional. The closer to
-1 or 1 it is, the stronger the correlation is.

You might also like