Multiple Regression
Multiple Regression
Simple Regression
Cause Effect
Independent variable Dependent variable
Simple Regression
Oil Prices
Government
Bond Yields
Causes Effect
Independent variables Dependent variable
Multiple Regression
Oil Prices
Government
Bond Yields
S&P 500
Share Index
(x2, y2)
(x3, y3)
Regression Line:
y = A + Bx
(xn, yn)
Causes Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
Regression Equation:
EXXONt = A + B DOWt + C OILt
E1 1 D1
[] [] [] [][]
O1 e1
E2 1 D2 O2 e2
E3 = A +B D3 +C O3 + e3
1
… … … …
…
En Dn On en
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Multiple Regression
Regression Equation:
y = A + Bx + Cz
y1 1 x1
[ ] [] [][][]
z1 e1
y2 1 x2 z2 e2
y3 =A +B x3 +C z3 + e3
1
… … … …
…
yn xn zn en
1
Multiple Regression
Regression Equation:
y = A + Bx + Cz
y1 e1
[ ] [ ] []
1 x1 z1
y2 e2
[ ]
A
1 x2 z2
y3
…
yn
=
1
…
x3
…
xn
z3
…
zn
* B
C
+ e3
…
en
1
n Rows, n Rows, 3 Rows, n Rows,
1 Column 3 Columns 1 Column 1 Column
Multiple Regression
2 Causes 1 Effect
Dow Jones index, Exxon stock
price of oil
Multiple Regression
k Causes 1 Effect
Dow Jones index, Exxon stock
price of oil, bond yields…
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
y1 1 x11 x1k e1
[ ] [] [] [][]
y2
y3
…
yn
= C1
1
1
…
+ C2
x21
x31
…
xn1
+ … Ck+1
x2k
x3k
…
+
e2
e3
…
xnk en
1
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
y1 e1
[ ] [ ] []
1 x11 x1k
y2 e2
[ ]
C1
1 x21 x2k
y3
…
yn
=
1
…
x31
…
xn1
… x3k
…
xnk
* C2
…
Ck+1
+ e3
…
en
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ck+1xk
Maximum
Method of Method of least
likelihood
moments squares
estimation
y1 1
[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
nx1
1
…
x31
…
xn1
… x3k-1
…
xnk-1
*
nxk
C2
…
Ck
kx1
1
n Rows, n Rows, k Rows,
1 Column k Columns 1 Column
Multiple Regression
Regression Equation:
y = C1 + C2x1 + … + Ckxk-1
y1 1
[] [ ]
x11 x1k-1
y2
[ ]
C1
1 x21 x2k-1
y3
…
yn
=
1
…
x31
…
xn1
… x3k-1
…
xnk-1
* C2
…
Ck
1
X1 Xk
Bad News: Multicollinearity Detected
X1
High R2
Xk
Low R2
Xk
E1 1 D1
[ ] [] [] []
O1
E2 1 D2 O2
E3 = A +B D3 +C O3
1
… … …
…
En Dn On
1
Ei = % return Di = % return of Oi = % change
on Exxon stock Dow Jones in price of oil
on day i index on day i on day i
Good News: No Multicollinearity Detected
DOW
Returns
Low R2
OIL
E1 1 D1
[ ] [] [] []
N1
E2 1 D2 N2
E3 = A +B D3 +C N3
1
… … …
…
En Dn Nn
1
Ei = % return Di = % return of Ni = % return of
on Exxon stock Dow Jones NASDAQ index
on day i index on day i on day i
Bad News: Multicollinearity Detected
DOW
High R2
NASDAQ
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Common Sense
Proposed Regression Equation:
EXXONt = A + B DOWt + C NASDAQt + D OILt
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Dow Jones
NASDAQ 100 Index Oil Prices
Industrial Average
100 large tech stocks Price of barrel of oil
30 Large-cap US stocks
Dow Jones
Oil Prices
Industrial Average
Price of barrel of oil
30 Large-cap US stocks
Time
20-dimensional 3-dimensional
data data
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Dimensionality Reduction via Factor Analysis
[ ]
x11 x1k-1 f11 f12 f13
x21
…
xn1
x2k-1
x31 … x3k-1
…
xnk-1
[ ]
f21
f31
…
fn1
f22
f32
…
fn2
f23
f33
…
fn3
Factor Analysis
1 column for
3 columns, 1 for
each interest
each factor
rate out there
Factor Analysis is a dimensionality-
reduction technique to identify a
few underlying causes in data
Multiple Regression
Proposed Regression Equation:
HOMEt = A + B 5-yrt + C 10-yeart + D 2-yeart +
E 1-yeart + F 3-montht + G 1-dayt + …
Principal
Component
Analysis
Change in EXXON = C
R2 Residuals
Measures overall quality of fit - the Check if regression assumptions are
higher the better (up to a point) violated
Standard Errors
R2
of coefficients
e = y - y’
=> y = y’ + e
=> Variance(y) = Variance(y’ + e)
=> Variance(y) = Variance(y’) + Variance(e) + Covariance(y’,e)
A Leap of Faith
This is important - more on why in a bit
Variance(y) = Variance(y’) + Variance(e)
Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
Variance(y) = Variance(y’) + Variance(e)
Variance Explained
Variance of the dependent variable can be decomposed into variance of the
regression fitted values, and that of the residuals
TSS = Variance(y) ESS = Variance(y’) RSS = Variance(e)
R2 = ESS / TSS
R 2
The percentage of total variance explained by the regression. Usually, the higher the
R2, the better the quality of the regression (upper bound is 100%)
R2 = ESS / TSS
R 2
In multiple regression, adding explanatory variables always increases R2, even if those
variables are irrelevant and increase danger of multicollinearity
Adjusted-R2 = R2 x (Penalty for adding irrelevant variables)
Adjusted-R2
Increases if irrelevant* variables are deleted
Female
Regression Line:
y = A + Bx
Female
A1
Female
A2
A1
A2
x
y = A1 + (A2 - A1)D + Bx
= A1 + Bx
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
y = A1 + (A2 - A1) + Bx
= A2 + Bx
Adding A Dummy Variable
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents
y = A2 + B2 x
Female
x
For males: D1 = 0
y = A1 + (A2 - A1)D1 +
B1x + (B2 - B1)D2
D2 = 0
= A1 + B1 x
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + B1 x y = A2 + B2 x
For females: D1 = 1
y = A1 + (A2 - A1)(1) +
B1x + (B2 - B1)x D2 = x
= A1 + (A2 - A1) +
B1x + (B2 - B1)x
= A2 + B2 x
Dummy Variables
X Y
Linear regression Logistic regression
Normal Distribution
μ
N(μ,σ)
Average (mean) is μ
Standard deviation is σ
Standard Errors
E(α) = A E(β) = B
High R2
Low R2
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
Sample Regression Line
Regression Equation:
y = A + Bx
Residuals
y1 = A + Bx1 + e1
y2 = A + Bx2 + e2
y3 = A + Bx3 + e3
… …
yn = A + Bxn + en
RSS = Variance(e)
E(α) = A E(β) = B
E(α) = 0
t x SE(α)
α=A
Y Y
X X
y = A + Bx y = α + βx
E(α) = 0
t x SE(α)
α=A
E(α) = 0 E(β) = 0
0.85xSE(α)
A
9.01xSE(β)
B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)
E(α) = 0 E(β) = 0
0.85xSE(α)
A
9.01xSE(β)
B
t-stat(α) = 0.85 t-stat(β) = 9.01
t-stat(α) = A/SE(α) t-stat(β) = B/SE(β)
E(α) = 0 E(β) = 0
0.85xSE(α)
A 0.39 2 x 10-15
9.01xSE(β)
χ2 Distribution
Never mind the fine print about degrees of freedom for now
Null Hypotheses
β = B, α = A
Y Y
X X
y = A + Bx y = α + βx
β = B, α = A
β = B, α = A
y = A1x1 + A20 + Bx
= A1 + Bx
Adding A Dummy Variable
Regression Line For Males: Regression Line For Females:
y = A1 + Bx y = A2 + Bx
y = A1x0 + A2x1 + Bx
= A2 + Bx
Adding A Dummy Variable
Original Regression Equation:
y = A + Bx
Height of Average height
individual of parents
Python statsmodel R2
Excel and R usually agree
sometimes differs