Module01 LinearRegression
Module01 LinearRegression
1 1.5 1.91
2 1.4 1.83
3 2.7 0.86
4 1.1 1.72
5 0.9 1.28
6 0.8 1.09
7 2.9 0.79
8 2.2 1.1
9 3.3 0.81
10 1.8 1.67
Glimpse of Linear Regression
lm1 <- lm(mpg ~ hp, data = mtcars)
summary(lm1)
anova(lm1)
Normal Probability Plot of Residuals
x1 x2 . . . xk y
x11 x12 . . . x1k y1
x21 x22 . . . x2k y2
M M M M
xn1 xn2 . . . xnk yn
• Comment: The process involves a degree of subjectivity and intuition about the
physical system and what model form makes sense and helps to answer the
relevant questions.
Estimation of Coefficients
Model form: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 +. . +𝛽𝑘 𝑥𝑘 + 𝜖
𝜖~𝑁(0, 𝜎 2 )
• 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Estimation of Coefficients
2
• 𝐿 = σ𝑛𝑖=1 𝜖 2 = σ𝑛𝑖=1 𝑦𝑖 − 𝛽መ0 + σ𝑘𝑗=1 𝛽መ𝑗 𝑥𝑖𝑗 =SSE
መ
• Minimize the above function with respect to 𝛽’s.
• How do we do this?
𝜕𝐿
• 𝑗 = 0 for 𝑗 = 0,1, . . 𝑘
𝜕𝛽
• In matrix form: 𝒚 = 𝑿𝛽 + 𝜺
𝑻
መ
• SSE= 𝒚 − 𝑿𝛽 (𝒚 − 𝑿𝛽) መ (minimize w.r.t 𝛽መ )
• (Find 𝛽መ for which derivative of SSE=0)
𝛽መ = 𝑿′ 𝑿 −1 𝑿′ 𝒚
Predicted response: 𝒚 ෝ = 𝑿𝛽,መ Residual 𝑒 = 𝒚 − 𝒚 ෝ
• For a 𝟐𝒌 𝐸𝑓𝑓𝑒𝑐𝑡𝑗
factorial design 𝛽𝑗 = 2 𝛽0 = 𝑦ത
Example #2
22 design with x1 x2 y
center-point -1 -1 4
1 -1 3
experimental -1 1 1 yi =b01 + b1 xi1 + b2 xi2 + ei
design 1 1 0 model equations
0 0 4
1 -1 -1
4
1 1 -1
3
X= 1
1
-1
1
1
1 y= 1
0
1 0 0
4
Design Matrix
Example #2- Estimation
5 0 0 0.20 0 0
X'X = 0 4 0 (X'X)-1 = 0 0.25 0
0 0 4 0 0 0.25
2.4
b = (X'X)-1 X'y =
-0.5
-1.5
y = 2.4 - 0.5 x1 - 1.5 x2 prediction equation
Example #2- New Model-Same Array
22 design with yi =b0 + b1 xi1 + b2 xi2 +
x1 x2 y
center-point -1 -1 4 b3 x2i2 + ei
1 -1 3
experimental -1 1 1 functional form of
design 1 1 0 the fit model
0 0 4
1 -1 -1 1 4
1 1 -1 1 3
X= 1
1
-1
1
1
1
1
1 y= 1
0
1 0 0 0 4
Different Design Matrix
Hypothesis Testing Multiple Regression
• 𝐻0 : 𝛽1 = ⋯ = 𝛽𝑘 = 0
• 𝐻1 : 𝛽𝑗 ≠ 0 𝑓𝑜𝑟 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑗
• 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
2 2
σ𝑛
𝑖=1 𝑦𝑖 σ𝑛
𝑖=1 𝑦𝑖
𝑆𝑆𝑅 = 𝛽መ ′ 𝑋 ′ 𝑦 − ; 𝑆𝑆𝑇 = 𝑦′𝑦 −
𝑛 𝑛
• Compare 𝐹0 with 𝐹𝑐𝑟𝑖𝑡 = 𝐹1−𝛼,𝑘,𝑛−𝑘−1 to determine
significance
DOE vs On-hand Data
Check VIFs,
Rsq, P-
values etc.
Model Selection: Example 1
Model Selection: Example 2
Forward Selection
• Step 1: Model m1 <- Fit a null model
• Step 2: Add variables one at a time (𝑝 simple Linear Reg
model)
• Step 3: Pick the model with lowest RSS and add it to the
m1
• Step 4: With remaining 𝑝 − 1 variables, add to m1 one at
a time and pick the model that provides best RSS
• Step 5: Continue until some stopping criterion is satisfied
Backward Selection
• Step 1: Start with all the variables in the model
• Step 2: Remove the variable which is least significant
(largest p-value)
• Step 3: Fit with remaining 𝑝 − 1 variables
• Step 4: Continue dropping variables until some stopping
criterion is met (threshold on p-value)
Other methods
• Mallow’s Cp
• AIC
• BIC
• Cross-validation
• Adjusted 𝑅2
Linear Regression
Prof. Sayak Roychowdhury
Hat Matrix
• Residuals are given by 𝑟𝑖 = 𝑦𝑖 − 𝑦ො𝑖
−𝟏 𝑻
• Now 𝒚 𝑻
ෝ = 𝑿𝜷 = 𝑿 𝑿 𝑿 𝑿 𝒚
−𝟏
• 𝑯 = 𝑿 𝑿 𝑿 𝑿𝑻 is called the “Hat Matrix” has it
𝑻
converts 𝒚 to 𝒚ෝ (y-hat)
•𝐸 𝒚 ෝ = 𝝁; var 𝐲ො = 𝐇𝜎 2
• residuals can be obtained by 𝒓 = 𝑰 − 𝑯 𝒚
• 𝑬 𝒓 = 𝟎; 𝑪𝒐𝒗 𝒓 = 𝑰 − 𝑯 𝜎 2
• Variance of 𝑖 𝑡ℎ residual is 𝑣𝑎𝑟 𝑟𝑖 = 1 − ℎ𝑖𝑖 𝜎 2
• The result shows that residuals may have different
variances even though original observations have
constant variance 𝜎 2 .
Leverage
• The location of points in the X space determine model
properties
• The elements ℎ𝑖𝑗 of 𝑯 may be interpreted as amount of
leverage exerted by 𝑦𝑗 on 𝑦ො𝑖 .
• ℎ𝑖𝑖 is the 𝑖 𝑡ℎ diagonal element with 0 ≤ ℎ𝑖𝑖 ≤ 1.
• ℎ𝑖𝑖 is called “leverage” or potential influence of 𝑖 𝑡ℎ
observation
• Observation with high leverage needs special attention, as
the fit maybe overly dependent on them
Leverage
• σ𝒏𝒊=𝟏 𝒉𝒊𝒊 = 𝒓𝒂𝒏𝒌 𝑯 = 𝒓𝒂𝒏𝒌 𝑿 = 𝑴,
• The averge size of the diagonal element is 𝑀/𝑛
2𝑀
• As a thumbrule if ℎ𝑖𝑖 > then the 𝑖 𝑡ℎ observation is a
𝑛
high leverage point
Residuals
• 𝒓𝒊 = 𝒚𝒊 − 𝒚
ෝ𝒊 ordinary residuals
𝒓𝒊
• Standardized residuals: 𝒅𝒊 = ෝ
𝝈
• If 𝑑𝑖 is not within −𝟑 ≤ 𝒅𝒊 ≤ 𝟑 it may be an outlier.
• For both 𝑟𝑖 and 𝑑𝑖 , the variances vary depending on the
location of the point 𝑥.
𝒓𝒊
• 𝒔𝒊 = is called Studentized residuals
ෝ
𝟏−𝒉𝒊𝒊 𝝈
• 𝑉 𝑠𝑖 = 1, location invariant
• Observations with |𝑠𝑖 | > 2 should be scrutinized further
• For large datasets, variance stabilizes, standardized and
studentized residuals will have little difference.
Residuals
• Predictive residual (PRESS) e(𝑖) = 𝑦𝑖 − 𝑦ො 𝑖 , where 𝑦ො 𝑖 is obtained
by omitting 𝑖𝑡ℎ observation and fitting the model
2 2
• PRESS Statistic = σ𝑛𝑖=1 𝑒(𝑖) = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො 𝑖
• PRESS requires fitting 𝑛 linear models, for each of the observation
• It is possible to calculate PRESS with the help of just one model,
using the Hat matrix
𝑟
• PRESS residual = e(𝑖) = 𝑖
1−ℎ𝑖𝑖
2
2 𝑟𝑖
• PRESS Statistic = σ𝑛𝑖=1 𝑒(𝑖) = σ𝑛𝑖=1
1−ℎ𝑖𝑖
• High PRESS residual indicates high influence points
• A large difference between ordinary residual and PRESS residual
indicates that the model fits the data well, but the model built
without that point predicts poorly.
Residuals
2
• PRESS can be used to compute 𝑅𝑝𝑟𝑒𝑑
2 𝑃𝑅𝐸𝑆𝑆
• 𝑅𝑝𝑟𝑒𝑑 =1 −
𝑆𝑆𝑇
Cook’s Distance to Estimate Actual
Influence
• While ℎ𝑖𝑖 gives the “potential influence”, based on the
location of the point 𝑥
• It is useful to consider both the location of the point and
the response variable, and their effect on 𝜷
• Cook (1977, 1979) proposed a measure of influence based
on location of the point as well as the response variable
• It indicates the extent to which parameter estimates
would change if one omitted 𝑖 𝑡ℎ observation
• It is given by the standardized difference between 𝜷 𝒊,
which is the estimate obtained omitting 𝑖 𝑡ℎ observation,
and 𝜷
Cook’s Distance
• Cook’s distance value can be easily obtained by using ℎ𝑖𝑖 :
𝑻
𝜷 𝒊
−𝜷
𝑿𝑻 𝑿 𝜷 𝒊
−𝜷
𝐷𝑖 =
𝑝𝑀𝑆𝐸
𝑠𝑖2 ℎ𝑖𝑖
𝐷𝑖 =
𝑝 1−ℎ𝑖𝑖
𝑡ℎ
• 𝑠𝑖 is the
𝑡ℎ
𝑖 studentized residual indicates how well the model fits
the 𝑖 observation
• 𝑝 is (number of predictors) + 1
ℎ𝑖𝑖
• The ratio represents the distance of vector 𝒙𝒊 from the
1−ℎ𝑖𝑖
remaining data
• If 𝐷𝑖 > 𝐹0.05 (𝑝, 𝑛 − 𝑝) median of F distribution, the 𝑖𝑡ℎ observation
maybe considered an outlier.
• Sometimes 𝐷𝑖 > 1 is also suggested as a cut-off.