Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Module01 LinearRegression

Linear regression is a widely used statistical technique for modeling relationships between variables. It allows for prediction of a dependent variable from one or more independent variables. The document discusses key concepts of linear regression including model fitting, assumptions, diagnostics and hypothesis testing.

Uploaded by

89694ncbsz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module01 LinearRegression

Linear regression is a widely used statistical technique for modeling relationships between variables. It allows for prediction of a dependent variable from one or more independent variables. The document discusses key concepts of linear regression including model fitting, assumptions, diagnostics and hypothesis testing.

Uploaded by

89694ncbsz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Linear Regression

Prof. Sayak Roychowdhury


Linear Regression
• Most widely used dependence technique, statistical data
model
• Used for both prediction and explanation
• Application ranges from research questions to find relation
between factor(s) and outcome, to business forecasting,
econometric models, marketing etc.
• Capability of explanation of a model is one of its more
important utility compared to other complicated black box
models (such as Neural Nets)
• Multiple Regression and its variants provides a framework for
in depth understanding of the process being investigated
• Extensively used in academic research (knowledge creation)
and managerial insights (impact of potential factors)
Glimpse of Linear Regression
• Minitab- Stat>Regression>Fit Regression Model > Select
the Response and the Factors > Ok

Data Point Body height Flight Time

1 1.5 1.91

2 1.4 1.83

3 2.7 0.86

4 1.1 1.72

5 0.9 1.28

6 0.8 1.09

7 2.9 0.79

8 2.2 1.1

9 3.3 0.81

10 1.8 1.67
Glimpse of Linear Regression
lm1 <- lm(mpg ~ hp, data = mtcars)
summary(lm1)
anova(lm1)
Normal Probability Plot of Residuals
x1 x2 . . . xk y
x11 x12 . . . x1k y1
x21 x22 . . . x2k y2
M M M M
xn1 xn2 . . . xnk yn

Residual = 𝑦𝑜𝑏𝑠 − 𝑦𝑝𝑟𝑒𝑑


Matrix Plot
Correlation
Questions to Ask?
• Is any of the predictors 𝑋 = {𝑋1 , . . , 𝑋𝑝 } important in
predicting the response 𝑌?
• Which of the predictors are important?
• How well does the model fit the data?
• Given a set of predictors, how accurate is the prediction?
• What is the effect of individual observation on the model?
Questions to Ask?
• Is any of the predictors 𝑋 = {𝑋1 , . . , 𝑋𝑝 } important in
predicting the response 𝑌?
Ans: ANOVA
• Which of the predictors are important?
Ans: All subsets or best subsets regression
• How well does the model fit the data?
2
Ans: 𝑅2 , 𝑅𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑
• Given a set of predictors, how accurate is the prediction?
Ans: 𝑅𝑆𝑆, Cross-validation etc.
• What is the effect of individual observation on the model?
Ans: Leverage, Cook’s Distance
Assumptions for Linear Regression
• The primary assumptions for linear regression are
1. Linearity of the observed phenomenon
2. Constant variance of error terms
3. Normality of the error term distribution
4. Independence of error terms
• Adherence to the assumptions are tested through
graphical methods such as residual plots, normal
probability plot of residuals
Steps to do Regression
• Step 1. Create a flat file (ready to use software when done)
• Step 2. Start with a first order model (usually)
• Step 3. Fit the current model form.
• Step 4. Perform model diagnostics. If defensible, stop. Otherwise, try a different
form including possibly adding or removing factors. Return to Step 3.
• Step 5. (Sometime Optional) t-test coefficients and/or make decision.

• Comment: The process involves a degree of subjectivity and intuition about the
physical system and what model form makes sense and helps to answer the
relevant questions.
Estimation of Coefficients
Model form: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 +. . +𝛽𝑘 𝑥𝑘 + 𝜖
𝜖~𝑁(0, 𝜎 2 )

𝑦ො𝑖 = 𝛽መ0 + 𝛽መ1 𝑥𝑖1 + 𝛽መ2 𝑥𝑖2 + ⋯ + 𝛽መ𝑘 𝑥𝑖𝑘

𝑦ො𝑖 = Estimate of the response 𝑦𝑖


𝛽መ𝑗 = Estimate of coefficient 𝛽𝑗 𝑗 = 0,1. . 𝑘
Estimation of Coefficients (SLR)
2
• 𝐿= σ𝑛𝑖=1 𝜖 2 = σ𝑛𝑖=1 𝑦𝑖 − 𝛽መ0 + 𝛽𝑥
መ 𝑖 =SSE

• Minimize the above function with respect to 𝛽’s.
• How do we do this?
𝜕𝐿
• ෡𝑗 = 0 for 𝑗 = 0,1, . . 𝑘
𝜕𝛽
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
• 𝛽መ1 = σ𝑛 2
𝑖=1 𝑥𝑖 −𝑥ҧ

• 𝛽መ0 = 𝑦ത − 𝛽መ1 𝑥ҧ
Estimation of Coefficients
2
• 𝐿 = σ𝑛𝑖=1 𝜖 2 = σ𝑛𝑖=1 𝑦𝑖 − 𝛽መ0 + σ𝑘𝑗=1 𝛽መ𝑗 𝑥𝑖𝑗 =SSE

• Minimize the above function with respect to 𝛽’s.
• How do we do this?
𝜕𝐿
• ෡𝑗 = 0 for 𝑗 = 0,1, . . 𝑘
𝜕𝛽
• In matrix form: 𝒚 = 𝑿𝛽 + 𝜺
𝑻

• SSE= 𝒚 − 𝑿𝛽 (𝒚 − 𝑿𝛽) መ (minimize w.r.t 𝛽መ )
• (Find 𝛽መ for which derivative of SSE=0)
𝛽መ = 𝑿′ 𝑿 −1 𝑿′ 𝒚
Predicted response: 𝒚 ෝ = 𝑿𝛽,መ Residual 𝑒 = 𝒚 − 𝒚 ෝ
• For a 𝟐𝒌 ෡ 𝐸𝑓𝑓𝑒𝑐𝑡𝑗
factorial design 𝛽𝑗 = 2 𝛽0 = 𝑦ത
Example #2
22 design with x1 x2 y
center-point -1 -1 4
1 -1 3
experimental -1 1 1 yi =b01 + b1 xi1 + b2 xi2 + ei
design 1 1 0 model equations
0 0 4

1 -1 -1
4
1 1 -1
3
X= 1
1
-1
1
1
1 y= 1
0
1 0 0
4
Design Matrix
Example #2- Estimation

5 0 0 0.20 0 0
X'X = 0 4 0 (X'X)-1 = 0 0.25 0
0 0 4 0 0 0.25
2.4
b = (X'X)-1 X'y =
-0.5
-1.5
y = 2.4 - 0.5 x1 - 1.5 x2 prediction equation
Example #2- New Model-Same Array
22 design with yi =b0 + b1 xi1 + b2 xi2 +
x1 x2 y
center-point -1 -1 4 b3 x2i2 + ei
1 -1 3
experimental -1 1 1 functional form of
design 1 1 0 the fit model
0 0 4

1 -1 -1 1 4
1 1 -1 1 3

X= 1
1
-1
1
1
1
1
1 y= 1
0
1 0 0 0 4
Different Design Matrix
Hypothesis Testing Multiple Regression
• 𝐻0 : 𝛽1 = ⋯ = 𝛽𝑘 = 0
• 𝐻1 : 𝛽𝑗 ≠ 0 𝑓𝑜𝑟 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑗
• 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
2 2
σ𝑛
𝑖=1 𝑦𝑖 σ𝑛
𝑖=1 𝑦𝑖
𝑆𝑆𝑅 = 𝛽መ ′ 𝑋 ′ 𝑦 − ; 𝑆𝑆𝑇 = 𝑦′𝑦 −
𝑛 𝑛
• Compare 𝐹0 with 𝐹𝑐𝑟𝑖𝑡 = 𝐹1−𝛼,𝑘,𝑛−𝑘−1 to determine
significance
DOE vs On-hand Data

√ guaranteed , X loss of credibility, ? unclear


𝑹 𝟐

𝑹𝟐 : How much of the variation in the response can be explained by the


model
𝟐
σ𝒏
𝒊=𝟏 𝒚𝒊

SST=𝒚 𝒚 − , 𝑺𝑺𝑬 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖 2 ෡ ′ 𝑿′ 𝒚
= 𝒚′ 𝒚 − 𝜷
𝒏
𝑸: 𝑛 × 𝑛 matrix of 1, n=# of runs
𝒚: Response vector
𝑆𝑆𝐸
• 𝑹𝟐 = 1 − = 𝑆𝑆𝑅/𝑆𝑆𝑇
𝑆𝑆𝑇
𝑛−1 𝑆𝑆𝐸
• 𝑹𝟐𝒂𝒅𝒋 = 1 − (𝑘 = #of predictors)
𝑛−𝑘−1 𝑆𝑆𝑇
𝑃𝑅𝐸𝑆𝑆
• 𝑹𝟐𝒑𝒓𝒆𝒅 = 1 −
𝑆𝑆𝑇
2
• 𝑃𝑅𝐸𝑆𝑆 = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො𝑖,𝑖−1 (leave one out cross validation)
Regression Diagnostics
• Summary Statistics: Measures goodness of fit. 𝑅2 describes the
fraction of the variation that is explainable from the data. 𝑅2 >
30% is acceptable.

• Variance Inflation Factors (VIFs): Numbers to assess severity of


multicollinearity in the model. Common rule VIF<10
1
𝑉𝐼𝐹𝑖 =
1 − 𝑅𝑖2
Where 𝑅𝑖2 is the 𝑅2 value for 𝑖𝑡ℎ variable on the other covariates.
• Normal Plot of Residuals: Indicates whether hypothesis tests on
estimates can be trusted. (points should roughly adhere to the
straight line, no apparent pattern or curvature)
Hypothesis Testing for Individual Coeff.
Wald Test
• 𝐻0 : 𝛽𝑗 = 0
• 𝐻1 : 𝛽𝑗 ≠ 0
෡𝑗
𝛽
• Test statistic 𝑡0 = ෡𝑗 , where 𝑠𝑒 𝛽መ𝑗 = 𝜎ො 2 𝐶𝑗𝑗
𝑠𝑒 𝛽
• 𝐶𝑗𝑗 is the diagonal element of 𝑋 ′ 𝑋 −1 corresponding to
element 𝑗
• If |𝑡0 | > 𝑡𝛼,𝑛−𝑘−1 reject the null hypothesis
2
• CI (95%)
• 𝛽መ𝑗 − 𝑡𝛼,𝑛−𝑘−1 𝑠𝑒 𝛽መ𝑗 ≤ 𝛽𝑗 ≤ 𝛽መ𝑗 + 𝑡𝛼,𝑛−𝑘−1 𝑠𝑒 𝛽መ𝑗
2 2
Confidence Interval on Mean Response
• Suppose we want the mean response value at a point
𝒙𝟎 = [1, 𝑥01 , … , 𝑥0𝑘 ]
• The mean response at the point is
𝜇 𝑦 𝑥0 = 𝛽0 + 𝛽1 𝑥01 + ⋯ + 𝛽0𝑘 𝑥0𝑘 = 𝒙𝟎 ′𝜷
• The estimated mean response is

𝑦ො 𝑥0 = 𝒙′𝟎 𝜷
• Variance of 𝑦ො 𝑥0 is
• The 100 1 − 𝛼 % Confidence Interval of mean response
at 𝒙𝟎 is
Prediction Interval of New Response
• Suppose we want to predict the actual response at a point
𝒙𝟎 = [1, 𝑥01 , … , 𝑥0𝑘 ]
• The point estimate of a future observation is same as the
mean response

𝑦ො 𝑥0 = 𝒙′𝟎 𝜷
• The 100 1 − 𝛼 % Prediction Interval of mean response
at 𝒙𝟎 is
Multicollinearity
• Multicollinearity exists when two or more of the
predictors in a regression model are moderately or highly
correlated with one another.
• Structural Multicollinearity: is a mathematical artifact
caused by creating new predictors from other predictors
— such as, creating the predictor 𝑥 2 from the predictor 𝑥.
• Dataset Multicollinearity: is a result of a poorly designed
experiment, reliance on purely observational data, or the
inability to manipulate the system on which the data are
collected.
(source: https://online.stat.psu.edu/stat462/node/177/)
Multicollinearity
• When predictor variables are correlated:
• The estimated regression coefficient of any one variable
depends on which other predictor variables are included in the
model.
• The precision of the estimated regression coefficients decreases
as more predictor variables are added to the model.
• The marginal contribution of any one predictor variable in
reducing the error sum of squares varies depending on which
other variables are already in the model.
• Hypothesis tests for βk = 0 may yield different conclusions
depending on which predictor variables are in the model. (This
effect is a direct consequence of the three previous effects.)
Regression Flow Chart

Check VIFs,
Rsq, P-
values etc.
Model Selection: Example 1
Model Selection: Example 2
Forward Selection
• Step 1: Model m1 <- Fit a null model
• Step 2: Add variables one at a time (𝑝 simple Linear Reg
model)
• Step 3: Pick the model with lowest RSS and add it to the
m1
• Step 4: With remaining 𝑝 − 1 variables, add to m1 one at
a time and pick the model that provides best RSS
• Step 5: Continue until some stopping criterion is satisfied
Backward Selection
• Step 1: Start with all the variables in the model
• Step 2: Remove the variable which is least significant
(largest p-value)
• Step 3: Fit with remaining 𝑝 − 1 variables
• Step 4: Continue dropping variables until some stopping
criterion is met (threshold on p-value)
Other methods
• Mallow’s Cp
• AIC
• BIC
• Cross-validation
• Adjusted 𝑅2
Linear Regression
Prof. Sayak Roychowdhury
Hat Matrix
• Residuals are given by 𝑟𝑖 = 𝑦𝑖 − 𝑦ො𝑖
−𝟏 𝑻
• Now 𝒚 ෡ 𝑻
ෝ = 𝑿𝜷 = 𝑿 𝑿 𝑿 𝑿 𝒚
−𝟏
• 𝑯 = 𝑿 𝑿 𝑿 𝑿𝑻 is called the “Hat Matrix” has it
𝑻
converts 𝒚 to 𝒚ෝ (y-hat)
•𝐸 𝒚 ෝ = 𝝁; var 𝐲ො = 𝐇𝜎 2
• residuals can be obtained by 𝒓 = 𝑰 − 𝑯 𝒚
• 𝑬 𝒓 = 𝟎; 𝑪𝒐𝒗 𝒓 = 𝑰 − 𝑯 𝜎 2
• Variance of 𝑖 𝑡ℎ residual is 𝑣𝑎𝑟 𝑟𝑖 = 1 − ℎ𝑖𝑖 𝜎 2
• The result shows that residuals may have different
variances even though original observations have
constant variance 𝜎 2 .
Leverage
• The location of points in the X space determine model
properties
• The elements ℎ𝑖𝑗 of 𝑯 may be interpreted as amount of
leverage exerted by 𝑦𝑗 on 𝑦ො𝑖 .
• ℎ𝑖𝑖 is the 𝑖 𝑡ℎ diagonal element with 0 ≤ ℎ𝑖𝑖 ≤ 1.
• ℎ𝑖𝑖 is called “leverage” or potential influence of 𝑖 𝑡ℎ
observation
• Observation with high leverage needs special attention, as
the fit maybe overly dependent on them
Leverage
• σ𝒏𝒊=𝟏 𝒉𝒊𝒊 = 𝒓𝒂𝒏𝒌 𝑯 = 𝒓𝒂𝒏𝒌 𝑿 = 𝑴,
• The averge size of the diagonal element is 𝑀/𝑛
2𝑀
• As a thumbrule if ℎ𝑖𝑖 > then the 𝑖 𝑡ℎ observation is a
𝑛
high leverage point
Residuals
• 𝒓𝒊 = 𝒚𝒊 − 𝒚
ෝ𝒊 ordinary residuals
𝒓𝒊
• Standardized residuals: 𝒅𝒊 = ෝ
𝝈
• If 𝑑𝑖 is not within −𝟑 ≤ 𝒅𝒊 ≤ 𝟑 it may be an outlier.
• For both 𝑟𝑖 and 𝑑𝑖 , the variances vary depending on the
location of the point 𝑥.
𝒓𝒊
• 𝒔𝒊 = is called Studentized residuals

𝟏−𝒉𝒊𝒊 𝝈
• 𝑉 𝑠𝑖 = 1, location invariant
• Observations with |𝑠𝑖 | > 2 should be scrutinized further
• For large datasets, variance stabilizes, standardized and
studentized residuals will have little difference.
Residuals
• Predictive residual (PRESS) e(𝑖) = 𝑦𝑖 − 𝑦ො 𝑖 , where 𝑦ො 𝑖 is obtained
by omitting 𝑖𝑡ℎ observation and fitting the model
2 2
• PRESS Statistic = σ𝑛𝑖=1 𝑒(𝑖) = σ𝑛𝑖=1 𝑦𝑖 − 𝑦ො 𝑖
• PRESS requires fitting 𝑛 linear models, for each of the observation
• It is possible to calculate PRESS with the help of just one model,
using the Hat matrix
𝑟
• PRESS residual = e(𝑖) = 𝑖
1−ℎ𝑖𝑖
2
2 𝑟𝑖
• PRESS Statistic = σ𝑛𝑖=1 𝑒(𝑖) = σ𝑛𝑖=1
1−ℎ𝑖𝑖
• High PRESS residual indicates high influence points
• A large difference between ordinary residual and PRESS residual
indicates that the model fits the data well, but the model built
without that point predicts poorly.
Residuals
2
• PRESS can be used to compute 𝑅𝑝𝑟𝑒𝑑
2 𝑃𝑅𝐸𝑆𝑆
• 𝑅𝑝𝑟𝑒𝑑 =1 −
𝑆𝑆𝑇
Cook’s Distance to Estimate Actual
Influence
• While ℎ𝑖𝑖 gives the “potential influence”, based on the
location of the point 𝑥
• It is useful to consider both the location of the point and
the response variable, and their effect on 𝜷 ෡
• Cook (1977, 1979) proposed a measure of influence based
on location of the point as well as the response variable
• It indicates the extent to which parameter estimates
would change if one omitted 𝑖 𝑡ℎ observation
• It is given by the standardized difference between 𝜷 ෡𝒊,
which is the estimate obtained omitting 𝑖 𝑡ℎ observation,
and 𝜷 ෡
Cook’s Distance
• Cook’s distance value can be easily obtained by using ℎ𝑖𝑖 :
𝑻

𝜷 𝒊

−𝜷 ෡
𝑿𝑻 𝑿 𝜷 𝒊

−𝜷
𝐷𝑖 =
𝑝𝑀𝑆𝐸

𝑠𝑖2 ℎ𝑖𝑖
𝐷𝑖 =
𝑝 1−ℎ𝑖𝑖
𝑡ℎ
• 𝑠𝑖 is the
𝑡ℎ
𝑖 studentized residual indicates how well the model fits
the 𝑖 observation
• 𝑝 is (number of predictors) + 1
ℎ𝑖𝑖
• The ratio represents the distance of vector 𝒙𝒊 from the
1−ℎ𝑖𝑖
remaining data
• If 𝐷𝑖 > 𝐹0.05 (𝑝, 𝑛 − 𝑝) median of F distribution, the 𝑖𝑡ℎ observation
maybe considered an outlier.
• Sometimes 𝐷𝑖 > 1 is also suggested as a cut-off.

You might also like