Class Material - Multiple Linear Regression
Class Material - Multiple Linear Regression
Regression
1
Multiple Linear Regression
• Multiple regression analysis is a statistical technique that can be
used to analyze the relationship between single dependent
variable and several independent (predictor) variables.
• MLR produces a model that identifies the best weighted
combination of independent variables to predict the dependent
(or criterion) variable.
2
Design Requirements
Y 0 1 x1 2 x2 ... k xk
Y 0 1 x1 2 x2 3 x1 x2 4 x2 ... k xk
2
An important task in multiple regression is to estimate the beta values (1, 2,
3 etc…)
Functional form of MLR
Yi 0 1 x1i 2 x2i ... k xki i
Y1 0 1 x11 2 x21 3 x31 1
Y2 0 1 x12 2 x22 3 x32 2
Y3 0 1 x13 2 x23 3 x33 3
Y4 0 1 x14 2 x24 3 x34 4
k 3; n 4
5
Functional form of MLR
Yi 0 1 x1i 2 x2i ... k xki i
0 is a constant
1 , 2 ,..., n are called partial regression coefficients
corresponding to the explanatory variables
Y X
6
Regression: Matrix Representation
y1 1 x11 x21 x k1 0 1
y 1 x12 x22 xk 2 1 2
2
y n 1 xkn k n
x1n x2 n
Y X
Ordinary Least Squares Estimation for Multiple Linear
Regression
The assumptions that are made in multiple linear regression model
are as follows:
The regression model is linear in parameter.
The explanatory variable, Xi, is assumed to be non-
stochastic (that is, X is deterministic).
The conditional expected value of the residuals, E(i|Xi), is zero.
In a time series data, residuals are uncorrelated, that is,
Cov(i, j) = 0 for all i j.
The residuals, i, follow a normal distribution.
The variance of the residuals is constant for all values of Xi.
There is no high correlation between independent variables in
the model (called multi-collinearity).
Steps in building Multiple Linear Regression model
Pre-process the Data
• Data Quality – completeness and correctness
• Missing Data – strategy such as data imputation and specific
techniques for imputation
• Handling Qualitative variables – convert categorical variables
into dummy variables
• Derive new variables – such as ratios and interaction variables
(are products of variables), which may have better association
relationship with dependent variable
10
Regression coefficients
Yi 0 1 X 1i 2 X 2i ... k X ki i
SSE n
2[ (Yi ( 0 1 X 1i 2 X 2i ... k X ki ))] 0
0 i 1
SSE n
2 X 1i [ (Yi ( 0 1 X 1i 2 X 2i ... k X ki ))] 0
1 i 1
SSE n
2 X ki [ (Yi ( 0 1 X 1i 2 X 2i ... k X ki ))] 0
k i 1
11
The Estimated Coefficients
n
2[ (Yi ( 0 1 X 1i 2 X 2i ... k X ki ))] 0
i 1
n
2 X 1i [ (Yi ( 0 1 X 1i 2 X 2i ... k X ki ))] 0
i 1
n
2 X ki [ (Yi ( 0 1 X 1i 2 X 2i ... k X ki ))] 0
i 1
12
The Estimated Coefficients
n n n n
N 0 1 (X 1i ) 2 (X 2i ) ... k (X ki ) (Yi )
i 1 i 1 i 1 i 1
n n n n n
0 (X 1i ) 1 (X 1i X 1i ) 2 (X 2i X 1i ) ... k (X ki X 1i ) (Yi X 1i )
i 1 i 1 i 1 i 1 i 1
n n n n n
0 (X ki ) 1 (X 1i X ki ) 2 (X 2i X ki ) ... k (X ki X ki ) (Yi X ki )
i 1 i 1 i 1 i 1 i 1
13
The Estimated Coefficients
n n n
N (X 1i ) (X ki ) (Yi )
i 1 i 1 0 i 1
n n
(X 1i X 1i ) (X ki X 1i ) 1 (Yi X 1i )
n n
(X 1i )
i 1 i 1 i 1 i 1
n n n k n
(X ki ) (X ki X 2i ) (X ki X ki ) (Yi X ki )
i 1 i 1 i 1 i 1
Y X̂
14
The Estimated Coefficients
Y Xˆ
X T Y X T Xˆ
ˆ X T X
1
X TY
Yˆ Xˆ
Yˆ X X T X
1
X T Y
Yˆ X X T X
1 T
X Y
Yˆ HY
15
The regression coefficients β is given by
β (XT X)1 XTY
The estimated values of response variable are
Y X β X(XT X)1 XTY
In above Eq. the predicted value of dependent variable Yi is a linear
function of Yi. Equation can be written as follows:
Y ΗY
H X(XTX)1 XT is called the hat matrix, also known as the influence
matrix, since it describes the influence of each observation on the
predicted values of response variable.
17
Dataset 9_Sheet
• Table shows the scores in the final examination (F) and the scores in
two tests (PI and P2) for 22 students in a statistics course.
a. Fit each of the following models to the data:
18
Example
K 1
• Root Mean Square Error (RMSE) (Yi Yˆi ) 2
i 1 K
Y 0 1 X1 2 X 2 i
• Partial correlation is the correlation between the response variable Y and the
explanatory variable X1 when influence of X2 is removed from both Y and X1 (in
other words, when X2 is kept constant).
Partial Correlation
Let rYX 1 , X 2 denote the partial correlation between Y and X1 when X2 is kept constant
and is given by
rYX1 rYX 2 rX 1 X 2
rYX1 , X 2
(1 rYX 2 ) (1 rX 1 X 2 )
2 2
• Semi-partial correlation –
removing the influence of X2
from X1 only.
• Remove E.
• Semi-partial correlation is
A/(A+B+C+D)
Semi-Partial Correlation (or Part)
• Semi-partial (part) correlation plays an important role in regression
model building.
28
Standardized Regression Co-efficient
30
Interaction Variables in Regression Models
• Interaction variables are basically inclusion of variables in the regression model
that are a product of two independent variables (such as X1 X2).
• The inclusion of interaction variables enables the data scientists to check the
existence of conditional relationship between the dependent variable and two
independent variables.
Statistical Significance of Individual Variables in MLR – t-test
The null and alternative hypotheses in the case of individual independent variable and
the dependent variable Y is given, respectively, by
• H0: There is no relationship between independent variable Xi and dependent variable Y
• HA: There is a relationship between independent variable Xi and dependent variable Y
Alternatively,
• H0: i = 0
• HA: i 0
The corresponding test statistic is given by
i 0 i
t
Se (i ) Se (i )
Residual Analysis in Multiple Linear Regression
Residual analysis is important for checking assumptions about normal distribution of
residuals, homoscedasticity, and the functional form of a regression model.
Multi-Collinearity and Variance Inflation Factor
• The sign of the regression coefficient may be different, that is, instead of
negative value for regression coefficient, we may have a positive regression
coefficient and vice versa.
X 1 0 1 X 2
• Let R212 be the R square value for the above model.
Variance Inflation Factor (VIF)
• Variance inflation factor (VIF) is then given by:
1
VIF
2
1 R12
The following distance measures are used for diagnosing the outliers and
influential observations in MLR model.
• Mahalanobis Distance
• Cook’s Distance
• Leverage Values
DM ( X i ) ( X i i )S 1( X i i )
Cook’s Distance
• Cook’s distance value of more than 1 indicates highly influential observation.
• The value of 4/(N-k-1) for k predictors and N points is recommended as a
threshold for Cook’s distance.
• Any value above this is classified as influential observation
The threshold for DFFIT is defined using Standardized DFFIT (SDFFIT). The
absolute value of SDFFIT should be less than 2 (k 1) / N
DFBETAi ( j ) j j (i )
• full_mod=sm.OLS(train_Y, train_X)
• full_model=full_mod.fit()
• full_model.summary2()
49
MLR Modelling in Python
#Normality residuals to be checked using P-P plot
# line=45 means 45-degree line
#‘s’: standardized line, the expected order statistics are scaled by the standard
deviation of the given sample and have the mean added to them
#‘r’: A regression line is fit
#‘q’: A line is fit through the quartiles.
#None: by default no reference line is added to the plot.
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
full_model_residual=full_model.resid
#ProbPlot() method on statsmodel draws P-P plot
probplot=sm.ProbPlot(full_model_residual)
plt.figure (figsize=(8,5))
probplot.ppplot(line='r')
plt.show() 50
MLR Modelling in Python
#Normality residuals to be checked using P-P plot
from scipy import stats
stats.probplot(full_model.resid, plot= plt)
plt.title("Model1 Residuals Probability Plot")
51
MLR Modelling in Python
52
MLR Modelling in Python
pred_y= full_model.predict(test_X_new)
54
Avoiding Overfitting - Mallows’s Cp
Mallows’s Cp (Mallows, 1973) is used to select the best regression model by
incorporating the right number of explanatory variables in the model.
Mallow’s Cp is given by
SSEp
Cp (n 2 p)
MSE full
where
SSEp is the sum of squared errors with p parameters in the model (including constant),
MSEfull is the mean squared error with all variables in the model,
n is the number of observations,
p is the number of parameters in the regression model including constant.
Avoiding Overfitting - Mallows’s Cp
The best regression model developed can be the model with number of
parameters p close to Mallows’s Cp is chosen as the best model.
Problem – Dataset 12
• The accompanying data is on
y = profit margin of savings and loan companies in a given year,
x1 = net revenues in that year, and
x2 = number of savings and loan branches offices
Questions
a. Determine the multiple regression equation for the data
b. Compute and interpret the coefficient of multiple determination, R2 .
c. At the 5% significance level, determine if the model is useful for
predicting the response.
d. At the 5% significance level, does it appear that any of the predictor
variables can be removed from the full model as unnecessary?
57
Problem – Dataset 11
e. Obtain and interpret 95% confidence intervals for the slopes, βi, of
the population regression line that relates net revenues and number
of branches to profit margin.
f. Are there any multicollinearity problems (i.e., are net revenues and
number of branches collinear)?
58
Effects of Multicollinearity
Data B
Data A ID Y V1 V2
ID Y V1 V2 1 3.7 3.2 2.9
1 5 6 13 2 3.7 3.3 4.2
2 3 8 13 3 4.2 3.7 4.9
3 9 8 11 4 4.3 3.3 5.1
4 9 10 11 5 5.1 4.1 5.5
5 13 10 9 6 5.2 3.8 6.0
6 11 12 9 7 5.2 2.8 4.9
7 17 12 7 8 5.6 2.6 4.3
8 15 14 7 9 5.6 3.6 5.4
10 6.0 4.1 5.5 59