Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
94 views

Class Material - Multiple Linear Regression

Multiple linear regression allows modeling of the relationship between a single dependent variable and multiple independent variables. It produces a model that identifies the best weighted combination of independent variables to predict the dependent variable. The objective is to use known independent variable values to predict the single dependent variable value. Ordinary least squares estimation is commonly used to estimate the coefficients in the model. Various diagnostics should be performed to validate the model and check assumptions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Class Material - Multiple Linear Regression

Multiple linear regression allows modeling of the relationship between a single dependent variable and multiple independent variables. It produces a model that identifies the best weighted combination of independent variables to predict the dependent variable. The objective is to use known independent variable values to predict the single dependent variable value. Ordinary least squares estimation is commonly used to estimate the coefficients in the model. Various diagnostics should be performed to validate the model and check assumptions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Multiple Linear

Regression

1
Multiple Linear Regression
• Multiple regression analysis is a statistical technique that can be
used to analyze the relationship between single dependent
variable and several independent (predictor) variables.
• MLR produces a model that identifies the best weighted
combination of independent variables to predict the dependent
(or criterion) variable.

2
Design Requirements

• One dependent variable (criterion)


• Two or more independent variables (predictor or
explanatory variables).
• Sample size: >= 50 (at least 10 times as many cases as
independent variables)
Multiple Linear Regression
• The objective of multiple regression analysis is to use the
independent variables whose values are known to predict the
single dependent variable value.
• The following are examples of multiple linear regression:

Y   0  1 x1   2 x2  ...   k xk  
Y  0  1 x1   2 x2   3 x1 x2   4 x2 ...   k xk
2

An important task in multiple regression is to estimate the beta values (1, 2,
3 etc…)
Functional form of MLR
Yi   0  1 x1i   2 x2i  ...   k xki   i
Y1   0  1 x11   2 x21   3 x31   1
Y2   0  1 x12   2 x22   3 x32   2
Y3   0  1 x13   2 x23   3 x33   3
Y4   0  1 x14   2 x24   3 x34   4
k  3; n  4
5
Functional form of MLR
Yi   0  1 x1i   2 x2i  ...   k xki   i
 0 is a constant
1 ,  2 ,...,  n are called partial regression coefficients
corresponding to the explanatory variables

The functional form is called response surface (hyper plane)

Y  X  
6
Regression: Matrix Representation

 y1  1 x11 x21 x k1    0    1 
 y  1 x12 x22 xk 2   1   2 
 2 
      
      
      
 y n  1 xkn    k   n 
   x1n x2 n

Y  X  
Ordinary Least Squares Estimation for Multiple Linear
Regression
The assumptions that are made in multiple linear regression model
are as follows:
 The regression model is linear in parameter.
 The explanatory variable, Xi, is assumed to be non-
stochastic (that is, X is deterministic).
 The conditional expected value of the residuals, E(i|Xi), is zero.
 In a time series data, residuals are uncorrelated, that is,
Cov(i, j) = 0 for all i  j.
 The residuals, i, follow a normal distribution.
 The variance of the residuals is constant for all values of Xi.
 There is no high correlation between independent variables in
the model (called multi-collinearity).
Steps in building Multiple Linear Regression model
Pre-process the Data
• Data Quality – completeness and correctness
• Missing Data – strategy such as data imputation and specific
techniques for imputation
• Handling Qualitative variables – convert categorical variables
into dummy variables
• Derive new variables – such as ratios and interaction variables
(are products of variables), which may have better association
relationship with dependent variable

10
Regression coefficients
Yi   0  1 X 1i   2 X 2i  ...   k X ki   i
SSE n
 2[  (Yi  (  0  1 X 1i   2 X 2i  ...   k X ki ))]  0
 0 i 1

SSE n
 2 X 1i [  (Yi  (  0  1 X 1i   2 X 2i  ...   k X ki ))]  0
1 i 1

SSE n
 2 X ki [  (Yi  (  0  1 X 1i   2 X 2i  ...   k X ki ))]  0
 k i 1
11
The Estimated Coefficients
n
 2[  (Yi  (  0  1 X 1i   2 X 2i  ...   k X ki ))]  0
i 1
n
 2 X 1i [  (Yi  (  0  1 X 1i   2 X 2i  ...   k X ki ))]  0
i 1

 
n
 2 X ki [  (Yi  (  0  1 X 1i   2 X 2i  ...   k X ki ))]  0
i 1

12
The Estimated Coefficients
n n n n
N 0  1  (X 1i )   2  (X 2i )  ...   k  (X ki )   (Yi )
i 1 i 1 i 1 i 1
n n n n n
 0  (X 1i )  1  (X 1i X 1i )   2  (X 2i X 1i )  ...   k  (X ki X 1i )   (Yi X 1i )
i 1 i 1 i 1 i 1 i 1

 
n n n n n
 0  (X ki )  1  (X 1i X ki )   2  (X 2i X ki )  ...   k  (X ki X ki )   (Yi X ki )
i 1 i 1 i 1 i 1 i 1

13
The Estimated Coefficients

 n n   n 
 N  (X 1i )   (X ki )    (Yi ) 
i 1 i 1 0  i 1
n    n 
 (X 1i X 1i )   (X ki X 1i )   1     (Yi X 1i ) 
n n
  (X 1i )
 i 1 i 1 i 1      i 1 
         
n n n   k   n 
  (X ki )  (X ki X 2i )   (X ki X ki )   (Yi X ki )
i 1 i 1 i 1  i 1 

Y  X̂
14
The Estimated Coefficients
Y  Xˆ
X T Y  X T Xˆ

ˆ  X T X 
1
X TY
Yˆ  Xˆ



Yˆ  X  X T X 
1
X T Y 



Yˆ   X X T X 
1 T 
X Y

Yˆ  HY
15

The regression coefficients β is given by

β  (XT X)1 XTY
The estimated values of response variable are

 
Y  X β  X(XT X)1 XTY

In above Eq. the predicted value of dependent variable Yi is a linear
function of Yi. Equation can be written as follows:

Y  ΗY
H  X(XTX)1 XT is called the hat matrix, also known as the influence
matrix, since it describes the influence of each observation on the
predicted values of response variable.

Hat matrix plays a crucial role in identifying the outliers and


influential observations in the sample.
Perform regression model diagnostics
• Validating MLR model involves checking multi-collinearity,
heteroscedasticity, auto-correlation, outlier analysis.
• F test is for overall significance of the model
• t-test is for the significance of the individual variables
• Presence of multi-collinearity can be checked through Variance
Inflation Factor.

17
Dataset 9_Sheet
• Table shows the scores in the final examination (F) and the scores in
two tests (PI and P2) for 22 students in a statistics course.
a. Fit each of the following models to the data:

b. Test whether β0 = 0 in each of the three models.


c. Which variable individually, PI or P2, is a better predictor of F?

18
Example

The MLR model is given by

Final Exam marks  β 0  β1  P1  β 2  P 2


Regression Models with Qualitative Variables

• In MLR, many predictor variables are likely to be qualitative or categorical


variables.
• Since the scale is not a ratio or interval for categorical variables, we
cannot include them directly in the model, since its inclusion directly will
result in model misspecification.
• Hence pre-process the categorical variables using dummy variables for
building a regression model.
Validate the model using validation data
• Final model selection is based on the performance measures:
• R squared or Adjusted R squared
1 Yi  Yˆi
K
• Mean Absolute Percentage Error (MAPE)   K Y  100
i 1 i

K 1
• Root Mean Square Error (RMSE)   (Yi  Yˆi ) 2
i 1 K

Where K is the number of cases in validation data set.


Models that provide consistent performance in both training and
validation data will be chosen
22
Semi-Partial Correlation and Partial Correlation

• The increase in the coefficient of determination, R2, when a new variable is


added is given by the square of the semi-partial (Part) correlation of the newly
added variable with dependent variable Y.
• Consider a regression model with two independent variables (say X1 and X2).
The model can be written as follows:

Y  0  1 X1   2 X 2   i

• Partial correlation is the correlation between the response variable Y and the
explanatory variable X1 when influence of X2 is removed from both Y and X1 (in
other words, when X2 is kept constant).
Partial Correlation
Let rYX 1 , X 2 denote the partial correlation between Y and X1 when X2 is kept constant
and is given by
rYX1  rYX 2  rX 1 X 2
rYX1 , X 2 
(1  rYX 2 )  (1  rX 1 X 2 )
2 2

rYX1 is cor.coeff. between Y and X 1


rYX 2 is cor.coeff. between Y and X 2
rX 1 X 2 is cor.coeff. between X 1 and X 2
Semi-Partial Correlation (or Part)
• Consider a regression model between a response variable Y and two
independent variables X1 and X2. The semi-partial (or part correlation) between
a response variable Y and independent variable X1 measures the relationship
between Y and X1 when the influence of X2 is removed from only X1 but not
from Y.
rYX1  rYX 2  rX 1 X 2
srYX1 , X 2 
(1  rX 1 X 2 )
2

rYX1 is cor.coeff. between Y and X 1


rYX 2 is cor.coeff. between Y and X 2
rX 1 X 2 is cor.coeff. between X 1 and X 2
Semi-Partial Correlation and Partial Correlation
• Partial correlation – removing
the influence of X2 from both Y
and X1.
• Remove B, C and E.
• Influence of X1 on Y is given by
A
• i.e. Partial correlation is A/(A+D)

• Semi-partial correlation –
removing the influence of X2
from X1 only.
• Remove E.
• Semi-partial correlation is
A/(A+B+C+D)
Semi-Partial Correlation (or Part)
• Semi-partial (part) correlation plays an important role in regression
model building.

• The increase in R-square (coefficient of determination), when a


new variable is added into the model, is given by the square of the
semi-partial correlation.
Data set 10_MLR Rating Revenue
The data set provides the cumulative television rating points for
various programmes (TV rating), money spent on promotion
(Exp_on_Promotion), and the revenue generated (in Indian
rupees) over one-month period.
1. Develop a multiple regression model to understand the
relationship between the revenue generated as response
variable and expenses on promotions and rating as
predictors.
2. Determine the part and partial correlations.

28
Standardized Regression Co-efficient

• A regression model can be built on standardized dependent variable and


standardized independent variables, the resulting regression coefficients
are then known as standardized regression coefficients.
• The standardized regression coefficient can also be calculated using the
following formula:
  SX 
Standardized Beta  β  i 

 SY 

• Where SX i is the standard deviation of the explanatory variable Xi and SY


is the standard deviation of the response variable Y.
Standardized regression coefficients
• Standardized regression coefficient for Rating = 0.732
• Standardized regression coefficient for Expenses on Promotion =
0.736
• When rating is changed by one standard deviation, dependent
variable Y will change 0.732 standard deviations
• When expenses on promotion is changed by one standard
deviation, dependent variable Y will change 0.736 standard
deviations

30
Interaction Variables in Regression Models
• Interaction variables are basically inclusion of variables in the regression model
that are a product of two independent variables (such as X1 X2).

• Usually the interaction variables are between a continuous and a categorical


variable.

• The inclusion of interaction variables enables the data scientists to check the
existence of conditional relationship between the dependent variable and two
independent variables.
Statistical Significance of Individual Variables in MLR – t-test

• Checking the statistical significance of individual variables is achieved through


t-test.
• Note that the estimate of regression coefficient is given by the following
equation 
1
β  (X X) X Y
T T

• Therefore the estimated value of regression coefficient is a linear function of


the response variable.
• Since we assume that the residuals follow normal distribution, Y follows a
normal distribution and the estimate of regression coefficient also follows a
normal distribution.
• Since the standard deviation of the regression coefficient is estimated from
the sample, we use a t-test.
Statistical Significance of Individual Variables in MLR – t-test

The null and alternative hypotheses in the case of individual independent variable and
the dependent variable Y is given, respectively, by
• H0: There is no relationship between independent variable Xi and dependent variable Y
• HA: There is a relationship between independent variable Xi and dependent variable Y

Alternatively,
• H0: i = 0
• HA: i  0
The corresponding test statistic is given by
 
i  0 i
t  
 
Se (i ) Se (i )
Residual Analysis in Multiple Linear Regression
Residual analysis is important for checking assumptions about normal distribution of
residuals, homoscedasticity, and the functional form of a regression model.
Multi-Collinearity and Variance Inflation Factor

• When more number of IV exists, these variables may be highly correlated.


• The existence of high correlation is called multi-collinearity
• May destabilize the MLR model
• Multi-collinearity can have the following impact on the model:
• The standard error of estimate of a regression coefficient may be inflated,
and may result in retaining of null hypothesis in t-test, resulting in rejection of
a statistically significant explanatory variable.
  
• The t-statistic

value is 


 S e (  ) 

S ( 
• If e is inflated, then the t-value will be underestimated resulting in high p-
)

value that may result in failing to reject the null hypothesis.


Impact of Multicollinearity
• Thus, it is possible that a statistically significant explanatory variable may
be labelled as statistically insignificant due to the presence of multi-
collinearity.

• The sign of the regression coefficient may be different, that is, instead of
negative value for regression coefficient, we may have a positive regression
coefficient and vice versa.

• Adding/removing a variable or even an observation may result in large


variation in regression coefficient estimates.
Variance Inflation Factor (VIF)
• Variance inflation factor (VIF) measures the magnitude of multi-collinearity.
• Let us consider a regression model with two explanatory variables defined as
follows:
Y   0  1 X 1   2 X 2
• To find whether there is multi-collinearity, we develop a regression model
between the two explanatory variables as follows:

X 1   0  1 X 2
• Let R212 be the R square value for the above model.
Variance Inflation Factor (VIF)
• Variance inflation factor (VIF) is then given by:

1
VIF 
2
1 R12

• The value 1  R12


2
is called the tolerance. VIF is the value by which the t-statistic
is deflated.
• Square root of VIF is the value by which the Std.error estimate is inflated in
presence of multi-collinearity.
• So, the actual t-value is given by
  
 1 
tactual    VIF
 
 S ( ) 
 e 1 
Variance Inflation Factor (VIF)
• The threshold value for VIF is 4 (few suggest 10)
• VIF value greater than 4 needs further investigation to assess the impact of
multi-collinearity
DurbinWatson Test for Auto-Correlation
• DurbinWatson is a hypothesis test to check the existence of auto-correlation
(Durbin and Watson, 1950).
• Let  be the correlation between error terms (t, t1).
• The null and alternative hypotheses are stated below:
H0:   0
H1:   0
• The DurbinWatson statistic, D, for correlation between errors of one lag is
given by
 
 ei  ei 1 
n 2 n
  i i 1 
e e
D i 2
 21  i  2n 
n  2 
 i  i 
2
e  e
i 1  i 1 
• The value of D will lie between 0 and 4
• The DurbinWatson test has two critical values, DL and DU.
• The inference of the test can be made based on the following
conditions:
• If D < DL, then the errors are positively correlated.
• If D > DL, then there is no evidence for positive auto-correlation.
• If DL < D < DU, the DurbinWatson test is inconclusive.
• If (4  D) < DL, then errors are negatively correlated.
• If (4  D) > DU, there is no evidence for negative auto-
correlation.
• If DL < (4  D) < DU, the test is inconclusive.
Distance Measures and Outliers Diagnostics

The following distance measures are used for diagnosing the outliers and
influential observations in MLR model.

• Mahalanobis Distance

• Cook’s Distance

• Leverage Values

• DFFIT and DFBETA Values


Mahalanobis Distance

• Mahalanobis Distance should be less than the chi-squared critical


value with d.f equal to the number of independent variables.
• Alternatively a thumb rule of 10 can be used.
• Any observation with MD value of more than 10 is an influential
observation.

DM ( X i )  ( X i  i )S 1( X i  i )
Cook’s Distance
• Cook’s distance value of more than 1 indicates highly influential observation.
• The value of 4/(N-k-1) for k predictors and N points is recommended as a
threshold for Cook’s distance.
• Any value above this is classified as influential observation

Leverage Value or Hat Value


• Leverage value of more than 2(k+1)/N or 3(k+1)/N indicates highly influential
observation.

DFFIT and SDFFIT


• DFFIT measures the difference in the fitted value of an observation when that
particular observation is removed from the model building.
 
DFFIT  y i  y i (i )
DFFIT and SDFFIT
Standardized DFFIT is given by
 
y i  y i (i )
SDFFIT 
S e (i ) hi

The threshold for DFFIT is defined using Standardized DFFIT (SDFFIT). The
absolute value of SDFFIT should be less than 2 (k  1) / N

DFBETA and SDFBETA


DFBETA measures the change in the regression coefficient when an
observation “i” is excluded from the model building. DFBETA is given by
DFBETA and SDFBETA

DFBETA measures the change in the regression coefficient when an


observation “i” is excluded from the model building. DFBETA is given by

 
DFBETAi ( j )   j   j (i )

The standardized DFBETA value (SDFBETA) for observation i is given by


 
 j j (i )
SDFBETAi ( j )  
Se ( j (i ) )

The threshold value of DFBETA is defined using SDFBETA. The absolute


value of SDFBETA should be less than 2 / N
MLR Modelling in Python – TV Rating
Model
• import statsmodels.api as sm
• import pandas as pd
• import matplotlib.pyplot as plt
• import seaborn as sn
• data2=pd.read_csv('TVRating1.csv')
• columns=['TV_Rating', 'Exp_on_Promotion']
• X = data2[columns]
• Y=data2['Revenue']
• from sklearn.model_selection import train_test_split
• train_X, test_X, train_Y, test_Y= train_test_split(X,Y,train_size=0.8,
random_state=67)
48
MLR Modelling in Python
With Intercept
• train_X_new=sm.add_constant(train_X)
• test_X_new=sm.add_constant(test_X)
• full_mod=sm.OLS(train_Y, train_X_new)
• full_model=full_mod.fit()
• full_model.summary2()
Without Intercept

• full_mod=sm.OLS(train_Y, train_X)
• full_model=full_mod.fit()
• full_model.summary2()
49
MLR Modelling in Python
#Normality residuals to be checked using P-P plot
# line=45 means 45-degree line
#‘s’: standardized line, the expected order statistics are scaled by the standard
deviation of the given sample and have the mean added to them
#‘r’: A regression line is fit
#‘q’: A line is fit through the quartiles.
#None: by default no reference line is added to the plot.
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
full_model_residual=full_model.resid
#ProbPlot() method on statsmodel draws P-P plot
probplot=sm.ProbPlot(full_model_residual)
plt.figure (figsize=(8,5))
probplot.ppplot(line='r')
plt.show() 50
MLR Modelling in Python
#Normality residuals to be checked using P-P plot
from scipy import stats
stats.probplot(full_model.resid, plot= plt)
plt.title("Model1 Residuals Probability Plot")

#Normality residuals to be checked using KS test (Kolmogorov-Smirnov test)


stats.kstest(full_model.resid, 'norm')

51
MLR Modelling in Python

Explained variance regression score


metrics.explained_variance_score
function
max_error metric calculates the
metrics.max_error
maximum residual error.
metrics.mean_absolute_error Mean absolute error regression loss
metrics.mean_squared_error Mean squared error regression loss
R^2 (coefficient of determination)
metrics.r2_score
regression score function.

52
MLR Modelling in Python
pred_y= full_model.predict(test_X_new)

from sklearn import metrics


np.sqrt(metrics.mean_squared_error(pred_y, test_Y))

from sklearn.metrics import explained_variance_score


explained_variance_score(pred_y, test_Y)

from sklearn.metrics import mean_absolute_error


mean_absolute_error(pred_y, test_Y)
53
MLR Modelling in Python
from sklearn.metrics import mean_squared_error
mean_squared_error(pred_y, test_Y)

from sklearn.metrics import r2_score


r2_score(pred_y, test_Y)

54
Avoiding Overfitting - Mallows’s Cp
Mallows’s Cp (Mallows, 1973) is used to select the best regression model by
incorporating the right number of explanatory variables in the model.
Mallow’s Cp is given by

 SSEp 
Cp     (n  2 p)
 MSE full 
 

where
SSEp is the sum of squared errors with p parameters in the model (including constant),
MSEfull is the mean squared error with all variables in the model,
n is the number of observations,
p is the number of parameters in the regression model including constant.
Avoiding Overfitting - Mallows’s Cp
The best regression model developed can be the model with number of
parameters p close to Mallows’s Cp is chosen as the best model.
Problem – Dataset 12
• The accompanying data is on
y = profit margin of savings and loan companies in a given year,
x1 = net revenues in that year, and
x2 = number of savings and loan branches offices
Questions
a. Determine the multiple regression equation for the data
b. Compute and interpret the coefficient of multiple determination, R2 .
c. At the 5% significance level, determine if the model is useful for
predicting the response.
d. At the 5% significance level, does it appear that any of the predictor
variables can be removed from the full model as unnecessary?

57
Problem – Dataset 11
e. Obtain and interpret 95% confidence intervals for the slopes, βi, of
the population regression line that relates net revenues and number
of branches to profit margin.
f. Are there any multicollinearity problems (i.e., are net revenues and
number of branches collinear)?

58
Effects of Multicollinearity
Data B
Data A ID Y V1 V2
ID Y V1 V2 1 3.7 3.2 2.9
1 5 6 13 2 3.7 3.3 4.2
2 3 8 13 3 4.2 3.7 4.9
3 9 8 11 4 4.3 3.3 5.1
4 9 10 11 5 5.1 4.1 5.5
5 13 10 9 6 5.2 3.8 6.0
6 11 12 9 7 5.2 2.8 4.9
7 17 12 7 8 5.6 2.6 4.3
8 15 14 7 9 5.6 3.6 5.4
10 6.0 4.1 5.5 59

You might also like