Day 6 Session 1 MLR

Multiple Linear Regression
Foundation for Predictive Modeling

Day 6
Session 1
1
Predictive Modeling
Process by which a statistical model is created to best predict the outcome or probability of an
outcome.
Predictive models are developed using historical data or from purposely collected data.
Predictive analytics is used in financial services, insurance, telecommunications, retail, travel,

healthcare, pharmaceuticals, sports and other fields.
2
Predictive Modeling
General Approach
Set
Set Apply Develop
Develop
Understand
Understand Apply
Business
Business Business TheStatistical
The Statistical
Data
Data Business
Goal
Goal Expertise Model
Model
Expertise
Validate Make
Make
Validate
TheModel
Model Decisions
Decisions
The
3
Multiple Linear Regression
Multiple linear regression is an extension of simple linear regression. It is used when we
want to predict the value of a variable based on the values of two or more other
variables.
The variable we want to predict is called as dependent variable (or sometimes response
variable).
The variables used to predict the value of dependent variable are called as independent
variables (or sometimes, the predictor, explanatory or regressor variables).
4
Statistical Model in Multiple Linear Regression
Y=b0 + b1x1 + b2x2 + - - - - + bpxp + e
Where,
Y : Dependent Variable
x1, x2 ,…, xp : Independent Variables
b0, b1 ,…, bp : Parameters of Model
e : Random Error Component
_______________________________________________
___________
Parameters of the model are estimated

by Least Square Method.
5
What is the least square method?
The least squares (LS) criterion states that the sum

of
the squares of errors (or residuals) is minimum.
Mathematically, following quantity is minimized to

estimate parameters using least square method.
^
Error ss= ∑ (Yi – Yi )2
6
Case Study
Predicting Job Performance
index
Test of
Technical
Language
General
Aptitude
Information
Performance
Index
7
Get Started With Scatter Plot Matrix
per_index<-read.csv(file.choose(),header=T)
pairs(~jpi+aptitude+tol+technical+general,data=per_index,
col="blue")
8
Performance Index: Mathematical
Model
Objective : To model performance index after probationary period.
Data: Scores on various tests conducted before recruitment.
Sample Size:33
Model:
jpi =b0 + b1(aptitude) + b2(tol) + b3(technical)+b4 (general) + e
Parameters of the Model are estimated by Least Square Method.

9
Snapshot of the Data
Dependent 4 independent variables
Variable
10
Least Square Estimates of Parameters
Coefficients
Intercept -54.2822 Model Equation
aptitude 0.3236
jpi= -54.2822 + 0.3236(aptitude)
tol 0.0334 + 0.0334(tol) + 1.0955(technical)
+0.5368 (general)
technical 1.0955
general 0.5368
11
Global Testing: Testing whether at least one
variable has significant impact
The aim of Global Testing is to test the null hypothesis that all the model
parameters are simultaneously equal to zero.
Hypotheses:
H0: b1 = b2 = … = bp = 0 v/s H1: At least one coefficient is not zero
In other words
H0: None of the Independent variable has significant impact on dependent

variable
12
Global Testing: Partitioning Total Variation
Total Variation
n
∑ (Yi – Y )2
i=1
Unexplained
Explained Variation Variation
n ^ n ^
∑ (Yi – Y )2 ∑ (Yi – Yi )2
i=1 i=1
13
Global Testing: ANOVA and Decision Criterion
Source DF SS MSS F Value Pr > F
<0.000
Model(Explained) p=4 2510.007 627.5017 49.8129
1
Error(Unexplained) n-p-1=28 352.7208 12.5972
Total n-1=32 2862.728
Reject the null hypothesis since P value < 0.05.

At least one variable has significant impact on performance index.
14
Individual Testing using t Test
Hypotheses
H 0: b i = 0 v/s H 1 : bi ≠ 0
; i=1,2,3,4,..,P
Test Statistic
t= (Estimate of bi)/(Standard Error of estimated bi)
Under H0, t follows t distribution with (n-p-1) d.f.
Reject the null hypothesis if P value < 0.05 and conclude that the
variable has significant effect on Y
15
Individual Testing: Case Study
Coefficients Standard Error t Stat P-value

Intercept -54.2822 7.3945 -7.3409 0.0000
aptitude 0.3236 0.0678 4.7737 0.0001
tol 0.0334 0.0712 0.4684 0.6431
technical 1.0955 0.1814 6.0395 0.0000
general 0.5368 0.1584 3.3890 0.0021
P values for aptitude, technical and general are less than 0.05.
P value for test of language(tol) is more than 0.05. Therefore, tol is
the only insignificant variable.
16
Interpretation of Partial Regression Coefficients
For every unit increase in the independent variable, the dependent

variable (Y) will change by the corresponding parameter estimate,
keeping all the other variables constant.
From the Parameter estimates table, we observe that the parameter

estimate for Aptitude Test is 0.3236
We can infer that for one unit increase in aptitude test score, the job performance
index will increase by 0.3236 units.
17
Measure of Goodness of Fit: R Squared
R2 is the proportion of variation in the dependent variable which is explained by

the independent variables. Note that R2 always increases if variable is added in the
model.
n ^
∑ (Yi – Y)2
Explained variation i=1
R2 = =
n
Total Variation ∑ (Yi – Y)2
i=1
The adjusted R-squared

2 n version
is a modified 1 of R-squared
2 that has been adjusted
R
for the number of  1 
a predictors in the model. (1  R )
n p 1
18
Multiple Linear Regression in R
per_index<-read.csv(file.choose(),header=T)
jpimodel<-lm(jpi~aptitude+tol+technical+general,data=per_index)
jpimodel
_______________________________________________________________
#default output displayed by typing model object name
Call:
lm(formula = jpi ~ aptitude + tol + technical + general, data = per_index)
Coefficients:
(Intercept) aptitude tol technical general
-54.28225 0.32356 0.03337 1.09547 0.53683
19
Multiple Linear Regression in
R(continued)
#detailed output displayed using 'summary' function (slides 20 and 21)
summary(jpimodel)
Call:
lm(formula = jpi ~ aptitude + tol + technical + general, data = per_index)
Residuals:
Min 1Q Median 3Q Max
-7.2891 -2.7692 0.4562 2.8508 5.6068
20
Multiple Linear Regression in
R(continued)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.28225 7.39453 -7.341 5.41e-08 ***
aptitude 0.32356 0.06778 4.774 5.15e-05 ***
tol 0.03337 0.07124 0.468 0.6431
technical 1.09547 0.18138 6.039 1.65e-06 ***
general 0.53683 0.15840 3.389 0.0021 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.549 on 28 degrees of freedom
Multiple R-squared: 0.8768, Adjusted R-squared: 0.8592
F-statistic: 49.81 on 4 and 28 DF, p-value: 2.467e-12
21
Summary of Findings
Performance in aptitude, technical and general tests during recruitment phase

has significant influence on job performance .
Test of language is the only insignificant variable.
R squared of the model is 0.88.
88% of the variation in job performance index is explained by the model and
12% is unexplained variation.
22
Regression Model in Matrix Form
Ynx1  X nx ( p 1)  ( p 1) x1  enx1

where
 y1  1 x11 x12 .........x1 p 
   
 y2   1 x 21 x 22 .......... . x 2 p 
 
Ynx1   ....  X   ................................ 
 ....   
   ................................ 
 yn  1 x x x 
   n1 n 2 .......... .... np

 0   e1 
  e 
  1   2
   ....  e   ... 
   
 ..   ... 
   en 
 p  
23
Least Square Estimator and its Variance
e Y  X
Z e e  ei2 (Y  X ) (Y  X )
Z
0

ˆ X X  1 X Y

V ( ˆ ) V X X  X Y
1

V (  ) X X  X  V (Y ) X ( X X )  1
ˆ 1

V (  ) X X   2
ˆ 1

24
THANK YOU!!
25

Day 6 Session 1 MLR

Uploaded by

Copyright:

Available Formats

Day 6 Session 1 MLR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day 6 Session 1 MLR

Uploaded by

Copyright:

Available Formats

Multiple Linear Regression

Foundation for Predictive Modeling

Predictive analytics is used in financial services, insurance, telecommunications, retail, travel,

Y=b0 + b1x1 + b2x2 + - - - - + bpxp + e

Parameters of the model are estimated

The least squares (LS) criterion states that the sum

Mathematically, following quantity is minimized to

Objective : To model performance index after probationary period.

Data: Scores on various tests conducted before recruitment.

jpi =b0 + b1(aptitude) + b2(tol) + b3(technical)+b4 (general) + e

Parameters of the Model are estimated by Least Square Method.

H0: b1 = b2 = … = bp = 0 v/s H1: At least one coefficient is not zero

H0: None of the Independent variable has significant impact on dependent

Source DF SS MSS F Value Pr > F

Error(Unexplained) n-p-1=28 352.7208 12.5972

Total n-1=32 2862.728

Reject the null hypothesis since P value < 0.05.

t= (Estimate of bi)/(Standard Error of estimated bi)

Under H0, t follows t distribution with (n-p-1) d.f.

Coefficients Standard Error t Stat P-value

For every unit increase in the independent variable, the dependent

From the Parameter estimates table, we observe that the parameter

R2 is the proportion of variation in the dependent variable which is explained by

The adjusted R-squared

#detailed output displayed using 'summary' function (slides 20 and 21)

Performance in aptitude, technical and general tests during recruitment phase

Test of language is the only insignificant variable.

R squared of the model is 0.88.

Ynx1  X nx ( p 1)  ( p 1) x1  enx1

You might also like