Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views

Linear Regression Model

Uploaded by

pmcsic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Linear Regression Model

Uploaded by

pmcsic
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Linear Regression Model

Sasadhar Bera, IIM Ranchi 1


Correlation Vs. Regression
A scatter plot is a graphical presentation that suggests the
overall relationship (either linear or non-linear) between
two variables.
Correlation coefficient only measures the strength of
relationship between two variables but does not actually
quantify the cause and effect relationship.

Regression Analysis investigates the cause and effect


relationship between two or more variables. Regression
analysis build a model to predict the output(s) for a given
set of input.

4
Sasadhar Bera, IIM Ranchi
What is Regression Analysis?
In most of the studies, the true relationship between
output and input variables is unknown. In such
situation, we build empirical model as an
approximating function. This empirical model
facilitates us to understand, interpretation and
implementation the input-output relationship.
Regression analysis is highly useful to develop such
empirical model using historical records or data from
natural phenomena or experiment.

Regression analysis use sample data and fit a


statistical model known as regression model. The
regression model may be linear or non-linear form.
5
Sasadhar Bera, IIM Ranchi
Dependent and Independent variables
Suppose that an analyst wants to develop a prediction
model in which tensile strength of synthetic fibre is a
function of cotton percentage, and drying time. He has
randomly selected 10 pieces of test specimens and collect
information on strength, cotton percentage, and drying
time. The relationship developed by using regression model
is given below:
Strength = 69.1 + 2.419 cotton percentage + 0.223 drying time

The above regression model have dependent variable,


‘Strength’ and independent variables, ‘cotton percentage’
and ‘drying time’. The independent variables control the
dependent variable. The dependent variable is also known
as output or response variable. The independent variables
are called predictors or regressors or explanatory variables. 6
Sasadhar Bera, IIM Ranchi
Simple Linear Regression Model
Simple linear regression involves one dependent variable
and one independent variable. The equation that
describes simple linear regression model is given below:
y = β0 + β1 x1 + ϵ

y is dependent variable and x1 is independent variable.


The independent variable being used to predict the
dependent variable.
β0 , β1 are unknown regression coefficients. β0 and β1 are
known as intercept and slope of regression equation,
respectively. This regression coefficients are estimated
based on observed sample data.
The term ϵ (pronounced as epsilon) is random error.
7
Sasadhar Bera, IIM Ranchi
Multiple Linear Regression Model
Multiple linear regression involves one dependent
variable and more than one independent variable. The
equation that describes multiple linear regression model is
given below:
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + ϵ
y is dependent variable and x1, x2 , . . .,xk are
independent variables. These independent variables being
used to predict the dependent variable.
β0 , β1 , β2 , . . ., βk are total (k+1) unknown regression
coefficients. These regression coefficients are estimated
based on observed sample data.
The term ϵ (pronounced as epsilon) is random error.
8
Sasadhar Bera, IIM Ranchi
Linear Vs. Non-linear Regression Model
In regression model, the regression coefficients (the β
values) are called model parameters.
A regression model is called linear if the model is linear
function of parameters regardless of the shape of the
surface that it generates.

y  0  1x1  2x2  . . . .  k xk  

Linear component Random Error


component

ϵ is random error component, which is unpredictable.

9
Sasadhar Bera, IIM Ranchi
Linear Vs. Non-linear Regression Model (Contd.)
Linear Regression Model-1: y  40.44  0.775x1  0.325 x 2
The above regression model parameters are linear form
and describes a plane in the two dimensional x1 , x2 space.

10
Sasadhar Bera, IIM Ranchi
Linear Vs. Non-linear Regression Model (Contd.)
Model-2: y  69.11.63 x1  0.96x12 1.08 x 2 1.22x 22  0.225 x1x 2
The above regression model is linear model because the
parameters (β values) are linear in form. It is a second
order response model. It is to be noted that power of x
does not determine whether or not a model is linear.

Response surface
generated by above
regression equation

11
Sasadhar Bera, IIM Ranchi
Linear Vs. Non-linear Regression Model (Contd.)
Any regression model that is non-linear in parameters are
called non-linear regression.

exp( 1 x )
Model-3: y
2  3 x
1
Model-4: y (e  x
e 1 1 2 x2
)
1  2

In above examples, the parameters (the β values) are in


non-linear form.

12
Sasadhar Bera, IIM Ranchi
Linear Vs. Non-linear Regression Model (Contd.)
exp( x )
Scatter plot of non-linear regression model: y
0.01  x

13
Sasadhar Bera, IIM Ranchi
Simple Linear Regression
Suppose that we have collected n observations to fit a
regression model with one predictor variable. The fitted
equation that describes prediction model using simple
linear regression is given below:

yi  0  1x i  i
where i = 1, 2, . . .,n n = total observations
yi is ith observation

β0 and β1 are unknown parameters (intercept and slope)

ϵi is random error for ith observation. This error is


uncontrollable or unpredictable.
14
Sasadhar Bera, IIM Ranchi
Simple Linear Regression (Contd.)
Y
Observed
value
error (ei )
Slope = β1
Predicted
value

Intercept Regression line


= β0

xi x

Observed value = Predicted value + error


15
Sasadhar Bera, IIM Ranchi
Simple Linear Regression (Contd.)
The error is the difference between actual and predicted
value. It may have positive or negative value.

Error is also known as residual. Predicted value is called


fitted value or fit.

The sum of squared difference between the actual and


predicted values known as sum of square of error. Least
square method minimizes the sum of square of error to
find out the best fitting line. This best fitting line is called
regression line.

16
Sasadhar Bera, IIM Ranchi
Parameter Estimates: Simple Linear Regression
Ordinary Least Square (OLS) approach to estimate model
parameters is given below. yi  0  1x i  i

17
Sasadhar Bera, IIM Ranchi
Estimated Regression Line
  
The fitted or estimated regression line: y  0  1 x
Predicted value or Fit for ith observation (xi):

  
yi  0  1 x i

Error in the fit called residual: ei  y i  y i
n

e
i 1
2
i
Standard error of estimate: S = MSE =
n2
18
Sasadhar Bera, IIM Ranchi
Test of Significance of Regression
Hypothesis related to the Significance of Regression
H0: 1 = 0 i. e. no relationship between x and y
H1: 1  0

1 = 0 1  0

19
Sasadhar Bera, IIM Ranchi
Test of Significance of Regression (Contd.)
ANOVA table
Source of Variation DF SS MS FCal
Regression 1 SSR SSR /1 =MSR MSR/MSE
Residual error n –2 SSE SSE / (n-2) = MSE
Total n –1 TSS

2 2

SSR    yi  y    y  yi 
n  n 
SSE 
i 1   
i
i 1

 y 
n 2
TSS  i
y
i 1

SSR: Sum of square due to regression, SSE: Sum of square


due to error, TSS: Total sum of square, TSS = SSR + SSE
Reject H0 if FCal > Fα, 1, (n-2) at significance level α 20
Sasadhar Bera, IIM Ranchi
Sampling Distribution of Regression Parameter
The estimated value of parameters (regression
coefficients) in regression analysis vary from sample to
sample. Thus, parameters are random variables. It must
have sampling distribution.

0 follows normal distribution with mean β0 and
 1 ( x ) 2
 n
variance   
2
 where SXX = n  ( x i  x ) 2
n SXX  i 1


1 follows normal distribution with mean β1 and
2
variance where σ2 is error variance.
SXX
21
Sasadhar Bera, IIM Ranchi
Coefficient of Determination
SSR
Coefficient of determination = R2 =
TSS
SSR : Sum of square due to regression
TSS: Total sum of square

Physical interpretation: Coefficient of determination is


the fraction of variation of the dependent variable
explained by regression variable. It is expressed in
percentage.
R2 is one of the goodness of linear fit measure. The
better the linear fit is, the R2 closer to 1.
In simple linear regression, R2 is correlation between
dependent and independent variable. Sasadhar Bera, IIM Ranchi 22
Lack of Fit Test
Lack of fit test identifies curvature in the data. It
assesses goodness of fit of a regression model. If
curvilinear pattern exists in the data set, higher order
polynomial model may provide accurate fit to the data.

H0 : Linear regression model is correctly fit the data


H1: Linear regression model is not correctly fit the data

23
Sasadhar Bera, IIM Ranchi
An Illustrative Example
The weight and systolic blood pressure of 26 randomly selected
males in the age group 25 to 30 are shown in the following
table. Assume that weight and blood pressure are jointly
normally distributed.
Subject Weight BP Subject Weight BP
1 165 130 14 172 153
2 167 133 15 159 128
3 180 150 16 168 132
4 155 128 17 174 149
5 212 151 18 183 158
6 175 146 19 215 150
7 190 150 20 195 163
8 210 140 21 180 156
9 200 148 22 143 124
10 149 125 23 240 170
11 158 133 24 235 165
12 169 135 25 192 160
13 170 150 26 187 159 24
Sasadhar Bera, IIM Ranchi
An Illustrative Example (Contd.)
i. Construct a scatter diagram of the data

ii. Find a regression line relating systolic blood pressure


to weight.

iii. Test for significance of regression using α = 0.05.

iv. Determine the coefficient of variation

25
Sasadhar Bera, IIM Ranchi
An Illustrative Example (Contd.)

The scatter plot does not resemble to straight line.


There may exist curvilinear pattern.
26
Sasadhar Bera, IIM Ranchi
An Illustrative Example (Contd.)

Analysis of Variance

Source DF SS MS F P
Regression 1 2693.6 2693.6 35.74 0.000
Residual Error 24 1808.6 75.4
Total 25 4502.2

P-value < α = 0.05, reject H0. Hence regression model


is significant.

27
Sasadhar Bera, IIM Ranchi
An Illustrative Example (Contd.)
Predictor Coef SE Coef T P
Constant 69.10 12.91 5.35 0.000
Weight 0.41942 0.07015 5.98 0.000

S = 8.68085 R-Sq = 59.8% R-Sq(adj) = 58.2%

Weight shows significant variable (p-value = 0.000) for


prediction of BP.
The fitted regression line: BP = 69.1 + 0.419 Weight
Standard error of estimate = S = 8.68085
R2 is 59.8% i. e. only 59.8% variability of BP is explained by
weight. It indicates poor approximation of true functional
relationship. There is doubt regarding goodness of fit.
28
Sasadhar Bera, IIM Ranchi
An Illustrative Example (Contd.)

Lack of fit test

Possible curvature in variable


weight (p-value = 0.014)

Overall lack of fit is significant


at p = 0.014

29
Sasadhar Bera, IIM Ranchi
Multiple Linear Regression Model

30
Sasadhar Bera, IIM Ranchi
Multiple Linear Regression Model
Multiple linear regression involves one dependent
variable and more than one independent variable. The
equation that describes multiple linear regression model is
given below:
y = β0 + β1 x1 + β2 x2 + . . . + βk xk + ϵ
y is dependent variable and x1, x2 , . . .,xk are
independent variables. These independent variables being
used to predict the dependent variable.
β0 , β1 , β2 , . . ., βk are total (k+1) unknown regression
coefficients. These regression coefficients are estimated
based on observed sample data.
The term ϵ (pronounced as epsilon) is random error.
31
Sasadhar Bera, IIM Ranchi
Matrix Notation: Multiple Linear Regression
Suppose that n number of observations are collected for
response variable (y) and k number of independent
variables present in the regression model.
yn×1 = Xn×(k+1) β(k+1) ×1 + ϵn×1
n = total number of observations, k = total number of
variables, β is model parameters in vector notation.

 1 x11 . x1 j x 1k  0  1 


 y1  .   . 
.     1
. . . . . .   
  .  ε   i 
y   yi  X   1 x i1 . x ij . x ik  β    
     j 
. .  . 
.  . . . .
.   n 
 y n  1 x . x . x nk   

 k 
n1 nj

32
Sasadhar Bera, IIM Ranchi
Estimated Residual and Standard Error
 
For ith observation (Xi), predicted value or Fit : y i  Xi β


Error in the fit called residual: ei  y i  y i
n

e
i 1
2
i
Mean Square Error = MSE =
n  k 1
where n is the total number of observations, k is number
of regressors.

Standard error (SE) of estimate =  = MSE
33
Sasadhar Bera, IIM Ranchi
Testing Significance of Regression Model
The test for significance of regression is a test to
determine if there is a linear relationship between the
response variable and regressor variables.

H0 : β1 = β2 = . . . = βk = 0
H1 : At least one βj is not zero

The test procedure involves an analysis of variance


(ANOVA) partitioning of the total sum of square into a sum
of squares due to regression and a sum of square due to
error (or residual)
Total number of model parameters = p = Number of
regression coefficients = (k+1)
34
Sasadhar Bera, IIM Ranchi
Testing Significance of Regression Model (Contd.)
ANOVA table
Source of DF SS MS FCal
Variation
Regression k SSR SSR /k =MSR MSR/MSE
Residual n - k -1 SSE SSE / (n-k-1)
error = MSE
Total n –1 TSS

2
 y n

2  i 
SSR    yi  y   β XT y   i1 
TSS = SSR + SSE n  T

i 1   n

 
2 n

SSE    yi  yi   y T y  β XT y
 T
TSS   yi  y
n 2

i 1   i 1
35
Sasadhar Bera, IIM Ranchi
Significance Test of Individual Parameter

Adding an unimportant variable to the model can actually


increase the mean square error, thereby decreasing the
usefulness of the model.
The hypothesis for testing the significance of any
individual regression coefficient, say βj is
H0: βj = 0 H1: βj ≠ 0

j t
Test Statistic = Tcal = ~ 2
, ( n  k 1)
 2 C jj
where σ2 is mean square error (MSE) and C is the diagonal
element of (XTX)-1 . Reject H0 if Tcal > t  , ( n k 1)
2
36
Sasadhar Bera, IIM Ranchi
Coefficient of Multiple Determination
SSR
Coefficient of multiple determination = R2 =
TSS
SSR TSS  SSE SSE
 1 
TSS TSS TSS

SSR : Sum of square due to regression


SSE : Sum of square due to error
TSS : Total sum of square
Coefficient of variation is the fraction of variation of the
dependent variable explained by regressor variables.
R2 is measure the goodness of linear fit. The better the
linear fit is, the R2 closer to 1.
37
Sasadhar Bera, IIM Ranchi
Coefficient of Multiple Determination (Contd.)
The major drawback of using coefficient of multiple
determination (R2) is that adding a predictor variable to the
model will always increase R2, regardless of whether the
additional variable is significant or not. To avoid such
situation, regression model builders prefer to use adjusted
R2 statistic.
SSE
( n  p)  n 1 
R 2
1   1    (1  R 2 )
n p
adj
TSS 
(n  1)
In general, adjusted R2 statistic will not increase as variables
are added to the model.
When R2 and adjusted R2 differ dramatically there is a good
chance that non-significant terms have been included in the
38
model. Sasadhar Bera, IIM Ranchi

You might also like