Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
405 views

Simple Linear Regression

Regression is a statistical analysis used to determine the relationship between variables. It can be used to predict the value of a dependent variable based on the value of one or more independent variables. The relationship can be linear or nonlinear. Simple linear regression involves one independent variable, while multiple linear regression involves more than one. Regression estimates coefficients that can be used to predict outcomes and determine how changes in independent variables affect the dependent variable.

Uploaded by

mathewsujith31
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
405 views

Simple Linear Regression

Regression is a statistical analysis used to determine the relationship between variables. It can be used to predict the value of a dependent variable based on the value of one or more independent variables. The relationship can be linear or nonlinear. Simple linear regression involves one independent variable, while multiple linear regression involves more than one. Regression estimates coefficients that can be used to predict outcomes and determine how changes in independent variables affect the dependent variable.

Uploaded by

mathewsujith31
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

What is Regression?

• Regression is a tool for finding existence of an association


relationship between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn) in a study.

• The relationship can be linear or non-linear.

• A dependent variable (response variable) “measures an outcome of


a study (also called outcome variable)”.

• An independent variable (explanatory variable) “explains changes in


a response variable”.

• Regression often set values of explanatory variable to see how it


affects response variable (predict response variable)
Regression
Regression is a supervised learning algorithm under
Machine Learning terminology

An important tool in Predictive Analytics


Regression - Definition

A statistical technique that attempts to determine


the existence of a possible relationship between one
dependent variable (usually denoted by Y) and a
collection of Independent variables.

Regression is used for generating new hypothesis


and for validating a hypothesis
Dependent and Independent Variables

• Terms dependent and independent does not


necessarily imply a causal relationship between two
variables.

• Regression is not designed to capture causality.

• Purpose of regression is to predict the value of


dependent variable given the value(s) independent
variable(s)
Regression Nomenclature
Dependent Independent
Variable Variable
Explained Variable Explanatory variable

Regressand Regressor

Predictand Predictor

Endogenous Variable Exogenous Variable

Controlled Variable Control Variable

Target Variable Stimulus Variable

Response Variable

Feature Outcome Variable


Regression Vs Correlation

• Regression is the study of, “existence of a


relationship”, between two variable. The main
objective is to estimate the change in mean value of
independent variable.

• Correlation is the study of, “strength of


relationship”, between two variables.
Types of Regression
Types of Regression

Regression
Models
One More than One
independent independent
variable variable

Simple Multiple
Regression Regression

Linear Non-linear Linear Non-linear


Types of Regression

• Simple linear regression – refers to a regression model


between two variables.

Y   0  1 X 1  

• Multiple linear regression – refers to a regression model


on more than one independent variables.
Y   0  1 X 1   2 X 2  ...   k X k  

• Nonlinear regression.
1
Y  0   X 2 3  
1   2 X 1
Linear Regression

• Linear regression stands for a function that is linear


in regression coefficients.

• The following equation will be treated as linear as


far as regression is concerned.

Y  1  1 X1   2 X1 X 2  3 X 22
Regression Model Development
Framework for SLR model development
Assumptions
The method of least squares gives the best equation under the
assumptions stated below (Harter 1974, 1975):
The regression model is linear in regression parameters.
The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
deterministic).
The conditional expected value of the residuals, E(i|Xi), is zero.
In case of time series data, residuals are uncorrelated, that is, Cov(i,
j) = 0 for all i  j.
The residuals, i, follow a normal distribution.
The variance of the residuals, Var(i|Xi), is constant for all values of
Xi. When the variance of the residuals is constant for different values
of Xi, it is called homoscedasticity. A non-constant variance of
residuals is called heteroscedasticity
OLS Estimation

In ordinary least squares, the objective is find the optimal values of 0 and 1
that will minimize the Sum of Squared Errors (SSE) given in belwo Eq:
n n
SSE    i2    Yi  β 0  β1X i  2
i 1 i 1

To find the optimal values of 0 and 1 that will minimize SSE, we have to
equate the partial derivative of SSE with respect to 0 and 1 to zero.
SSE n  n n 
   2 Yi  β 0  β1Xi   2 nβ 0  β1  Xi   Yi   0
β 0 i 1  i 1 i 1 

SSE
β1
n

i 1
n
 
   2Xi  Yi  β 0  β1Xi   2  Xi Yi  β 0 Xi  β1Xi2  0
i 1
OLS Estimation

Solving the system of equations for 0 and 1, we get the


estimated values as follows:

   
 0  y  1 X

n   n 

  X i Yi  X i Y    X i (Yi  Y )

i 1   i 1
1 
n   n 
2
  X i  X i X   Xi (Xi  X )
i 1  i 1
Example :

Salary of Graduating MBA Students versus Their


Percentage Marks in Grade 10

Table in next slide provides the salary of 50


graduating MBA students of a Business School in
2016 and their corresponding percentage marks in
grade 10 . Develop a linear regression model by
estimating the model parameters.
Salary of MBA students versus their grade 10 marks

Percentage in Grade Percentage in


S. No. Salary S. No. Salary
10 Grade 10
1 62 270000 26 64.6 250000
2 76.33 200000 27 50 180000
3 72 240000 28 74 218000
4 60 250000 29 58 360000
5 61 180000 30 67 150000
6 55 300000 31 75 250000
7 70 260000 32 60 200000
8 68 235000 33 55 300000
9 82.8 425000 34 78 330000
10 59 240000 35 50.08 265000
11 58 250000 36 56 340000
12 60 180000 37 68 177600
13 66 428000 38 52 236000
14 83 450000 39 54 265000
15 68 300000 40 52 200000
16 37.33 240000 41 76 393000
17 79 252000 42 64.8 360000
18 68.4 280000 43 74.4 300000
19 70 231000 44 74.5 250000
20 59 224000 45 73.5 360000
21 63 120000 46 57.58 180000
22 50 260000 47 68 180000
23 69 300000 48 69 270000
24 52 120000 49 66 240000
25 49 120000 50 60.8 300000
Solution
Using Eqs., the estimated values of 0 and 1 are given by
 
 0  61555.3553 and  1  3076.1774
The corresponding regression equation is given by

Yi  61555.3553  3076.1774 X i


Where is the predicted value of Y for a given value of Xi.
Yi

.
The equation can be interpreted as follows: for every one
percentage increase in grade 10 marks, the salary of the
MBA students will increase at the rate of 3076.1774 on an
average. The notations
Solution Continued
 
are used to denote that these are estimated values
 0 and  1
of the regression coefficients from the sample of 50
students.
The Microsoft Excel output for SLR model is shown in
Table

  Coefficients Standard t-stat p-value


Error

Intercept 61555.35534 66701.901 0.9228 0.3607

Percentage in grade 10 3076.177438 1031.5258 2.9821 0.0044


Validation of the Simple Linear Regression
Model
It is important to validate the regression model to ensure its validity
and goodness of fit before it can be used for practical applications.
The following measures are used to validate the simple linear
regression models:

• Co-efficient of determination (R-square).


• Hypothesis test for the regression coefficient
• Analysis of Variance for overall model validity (relevant more for
multiple linear regression).
• Residual analysis to validate the regression model assumptions.
• Outlier analysis.

The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)

• The co-efficient of determination (or R-square or R2)


measures the percentage of variation in Y explained by the
model (0 + 1 X).
• The simple linear regression model can be broken into
explained variation and unexplained variation as shown in

Yi

  0  1 X i   i
   
Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model

In absence of the predictive model for Yi, the users will use the
mean value of Y­i. Thus, the total variation is measured
 as the
difference between Yi and mean value of Yi (i.e.,Yi Y
- ).
Description of total variation, explained
variation and unexplained variation

Variation Type Measure Description

Total Variation (SST) ( ) Total variation is the difference between the actual value
Yi  Y and the mean value.

Variation explained by the model (   ) Variation explained by the model is the difference
Yi  Y between the estimated value of Yi and the mean value of
Y

Variation not explained by model ( ) Variation not explained by the model is the difference
Yi  Yi between the actual value and the predicted value of Y i
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
   
Yi  Y  Yi  Y  Yi  Yi
  
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model

It can be proved mathematically that sum of squares of total variation


is equal to sum of squares of explained variation plus sum of squares
of unexplained variation
  2 2 2
n   
n   
n  
  Yi  Y 

    Yi  Y 

    Yi  Yi 


i 1  i 1  i 1 
              
SST SSR SSE

where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by
2
  
 Yi  Y 
2 Explained variation SSR  

Coefficient of determination  R   
Total variation SST  2

 Yi  Y 
 
 

Since SSR = SST – SSE, the above Eq. can be written as


2
  
 Yi  Yi 
SSE  
R 2
 1  1  
SST  2
 
 Yi  Y 
 
 
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:

• The value of R2 lies between 0 and 1.


• Higher value of R2 implies better fit, but one should be aware of
spurious regression.
• Mathematically, the square of correlation coefficient is equal to
coefficient of determination (i.e., r2 = R2).
• We do not put any minimum threshold for R2; higher value of R2
implies better fit. However, a minimum value of R2 for a given
significance value  can be derived using the relationship between
the F-statistic and R2
Hypothesis Test for Regression Co-efficient
(t-Test)
• The regression co-efficient ( 1) captures the existence of a
linear relationship between the response variable and the
explanatory variable.
• If 1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
 The estimate of 1 using OLS is given by
n   n   n  n 
 (X i  X )(Yi  Y)  (X i  X )Yi X  (Yi  Y )  (X i  X)Yi
β1  i 1  i 1  i 1  i 1
n  n  n  n 
2 2 2 2
 (X i  X)  (X i  X )  (X i  X )  (X i  X )
i 1 i 1 i 1 i 1

Above eq. can be written as follows:


n
 K i Yi 
β1  i 1 where K i  (X i  X )
n
2
 Ki
i 1

That is, the value of 1 is a function of Yi (Ki is a constant since


Xi is assumed to be non-stochastic)
The standard error of 1 is given by
 Se
S e ( 1 ) 

( X i  X )2

In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by

n  n
2 2
 (Yi  Y i )  i
Se  i 1  i 1
n2 n2

The denominator in above Eq. is (n  2) since 0 and 1 are estimated


from the sample in estimating

Yi and thus two degrees of freedom are lost.
1
The standard error of can be written as
n 
2

 (Yi  Y i ) n2
Se i 1
S e ( 1 )  
 
( X i  X )2 ( X i  X )2
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:

H0: 1 = 0
HA: 1  0
• The corresponding t-statistic is given as
  
 1  1  1 0 1
t   
  
Se ( 1) Se (  1 ) Se (  1 )
Residual Analysis

Residual (error) analysis is important to check whether the


assumptions of regression models have been satisfied. It is
performed to check the following:

• The residuals i i )are normally distributed.
(Y  Y

• The variance of residual is constant (homoscedasticity).


• The functional form of regression is correctly specified.
• If there are any outliers
Checking for Normal Distribution of 
Residuals (Y  Y ) i i

• The easiest technique to check whether the residuals follow


normal distribution is to use the P-P plot (Probability-Probability
plot).
• The P-P plot compares the cumulative distribution function of two
probability distributions against each other
Test of Homoscedasticity

An important assumption of regression model is that the residuals


have constant variance (homoscedasticity) across different values
of the explanatory variable (X).
That is, the variance of residuals is assumed to be independent of
variable X. Failure to meet this assumption will result in
unreliability of the hypothesis tests.

Testing the Functional Form of Regression


Model
Any pattern in the residual plot would indicate incorrect
specification (misspecification) of the model.
Testing the Functional Form of Regression
Model
Any pattern in the residual plot would indicate incorrect
specification (misspecification) of the model.
Confidence Interval for the Expected Value
of Y for a Given X
• Since the point estimates are subjected to higher levels of error,
due to uncertainties around estimation of parameters and natural
variation in the data around the predicted line, the user would like
to know the interval estimate or the confidence interval for the
conditional expected value.
• The confidence interval of the expected value of Yi for a given
value of Xi is given by

 1 ( X i  X )2
Yi  t / 2, n  2  Se   
n n
2
 ( Xi  X )
i 1


1 ( X i  X )2
• Where the term Se 
n

n  is the standard error of
2
 (Xi  X )
E(Y|X). i 1
Prediction Interval for the Value of Y for a
Given X
The prediction interval of Yi for a given value of Xi is given by

 1 ( X i  X )2
Yi  t / 2,n  2  Se  1  
n n 
2
 ( Xi  X )
i 1


( X i  X )2
where the term, Se  1is the standard error of Yi
1
n

n 
2
for a given Xi value
i 1
 (Xi  X )
For large n, the confidence interval of E(Y|X) will converge to

Yi  t / 2, n  2  Se
This is because, as n  , the term

1 ( X i  X )2
1 
n n 
 ( Xi  X )
2 converges to 1
i 1

You might also like