Simple Linear Regression
Simple Linear Regression
Y Y
X X
Y Y
X X
What is LR model used for?
• Linear regression models are used to show or predict the
relationship between two variables or factors. The factor
that is being predicted is called the dependent variable.
• Regression analysis is used in stats to find trends in data.
(Education) y (Income)
(Education)
(Sex) Y (Income)
(Experience)
(Age)
Correlation - 1
• Correlation is a statistical technique that can show whether and how
strongly pairs of variables are related.
• The range of possible values is from -1 to +1
• The correlation is high if observations lie close to a straight line (ie
values close to +1 or -1) and low if observations are widely scattered
(correlation value close to 0)
∑( xi − x )( yi − y )
r=
(∑( xi − x ) 2 )(∑( yi − y ) 2 )
Correlation - Cont
Correlation vs Linear Regression
Simple Linear Regression - 1
• Estimate the linear equation for a relationship between continuous
variables so one variable can be predicted or estimated.
• To determine the relationship between one numerical dependent variable
and one numerical or categorical independent variable.
• Measures the strength of association between these variables as in
correlation but it provides more information compared to correlation
method.
• Usually used as the preliminary step of multiple linear regression.
Simple Linear Regression - 2
• As in correlation analysis, the linear regression also provides the
coefficient of relationship named as coefficient of determination, which
represents the proportion of the variation of dependent variable explained
by the independent variable.
Simple Linear Regression - 3
̄ 𝟐𝟐 = �(𝐲𝐲𝐢𝐢
𝐒𝐒𝐒𝐒𝐒𝐒 = �(𝐲𝐲𝐢𝐢 − 𝐲𝐲) − 𝐲𝐲�𝐢𝐢 )𝟐𝟐 + �( 𝐲𝐲�𝐢𝐢 − 𝐲𝐲)
̄ 𝟐𝟐
𝐢𝐢=𝟏𝟏 𝐢𝐢=𝟏𝟏 𝐢𝐢=𝟏𝟏
2
𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 R2 = 1 - (SSE/SST)
𝑅𝑅 = =1−
𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆
• proportion of variability in the observed dependent variable that is
explained by the linear regression model.
• The coefficient of determination measures the strength of that
linear relationship, denoted by R2
• The greater R2 the more successful is the Linear Model
• The R2 value close to 1 indicates the best fit and close to 0 indicates
the poor fit.
R-Squared
Model Evaluation – The Hypothesis Test - 1
• Hypothesis testing: Decision-making procedure
about the null hypothesis
• The Null Hypothesis (H0):
• The hypothesis that cannot be viewed as false unless
sufficient evidence on the contrary is obtained.
• The Alternative Hypothesis (H1):
• The hypothesis against which the null hypothesis is
tested and is viewed true when H0 is declared false.
Model Evaluation – The Hypothesis Test - 2
• Hypothesis test
• A process that uses sample statistics to test a claim about the value of a
population parameter.
• Example: An automobile manufacturer advertises that its new hybrid car
has a mean mileage of 50 miles per gallon. To test this claim, a sample
would be taken. If the sample mean differs enough from the advertised
mean, you can decide the advertisement is wrong.
Model Evaluation – The Hypothesis Test - 3
H 0 : µ = µ0
• Two-sided (tailed) test H1 : µ ≠ µ 0
Two-sided (tailed)
test
Model Evaluation – The Hypothesis Test - 5
• Equivalence of F-Test and t -Test: For Given α Level, the F-
Test of β1 = 0 Versus β1 ≠ 0 is Equivalent algebraically to
the Two-sided t-test.
Example 1
The manager of a car plant wishes to investigate how the plant’s
electricity usage depends upon the plant production. The data is
given below estimate the linear regression equation.
Production (x) ($M) 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2
Electricity Usage 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
(y)(kWh)
ANOVA
df SS MS F Significance F
Regression 1 1.212381668 1.21238 40.53297031 8.1759E-05
Residual 10 0.299109998 0.02991
Total 11 1.511491667
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.409048191 0.385990515 1.05974 0.314189743 -0.450992271 1.269088653 -0.45099227 1.269088653
X Variable 1 0.498830121 0.078351706 6.36655 8.1759E-05 0.324251642 0.673408601 0.32425164 0.673408601
MS Excel Result – Regression Line
Example 1 – Summary - 1
Estimated Regression Line 𝑦𝑦� = 0.4091 + 0.4988𝑥𝑥
Electricity usage = 0.4091 + 0.4988*Production
Standard Error of Estimate = 0.173
Coefficient of Determination R2 = 0.802
𝛼𝛼 = 0.05; 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 = 𝑡𝑡0.025,10 = 2.228
𝑇𝑇 = 6.37, Critical Region: |T | > tα/2, n-2 .
ANOVA
df SS MS F Significance F
Regression 1 69.19926 69.19926 412.7226 1.22E-66
Residual 479 80.31167 0.167665
Total 480 149.5109
2.50
2.00 Predicted RQI
Linear (RQI)
1.50
1.00
0.50
0.00
0.0 500.0 1000.0 1500.0 2000.0 2500.0
Permeability(md)
Example 2 – Interpretation of the results
• Permeability(md) coefficient (𝛽𝛽 1 =0.0017): Each unit increase in
Permeability adds 0.0017 to RQI value.
• 𝛽𝛽 1> 0: (positive relationship): RQI increases with the increase
in Permeability.
• Intercept coefficient (𝛽𝛽 0 = 0.309): The value of RQI when
Permeability equal to zero.
• R Square = 0.462837: indicates that the model explains 46% of the
total variability in the RQI values around its mean.
• P-value < 0.05: The regression is significant
Example 2 (b) – Application of SLR to
Reservoir Quality Index (RQI)
• Example: Given data on Permeability and Reservoir Quality
Index, RQI, investigate the dependence of RQI (Y) on
Permeability (X). For this example, outliers have been
detected and removed.
• Set up the hypothesis :
• H0 : β = 0 ( there is no relationship between x and y)
• There is no relationship between RQI and Permeability
41 outliers were
detected and treated
Excel Results – Example 2(b)
RQI vs Permeability (Outliers Removed )
Observed the RQI 1.80
values on Y-axis 1.60
1.40
1.20
1.00
0.80
RQI
0.60
0.40 RQI
0.20 Linear (RQI)
0.00
0.0 100.0 200.0 300.0 400.0
Permeability
Excel Results – Example 2(b)
Regression Statistics
Multiple R 0.851746 What can you conclude here????
R Square 0.725471
Adjusted R
Square 0.724844
Standard Error 0.134745
Observations 440
ANOVA
df SS MS F Significance F
Regression 1 21.01512 21.01511817 1157.460884 5.0238E-125
Residual 438 7.952426 0.018156223
Total 439 28.96754
ANOVA
df SS MS F Significance F
Regression 1 21.01512 21.01511817 1157.460884 5.0238E-125
Residual 438 7.952426 0.018156223
Total 439 28.96754