Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views

Simple Linear Regression

ECONOMETRICS[Econ 3061

Uploaded by

getasew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Simple Linear Regression

ECONOMETRICS[Econ 3061

Uploaded by

getasew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Simple Linear Regression

Dr. Mohammad Nasir Abdullah


PhD (Statistics), MSc(Medical Statistics), BSc(hons)(Statistics),
Diploma in Statistics, Certified Data Science Specialist, Graduate
Statistician from Royal Statistical Society, UK.

Fundamental & Applied Sciences Department,


Universiti Teknologi PETRONAS
nasir.abdullah@utp.edu.my
Overview
Background
Correlation
Regression
Least Squares Method
Simple Linear Regression (SLR)
ANOVA
Model Evaluation
Applications/Examples
Background
• Regression analysis is a statistical methodology that utilizes the
relation between two or more quantitative variables so that a
response or outcome variable can be predicted from the other, or
others.
• This methodology is widely used in business, the social and
behavioural sciences, the biological sciences, and many other
disciplines.
• Linear regression is a linear model, a model that assumes
a linear relationship between the input variables (x) and the single
output variable (y)
• LR is relation between variables where changes in some variables
may “explain” the changes in other variables
• LR model estimates the nature of the relationship between
independent and dependent variables.
Example 1
• Drug Dosage Determination - Determining the appropriate drug
dosage for patients based on factors like age, weight, and other
physiological parameters.
• Disease Progression Prediction - To predict the progression of
diseases like diabetes, cancer, or cardiovascular conditions. By
analysing historical patient data (for example: biomarkers, genetic
factors, lifestyle habits).
• Price Optimization - Determining the optimal pricing strategy by
examining the relationship between price changes and customer
demand.
• The performance of an employee on a job can be predicted by
utilizing the relationship between performance and a battery of
aptitude tests
What is Linear Regression (LR)
• Investigating the dependence of one variable (dependent
variable), on one or more variables (independent variable)
using a straight line.
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
What is LR model used for?
• Linear regression models are used to show or predict the
relationship between two variables or factors. The factor
that is being predicted is called the dependent variable.
• Regression analysis is used in stats to find trends in data.

• Example: you might guess that there's a connection


between how much you eat and how much you
weight; regression analysis can help you quantify that.
How does it work?
• Linear Regression is the process of finding a line that best
fits the data points available on the plot, so that we can
use it to predict output values for inputs that are not
present in the data set we have, with the belief that those
outputs would fall on the line.
What Regression is used for? - 1
• Predictive Analytics: forecasting future opportunities and
risks is the most prominent application of regression analysis
in business
• Operation Efficiency: Regression models can also be used to
optimize business processes.
• Supporting Decisions: Businesses today are overloaded with
data on finances, operations and customer purchases.
Executives are now leaning on data analytics to make
informed business decisions that have statistical
significance, thus eliminating the intuition and gut feel.
What Regression is used for? - Cont
• Correcting Errors: Regression is not only great
for lending empirical support to management
decisions but also for identifying errors in
judgment.
• New Insights: Over time businesses have
gathered a large volume of unorganized data that
has the potential to yield valuable insights.
Example of LR in Forecasting
Types of Regression - 1
Types of Regression - Cont
Types of Regression - Cont

(Education) y (Income)

(Education)
(Sex) Y (Income)
(Experience)
(Age)
Correlation - 1
• Correlation is a statistical technique that can show whether and how
strongly pairs of variables are related.
• The range of possible values is from -1 to +1
• The correlation is high if observations lie close to a straight line (ie
values close to +1 or -1) and low if observations are widely scattered
(correlation value close to 0)

∑( xi − x )( yi − y )
r=
(∑( xi − x ) 2 )(∑( yi − y ) 2 )
Correlation - Cont
Correlation vs Linear Regression
Simple Linear Regression - 1
• Estimate the linear equation for a relationship between continuous
variables so one variable can be predicted or estimated.
• To determine the relationship between one numerical dependent variable
and one numerical or categorical independent variable.
• Measures the strength of association between these variables as in
correlation but it provides more information compared to correlation
method.
• Usually used as the preliminary step of multiple linear regression.
Simple Linear Regression - 2
• As in correlation analysis, the linear regression also provides the
coefficient of relationship named as coefficient of determination, which
represents the proportion of the variation of dependent variable explained
by the independent variable.
Simple Linear Regression - 3

• Some examples of research questions for simple linear regression are:


• Is the amount of calories associated with body weight?
• Would the amount of exercise in hours predict the change in blood
pressure?
• How much a 15-hour physical activity changes the weight in
kilograms?
• How much will a 1gm of salt change blood pressure in mmHg in Perak
population?
Remember!
• Linear regression measures the linear association between
continuous variables and it is useful when a dependent variable and
independent variable are clearly defined.
• The most important part in linear regression is, it applied when a
prediction in the target variable is among the objectives of the
analysis.
Regression – Population & Sample
Regression Model
• General regression model:
𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋 + 𝜀𝜀
• β0, and β1 are unknown parameters, x is a known parameter
Deviations ε are independent, n(o, σ2)
• The values of the regression parameters β0, and β1 are not
known. We estimate them from data.
• β1 indicates the change in the mean response per unit increase
in X.
Regression Line
• If the scatter plot of the sample data suggests a linear relationship
between two variables i.e.
𝑦𝑦 = 𝛽𝛽̂0 + 𝛽𝛽̂1 𝑥𝑥
the relationship can be summarized by a straight-line plot.
• Least squares method give the “best” estimated line for our set of sample
data.
• The least squares method is a statistical procedure to find the best fit for a
set of data points by minimizing the sum of the offsets or residuals of
points from the plotted curve. Least squares method is used to predict
the behavior of dependent variables.
Least Squares Method - 1
• Line of ‘best fit’ means differences between actual y values &
predicted y values is at minimum, but positive differences offset
negative ones, so, square the errors!

• LS methods minimizes the Sum of the Squared Differences


(errors) (SSE)
Least Squares Method - Cont
Assumption in Simple Linear Regression
• Linear relationship: The relationship between X and the
mean of Y is linear
• Same variance: The errors should have the same variance.
• Independent observations: Observations are independent
of each other.
• Normal distribution of error terms: The residuals 𝜀𝜀𝑖𝑖 are
normally distributed
ANOVA
• ANOVA (analysis of variance) is the term for statistical
analyses of the different sources of variation.
• Partitioning of sums of squares and degrees of freedom
associated with the response variable.
ANOVA Table
ANOVA – SST, SSE, & SSR
ANOVA – SST, SSE, & SSR
ANOVA – SST, SSE, & SSR
The measure of total variation, denoted by SST, is the sum of
the squared deviations:
̄ 2
𝑆𝑆𝑆𝑆𝑆𝑆 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦)

If SST = 0, all observations are the same (no variability).


The greater is SST, the greater is the variation among the y
values.
In regression model, the measure of variation is that of the y
observations variability around the fitted line:
𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖
ANOVA – SST, SSE, & SSR
 Sum Square Total (SST ):
- Measure how much variance is in the dependent variable.
- Made up of the SSE and SSR
𝐧𝐧 𝐧𝐧 𝐧𝐧

̄ 𝟐𝟐 = �(𝐲𝐲𝐢𝐢
𝐒𝐒𝐒𝐒𝐒𝐒 = �(𝐲𝐲𝐢𝐢 − 𝐲𝐲) − 𝐲𝐲�𝐢𝐢 )𝟐𝟐 + �( 𝐲𝐲�𝐢𝐢 − 𝐲𝐲)
̄ 𝟐𝟐
𝐢𝐢=𝟏𝟏 𝐢𝐢=𝟏𝟏 𝐢𝐢=𝟏𝟏

SST = SSE + SSR


degree of
𝑛𝑛 − 1 = 𝑛𝑛 − 2 + 1
freedom:
ANOVA – SST, SSE, & SSR
Model Evaluation - 1

SLR model evaluation is using software output

(i) standard error of estimate (s)


(ii) coefficient of determination (R2)
(iii) hypothesis test
a)the t-test of the slope
b)the F-test of the slope
Model Evaluation – Standard Error of Estimate
Compute Standard Error of Estimate
𝐧𝐧

𝐒𝐒𝐒𝐒𝐒𝐒 = �(𝐲𝐲𝐢𝐢 − 𝑦𝑦�𝐢𝐢 )𝟐𝟐


𝐢𝐢=𝟏𝟏
The smaller SSE the more successful is the Linear Regression
Model in explaining y.
Model Evaluation – Coefficient of Determination
• Coefficient of determination

2
𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 R2 = 1 - (SSE/SST)
𝑅𝑅 = =1−
𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆
• proportion of variability in the observed dependent variable that is
explained by the linear regression model.
• The coefficient of determination measures the strength of that
linear relationship, denoted by R2
• The greater R2 the more successful is the Linear Model
• The R2 value close to 1 indicates the best fit and close to 0 indicates
the poor fit.
R-Squared
Model Evaluation – The Hypothesis Test - 1
• Hypothesis testing: Decision-making procedure
about the null hypothesis
• The Null Hypothesis (H0):
• The hypothesis that cannot be viewed as false unless
sufficient evidence on the contrary is obtained.
• The Alternative Hypothesis (H1):
• The hypothesis against which the null hypothesis is
tested and is viewed true when H0 is declared false.
Model Evaluation – The Hypothesis Test - 2

• Hypothesis test
• A process that uses sample statistics to test a claim about the value of a
population parameter.
• Example: An automobile manufacturer advertises that its new hybrid car
has a mean mileage of 50 miles per gallon. To test this claim, a sample
would be taken. If the sample mean differs enough from the advertised
mean, you can decide the advertisement is wrong.
Model Evaluation – The Hypothesis Test - 3

• One sided (tailed) H 0 : µ ≥ µ0 or µ =


µ0
lower-tail test H1 : µ < µ 0

• One sided (tailed) H 0 : µ ≤ µ0 or µ =


µ0
upper-tail test H1 : µ > µ 0

H 0 : µ = µ0
• Two-sided (tailed) test H1 : µ ≠ µ 0

Note: μ0 is the value given/assumed for the parameter μ.


Model Evaluation – The Hypothesis Test - 4

One sided (tailed)


upper-tail test

One sided (tailed)


lower-tail test

Two-sided (tailed)
test
Model Evaluation – The Hypothesis Test - 5
• Equivalence of F-Test and t -Test: For Given α Level, the F-
Test of β1 = 0 Versus β1 ≠ 0 is Equivalent algebraically to
the Two-sided t-test.

• Thus, at a given level, we can use Either the t-test or the F-


test for testing β1 = 0 versus β1 ≠ 0.

• The t-test is More Flexible since it can be used for one


sided test as well.
Model Evaluation – The Hypothesis Test
- t-test- 6
• t-test is used to check on adequate relationship
between x and y

• Test the hypothesis


H0 : = 0 (No relationship between x and y)
H1: ≠ 0 (There is relationship between x and y)
• Test Statistic: T – distribution: 𝛽𝛽̂1 − 𝛽𝛽1 𝛽𝛽̂1 − 𝛽𝛽1
𝑇𝑇 = =
𝜎𝜎� 2 𝑠𝑠𝑠𝑠(𝑏𝑏)
• Critical Region: |T | > tα/2, n-2 . 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥
Model Evaluation – The Hypothesis Test
- F-test- 7
• In order to be able to construct a statistical
decision rule, we need to know the
distribution of our test statistic F.
𝑀𝑀𝑀𝑀𝑀𝑀
𝐹𝐹 =
𝑀𝑀𝑀𝑀𝑀𝑀
• when h0 is true, our test statistic, F, follows
the F-distribution with 1, and n-2 degrees of
freedom.
𝐹𝐹(𝛼𝛼; 1, 𝑛𝑛 − 2)
Model Evaluation – The Hypothesis Test
- F-test- 8
• This time we will use the F-test, the null and alternative
hypothesis are:
𝐻𝐻0 : 𝛽𝛽1 = 0
𝐻𝐻𝑎𝑎 : 𝛽𝛽1 ≠ 0
Construction of decision rule:
At α = 5% level, Reject H0 if 𝐹𝐹 > 𝐹𝐹(𝛼𝛼; 1, 𝑛𝑛 − 2)

Large values of F support Ha and values of F near 1 support


H0.
Model Evaluation – P-value - 9
• In statistics, the p-value is the probability of obtaining results at
least as extreme as the observed results of a statistical
hypothesis test, assuming that the null hypothesis is correct.
A smaller p-value means that there is
stronger evidence in favor of the
alternative hypothesis. When the p–
value is small we shall reject the null
hypothesis H0 .
Regression
in MS Excel

Excel steps and


outputs
https://colab.research.google.com/drive/1uBhaedip2COZW
GkhwqlF1aFqt1NV-7ev?usp=sharing

Example 1
The manager of a car plant wishes to investigate how the plant’s
electricity usage depends upon the plant production. The data is
given below estimate the linear regression equation.
Production (x) ($M) 4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2

Electricity Usage 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53
(y)(kWh)

i. Estimate the linear regression equation


ii. Find the standard error of estimate of this regression.
iii. Determine the coefficient of determination of this regression.
iv. Test for significance of regression at 5% significance level.
Set up the hypothesis
• H0 : β = 0 ( there is no relationship between x and y)
• There is no relationship between Production and
Electric Usage

• H1: β ≠ 0 (the straight-line model is adequate)


• There is relationship between Production and
Electric Usage
MS Excel Results
Regression Statistics
Multiple R 0.895605603
R Square 0.802109396
Adjusted R Square 0.782320336
Standard Error 0.172947969
Observations 12

ANOVA
df SS MS F Significance F
Regression 1 1.212381668 1.21238 40.53297031 8.1759E-05
Residual 10 0.299109998 0.02991
Total 11 1.511491667

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.409048191 0.385990515 1.05974 0.314189743 -0.450992271 1.269088653 -0.45099227 1.269088653
X Variable 1 0.498830121 0.078351706 6.36655 8.1759E-05 0.324251642 0.673408601 0.32425164 0.673408601
MS Excel Result – Regression Line
Example 1 – Summary - 1
Estimated Regression Line 𝑦𝑦� = 0.4091 + 0.4988𝑥𝑥
Electricity usage = 0.4091 + 0.4988*Production
Standard Error of Estimate = 0.173
Coefficient of Determination R2 = 0.802
𝛼𝛼 = 0.05; 𝑡𝑡𝛼𝛼/2,𝑛𝑛−2 = 𝑡𝑡0.025,10 = 2.228
𝑇𝑇 = 6.37, Critical Region: |T | > tα/2, n-2 .

Since 6.37 > 2.228, reject H0 , thus, the distribution of


Electricity usage does depend on level of production
Example 1 – Summary - 2
• Using F-test. the null and alternative hypothesis are:
𝐻𝐻0 : 𝛽𝛽1 = 0
𝐻𝐻𝑎𝑎 : 𝛽𝛽1 ≠ 0

• α = 0.05. since n=12, we require F(0.05; 1,10).


• from table F(0.05; 1,10) = 4.96.
• decision rule is to reject h0 since:
𝐹𝐹 = 40.53 > 4.96

In conclusion, there is also a linear association between the


distribution of electricity usage and level of production
Example 1 - Interpretation
• Production coefficient (𝛽𝛽 1 =0.498): Each unit increase in
Production($million) adds 0.498 to Electricity usage.
• 𝛽𝛽 1> 0: (positive relationship): Electricity usage increases
with the increase in Production.
• Intercept coefficient (𝛽𝛽 0 = 0.409): The Electricity usage when
Production equal to zero.
• R Square = 0.802: indicates that the model explains 80% of the
total variability in the electricity usage around its mean. (good fit)
• P-value < 0.05: The regression is significant. The change in
production impacts the electricity usage.
Example 2 – Application of SLR to Resevoir
Quality Index (RQI)
• Example: Given data on Permeability and
Reservoir Quality Index, RQI, investigate the
dependence of RQI (Y) on Permeability (X).
• Set up the hypothesis :
• H0 : β = 0 ( there is no relationship between x and y)
• There is no relationship between RQI and
Permeability

• H 1: β ≠ 0 (the straight-line model is adequate)


• There is relationship between RQI and Permeability
Excel Results – Example 2
Regression Statistics
Multiple R 0.680322
R Square 0.462837
Adjusted R
Square 0.461716
Standard Error 0.40947
Observations 481

ANOVA
df SS MS F Significance F
Regression 1 69.19926 69.19926 412.7226 1.22E-66
Residual 479 80.31167 0.167665
Total 480 149.5109

Coefficients Standard Error t Stat P-value


Intercept 0.309739 0.019769 15.66798 5.73E-45
Permeability
(md) 0.00171 8.42E-05 20.31558 1.22E-66
Excel Results – Example 2
Permeability(md) Line Fit Plot
5.00
4.50
y = 0.3097 + 0.0017x
4.00 R² = 0.4628
3.50
3.00
RQI
RQI

2.50
2.00 Predicted RQI
Linear (RQI)
1.50
1.00
0.50
0.00
0.0 500.0 1000.0 1500.0 2000.0 2500.0
Permeability(md)
Example 2 – Interpretation of the results
• Permeability(md) coefficient (𝛽𝛽 1 =0.0017): Each unit increase in
Permeability adds 0.0017 to RQI value.
• 𝛽𝛽 1> 0: (positive relationship): RQI increases with the increase
in Permeability.
• Intercept coefficient (𝛽𝛽 0 = 0.309): The value of RQI when
Permeability equal to zero.
• R Square = 0.462837: indicates that the model explains 46% of the
total variability in the RQI values around its mean.
• P-value < 0.05: The regression is significant
Example 2 (b) – Application of SLR to
Reservoir Quality Index (RQI)
• Example: Given data on Permeability and Reservoir Quality
Index, RQI, investigate the dependence of RQI (Y) on
Permeability (X). For this example, outliers have been
detected and removed.
• Set up the hypothesis :
• H0 : β = 0 ( there is no relationship between x and y)
• There is no relationship between RQI and Permeability

• H1: β ≠ 0 (the straight-line model is adequate)


• There is relationship between RQI and Permeability
Example 2 (b) - Application of SLR to
Reservoir Quality Index (RQI)
Outlier detection arguments
Q1 0.125452
Q3 0.536934
IQR 0.411482
L Bound -0.49177
U Bound 1.154156

41 outliers were
detected and treated
Excel Results – Example 2(b)
RQI vs Permeability (Outliers Removed )
Observed the RQI 1.80
values on Y-axis 1.60
1.40
1.20
1.00
0.80
RQI
0.60
0.40 RQI
0.20 Linear (RQI)
0.00
0.0 100.0 200.0 300.0 400.0

Permeability
Excel Results – Example 2(b)
Regression Statistics
Multiple R 0.851746 What can you conclude here????
R Square 0.725471
Adjusted R
Square 0.724844
Standard Error 0.134745
Observations 440

ANOVA

df SS MS F Significance F
Regression 1 21.01512 21.01511817 1157.460884 5.0238E-125
Residual 438 7.952426 0.018156223
Total 439 28.96754

Coefficients Standard Error t Stat P-value


Intercept 0.186763 0.007286 25.63358468 3.4304E-89
Permeability
(md) 0.003609 0.000106 34.02147681 5.0238E-125
Excel Results – Example 2(b)
Regression Statistics
Multiple R 0.851746
R Square 0.725471
Adjusted R
Square 0.724844
Standard Error 0.134745
Observations 440

ANOVA
df SS MS F Significance F
Regression 1 21.01512 21.01511817 1157.460884 5.0238E-125
Residual 438 7.952426 0.018156223
Total 439 28.96754

Coefficients Standard Error t Stat P-value


Intercept 0.186763 0.007286 25.63358468 3.4304E-89
Permeability
(md) 0.003609 0.000106 34.02147681 5.0238E-125
Example 2 (b) – Interpretation of the results
• Permeability(md) coefficient (𝛽𝛽 1 =0.003609): Each unit increase
in Permeability adds 0.00361 to RQI.
• 𝛽𝛽 1> 0: (positive relationship): RQI increases with the increase
in Permeability.
• Intercept coefficient (𝛽𝛽 0 = 0.186763): The value of RQI when
Permeability equal to zero.
• R Square = 0.725471: indicates that the model explains 73% of the
total variability in the RQI values around its mean. This value has
improved significantly with the removal of outliers.
• P-value < 0.05: The regression is significant
Thank You
Q&A

You might also like