Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

8.-Linear-Regression

The document provides an overview of linear regression analysis, detailing how to estimate the relationship between dependent and independent variables using least-squares methods. It explains both simple and multiple linear regression equations, including how to calculate slopes, intercepts, and correlation coefficients, along with examples for practical application. Additionally, it outlines the assumptions necessary for both simple and multiple linear regression to ensure valid results.

Uploaded by

hamoudiguinal2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

8.-Linear-Regression

The document provides an overview of linear regression analysis, detailing how to estimate the relationship between dependent and independent variables using least-squares methods. It explains both simple and multiple linear regression equations, including how to calculate slopes, intercepts, and correlation coefficients, along with examples for practical application. Additionally, it outlines the assumptions necessary for both simple and multiple linear regression to ensure valid results.

Uploaded by

hamoudiguinal2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

LINEAR

REGRESSION
ANALYSIS
The concept of regression analysis
deals with finding the best
relationship between Y and x,
quantifying the strength of that
relationship, and using methods that
allow for prediction of the response
given values of the regressor x.
Thus, we wish to estimate the value of a
variable Y (dependent variable) corresponding
to a given value of a variable X (independent
variable or regressors). This can be
accomplished by estimating the value of Y from
a least-squares curve that fits the sample data.
The resulting curve is called a regression curve
of Y on X, since Y is estimated from X.
For a simple linear regression, where there is
only 1 dependent variable and 1 independent
variable, we have

𝑌 = 𝑎 + 𝑏𝑥
Where
a is the y-intercept of the line
b is the slope of the regression line
For a quick review,
•The slope represents the amount the
dependent variable increases or decreases
with unit increase or decrease in the
independent variable.
•The intercept indicates the value of the
dependent variable when the independent
variable takes the value zero.
The preceding equation is also called the
least-squares regression equation. It creates
a least-squares regression line where it is
the best-fitting regression line for
summarizing the relationship between 2
variables measured at the interval and/or
ratio scale.
To find the y-intercept and the slope, we
have
𝑁 σ 𝑥𝑌−(σ 𝑥)(σ 𝑌)
𝑎 = 𝑌ത − 𝑏𝑥ҧ and 𝑏=
𝑁 σ(𝑥 2 )− σ 𝑥 2
Where b is actually the Pearson’s product
moment correlation coefficient between Y
and X.
Then, we can now create an equation
that predict the value of Y.
𝑌෠ = 𝑎 + 𝑏𝑥
In linear regression, the less the spread
of observation around the best-fitted
regression line, the more accurate will be our
prediction of values of Y from the values of x.
Example:
The amount spent on medical expenses
(Medcost) per year is correlated with
alcohol for 15 adult males. For alcohol
consumption, the recorded data were the
money spent on alcohol per week in Table
1. (a) Find the linear regression equation of
the medcost. (b) Estimate the value of the
medcost if the money spent on alcohol per
week is 35. (Standard deviations: Medcost
= 544.40; Alcohol = 9.54)
Solution:
𝑁 σ 𝑥 𝑌 − (σ 𝑥)(σ 𝑌) (15)(778510) − (295)(36927)
𝑏= 2 2
= 2
= 41.057
𝑁 σ(𝑥 ) − σ 𝑥 (15)(7075) − 295
σ𝑦 σ 𝑥 36927 295
𝑎 = 𝑌ത − 𝑏𝑥ҧ = −𝑏 = − 41.057 = 1654.346
𝑛 𝑛 15 15
a. So, the simple linear regression equation for the medcost is
𝑌෠ = 𝑎 + 𝑏𝑥 = 1654.346 + 41.057 𝑥
b. If the money spent on alcohol per week is 35, we can predict that the
medical cost would be

𝑌෠ = 1654.346 + 41.057 𝑥 = 1654.346 + 41.057 35 = 3091.341


Example:
A regional development council composed of members
from several companies, gathered the following data in the
table below from selected companies in a region that
produced office and school supplies. Solve for the linear
regression equation and predict the price if the cost is 50.
COMPANY PRODUCTION COST FINISHED-PRODUCT PRICE
A 40 150
B 38 140
C 48 160
D 56 170
E 70 180
F 90 165
Multiple Linear Regression involves two independent
variables. The least-squares multiple regression equation is
𝑌෠ = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2
𝑎 = 𝑌ത − 𝑏1 𝑥ҧ1 − 𝑏2 𝑥ҧ2
Where
𝑌෠ is the estimated value of the dependent variable
a is the y-intercept of the regression line
b1 is the partial slope of the linear relationship between
the first independent variable x1 and Y if x2 is held constant.
x1 is the value of the first independent variable
b2 is the partial slope of the linear relationship between
the second independent variable x2 and Y if x1 is held constant.
The partial slope denotes the amount of change in Y for
a unit change in the independent variable while controlling
the effect of the other independent variables in the equation.
We can solve for the partial slopes using
𝑆𝑦 𝑟𝑦1 −𝑟𝑦2 𝑟12 𝑆𝑦 𝑟𝑦2 −𝑟𝑦1 𝑟12
𝑏1 = 2 and 𝑏2 =
𝑆1 1−𝑟12 𝑆2 1−𝑟12 2

The bivariate correlations are solved using the formula


used in the Pearson r:
𝑁 σ 𝑥 𝑌 − (σ 𝑥)(σ 𝑌)
𝑟=
[𝑁 σ 𝑥 2 − σ 𝑥 2 ][𝑁 σ 𝑌 2 − σ 𝑌 2 ]
To find the multiple correlation coefficient with 1 dependent
variable (Y) and 2 independent variables (X and Z), we have
𝑟𝑌𝑋 2 + 𝑟𝑌𝑍 2 − 2 𝑟𝑋𝑌 𝑟𝑌𝑍 𝑟𝑍𝑋
𝑅 = 𝑅𝑌.𝑋𝑍 =
1 − 𝑟𝑋𝑍 2
When X becomes the dependent and Y and Z, this time, will be
the independent variables, we have
𝑟𝑋𝑌 2 + 𝑟𝑋𝑍 2 − 2 𝑟𝑋𝑌 𝑟𝑌𝑍 𝑟𝑍𝑋
𝑅 = 𝑅𝑋.𝑌𝑍 =
1 − 𝑟𝑌𝑍 2
The pattern is the same for Z as the dependent variable and
the other two as the independent variable.
Example:
Eleven student teachers took part in an
evaluation program designed to measure
teacher effectiveness and determine what
factors are important. The response measure
was a quantitative evaluation of the teacher. The
regressor variables were scores on two
standardized tests given to each teacher. (sy =
105.69, sx1 = 24.99, sx2 = 13.42)
a. Find the multiple linear regression
equation.
b. Predict the response score if the two
standardized tests are 50 and 120, respectively.
c. Calculate the multiple correlation
coefficient.
Solution:
Solving first for their means, we have
σ 𝑥1 728
𝑥1ҧ = = = 66.18
𝑛 11
σ 𝑥2 1549
𝑥ҧ2 = = = 140.82
𝑛 11
σ 𝑦 4731
𝑦ത = = = 430.09
𝑛 11
Now, let’s solve for the bivariate correlations for each combination,
(11)(319880) − (728)(4731)
𝑟𝑥1𝑦 =
[(11)(54424) − 728 2 ][(11)(2146461) − 4731 2 ]
𝑟𝑥1𝑦 = 0.256
(11)(670331) − (1549)(4731)
𝑟𝑥2𝑦 =
[(11)(219927) − 1549 2 ][(11)(2146461) − 4731 2 ]
𝑟𝑥2𝑦 = 0.291
(11)(102799) − (728)(1549)
𝑟𝑥1𝑥2 =
[(11)(54424) − 728 2 ][(11)(219927) − 1549 2 ]
𝑟𝑥1𝑥2 = 0.085
Next, solving for partial slopes,
105.69 0.256 − (0.291)(0.085)
𝑏𝑥1 = 2
= 0.988
24.99 1 − (0.085)
105.69 0.291 − (0.256)(0.085)
𝑏𝑥2 = 2
= 2.13
13.42 1 − (0.085)
Then, solving for the y-intercept of the regression line,
𝑎 = 𝑌ത − 𝑏1 𝑥ҧ1 − 𝑏2 𝑥ҧ2
𝑎 = 430.09 − 0.988 66.18 − 2.13 140.82 = 64.21
Thus, the multiple linear regression equation is
a. 𝑌෠ = 𝑎 + 𝑏𝑥1 𝑥1 + 𝑏𝑥2 𝑥2 = 64.21 +
0.988 𝑥1 + 2.13 𝑥2
b. If the two standardized tests are 50 and 120,
respectively, we have
𝑌෠ = 64.21 + 0.988 𝑥1 + 2.13 𝑥2
= 64.21 + 0.988 (50) + 2.13 (120)
= 369.21
c.Calculating the multiple correlation coefficient, we have
𝑟𝑥1 𝑌 2 + 𝑟𝑥2 𝑌 2 − 2 𝑟𝑥1 𝑌 𝑟𝑥2 𝑌 𝑟𝑥1 𝑥2
𝑅=
1 − 𝑟𝑥1 𝑥2 2

(0.256)2 +(0.291)2 −2 (0.256)(0.291)(0.085)


=
1 − (0.085)2
𝑅 = 0.372
SPSS Portion:
Assumptions for Simple Linear Regression:
1.The two variables should be measured at the continuous level.
2.There needs to be a linear relationship between the two
variables.
3.There should be no extreme significant outliers.
4.There should have independence of observations, which you can
easily check using the Durbin-Watson statistic. The Durbin Watson
statistic is a test for autocorrelation in a regression model's
output. The DW statistic ranges from zero to four, with a value of
2.0 indicating zero autocorrelation. Values below 2.0 mean there
is positive autocorrelation and above 2.0 indicates negative
autocorrelation.
5. The data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along
the line.
6. Check that the residuals (errors) of the regression line
are approximately normally distributed.
Assumptions for Multiple Linear Regression:
1.The dependent variable should be measured on a continuous
scale. If your dependent variable was measured on
an ordinal scale, you will need to carry out ordinal
regression rather than multiple regression (Unfortunately, we will
not include it in this learning guide. Instead, check this link for
further studies on ordinal regression:
https://statistics.laerd.com/spss-tutorials/ordinal-regression-
using-spss-statistics.php).
2.There needs to have two or more independent variables, which
can be either continuous or categorical.
3.There should have independence of observations, which you can
easily check using the Durbin-Watson statistic.
4.There needs to be a linear relationship between (a) the
dependent variable and each of your independent variables, and
(b) the dependent variable and the independent
variables collectively.
5.The data needs to show homoscedasticity.
6.Your data must not show multicollinearity, which occurs when
you have two or more independent variables that are highly
correlated with each other.
7.There should be no extreme significant outliers, high leverage
points or highly influential points.
8.Check that the residuals (errors) are approximately normally
distributed.
Example:
The data in the table
Bedrooms Bathrooms
below give us the price of a Price (Y)
(X) (Z)
house based on the number 165,000 3 2
of bedrooms and the number 200,000 3 3
225,000 4 3
of bathrooms for 10 houses. 180,000 2 3
Find the regression equation 202,000 4 2
250,000 4 4
of the price. Estimate the 275,000 3 4
value of the price if the house 300,000 5 3
155,000 2 2
has 6 bedrooms and 2 230,000 4 4
bathrooms. Find also the 47130.55 0.966 0.8165
*Red values are the standard deviations.
multiple correlation coefficient R.

You might also like