Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Simple Linear Regression sample

The document discusses Simple Linear Regression, focusing on the relationship between two variables, X and Y, using regression analysis to predict Y based on X. It outlines the structure of the linear model, examples of different regression types, and methods for estimating parameters and errors. Additionally, it covers the interpretation of coefficients, the significance of the regression model, and the calculation of prediction and confidence intervals.

Uploaded by

garygramatico5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Simple Linear Regression sample

The document discusses Simple Linear Regression, focusing on the relationship between two variables, X and Y, using regression analysis to predict Y based on X. It outlines the structure of the linear model, examples of different regression types, and methods for estimating parameters and errors. Additionally, it covers the interpretation of coefficients, the significance of the regression model, and the calculation of prediction and confidence intervals.

Uploaded by

garygramatico5
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Simple Linear Regression

Simple Linear Regression

• Our objective is to study the relationship between


two variables X and Y.
• One way is by means of regression.
• Regression analysis is the process of estimating a
functional relationship between X and Y. A
regression equation is often used to predict a
value of Y for a given value of X.
• Another way to study relationship between two
variables is correlation. It involves measuring the
direction and the strength of the linear
relationship.
2
First-Order Linear Model = Simple Linear
Regression Model
Yi =  0 + 1X i +  i

where
y = dependent variable
x = independent variable
0= y-intercept
1= slope of the line
 = error variable 3
Simple Linear Model

Yi =  0 + 1X i +  i
This model is
– Simple: only one X
– Linear in the parameters: No parameter appears
as exponent or is multiplied or divided by another
parameter
– Linear in the predictor variable (X): X appears
only in the first power.

4
Examples
• Multiple Linear Regression:
Yi =  0 + 1X1i +  2 X 2i +  i
• Polynomial Linear Regression:

Yi =  0 + 1Xi +  2 X 2i +  i

• Linear Regression:
log10 (Yi ) =  0 + 1X i +  2 exp( X i ) +  i
• Nonlinear Regression:
Yi =  0 /(1 + 1 exp(  2 X i )) +  i
Linear or nonlinear in parameters 5
Deterministic Component of Model

50 y = ˆ0 + ˆ1x
45
40
35
y-intercept 30 ∆y ˆ1(slope)=∆y/∆x
25
20̂ 0 ∆x
15
10
5
0 x
0 5 10 15 20

6
Mathematical vs Statistical Relation

50 ^
45 y = - 5.3562 + 3.3988x
40
35
30
25
20
15
10
5
0 x
x
0 5 10 15 20

7
Error

• The scatterplot shows that the points are not on a


line, and so, in addition to the relationship, we also
describe the error:
yi =  0 + 1 xi +  i , i=1,2,...,n
• The Y’s are the response (or dependent) variable.
The x’s are the predictors or independent variables,
and the epsilon’s are the errors. We assume that the
errors are normal, mutually independent, and have
variance 2.

8
Least Squares:
n n
Minimize
 
i =1
 i2 =
i =1
( yi −  0 −  i xi ) 2

yˆi = ˆ0 + ˆ1 xi


• The quantities are called the residuals.
Ri = yerror,
If we assume a normal i − yi they should look
ˆ
normal.

Error: Yi-E(Yi) unknown; Residual: Ri = yi − yˆi estimated, i.e.


known

9
Minimizing error

10
• The Simple Linear Regression Model
y =  0 + 1 x + 
• The Least Squares Regression Line where

ŷ = ˆ0 + ˆ1 x
SS xy
ˆ1 =
SS x
ˆ0 = y − ˆ1 x
(  x)
2

(x − x ) =  x
i
SS x = i
2 2
i −
n

(  x )(  y )
 ( x − x )( y − y ) = 
i i
SS xy = i i xy −
i i
n
11
What form does the error take?
• Each observation may be decomposed into two
parts:

y = yˆ + ( y − yˆ )
• The first part is used to determine the fit, and the
second to estimate the error.
• We estimate the standard deviation of the error
by:

 S 2 
SSE =  (Y − Y ) = S yy − 
ˆ xy

2
 S xx 
  12
Estimate of 2

• We estimate 2 by

SSE
s =
2
= MSE
n−2

13
Example

• An educational economist wants to establish the


relationship between an individual’s income and
education. He takes a random sample of 10 individuals
and asks for their income ( in $1000s) and education (
in years). The results are shown below. Find the least
squares regression line.

Education 11 12 11 15 8 10 11 12 17 11
Income 25 33 22 41 18 28 32 24 53 26

14
Dependent and Independent Variables

• The dependent variable is the one that we want to


forecast or analyze.
• The independent variable is hypothesized to affect
the dependent variable.
• In this example, we wish to analyze income and we
choose the variable individual’s education that most
affects income. Hence, y is income and x is
individual’s education

15
First Step:

 x = 118
i

 x = 1450
2
i

 y = 302
i

 y = 10072
2
i

 x y = 3779
i i

16
Sum of Squares:

SS xy = x y −
i i
 x )( y )
( i
= 3779 −
(118)(302)
i
= 215.4
n 10

SS x = x 2
i −
(  xi ) 2
= 1450 −
(118) 2
= 57.6
n 10
Therefore, ˆ SS xy 215.4
1 = = = 3.74
SS x 57.6

ˆ ˆ 302 118
 0 = y − 1 x = − 3.74 = −13.93
10 10
17
The Least Squares Regression Line

• The least squares regression line is

yˆ = −13.93 + 3.74 x
• Interpretation of coefficients:
*The sample slope ˆ = 3.74tells us that on average
1
for each additional year of education, an individual’s
income rises by $3.74 thousand.
• The y-intercept is ˆ = −13.93 . This value is the
0
expected (or average) income for an individual who
has 0 education level (which is meaningless here)

18
Example

• Car dealers across North America use the red book to


determine a cars selling price on the basis of
important features. One of these is the car’s current
odometer reading.
• To examine this issue 100 three-year old cars in mint
condition were randomly selected. Their selling price
and odometer reading were observed.

19
Portion of the data file

Odometer Price
37388 5318
44758 5061
45833 5008
30862 5795
….. …
34212 5283
33190 5259
39196 5356
36392 5133

20
Example (Minitab Output)

Regression Analysis

The regression equation is


Price = 6533 - 0.0312 Odometer

Predictor Coef StDev T P


Constant 6533.38 84.51 77.31 0.000(SIGNIFICANT)
Odometer -0.031158 0.002309 -13.49 0.000(SIGNIFICANT)

S = 151.6 R-Sq = 65.0% R-Sq(adj) = 64.7%

Analysis of Variance

Source DF SS MS F P
Regression 1 4183528 4183528 182.11 0.000
Error 98 2251362 22973
Total 99 6434890

21
Example
• The least squares regression line is

yˆ = 6533.38 − 0.031158x

6000

5500
Price

5000

20000 30000 40000 50000


Odometer 22
Interpretation of the coefficients

• ˆ = −0.031158 means that for each additional mile on


1
the odometer, the price decreases by an average of
3.1158 cents.
• ˆ = 6533.38means that when x = 0 (new car), the
0
selling price is $6533.38 but x = 0 is not in the range
of x. So, we cannot interpret the value of y when x=0
for this problem.
• R2=65.0% means that 65% of the variation of y can
be explained by x. The higher the value of R2, the
better the model fits the data.

23
R² and R² adjusted

• R² measures the degree of linear association


between X and Y.
• So, an R² close to 0 does not necessarily indicate that
X and Y are unrelated (relation can be nonlinear)
• Also, a high R² does not necessarily indicate that the
estimated regression line is a good fit.
• As more and more X’s are added to the model, R²
always increases. R²adj accounts for the number of
parameters in the model.

24
Scatter Plot

Odometer .vs. Price Line Fit Plot

6000

5500
Price

5000

4500
19000 29000 39000 49000
Odometer
25
Testing the slope
• Are X and Y linearly related?

H 0 : 1 = 0
H A : 1  0
•Test Statistic:
ˆ
1 − 1 s
t= where sˆ =
sˆ 1
SS x
1

26
Testing the slope (continue)
• The Rejection Region:
Reject H0 if t < -t/2,n-2 or t > t/2,n-2.

• If we are testing that high x values lead to high y values,


HA: 1>0. Then, the rejection region is
t > t,n-2.

• If we are testing that high x values lead to low y values


or low x values lead to high y values, HA: 1 <0. Then,
the rejection region is t < - t,n-2.

27
Assessing the model
Example:
• Excel output

Coefficients Standard Error t Stat P-value


Intercept 6533.4 84.512322 77.307 1E-89
Odometer -0.031 0.0023089 -13.49 4E-24

• Minitab output

Predictor Coef StDev T P


Constant 6533.38 84.51 77.31 0.000
Odometer -0.031158 0.002309 -13.49 0.000

28
Coefficient of Determination
SSE
R = 1−
2

SS y
For the data in odometer example, we obtain:
SSE 2,251,363
R = 1−
2
= 1−
SS y 6,434,890
= 1 − 0.3499 = 0.6501
2 n − 1 SSE
R adj = 1 − ( )
n − p SS y
where p is number of predictors in the model.
29
Using the Regression Equation

• Suppose we would like to predict the selling price for


a car with 40,000 miles on the odometer

yˆ = 6,533 − 0.0312 x
= 6,533 − 0.0312(40,000)
= $5,285

30
Prediction and Confidence Intervals

• Prediction Interval of y for x=xg: The confidence


interval for predicting the particular value of y for a
given x
1 ( xg − x )
2

yˆ  t / 2,n−2 se 1 + +
n SS x
• Confidence Interval of E(y|x=xg): The confidence
interval for estimating the expected value of y for a
given x
1 ( xg − x ) 2
yˆ  t / 2,n −2 se +
n SS x
31
Solving by Hand
(Prediction Interval)
• From previous calculations we have the following
estimates:
yˆ = 5285, s = 151.6, SS x = 4309340160, x = 36,009
• Thus a 95% prediction interval for x=40,000 is:

1 (40, 000 − 36, 009) 2


5, 285  1.984(151.6) 1 + +
100 4,309,340,160
5, 285  303

•The prediction is that the selling price of the car


will fall between $4982 and $5588. 32
Solving by Hand
(Confidence Interval)

• A 95% confidence interval of


E(y| x=40,000) is:

1 (40,000 − 36,009) 2
5,285  1.984(151.6) +
100 4,309,340,160
5,285  35
•The mean selling price of the car will fall between $5250
and $5320.
33
Prediction and Confidence Intervals’ Graph

6300
Prediction interval

5800
Predicted

5300

Confidence interval
4800

20000 30000 40000 50000


Odometer

34
Notes

• No matter how strong is the statistical relation


between X and Y, no cause-and-effect pattern is
necessarily implied by the regression model. Ex:
Although a positive and significant relationship is
observed between vocabulary (X) and writing speed
(Y), this does not imply that an increase in X causes
an increase in Y. Other variables, such as age, may
affect both X and Y. Older children have a larger
vocabulary and faster writing speed.

35
Regression Diagnostics

Residual Analysis:
Non-normality
Heteroscedasticity (non-constant variance)
Non-independence of the errors
Outlier
Influential observations

36
Standardized Residuals
• The standardized residuals are calculated as

ri
where Standardized . residual =
s
ri = yi − yˆi
• The standard deviation of the i-th residual is

1 ( xi − x ) 2
sr = s 1 − hi where hi = +
i
n SS x

37
Non-normality:

• The errors should be normally distributed. To check


the normality of errors, we use histogram of the
residuals or normal probability plot of residuals or
tests such as Shapiro-Wilk test.
• Dealing with non-normality:
– Transformation on Y
– Other types of regression (e.g., Poisson or Logistic …)
– Nonparametric methods (e.g., nonparametric
regression(i.e. smoothing))

38
Non-constant variance:

• The error variance  should


2 be constant.
• To diagnose non-constant variance, one method is to
plot the residuals against the predicted value of y (or
x). If the points are distributed evenly around the
expected value of errors which is 0, this means that
the error variance is constant. Or, formal tests such
as: Breusch-Pagan test

39
Dealing with non-constant variance
• Transform Y
• Re-specify the Model (e.g., Missing important X’s?)
• Use Weighted Least Squares instead of Ordinary
Least Squares
n
 i2
min 
i =1 Var ( i )

40
Non-independence of error variable:

• The values of error should be independent. When


the data are time series, the errors often are
correlated (i.e., autocorrelated or serially correlated).
To detect autocorrelation we plot the residuals
against the time periods. If there is no pattern, this
means that errors are independent. Or, more formal
tests such as Durbin-Watson

41
Outlier:

• An outlier is an observation that is unusually small or


large. Two possibilities which cause outlier is
1. Error in recording the data. Detect the error and
correct it
The outlier point should not have been included in
the data (belongs to another population)  Discard
the point from the sample
2. The observation is unusually small or large although
it belong to the sample and there is no recording
error.  Do NOT remove it

42
Influential Observations

Scatter Plot Without the Influential


Scatter Plot of One Influential Observation
Observation

60 150
50
100
40

y
y 30
50
20
10 0
0 0 10 20 30 40 50
0 10 20 30 40 50
x
x

43
Influential Observations

• Detection:
Cook’s Distance, DFFITS, DFBETAS (Neter, J., Kutner, M.H.,
Nachtsheim, C.J., and Wasserman, W., (1996) Applied Linear Statistical
Models, 4th edition, Irwin, pp. 378-384)

44
Multicollinearity

• A common issue in multiple regression is


multicollinearity. This exists when some or all of the
predictors in the model are highly correlated. In such
cases, the estimated coefficient of any variable
depends on which other variables are in the model.
Also, standard errors of the coefficients are very
high…

45
Multicollinearity

• Look into correlation coefficient among X’s: If


Cor>0.8, suspect multicollinearity
• Look into Variance inflation factors (VIF): VIF>10 is
usually a sign of multicollinearity
• If there is multicollinearity:
– Use transformation on X’s, e.g. centering, standardization.
Ex: Cor(X,X²)=0.991; after standardization Cor=0!
– Remove the X that causes multicollinearity
– Factor analysis
– Ridge regression
–…

46
Exercise

• In baseball, the fans are always interested in


determining which factors lead to successful teams.
The table below lists the team batting average and
the team winning percentage for the 14 league
teams at the end of a recent season.

47
Team-B-A Winning%
0.254 0.414
0.269 0.519
0.255 0.500
0.262 0.537
0.254 0.352
0.247 0.519
0.264 0.506
0.271 0.512
0.280 0.586
0.256 0.438
0.248 0.519
0.255 0.512
0.270 0.525
0.257 0.562

y = winning % and x = team batting average


48
a) LS Regression Line
 
xi = 3.642, xi2 = 0.949

 y = 7.001,  y = 3.549
i
2
i

 x y = 1.824562 i i

SS xy =x y −

i i
x )( y )
( i
= 1.824562 −
(3.642)(7.001)
i
= 0.0033
n 14

(  x)
2


i (3.642) 2
SS x = x −
2
i = 0.948622 − = 0.00118
n 14

49
ˆ SS xy 0.003302
1 = = = 0.7941
SS x 0.001182
ˆ = y − ˆ x = 0.5 − (0.7941)0.26 = 0.2935
0 1

yˆ = 0.2935
• The least squares + 0.7941line
regression x is

ˆ1 = 0.7941
• The meaning is for each additional
batting average of the team, the winning
percentage increases by an average of 79.41%.

50
b) Standard Error of Estimate
  (  yi )   S2
2
 S xy
2 
SSE = S yy −  =   yi −  −  xy 
2
 S xx   n   S xx 
    
7.0012 0.0033022
= (3.548785 − )− = 0.03856
14 0.00182

SSE 0.03856
So, s =
2
 = = 0.00321 and s = s2 = 0.0567
n − 2 14 − 2
• Since s=0.0567 is small, we would conclude that “s”
is relatively small, indicating that the regression line
fits the data quite well.

51
c) Do the data provide sufficient evidence at the 5%
significance level to conclude that higher team
batting average lead to higher winning percentage?

H 0 : 1 = 0 ˆ1 − 1
Test statistic: t = = 1.69 (p-value=.058)
H A : 1  0 sˆ
1

Conclusion: Do not reject H0 at  = 0.05.


The higher team batting average does not
lead to higher winning percentage.
52
d)Coefficient of Determination

SS xy2 SSE 0.03856


R =
2
= 1− = 1− = 0.1925
SS x − SS y SS y 0.04778

The 19.25% of the variation in the winning percentage


can be explained by the batting average.

53
e) Predict with 90% confidence the winning
percentage of a team whose batting average is 0.275.

yˆ = 0.2935 + 0.7941(0.275) = 0.5119


1 ( xg − x )
2

yˆ  t / 2,n−2 s 1 + + =
n SS x
1 (0.275 − 0.2601) 2
0.5119  (1.782)(0.0567) 1 + +
14 0.001182
0.5119  0.1134
90% PI for y: (0.3985,0.6253)

•The prediction is that the winning percentage of the


team will fall between 39.85% and 62.53%.

54
55

You might also like