Simple Linear Regression sample
Simple Linear Regression sample
where
y = dependent variable
x = independent variable
0= y-intercept
1= slope of the line
= error variable 3
Simple Linear Model
Yi = 0 + 1X i + i
This model is
– Simple: only one X
– Linear in the parameters: No parameter appears
as exponent or is multiplied or divided by another
parameter
– Linear in the predictor variable (X): X appears
only in the first power.
4
Examples
• Multiple Linear Regression:
Yi = 0 + 1X1i + 2 X 2i + i
• Polynomial Linear Regression:
Yi = 0 + 1Xi + 2 X 2i + i
• Linear Regression:
log10 (Yi ) = 0 + 1X i + 2 exp( X i ) + i
• Nonlinear Regression:
Yi = 0 /(1 + 1 exp( 2 X i )) + i
Linear or nonlinear in parameters 5
Deterministic Component of Model
50 y = ˆ0 + ˆ1x
45
40
35
y-intercept 30 ∆y ˆ1(slope)=∆y/∆x
25
20̂ 0 ∆x
15
10
5
0 x
0 5 10 15 20
6
Mathematical vs Statistical Relation
50 ^
45 y = - 5.3562 + 3.3988x
40
35
30
25
20
15
10
5
0 x
x
0 5 10 15 20
7
Error
8
Least Squares:
n n
Minimize
i =1
i2 =
i =1
( yi − 0 − i xi ) 2
9
Minimizing error
10
• The Simple Linear Regression Model
y = 0 + 1 x +
• The Least Squares Regression Line where
ŷ = ˆ0 + ˆ1 x
SS xy
ˆ1 =
SS x
ˆ0 = y − ˆ1 x
( x)
2
(x − x ) = x
i
SS x = i
2 2
i −
n
( x )( y )
( x − x )( y − y ) =
i i
SS xy = i i xy −
i i
n
11
What form does the error take?
• Each observation may be decomposed into two
parts:
y = yˆ + ( y − yˆ )
• The first part is used to determine the fit, and the
second to estimate the error.
• We estimate the standard deviation of the error
by:
S 2
SSE = (Y − Y ) = S yy −
ˆ xy
2
S xx
12
Estimate of 2
• We estimate 2 by
SSE
s =
2
= MSE
n−2
13
Example
Education 11 12 11 15 8 10 11 12 17 11
Income 25 33 22 41 18 28 32 24 53 26
14
Dependent and Independent Variables
15
First Step:
x = 118
i
x = 1450
2
i
y = 302
i
y = 10072
2
i
x y = 3779
i i
16
Sum of Squares:
SS xy = x y −
i i
x )( y )
( i
= 3779 −
(118)(302)
i
= 215.4
n 10
SS x = x 2
i −
( xi ) 2
= 1450 −
(118) 2
= 57.6
n 10
Therefore, ˆ SS xy 215.4
1 = = = 3.74
SS x 57.6
ˆ ˆ 302 118
0 = y − 1 x = − 3.74 = −13.93
10 10
17
The Least Squares Regression Line
yˆ = −13.93 + 3.74 x
• Interpretation of coefficients:
*The sample slope ˆ = 3.74tells us that on average
1
for each additional year of education, an individual’s
income rises by $3.74 thousand.
• The y-intercept is ˆ = −13.93 . This value is the
0
expected (or average) income for an individual who
has 0 education level (which is meaningless here)
18
Example
19
Portion of the data file
Odometer Price
37388 5318
44758 5061
45833 5008
30862 5795
….. …
34212 5283
33190 5259
39196 5356
36392 5133
20
Example (Minitab Output)
Regression Analysis
Analysis of Variance
Source DF SS MS F P
Regression 1 4183528 4183528 182.11 0.000
Error 98 2251362 22973
Total 99 6434890
21
Example
• The least squares regression line is
yˆ = 6533.38 − 0.031158x
6000
5500
Price
5000
23
R² and R² adjusted
24
Scatter Plot
6000
5500
Price
5000
4500
19000 29000 39000 49000
Odometer
25
Testing the slope
• Are X and Y linearly related?
H 0 : 1 = 0
H A : 1 0
•Test Statistic:
ˆ
1 − 1 s
t= where sˆ =
sˆ 1
SS x
1
26
Testing the slope (continue)
• The Rejection Region:
Reject H0 if t < -t/2,n-2 or t > t/2,n-2.
27
Assessing the model
Example:
• Excel output
• Minitab output
28
Coefficient of Determination
SSE
R = 1−
2
SS y
For the data in odometer example, we obtain:
SSE 2,251,363
R = 1−
2
= 1−
SS y 6,434,890
= 1 − 0.3499 = 0.6501
2 n − 1 SSE
R adj = 1 − ( )
n − p SS y
where p is number of predictors in the model.
29
Using the Regression Equation
yˆ = 6,533 − 0.0312 x
= 6,533 − 0.0312(40,000)
= $5,285
30
Prediction and Confidence Intervals
yˆ t / 2,n−2 se 1 + +
n SS x
• Confidence Interval of E(y|x=xg): The confidence
interval for estimating the expected value of y for a
given x
1 ( xg − x ) 2
yˆ t / 2,n −2 se +
n SS x
31
Solving by Hand
(Prediction Interval)
• From previous calculations we have the following
estimates:
yˆ = 5285, s = 151.6, SS x = 4309340160, x = 36,009
• Thus a 95% prediction interval for x=40,000 is:
1 (40,000 − 36,009) 2
5,285 1.984(151.6) +
100 4,309,340,160
5,285 35
•The mean selling price of the car will fall between $5250
and $5320.
33
Prediction and Confidence Intervals’ Graph
6300
Prediction interval
5800
Predicted
5300
Confidence interval
4800
34
Notes
35
Regression Diagnostics
Residual Analysis:
Non-normality
Heteroscedasticity (non-constant variance)
Non-independence of the errors
Outlier
Influential observations
36
Standardized Residuals
• The standardized residuals are calculated as
ri
where Standardized . residual =
s
ri = yi − yˆi
• The standard deviation of the i-th residual is
1 ( xi − x ) 2
sr = s 1 − hi where hi = +
i
n SS x
37
Non-normality:
38
Non-constant variance:
39
Dealing with non-constant variance
• Transform Y
• Re-specify the Model (e.g., Missing important X’s?)
• Use Weighted Least Squares instead of Ordinary
Least Squares
n
i2
min
i =1 Var ( i )
40
Non-independence of error variable:
41
Outlier:
42
Influential Observations
60 150
50
100
40
y
y 30
50
20
10 0
0 0 10 20 30 40 50
0 10 20 30 40 50
x
x
43
Influential Observations
• Detection:
Cook’s Distance, DFFITS, DFBETAS (Neter, J., Kutner, M.H.,
Nachtsheim, C.J., and Wasserman, W., (1996) Applied Linear Statistical
Models, 4th edition, Irwin, pp. 378-384)
44
Multicollinearity
45
Multicollinearity
46
Exercise
47
Team-B-A Winning%
0.254 0.414
0.269 0.519
0.255 0.500
0.262 0.537
0.254 0.352
0.247 0.519
0.264 0.506
0.271 0.512
0.280 0.586
0.256 0.438
0.248 0.519
0.255 0.512
0.270 0.525
0.257 0.562
y = 7.001, y = 3.549
i
2
i
x y = 1.824562 i i
SS xy =x y −
i i
x )( y )
( i
= 1.824562 −
(3.642)(7.001)
i
= 0.0033
n 14
( x)
2
i (3.642) 2
SS x = x −
2
i = 0.948622 − = 0.00118
n 14
49
ˆ SS xy 0.003302
1 = = = 0.7941
SS x 0.001182
ˆ = y − ˆ x = 0.5 − (0.7941)0.26 = 0.2935
0 1
yˆ = 0.2935
• The least squares + 0.7941line
regression x is
ˆ1 = 0.7941
• The meaning is for each additional
batting average of the team, the winning
percentage increases by an average of 79.41%.
50
b) Standard Error of Estimate
( yi ) S2
2
S xy
2
SSE = S yy − = yi − − xy
2
S xx n S xx
7.0012 0.0033022
= (3.548785 − )− = 0.03856
14 0.00182
SSE 0.03856
So, s =
2
= = 0.00321 and s = s2 = 0.0567
n − 2 14 − 2
• Since s=0.0567 is small, we would conclude that “s”
is relatively small, indicating that the regression line
fits the data quite well.
51
c) Do the data provide sufficient evidence at the 5%
significance level to conclude that higher team
batting average lead to higher winning percentage?
H 0 : 1 = 0 ˆ1 − 1
Test statistic: t = = 1.69 (p-value=.058)
H A : 1 0 sˆ
1
53
e) Predict with 90% confidence the winning
percentage of a team whose batting average is 0.275.
yˆ t / 2,n−2 s 1 + + =
n SS x
1 (0.275 − 0.2601) 2
0.5119 (1.782)(0.0567) 1 + +
14 0.001182
0.5119 0.1134
90% PI for y: (0.3985,0.6253)
54
55