Chapter 11: Simple Linear Regression
Chapter 11: Simple Linear Regression
0
0 1 2 3 4 5 6
Home Runs
McClave: Statistics, 11th ed. Chapter 11: Simple 5
Linear Regression
11.1: Probabilistic Models
But if you consider how many runners are on base when the home run
is hit, or even how often the batter misses a base and is called out, the
Runs rigid model becomes more variable.
6
0
0 1 2 3 4 5 6
Home Runs
McClave: Statistics, 11th ed. Chapter 11: Simple 6
Linear Regression
11.1: Probabilistic Models
200
Attendance
150
100
50
0
0 20 40 60 80 100 120 140 160
Week
Step 1
Hypothesize the deterministic
component of the probabilistic model
E(y) = 0 + 1x
Step 2
Use sample data to estimate the
unknown parameters in the model
McClave: Statistics, 11th ed. Chapter 11: Simple 12
Linear Regression
11.2: Fitting the Model: The
Least Squares Approach
Values on the line are the predicted values
4000
of total offerings given the average offering.
3500
3000
Total Offering
2500
2000
1500
1000
500
The distances between the scattered dots and
0 the line are the errors of prediction.
0 10 20 30 40 50
Average Offering
2500
The line’s estimated parameters are the values that
2000 minimize the sum of the squared errors of prediction,
1500 and the method of finding those values is called the
method of least squares.
1000
500
The distances between the scattered dots and
0 the line are the errors of prediction.
0 10 20 30 40 50
Average Offering
Model: y 0 1 x
ˆ
y ˆ
ˆ
Estimates: 0 1x
( y ˆ
y ) [ y ( ˆ
ˆ
Deviation: i i i 0 1 xi )]
SSE: ˆ
i
ˆ
[ y ( x)] 2
0 1
Slope: ˆ1
SSxy
i
i
i i
n
(x x ) xi
2 2
SSxx
i
i 2
x
n
y intercept: 0 y ˆ1 x
Is there a relationship
between the number of
home runs a team hits and
the quality of its fielding?
n 5
.0521
y intercept: 0 y ˆ1 x 98.4 (.0521)(153.4) 107.2
Assumptions
1. The mean of the probability distribution of
is 0.
2. The variance, 2, of the probability
distribution of is constant.
3. The probability distribution of is
normal.
4. The values of associated with any two
values of y are independent.
McClave: Statistics, 11th ed. Chapter 11: Simple 22
Linear Regression
11.3: Model Assumptions
The variance, 2, is used in every test statistic and
confidence interval used to evaluate the model.
Invariably, 2 is unknown and must be estimated.
i i yy 1 xy
ˆ SS
2
SSE y yˆ SS
s2
n2 n2 n2
SS yy yi y and SS xy xi x yi y
2
y . y y .
. . . . . .
. . . ............ . . .
. . . . . .
. . .
. . . .
x x x
Positive Relationship No Relationship Negative Relationship
1 > 0 1 = 0 1 < 0
x x x
Positive Relationship No Relationship Negative Relationship
1 > 0 1 = 0 1 < 0
where ˆ
1
SS xx
s
and ˆ ˆ sˆ ,
1 1
SS xx
called the estimated standard error of the least squares
slope estimate.
McClave: Statistics, 11th ed. Chapter 11: Simple 28
Linear Regression
11.4: Assessing the Utility of the Model:
Making Inferences about the Slope 1
A Test of Model Utility: Simple Linear Regression
One-Tailed Test Two-Tailed Test
H 0 : 1 0 H 0 : 1 0
H a : 1 0 ( 0) H a : 1 0
ˆ1 ˆ
Test Statistic : t 1
sˆ s / SS xx
1
Rejection Region:
t t ( t ) |t | t /2
Degrees of freedom = n 2
A Confidence Interval on 1
ˆ1 t /2 sˆ
1
y . y . y .
. . . . . . . . .
. . . . . . .. . . .
. . .. . . . . .
. . . . .
. . . .
x x x
r → +1 r0 r → -1
Values of r equal to +1 or -1 require each point in the scatter plot to lie on a single straight line.
SS xy 143.8
so r .058
SS xx SS yy (2509)(2443.2)
McClave: Statistics, 11th ed. Chapter 11: Simple 37
Linear Regression
11.5: The Coefficients of
Correlation and Determination
SS yy SSE SSE
r
2
1
SS yy SS yy
0 r 1 2
1 ( x p x )2
y (the standard error of y )
n SS xx
where is the standard deviation of the error term .
Using s to estimate , a 100(1- )% confidence interval
1 (xp x )
2
on the mean of y is yˆ t /2 s ,
n SS xx
with n-2 degrees of freedom.
1 ( x p x )2 1 (140 153.4) 2
y s
ˆ 1 .5211
n SS xx 5 2509
A 95% Confidence Interval for y | x 140 is
1 ( x p x )2
yˆ t.025,df 3 s 99.9 3.182(.5211) 99.9 1.66
n SS xx
McClave: Statistics, 11th ed. Chapter 11: Simple 43
Linear Regression
11.6: Using the Model for
Estimation and Prediction
Sampling Error for the Predictor of an Individual Value of y | x p
1 ( x x ) 2
n SS xx
where is the standard deviation of the error term .
Using s to estimate , a 100(1- )% preduction interval
1 (xp x )
2
1 ( x p x )2 1 (140 153.4) 2
ˆ y s 1 1 1 1.13
n SS xx 5 2509
A 95% Confidence Interval for y | x 140 is
1 ( x p x )2
yˆ t.025,df 3 s 1 99.9 3.182(1.13) 99.9 3.596
n SS xx
McClave: Statistics, 11th ed. Chapter 11: Simple 45
Linear Regression
11.6: Using the Model for
Estimation and Prediction
Prediction intervals for
individual new values of y
are wider than confidence Error in
Error in predicting a
intervals on the mean of y E(y|xp) mean value
because of the extra of y|xp
source of error.
Error in
Sampling
predicting a
Error in error from
specific
E(y|xp) the y
value of
population
y|xp
True relationship
Xi Xj
Range of observed
values of x
McClave: Statistics, 11th ed. Chapter 11: Simple 48
Linear Regression
11.7: A Complete Example
Step 1
How does the proximity of a fire
house (x) affect the damages (y)
from a fire?
y = f(x)
y = 0 +1x +
Step 2
The data (found in Table 11.7) produce the
following estimates (in thousands of dollars):
ˆ0 10.28
ˆ1 4.91
The estimated damages equal $10,280 + $4910
for each mile from the fire station, or
yˆ 10.28 4.92 x
Step 3
The estimate of the standard deviation,
, of is
s = 2.31635
Most of the observed fire damages will be
within 2s 4.64 thousand dollars of the
predicted value
Step 4
Test that the true slope is 0
H 0 : 1 0
H a : 1 0
SAS automatically performs a two-tailed
test, with a reported p-value < .0001. The
one-tailed p-value is < .00005, which
provides strong evidence to reject the null.
McClave: Statistics, 11th ed. Chapter 11: Simple 54
Linear Regression
11.7: A Complete Example
Step 4
A 95% confidence interval on 1 from the
SAS output is 4.071 ≤ 1 ≤ 5.768.
The coefficient of determination, r 2, is
.9235.
The coefficient of correlation, r, is
r r .9235 .96
2