Complete Business Statistics: Simple Linear Regression and Correlation
Complete Business Statistics: Simple Linear Regression and Correlation
Complete Business Statistics: Simple Linear Regression and Correlation
BUSINESS
STATISTICS
by
AMIR D. ACZEL
&
JAYAVEL SOUNDERPANDIAN
7th edition.
Prepared by Lloyd Jaisingh, Morehead State
University
Chapter 10
10-2
10
Using Statistics
The Simple Linear Regression Model
Estimation: The Method of Least Squares
Error Variance and the Standard Errors of Regression Estimators
Correlation
Hypothesis Tests about the Regression Relationship
How Good is the Regression?
Analysis of Variance Table and an F Test of the Regression Model
Residual Analysis and Checking for Model Inadequacies
Use of the Regression Model for Prediction
The Solver Method for Regression
10-3
10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
10-4
10
10-5
10-6
Sales
100
80
60
40
20
0
0
10
20
30
40
50
A d ve rtising
The scatter of points tends to be distributed around a positively sloped straight line.
The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
The line represents the nature of the relationship on average.
10-7
0
0
10-8
Model Building
Theinexact
inexactnature
natureof
ofthe
the
The
relationshipbetween
between
relationship
advertisingand
andsales
sales
advertising
suggeststhat
thataastatistical
statistical
suggests
modelmight
mightbe
beuseful
usefulinin
model
analyzingthe
therelationship.
relationship.
analyzing
statisticalmodel
modelseparates
separates
AAstatistical
thesystematic
systematiccomponent
component
the
ofaarelationship
relationshipfrom
fromthe
the
of
randomcomponent.
component.
component
random
component
Data
Statistical
model
Systematic
component
+
Random
errors
InANOVA,
ANOVA,the
thesystematic
systematic
In
componentisisthe
thevariation
variation
component
ofmeans
meansbetween
betweensamples
samples
of
ortreatments
treatments(SSTR)
(SSTR)and
and
or
therandom
randomcomponent
componentisis
the
theunexplained
unexplainedvariation
variation
the
(SSE).
(SSE).
Inregression,
regression,
the
regression the
In
regression
systematiccomponent
componentisis
systematic
theoverall
overalllinear
linear
the
relationship,and
andthe
the
relationship,
randomcomponent
componentisisthe
the
random
variationaround
aroundthe
theline.
line.
variation
10-9
Thepopulation
populationsimple
simplelinear
linearregression
regressionmodel:
model:
The
Y= 0++1XX
Y=
0
1
Nonrandomoror
Nonrandom
Systematic
Systematic
Component
Component
++
Random
Random
Component
Component
where
where
YYisisthe
thedependent
dependentvariable,
variable,the
thevariable
variablewe
wewish
wishtotoexplain
explainor
orpredict
predict
XXisisthe
theindependent
independentvariable,
variable,also
alsocalled
calledthe
thepredictor
predictorvariable
variable
isisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthe
themodel,
model,and
andthus,
thus,the
the
onlysource
sourceof
ofrandomness
randomnessininY.Y.
only
0isisthe
theintercept
interceptof
ofthe
thesystematic
systematiccomponent
componentof
ofthe
theregression
regressionrelationship.
relationship.
0
1isisthe
theslope
slopeof
ofthe
thesystematic
systematiccomponent.
component.
1
Theconditional
conditionalmean
meanof
ofY:
Y: E [Y X ]
The
0 1 X
10-10
Regression Plot
E[Y]=0 + 1 X
Yi
1 = Slope
Error: i
Actualobserved
observedvalues
valuesof
ofYY
Actual
differfrom
fromthe
theexpected
expectedvalue
valueby
by
differ
anunexplained
unexplainedor
orrandom
randomerror:
error:
an
0 = Intercept
Xi
Thesimple
simplelinear
linearregression
regression
The
modelgives
givesan
anexact
exactlinear
linear
model
relationshipbetween
betweenthe
the
relationship
expectedor
oraverage
averagevalue
valueof
ofY,
Y,
expected
thedependent
dependentvariable,
variable,and
andX,
X,
the
theindependent
independentor
orpredictor
predictor
the
variable:
variable:
E[Y]=
Xi
i]=0++1X
E[Y
E[Y]i]++ i
YYi i==E[Y
i
i
i + i
==00++11XXi +
i
Therelationship
relationshipbetween
betweenXXand
andYYisisaa
The
straight-linerelationship.
relationship.
straight-line
Thevalues
valuesofofthe
theindependent
independent
The
variableXXare
areassumed
assumedfixed
fixed(not
(not
variable
random);the
theonly
onlyrandomness
randomnessininthe
the
random);
valuesofofYYcomes
comesfrom
fromthe
theerror
errorterm
term
values
i.i.
Theerrors
errorsi iare
arenormally
normallydistributed
distributed
The
withmean
mean00and
andvariance
variance2.2. The
The
with
errorsare
areuncorrelated
uncorrelated(not
(notrelated)
related)
errors
successiveobservations.
observations. That
Thatis:
is:
ininsuccessive
~N(0,
N(0,2)2)
~
10-11
E[Y]=0 + 1 X
Identical normal
distributions of errors,
all centered on the
regression line.
10-12
wherebb0estimates
estimatesthe
theintercept
interceptof
ofthe
thepopulation
populationregression
regressionline,
line,0;;
where
0
0
estimatesthe
theslope
slopeof
ofthe
thepopulation
populationregression
regressionline,
line,;1;
bb11estimates
1
andeestands
standsfor
forthe
theobserved
observederrors
errors--the
theresiduals
residualsfrom
fromfitting
fittingthe
theestimated
estimated
and
regressionline
linebb0++bbX
toaaset
setof
ofnnpoints.
points.
1X to
regression
0
1
Theestimated
estimatedregression
regressionline:
line:
The
YY bb00 ++bb11XX
(Y
whereY
(Y--hat)
hat)isisthe
thevalue
valueof
ofYYlying
lyingon
onthe
thefitted
fittedregression
regressionline
linefor
foraagiven
given
Y
where
valueof
ofX.
X.
value
10-13
Data
Three errors
from a fitted line
X
10-14
Errors in Regression
Y
the observed data point
Yi
Error ei Yi Yi
Yi
Xi
Y b0 b1 X
10-15
e
i=1
2
i
(y
y i ) 2
i=1
The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .
The normal equations:
n
b0
i=1
nb0 b1 x i
i=1
x y
i
i=1
SSE
i=1
i=1
b0 x i b1 x 2i
Least squares b0
Least squares b1
At this point
SSE is
minimized
with respect
to b0 and b1
b1
x
2
2
2
2
SSx
((xxxx))
xx n
SS
x
n 22
y
y
2
2
2
2
SS y
((yyyy))
yy n
SS
y
n
x ((
y))
x
y
SSxy
xy
((xxxx)()(yyyy))
xy
SS
xy
nn
Leastsquares
squaresregression
regressionestimators:
estimators:
Least
SS XY
SS
bb11 SSXY
SS XX
bb00 yybb11xx
10-16
10-17
Example 10-1
Miles
Miles
1211
1211
1345
1345
1422
1422
1687
1687
1849
1849
2026
2026
2133
2133
2253
2253
2400
2400
2468
2468
2699
2699
2806
2806
3082
3082
3209
3209
3466
3466
3643
3643
3852
3852
4033
4033
4267
4267
4498
4498
4533
4533
4804
4804
5090
5090
5233
5233
5439
5439
79,448
79,448
Dollars
Miles2 2 Miles*Dollars
Miles*Dollars
Dollars
Miles
1802
1466521
2182222
1802
1466521
2182222
2405
1809025
3234725
2405
1809025
3234725
2005
2022084
2851110
2005
2022084
2851110
2511
2845969
4236057
2511
2845969
4236057
2332
3418801
4311868
2332
3418801
4311868
2305
4104676
4669930
2305
4104676
4669930
3016
4549689
6433128
3016
4549689
6433128
3385
5076009
7626405
3385
5076009
7626405
3090
5760000
7416000
3090
5760000
7416000
3694
6091024
9116792
3694
6091024
9116792
3371
7284601
9098329
3371
7284601
9098329
3998
7873636
11218388
3998
7873636
11218388
3555
9498724
10956510
3555
9498724
10956510
4692
10297681
15056628
4692
10297681
15056628
4244
12013156
14709704
4244
12013156
14709704
5298
13271449
19300614
5298
13271449
19300614
4801
14837904
18493452
4801
14837904
18493452
5147
16265089
20757852
5147
16265089
20757852
5738
18207288
24484046
5738
18207288
24484046
6420
20232004
28877160
6420
20232004
28877160
6059
20548088
27465448
6059
20548088
27465448
6426
23078416
30870504
6426
23078416
30870504
6321
25908100
32173890
6321
25908100
32173890
7026
27384288
36767056
7026
27384288
36767056
6964
29582720
37877196
6964
29582720
37877196
106,605
293,426,946
390,185,014
106,605 293,426,946
390,185,014
22
2
x2 x
SS x x
SS
x
nn
2
79, 448
, 4482
79
293, 426
, 426,946
,946
40,947
,947,557
,557.84
.84
293
40
25
25
xx ( (yy))
SS
xy
SS xy
xy xy
nn
79, 448
, 448)()(106
106,605
,605))
((79
390
,
185
,
014
51,402
,402,852
,852.4.4
390,185,014
51
25
25
SS
51, 402
, 402,852
,852.4.4
SS
XY 51
XY
b
.25533377611.26
.26
b 1
11.255333776
SS
40,947
,947,557
,557.84
.84
1 SS
40
X
X
106,605
,605
79, 448
, 448
106
) 79
b
b
x
(
1
.
255333776
b 0 y b 1x
(1.255333776)
25
25
0
1
25
25
274.85
.85
274
10-18
10-19
10-20
10-21
10-22
10-23
2
2
An unbiased estimator of s , denoted by S :
MSE =
SSE
(n - 2)
Example 10 - 1:
SSE = SS Y b1 SS XY
66855898 (1.255333776)( 51402852 .4 )
2328161.2
MSE
SSE
n2
101224 .4
s
MSE
2328161.2
23
101224 . 4 318.158
s x
s
(
b
)
s(b00) nSS
nSS XX
wheress == MSE
MSE
where
Thestandard
standarderror
errorof
ofbb11(slope)
(slope)::
The
ss
s
(
b
)
s(b11) SS
SS XX
Example10
10- -1:1:
Example
22
s
x
s x
s
(
b
)
s(b00)
nSS X
nSS
X
318.158
.158 293426944
293426944
318
25)()(4097557
4097557.84
.84) )
( (25
170.338
.338
170
ss
ss(b(b11) )
SS X
SS
X
318
.158
318.158
40947557.84
.84
40947557
.04972
00.04972
10-24
r9
5%
Up
pe
r
we
Lo
%
95
d:
un
o
b
Length = 1
5
1.1
Height = Slope
bo
un
d
on
slo
pe
:1
.3
58
20
6
24
Example10
10--11
Example
95%Confidence
ConfidenceIntervals:
Intervals:
95%
274.85((2.069)
2.069)(170
(170.338
.338))
==274.85
274.85
.85352
352.43
.43
274
77.58
.58, 627
, 627.28
.28]]
[[77
1.25533((2.069)
2.069)((00.04972
.04972))
==1.25533
.25533010287
010287
11.25533
..
[115246
.35820]]
[115246
..
,1,1.35820
10-25
10-26
10-27
10-5 Correlation
Thecorrelation
correlationbetween
betweentwo
tworandom
randomvariables,
variables,XXand
andY,
Y,isisaameasure
measureof
ofthe
the
The
degreeof
of linear
linearassociation
associationbetween
betweenthe
thetwo
twovariables.
variables.
degree
Thepopulation
populationcorrelation,
correlation,denoted
denotedby,
by,can
cantake
takeon
onany
anyvalue
valuefrom
from-1
-1toto1.1.
The
indicates
indicatesaaperfect
perfectnegative
negativelinear
linearrelationship
relationship
-1<<<<00 indicates
indicatesaanegative
negativelinear
linearrelationship
relationship
-1
indicates
indicatesno
nolinear
linearrelationship
relationship
indicatesaapositive
positivelinear
linearrelationship
relationship
00<<<<11 indicates
indicates
indicatesaaperfect
perfectpositive
positivelinear
linearrelationship
relationship
Theabsolute
absolutevalue
valueof
ofindicates
indicatesthe
thestrength
strengthor
orexactness
exactnessof
ofthe
therelationship.
relationship.
The
10-28
Illustrations of Correlation
Y
= -1
= 0
Y
= 1
X
Y
= -.8
X
Y
= 0
Y
= .8
10-29
Example 10 - 1:
SS
XY
r=
SS SS
X Y
51402852.4
( 40947557.84 )( 66855898)
51402852.4
.9824
52321943.29
Test Statistic: t( n 2 )
r
1 r2
n2
Example10
10-1:
-1:
Example
rr
t
t( n( n22) )
22
1
r
1 r
nn22
0.9824
0.9824
=
= 1 - 0.9651
1 - 0.9651
25--22
25
0.9824
0.9824
25.25
.25
== 0.038925
0.0389
22.807
.80725
25.25
.25
tt00. 005
. 005
rejectedatat1%
1%level
level
HH00 rejected
10-30
10-31
Unsystematic Variation
Nonlinear Relationship
Y
t
(n - 2)
s(b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1
1
1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
(n - 2)
t
( 0. 005 , 23 )
b
1
s(b )
1
1.25533
0.04972
25.25
2.807 25.25
Example 10 - 4 :
H : 1
0 1
H : 1
1 1
b 1
t
1
( n - 2) s (b )
1
1.24 - 1
=
1.14
0.21
1.671 1.14
(0.05,58)
H is not rejected at the 10% level.
0
We may not conclude that the beta
coefficient is different from 1.
t
10-32
10-33
Y
Unexplained Deviation
Explained Deviation
( y y ) ( y y) ( y y )
SST
= SSE
+ SSR
Total Deviation
r
X
X
SSR
SST
SSE
SST
Percentage of
total variation
explained by
the regression.
10-34
SST
r2 = 0
SSE
r2 = 0.50
SST
SSE SSR
Example 10 -1:
7000
SSR 64527736.8
r
0.96518
SST
66855898
5000
SST
SSR
6000
Dollars
r2 = 0.90
S
S
E
4000
3000
2000
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
Degreesof
of
Degrees
Freedom Mean
MeanSquare
Square FFRatio
Ratio
Freedom
Regression SSR
SSR
Regression
(1)
(1)
MSR
MSR
Error
Error
Total
Total
(n-2)
(n-2)
(n-1)
(n-1)
MSE
MSE
MST
MST
SSE
SSE
SST
SST
MSR
MSR
MSE
MSE
Example10-1
10-1
Example
Sourceofof Sum
Sumofof
Source
Variation Squares
Squares
Variation
Degreesofof
Degrees
Freedom
Freedom
Regression 64527736.8
64527736.8 11
Regression
Error
Error
Total
Total
2328161.2 23
23
2328161.2
66855898.0 24
24
66855898.0
Ratio ppValue
Value
FFRatio
MeanSquare
Square
Mean
64527736.8 637.47
637.47 0.000
0.000
64527736.8
101224.4
101224.4
10-35
10-36
Residuals
x or y
x or y
Time
x or y
10-37
10-38
10-39
10-40
10-41
PointPrediction
Prediction
Point
Asingle-valued
single-valuedestimate
estimateof
ofYYfor
foraagiven
givenvalue
valueof
ofXXobtained
obtainedby
by
A
insertingthe
thevalue
valueof
ofXXin
inthe
theestimated
estimatedregression
regressionequation.
equation.
inserting
PredictionInterval
Interval
Prediction
Foraavalue
valueof
ofYYgiven
givenaavalue
valueof
ofXX
For
Variationininregression
regressionline
lineestimate
estimate
Variation
Variation of points around regression line
Variation
of points around regression line
Foran
anaverage
averagevalue
valueof
ofYYgiven
givenaavalue
valueof
ofXX
For
Variationininregression
regressionline
lineestimate
estimate
Variation
10-42
10-43
10-44
Theprediction
predictionband
bandfor
forE[Y|X]
E[Y|X]isis
The
narrowestatatthe
themean
meanvalue
valueof
ofX.
X.
narrowest
Theprediction
predictionband
bandwidens
widensasasthe
the
The
distancefrom
fromthe
themean
meanofofXXincreases.
increases.
distance
Predictionsbecome
becomevery
veryunreliable
unreliablewhen
when
Predictions
weextrapolate
extrapolatebeyond
beyondthe
therange
rangeof
ofthe
the
we
sampleitself.
itself.
sample
10-45
Regression line
10-46
s
1
y t s 1 n SS
n
SS
2
2
2
Example10
10--11(X
(X==4,000)
4,000)::
Example
,00033,177
,177.92
.92))
11 ((44,000
{
2
74.85
(1.2553)(4
,000)}
2
.
069
318
.
16
1
5296.05
.05676
676.62
.62[[4619
4619.43
.43, ,5972
5972.67
.67]]
5296
10-47
s
y t s n SS
n
SS
2
2
2
Example10
10--11(X
(X==4,000)
4,000)::
Example
,00033,177
,177.92
.92))
11 ((44,000
{
2
74.85
(1.2553)(4
,000)}
2
.
069
318
.
16
{274.85 (1.2553)(4,000)} 2.069 318.16 25 40,947,557.84
25
40,947,557.84
2
,296.05
.05156
156.48
.48[[5139
5139.57
.57, ,5452
5452.53
.53]]
55,296
10-48
10-49
Y = - 0.8465 + 1.352 X
Y = - 0.8465 + 1.352 X
9.0
9.0
S
S
R-Sq
R-Sq
R-Sq(adj)
R-Sq(adj)
8.5
8.5
Y
Y
8.0
8.0
7.5
7.5
7.0
7.0
6.5
6.5
6.0
6.0
5.5
5.5
6.0
6.0
X
X
6.5
6.5
7.0
7.0
7.5
7.5
0.184266
0.184266
95.2%
95.2%
94.8%
94.8%
10-50