SBE11 CH 16

Slides by
John
Loucks
St. Edward’s
University
© 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied
Slide
1
or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter 16
Regression Analysis: Model Building
 General Linear Model
 Determining When to Add or Delete Variables
 Variable Selection Procedures
 Multiple Regression Approach to
Experimental Design
 Autocorrelation and the
Durbin-Watson Test
Slide
2
General Linear Model
 Models in which the parameters (0, 1, . . . , p ) all

have exponents of one are called linear models.
 A general linear model involving p independent
variables is
y  0  1z1  2 z2     p zp  
 Each of the independent variables z is a function of

x1, x2,..., xk (the variables for which data have been
collected).
Slide
3
General Linear Model
 The simplest case is when we have collected data for

just one variable x1 and want to estimate y by using a
straight-line relationship. In this case z1 = x1.
 This model is called a simple first-order model with
one predictor variable.
y   0   1 x1  
Slide
4
Modeling Curvilinear Relationships
 To account for a curvilinear relationship, we might

set z1 = x1 and z2 = x12.
 This model is called a second-order model with one
predictor variable.
y   0   1 x1   2 x12  
Slide
5
Interaction
 If the original data set consists of observations for y

and two independent variables x1 and x2 we might
develop a second-order model with two predictor
variables.
y   0   1 x1   2 x2   3 x12   4 x22   5 x1 x2  
 In this model, the variable z5 = x1x2 is added to

account for the potential effects of the two variables
acting together.
 This type of effect is called interaction.
Slide
6
Transformations Involving the Dependent Variable
 Often the problem of nonconstant variance can be

corrected by transforming the dependent variable to a
different scale.
 Most statistical packages provide the ability to apply
logarithmic transformations using either the base-10
(common log) or the base e = 2.71828... (natural log).
 Another approach, called a reciprocal transformation,
is to use 1/y as the dependent variable instead of y.
Slide
7
Nonlinear Models That Are Intrinsically Linear
 Models in which the parameters (0, 1, . . . , p ) have

exponents other than one are called nonlinear models.
 In some cases we can perform a transformation of
variables that will enable us to use regression analysis
with the general linear model.
 The exponential model involves the regression
equation:
E( y )   0  x1
 We can transform this nonlinear model to a linear

model by taking the logarithm of both sides.
Slide
8
Determining When to Add or Delete Variables
 To test whether the addition of x2 to a model involving

x1 (or the deletion of x2 from a model involving x1 and
x2) is statistically significant we can perform an F Test.
 The F Test is based on a determination of the amount of
reduction in the error sum of squares resulting from
adding one or more independent variables to the model.
(SSE(reduced)-SSE(full))/number of extra terms
F
MSE(full)
(SSE(x 1 )-SSE(x1 ,x 2 ))/1

F
(SSE(x 1 , x 2 ))/( n  p  1)
Slide
9
Determining When to Add or Delete Variables
 The p–value criterion can also be used to determine

whether it is advantageous to add one or more
dependent variables to a multiple regression model.
 The p–value associated with the computed F statistic
can be compared to the level of significance a .
 It is difficult to determine the p–value directly from
the tables of the F distribution, but computer
software packages, such as Minitab or Excel, provide
the p- value.
Slide
10
Variable Selection Procedures
 Stepwise Regression Iterative; one independent

 Forward Selection variable at a time is added or
 Backward Elimination deleted based on the F statistic
Different subsets of the

 Best-Subsets Regression
independent variables
are evaluated
The first 3 procedures are heuristics

and therefore offer no guarantee
that the best model will be found.
Slide
11
Variable Selection: Stepwise Regression
 At each iteration, the first consideration is to see

whether the least significant variable currently in the
model can be removed because its F value is less
than the user-specified or default Alpha to remove.
 If no variable can be removed, the procedure checks
to see whether the most significant variable not in the
model can be added because its F value is greater
than the user-specified or default Alpha to enter.
 If no variable can be removed and no variable can be
added, the procedure stops.
Slide
12
Variable Selection: Stepwise Regression
Compute F stat. and Any
p-value for each indep. p-value < alpha
variable not in model to enter
? No
No
Indep. variable Yes
Any Yes with largest
p-value > alpha p-value is Stop
to remove removed
? from model
Compute F stat. and next Indep. variable with

p-value for each indep. smallest p-value is
variable in model iteration entered into model
Start with no indep.

variables in model
Slide
13
Variable Selection: Forward Selection
 This procedure is similar to stepwise regression, but

does not permit a variable to be deleted.
 This forward-selection procedure starts with no
independent variables.
 It adds variables one at a time as long as a significant
reduction in the error sum of squares (SSE) can be
achieved.
Slide
14
Variable Selection: Forward Selection
Start with no indep.

variables in model
Compute F stat. and

p-value for each indep.
variable not in model
Any Indep. variable with

p-value < alpha Yes smallest p-value is
to enter entered into model
?
No
Stop
Slide
15
Variable Selection: Backward Elimination
 This procedure begins with a model that includes

all the independent variables the modeler wants
considered.
 It then attempts to delete one variable at a time by
determining whether the least significant variable
currently in the model can be removed because its
p-value is less than the user-specified or default
value.
 Once a variable has been removed from the model it
cannot reenter at a subsequent step.
Slide
16
Start with all indep.

variables in model
Compute F stat. and

p-value for each indep.
variable in model
Any Indep. variable with

p-value > alpha Yes largest p-value is
to remove removed from model
?
No
Stop
Slide
17
 Example: Clarksville Homes

Tony Zamora, a real estate investor, has just
moved to Clarksville and wants to learn about the
city’s residential real estate market. Tony has
randomly selected 25 house-for-sale listings from the
Sunday newspaper and collected the data partially
listed on the next slide.
Develop, using the backward elimination
procedure, a multiple regression model to predict the
selling price of a house in Clarksville.
Slide
18
 Partial Data
A B C D E F
Selling House Number Number Garage
Segment Price Size of of Size
1 of City ($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
2 Northwest 290 21 4 2 2
3 South 95 11 2 1 0
4 Northeast 170 19 3 2 2
5 Northwest 375 38 5 4 3
6 West 350 24 4 3 2
7 South 125 10 2 2 0
8 West 310 31 4 4 2
9 West 275 25 3 2 2
Note: Rows 10-26 are not shown.
Slide
19
 Regression Output
A B C D E
42
43 Coeffic. Std. Err. t Stat P-value
44 Intercept -59.416 54.6072 -1.0881 0.28951
45 House Size 6.50587 3.24687 2.0037 0.05883
46 Bedrooms 29.1013 26.2148 1.1101 0.28012
47 Bathrooms 26.4004 18.8077 1.4037 0.17574
48 Cars -10.803 27.329 -0.3953 0.6968
49
Greatest
Variable p-value
to be > .05
removed
Slide
20
 Cars (garage size) is the independent variable

with the highest p-value (.697) > .05.
 Cars variable is removed from the model.
 Multiple regression is performed again on the
remaining independent variables.
Slide
21
A B C D E
42
44 Intercept -47.342 44.3467 -1.0675 0.29785
45 House Size 6.02021 2.94446 2.0446 0.05363
46 Bedrooms 23.0353 20.8229 1.1062 0.28113
47 Bathrooms 27.0286 18.3601 1.4721 0.15581
48
49 Greatest
Variable p-value
to be > .05
removed
Slide
22
 Bedrooms is the independent variable with the

highest p-value (.281) > .05.
 Bedrooms variable is removed from the model.
remaining independent variables.
Slide
23
A B C D E
42
44 Intercept -12.349 31.2392 -0.3953 0.69642
45 House Size 7.94652 2.38644 3.3299 0.00304
46 Bathrooms 30.3444 18.2056 1.6668 0.10974
47
48
49 Greatest
Variable p-value
to be > .05
removed
Slide
24
 Bathrooms is the independent variable with the

highest p-value (.110) > .05.
 Bathrooms variable is removed from the model.
remaining independent variable.
Slide
25
A B C D E
42
44 Intercept -9.8669 32.3874 -0.3047 0.76337
45 House Size 11.3383 1.29384 8.7633 8.7E-09
46
47
48
49 Greatest
p-value
is < .05
Slide
26
 House size is the only independent variable

remaining in the model.
 The estimated regression equation is:
yˆ  9.8669  11.3383(House Size)
Slide
27
Variable Selection: Best-Subsets Regression
 The three preceding procedures are one-variable-at-

a-time methods offering no guarantee that the best
model for a given number of variables will be found.
 Some software packages include best-subsets
regression that enables the user to find, given a
specified number of independent variables, the best
regression model.
 Minitab output identifies the two best one-variable
estimated regression equations, the two best two-
variable equation, and so on.
Slide
28
Variable Selection: Best-Subsets Regression
 Example: PGA Tour Data

The Professional Golfers Association keeps a
variety of statistics regarding performance measures.
Data include the average driving distance, percentage
of drives that land in the fairway, percentage of
greens hit in regulation, average number of putts,
percentage of sand saves, and average score.
Slide
29
Variable-Selection Procedures
 Variable Names and Definitions
Drive: average length of a drive in yards

Fair: percentage of drives that land in the fairway
Green: percentage of greens hit in regulation (a par-3
green is “hit in regulation” if the player’s first
shot lands on the green)
Putt: average number of putts for greens that have
been hit in regulation
Sand: percentage of sand saves (landing in a sand
trap and still scoring par or better)
Score: average score for an 18-hole round
Slide
30
 Sample Data (Part 1)
Drive Fair Green Putt Sand Score

277.6 .681 .667 1.768 .550 69.10
259.6 .691 .665 1.810 .536 71.09
269.1 .657 .649 1.747 .472 70.12
267.0 .689 .673 1.763 .672 69.88
267.3 .581 .637 1.781 .521 70.71
255.6 .778 .674 1.791 .455 69.76
272.9 .615 .667 1.780 .476 70.19
265.4 .718 .699 1.790 .551 69.73
Slide
31

272.6 .660 .672 1.803 .431 69.97
263.9 .668 .669 1.774 .493 70.33
267.0 .686 .687 1.809 .492 70.32
266.0 .681 .670 1.765 .599 70.09
258.1 .695 .641 1.784 .500 70.46
255.6 .792 .672 1.752 .603 69.49
261.3 .740 .702 1.813 .529 69.88
262.2 .721 .662 1.754 .576 70.27
Slide
32

260.5 .703 .623 1.782 .567 70.72
271.3 .671 .666 1.783 .492 70.30
263.3 .714 .687 1.796 .468 69.91
276.6 .634 .643 1.776 .541 70.69
252.1 .726 .639 1.788 .493 70.59
263.0 .687 .675 1.786 .486 70.20
263.0 .639 .647 1.760 .374 70.81
253.5 .732 .693 1.797 .518 70.26
266.2 .681 .657 1.812 .472 70.96
Slide
33
 Sample Correlation Coefficients
Score Drive Fair Green Putt

Drive -.154
Fair -.427 -.679
Green -.556 -.045 .421
Putt .258 -.139 .101 .354
Sand -.278 -.024 .265 .083 -.296
Slide
34
 Best Subsets Regression of SCORE
Vars R-sq R-sq(a) C-p s D F G P S

1 30.9 27.9 26.9 .39685 X
1 18.2 14.6 35.7 .43183 X
2 54.7 50.5 12.4 .32872 X X
2 54.6 50.5 12.5 .32891 X X
3 60.7 55.1 10.2 .31318 X X X
3 59.1 53.3 11.4 .31957 X X X
4 72.2 66.8 4.2 .26913 X X X X
4 60.9 53.1 12.1 .32011 X X X X
5 72.6 65.4 6.0 .27499 X X X X X
Slide
35
 Minitab Output
The regression equation
Score = 74.678 - .0398(Drive) - 6.686(Fair)
- 10.342(Green) + 9.858(Putt)
Predictor Coef Stdev t-ratio p
Constant 74.678 6.952 10.74 .000
Drive -.0398 .01235 -3.22 .004
Fair -6.686 1.939 -3.45 .003
Green -10.342 3.561 -2.90 .009
Putt 9.858 3.180 3.10 .006
s = .2691 R-sq = 72.4% R-sq(adj) = 66.8%
Slide
36
 Minitab Output
Analysis of Variance
SOURCE DF SS MS F P
Regression 4 3.79469 .94867 13.10
Error .000 20 1.44865 .07243
Total 24 5.24334
Slide
37
Multiple Regression Approach to
Experimental Design
 The use of dummy variables in a multiple regression
equation can provide another approach to solving
analysis of variance and experimental design
problems.
 We will use the results of multiple regression to
perform the ANOVA test on the difference in the
means of three populations.
Slide
38
Experimental Design
 Example: Reed Manufacturing
Janet Reed would like to know if there is any
significant difference in the mean number of hours
worked per week for the department managers at
her three manufacturing plants (in Buffalo,
Pittsburgh, and Detroit).
A simple random sample of five managers from
each of the three plants was taken and the number
of hours worked by each manager for the previous
week is shown on the next slide.
Slide
39
Experimental Design
Plant 1 Plant 2 Plant 3

Observation Buffalo Pittsburgh Detroit
1 48 73 51
2 54 63 63
3 57 66 61
4 54 64 54
5 62 74 56
Sample Mean 55 68 57
Sample Variance 26.0 26.5 24.5
Slide
40
Experimental Design
 We begin by defining two dummy variables, A and
B, that will indicate the plant from which each sample
observation was selected.
 In general, if there are k populations, we need to
define k – 1 dummy variables.
A = 0, B = 0 if observation is from Buffalo plant

A = 1, B = 0 if observation is from Pittsburgh plant
A = 0, B = 1 if observation is from Detroit plant
Slide
41
Experimental Design
 Input Data
Plant 1 Plant 2 Plant 3

Buffalo Pittsburgh Detroit
A B y A B y A B y
0 0 48 1 0 73 0 1 51
0 0 54 1 0 63 0 1 63
0 0 57 1 0 66 0 1 61
0 0 54 1 0 64 0 1 54
0 0 62 1 0 74 0 1 56
Slide
42
Experimental Design
E(y) = expected number of hours worked

= b0 + b1A + b2B
For Buffalo: E(y) = b0 + b1(0) + b2(0) = b0

For Pittsburgh: E(y) = b0 + b1(1) + b2(0) = b0 + b1
For Detroit: E(y) = b0 + b1(0) + b2(1) = b0 + b2
Slide
43
Experimental Design
Excel produced the regression equation:

y = 55 +13A + 2B
Plant Estimate of E(y)
Buffalo b0 = 55
Pittsburgh b0 + b1 = 55 + 13 = 68
Detroit b0 + b2 = 55 + 2 = 57
Slide
44
Experimental Design
 Next, we observe that if there is no difference in
the means:
E(y) for the Pittsburgh plant – E(y) for the Buffalo plant = 0
E(y) for the Detroit plant – E(y) for the Buffalo plant = 0
Slide
45
Experimental Design
 Because b0 equals E(y) for the Buffalo plant and
b0 + b1 equals E(y) for the Pittsburgh plant, the first
difference is equal to (b0 + b1) - b0 = b1.
 Because b0 + b2 equals E(y) for the Detroit plant, the
second difference is equal to (b0 + b2) - b0 = b2.
 We would conclude that there is no difference in the
three means if b1 = 0 and b2 = 0.
Slide
46
Experimental Design
 The null hypothesis for a test of the difference of
means is
H0: b1 = b2 = 0
 To test this null hypothesis, we must compare the

value of MSR/MSE to the critical value from an F
distribution with the appropriate numerator and
denominator degrees of freedom.
Slide
47
Experimental Design
 ANOVA Table Produced by Excel
Source of Sum of Degrees of Mean

Variation Squares Freedom Squares F p
Regression 490 2 245 9.55 .003
Error 308 12 25.667
Total 798 14
Slide
48
Experimental Design
 At a .05 level of significance, the critical value of
F with k – 1 = 3 – 1 = 2 numerator d.f. and nT – k =
15 – 3 = 12 denominator d.f. is 3.89.
 Because the observed value of F (9.55) is greater than
the critical value of 3.89, we reject the null hypothesis.
 Alternatively, we reject the null hypothesis because
the p-value of .003 < a = .05.
Slide
49
Autocorrelation and the Durbin-Watson Test
 Often, the data used for regression studies in

business and economics are collected over time.
 It is not uncommon for the value of y at one time
period to be related to the value of y at previous time
periods.
 In this case, we say autocorrelation (or serial
correlation) is present in the data.
Slide
50
 With positive autocorrelation, we expect a positive

residual in one period to be followed by a positive
residual in the next period.
 With positive autocorrelation, we expect a negative
residual in one period to be followed by a negative
residual in the next period.
 With negative autocorrelation, we expect a positive
residual in one period to be followed by a negative
residual in the next period, then a positive residual,
and so on.
Slide
51
 When autocorrelation is present, one of the

regression assumptions is violated: the error terms
are not independent.
 When autocorrelation is present, serious errors can be
made in performing tests of significance based upon
the assumed regression model.
 The Durbin-Watson statistic can be used to detect
first-order autocorrelation.
Slide
52
 Durbin-Watson Test Statistic

n
2
 ( et  et  1 )
d  t 2 n
2
 et2
t 1
The ith residual is denoted ei  y i  yˆ i
Slide
53
 Durbin-Watson Test Statistic

• The statistic ranges in value from zero to four.
• If successive values of the residuals are close
together (positive autocorrelation is present),
the statistic will be small.
• If successive values are far apart (negative
autocorrelation is present), the statistic will
be large.
• A value of two indicates no autocorrelation.
Slide
54
 Suppose the values of e (residuals) are not

independent but are related in the following manner:
et = r et-1 + zt
where r is a parameter with an absolute value less than

one and zt is a normally and independently distributed
random variable with a mean of zero and variance of s 2.
 We see that if r = 0, the error terms are not related.
 The Durbin-Watson test uses the residuals to
determine whether r = 0.
Slide
55
 The null hypothesis always is:

H0 :   0 there is no autocorrelation
 The alternative hypothesis is:
Ha :   0 to test for positive autocorrelation
Ha :   0 to test for negative autocorrelation
Ha :   0 to test for positive or negative

autocorrelation
Slide
56
A Sample Of Critical Values For The

Durbin-Watson Test For Autocorrelation
Significance Points of dL and dU: a = .05
Number of Independent Variables
1 2 3 4 5
n dL dU dL dU dL dU dU dU dU dU
15 1.08 1.36 0.95 1.54 0.82 1.75 0.69 1.97 0.56 2.21
16 1.10 1.37 0.98 1.54 0.86 1.73 0.74 1.93 0.62 2.15
17 1.13 1.38 1.02 1.54 0.90 1.71 0.78 1.90 0.67 2.10
18 1.16 1.39 1.05 1.53 0.93 1.69 0.82 1.87 0.71 2.06
Slide
57
Positive Incon- No evidence of

autocor- clusive positive autocorrelation
relation
0 dL dU 2 4-dU 4-dL 4
No evidence of Incon- Negative

negative autocorrelation clusive autocor-
relation
0 dL dU 2 4-dU 4-dL 4
Positive Incon- No evidence of Incon- Negative

autocor- clusive autocorrelation clusive autocor-
relation relation
0 dL dU 2 4-dU 4-dL 4
Slide
58
End of Chapter 16
Slide
59

SBE11 CH 16

Uploaded by

Copyright:

Available Formats

SBE11 CH 16

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SBE11 CH 16

Uploaded by

Copyright:

Available Formats

Slides by

 Models in which the parameters (0, 1, . . . , p ) all

 Each of the independent variables z is a function of

 The simplest case is when we have collected data for

 To account for a curvilinear relationship, we might

 If the original data set consists of observations for y

 In this model, the variable z5 = x1x2 is added to

 Often the problem of nonconstant variance can be

 Models in which the parameters (0, 1, . . . , p ) have

 We can transform this nonlinear model to a linear

 To test whether the addition of x2 to a model involving

(SSE(x 1 )-SSE(x1 ,x 2 ))/1

 The p–value criterion can also be used to determine

 Stepwise Regression Iterative; one independent

Different subsets of the

The first 3 procedures are heuristics

 At each iteration, the first consideration is to see

Compute F stat. and next Indep. variable with

Start with no indep.

 This procedure is similar to stepwise regression, but

Start with no indep.

Compute F stat. and

Any Indep. variable with

 This procedure begins with a model that includes

Start with all indep.

Compute F stat. and

Any Indep. variable with

 Example: Clarksville Homes

 Cars (garage size) is the independent variable

 Bedrooms is the independent variable with the

 Bathrooms is the independent variable with the

 House size is the only independent variable

yˆ  9.8669  11.3383(House Size)

 The three preceding procedures are one-variable-at-

 Example: PGA Tour Data

 Variable Names and Definitions

Drive: average length of a drive in yards

 Sample Data (Part 1)

Drive Fair Green Putt Sand Score

 Sample Data (Part 2)

Drive Fair Green Putt Sand Score

 Sample Data (Part 3)

Drive Fair Green Putt Sand Score

 Sample Correlation Coefficients

Score Drive Fair Green Putt

 Best Subsets Regression of SCORE

Vars R-sq R-sq(a) C-p s D F G P S

Plant 1 Plant 2 Plant 3

A = 0, B = 0 if observation is from Buffalo plant

Plant 1 Plant 2 Plant 3

E(y) = expected number of hours worked

For Buffalo: E(y) = b0 + b1(0) + b2(0) = b0

Excel produced the regression equation:

Plant Estimate of E(y)

 To test this null hypothesis, we must compare the

Source of Sum of Degrees of Mean

 Often, the data used for regression studies in

 With positive autocorrelation, we expect a positive

 When autocorrelation is present, one of the