Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Simple Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Chapter 10

Simple Linear Regression


1. The following data were collected by a bank wishing to examine the relationship (if
any) between individual income and savings per year (in units of $1, 000).
Income 60 40 50 30 70 80 74 54
Savings 6 3 3 2 8 12 11 7
(a) Which of the two variables would you choose to be the response variable in a
simple linear regression analysis?
Solution:
The bank would be most interested in predicting an individuals Savings, given
their individual Income, and in assessing the eect of change in Income on
Savings. Since Savings is the variable that we are most interested in predicting,
Savings is taken as the response variable. Income is the explanatory variable.
(b) Without using Excel, sketch an approximate scatterplot of the data.
Solution:
Your scatterplot should look similar to this:
A simple linear regression analysis was performed using Excel, yielding the following
output:
141
142 CHAPTER 10. SIMPLE LINEAR REGRESSION
(c) Use the Excel output to write down an estimate b
1
for the regression slope
parameter
1
. Interpret the meaning of b
1
in terms of family income and
savings.
Solution:
From the output, the estimate of the slope parameter
1
is b
1
= 0.205 (3 d.p.).
The interpretation is that if an individuals Income increases by $1, 000, their
expected Savings increases by $205.
(d) Test the hypothesis H
0
:
1
= 0 against H
1
:
1
= 0 at the 1% signicance
level.
Solution:
In simple linear regression, the test of H
0
:
1
= 0 (there is no relationship
between Income and Savings) against H
1
:
1
= 0 (there is a signicant linear
relationship between Income and Savings) can be carried out in either of
two (equivalent) ways. The rst approach is based on the test statistic
F = MS
Regression
/MS
Residual
( F
1,n2
under H
0
). From the Excel Output, the
observed value is F
obs
= 48.42, and pvalue associated with this observed value
is 0.000437 < = 0.01.
The second approach is based on the test statistic
T =
B
1

MS
Residual
/SS
x
( t
n2
= t
6
under H
0
),
143
where
B
1
=

n
i=1
(x
i
x)(Y
i


Y )

n
i=1
(x
i
x)
2
is the least squares estimator of the slope
1
. From the Excel output, the
observed value of the test statistic is
t
obs
= 6.9585,
and the critical region is (two-tail test, = 0.01)
CR = {|T| > t
crit
= t
/2
n2
= t
0.005
6
= 3.7074}.
Thus, t
obs
CR. (Alternatively, it can be seen from the Excel output that
the pvalue associated with t
obs
= 6.9585 is 0.000437 < = 0.01.)
Thus, either approach results in rejection of H
0
:
1
= 0 in favour of H
1
:

1
= 0 at the 1% signicance level. Hence there is sucient evidence at the
1% signicance level to conclude that there is a signicant linear relationship
between Income and Savings.
The advantage of the second approach is that it can be used to test H
0
:
1
= 0
against either of the one-tail alternatives H
1
:
1
> 0 (signicant positive
relationship between Income and Savings) or H
1
:
1
< 0 (signicant negative
relationship between Income and Savings), by choosing the appropriate form of
the critical region.
(e) Briey explain what, in practice, is the purpose of examining a plot of Residuals
against the explanatory variable (Income).
Solution:
From diagnostic plots, one can check whether any of the assumptions of simple
linear regression appear to be violated. From residual plots, one can assess
the appropriateness of the linear model, and can recognise if the errors are not
independent or do not have constant variance. The remaining assumption is
that of Normality of the errors, which can be checked by examining a Normality
plot of the residuals.
(f) From this regression, what is the predicted Savings for an individual with an
Income of $20, 000 per annum? Comment on the usefulness of this prediction.
Solution:
Predicted Savings = b
0
+ b
1
Income = 5.246 + 0.205 20 = 1.142. One
might question how a negative value of Savings should be interpreted. Note
that an Income of $20, 000 is outside the range of the data upon which the
model was constructed, hence this prediction is not reliable and should be
taken with a grain of salt. Predictions are only reliable if the values of any
explanatory variables are within the range of the data.
144 CHAPTER 10. SIMPLE LINEAR REGRESSION
2. The following output comes from a linear regression, modelling the number of elec-
tronic components assembled (within a certain time) by employees of an electronics
company with diering amounts of experience (in years).
(a) Specify the regression model and explain each term in the model.
(b) State the estimated regression equation between Production and Experience.
(c) Is there a signicant linear relationship between Production and Experience?
Justify your answer.
(d) Do the residual plots suggest any problems with model assumptions?
(e) Estimate the eect, on average, of
i. a one year increase in experience,
ii. a two year increase in experience.
(f) State a 95% condence interval for the slope parameter
1
.
(g) What is the co-ecient of determination for this model? What is its meaning?
Solution:
(a) Production =
0
+
1
Experience + , where Production is the number
of components produced, Experience is the number of years of experience the
employee has,
0
is the intercept,
1
is the slope, and is the random variation
term or error.
(b) The estimated regression line is

Production = 2.914 + 1.967 Experience.
(c) We are testing the hypotheses H
0
:
1
= 0 against H
1
:
1
= 0. The p
value for this test is 7.33 10
31
< 0.05, so the data provides overwhelming
evidence against the null hypothesis. We conclude that there is a signicant
linear relationship between Production and Experience.
145
(d) 1. A linear model is appropriate. There is no evidence of a trend in the
residual plot.
2. The errors are normally distributed. The points in the normal prob-
ability plot lie approximately on a straight line, indicating the assumption
of normality is okay.
3. The errors have constant variance. The spread of residuals about the
horizontal axis does not vary as Experience increases, so this looks okay.
4. The errors are independent (or uncorrelated). The residual plot
doesnt show any clear violation of independence.
No evidence of outliers or points of high leverage. Thus there is no reason to
doubt the adequacy of our linear regression model.
(e) i. an extra one year of experience will increase Production on average by
1.967 components.
ii. an extra two years of experience will increase Production on average by
1.967 2 = 3.934 components.
(f) We can read the 95% condence interval for
1
from the Excel output as
(1.912, 2.021).
(g) r
2
= 0.9955, this means that the variation in Experience explains 99.55% of
the variation in Production.
3. House Data: Regression of Price against Age
Open the House.xlsx le. We will perform a regression analysis of Price against the
Age of the houses sold. The data was collected in 2010.
(a) Create a column called AgeHouse (which is simply 2010 - YrBuilt). To do this,
type AgeHouse in Cell J1, type = 2010 - E2 in Cell J2, and ll down.
(b) Produce a scatterplot of Price against AgeHouse, and describe any general
trend. Aside from this, is there anything else of note?
Solution:
A scatterplot of Price against AgeHouse is shown below:
146 CHAPTER 10. SIMPLE LINEAR REGRESSION
There does seem to a trend for Price to decrease with increasing age, but this
is due almost entirely to 5 points (possible outliers?) which correspond to very
new houses.
(c) Go to Data Data Analysis Regression. The Input Y Range is Price,
the Input X Range is AgeHouse. You should include Labels in these ranges;
check the corresponding box. Select an Output Range and click OK.
Solution:
These steps yield the following output:
(d) Write down the equation of regression.
Solution:
From the Table of Coecients, the regression equation is

Price = 478.77 3.4485 AgeHouse.


147
(e) Test appropriate hypotheses to determine if there is a signicant linear rela-
tionship between Price and AgeHouse. State your conclusion.
Solution:
The hypotheses of interest are
H
0
:
1
= 0 H
1
:
1
= 0
where
1
is the true regression gradient linking House Prices and the age of
the house.
The p-value for the test is 3.394 10
9
<< 0.05, so there is overwhelming
evidence against the null hypothesis. We conclude that there is a signicant
linear relationship between Price and the age of the house sold.
(f) Examine the residual plots and Normality plot associated with the regression.
Do the residuals appear Normally distributed?
Solution:
No, there does seem to be some non-Normality in the residuals the Normality
plot is not entirely linear, due to perhaps 3-5 extreme points. This throws some
doubt on our conclusion above. One option would be to remove one or two of
them and see whether the Normality of the residuals improves, and whether
our conclusions stay the same.
4. Calculate the estimated coecients b
0
and b
1
in the estimated least squares regres-
sion equation y = b
0
+b
1
x in each of the cases (a) and (b), using the formulae given
in lecture slides, for a set of data (x
i
, y
i
), i = 1, 2, . . . , 10, such that
(a)

10
i=1
x
i
= 15,

10
i=1
y
i
= 714,

10
i=1
x
i
y
i
= 1278,

10
i=1
x
2
i
= 25.8,
(b) x = 0, y = 12.7, SS
xy
= 246.56, s
2
x
= 36.67.
Solution:
(a) From formulae given in Lecture slides,
b
1
=

n
i=1
x
i
y
i
n x y

n
i=1
x
2
i
n( x)
2
=
1278 10(
15
10
)(
714
10
)
25.8 10(
15
10
)
2
= 62.727,
b
0
= y b
1
x =
714
10
62.727(
15
10
) = 22.691.
so the equation of the regression line is y = 22.691 + 62.727x.
148 CHAPTER 10. SIMPLE LINEAR REGRESSION
(b) First we need to compute SS
x
. Recognising that SS
x
=

n
i=1
(x
i
x)
2
=
(n 1)s
2
x
, we nd that SS
x
= (n 1)s
2
x
= 9 36.67. Thus
b
1
=
SS
xy
SS
x
=
246.56
36.67 9
= 0.7471,
b
0
= y b
1
x = 12.7 0.7471(0) = 12.7.
so the equation of the regression line is y = 12.7 0.7471x.
5. Open the Excel le House.xlsx. National Realty wants you to investigate the rela-
tionship between the selling price of a house (in $1,000) and the area of the block of
land on which it is situated (in m
2
). You decide to perform a simple linear regression
between Price and Area.
(a) First, decide which of the two variables should be chosen as the response vari-
able. Then specify the regression model, and explain each term in the model.
(b) What are the assumptions that must be satised to ensure that a simple linear
regression is appropriate?
(c) Using Excel, produce an appropriate Summary Output for the simple linear
regression described by (a). This should include an appropriate set of diag-
nostic plots that can be used to assess whether or not the assumptions of the
regression model in (b) are justied.
(d) From your output in (c), write down the estimated regression equation between
Price and Area.
(e) Give an interpretation for the estimate of the slope parameter in the estimated
regression equation in (d).
(f) Do the diagnostic plots suggest any violation of the assumptions in (b)?
Solution:
(a) Price is the appropriate choice for the response variable. The regression model
is Price =
0
+
1
Area +, where Price is the selling price of the house, Area
is the area of the block of the house,
0
is the intercept,
1
is the slope, and
is the random variation term or residual.
(b) The assumptions of the simple linear regression are
i. A linear model is appropriate: Price =
0
+
1
Area + , where E[] = 0;
ii. The error variables are Normally distributed;
iii. The error variables have constant variance;
iv. The error variables are independent (or at least uncorrelated).
(c) An Excel output is shown:
149
(d)

Price = 219.23 + 0.2901 Area.


(e) If the area increases by 1m
2
then the selling price will, on average, increase by
$290.
(f) 1. A linear model is appropriate. The scatter plot is slightly suggestive
of a curved relationship, particularly if the one extremely negative point
is seen as an outlier.
2. The Error variables are normally distributed. The normality plot
is approximately linear, except at the tails. The Normality assumption is
called into question by 4-6 extreme points. See below.
3. The Error variables have constant variance. This really depends
on how we see the one negative outlier. Without this point, the constant
variance assumption looks okay. Leaving this point in, constant variance
is more open to question.
4. The Error variables are independent (or uncorrelated). The resid-
ual plot doesnt show any clear violation of independence.
The data contains some very extreme points in the Area variable and all of
these would have high leverage. One of these points is an extreme negative
outlier, all of which cast some doubt on the results above. We would be well
advised to see what eect these points have on our regression model, by retting
the model with one or more of these points removed.
150 CHAPTER 10. SIMPLE LINEAR REGRESSION
6. (a) Based on your answers to Question 7e, predict the selling price for a house
with area equal to (i) 900m
2
; (ii) 1900m
2
. Comment on the reliability of
these predictions.
(b) Is there a signicant (linear) relationship between Price and Area? State the
hypotheses to be tested, and read o the appropriate pvalue for this test from
your output in Question 7(e)iii.
(c) Without any calculation, state a 95% condence interval for the slope param-
eter
1
.
(d) Calculate a 98% condence interval for
1
.
Solution:
(a) (i)

Price = 219.23 + 0.2901(900) = 480.354, that is, $480,354. (ii)

Price =
219.23 + 0.2901(1900) = 770.493, that is, $770,493. The rst prediction is
reliable (subject to the comments above about residuals), since 900 is in the
range of Area on which we built the model. The second prediction is unreliable,
as 1900 is well outside the data range of Area upon which we built the model.
(b) We are testing the hypotheses H
0
:
1
= 0 versus H
1
:
1
= 0. The p-
value for this test is 4.1 10
29
< 0.05, so the data provides overwhelming
evidence against the null hypothesis. We conclude that there is signicant
linear relationship between Price and Area.
(c) We can read the 95% condence interval from the Excel output as (0.2467, 0.3336).
(d) 98% CL for
1
= b
1
t
0.01,206
SE(
1
). So 98% CL for
1
= 0.2901 2.3451
0.022044 = 0.2901 0.0517, so a 98% CI for
1
= (0.2384, 0.3418)
Chapter 11
Multiple Linear Regression
1. Absenteeism is a major problem for employers in most countries, reducing potential
output by an estimated 10%. Economists M. Chaudhary and I. Ng (Canadian Jour-
nal of Economics,, August 1992) conducted a research project to better understand
the causes of this problem. They randomly selected 100 organisations to participate
in a year long study. For each organisation, the average number of days absent per
employee was recorded, along with several other variables described below:
Wage : the average employee wage
Pct PT: percentage of part time employees
Pct U: the percentage of unionised employees
Av Shift: availability of shift work (1 = yes, 0 = no)
U/M Rel: union-management relationship (1 = good, 0 = not good)
A linear regression analysis was conducted with Absent (average number of days
absent per employee) as response, and some of the output is given on the following
page.
(a) Specify the multiple linear regression model between Absent and the explana-
tory variables, and explain each term in the model.
(b) Is there sucient evidence to conclude that the availability of shift work is
related to absenteeism? Justify your answer.
(c) Can we infer that in organisations where union and management relations are
poor, absenteeism is high? Justify your answer.
(d) Write down the tted regression model between Absent and the explanatory
variables, using only the signicant terms.
(e) State and verify the assumptions of the linear regression model using the out-
put.
(f) Which variable, Av Shift or U/M Rel, has the greatest aect on absenteeism
in the workplace according to this data?
(g) Compute a 95% condence interval for the coecient of the percentage of
unionised employees.
(h) How can this model be improved? Justify your answer.
151
152 CHAPTER 11. MULTIPLE LINEAR REGRESSION
Solution:
(a) The regression model is
Absent =
0
+
1
Wage +
2
Pct PT +
3
Pct U +
4
Av Shift +
5
U/M Rel +
where
0
is the intercept,
1
is the coecient of Wage,
2
is the coecient of
Pct PT,
3
is the coecient of Pct U,
4
is the coecient of Av Shift,
5
is the
coecient of U/M Rel, and is the random variation term.
(b) The p-value of the coecient of Av Shift is 0.0025 < 0.05, so there is sucient
evidence to conclude that the availability of shift work is related to absenteeism.
153
In fact, since the coecient of Av Shift is positive, the availability of shift work
increases the mean number of days absent per employee (by 1.56 days per year).
(c) The p-value of the coecient of U/M Rel is 5.99 10
7
<< 0.05, so there is
sucient evidence to conclude that the status of union-management relations
is related to absenteeism. Since the coecient of U/M Rel is negative, it
indicates that if union-management relations are good, then the mean number
of days absent per employee decreases (by 2.64 days per year). Equivalently,
bad union-management relations imply that the mean number of days per year
absent per employee will increase by 2.64.
(d) To write down the tted regression model, we just need to read o the estimated
coecients from the output:

Absent = 10.2648 0.0002 Wage 0.1069 Pct PT


+ 0.0599 Pct U + 1.5619 Av Shift 2.6366 U/M Rel.
Note that all the estimated coecients have pvalue < 0.05, thus, every one
of the ve explanatory variables contributes signicantly to Absenteeism (and
should be included in the model).
(e) i. The linear regression model is appropriate. The scatter plot of
Residuals against Fitted Values is slightly suggestive of some degree of
non-linearity (i.e. a curved relationship). Assume for now that the linear
model is appropriate.
ii. The errors are normal. It is not very evident, from the histogram
given, that the residuals are probably not normally distributed. However,
the Normal Probability Plot shows a distinct curvature. Thus the residuals
are probably not normal this assumption is not justied.
iii. The errors have constant variance. The scatter plot of Residuals
against Fitted Values shows no clear pattern, so there is no reason to
doubt the equal variance assumption.
iv. The errors are uncorrelated. As observed in i., the scatter plot of
residuals against tted values shows a slight pattern, but there is possibly
not enough reason to doubt the claim that the residuals are uncorrelated.
(f) In the case of the two factor variables, union/management relations and
availability of shift work, it is clear that union/management relations
have a greater eect than availability of shift work , because the absolute
value of the estimated coecient is larger.
(g) A 95% CI for the coecient of Pct U is
b
3
t
0.025,94
SE(b
3
) = 0.0599 t
0.025,94
0.0124
= 0.0599 1.9855 0.0124
= 0.0599 0.0246 = (0.0353, 0.0845).
One can also read this straight from the Excel output.
Informally, this tells us that if percentage union membership increases by 1%,
then we would expect that mean absenteeism will increase by between 0.0353
and 0.0845 days per year.
154 CHAPTER 11. MULTIPLE LINEAR REGRESSION
(h) The data (Absent) should be transformed and the model re-tted to see if
there is any improvement in the behaviour of the residuals with respect to the
normality assumption.
155
2. As a further analysis, the following loglinear model was tted to the data:
ln Absent =
0
+
1
Wage +
2
PctPT +
3
PctU +
4
AvShift +
5
U/MRel +
Some of the output from the analysis is given on the following page.
(a) Using the analysis of the previous question, justify tting the above model to
the data.
(b) Write down the tted regression model between ln(Absent) and the explanatory
variables.
(c) Is there sucient evidence to conclude that the availability of shift work is
related to absenteeism? Justify your answer.
(d) Can we infer that in organisations where union and management relations are
poor, absenteeism is high? Justify your answer.
(e) State and verify the assumptions of the regression model using the output.
(f) Compare the log model to the linear model tted in the previous question.
Which is better? Justify your answer.
(g) Between U/M Rel and Av Shift, which variable has the greatest aect on
absenteeism in this model? How does this compare with the model in Question
1?
(h) Compute a 95% condence interval for the coecient of the percent of unionised
employees, and compare your answer to that in Question 7(a)vii.
(i) Write a statement reporting the results of the analysis, referring to the factors
that aect worker absenteeism.
156 CHAPTER 11. MULTIPLE LINEAR REGRESSION
Solution:
(a) The Normality assumption employed in the previous analysis was perhaps not
justied. Now the response variable is being transformed, in an attempt to
nd a more appropriate model. After tting the new model to the data, we
can see if there is any change in the behaviour of the residuals. Since the
157
histogram of residuals was right-skewed, an appropriate transformation might
be to take the square root or natural logarithm of the response variable. The
log transformation has been used here.
(b) Reading the estimated coecients from the output, the tted model is

ln Absent = 2.25 3.38Wage 0.019PCt PT + 0.011Pct U +


+ 0.283Av Shift 0.371U/M Rel.
(c) The p-value of the coecient of Av Shift is 0.0012 < 0.05, so there is sucient
evidence to conclude that the availability of shift work is related to absenteeism.
In fact, since the coecient of Av Shift is positive, the availability of shift work
increases the mean number of days absent per employee. From the tted model
derived in (b), we obtain

Absent = exp(2.25 3.38Wage 0.019PCt PT + 0.011Pct U +


+ 0.283Av Shift 0.371U/M Rel)
so absenteeism increases by a factor of e
0.283
= 1.327 for companies that have
shift work available.
(d) The p-value of the coecient of U/M Rel is 2.11 10
5
<< 0.05, so there is
sucient evidence to conclude that the status of union-management relations
is related to absenteeism. Since the coecient of U/M Rel is negative, it
indicates that if union-management relations as good then the mean number
of days absent per employee decreases. In fact, absenteeism decreases by a
factor of e
0.371
= 0.690 if management-union relations are good.
(e) i. The tted regression model is appropriate. The scatter plot of Resid-
uals against Fitted Values shows no clear pattern, so we conclude that the
model is appropriate.
ii. The errors are normal. The normal probability plot is fairly straight,
and the histograms is similar to that expected from a Normal distribu-
tion. We conclude that there is no reason to question the assumption of
Normality of the residuals.
iii. The errors have constant variance. The scatter plot of residuals
against tted values shows possibly that the variances decrease slightly
as the tted values increase, but perhaps not enough to doubt the equal
variance assumption.
iv. The errors are uncorrelated. The scatter plot of residuals against tted
values shows no clear pattern, so there we conclude that the residuals are
uncorrelated.
(f) The log model is a great improvement on the linear model. The correlation
coecient has not changed much (0.7252 compared to 0.7296). The standard
error for the log model is much smaller than for the linear model this is
partly due to the fact that the data values have decreased due to the log
transformation, but even taking this into account, the reduction is large (in
fact, as a rough guide, the log of the standard error for the linear model is
ln 2.3559 = 0.8569, and this is more than twice the standard error for the log
model). Furthermore, the diagnostic plots suggest that we may safely assert
158 CHAPTER 11. MULTIPLE LINEAR REGRESSION
that the assumptions of the regression are satised by the log model, whereas
some doubt must be cast on the validity of the assumptions of the linear model,
in particular Normality of the residuals.
(g) Again the coecient of Union/Management Relations has larger absolute value,
so again of the two variables, this variable has the greatest eect.
(h) 95% CL for coecient of Pct U
= 0.0111 t
94
0.0021
= 0.0111 1.9855 0.0021
= 0.0111 0.0042
so a 95% CI is (0.0069, 0.0153). This is quite dierent to the 95% CI for the
same coecient under the previous model. The dierence is due to the choice
of model. Note that this is a CI for the increase in ln(Absent) corresponding
to a 1% increase in union membership.
(i) Worker absenteeism decreases as the average employee wages and percent of
part time employees increase. Further, absenteeism is lower for those companies
for which management has a good relationship with the union and higher for
those companies that have shift work available. Finally, absenteeism increases
as the percentage of unionised employees increases.
3. In this exercise, we will examine how to use Excel to generate output that can
be used to conduct a multiple linear regression analysis on a given data set. The
standard Excel regression output does not include all of the diagnostic plots that
one would usually be interested in separately, we can obtain plots of residuals
against tted values, and histograms of the residuals. The Absenteeism data of the
previous two questions is contained in the Excel le Absent.xlsx.
Follow the steps outlined below:
(a) Generate output relevant to a multiple linear regression of Absent on the
ve explanatory variables. To do this, select Data -> Data Analysis ->
Regression. Since Absent is the response variable, set Y Range to be all
the data in the column Absent. X Range should be set to be all the data in
remaining columns. Labels should be included. Select also Residuals and
Normal Probability Plot.
(b) Under Residual Output, you will see two columns headed Predicted Absent
and Residuals. Copy the two columns to a separate worksheet, and use the
Scatterplot command to generate a scatterplot of Residuals against Fitted
Values (Predicted Absent). Check that the plot is the same as the one given
in the output in Question 7a.
(c) Generate a histogram of the Residuals.
Solution:
See Question 7a for an example output.
159
4. In Question 7b, we considered a multiple linear regression of the natural logarithm
of Absent on the ve explanatory variables. Here, we will reproduce the relevant
output in Excel. First, we must transform the Absent data.
(a) Create a new column, to the right of the Absent column, headed ln(Absent),
calculate the natural logarithm of the rst data point as shown below, and
then ll down the column.
(b) Now generate the standard Excel output for a multiple linear regression of
ln(Absent) on the ve explanatory variables (ignoring the original Absent
data!).
(c) Once again, create a plot of Residuals against Fitted Values. Check that it is
the same as the plot given in the Excel output in Question 7b. Comment on
the dierences in the diagnostic plots for the two dierent models, and what
these plots tell us.
Solution:
See Question 7b for an example output. The diagnostics for the two models were
examined respectively in Questions 7(a)v and 7(b)v. We can assert that the as-
sumptions of the regression are satised by the log model. There is no evident
pattern in the residual plot to suggest that this model is not appropriate or that
the errors are not independent. Furthermore, the Normal Probability Plot resem-
bles a line, indicating that the Normality assumption is OK. However, for the linear
model, the Normal Probability Plot has a denite curve. For this model, some doubt
must be cast on the validity of the assumption of Normality of the errors. So, the
model is inappropriate. Inference based on bad models will usually result in wrong
conclusions.
160 CHAPTER 11. MULTIPLE LINEAR REGRESSION
Chapter 12
Chi-Squared Tests for Categorical
Data
12.112.2: The Chi-Squared Test for Goodness of Fit
1. A company which manufactures tractors takes daily samples of 4 tractors for careful
inspection as a check on the quality of their product. Over 200 days, the numbers
of tractors needing adjustment on each day were recorded, resulting in the following
frequency table. Test whether a Binomial model with p = 0.1 is appropriate for the
number of tractors needing adjustment on a given day.
[Fill in the rest of the table before your lab, remembering that you might need to
group the categories.]
Number needing adj. per day (x
i
) 0 1 2 3 4 total
Number of days (o
i
) 102 78 19 1 0 200
P(X = x
i
) if X Bin(4, 0.1) 1
Expected frequency (e
i
) 200
(o
i
e
i
)
2
/e
i
Solution:
The hypotheses to be tested are
H
0
: data consistent with a Bin(4, 0.1) distribution
H
1
: data not consistent with a Bin(4, 0.1) distribution
Let X denote the number of tractors needing adjustment on a given day. As-
sume that the numbers of tractors requiring adjustment on each day are iid. Then
the number of days, out of 200, on which x
i
tractors need adjustment (for x
i
=
0, 1, . . . , 4) is a Bin(200, p
i
) random variable, where p
i
= P(X = x
i
). So the ex-
pected number of days on which x
i
tractors need adjustment can be written as
200p
i
. Under the Bin(4, 0.1) model, p
i
=

4
x
i

0.1
x
i
(1 0.1)
4x
i
. The expected fre-
quencies under this model can now be calculated, and the results are given in the
following table:
161
162 CHAPTER 12. CHI-SQUARED TESTS
Number needing adj. (x
i
) 0 1 2 3 4 Total
Number of days (o
i
) 102 78 19 1 0 200
P(X = x
i
) if X Bin(4, 0.1) 0.6561 0.2916 0.0486 0.0036 0.0001 1
Expected freq. (e
i
= 200p
i
) 131.22 58.32 9.72 0.72 0.02 200
The chi-square tests require that all expected frequencies be greater than 5. To
achieve this, we group the last three categories. The revised table is shown below:
Number needing adj. (x
i
) 0 1 2, 3 or 4 total
Number of days (o
i
) 102 78 20 200
P(X = x
i
) if X Bin(4, 0.1) 0.6561 0.2916 0.0523 1
Expected freq. (e
i
= 200p
i
) 131.22 58.32 10.46 200
(o
i
e
i
)
2
/e
i
6.51 6.64 8.70 21.85
The test statistic is
X
2
=

i
(o
i
e
i
)
2
e
i
,
where the sum is over all (remaining) categories of the variable. Under H
0
, X
2
observes a
2
distribution, with
df = Number of categories 1 Number of parameters estimated
= 3 1 0
= 2,
i.e. X
2

2
2
under H
0
. So the = 0.05 critical value is
2
crit
=
2
2,0.05
= 5.99.
The observed value of test statistic is
2
obs
= 21.85.
Since
2
obs
= 21.85 CR, the data provides sucient evidence to reject H
0
at the 5%
level of signicance. We conclude that the number of tractors needing adjustment
is not distributed as Bin(4, 0.1).
2. Political ideology of government has a great impact on business perception and
planning. A market researcher is investigating the support for the various political
parties in Australia at the Federal level. The support at the 2001 Federal election
was Liberal 37%, Labor 38%, National 6%, Democrats 5%, Others 14%
(source: http://www.aec.gov.au/ content/when/past/2001/results/index.html).
Six months after the 2001 election, a survey of 1050 voters was conducted, to de-
termine whether the level of support for each party had changed. The results are
summarised in the table below.
Party (i) Lib Lab Nat Dem Oth Total
No. of voters (o
i
) 350 456 50 44 150 1050
Probability (p
i
) 0.37 0.38 0.06 0.05 1
163
Determine at signicance level 0.05 whether the level of support for the parties
changed in the six months following the 2001 election. Comment on where the
major discrepancy appears to lie.
Solution:
Party (i) Lib Lab Nat Dem Oth Total
No. of voters (o
i
) 350 456 50 44 150 1050
Probability (p
i
) 0.37 0.38 0.06 0.05 0.14 1
Expected frequency (e
i
) 388.5 399 63 52.5 147 1050
o
i
e
i
-38.5 57 -13 -8.5 3 0
(o
i
e
i
)
2
1482.25 3249 169 72.25 9 (not reqd)
(o
i
e
i
)
2
/e
i
3.8153 8.1429 2.6825 1.3762 0.0612 16.0781
(a) df = number of categories (after grouping to eliminate any expected frequencies
less than 5) 1 number of parameters estimated from the data.
(b) [See table above].
df= 5 1 0 = 4;
2
4,0.05
= 9.49.

2
o
= 16.0781 > 9.49, therefore the data provides sucient evidence to reject
H
0
at the 5% level of signicance. We conclude that the level of support for
the political parties has changed since the last election.
The largest contribution to the Chi-squared statistic is from the Labor column.
Thus the major discrepancy is that the support for Labor increased in the six
months following the 2001 election.
3. Black et al. 12.5.
Solution:
H
0
: The way that men dene their personal success does not dier from how women
dene theirs
H
1
: H
0
is false.
The test statistic is given by

2
=

obs freq(f
o
) exp freq(f
e
)

2
exp freq(f
e
)
The signicance level is given as = 0.05
There are four categories in this question (happiness, sales, helping others, achieve-
ments), k = 4. The degrees of freedom are k 1. For = 0.05 and df = 3, the
critical chi-square value is

2
0.05,3
= 7.8147
The observed values are computed by multiplying the expected proportions (from
womans data) to the total sample size of the mens data. For example, the total
sample size for the mens data is 227 (add up all the observed frequencies). The ex-
pected frequency for the happiness category is then 227(0.39) = 88.53 and similarly
for the sales category, the expected frequency is 227(0.12) = 27.24 and so on.
164 CHAPTER 12. CHI-SQUARED TESTS
Denition f
o
f
e
(fofe)
2
fe
Happiness 42 88.53 24.46
Sales 95 27.24 168.55
Helping 27 40.86 4.70
Achievements 63 70.34 0.77
Total 198.98
Since the chi-squared observed value (198.98) is greater than the critical value, we
reject the null hypothesis.
Thus, the data gathered in the sample suggests that the way men dene their
personal success diers signicantly from how women dene theirs.
12.3: Contingency Analysis: The Chi-Squared Test
for Independence
4. In a random sample of 100 people, each person was classied by buying response to
a particular product and also by degree of exposure to marketing pressure (recorded
in four categories I, II, III, IV), with the following results:
[Fill in (say) three of the expected frequencies (in parentheses) before your lab.]
Marketing Pressure
I II III IV Totals
Denitely buy 12 ( ) 12 ( ) 6 ( ) 17 ( ) 47
Undecided 5 ( ) 8 ( ) 10 ( ) 5 ( ) 28
Will not buy 3 ( ) 10 ( ) 7 ( ) 5 ( ) 25
Total 20 30 23 27 100
(a) State the hypotheses you would use in testing the advertising agencys claim
that buying response is inuenced by the degree of marketing pressure.
(b) Explain why you would calculate the expected frequencies using the rule
expected frequency for a cell =
row total column total
grand total
.
(c) Test the advertising agencys claim at the 5% signicance level.
Solution:
(a) The hypotheses to be tested are
H
0
: Marketing pressure and buying response are independent
H
1
: Marketing pressure and buying response are not independent
(i.e. buying response is inuenced by marketing pressure)
165
(b) Consider the upper-left cell. The probability that a particular customer will
be classied in this cell is P(Marketing Pressure I and Denitely buy). If
H
0
is true, then
P(Marketing Pressure I and Denitely buy)
= P(Marketing Pressure I) P(Denitely buy).
The two probabilities on the right-hand side can be estimated naturally in
terms of the respective row and column totals:
P(Marketing Pressure I)
Column 1 Total
Grand Total
P(Denitely buy)
Row 1 Total
Grand Total
.
So,
P(Marketing Pressure I and Denitely buy) =
Row 1 Total Column 1 Total
(Grand Total)
2
.
The expected frequency for the upper-left cell is
Grand Total P(Marketing Pressure I and Denitely buy)
Grand Total
Row 1 Total Column 1 Total
(Grand Total)
2
=
Row 1 Total Column 1 Total
Grand Total
.
This argument applies in the same way for all other cells in the table.
(c) Expected frequencies (under the model of independence) are given in brackets:
Marketing Pressure
I II III IV Totals
Denitely buy 12 (9.4) 12 (14.1) 6 (10.81) 17 (12.69) 47
Undecided 5 (5.6) 8 (8.4) 10 (6.44) 5 (7.56) 28
Will not buy 3 (5) 10 (7.5) 7 (5.75) 5 (6.75) 25
Total 20 30 23 27 100
The test statistic is
X
2
=
r

i=1
c

j=1
(O
ij
e
ij
)
2
e
ij
where r and c are the numbers of rows and columns (not including totals), O
ij
is the observed count in the cell in row i and column j, and e
ij
is the expected
count in the same cell (assuming H
0
, that is, no relationship between the two
variables). This double sum can be thought of simply as a single sum over
all cells in the table.
Under H
0
, the test statistic observes a
2
distribution, with degree of freedom
(r 1) (c 1) = (3 1) (4 1) = 6, i.e. X
2

2
6
under H
0
. Thus, the
= 0.05 critical value of the test is
2
crit
=
2
6,0.05
= 12.59.
166 CHAPTER 12. CHI-SQUARED TESTS
The observed value of the test statistic is

2
obs
=
r

i=1
c

j=1
(o
ij
e
ij
)
2
e
ij
= 0.719 + 0.312 + 2.140 + 1.464 + 0.064 + 0.019 + 1.968 +
+ 0.867 + 0.800 + 0.833 + 0.271 + 0.454
= 9.91.
Since
2
obs
= 9.91 < 12.59, the data does not provide sucient evidence to
reject H
0
in favour of H
1
at the 5% level of signicance. We conclude that
buying response is not inuenced by marketing pressure.
5. Four hotels took part in a survey on hotel guest satisfaction. A follow up question
was asked of all respondents who were dissatised with the service. These guests
were asked to indicate the main reason for their dissatisfaction. You are asked
to investigate whether the choice of hotel has any bearing on the main reason for
dissatisfaction.
Do not use Excel in this question! Write your answers on paper, showing
full working.
(a) State appropriate hypotheses that could be tested to answer the question: Do
the results of the survey provide evidence that the nature of dissatisfaction and
the choice of hotel are related?
Solution:
H
0
: Choice of hotel and reason for dissatisfaction are independent.
H
1
: H
0
is false (i.e. choice of hotel and reason for dissatisfaction are related).
(b) A contingency table, summarising the results of the survey, is given below.
The table shows the observed frequencies for each cell, as well as some of the
expected frequencies under H
0
(in parentheses). Copy down this table, and
without using Excel, calculate the remaining expected frequencies under H
0
.
Show working!
Hotel
Fijian Tradeswest Sheraton Coral Reef Totals
Politeness 23 ( ) 7 ( ) 37 (33.7410) 67 (62.0192) 134
Knowledge 25 ( ) 13 ( ) 25 (30.9712) 60 (56.9281) 123
Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62
Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98
Totals 74 45 105 193 417
Solution:
167
The expected frequency for Fijian and Politeness is
Row Total Column Total
Grand Total
= (134 74)/417 = 23.7794.
One can either work out the remaining three expected frequencies as above, or
by using the fact that the expected frequencies in each row/column are required
to sum to the (observed) row/column total. The complete table is below:
Hotel
Fijian Tradeswest Sheraton Coral Reef Totals
Politeness 23 (23.7794) 7 (14.4604) 37 (33.7410) 67 (62.0192) 134
Knowledge 25 (21.8273) 13 (13.2734) 25 (30.9712) 60 (56.9281) 123
Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62
Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98
Totals 74 45 105 193 417
(c) Write down an expression for the relevant test statistic, and state its distribu-
tion under H
0
(together with any associated parameters!).
Solution:
The test statistic is
X
2
=
r

i=1
c

j=1
(O
ij
e
ij
)
2
e
ij
where r and c are the numbers of rows and columns (not including totals), O
ij
is the observed count in the cell in row i and column j, and e
ij
is the expected
count in the same cell (assuming H
0
, that is, no relationship between the two
variables).
Under H
0
, the test statistic observes a
2
distribution, with degree of freedom
(r 1) (c 1) = (4 1) (4 1) = 9, i.e. X
2

2
9
under H
0
.
(d) Without using Excel, calculate the contribution from the upper-left cell to the
observed value of the test statistic.
Solution:
The contribution to the observed value from the upper-left cell is
(o
ij
e
ij
)
2
e
ij
=
(23 23.7794)
2
23.7794
= 0.0256.
(e) Given that the observed value of the test statistic is
2
obs
= 20.8059, carry
out the test (without using Excel) at the 5% signicance level, and state your
168 CHAPTER 12. CHI-SQUARED TESTS
conclusion. Is there sucient evidence to conclude that there is a relationship
between the choice of hotel and the nature of dissatisfaction?
Solution:
The critical value for this test is

2
crit
=
2
9,
=
2
9,0.05
= 16.92,
so the critical region is {X
2
>
2
crit
= 16.92}. Since
2
obs
= 20.8059 is within
the critical region, we reject H
0
in favour of H
1
. There is sucient evidence,
at the 5% signicance level, to conclude that the nature of dissatisfaction is
related to the choice of hotel.
6. To undertake contingency analysis in Excel, rst enter the data, then go KaddSTAT
-> Hypothesis Testing -> Chi-Square Test. Select the data as Input Range,
tick the Header Row and Column Included box, and choose where you want Excel
to print the output.
Enter the data from Question 7e as shown below:
(a) Use Excel to generate an appropriate output for a test for independence of the
two variables of interest, carry out the test, and check that your conclusions
are the same as in Question 7e.
Solution:
Excel returns the following output:
169
(b) If there is evidence that the nature of dissatisfaction is related to the choice
of hotel, where do the discrepancies lie? Which hotel(s) could be advised to
improve their service, and in which area(s)? Do any of the hotels appear to
provide signicantly better service than the others in a particular area?
Solution:
Having established that there is indeed a relationship between the choice of
hotel and the nature of dissatisfaction, one can examine the output to deter-
mine which hotels have a greater (or lesser) proportion of complaints of each
type.
It can be seen from the output of chi-square calculations that there are three
cells that have much larger contributions to the observed value of the test
statistic than the others. These cells are Tradewest and Politeness (3.8490),
Tradewest and Other (8.3987) and Coral Reef and Other (2.3651). Comparing
the observed frequencies with the expected frequencies for these cells, we see
that of the dissatised hotel guests, those who stayed at Tradeswest are less
often dissatised with Politeness, and more often their dissatisfaction is classi-
ed as Other. It might also be that those dissatised guests who stay at Coral
Reef less often state that their dissatisfaction is due to Other, although there
is probably not enough evidence to conrm this (the chi-square contribution is
not that large).
We conclude that Tradeswest should take steps to improve their service in the
area of Other. Some further analysis might be required to provide more useful
advice. To nd out which particular aspects of Tradeswests service guests are
dissatised with, one might choose to replace Other by a collection of more
meaningful categories (e.g. Cleanliness, Food, etc.). One can ensure that the
expected frequencies are all greater than 5 by combining any categories that
have small expected frequencies, or by simply gathering enough data.
7. A market-researcher wished to investigate whether a buyers age had any bearing
on choice of car colour. A random sample of 200 car buyers resulted in the following
table which shows the observed frequencies and some of the expected frequencies
(in parentheses).
Chose Red Chose White Chose Grey
Age 17 24 20 ( 16 ) 15 ( 16 ) 5 ( )
Age 25 40 30 ( 24 ) 20 ( 24 ) 10 ( )
Age over 40 30 ( ) 45 ( ) 25 ( )
(a) State the hypotheses that the researcher is comparing in this investigation.
(b) Copy the body of the table and complete the entries for expected frequencies.
(c) Give the number of degrees of freedom for a
2
-test of the hypotheses in part
(a).
170 CHAPTER 12. CHI-SQUARED TESTS
(d) Explain fully how the expected frequency of 16 is obtained for the 17 24 age
group with a preference for Red. (Do not merely quote a formula or show one
line of arithmetic.)
(e) Using a 5% level of signicance, determine if the buyers age has any bearing
on the choice of car colour.
Solution:
(a) H
0
: choice of colour independent of age; H
1
: choice of colour dependent on
age.
(b)
Chose Red Chose White Chose Grey Total
Age 17 24 20 ( 16 ) 15 ( 16 ) 5 ( 8 ) 40
Age 25 40 30 ( 24 ) 20 ( 24 ) 10 ( 12 ) 60
Age over 40 30 ( 40 ) 45 ( 40 ) 25 ( 20 ) 100
Total 80 80 40 200
(c) No grouping required, so the number of degrees of freedom for the
2
-distribution
is (3 1)(3 1) = 4.
(d) Under H
0
, P(Age 17 24 and Red) = P(Age 17 24).P(Red).
Estimating P(Age 1724) by
40
200
and P(Red) by
80
200
, gives expected frequency
for (1, 1)-cell =
40
200

80
200
200 = 16.
(e) From tables,
2
4, 0.05
= 9.49 and observed value of test statistic = 9.06 / CR
so the data does not provide sucient evidence to reject H
0
at the 5% level of
signicance. We conclude that choice of colour is not dependent on age.
8. Black et al. Exercises 12.27 and 12.29.
Solution:
(a) Black et al. 12.27
171
The hypotheses of interest here are
H
0
: Proportion of households with internet access is not dependent on whether
they have children under the age of 15 for the period 1989 to 2003.
H
1
: Proportion of households with internet access is dependent on whether
they have children under the age of 15 for the period 1989 to 2003. Note that
all the expected frequencies are greater than 5. df = (2 1)(6 1) = 5. The
p-value of the test = P(
2
5
> 0.13) = 0.9997 (from Excel), so there is insu-
cient evidence against the null hypothesis. We conclude that the proportion of
households with internet access is the same for those with children under 15
and those without children under 15 in the period 1989 to 2003.
(b) Black et al. 12.29
H
0
: Gender and colour preference for cars is independent
H
1
: Gender and colour preference for cars is not independent
To test this hypothesis, we use the chi-squared test of independence. The
observed chi-squared value (from the test-statistic) is 5.366. The p-value
0.252 > 0.05, so there is insucient evidence to reject the null hypothesis.
(The critical value at 5% level of signicance with (5 1)(2 1) = 4 degrees of
freedom is 9.4877. Since the observed value does not lie in the critical region,
we do not reject the null hypothesis.) Therefore, there is not enough evidence
provided by the data to suggest that colour preference is dependent on gender.
Marketing agencies dont have to model colour as a factor when trying to sell
cars to either gender. Also, manufacturers can determine car colour quotes on
another basis, instead of gender preference.

You might also like