Simple Linear Regression
Simple Linear Regression
Simple Linear Regression
MS
Residual
/SS
x
( t
n2
= t
6
under H
0
),
143
where
B
1
=
n
i=1
(x
i
x)(Y
i
Y )
n
i=1
(x
i
x)
2
is the least squares estimator of the slope
1
. From the Excel output, the
observed value of the test statistic is
t
obs
= 6.9585,
and the critical region is (two-tail test, = 0.01)
CR = {|T| > t
crit
= t
/2
n2
= t
0.005
6
= 3.7074}.
Thus, t
obs
CR. (Alternatively, it can be seen from the Excel output that
the pvalue associated with t
obs
= 6.9585 is 0.000437 < = 0.01.)
Thus, either approach results in rejection of H
0
:
1
= 0 in favour of H
1
:
1
= 0 at the 1% signicance level. Hence there is sucient evidence at the
1% signicance level to conclude that there is a signicant linear relationship
between Income and Savings.
The advantage of the second approach is that it can be used to test H
0
:
1
= 0
against either of the one-tail alternatives H
1
:
1
> 0 (signicant positive
relationship between Income and Savings) or H
1
:
1
< 0 (signicant negative
relationship between Income and Savings), by choosing the appropriate form of
the critical region.
(e) Briey explain what, in practice, is the purpose of examining a plot of Residuals
against the explanatory variable (Income).
Solution:
From diagnostic plots, one can check whether any of the assumptions of simple
linear regression appear to be violated. From residual plots, one can assess
the appropriateness of the linear model, and can recognise if the errors are not
independent or do not have constant variance. The remaining assumption is
that of Normality of the errors, which can be checked by examining a Normality
plot of the residuals.
(f) From this regression, what is the predicted Savings for an individual with an
Income of $20, 000 per annum? Comment on the usefulness of this prediction.
Solution:
Predicted Savings = b
0
+ b
1
Income = 5.246 + 0.205 20 = 1.142. One
might question how a negative value of Savings should be interpreted. Note
that an Income of $20, 000 is outside the range of the data upon which the
model was constructed, hence this prediction is not reliable and should be
taken with a grain of salt. Predictions are only reliable if the values of any
explanatory variables are within the range of the data.
144 CHAPTER 10. SIMPLE LINEAR REGRESSION
2. The following output comes from a linear regression, modelling the number of elec-
tronic components assembled (within a certain time) by employees of an electronics
company with diering amounts of experience (in years).
(a) Specify the regression model and explain each term in the model.
(b) State the estimated regression equation between Production and Experience.
(c) Is there a signicant linear relationship between Production and Experience?
Justify your answer.
(d) Do the residual plots suggest any problems with model assumptions?
(e) Estimate the eect, on average, of
i. a one year increase in experience,
ii. a two year increase in experience.
(f) State a 95% condence interval for the slope parameter
1
.
(g) What is the co-ecient of determination for this model? What is its meaning?
Solution:
(a) Production =
0
+
1
Experience + , where Production is the number
of components produced, Experience is the number of years of experience the
employee has,
0
is the intercept,
1
is the slope, and is the random variation
term or error.
(b) The estimated regression line is
Production = 2.914 + 1.967 Experience.
(c) We are testing the hypotheses H
0
:
1
= 0 against H
1
:
1
= 0. The p
value for this test is 7.33 10
31
< 0.05, so the data provides overwhelming
evidence against the null hypothesis. We conclude that there is a signicant
linear relationship between Production and Experience.
145
(d) 1. A linear model is appropriate. There is no evidence of a trend in the
residual plot.
2. The errors are normally distributed. The points in the normal prob-
ability plot lie approximately on a straight line, indicating the assumption
of normality is okay.
3. The errors have constant variance. The spread of residuals about the
horizontal axis does not vary as Experience increases, so this looks okay.
4. The errors are independent (or uncorrelated). The residual plot
doesnt show any clear violation of independence.
No evidence of outliers or points of high leverage. Thus there is no reason to
doubt the adequacy of our linear regression model.
(e) i. an extra one year of experience will increase Production on average by
1.967 components.
ii. an extra two years of experience will increase Production on average by
1.967 2 = 3.934 components.
(f) We can read the 95% condence interval for
1
from the Excel output as
(1.912, 2.021).
(g) r
2
= 0.9955, this means that the variation in Experience explains 99.55% of
the variation in Production.
3. House Data: Regression of Price against Age
Open the House.xlsx le. We will perform a regression analysis of Price against the
Age of the houses sold. The data was collected in 2010.
(a) Create a column called AgeHouse (which is simply 2010 - YrBuilt). To do this,
type AgeHouse in Cell J1, type = 2010 - E2 in Cell J2, and ll down.
(b) Produce a scatterplot of Price against AgeHouse, and describe any general
trend. Aside from this, is there anything else of note?
Solution:
A scatterplot of Price against AgeHouse is shown below:
146 CHAPTER 10. SIMPLE LINEAR REGRESSION
There does seem to a trend for Price to decrease with increasing age, but this
is due almost entirely to 5 points (possible outliers?) which correspond to very
new houses.
(c) Go to Data Data Analysis Regression. The Input Y Range is Price,
the Input X Range is AgeHouse. You should include Labels in these ranges;
check the corresponding box. Select an Output Range and click OK.
Solution:
These steps yield the following output:
(d) Write down the equation of regression.
Solution:
From the Table of Coecients, the regression equation is
10
i=1
x
i
= 15,
10
i=1
y
i
= 714,
10
i=1
x
i
y
i
= 1278,
10
i=1
x
2
i
= 25.8,
(b) x = 0, y = 12.7, SS
xy
= 246.56, s
2
x
= 36.67.
Solution:
(a) From formulae given in Lecture slides,
b
1
=
n
i=1
x
i
y
i
n x y
n
i=1
x
2
i
n( x)
2
=
1278 10(
15
10
)(
714
10
)
25.8 10(
15
10
)
2
= 62.727,
b
0
= y b
1
x =
714
10
62.727(
15
10
) = 22.691.
so the equation of the regression line is y = 22.691 + 62.727x.
148 CHAPTER 10. SIMPLE LINEAR REGRESSION
(b) First we need to compute SS
x
. Recognising that SS
x
=
n
i=1
(x
i
x)
2
=
(n 1)s
2
x
, we nd that SS
x
= (n 1)s
2
x
= 9 36.67. Thus
b
1
=
SS
xy
SS
x
=
246.56
36.67 9
= 0.7471,
b
0
= y b
1
x = 12.7 0.7471(0) = 12.7.
so the equation of the regression line is y = 12.7 0.7471x.
5. Open the Excel le House.xlsx. National Realty wants you to investigate the rela-
tionship between the selling price of a house (in $1,000) and the area of the block of
land on which it is situated (in m
2
). You decide to perform a simple linear regression
between Price and Area.
(a) First, decide which of the two variables should be chosen as the response vari-
able. Then specify the regression model, and explain each term in the model.
(b) What are the assumptions that must be satised to ensure that a simple linear
regression is appropriate?
(c) Using Excel, produce an appropriate Summary Output for the simple linear
regression described by (a). This should include an appropriate set of diag-
nostic plots that can be used to assess whether or not the assumptions of the
regression model in (b) are justied.
(d) From your output in (c), write down the estimated regression equation between
Price and Area.
(e) Give an interpretation for the estimate of the slope parameter in the estimated
regression equation in (d).
(f) Do the diagnostic plots suggest any violation of the assumptions in (b)?
Solution:
(a) Price is the appropriate choice for the response variable. The regression model
is Price =
0
+
1
Area +, where Price is the selling price of the house, Area
is the area of the block of the house,
0
is the intercept,
1
is the slope, and
is the random variation term or residual.
(b) The assumptions of the simple linear regression are
i. A linear model is appropriate: Price =
0
+
1
Area + , where E[] = 0;
ii. The error variables are Normally distributed;
iii. The error variables have constant variance;
iv. The error variables are independent (or at least uncorrelated).
(c) An Excel output is shown:
149
(d)
4
x
i
0.1
x
i
(1 0.1)
4x
i
. The expected fre-
quencies under this model can now be calculated, and the results are given in the
following table:
161
162 CHAPTER 12. CHI-SQUARED TESTS
Number needing adj. (x
i
) 0 1 2 3 4 Total
Number of days (o
i
) 102 78 19 1 0 200
P(X = x
i
) if X Bin(4, 0.1) 0.6561 0.2916 0.0486 0.0036 0.0001 1
Expected freq. (e
i
= 200p
i
) 131.22 58.32 9.72 0.72 0.02 200
The chi-square tests require that all expected frequencies be greater than 5. To
achieve this, we group the last three categories. The revised table is shown below:
Number needing adj. (x
i
) 0 1 2, 3 or 4 total
Number of days (o
i
) 102 78 20 200
P(X = x
i
) if X Bin(4, 0.1) 0.6561 0.2916 0.0523 1
Expected freq. (e
i
= 200p
i
) 131.22 58.32 10.46 200
(o
i
e
i
)
2
/e
i
6.51 6.64 8.70 21.85
The test statistic is
X
2
=
i
(o
i
e
i
)
2
e
i
,
where the sum is over all (remaining) categories of the variable. Under H
0
, X
2
observes a
2
distribution, with
df = Number of categories 1 Number of parameters estimated
= 3 1 0
= 2,
i.e. X
2
2
2
under H
0
. So the = 0.05 critical value is
2
crit
=
2
2,0.05
= 5.99.
The observed value of test statistic is
2
obs
= 21.85.
Since
2
obs
= 21.85 CR, the data provides sucient evidence to reject H
0
at the 5%
level of signicance. We conclude that the number of tractors needing adjustment
is not distributed as Bin(4, 0.1).
2. Political ideology of government has a great impact on business perception and
planning. A market researcher is investigating the support for the various political
parties in Australia at the Federal level. The support at the 2001 Federal election
was Liberal 37%, Labor 38%, National 6%, Democrats 5%, Others 14%
(source: http://www.aec.gov.au/ content/when/past/2001/results/index.html).
Six months after the 2001 election, a survey of 1050 voters was conducted, to de-
termine whether the level of support for each party had changed. The results are
summarised in the table below.
Party (i) Lib Lab Nat Dem Oth Total
No. of voters (o
i
) 350 456 50 44 150 1050
Probability (p
i
) 0.37 0.38 0.06 0.05 1
163
Determine at signicance level 0.05 whether the level of support for the parties
changed in the six months following the 2001 election. Comment on where the
major discrepancy appears to lie.
Solution:
Party (i) Lib Lab Nat Dem Oth Total
No. of voters (o
i
) 350 456 50 44 150 1050
Probability (p
i
) 0.37 0.38 0.06 0.05 0.14 1
Expected frequency (e
i
) 388.5 399 63 52.5 147 1050
o
i
e
i
-38.5 57 -13 -8.5 3 0
(o
i
e
i
)
2
1482.25 3249 169 72.25 9 (not reqd)
(o
i
e
i
)
2
/e
i
3.8153 8.1429 2.6825 1.3762 0.0612 16.0781
(a) df = number of categories (after grouping to eliminate any expected frequencies
less than 5) 1 number of parameters estimated from the data.
(b) [See table above].
df= 5 1 0 = 4;
2
4,0.05
= 9.49.
2
o
= 16.0781 > 9.49, therefore the data provides sucient evidence to reject
H
0
at the 5% level of signicance. We conclude that the level of support for
the political parties has changed since the last election.
The largest contribution to the Chi-squared statistic is from the Labor column.
Thus the major discrepancy is that the support for Labor increased in the six
months following the 2001 election.
3. Black et al. 12.5.
Solution:
H
0
: The way that men dene their personal success does not dier from how women
dene theirs
H
1
: H
0
is false.
The test statistic is given by
2
=
obs freq(f
o
) exp freq(f
e
)
2
exp freq(f
e
)
The signicance level is given as = 0.05
There are four categories in this question (happiness, sales, helping others, achieve-
ments), k = 4. The degrees of freedom are k 1. For = 0.05 and df = 3, the
critical chi-square value is
2
0.05,3
= 7.8147
The observed values are computed by multiplying the expected proportions (from
womans data) to the total sample size of the mens data. For example, the total
sample size for the mens data is 227 (add up all the observed frequencies). The ex-
pected frequency for the happiness category is then 227(0.39) = 88.53 and similarly
for the sales category, the expected frequency is 227(0.12) = 27.24 and so on.
164 CHAPTER 12. CHI-SQUARED TESTS
Denition f
o
f
e
(fofe)
2
fe
Happiness 42 88.53 24.46
Sales 95 27.24 168.55
Helping 27 40.86 4.70
Achievements 63 70.34 0.77
Total 198.98
Since the chi-squared observed value (198.98) is greater than the critical value, we
reject the null hypothesis.
Thus, the data gathered in the sample suggests that the way men dene their
personal success diers signicantly from how women dene theirs.
12.3: Contingency Analysis: The Chi-Squared Test
for Independence
4. In a random sample of 100 people, each person was classied by buying response to
a particular product and also by degree of exposure to marketing pressure (recorded
in four categories I, II, III, IV), with the following results:
[Fill in (say) three of the expected frequencies (in parentheses) before your lab.]
Marketing Pressure
I II III IV Totals
Denitely buy 12 ( ) 12 ( ) 6 ( ) 17 ( ) 47
Undecided 5 ( ) 8 ( ) 10 ( ) 5 ( ) 28
Will not buy 3 ( ) 10 ( ) 7 ( ) 5 ( ) 25
Total 20 30 23 27 100
(a) State the hypotheses you would use in testing the advertising agencys claim
that buying response is inuenced by the degree of marketing pressure.
(b) Explain why you would calculate the expected frequencies using the rule
expected frequency for a cell =
row total column total
grand total
.
(c) Test the advertising agencys claim at the 5% signicance level.
Solution:
(a) The hypotheses to be tested are
H
0
: Marketing pressure and buying response are independent
H
1
: Marketing pressure and buying response are not independent
(i.e. buying response is inuenced by marketing pressure)
165
(b) Consider the upper-left cell. The probability that a particular customer will
be classied in this cell is P(Marketing Pressure I and Denitely buy). If
H
0
is true, then
P(Marketing Pressure I and Denitely buy)
= P(Marketing Pressure I) P(Denitely buy).
The two probabilities on the right-hand side can be estimated naturally in
terms of the respective row and column totals:
P(Marketing Pressure I)
Column 1 Total
Grand Total
P(Denitely buy)
Row 1 Total
Grand Total
.
So,
P(Marketing Pressure I and Denitely buy) =
Row 1 Total Column 1 Total
(Grand Total)
2
.
The expected frequency for the upper-left cell is
Grand Total P(Marketing Pressure I and Denitely buy)
Grand Total
Row 1 Total Column 1 Total
(Grand Total)
2
=
Row 1 Total Column 1 Total
Grand Total
.
This argument applies in the same way for all other cells in the table.
(c) Expected frequencies (under the model of independence) are given in brackets:
Marketing Pressure
I II III IV Totals
Denitely buy 12 (9.4) 12 (14.1) 6 (10.81) 17 (12.69) 47
Undecided 5 (5.6) 8 (8.4) 10 (6.44) 5 (7.56) 28
Will not buy 3 (5) 10 (7.5) 7 (5.75) 5 (6.75) 25
Total 20 30 23 27 100
The test statistic is
X
2
=
r
i=1
c
j=1
(O
ij
e
ij
)
2
e
ij
where r and c are the numbers of rows and columns (not including totals), O
ij
is the observed count in the cell in row i and column j, and e
ij
is the expected
count in the same cell (assuming H
0
, that is, no relationship between the two
variables). This double sum can be thought of simply as a single sum over
all cells in the table.
Under H
0
, the test statistic observes a
2
distribution, with degree of freedom
(r 1) (c 1) = (3 1) (4 1) = 6, i.e. X
2
2
6
under H
0
. Thus, the
= 0.05 critical value of the test is
2
crit
=
2
6,0.05
= 12.59.
166 CHAPTER 12. CHI-SQUARED TESTS
The observed value of the test statistic is
2
obs
=
r
i=1
c
j=1
(o
ij
e
ij
)
2
e
ij
= 0.719 + 0.312 + 2.140 + 1.464 + 0.064 + 0.019 + 1.968 +
+ 0.867 + 0.800 + 0.833 + 0.271 + 0.454
= 9.91.
Since
2
obs
= 9.91 < 12.59, the data does not provide sucient evidence to
reject H
0
in favour of H
1
at the 5% level of signicance. We conclude that
buying response is not inuenced by marketing pressure.
5. Four hotels took part in a survey on hotel guest satisfaction. A follow up question
was asked of all respondents who were dissatised with the service. These guests
were asked to indicate the main reason for their dissatisfaction. You are asked
to investigate whether the choice of hotel has any bearing on the main reason for
dissatisfaction.
Do not use Excel in this question! Write your answers on paper, showing
full working.
(a) State appropriate hypotheses that could be tested to answer the question: Do
the results of the survey provide evidence that the nature of dissatisfaction and
the choice of hotel are related?
Solution:
H
0
: Choice of hotel and reason for dissatisfaction are independent.
H
1
: H
0
is false (i.e. choice of hotel and reason for dissatisfaction are related).
(b) A contingency table, summarising the results of the survey, is given below.
The table shows the observed frequencies for each cell, as well as some of the
expected frequencies under H
0
(in parentheses). Copy down this table, and
without using Excel, calculate the remaining expected frequencies under H
0
.
Show working!
Hotel
Fijian Tradeswest Sheraton Coral Reef Totals
Politeness 23 ( ) 7 ( ) 37 (33.7410) 67 (62.0192) 134
Knowledge 25 ( ) 13 ( ) 25 (30.9712) 60 (56.9281) 123
Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62
Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98
Totals 74 45 105 193 417
Solution:
167
The expected frequency for Fijian and Politeness is
Row Total Column Total
Grand Total
= (134 74)/417 = 23.7794.
One can either work out the remaining three expected frequencies as above, or
by using the fact that the expected frequencies in each row/column are required
to sum to the (observed) row/column total. The complete table is below:
Hotel
Fijian Tradeswest Sheraton Coral Reef Totals
Politeness 23 (23.7794) 7 (14.4604) 37 (33.7410) 67 (62.0192) 134
Knowledge 25 (21.8273) 13 (13.2734) 25 (30.9712) 60 (56.9281) 123
Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62
Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98
Totals 74 45 105 193 417
(c) Write down an expression for the relevant test statistic, and state its distribu-
tion under H
0
(together with any associated parameters!).
Solution:
The test statistic is
X
2
=
r
i=1
c
j=1
(O
ij
e
ij
)
2
e
ij
where r and c are the numbers of rows and columns (not including totals), O
ij
is the observed count in the cell in row i and column j, and e
ij
is the expected
count in the same cell (assuming H
0
, that is, no relationship between the two
variables).
Under H
0
, the test statistic observes a
2
distribution, with degree of freedom
(r 1) (c 1) = (4 1) (4 1) = 9, i.e. X
2
2
9
under H
0
.
(d) Without using Excel, calculate the contribution from the upper-left cell to the
observed value of the test statistic.
Solution:
The contribution to the observed value from the upper-left cell is
(o
ij
e
ij
)
2
e
ij
=
(23 23.7794)
2
23.7794
= 0.0256.
(e) Given that the observed value of the test statistic is
2
obs
= 20.8059, carry
out the test (without using Excel) at the 5% signicance level, and state your
168 CHAPTER 12. CHI-SQUARED TESTS
conclusion. Is there sucient evidence to conclude that there is a relationship
between the choice of hotel and the nature of dissatisfaction?
Solution:
The critical value for this test is
2
crit
=
2
9,
=
2
9,0.05
= 16.92,
so the critical region is {X
2
>
2
crit
= 16.92}. Since
2
obs
= 20.8059 is within
the critical region, we reject H
0
in favour of H
1
. There is sucient evidence,
at the 5% signicance level, to conclude that the nature of dissatisfaction is
related to the choice of hotel.
6. To undertake contingency analysis in Excel, rst enter the data, then go KaddSTAT
-> Hypothesis Testing -> Chi-Square Test. Select the data as Input Range,
tick the Header Row and Column Included box, and choose where you want Excel
to print the output.
Enter the data from Question 7e as shown below:
(a) Use Excel to generate an appropriate output for a test for independence of the
two variables of interest, carry out the test, and check that your conclusions
are the same as in Question 7e.
Solution:
Excel returns the following output:
169
(b) If there is evidence that the nature of dissatisfaction is related to the choice
of hotel, where do the discrepancies lie? Which hotel(s) could be advised to
improve their service, and in which area(s)? Do any of the hotels appear to
provide signicantly better service than the others in a particular area?
Solution:
Having established that there is indeed a relationship between the choice of
hotel and the nature of dissatisfaction, one can examine the output to deter-
mine which hotels have a greater (or lesser) proportion of complaints of each
type.
It can be seen from the output of chi-square calculations that there are three
cells that have much larger contributions to the observed value of the test
statistic than the others. These cells are Tradewest and Politeness (3.8490),
Tradewest and Other (8.3987) and Coral Reef and Other (2.3651). Comparing
the observed frequencies with the expected frequencies for these cells, we see
that of the dissatised hotel guests, those who stayed at Tradeswest are less
often dissatised with Politeness, and more often their dissatisfaction is classi-
ed as Other. It might also be that those dissatised guests who stay at Coral
Reef less often state that their dissatisfaction is due to Other, although there
is probably not enough evidence to conrm this (the chi-square contribution is
not that large).
We conclude that Tradeswest should take steps to improve their service in the
area of Other. Some further analysis might be required to provide more useful
advice. To nd out which particular aspects of Tradeswests service guests are
dissatised with, one might choose to replace Other by a collection of more
meaningful categories (e.g. Cleanliness, Food, etc.). One can ensure that the
expected frequencies are all greater than 5 by combining any categories that
have small expected frequencies, or by simply gathering enough data.
7. A market-researcher wished to investigate whether a buyers age had any bearing
on choice of car colour. A random sample of 200 car buyers resulted in the following
table which shows the observed frequencies and some of the expected frequencies
(in parentheses).
Chose Red Chose White Chose Grey
Age 17 24 20 ( 16 ) 15 ( 16 ) 5 ( )
Age 25 40 30 ( 24 ) 20 ( 24 ) 10 ( )
Age over 40 30 ( ) 45 ( ) 25 ( )
(a) State the hypotheses that the researcher is comparing in this investigation.
(b) Copy the body of the table and complete the entries for expected frequencies.
(c) Give the number of degrees of freedom for a
2
-test of the hypotheses in part
(a).
170 CHAPTER 12. CHI-SQUARED TESTS
(d) Explain fully how the expected frequency of 16 is obtained for the 17 24 age
group with a preference for Red. (Do not merely quote a formula or show one
line of arithmetic.)
(e) Using a 5% level of signicance, determine if the buyers age has any bearing
on the choice of car colour.
Solution:
(a) H
0
: choice of colour independent of age; H
1
: choice of colour dependent on
age.
(b)
Chose Red Chose White Chose Grey Total
Age 17 24 20 ( 16 ) 15 ( 16 ) 5 ( 8 ) 40
Age 25 40 30 ( 24 ) 20 ( 24 ) 10 ( 12 ) 60
Age over 40 30 ( 40 ) 45 ( 40 ) 25 ( 20 ) 100
Total 80 80 40 200
(c) No grouping required, so the number of degrees of freedom for the
2
-distribution
is (3 1)(3 1) = 4.
(d) Under H
0
, P(Age 17 24 and Red) = P(Age 17 24).P(Red).
Estimating P(Age 1724) by
40
200
and P(Red) by
80
200
, gives expected frequency
for (1, 1)-cell =
40
200
80
200
200 = 16.
(e) From tables,
2
4, 0.05
= 9.49 and observed value of test statistic = 9.06 / CR
so the data does not provide sucient evidence to reject H
0
at the 5% level of
signicance. We conclude that choice of colour is not dependent on age.
8. Black et al. Exercises 12.27 and 12.29.
Solution:
(a) Black et al. 12.27
171
The hypotheses of interest here are
H
0
: Proportion of households with internet access is not dependent on whether
they have children under the age of 15 for the period 1989 to 2003.
H
1
: Proportion of households with internet access is dependent on whether
they have children under the age of 15 for the period 1989 to 2003. Note that
all the expected frequencies are greater than 5. df = (2 1)(6 1) = 5. The
p-value of the test = P(
2
5
> 0.13) = 0.9997 (from Excel), so there is insu-
cient evidence against the null hypothesis. We conclude that the proportion of
households with internet access is the same for those with children under 15
and those without children under 15 in the period 1989 to 2003.
(b) Black et al. 12.29
H
0
: Gender and colour preference for cars is independent
H
1
: Gender and colour preference for cars is not independent
To test this hypothesis, we use the chi-squared test of independence. The
observed chi-squared value (from the test-statistic) is 5.366. The p-value
0.252 > 0.05, so there is insucient evidence to reject the null hypothesis.
(The critical value at 5% level of signicance with (5 1)(2 1) = 4 degrees of
freedom is 9.4877. Since the observed value does not lie in the critical region,
we do not reject the null hypothesis.) Therefore, there is not enough evidence
provided by the data to suggest that colour preference is dependent on gender.
Marketing agencies dont have to model colour as a factor when trying to sell
cars to either gender. Also, manufacturers can determine car colour quotes on
another basis, instead of gender preference.