Chapter 9 S 1
Chapter 9 S 1
Chapter 9 S 1
Chapter No.: 9
Class: BSc
Subject: Statistics-1
E-Mail: waseem.mustafa.ibd@gmail.com
Page |1
Important Points:
9.1) Introduction:
In Chapter 8 we focused on testing the value of a particular population parameter of interest,
such as a mean, μ, or proportion, 𝝅. Here we shall look with testing for association between
two categorical variables
9.2) Association:
We often wish to answer questions such as ‘Do vegetarians tend to support Party X?’ or ‘Do
Party X supporters tend to be vegetarians?’ If either question has the affirmative answer, we say
that there is association between the two categories (or ‘attributes’ or ‘factors’). We will need
frequency data in order to answer such questions, but in this course we will not be concerned
with the strength of any such association, but rather just testing for its existence, and with
commenting on its nature.
The cell frequencies are known as observed frequencies and show how the data are spread
across the different combinations of factors.
Page |2
Step 3: Test statistic: The test statistic used for tests for association is as follows:
𝒓 𝒄
𝟐
(𝑶𝒊𝒋 − 𝑬𝒊𝒋 )𝟐
𝑿𝒄 = ∑ ∑
𝑬𝒊𝒋
𝒊=𝟏 𝒋=𝟏
Step 4: Decide significance level and find the value of 𝑿𝒗 𝟐 : A chi-square value is only
defined over positive values and it is dependent on the degrees of freedom
𝒗 = (𝒓 − 𝟏)(𝒄 − 𝟏), where r is no of rows and c is no of columns and for significance level
we use same decision tree which we have studied us in chapter-8. It is also presented here.
Note: In this case, we calculate value of 𝑿𝒗 𝟐 from (table −8) (Will be given in past papers).
Step 5: Determine the critical region: 𝑿𝟐 tests of association are always upper tailed tests.
Which means, we have to check 𝑿𝒄 𝟐 > 𝑿𝒗 𝟐
Step 6: Choose hypothesis: Now we decide whether or not to reject 𝐇𝟎 (our working
hypothesis which, up to now, is assumed to be true). If the test statistic value lies in the critical
region then we will reject 𝐇𝟎 , if not we do not reject 𝐇𝟎 . A rejected 𝐇𝟎 means the test is
statistically significant at the specific significance level used.
Step 7: Draw Conclusions: It is always important to draw conclusions in the context of the
variables of the original problem, we can divide it into two types which are as follows:
a) value of the test statistic falls in the critical region and conclude that 𝐇𝟏 is true.
b) Accept the null hypothesis 𝐇𝟎 , otherwise.
Returning to the crime example, now Test for an association between area and crime at 0.01
Solution: 𝐇𝟎 = There is no association between area and crime
𝐇𝟏 = There is association between area and crime
The observed frequencies, the expected frequencies and the test statistic contributor values are
given in the following table:
At the 5% significance level, the upper-tail critical value is 5.99, so since 2.364 < 5.99 we do
not reject 𝐇𝟎 .
If we now consider the 10% level the critical value is 4.61, so again we do not reject 𝐇𝟎 . So
it looks as if there are no preferences for a particular wrapper type on the choices so far.
Page |6
Exercise:
Q 1) A survey has been made of levels of satisfaction with housing by people living in different
types of accommodation. Levels of satisfaction are:
high, medium, low, very dissatisfied
and housing types are:
public housing apartment, public housing house, private apartment, private detached house,
private semi-detached house, miscellaneous (includes boat, caravan etc!)
Give:
a) the null and alternative hypotheses
b) the degrees of freedom
c) the 5% and 1% critical values for 𝑿𝟐 .
Q 2) In a survey made in order to decide where to locate a factory, samples from five towns
were examined to see the numbers of skilled and unskilled workers. The data were as follows:
Does the population proportion of skilled workers vary with the area?
Q 3) Look at the following table taken from a study of gender differences in perception. One
of the tests was of scores in verbal reasoning.
Carry out a suitable test at at least two levels of significance and report on your findings.
Calculate the value of chi-squared for the table and say what you would conclude.
ii) If you look separately at the relationship for poor districts and rich districts you get
the following two tables:
For the poor districts the value of chi-squared is 1.76. For the rich districts the value of chi-
squared is 0.023. What would you conclude for each table, and overall?
Test for association between gender and party affiliation at two appropriate levels and comment
on your results.
Conduct a suitable test for association between Ticket classification and Footfall level, and
report on your findings.
i) Use an appropriate test to see whether there is an association between Education and Religious
Belief according to these data. Give the null and alternative hypotheses and test at two
appropriate levels.
ii) What are your conclusions?
iii) Compare the proportion of those with less than high school education who have liberal
religious beliefs with that for those with whose highest degree was of bachelor or graduate level.
Does this agree with your earlier findings?
i) Carry out an overall test for association between gender and method of calculation at two
levels. Give the null and alternative hypotheses and comment on your results.
ii) The researcher is interested in whether there are any gender differences in preferred method
of computation. Discuss any potential gender differences which appeared in the test for
association.
i) Test for association between performance in algebra and pre-school attendance at two
appropriate significance levels. State the null and alternative hypotheses clearly.
ii) Comment on your results describing potential associations in detail. Discuss any potential
differences in the algebra marks between students who did and students who did not attend pre-
school.
i) Test for an association between a person’s opinion on the new taxation policy and the city of
residence at two appropriate significance levels. State the null and alternative hypotheses
clearly.
ii) Comment on your results describing potential associations in detail. Discuss the potential
differences in rates in favour of the new taxation policy across different cities.
P a g e | 13
i) Test for an association between the method of contact prior to the survey and response at two
appropriate significance levels. State the null and alternative hypotheses clearly.
ii) Comment on your results describing potential associations in detail. Discuss the potential
differences in response rates for different methods of contact.
i) Based on the data in the table, and without doing a significance test, how would you describe
the relationship between education and opinion on whether or not homeopathy is scientific?
ii) Calculate the 𝑋 2 statistic and use it to test for independence, using a 1% significance level.
What do you conclude?
i) Based on the data in the table, and without doing a significance test, how would you describe
the relationship between education and opinion on whether or not astrology is scientific?
ii) Calculate the 𝑋 2 statistic and use it to test for independence, using a 1% significance level.
What do you conclude?
i) Based on the data in the table, and without conducting a significance test, would you say
there is an association between the frequency of newspaper readership and reader's
educational background?
ii) Calculate the 𝑋 2 statistic and use it to test for independence, using two appropriate
significance levels. What do you conclude?
i) Based on the data in the table, and without conducting a significance test, would you say
there is an association between areas and level of local sporting facilities?
ii) Calculate the 𝑋 2 statistic and use it to test for independence, using two appropriate
significance levels. What do you conclude?
i) Based on the data in the table, and without conducting any significance test, would you
say there is an association between the machine number and the component being faulty?
ii) Calculate the 𝑋 2 statistic and use it to test for independence, using a 5% significance
level. What do you conclude?
P a g e | 16
i) Based on the data in the table, and without conducting any significance test, would you
say there is an association between the student's type of personality and colour
preference?
ii) Calculate the 𝑋 2 statistic and use it to test for independence, using a 5% significance
level. What do you conclude?
i) Based on the data in the table, and without conducting any significance test, would you
say there is an association between the political affiliation and opinion on the tax reform
bill?
ii) Calculate the 𝑋 2 statistic and use it to test for independence of political affiliation and
opinion on the tax reform bill. What do you conclude?
i) Based on the data in the table, and without conducting any significance test, would you
say there is an association between the student's origin and satisfaction with university
life?
ii) Calculate the 𝑋 2 statistic and use it to test for independence of student's origin and
satisfaction with university life. What do you conclude?
i) Based on the data in the table, and without conducting any significance test, would you
say there is an association between age and watch preference? Provide a brief justification
for your answer.
ii) Calculate the 𝑋 2 statistic for the hypothesis of independence between age and watch
preference, and test that hypothesis. What do you conclude?
i) Based on the data in the table, and without conducting any significance test, would you
say there is an association between final grade and attending revision? Provide a brief
justification for your answer.
ii) Calculate the 𝑋 2 statistic for the hypothesis of independence between final grade and
attending revision, and test that hypothesis. What do you conclude?
P a g e | 18
Answers:
Q 1) (a)
H0 = There is no association between the kind of accommodation people have and their level of
satisfaction with it.
H1 = There is such a difference.
(b) Degrees of freedom = 15
(c) 5% value = 25.00 and the 1% value = 30.58
Q 2) The proportion of skilled workers does not appear to vary with area.
Q 4) There is very strong evidence that births vary over the year.
Q 5) Do it by yourself, Q 6) Do it by yourself.
Q 7) Do it by yourself., Q 8) Do it by yourself.
Q 9) Do it by yourself.
Q 10) The results demonstrate a strong association between preference and ‘subject specialism.
Q 11) The results demonstrate an association, though not a strong one, between attitude and
type of household.
Q 13)
P a g e | 19
Q 14) (i) There is a strong connection between crime rate and the existence of the death penalty.
(ii) There is no connection between crime rate and the existence of the death penalty in both
districts.
Q 15) There is some association between gender and party identification but it is not very
marked.
Q 16) There is some association between footfall level and ticket price, but it is not a strong
one.
Q 17) This is clearly not possible as the formula for chi-squared is, in fact, a square (and hence
has a positive value).
Q 19) This is clearly not possible as the formula for chi-squared is, in fact, a square (and hence
has a positive value).
Q 20) (i) There is some evidence of association between gender and method of computation,
but the evidence is not terribly strong.
(ii) Looking at individual ‘observed’ and ‘expected’ values, we can see that there is no
difference between men and women in their using no aids.
Slightly fewer women than men than might have been expected use a computer. But the big
difference is that men are much less likely than women to use a statistical function on a
calculator than expected, while women are less likely to use a basic calculator compared with
men.
Q 22) (i) We conclude that there is some evidence of an association between pre-school
attendance and algebra marks.
(ii) There are a number of statements that can be drawn from the previous results. By checking
differences between expected and observed numbers we can extract various arguments that aid
in the interpretation of the results. For example, we may say things like:
Q 23) (i) We conclude that the association between opinion on the new taxation policy and city
is weakly significant.
(ii) A number of statements can be drawn from the previous results. By checking differences
between expected and observed numbers we can extract various arguments that aid in the
interpretation of the results. For example, we may say things like
Main sources of association: London v. Other cities.
People in London appear to be neutral regarding the new taxation policy whereas people
in Birmingham and Manchester appear to be slightly against it.
There does not seem to be a difference of opinion among the people of Birmingham and
Manchester.
Q 24) (i) We conclude that the association between response and method of contact is highly
significant.
(ii) A number of statements can be drawn from the previous results. By checking differences
between expected and observed numbers we can extract various arguments that aid in the
interpretation of the results. For example, we may say things like
Main sources of association: no contact v. any type of contact.
Contact prior to the survey increases the response rate.
Contact by phone results in higher response rates than contact by letter.
Q 25) (i) Using the percentages we see that the higher someone's education, the smaller the
belief that homeopathy is very scientific and the higher the belief that it is not at all scientific.
For example, 79% of those who attended college or higher education responded that
homeopathy is not at all scientific, whereas the corresponding proportion for those with less
than high school education is 48%.
(ii) We conclude that the association between views on homeopathy and educational level is
highly significant.
Q 26) (i) Using the percentages we see that the higher someone's education, the smaller the
belief that astrology is very scientific and the higher the belief that it is not at all scientific.
For example, 79% of those who attended college or higher education responded that astrology
is not at all scientific, whereas the corresponding proportion for those with less than high school
education is 48%.
(ii) We conclude that the association between views on astrology and educational level is highly
significant.
P a g e | 21
Q 27) (i) Looking at the percentages, we see some differences between males and females. More
specifically, 39% of females shop online frequently versus 26% of males. Moreover, the
percentage of males who never shop online is 27% versus 18% for females. Hence, there may
be an association between gender and tendency to shop online, although this needs to be
investigated further.
(ii) There is some evidence of an association between gender and tendency to shop online.
Q 28) (i) Looking at the percentages, we see that distributions for the answers regarding buying
organic products are similar in rural and urban areas. More specifically, the percentage of people
who answered `Yes' is quite close in these two cases (17% and 21%, respectively). The
percentages of those who replied `No' were not too far either (38% vs. 33%). Hence, there does
not seem to be a strong association between place of residence and buying organic products,
although this needs to be investigated further.
(ii) There is no evidence to support an association between place of residence and buying
organic products.
Q 29) False. The larger the test statistic value, the smaller the p-value.
Q 30) i) There are some differences in the distributions within readership levels. More
specifically, graduates appear more frequent readers than low readership compared to those with
lower educational attainment than A-levels (46% vs. 19% and 14% vs. 49%, respectively). For
those with A-levels, most are of a moderate readership type (52%). Hence, there seems to be an
association between readership levels and reader's educational background, although this needs
to be investigated further.
ii) There is strong evidence of an association between readership level and educational
background.
Q 31) i) There are some differences in the distributions within areas. More specifically, very
good sporting facilities appear more frequently than poor sporting facilities in Areas 1 and 3
(44% vs. 30% and 45% vs. 28%, respectively). In Area 2, however, poor sporting facilities are
more common than very good ones (45% vs. 29%). There seems to be an association between
area and level of sporting facilities, although this needs to be investigated further.
ii) There is some evidence to support an association between area and level of sporting facilities.
Q 32) i) There are some differences in the proportions of faulty components for each machine.
More specifically, 2% of the components from Machine 2 are faulty, whereas the corresponding
proportion for Machine 3 is 11%, and for Machine 4 is 14%. Hence, there seems to be an
association between machine number and the component being faulty, although this needs to be
investigated further.
ii) There is evidence of an association between machine number and the component being
faulty.
P a g e | 22
Q 33) i) There are some differences in rates of introvert students for each colour preference.
More specifically, 21% of the students who prefer the green colour are introvert, whereas the
corresponding proportion for students who prefer red is 32%, and for students preferring blue
is 46%. Hence, there seems to be an association between personality type and colour preference,
although this needs to be investigated further.
ii) There is evidence of an association between personality type and colour preference.
Q 34) i) There are some differences in the opinion on the tax reform bill between Democrats
and Republicans. More specifically, only 40% of those in favour are Democrats, whereas more
than 50% of those opposed are Democrats. Hence there seems to be an association between
political affiliation and opinion on the tax reform bill, although this needs to be investigated
further.
ii) There is moderate evidence of an association between political affiliation and opinion on the
tax reform bill.
Q 35) False. A contingency table is used to display two categorical variables. An alternative
justification is that a scatter diagram is used to display two measurable variables.
Q 36) i) There are some differences in the rates of satisfaction between UK/EU and overseas
students. More specifically, two thirds of the satisfied students were overseas students, whereas
only half of the dissatisfied students were overseas students. Hence there seems to be an
association between a student's origin and satisfaction with university life, although this needs
to be investigated further.
ii) There is moderate evidence of an association between a student's origin and satisfaction with
university life.
Q 37) i) There are some differences between younger and older people regarding watch
preference. More specifically, 16% of younger people prefer an analogue watch compared to
48% for people over 30. Hence there seems to be an association between age and watch
preference, although this needs to be investigated further.
ii) There is strong evidence of an association between age and watch preference.
Q 38) i) There are some differences in the final grades between students who did and did not
attend the revision session. More specifically, 56% of those who got a final grade A attended
the revision session, whereas only 40% of those who got a final grade C attended it. Hence there
seems to be an association between final grade and attending revision although this needs to be
investigated further.
ii) There is weak evidence of an association between attending revision and final grade.