Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
163 views

Data Analysis Text Book

Uploaded by

claire
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views

Data Analysis Text Book

Uploaded by

claire
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 292

Faculty of Science and Technology

SBST3103
Introductory Data Analysis

Copyright © Open University Malaysia (OUM)


SBST3103
INTRODUCTORY
DATA ANALYSIS
Prof Dr Mohd Kidin Shahran
Dr Wan Rosmanira Ismail
Dr Nur Riza Mohd Suradi
Zalina Mohd Ali
Wan Zawiah Wan Zin
Marina Zahari
Prof Dr Najib Mahmood Rafee

Copyright © Open University Malaysia (OUM)


Project Directors: Prof Dato’ Dr Mansor Fadzil
Assoc Prof Dr Norlia T. Goolamally
Open University Malaysia

Module Writers: Prof Dr Mohd Kidin Shahran


Open University Malaysia

Dr Wan Rosmanira Ismail


Dr Nur Riza Mohd Suradi
Zalina Mohd Ali
Wan Zawiah Wan Zin
Marina Zahari
Prof Dr Najib Mahmood Rafee
Universiti Kebangsaan Malaysia
Moderators: Dr Norazan Mohamed Ramli
Universiti Teknologi MARA

Assoc Prof Dr Norlia T. Goolamally


Open University Malaysia

Developed by: Centre for Instructional Design and Technology


Open University Malaysia

First Edition, December 2007


Copyright © Open University Malaysia (OUM), December 2012, SBST3103
All rights reserved. No part of this work may be reproduced in any form or by any means
without the written permission of the President, Open University Malaysia (OUM).

Copyright © Open
Copyright Open University
University Malaysia
Malaysia (OUM)
(OUM)
Table of Contents
Course Guide ix - xiv

Topic 1 Chi-Square and F Distribution, and Their Applications 1


1.1 Chi-Square Distribution 2
1.1.1 Properties of Chi-Square Distribution 2
1.1.2 Chi-Square Distribution Table 6
1.1.3 Hypothesis Testing on Population Variance 12
1.2 F Distribution 18
1.2.1 Properties of F Distribution 18
1.2.2 F Distribution Table 20
1.3 Interval Estimation of Variance for Two Populations 23
1.4 Comparing Variances of Two Populations 25
Summary 32

Topic 2 One-Way Analysis of Variance (ANOVA) 34


2.1 Basic Concepts of ANOVA 35
2.1.1 ANOVA and Experiment 35
2.1.2 Within and between Group Variation 40
2.2 Single-Factor Experiment 47
2.2.1 One-Way ANOVA 47
2.2.2 Model for a Single-Factor Test 49
Summary 53

Topic 3 Categorical Data Analysis 55


3.1 Goodness-of-Fit Test 56
3.1.1 Fitting to a Given Probability 56
3.1.2 Fitting to a Given Distribution 60
3.2 Tabulating Qualitative Data (Contingency Table) 67
3.2.1 Test of Independence 71
3.2.2 Test of Homogeneity 73
3.2.3 Yates’s Continuity Correction 76
Summary 81

Copyright © Open University Malaysia (OUM)


iv  TABLE OF CONTENTS

Topic 4 Correlation 83
4.1 Two-Way Scatter Plot 85
4.2 Pearson Correlation Coefficient 89
4.2.1 Pearson Correlation Coefficient Significance Test 91
4.3 Spearman Rank Correlation Coefficient 93
4.3.1 Spearman Rank Correlation Coefficient
Significance Test 96
Summary 100

Topic 5 Simple Linear Regression Analysis 101


5.1 Introduction to Regression Concepts 102
5.2 Simple Linear Regression Model and its Assumptions 103
5.3 The Least Squares Method 106
5.4 Inferences on Regression Coefficients 111
5.5 Model Adequacy Check 114
5.5.1 Coefficient of Determination, R2 114
5.5.2 Residual Plot 117
5.5.3 Some Transformations 119
5.6 Prediction and Estimation Using Regression Model 121
5.6.1 Prediction Interval for an Individual Value of y 122
5.6.2 Confidence Interval for a Mean Value of y 124
Summary 128

Topic 6 Multiple Regression 129


6.1 Multiple Regression Model and Assumptions 131
6.1.1 Assumptions for Multiple Regression Model 132
6.1.2 Multiple Regression Model with TWO
Independent Variables 132
6.1.3 Calculation Using Microsoft Excel Package 143
Summary 147

Topic 7 Introduction to Non-Parametric Concepts 149


7.1 Application of Non-Parametric Statistics 150
7.1.1 Data Not Meeting Assumptions 151
7.1.2 Qualitative Data 151
7.2 Limitations of Non-Parametric Statistics 154
Summary 156

Copyright © Open University Malaysia (OUM)


TABLE OF CONTENTS  v

Topic 8 Non-Parametric Test for Randomness 157


8.1 Runs Test 158
8.2 Runs Test for Large Sample Size 163
8.3 Tests For Randomness of Quantitative Data 166
Summary 168

Topic 9 Non-Parametric Hypothesis Test for Single Population 169


9.1 Hypothesis Statement for Single
Population Testing 170
9.2 Sign Test 172
9.2.1 Sign Test for Large n 176
9.3 Wilcoxon Signed-Rank Test 178
9.3.1 Wilcoxon Signed-Rank Test for Large n 181
Summary 184

Topic 10 Non-Parametric Hypothesis Test for Two Populations 185


10.1 Dependent and Independent Populations 186
10.2 Hypothesis Statement for Two–Population Testing 187
10.3 Comparing Two Dependent Populations 190
10.3.1 The Sign Test for Two Dependent Populations 190
10.3.2 Wilcoxon Signed-Rank Test for Two Dependent
Populations 195
10.4 Comparing Two Independent Populations 200
Summary 207

Answers 208
Glossary 272

Copyright © Open University Malaysia (OUM)


vi  TABLE OF CONTENTS

Copyright © Open University Malaysia (OUM)


COURSE GUIDE

Copyright © Open University Malaysia (OUM)


Copyright © Open University Malaysia (OUM)
COURSE GUIDE  ix

COURSE GUIDE DESCRIPTION


You must read this Course Guide carefully from the beginning to the end. It tells
you briefly what the course is about and how you can work your way through the
course material. It also suggests the amount of time you are likely to spend in
order to complete the course successfully. Please keep on referring to Course
Guide as you go through the course material as it will help you to clarify
important study components or points that you might miss or overlook.

INTRODUCTION
SBST3103 Introductory Data Analysis is one of the courses offered by the
Faculty of Science and Technology, Open University Malaysia (OUM).

Similar to other courses offered by the Faculty of Science and Technology, this
3 credit hour course will be conducted over 15 weeks and is offered in semesters
January, May and September.

COURSE AUDIENCE
This is a core course for students undergoing Bachelor of Education (Mathematics)
(Honours) at OUM.

As an open and distance learner, you should be acquainted with learning


independently and being able to optimise the learning modes and environment
available to you. Before you begin this course, please confirm the course material,
the course requirements and how the course is conducted.

STUDY SCHEDULE
It is a standard OUM practice that learners accumulate 40 study hours for every
credit hour. As such, for a three-credit hour course, you are expected to spend
120 study hours. Table 1 gives an estimation of how the 120 study hours could be
accumulated.

Copyright © Open University Malaysia (OUM)


x  COURSE GUIDE

Table 1: Estimation of Time Accumulation of Study Hours

Study Activities Study Hours


Briefly go through the course content and participate in initial 2
discussions
Study the module 60
Attend 4 tutorial sessions 8
Online participation 15
Revision 15
Assignment(s) and Examination(s) 20
Total Study Hours 120

COURSE OUTCOMES
By the end of this module, you should be able to:
1. Explain the One-Way Analysis of Variance concepts;
2. Explain the regression and correlation concepts;
3. Describe the simple and multiple linear regression concepts; and
4. Describe the non-parametric methodologies concepts.

COURSE SYNOPSIS
Topic 1 introduces you to the Chi-Square and F distributions, where a good
understanding in sampling distribution and hypothesis testing is necessary. In this
topic, we will deal a lot with Chi-Square and F distributions. Hence, you must
master these two distributions including their standard tables.

Topic 2 takes a look at mean comparisons for more than two populations using
ANOVA which is a follow-up from mean comparison testing involving one or
two populations. Variance partitioning concepts is introduced in this topic.

Topic 3 discusses the Goodness-of-Fit test using Chi-Square distribution.

Copyright © Open University Malaysia (OUM)


COURSE GUIDE  xi

Topic 4 begins by introducing students to ways or methods to identify


relationship between two variables. This topic discusses a graph method, which is
a two-way scatter plot to display the relationship between two variables. Students
are also exposed to correlation coefficient method to measure the strength of
relationship between two variables.

Topic 5 introduces a prediction technique that is based on the regression method.


You will also be taught on how to obtain simple linear regression model used in
prediction. Inferences on regression coefficients are done to obtain information on
population regression coefficients using regression coefficients from samples.
This topic will expose you to several methods for model evaluation to check if the
model satisfies the underlying assumptions. After the model is obtained, you are
able to use it for prediction or forward estimation.

Topic 6 introduces you to multiple regressions for cases involving more than two
independent variables.

Topic 7 introduces various techniques applicable to data which do not belong to


any particular distribution. This type of statistical analysis is called non-
parametric statistics. In addition, you will also be able to identify the differences
between parametric and non-parametric statistics.

Topic 8 discusses in detail non-parametric statistical methods to test randomness


of data based on data sequence. The Run Test will be discussed to show
randomness of data. Explanation of the Run Test for quantitative data will be
introduced.

Topic 9 exposes you to writing hypotheses statements for single population


testing which includes the null and alternative hypothesis. This is prior to the
statistical analysis. Non-parametric statistical tests which are the sign test and the
signed-rank test will be discussed in detail in this topic.

Topic 10 discusses alternatives in inferential and data analysis that is by Non-


Parametric statistical approach. In using Non-Parametric Statistics, knowledge on
data distribution and population are not required. In fact, data assigned rank will
be analysed. Hence, non-parametric statistics is also known as Distribution-Free
statistics and Rank Test. There is a wide application for non-parametric statistics;
similar to the parametric statistics tests using normal distribution or related
distributions such as t, F and Chi-Square distributions. However, topic 3 only
discusses its applications on test for randomness as well as one and two
populations test.

Copyright © Open University Malaysia (OUM)


xii  COURSE GUIDE

TEXT ARRANGEMENT GUIDE


Before you go through this module, it is important that you note the text
arrangement. Understanding the text arrangement should help you to organise
your study of this course to be more objective and more effective. Generally, the
text arrangement for each topic is as follows:

Learning Outcomes: This section refers to what you should achieve after you
had completely gone through a topic. As you go through each topic, you should
frequently refer your reading back to these given learning outcomes. By doing
this, you can continuously gauge your progress of digesting the topic.

Self-Check: This component of the module is inserted at strategic locations


throughout the module. It is inserted after you had gone through one sub-section
or sometimes a few sub-sections. It usually comes in a form of a question that
may require you to stop your reading and start thinking. When you come across
this component, try to reflect what you had already gone through. When you
attempt to answer the question prompted, you should be able to gauge whether
you had understand what you had read clearly, vaguely or worse you might find
out that you had not comprehended or retained the sub-section(s) that you had just
gone through. Most of the time, the answer to the question can be found directly
from the module itself.

Activity: Like Self-Check, activities are also placed at various locations or


junctures throughout the module. Compared to Self-Check, Activity can appear in
various forms such as questions, short case studies or it may even ask you to
conduct an observation or research. Activity may also ask your opinion and
evaluation on a given scenario. When you come across an Activity, you should try
to widen what you had gathered from the module and introduce it to real
situations. You should engage yourself in higher order thinking where you might
be required to analyse, synthesise and evaluate instead of just having to recall and
define.

Summary: You can find this component at the end of each topic. This component
assists you to recap the whole topic. By going through summary, you should be
able to gauge your knowledge retention level. Should you find points inside the
summary that you do not fully understand; it would be a good idea for you to
revisit the details from the module.

Copyright © Open University Malaysia (OUM)


COURSE GUIDE  xiii

Key Terms: This component can be found at the end of each topic. You should
go through this component so as to remind yourself on important terms or jargons
used throughout the module. Should you find terms here that you are not able to
explain, you should look for the terms from the module.

References: References is where a list of relevant and usually useful textbooks,


journals, articles, electronic contents or sources can be found. This list can appear
in a few locations such as in the Course Guide (at References section), at the end
of every topic or at the back of the module. You are encouraged to read and refer
to the suggested sources to elicit the additional information needed as well as to
enhance you overall understanding of the course.

PRIOR KNOWLEDGE
Students taking this course are required to have prior knowledge in courses
SBST1103 and SBST2103.

ASSESSMENT METHOD
Please refer to myINSPIRE.

REFERENCES
Dielman, T. E. (2004). Applied regression analysis: A second course in business
and economic statistics (4th ed.). Texas: Thomson Brooks/Cole.

Freund, J. E. (2003). Mathematical statistics (7th ed.). Prentice-Hall.

Mendenhall, W., Beaver, R.J., & Beaver, B. M. (2006). Introduction to


probability and statistics (12th ed.). California: Thomson Brooks/Cole.

Mann, P. S. (2005). Introductory statistics using technology (5th ed.).


Connecticut: John Wiley & Sons.

Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2006). Probability and
statistics for engineers and scientist (8th ed.). Prentice-Hall.

Mendenhall, W., & Sincich, T. (2003). A second course in statistics: Regression


analysis (6th. ed.). Prentice-Hall.

Copyright © Open
Copyright Open University
University Malaysia
Malaysia (OUM)
(OUM)
xiv  COURSE GUIDE

Dielman, T. E. (2004). Applied regression analysis: A second course in business


and economic statistics (4th ed.). Texas: Thomson Brooks/Cole.

Freund, J. E. (2003). Mathematical statistics (7th ed.). Prentice-Hall.

Mendenhall, W., Beaver, R. J., & Beaver, B. M. (2006). Introduction to


probability and statistics (12th ed.). California: Thomson Brooks/Cole.

Mann, P. S. (2005). Introductory statistics using technology (5th ed.).


Connecticut: John Wiley & Sons.

Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2006). Probability and
statistics for engineers and scientist (8th ed.). Prentice-Hall.

Mendenhall, W., & Sincich, T. (2003). A second course in statistics: Regression


analysis (6th. ed.). Prentice-Hall.

TAN SRI DR ABDULLAH SANUSI (TSDAS) DIGITAL


LIBRARY
The TSDAS Digital Library has a wide range of print and online resources for the
use of its learners. This comprehensive digital library, which is accessible through
the OUM portal, provides access to more than 30 online databases comprising e-
journals, e-theses, e-books and more. Examples of databases available are
EBSCOhost, ProQuest, SpringerLink, Books24x7, InfoSci Books, Emerald
Management Plus and Ebrary Electronic Books. As an OUM learner, you are
encouraged to make full use of the resources available through this library.

Copyright © Open University Malaysia (OUM)


Topic  Chi-Square and
F Distribution,
1 and Their
Applications
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the properties of Chi-Square and F distributions;
2. Use Chi-Square and F distribution tables;
3. Perform the analysis of variance for single and two populations
variances; and
4. Construct the confidence interval for comparing variances of two
populations.

 INTRODUCTION
Previously, we have been exposed to several methods of testing and estimating a
population mean. Similarly, we may be interested in making inferences on
changes in a population; hence, the correct parameter to use in this case is the
population variance,  2 . There are several reasons why it is important to test
hypotheses concerning the variances of populations. Inferences on variance can be
applied in daily life. For example, a quality control engineer needs to monitor the
consistency of products manufactured by the factory as this ensures that the
products are meeting the required specifications. One of the methods for
consistency checking is by calculating the variance of size, weight or volume of
the product. If the variation in these measurements is large, this means that more
products will be outside the specification limits. In the financial area, investors
use the variations in the returns from their portfolios stocks, bonds or any type of

Copyright © Open University Malaysia (OUM)


2  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

investment as a measure on uncertainties and risks. With this method, investors


will be able to reduce the risks of investment.
Two types of distributions will be used, the Chi-Square distribution (single-
population variance) and F distribution (at least two distributions variance). Both
distributions are non-symmetrical and skewed to the right.

1.1 CHI-SQUARE DISTRIBUTION

The Chi-Square distribution is used for testing the hypothesis on a single


population variance  2 , the population standard deviation  and to construct their
confidence intervals. Figure 1.1 displays the graph for Chi-Square distribution.

Figure 1.1: Chi-Square distribution

1.1.1 Properties of Chi-Square Distribution

SELF-CHECK 1.1

Prior to this, the Central Limit Theorem and various theories explaining
the properties of sample mean x have been discussed. Can you recall
what the properties are?

The value of sample mean x and sample variance s2 differs from one sample to
the other. It is also important to know the properties of the sample variance. The
following are several theorems explaining the properties of sample variance 2 .

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  3

Theorem 1
Suppose a random sample X 1 , X 2 ,…, X n of size n is chosen from a normal
distribution with mean  and variance  2 . The sample mean and variance can be
computed as follows:
n

X i
Sample mean, X  i 1

n
n

 X X 
2
i
Sample variance, s 2  i 1

n 1

(n  1) s 2
As both X and s 2 are random variables, then
2
is also a random variable which follows a Chi-Square distribution with v = n – 1
degrees of freedom. Let a Greek symbol  2 (pronounced as Chi-Square) to
represent the random variable, thus we have

(n  1) s 2
 
2
~  2 (n  1)
 2

Based on Figure 1.1,  2 only takes positive values starting from zero at horizontal
axis.

Theorem 2
A random variable X is said to follow a Chi-Square distribution only if its
probability density function is given by

 2
 1
  /2 x 2 e  x /2 , x  0
f ( x)   2 ( / 2)

0, otherwise


where v is the degree of freedom and ( / 2)    1 ! .
2  

Copyright © Open University Malaysia (OUM)


4  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Graph of f(x) for various values of v is shown in Figure 1.2 below:

Figure 1.2

The bigger the value of v, the flatter the density curve is, skewing to the right.
Using v = 1, 2, 3 and 4, the graph of f(x) versus x is shown in Figure 1.2. What
can be observed from this figure? Give your comment.
(a) A few specific properties are:
(i) The distribution is continuous;
(ii) There is only one parameter, v; and
(iii) The random variable X that follows a  2 distribution with parameter v
can be written as X ~  2 ( ) .
(b) Other properties of a Chi-Square distribution:
(i) if X is distributed as  2  v  , then its mean, E[X] = v and variance,
Var[X] = 2v;

(ii) if X1 and X2 are independent random variables, and given that


X1~2(v1) and X2~2(v2), then Y = X1+ X2 is distributed as 2 (v1 + v2),
i.e Y = X1+ X2 ~2 (v1 + v2).

(iii) if Z1, Z2,..., Zn are n independent random variables from standard


normal distribution N(0,1), then
 Z i2 ~  2 (1) , i = 1, 2, …, n

 Y2  Z i2  Z 2j ~  2 (1  1) , i = 1, 2, …, n and j = 1, 2, …, n

 Yn  Z12  Z 22    Z n2 ~  2 (1  1    1  n)

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  5

Worked Example 1.1


Given an independent random variable X1 is 2 (4) and X2 is 2(6), determine
the distribution for random variable Y = X1+X2. Find the mean and variance for
Y distribution.

Answer:
Using property (b), the Chi-Square distribution, Y = X1+X2 will be distributed
as 2 (4+6) =2 (10). Hence, E[Y] = 10 and Var[Y] = 2(10).

SELF-CHECK 1.2
Given an independent random variable X ~  2 (2) and Y ~  2 (3) ,
determine the distribution of random variable T = X + Y.

Worked Example 1.2


If a random variable X is distributed as N  , 2  , identify the distribution of

 X  
2

random variable Y and show that the mean and variance of Y are
2
 = 1 and 2 = 2, respectively.

Answer:
( X  )
Since X is distributed as N  ,  2  , the Z = will be distributed as

 X  
2

standard normal, N(0,1) and Y  Z 2


 follows the  2 (1) (i.e the chi-
 2

square distribution with 1 degree of freedom) as in property II [c(i)], hence,


from property II (a) E[Y] = 1 and Var[Y] = 2v = 2(1) = 2.

Copyright © Open University Malaysia (OUM)


6  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

EXERCISE 1.1
4
The following are 100 data, Y = z
i 1
2
i calculated for 4 sets of Z1, Z2, Z3,

Z4 from independent normal population N (0,1):


(a) Build a histogram for Y.
(b) Plot a scatter diagram for f(X) where X~ 2 (= 4) versus Y. Give
your comment.
(Write down your answer on a separate sheet)

3.472 6.472 8.347 13.025 1.483 7.832 0.772 5.449 1.037 4.744
4.940 2.186 2.920 1.083 3.047 5.627 4.091 1.031 4.532 1.033
3.146 4.004 2.685 4.379 1.510 0.964 1.519 4.668 12.723 2.018
6.018 3.820 4.900 3.300 3.147 5.741 6.613 9.386 4.874 9.775
5.290 6.854 8.992 3.330 2.574 0.611 0.870 1.152 0.738 8.630
5.233 0.579 1.653 1.237 8.484 3.643 2.118 5.813 5.168 4.255
1.079 3.145 4.541 2.052 6.846 0.570 0.476 2.151 0.391 0.758
3.700 2.476 2.680 0.756 3.549 2.694 10.884 1.630 2.392 3.084
2.577 4.354 3.785 2.232 1.348 1.840 6.208 10.938 2.217 1.264
1.330 1.808 1.642 3.434 3.596 4.687 2.650 6.203 1.830 4.865

1.1.2 Chi-Square Distribution Table


The Chi-Square distribution ² (v) table is created based on the degrees of
freedom v = n – 1 and . The values of ² variables satisfy the equation:

Pr(  2  c2 )   ,0    1

To facilitate the usage of this table, we use the term “small” if its values are in
the range of 0  0.1 and “ large” if its values are in the complementary
interval of 0.9 1. Usually, the 2c values are big for small , and 2c are
small when  is large. For example:

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  7

Figure 1.3 (a) Figure 1.3 (b)

where Pr   2 ( )   c2   
c2 = critical value.

Table 1.1: The Relationship between Degrees of Freedom and c2

 Large c2 Small c2


5 0.95 1.145 0.05 11.07
5 0.975 0.831 0.025 12.833
5 0.995 0.412 0.005 16.750
20 0.95 10.851 0.05 31.410
25 0.975 13.120 0.025 40.646
30 0.995 13.787 0.005 53.672

Based on Table 1.1 above, it can be seen that for the same v degrees of freedom
(v = 5), the c2 value approaches zero when the value of  increases for “large ”
case. On the other hand, the value of c2 increases and approaches  when 
value decreases for “small ” case. This is an important property especially when
 is used as a significance level in hypothesis testing or confidence interval
construction for population, 2 .

The following Table 1.2 shows part of c2 values for some  values and the
corresponding degrees of freedom v where  = n –1. This table was constructed in
a similar manner as the t- Table that contains the t values. The top row of the table
represents the right-hand section of the point as in Figure 1.4, in reference to the
rows with the appropriate degrees of freedom. Let us check how Table 1.2 is used.

Copyright © Open University Malaysia (OUM)


8  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Table 1.2: Relationship between v, , and 2


Degrees of Probability (or area under 2 distribution curve on the right-hand side of the critical
freedom,  point for a given 2 value)
0.975 0.95 0.90 0.10 0.05 0.030
 Value
2

1
2
.
.
.
.
11 4.575 5.578 17.275 19.675
12 5.226 6.304 18.549 21.026
13 5.892 7.042 19.812 22.362
14 22.362 7.790 21.064 33.685
.
.
.

If v = 12 and  = 0.05 then 0.05


2
= 21.026 and if v = 12 and  = 0.95,
the 0.95
2
= 5.226 . Since the Chi-Square distribution is non-symmetrical, the values
on the left-hand side of the distribution need to be determined similar to how the
values in the right-hand side of the distribution are determined from the table.
Based on the two values 0.052
= 21.026 and 0.95
2
= 5.226 , it is known that
Pr(2 > 21.026) = 0.05
Pr(2 > 5.226) = 0.95

with 12 degrees of freedom. Based on these two statements,

Pr(5.226 < 2 < 21.026) = 0.90

The values and area for Chi-Square distribution with v = 12 degrees of freedom
are shown in Figure 1.4(a). Since

Pr   2  5.226   0.95 hence Pr   2  5.226  = 0.05.

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  9

Figure 1.4 (a)

It is important to note that the 2 point represents its position on the horizontal
axis. As such, all points and probabilities/areas in the table satisfy

Pr   2 ( )   c2   

The non-symmetrical property of Chi-Square distribution causes the points to


decrease with the increasing value of  and increase when  decreases. As such, a
general statement can be written,

Pr( 12 / 2  X  2 / 2 )  1  

For example, when  = 0.05,

Pr( 02.975  X  02.025 )  0.95

Refer to the properties in Figure 1.4 (b).

Figure 1.4 (b)

Copyright © Open University Malaysia (OUM)


10  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Worked Example 1.3


Suppose that n = 20, determine 02.975 and 02.025 .

Answer:
When n = 20, v = n – 1 = 20 – 1 = 19 degrees of freedom. From the distribution
table,
    20.975= 8.907
20.025 = 32.852

Figure 1.5 clearly depicts that the area on the right-hand side of point 8.907 is
0.975 and the area on the right-hand side of point 32.852 is 0.025.

Figure 1.5

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  11

Worked Example 1.4


Suppose that X and Y are independent random variables distributed as 2(2) and
2(6) respectively, determine the probability of

(a) X > 7.38; (b) X < 0.103; (c) Y < 22.46; and (d) X+Y > 2.18

Answer:
(a) It is known that X is distributed as 2(2). From the table, 0.025 from the
distribution is situated on the right-hand side of point 7.38 (column
 = 0.025, row v = 2)
 Pr (X > 7.38) = 0.025

(b) Pr (X > 0.103) = 0.95 (based on column  = 0.95, row v =2)


Pr (X < 0.103) = 1 – 0.95 = 0.05

(c) Y~2(6). Hence, Pr(Y>22.46) = 0.001 (from column  = 0.001, row v = 6)


 Pr (Y < 22.46) = 1 – 0.001 = 0.999

(d) X + Y ~2 (2 + 6). Based on the table, with column = 0.975 and row = 8.
 Pr(X+Y > 2.18) = 0.975

EXERCISE 1.2

Explain the meaning of Figure 1.6(a) and Figure 1.6(b):

Figure 1.6(a) Figure 1.6(b)

Copyright © Open University Malaysia (OUM)


12  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

1.1.3 Hypothesis Testing on Population Variance


In most cases, population variance is a parameter of which the value is rarely
known. We are interested to test if a sample taken from a normal population
(parent) contains variance 02 that is a given or assumed value or quantity. If a
random sample with size n is drawn from a normal population with 2 as the

variance, we can prove the


(n  1) S 2
~  2
( n  1) statistic. This means,
 n  1 S 2
2 2
follows a  2  n  1 distribution and can be used as test statistic when the null
hypothesis is true.

Statement 1
To test whether a random sample of size n with sample variance S 2 was drawn

from a normal population with variance 2, we use the  2 


 n  1 S 2 statistic,
2
which is distributed as  2  n  1 when the null hypothesis is true.

The following Figure 1.7 shows the critical region for testing H0:2  02 versus

(a) H1:2  02 ; (b) H1:2  02 ; and (c) H1: 2  02 .

Figure 1.7(a) Figure 1.7(b) Figure 1.7(c)

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  13

Worked Example 1.5 (Rejection Region)


(a) Suppose that we want to get the rejection region for a one-sided right
hand side (small) hypothesis testing at 5% significance level    0.05 
with sample size, n = 10.

It is known that Pr   2  52%  9    0.05 , hence, from the table (critical


value), 52%  9   16.92. Refer to Figure 1.8 below:

Figure 1.8

(b) Find the critical region for a one-sided left-hand side (large ) at 1%
level    0.01 and sample size n = 22.

We know that Pr  2  99%  21  1    0.99 , hence, the critical value,


2

99 %  21  8.897 . Refer to Figure 1.9 below:


2

Figure 1.9

Copyright © Open University Malaysia (OUM)


14  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

(c) Determine the critical value at α = 0.05 level and sample size n = 21 for a
two-sided test.

(i) Step I; for two sided test, divide  to half, i.e.  2  0.025 . Refer
to figure 1.10:

Figure 1.10

(ii) Step II; refer to Figure 1.11 (a) and (b):

Figure 1.11 (a)

Figure 1.11 (b)

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  15

The following Table 1.3 gives the rejection region for various combinations of
null hypothesis, H0 and alternative hypothesis, H1.

Table 1.3: Rejection Region for Various Combinations of H0 and H1


Null Alternative
Rejection Region Test Statistic
Hypothesis Hypothesis
(a) One-sided (a) One-sided (right)
(right)
 2   2 ,n 1
H1:  
2 2
0

(b) One-sided (left) (b) One-sided (left)


H1:  
2 2
 2  12 ,n 1  n  1 s 2
H 0 :2  02 0 2 
02

(c) Two-sided (c) Two-sided
H1: 2  02   , n 1 or

1  , n 1

H 0 :2  02 H1: 2  02  2  12 ,n 1

H 0 :2  02 H1: 2  02  2   2 ,n 1

Note: 2 ,( n 1)  2 (n  1)

SELF-CHECK 1.3

Obtain the rejection region for these tests:

(a) One-sided (left) with  = 0.05 and sample size, n = 31.

(b) One-sided (right) with  = 0.1 and sample size, n = 40.

(c) Two-sided with  = 0.05 and sample size, n = 15.

Discuss answers with your classmates.

Copyright © Open University Malaysia (OUM)


16  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Worked Example 1.6


A normal population has a variance of 9.0. A random sample of size 9 gives
sample variance 8.01 which was drawn from a normal population. Determine
whether the variance from this random sample is 9.0. Test at 5% level.

Answer:
There are a few steps to be taken to answer this question:

Step 1: Determine the parameter of interest


The population parameter is the population variance,  2 . Our result will
depend on the value of the sample variance; that is 8.01.

Step 2: Gather all available information


The population is normally distributed with mean  (not included in the test)
and assumed variance for testing, 02  9.0.

Step 3: Construct hypothesis statements


The population variance is given as 9.0. Since the statement does not state to
test larger or smaller, a two-sided test will be carried out, that is:
H 0 : 2  9.0
H1: 2  9.0
Step 4: Determine the significance level and rejection region
The two-sided hypothesis will be performed at 5% level, the Chi-Square
distribution has n – 1 = 9 – 1 = 8 degrees of freedom. From the table, the
critical value  2   22.5%  8   17.53 and 97.5%
2
 8  2.180 . We will reject the
null hypothesis when
 2  97.5%  8  or    2.5%  8 
2 2 2

Step 5: Calculate the test statistic

Test Statistic:  2

 n  1 S 2

8  8.01
 7.12
 2
9

Step 6: Determine the result


Since the value of test statistic,  2 is within the acceptance region, that is 2.180
< 7.12 < 17.53, hence H0 is accepted.

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  17

Step 7: State the conclusion


As such, we have valid evidence to state that the sample is from a normal
population with population variance,  2  9.0. Take a look at the figure below:

Figure 1.12

SELF-CHECK 1.4

A random sample of exam scores of size 26 of Year 6 students at a


primary school has an average total score of 70.57 with standard
deviation 16.1. If the standard deviation for the total exam score of all
Year 6 students is 15 and based on information from the random sample:

(a) Give your opinion on the possible value change for the standard
deviation of the population total exam score; and

(b) Write down the appropriate H0 and H1.

Copyright © Open University Malaysia (OUM)


18  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Let us try this exercise.

EXERCISE 1.3

1. Past experience has shown that a certain population is distributed as


normal with variance 42. A random sample of size 16 was drawn,
giving a sample variance value of 18.4. Is there significant evidence
to suggest that the current population variance is less than 42?
(Please answer on separate sheets).

2. Fill in the properties of a Chi-Square distribution.


Chi-Square
Characteristics
Distribution
(a) Value
(b) Distribution
(c) Distribution and relationship with
sample

1.2 F DISTRIBUTION
The F distribution is required for studies involving two or more independent
populations. The sample variance s12 will be calculated and used as point
estimation for the first population variance 12 and the second sample variance s22
will be calculated and used as point estimation for the second population variance.

1.2.1 Properties of F Distribution


F distribution is an important distribution in statistics. This distribution has two
different degrees of freedom given as  1  n1  1 and  2  n2  1 . Although F
distribution has several common properties, each pair of v1 and v2 produces a
different curve. The common type of curve is displayed in Figure 1.13. Based on
the figure, it is obvious that the F distribution is non-symmetrical and skewed to
the right (similar to the Chi-Square distribution). The f value as in the figure is
non-negative since the quantity of s12 , s22 , 12 , 22 , required in f statistics is always
positive. Hypothesis testing on 12 and  22 are done based on F distribution. This

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  19

distribution is also important for Analysis of Variance (ANOVA) which will be


covered in Topic 2.

Figure 1.13

If Chi-Square distribution is used for testing a single population variance, the F


distribution, on the other hand, is for cases involving variances of two
populations. It should be noted that the two populations in this case must be
independent and normally distributed. The F distribution is related to normal
population sampling distribution. This distribution was introduced by Sir Ronald
A. Fisher, hence the name. In general, this distribution can be viewed as a
sampling distribution for the ratio of two independent random variables of Chi-
Square distribution divided by their respective degrees of freedom. This can be
shown using Theorem 3.

Theorem 3
Let U and V be two independent random variables having Chi-Squared
distribution with v1 and v2 degrees of freedom, respectively. Then,
U v1
F
V v2
is the random variable that follows F distribution with v1 and v2 degrees of
freedom and the probability density function g(f) is
   
 1 2  1 / 2 1
1
  1  2 
 2   1  1    2
g( f )    . f 2 1  1  , f 0
      2 
 1  2   2 
2  2
and g(f) = 0 otherwise.

Copyright © Open University Malaysia (OUM)


20  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Several special properties:


(a) The distribution is continuous;
(b) The distribution has two parameters, v1 and v2;
(c) If a random variable X follows F distribution with v1 and v2 parameters, it
can be written as X ~ F(v1, v2); and
(d) The distribution is non-symmetrical and the amount of skewness (area) on
the right-hand side depends on v1 and v2.

EXERCISE 1.4
Sketch separately three graphs of function f (x) versus x using pairs of
(29,28), (19,6) and (6,6). Give your comment.
(Please answer on separate sheets.)

1.2.2 F Distribution Table


The table for F distribution is prepared based on two degrees of freedom, which
are v1 = n1 – 1 and v2 = n2 – 1. The contents of the table are values of F variable
satisfying the equation Pr( F  F1 ,2 ; )   .

The critical value of F distribution lies on the right-hand side of the function
graph (refer to Figure 1.13). For each pair of v1 and v2; the first, second and third
rows in the F distribution table are critical values at 0.05, 0.025 and 0.01
significance levels. To determine the critical values on the left-hand side, this
relationship is used:
1
Fv1 ,v2 ; 
Fv2 ,v1 ;1

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  21

To facilitate the usage of the table, we will observe several critical values of F
when:

(a) v1 is fixed and v2 varies, significance level is fixed:

1 2 Fv1 , v 2 ; 0.05 
6 5 4.950
6 6 4.284
6 7 3.866
6 8 3.581
Comment: The F critical value decreases when v1 is fixed and v2 is varied.

(b) v1 varies and v2 is fixed, significance level is fixed:

1 2 Fv1 , v 2 ; 0.05 
5 6 4.387
6 6 4.284
7 6 4.207
8 6 4.147
Comment: The F critical value decreases when v1 is varied and v2 is fixed.

(c) v1 and v2 are fixed, significance level varies:

1 2 Fv1 , v 2 ; 0.05  Fv1 , v 2 ; 0.025 Fv1 , v 2 ; 0.01


6 6 4.284 5.820 8.466
7 7 3.787 4.995 6.993
8 8 3.438 4.433 6.029
9 9 3.179 4.026 5.351
Comment: The F critical value increases when the significance level
decreases and both v1 and v2 are fixed.

Copyright © Open University Malaysia (OUM)


22  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

(d) v1 is varied and v2 is fixed, significance level varies:

1 2 Fv1 , v 2 ; 0.05  Fv1 , v 2 ; 0.025 Fv1 , v 2 ; 0.01


6 6 4.284 5.820 8.466
7 6 4.207 5.695 8.260
8 6 4.147 5.600 8.102
9 6 4.099 5.523 7.976
Comment: The F critical value also increases when significance level
decreases, v1 varies and v2 is fixed.

(e) v1 is fixed and v2 changes, significance level changes:

1 2 Fv1 , v 2 ; 0.05  Fv1 , v 2 ; 0.025 Fv1 , v 2 ; 0.01


6 6 4.284 5.820 8.466
6 7 3.866 5.119 7.191
6 8 3.581 4.652 6.371
6 9 3.374 4.320 5.802

Comment: F values increase when the significance level decreases, 1 is


fixed and 2 changes. Using the special property of F distribution, that is,
1
F1,2 ; 
F2 ,v1;1 , we can find the values of F distribution at  = 0.95, 0.975
and 0.99.

Worked Example 1.7


Determine the value of F 10,11, = 0.95 .

Answer:
Using the property,
1 1 1
F10,11,0.95     0.3398
F11,10,10.95 F11,10,0.05 2.943

Note: F1 , 2 ,   F ,1 , 2  F ( 1 ,  2 ) .

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  23

Thus, for example, F3,16, 0.05 = F0.05,3,16 = F0.05 (3,16)

EXERCISE 1.5
1. Determine the following values based on the F distribution table.
(a) F0.05 (3,16) (b) F0.05 (12,25)
(c) F0.01 (4,15) (d) F0.05 (7,4)
2. Determine the values (using the F distribution table) that satisfy the
equation below:
(a) (6,14) = 3.50 (b) (10,32) = 2.93
(c) (24,38) = 1.81 (d) (2,24) = 5.61
(Please answer on separate sheets.)

1.3 INTERVAL ESTIMATION OF VARIANCE


FOR TWO POPULATIONS
If S12 and S 22 are sample variances for two independent random samples of size n1
and n2 taken from normal populations with variances 12 and 12 respectively, then

22 S12
F 2 2
1 S2
is a random variable following an F distribution with n1 – 1 and n2 – 1 degrees of
freedom. Hence, we write:
22 S12
Pr( F1 / 2 ,n1 1,n2 1  2 2  F / 2 ,n1 1,n2 1  1  
1 S 2

Copyright © Open University Malaysia (OUM)


24  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Figure 1.14

The illustration of this equation is displayed in Figure 1.14 above.


1
Since F 
F / 2,n2 1,n1 1 then
1 / 2, n1 1, n2 1

 22 S12
Pr( F1  / 2 (  1 ,  2 )   F / 2 (  1 ,  2 ))  1  
 12 S 22
S2 2
 Pr( 22 F1  / 2 (  1 ,  2 )  22  F / 2 (  1 ,  2 ))  1  
S1 1

We obtain the following theorem:

Theorem 4
If S12 and S 22 are variances for independent random variables sized n1 and n2 from
a normal population, then,
s12 1 12
   F 2,n2 1, n1 1
s22 F 2, n1 1,n2 1 22

12
is defined as (1 – α)100% confidence interval for  2 .
2

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  25

Worked Example 1.8


In an experiment, the first sample with n1 = 10 has a standard deviation of s1 =
0.5. The second sample with n2 = 8 has a standard deviation s2 = 0.7. Determine
12
the 98% confidence interval for  2 which is the ratio of the population
2

variances.

Answer:
From the F distribution, we obtained
f 0.01,9,7  6.72 and f 0.01,7,9  5.61  for  =0.02  .
 0.5  1  12   0.5  5.61 12
2 2

As such, that is, 0.076 < 2 <2.862


 0.7  6.72 22  0.7 2 2
12
This means, 2 ratio = 1; that is, 12 =  22 is true since the confidence interval
2
contains the value of 1.

EXERCISE 1.6

Let n1  15, n2  12, s1  3.07 and s2  0.80. Construct a 98% confidence


21
interval for the ratio.
22

(Please answer on separate sheets.)

1.4 COMPARING VARIANCES OF TWO


POPULATIONS
Consider a test to check whether two samples taken from a normal population has
equal variance. Suppose that these two random samples are written as X1,....., Xn
and Y1,.....,Ym with n and m sample sizes respectively (where n does not
necessarily equal m) with sample variances S X2 and SY2 .

Copyright © Open University Malaysia (OUM)


26  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Now, if both samples come from a normal population with equal variance, σ2 then
( n  1 )S X2 ( m  1 )SY2
~  2
( n  1 ) and ~ 2 ( m  1 )
 2
 2

Hence, (n  1)S X2 (m  1) SY2 are F(n – 1,m – 1) variable.


2

(n  1) (n  1) 2

To test whether two random samples of sizes n and m with a sample variance S X2
and SY2 respectively are taken from a normal population with equal variance, we

ˆ 2X
use the F  statistic distributed as F(n – 1, m – 1) when the null hypothesis is

ˆ Y2
true (that is, samples have equal variance) with unbiased estimated variance:

( n  1 )S X2 ( m  1 )SY2

ˆ 2X  and 
ˆ Y2  respectively.
n 1 m 1

SELF-CHECK 1.5

Fill in the properties of F distribution in the table below:

Properties Chi-Square Distribution


1. Value
2. Distribution
3. Distribution and relationship
with sample
4. Application

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  27

Table 1.4 summarises the test on variances comparison.

Table 1.4: Variance Comparison Test


Test
Null Hypothesis Alternative Hypothesis Critical Region
Statistic

H 0 : 12  22 (a) One-sided test-Right (a) One-sided test-Right S12


H1 :   
2 2
S 2 F
2 1 2
 f  (1 , 2 )
1 S 22
 H 0 : 12  1 2 2
2  H1 : 12  1 S 2
2
(b) One-sided test-Left
(b) One-sided test-Left S12
 f1 (1 ,  2 )
H1 :   
2
1
2
2 S22
2
 H1 : 12  1 (c) Two-sided test
2
S12
 f  / 2 (1 , 2 )
(c) Two-sided test S22
H1 : 12  22
2 Or
 H1 : 12  1
2
S12
 f1 / 2 (1 ,  2 )
S 22

H 0 : 12   22 H1 : 12  22 S12


2 2  f1 (1 ,  2 )
 H 0 : 12  1  H1 : 12  1 S22
2 2

H 0 : 12   22 H1 : 12  22


S12
2 2  f  (1 ,  2 )
 H 0 : 12  1  H1 : 12  1 S 22
2 2

Let us look at the following example on comparing two different variances.

Copyright © Open University Malaysia (OUM)


28  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Worked Example 1.9


The random samples with sizes n1  16 and n2  25 respectively were taken from
two normal populations. The variances for both samples were S12 = 48 and S 22 =
26 respectively. Using an appropriate hypothesis, carry out a test to determine
whether the variance from the first population is bigger than the second population
variance. Use α = 0.05.

Answer:
A few steps need to be followed to solve this problem.

Step 1: Determine the parameter of interest


Population parameters of interest are variances of the first and second populations,
12 and 22 , respectively.

Step 2: Gather all available information


Population 1 Population 2
(a) shape: normal (a) shape: normal
(b) mean: 1 (unknown) (b) mean :  (unknown)

(c) standard deviation: 1 (unknown) (c) standard deviation:  2 (unknown)

Step 3: Construct hypothesis statement


The statement “ 12 is greater than 22 ” means H1 is chosen as
 12
H1 :  12   22  H1 : 1
 22
and
 12
H0 :    H0 : 2  1
2 2
1 2
2
(Refer to Table 1.4)

Step 4: Determine the significance level and rejection region


A one-sided test will be carried out at a significance level of  = 0.05. The
rejection region is on the right side of the F distribution. From the table, we obtain
F0.05,15,24 = 2.11. Hence, we reject H0 when the value of the F statistic is greater
than 2.11.

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  29

Step 5: Calculate the test statistic


S 2 48
Test statistic: F  12   1.85
S2 26

Step 6: Determine the result


Since the value of the F statistic is inside the acceptance region,
i.e. 1.85 < 2.11, accept the null hypothesis.

Step 7: State a conclusion


Hence, there exists strong evidence to say that 12 =  22 , that is, variances for both
populations are equal.

EXERCISE 1.7

Two independent random samples were chosen from two normal


populations. Both samples gave the following information:

Sample 1 Sample 2
n1 =16 n2 =25
x1 =48.7 x 2 =39.2

S12 =10.6 S 22 =7.3

Carry out the appropriate hypothesis testing (at  = 0.02) to determine


whether both populations have equal variance.
(Please answer on separate sheets.)

Copyright © Open University Malaysia (OUM)


30  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

EXERCISE 1.8

1. If independent random variables X1, X2 and X3 are distributed as


 2 (1),  2 (5) and  2 (10) respectively, determine the distribution
(with mean and variance) for
(a) X1 + X2
(b) X1 + X3

2. If a random variable X1 is distributed as N (1 , 12 ) and


random variable X2 is distributed as N( 2 ,  22 ), show that
  X 1  1 2  X 2   2 2 
E    2.
  2
 2

 
3. Using a distribution table,
(a) Determine the value for  if
(i)   = 19.02 (v = 9)
2

(ii)   = 24.43 (v = 40)


2

(b) Determine the value for x if


(i) 0.005 = x (v = 29)
2

(ii) 0.99 = x (v = 4)
2

4. Suppose that X1 and X2 are random variables with X1 distributed


as  2 (3) and X2 distributed as  2 (4) Determine the probability
value for
(a) X1 > 6.25
(b) X1< 0.115
(c) X2 > 14.86
(d) X1 + X2 > 14.07

Copyright © Open University Malaysia (OUM)


TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  31

5. EXERCISE
A test has 1.8
been suggested at 5% level to check whether a sample
containing 9 items comes from a normal population with variance
11. If the calculated sample variance is 12.1, determine the result of
that test.

6. A sample of size 15 drawn from a normal population gave the


following results:
9.1, 14.3, 11.2, 8.4, 8.5, 14.0, 9.9, 8.9, 11.0, 10.2, 10.8, 11.4, 13.0,
9.9 and 10.4.
(a) Find the sample mean and sample variance. Test at 5% level
whether the sample variance is 1.9.
(b) If the test is to be conducted to check whether the population
variance is greater than 1.9, will the result differ from the one
obtained in (a)?

7. A firm would like to conduct a study on the sequence of its two


production channels. Since both sequences result in almost similar
daily output, it has been suggested that the production channel of
which the sequence results in less variation be selected. Two
random samples were obtained as shown in the data below.
Construct a 95% confidence interval for variance ratio of daily
output. Which channel sequence would you suggest?
2
 Production Channel 1 n = 21 days S1 = 1432
1

 Production Channel 2 n2 = 25 days S 22 = 3761

8. Two random samples of sizes 8 and 11 from a population are


assumed to be distributed as normal with variances 12.4 and 19.3
respectively. Test at 5% level to determine whether both samples
have equal variance.

9. The following data was chosen from a normal population.


Using  = 0.01, test the appropriate hypothesis to determine
whether the first population variance is larger than that of the
second population variance.
Sample 1 24.3 46.0 56.6 40.3 64.1 69.5 48.1 37.1 56.5 50.6
Sample 2 31.9 42.8 55.4 52.3 46.5 42.0 45.5 42.4 32.0 51.5

Copyright © Open University Malaysia (OUM)


32  TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS

Try the activities on this website:


http://home.xnet.com/~fidler/triton/math/review/mat170/fdist1.htm

Chi-Square Distribution 
2

( n  1) s 2
The Chi-Square distribution is a sampling distribution for the
2
variable possessing the following properties:
 The Chi-Square value is always greater than or equal to 0, a property
which is not applicable for z and t distributions;
 The distribution is non-symmetrical;
 The distribution will vary according to the sample size. This means the
distribution shape is very dependent on the value , which is the degree of
freedom value that exists in a sampling situation;
 The mean for any distribution is equal to its degrees of freedom; and
 It is used to compare the variance of a population.

 F Distribution
The F distribution is a sampling distribution for variables possessing the
following properties:
 There are no negative values in the F distribution (similar to Chi-Square
distribution). As such, the scale for the F value starts with 0 and extends
towards the positive side on the right;
 The F distribution is also non-symmetrical;
 There are various shapes of F distribution depending on sample size,
which is the respective sample degrees of freedom; and
 Is used to compare two independent population variances.

 Steps to carry out hypothesis testing on variance:


 Determine the parameter of interest.
 Gather all available information.
 Construct the hypothesis statement.
 Determine the significance level and rejection region.
Copyright © Open University Malaysia (OUM)
TOPIC 1 CHI-SQUARE AND F DISTRIBUTION, AND THEIR APPLICATIONS  33

 Calculate test statistic.


 Determine the result.
 State a conclusion.

Copyright © Open University Malaysia (OUM)


Topic  One-Way
2 Analysis of
Variance
(ANOVA)
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the importance of variance analysis;
2. Identify the assumptions required in carrying out the variance analysis
technique;
3. Explain the procedure for ANOVA hypothesis testing; and
4. Apply the ANOVA testing procedure and F distribution table to obtain
statistical results based on the means for three or more populations.

 INTRODUCTION
There are situations that may interest us to broaden our test scope to more than
two populations. The following are two situations, which compare three or more
populations. A teacher may be interested in comparing the mean of Mathematics
marks obtained from three groups of students following different teaching
methods. Similarly, a scientist may be interested to compare the strength of
certain pulps produced using different techniques. Each production method may
result in a different mean strength and the scientist may want to test the equality
of several means. This can be performed through a procedure called variance
analysis. Analysis of Variance (ANOVA) is a basic method used in experimental
design. This technique has wide applications and is a useful technique in
inferential statistics.

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  35

In general, we use this method to estimate and test hypothesis regarding


population mean and variance. Although subsequent topics emphasise hypothesis
testing on population means, any conclusion made is very dependent on the
observed magnitude of the variance. The application of variance analysis also
depends on several required assumptions. We will be discussing assumptions
required for the test, the hypothesis which will be constructed, the calculation
involved, (the indicators for the ANOVA table) and the type of results that can be
expected from this analysis.

2.1 BASIC CONCEPTS OF ANOVA

2.1.1 ANOVA and Experiment


Experiment: A study or research designed with the purpose of checking the
effect of one variable on the value of another variable. For example, the effect of
tuition classes on students’ performance.

Independent variable: An observed or controlled variable for the purpose of


determining its effect on the dependent variable/s. In ANOVA, the independent
variable can be qualitative (such as gender) or quantitative (such as student’s age).

Relevant terms:

(a) Independent variables are called factors. A study may involve 1, 2 or more
factors.
Example: type of tuition classes, time of classes and students’ commitment.

(b) Each factor may involve different factor levels (categories).


Example:
Factor Level
Method 1 : tuition
Type of Tuition Classes
Method 2 : intensive class
Time 1 : afternoon
Tuition Class Time Time 2 : night
Time 3 : weekends

Copyright © Open University Malaysia (OUM)


36  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

(c) The combination of one level of a factor to another factor’s level is termed
as treatment or run.

Note:
(i) For a single factor experiment, the factor level and treatment carry the same
meaning.

(ii) In this module, only the single-factor experiment is considered. For multiple
factors, please refer to appropriate reference books.

In general, the analysis of variance is an approach to investigation using sample


data whether the unknown means of three or more populations are equal or not. If
comparison is made on two population means, the result from the ANOVA
procedure discussed here is equivalent to the means testing procedures for two
small-size populations (using t distribution). As such, the focus of this topic is to
compare the means for three or more unknown populations with restriction on one
variable or one factor. The examples of experiments involving comparison of
means for three or more populations are:
 A college principal who is interested in comparing the marks obtained by
students from Years 1, 2 and 3;
 An engineer comparing the current flow for 5 different conductors; or
 A pharmacist who is interested in comparing the effectiveness of 3 types of
medicines used to treat patients at a hospital.

SELF-CHECK 2.1

Can you think of several other examples that may use ANOVA?

You have learnt about display plots in the Statistics I module. The focus is on the
usage of the box-plot and dot-plot to provide a visual display before calculating
the mean comparison. Figure 2.1 displays the box-plot graph for comparison of
mean speeds according to types of cars.

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  37

Figure 2.1

Figure 2.1 gives information on five cars which have different means or medians
and variances for their speed at different levels. However, do the observed
differences yield significant statistical results? We may need the analysis of
variance to carry out a numerical significance test on the equality of each mean.
The null hypothesis in there provides no significant difference on the mean speed
of all cars. If the null hypothesis is rejected, this means that there exists
differences in mean speed and if we are interested, we can determine the mean
speed of the car that caused the difference.

There are several assumptions we can make when carrying out ANOVA
procedures. The assumptions are that:
(a) Each population under study must be normally distributed;
(b) Samples taken are random and independent; and
(c) Each population that produces a sample value has unknown and equal
population variance, that is
 12   22  ...   k2

Figure 2.2 below displays the graph shape for populations that satisfy the
assumptions above.

Figure 2.2
Copyright © Open University Malaysia (OUM)
38  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Under the assumption that the population distribution is normal with a


homogenous population variance, this satisfies the property of F distribution. If
the assumptions on F procedure are not satisfied for a population distribution, a
non-parametric statistical test must be used (this will be discussed in Topic 5).

Observe Figure 2.3 below. What can you comment on the means µ1, µ2, and µ3?

Figure 2.3

Worked Example 2.1


Below is a one-way experiment/test. The measurement in the table is the final
examination score for 12 students according to type/category of tuition class
considered as a factor.

Type of Tuition Class (Factor)


Type 1 Type 2 Type 3 Row Total
65 66 70 201
64 64 68 196
67 63 65 195
66 67 69 202
794
Mean, x x1  65.5 x 2  65 x3  68 x..   66.17
12

Note:
(a) The factor is the tuition class; the factor level is type/category.
(b) Four observations for factor level 1 is called group 1.
(c) There are three groups in this experiment with the dot plot shown in Figure
2.4:

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  39

Figure 2.4

(d) There exists variations within each respective group (as observed from the
variation in marks value).
(e) There exists variations between groups (as observed from the varying
position of the group centre).
(f) Based on these, we can raise the question: Do the three types of extra classes
result in similar effects on students’ performance? In other words, is the
mean population for group 1 (1) = mean population for group 2 (2) = mean
population for group 3 (3)?

SELF-CHECK 2.2

1. Provide an example of a one-factor experiment with 4 factor


levels. Name the independent and dependent variables involved.

2. Give an example of a two-factor experiment with 2 factor levels.


Name the independent and dependent variables involved.

Copyright © Open University Malaysia (OUM)


40  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

EXERCISE 2.1

Using the two data sets below, construct a dot-plot graph. Give your
comment.

Data 1 Data 2
A B C A B C
14.9 14.4 14.5 16.6 13.2 13.5
14.9 14.4 14.5 16.8 14.7 16.9
14.9 14.4 14.5 13.2 15.7 15.4
14.9 14.4 14.5 15.3 12.3 12.8
14.9 14.4 14.5 12.6 16.1 13.9
Total 74.5 72.0 72.5 74.5 72.0 72.5
Mean 14.9 14.4 14.5 14.9 14.4 14.5
Variance 0 0 0 3.71 2.63 2.71
Overall Total 219.0 219.0
Mean Total 14.6 14.6

2.1.2 Within and between Group Variation


Changes within one data set containing n measurements are proportional to the
n _
total sum of squares,  ( x  x)
i 1
i
2
and this quantity is used to calculate sample
variance for a population. The term analysis of variance comes from the
decomposition of the total variability of its components.

k n _

The total sum of squares, that is  ( xij  x.. ) is a measurement of total


2

i 1 j 1

variability in the data. Observe Table 2.1:

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  41

Table 2.1

Observation
1 x11 x12 …  2 ,n1
2 x21 x22 …  2 ,n2
Treatment . . . … .
. . . … .
k xk1 xk2  k ,nk

The table shows the measurement collected from observations under study
(subject) i and j where i = 1, 2, …k, and j = 1, 2, …,n. The sample sizes are not
k

necessarily equal. Total sample size is N   ni . The overall mean is obtained by


i 1

dividing the total overall observation from samples with the total sample size, that
_ k
is x..   Ti / N .
i 1

Note that the total sum of squares SST can be written as


k n _ k n _ _ _

 ( xij  x.. )2   [( xi.  x.. )  ( xij  x.. )]2


i 1 j 1 i 1 j 1
(2.1)
or
k n


i 1 j 1
( xij  x ..) 2

k n _ k n _ _ _ (2.2)
 n ( x i.  x.. ) 2   ( xij  x i. ) 2  2 ( x i.  x.. )(xij  x i. )
i 1 j 1 i 1 j 1

However, the cross-product term in Equation (2.2) above is zero since


n _ _

 (x
j 1
ij  x i. ) xi.  n x i.  xi.  n( xi. / n)  0

Hence, we have
k n _ k k n

 ( x  x.. ) 2 n   xi  xi..    x  xi 


2 2
ij ij (2.3)
i 1 j 1 i 1 i 1 j 1

Equation (2.3) states that the total variability in the data, as measured by the total
sum of squares can be partitioned into the sum of squares deviation between
means in the treatment and within means of the treatments. This means the
differences between mean of observed treatment and the overall mean is a
measure of differences between treatment means, while the observed differences
Copyright © Open University Malaysia (OUM)
42  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

between a treatment with the treatment mean can only be caused by random error.
As such, we can write Equation (2.3) (in symbol) as:
SST SST   SS E
 
N 1 k 1 N  k
where

SS(Tr) = the sum of squares due to treatments (that is between observations)


SSE = the sum of squares due to error (that is within observations)

Since there exists kn = N total observations, we have N – 1 degrees of freedom.


SS(Tr) has k – 1 degrees of freedom since there are k factor levels (and k treatment
means). Finally, in every treatment there are n replicates that provide n – 1
degrees of freedom used in estimating the experimental error. Since there are k
treatments, we have k(n – 1) = kn – k = N – k degrees of freedom for error.

It is important for us to check intrinsically both terms on the right side of the basic
identity for analysis of variance (Equation (2.3)). Consider the following sum of
squares for error:
k n k  n 
SS E   ( xij  xi. ) 2    ( xij  xi. ) 2 
i 1 j 1 i 1  j 1 
It is easier for us to see that the term inside the square bracket is equivalent to the
sample variance at the i-th treatment divided by n – 1, that is
n

 (x
j 1
ij  xi. ) 2
S i2  , i  1,2,...., k
ni 1
Now, the sample variance can be combined to get an equal estimation for the
population variance, that is,
k  n _ 
 n  1 S2
  n  1 S  ...   n  1 S
2 2   ( x ij  xi. ) 2 
1 2 k

i 1  j 1 
(n  1)  (n  1)  ...  (n  1) k

 (n  1)
i 1

= SSE/(N-k)

As such, SSE/(N – k) is an unbiased estimator of variance for each treatment.

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  43

In the same way, if there is no difference between k treatment means, we can use
the changes in treatment means from the total mean to estimate  2 .

When n is equal for each k treatment, as such


2
k
_ _ 
n  xi.  x.. 
SS(Tr)/(k-1) = i 1  
k 1
is an estimation for 2 when the treatment means are the same.
2
If n is unequal, we have
k
_ _ 

i 1
ni  xi.  x.. 
 
SS(Tr)/(k-1) = k 1

Note that the identity analysis of variance (Equation 2.3) provides two
estimations, which are based on the existence of changes within treatments and
between treatments. If there are no differences in treatment means, both
estimations should be almost equal. If both estimations are not the same, it is
suspected that the observed differences must be due to differences within
treatment means. The quantity

SS (Tr )
 MS (Tr )
k 1
and
SS E
 MS E
N k

where MS is the mean square.

Copyright © Open University Malaysia (OUM)


44  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Worked Example 2.2


Using data on students’ test marks according to classes (full marks = 20),
calculate the mean squares between classes, mean squares between students
(errors) and total deviation of overall marks.
Class A Class B Class C
15.2 14.8 15.1
15.4 14.4 14.3
14.8 14.3 14.6
14.4 14.1 13.9
14.7 14.4 14.6
Answer:

The results are as below:

Class A Class B Class C


15.2 14.8 15.1
15.4 14.4 14.3
14.8 14.3 14.6
14.4 14.1 13.9
14.7 14.4 14.6
Total 74.5 72.0 72.5
Mean = x j 14.9 14.4 14.5

Variance, S i2 0.16 0.065 0.195

Overall Total 74.5+72.01+72.5=219.0


Overall Mean x 219.0/15=14.6

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  45

We obtained:

(a) Variation of marks between Classes (Treatments)


Sum of Squares Treatments   n j ( x j  x) 2
 5 14.9  14.6   5 14.14  14.6   5 14.5  14.6 
2 2 2

 5  0.09  0.04  0.01


 0.7

Degrees of Freedom, (df) = k-1 = 3-1 = 2


As such,
Mean squares of treatments = (Sum of squares treatments / df)
= 0.7/2
= 0.35
This means that changes in measurement between the three samples, that is
variation of students’ marks between classes, is 0.35.

(b) Variation of marks between students (Error)


Sum of squares error = (n1  1) s12  (n2  1) s22  (n3  1) s32
= 4(0.16) + 4(0.065)+4(0.195)
= 4(0.42)
= 1.68
Degrees of freedom, (df) = n1 + n2 + ... + nk– k = 5+5+5-3 = 12
Hence, the mean square error = (sum of squares error / df )
= 1.68/12 = 0.14
This means that, the changes in measurement within the three samples,
which is the variation of marks between students in respective classes, is
0.14.

(c) The total deviation in overall marks is SST


Hence, SS T   ( xi , j  x...) 2
 15.2  14.6   15.4  14.6     14.6  14.6 
2 2 2

 2.38 14
 0.17

In this topic, the usage of mean squares is synonymous with the term covariance.
As such, when discussing about the changes in measurement for a set of data, the
word mean squares is used compared to the word variance. When the value of the
mean squares of treatments is greater than the mean squares of errors, we can
make an early conclusion that there exists a significant difference between the
treatments that is affecting towards experimental results. The following section
Copyright © Open University Malaysia (OUM)
46  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

will demonstrate how existing differences can be verified using appropriate


statistical tests.

SELF-CHECK 2.3

Using the answer in Worked Example 2.2,


1. Determine
(a) Sum of squares of treatment + Sum of squares of error
(b) df[SS(Tr)] + df[SSE]

2. Compare the values of mean square of treatment (MSTr) and mean


square of error (MSE).
Comment on 1 and 2.

EXERCISE 2.2

Four groups of students were chosen and exposed to different methods of


teaching. At the end of the period, they were given a test to evaluate the
effectiveness of the teaching method. The number of students in each
group differed according to group. Using the following marks data,
calculate the Mean Squares Between Methods, Mean Squares Between
Students and the total deviation on overall marks.
G1 G2 G3 G4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
81 72 83
69 79 76
90

(Use separate answer sheets.)

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  47

2.2 SINGLE-FACTOR EXPERIMENT

2.2.1 One-Way ANOVA


For a single factor experiment, ANOVA is called One-Way ANOVA. The
purpose of this analysis is to check the equality of population means for three or
more random samples. The following are a few steps in carrying out ANOVA
procedures.

Step 1: Determine the null and alternative hypotheses


The null hypothesis in ANOVA is the independent sample taken from different
populations with equal means, while the alternative hypothesis is the reverse. In
short, they can be written as
H 0 :1   2     k
(no mean difference for k factor levels, k = number of population)
H1 : not all populations have equal means.
Note that ANOVA does not provide information such as by how much the mean
populations differ, as well as accurate information on which population mean
differs. If H 0 is true and the three previous assumptions were satisfied, we can
say that the three samples were taken from the same population as shown in
Figure 2.5(a) and the variance contribution to the total overall variability is zero.
On the other hand, if H 0 is false, the variation within the observed dependent
variable is mostly due to difference in treatment and its contribution is viewed as
SS(Tr) percentage towards the total deviation. The greater the SS(Tr) percentage is,
the closer it is towards the truth of H1 that is, the mean for each population is not
equal. This event implies that samples taken were from the following populations
as shown in Figure 2.5(b).

Figure 2.5(a) Figure 2.5(b)

Copyright © Open University Malaysia (OUM)


48  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Step 2: Choose a significance level


A criteria to reject H 0 is required and a test is performed at a specific significance
level value, usually taken at α = 0.01 or  = 0.05.

Step 3: Determine the rejection region


To determine the F value that separates the acceptance and rejection region, we
need information on two things, which are:
(a) The degrees of freedom for the treatment, v1 = k – 1 with k sample size; and
(b) The degrees of freedom for the error, v2 = N – k where N is the total number
of observations in all samples that is n1 + n2 + ... + nk = N (for unequal
sample sizes) or kn = N (for equal sample sizes) and k is the number of
groups/treatments.

Determine the critical value, that is F1 ,  2 ,  (obtained from the F table
distribution). Reject H when F > F1 ,  2 ,  .

Step 4: Calculate the test statistic


MS (Tr ) SS (Tr ) /(k  1)
F 
MS E SS E /( N  k )

Step 5: Determine test result


The result depends on the rejection region in Step 3.

Step 6: State test conclusion


The conclusion of the test depends on the statistical test result obtained in Step 5.

Let us try the following exercise to test your understanding.

EXERCISE 2.3

How can a critical region be determined in an ANOVA test? What


should you know prior to obtaining the critical value?

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  49

2.2.2 Model for a Single-Factor Test


The outcome of a single-factor test is:
Observed data = Mean + Unexplained Variation
where the unexplained variation is sampling variation.

The mean model is the model usually used for single-factor testing, that is, the
model used to compare means ( µj) of factor levels. This model is

X i , j   j   ji i = 1,2,..., k ; j = 1,2...,n

Where X i , j = the i-th observation for j factor level


k = number of factor levels
n = number of sample observations for k factor level
 ji = the independent error identified as normally distributed N (0 2 )

For hypothesis testing, the model error is assumed as an independent random


variable, which is normally distributed with zero mean and variance 2 . The
variance 2 is assumed constant for all factor levels. This model is also termed as
a one-way analysis of variance since only a single factor is tested. The test method
is summarised in the following ANOVA table:

Sum of Degrees of Mean Square


Source F-Value
Squares Freedom Error
Treatments SS(Tr) k-1 MS(Tr) MS(Tr)/ MSE
Error SSE N-k MSE
Total SST N-1

As we know, MS(Tr) and MSE are measures of changes between treatments and
changes within treatments. The combination of both terms will result in the total
variability in the sample data. This means:

The total sum of squares = Sum of squares treatment + Sum of squares Error

Copyright © Open University Malaysia (OUM)


50  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Worked Example 2.3


A private accounting firm carried out a study to investigate whether the efficiency
of its employees was related to their former schools. A few selected accountants
from the company chose 4 schools at random. The numbers of mistakes made by
other accountants in 2 weeks duration were recorded as below:
School A School B School C School D
14 17 19 23
16 16 20 12
17 18 22 21
13 15 21 10
22 16 18 9
9 12 19 15

Carry out an ANOVA test at 0.01 significance level. Is there any significant
difference in the efficiency of employees' based on their former schools.
Answer:
The factor involved here is the former school; the factor level is the 4 original
schools selected by the accountants.

Step 1: Determine the null and alternative hypotheses


H 0 :1   2  3   4 (no mean difference for the 4 factor levels)
H1 : not all populations have an equal mean.

Step 2: Choose a significance level


The significance level is fixed at α = 0.01.

Step 3: Determine the rejection region


(a) The degrees of freedom for treatments, v1 = k – 1 = 4 – 1 = 3.
(b) The degrees of freedom for error, v2 = N – k = 24 – 4 = 20.
Determine the critical value, that is F3, 20, 0.01 = 4.938 (obtained from the F
distribution table). Reject H0 when F > 4.938.

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  51

Step 4: Calculate the test statistic


MS (Tr ) SS (Tr ) /(k  1)
F 
MS E SS E /( N  k )
94.8333 3
  2.129
297 20

Step 5: Test result


Since the calculated test statistic value, F = 2.129 < 4.938, fail to reject H0.

Step 6: Conclusion
Hence, we have a strong evidence to state that the mean mistakes done by the
accountants are equal. This means that there does not exist any significant
difference in evaluating employees’ efficiency based on schools.

You can now tackle these exercises and activity.

EXERCISE 2.4
A teacher claimed that the frequencies of watching television are equal
for all students in primary 6, Form 1 and 2. He conducted a survey on a
random sample of selected students and their total time (in minutes)
spent on watching television after school time until just before bed
time were recorded as below:
Primary 6 Form 1 Form 2
459 115 272
311 153 88
152 201 374
293 30 178

Carry out an ANOVA test at 5% level. What is your conclusion?

Copyright © Open University Malaysia (OUM)


52  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

SELF-CHECK 2.4

State:
1. Three assumptions made in ANOVA hypothesis testing.
2. The steps involved to perform ANOVA hypothesis testing.

EXERCISE 2.5

1. What are the assumptions needed in carrying out the analysis of


variance procedures?
2. What is the null and alternative hypotheses used in any analysis of
variance test?
3. Determine:
(a) The critical value for ANOVA test at α = 0.01 if the test
consists of 6 samples with 34 items in each sample.
(b) The critical value for ANOVA test at α= 0.05 if the test
consists of 4 samples with 44 items in each sample.
4. Calculate:
(a) Test statistics F when MSE =14.6 and MS(Tr) = 35.7 (use the
information in question 3(a)).
(b) Test statistics F when MSE =73.81 and MS(Tr) = 215.23 (use
the information in question 3(b)).
5. A research has been carried out and its result showed that there
exists no difference in decision-making process between students
population taken from various social economic family background.
The score results for this behavioural test, which has been
conducted on a random sample, is as below:

EXERCISE 2.5

Copyright © Open University Malaysia (OUM)


TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)  53

Test the research claim at 1% level. Comment on the result.


Lower Middle Higher
32 45 38
36 42 38
40 34 38
32 42 31
33 29 41
37 33
34

Please visit the following websites to find out more on:


 Chi-Square and F distribution
http://mathforum.org/library/drmath/view/52808.html
 F Distribution and ANOVA
http://www.sytsma.com/pjad530/anovaworks.html

 The main purpose of the analysis of variance technique discussed here is to


enable a decision-maker to compare the means for three or more independent
samples and check if there exists any statistically significant difference
between population means from where the samples were taken.

 There are three assumptions that need to be verified before ANOVA technique
can be carried out which are:
 The population distribution approaches normality;
 The samples are chosen at random and independent; and
 The population variances are equal.

Copyright © Open University Malaysia (OUM)


54  TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

 There are 6 steps to carry out in hypothesis testing procedures.


Step 1: Determine the null and alternative hypotheses;
Step 2: Choose a significance level;
Step 3: Determine the rejection region;
Step 4: Calculate the test statistic;
Step 5: Determine test result; and
Step 6: State the conclusion.

 The hypothesis testing process using ANOVA is summarised through the


construction of ANOVA One-Way Table which consists of sources of
variation (treatment, error and total), degrees of freedom, mean squares values
(treatment and error), and F ratio.

Copyright © Open University Malaysia (OUM)


Topic  Categorical
3 Data Analysis
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the importance of goodness-of-fit and contingency table tests;
2. Determine appropriate null and alternative hypotheses for goodness-
of-fit and contingency table tests;
3. Calculate the expected frequencies for goodness-of-fit and
contingency table tests;
4. Determine the degrees of freedom for all tests performed; and
5. Apply step-by-step procedures in goodness-of-fit and contingency
table tests to deduce statistical decisions.

 INTRODUCTION
In the previous topics, analysis of variance and hypothesis testing were conducted
on quantitative and continuous data. The statistical techniques discussed in those
topics required measurement values such as weight, height, diameter, distance,
total money or total score/marks for a test. On the other hand, many types of
surveys and experiments result in qualitative rather than quantitative response
variables. As a result, the responses can be classified but not quantified. Data from
these experiments consist of the count or number of observations that fall into
each of the response categories included in the experiment. In this topic, we are
concerned with methods for analysing categorical data.

A categorical variable is a variable that classifies or categorises each individual


into exactly one of several cells or classes. Categorical data analysis is a
technique used to analyse these qualitative data with specific categories that are of
interest. For example, in a public poll, respondents’ feedback on certain issues are
recorded and the data is in categorical form, that is whether the respondent

Copyright © Open University Malaysia (OUM)


56  TOPIC 3 CATEGORICAL DATA ANALYSIS

“Agree”, “Disagree” or have “No Opinion”. Other examples are, an experimenter


who is carrying out a study on leukaemia patients, records the number of cancer
patients according to the patient’s family category; or an administrative executive
at a university records the number of candidates according to gender and number
of courses taken. Each of these examples is a categorical variable and the data
taken is the number of frequencies that falls into each category of variable. The
chi-square test will be used to carry out categorical data analysis by comparing the
observed frequencies with the expected frequencies under a pre-specified null
hypothesis.

SELF-CHECK 3.1

What are examples of qualitative data?

3.1 GOODNESS-OF-FIT TEST

3.1.1 Fitting to a Given Probability


The goodness-of-fit test is used to perform hypothesis tests to determine whether
a population of interest is suited to follow a probability distribution of a random
variable. In general, we need to understand the multinomial experiment concept,
which is an extension of the binomial experiment, in order to understand the
principles of the goodness-of-fit test.

Multinomial Experiment
(a) This experiment consists of n identical trials.
(b) The outcome of each trial falls into one of k categories or cells.
(c) The probability that the outcome of a single trial will fall in a particular
cell, say, cell i, is pi, where i = 1,2,…,k, and remains the same from trial to
trial and P1  P2    Pk  1 .
(d) The experimenter counts the observed number of outcomes in each
category, written as O1  O2    Ok where O j ( j = 1, 2 ,...,k) with n = as
O1  O2    Ok .

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  57

(e) When performing a goodness-of-fit test, these two assumptions are required:
(i) The experiment satisfies the properties of a multinomial experiment;
and
(ii) All expected frequencies are at least 1, and, at most, 20% of the
expected frequencies are less than 5.
(f) The required alternative hypothesis.
(g) H1: at least one of the multinomial probabilities is unequal.
(h) In n trials, the expected number that falls into the j-th category under the
null hypothesis is as E j  np j .

Worked Example 3.1


Suppose that an unbiased dice is tossed 120 times and each outcome is recorded
in the following frequency table.

Face 1 2 3 4 5 6 Total
Total Frequency (expected) 20 10 10 20 20 40 120

Test whether the toss of the dice is biased or not.

Answer:
Step 1: Construct the appropriate hypothesis statement
H0 : P1 = P2 = ...= P6= 1/6, that is each toss of the dice is fair
H1 : at least pj  1/6 for j = 1,2,…,6 interval
Theoretically, if the dice is balanced, we would expect each face to occur 20
times that it follows a uniform distribution. This means, each face of the dice
from 120 throws is repetitive as in the following table:

Face 1 2 3 4 5 6 Total
Frequency 20 20 20 20 20 20 120

Hence, the H0 test above is a goodness-of-fit test for a uniform distribution.

Copyright © Open University Malaysia (OUM)


58  TOPIC 3 CATEGORICAL DATA ANALYSIS

Step 2: Determine the significance level and the rejection region


The test is performed at the 5% level, hence reject the null hypothesis when the

test statistic is   
2 O  E
2
  2 5% 5  11.070 with degrees of freedom,
E
v = (number of columns) – 1 = 6 – 1 = 5.

Step 3: Calculate the value of test statistic


6 O  Ej 
2

The test statistic is calculated as   


j
2
distributed as  2
Ej j 1

distribution with v = 5 degrees of freedom. The following table gives the


information needed.

Face 1 2 3 4 5 6 Total
The observed frequency, O 20 10 10 20 20 40 120
The expected frequency, E
(when the null hypothesis is 20 20 20 20 20 20 120
true)

OE 0 -10 -10 0 0 20 0

(O  E ) 2
0 5 5 0 0 20 30
E

6 O  Ej 
2
(O  E ) 2
Hence, we obtain     E
2 j
= 30. Under H0, is
j 1 Ej
distributed as  2 distribution. To calculate the test value in this case, the
information of the value of the 6 pairs is needed. The 6 pairs are independent
and their total frequency must be 120. Meanwhile, the difference for each added
pair must be zero. This means, as many as 6-1 observed independent pairs
(degrees of freedom) would be used to calculate the  2 test value. In other
words, there are 6 cells to fill based on 1 restriction where the total frequency
for the 6 sets must be 120. For this reason, 6 choices – 1 restriction = 5 degrees
of freedom.

Step 4: Determine the results


Since the test statistic value,  2 = 30 > 5%
2
(5) = 11.070, reject H0.

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  59

Observe the rejection region on the right-hand side in Figure 3.1:

Figure 3.1
Step 5: State the conclusion
Hence, we have strong evidence to state that the throw of the dice is unbalanced
and that it is not uniformly distributed.

SELF-CHECK 3.2

State the importance of the Goodness-of-Fit test.

EXERCISE 3.1

A personnel manager at a government agency is very concerned with


the employees’ attendance record. He takes a sample from their records
to investigate whether employees’ absenteeism is uniformly distributed
according to the 6 working days. The table below displays the data
obtained:
Day Number of Absentees
Monday 12
Tuesday 9
Wednesday 11
Thursday 10
Friday 9
Saturday 9

Test whether the employees’ absenteeism is uniformly distributed at 1%


level of significance.

Copyright © Open University Malaysia (OUM)


60  TOPIC 3 CATEGORICAL DATA ANALYSIS

3.1.2 Fitting to a Given Distribution


The goodness-of-fit here also applies to situations in which we want to determine
whether a set of data may be looked upon as a random sample from a population,
having a given distribution. We would like to check if a given frequency
distribution could be considered as Normal, Binomial or Poisson distribution
based on the following properties:
 The chi-square test statistic approaches the chi-square distribution with k – 1
degrees of freedom in the goodness-of-fit test and (r – 1)(k – 1) in the Test of
Independence and Test of Homogeneity.
 In each case, the approximation to chi-square distribution is accurate if all
expected frequencies are at least 1, and, at most, 20% of the expected
frequencies are less than 5.
 The Yate’s continuity correction statistic, that is:
( O  E  0.5 )2
 
2
cc
E
is required as a result of approximating a discrete distribution by a continuous
probability distribution.

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  61

(a) Binomial Distribution

Worked Example 3.2


A lecturer would like to investigate the frequency of students taking
elective courses at a local university. Consider whether the following
number of students is distributed as binomial distribution. Use 5%
significance level.

Number of elective courses 0 1 2 3 4 5 or 6 Total


Frequency 12 16 8 3 1 0 40

Answer:
Let X be a random variable that represents the number of students taking
elective courses.
Step 1: Construct the appropriate hypothesis statement
H0 : The number of students taking elective courses is distributed as
binomial.
_

 H0: X~Bin(6, p) where p  x  45 / 40  0.1875


n 6
H1: Otherwise.

In this case, x 
 fx  0(12)  1(16)  2(8)  3(3)  4(1)  5(0)  45
f 12  16  8  3  4  0 40

Step 2: Calculate the test statistic


Using n = 6 and p = 0.1875, we can generate a theoretical binomial
distribution, where X (representing number of events in the experiment) ~
 n
Bin(6, 0.1875), derived from P( X  x)    p x (1  p) n  x and the
 x
following table is obtained:

X 0 1 2 3 4 5 or 6
Expected values 11.51 15.93 9.19 2.83 0.49 0.05

Note: Expected frequency for jth cell = Ej = (total frequency)P(X = x)


 6
Thus, for X= 0, P( X  0)   (0.1875) 0 (1  0.1875) 60 = 0.2877.
 0

Copyright © Open University Malaysia (OUM)


62  TOPIC 3 CATEGORICAL DATA ANALYSIS

And therefore the first cell expected values are E = (40)(0.2877) = 11.51, etc.

Since there are three cells with frequency less than 5 (X = 3,4 and 5 or 6),
all of them will be combined together with frequency at X=2 resulting in
the value of 12.56 (that is 9.19 + 2.83 + 0.49 + 0.05). Hence, combining
the observed and expected frequencies table to perform subsequent
analysis:

X 0 1 2 or More Total
Observed 12 16 12 40
Expected 11.51 15.93 12.56 40

and the test statistic is


(O  E) 2 (12  11.51) 2 (16  15.93) 2 (12  12.56) 2
2       0.05
E 11.51 15.93 12.56

Step 3: Determine the significance level and rejection region


In the table above, there are 3 frequency cells. There are 2 restrictions
placed in finding the expected values:
(i) overall total = 40
(ii) using overall total = 40, the mean value x = 1.125, we will have one
degree of freedom (df), i.e
df = number of frequency cells – number of restriction

So, df = 3 - 2 = 1
Hence, the critical value for this test is 5% (1) = 3.84 (refer to table).
2

Step 4: Determine test results


Since X = 0.05 < 5% = 3.84 , accept H0.
2

Step 5: State the conclusion


Hence, we do not have a strong evidence to reject the null hypothesis. In
conclusion, it is clear that the distribution of students taking elective
courses is Binomial.

Note: How to determine number of restrictions

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  63

(i) When n is known and p is unknown:


The sample mean needs to be determined to obtain p and the total
frequencies, , need to be multiplied with the calculated probabilities
(p) to obtain the expected frequencies. In this case, there are two
restrictions that are the mean and the total, .
(ii) When n and p are known:
The sample mean will not be used. It is sufficient to use  to get the
expected frequencies. There is only one restriction, which is the total,
.
(b) Poisson Distribution
The process to test a Poisson distribution is similar to the one performed in a
Binomial test. We are going to use a given frequency distribution to test
whether the data has a similar mean value and follows Poisson distribution.
In theory, Poisson distribution is generated using information from the data
given. Two distributions can be compared based on means using the chi-
square test that is  2   ( O  E ) . However, the binomial test will be in two
2

E
forms depending on whether (a) p is unknown or (b) p is known, and here,
we will only consider the Poisson distribution for one case. Similar to a
previous case, to generate the expected distribution (when H0 is true), it is
necessary to generate a theoretical distribution and this requires:
(i) mean parameter , µ; and
(ii) total frequencies, f .
Let us see the following example.

Worked Example 3.3


Test whether Poisson distribution can be fitted to the frequency
distribution below:
X 0 1 2 3 4 5 6 or more
Frequency 19 26 27 13 11 2 0

Answer:
Step 1: Construct the appropriate hypothesis test
H0: The frequency follows a Poisson distribution

 H0 : X ~ P  

Copyright © Open University Malaysia (OUM)


64  TOPIC 3 CATEGORICAL DATA ANALYSIS

where
173
ˆ  x   1.765
98
H1: otherwise

Step 2: Calculate the test statistic


Using this mean value, we can calculate f(x) = Pr(X=x) = e  , where
 x

x!

̂  x . We can also use the formula f ( x  1)  f ( x) to calculate
x 1
Pr(X = x).

Note: Expected frequency for jth cell = Ej = (total frequency)P(X=x)


e1.765 (1.765) 0
Thus, for X= 0, P( X  0)  = 0.1712
0!
And therefore the first cell expected values is E=(98)( 0.1712)=16.8, etc

The following table is generated based on this information on mean and


observed frequencies:

X 0 1 2 3 4 or More Total
Observed 19 26 27 13 13 98
Expected 16.8 29.6 26.1 15.4 10.1 98

Hence,  2   (O  E)  (19  16.8)  ...  (13  10.1)  1.964


2 2 2

E 16.8 10.1

Step 3: Determine the significance level and rejection region


Since there are 5 pairs of frequencies with 2 restrictions, there will be
5  2  3 degrees of freedom. Hence, the critical value of the test is
52%  3  18 .

Step 4: Determine statistical test result


Since the test statistic, X = 1.964 < 52% = 7.81, failed to reject H0 .

Step 5: State the test conclusion


This means there is strong evidence to accept H0 and to state that Poisson
distribution is the best fit for this data.

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  65

(c) Normal Distribution


The goodness-of-fit test for a normal distribution also uses a similar method
as the previous two distributions. However, in general, to test normality for a
given frequency distribution, we need information on the sample mean x
value, sample standard deviation, s and total frequency,  f . This means,
there will be three restrictions in getting the expected values. These
procedures can be followed:
(i) Calculate the x and s values for the given frequency distribution.
(ii) Use x and s values as approximation to µ and  along with the total
frequency given to construct a theoretical normal distribution.
(iii) Compare the observed frequency with expected frequency using chi-
square test with 3 restrictions.

If the value of µ and  for theoretical distribution is known, it is unnecessary


to use the given frequency distribution to estimate both parameters.

Worked Example 3.4


The following table shows information on height (measured to the nearest
centimetre) for 694 nine year-old girls.

Height 117 – 120 121 – 124 125 – 128 129 – 132 133 – 136
Frequency 8 28 82 140 188

Height 137 – 140 141 – 144 145 – 148 149 – 152


Frequency 148 69 15 16

Test this sample to check normality at 5% level.

Answer:

Step 1: Construct the appropriate hypothesis test


H0: The frequency follows a normal distribution
H0: X ~N ( µ, 2 ) where x = ̂ = 134.356 and s = ̂ = 6.195
H1 : otherwise

Copyright © Open University Malaysia (OUM)


66  TOPIC 3 CATEGORICAL DATA ANALYSIS

Step 2: Calculate the test statistic


The calculation for observed frequency below is under the assumption that
the normal distribution, using the values obtained, results in the following
table:

Z
Upper x  134.256 Pr Expected
P Observed
Interval (Z<z) (p*Observed)
= 6.195
x-class
120.5 -2.24 0.013 0.013 9.0 8
124.5 -1.59 0.056 0.043 29.9 28
128.5 -0.95 0.171 0.115 79.8 82
132.5 -0.3 0.382 0.211 146.4 140
136.5 0.35 0.637 0.255 177.0 188
140.5 0.99 0.839 0.202 140.2 148
144.5 1.64 0.950 0.111 77.0 69
148.5 2.28 0.989 0.039 27.1 15
- - 1.000 0.011 7.6 16

Hence,  2   (O  E)  (8  9)  (28  29.9)  ...  (16  7.6)  17.21


2 2 2 2

E 9 29.9 7.6

Step 3: Determine the significance level and rejection region


Data provides 9 actual cells with 3 restrictions that are mean, standard
deviation and total, giving v = 93 = 6 degrees of freedom. Hence, we
obtain the critical value = 5% (6) =12.59.
2

Step 4: Test results


Since  2 = 17.21 > 5% = 12.59, reject H0.
2

Step 5: Conclusion
We have evidence to state that the normality assumption on the observed
data is not satisfied.

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  67

EXERCISE 3.2

1. Test whether the sampling distribution below can be considered as a


binomial distribution (using 5% significance level).
X 0 1 2 3 4 5
Frequency 1 6 14 33 31 15

2. The following table is the distribution of the number of storms


reported by 330 weather stations in the United States of America in
a year.
Number of storms 0 1 2 3 4 5 >5
Number of station (f)
102 114 74 28 10 2 0
that reports

(a) Determine the expected frequency.

(b) Perform a goodness-of-fit test to check whether the above data


can be modelled with Poisson distribution.

3. Use chi-square goodness-of-fit test to determine whether the


following distribution can be considered as normal.
Weight (g) 50-59.9 60-69.9 70-79.9 80-89.9
Number of Specimens 8 10 16 14

Weight (g) 90-99.9 100-109.9 110-119.9


Number of Specimens 10 5 2

3.2 TABULATING QUALITATIVE DATA


(CONTINGENCY TABLE)
In some situations, the researcher classifies an experimental unit according to two
qualitative variables. In this case, when two categorical variables are recorded, the
data can be presented in a two-way cross-classification table known as a
contingency table. The contingency table test cannot be used when n < 20 or 20 <

Copyright © Open University Malaysia (OUM)


68  TOPIC 3 CATEGORICAL DATA ANALYSIS

n < 40 and when any expected frequency is less than 5. If n > 40, each expected
frequency in table r * c has a value more than 1 (c is the number of columns for
the level of the first factor, r is the number of rows for the level of the second
factor).

For example, a political analyst may be interested to investigate whether the


opinions of voting residents concerning a new tax reform are independent of their
political affiliation or not. A random sample of 1000 registered voters is classified
according to whether they support Party A, B or C and whether or not they favour
a new tax reform. The observed frequencies are presented in Table 3.1 known as a
contingency table.

Table 3.1: Contingency Table

Political Affiliation
Tax Reform Party A Party B Party C Total
For 308 190 102 600
Against 92 160 148 400
Total 400 350 250 1000

Table 3.1 is also called a 2 * 3 (or 3 * 2) table since it consists of two rows and three
columns. The two categorical variables involved are Political Affiliation (at three
levels that are Party A, Party B, and Party C) and their views on tax reform (at two
levels, “For” or “Against”). The values inside the table (308, 190 and the rest) are the
intersection given type of political affiliation and views on tax reform. These values
are observed frequencies as they represent the results obtained in the study and
identified as the number of individuals in all of the six categories or cells. For
example, the number of people who are members of Party A and agree on tax reform
is 308 (row 1, column 1) while the number of people who are members of Party C
and disagree on tax reform is 148 (row 2, column 3).

Investigation of any form or relationship between party membership and views on


tax reform can be viewed by converting the information in Table 3.1 to percentage
form based on:
(a) Total number of individuals involved (1000 people);
(b) Total rows (“For” and “Against”); and
(c) Total columns (Political Affiliation).

The results are displayed in Tables 3.2, 3.3 and 3.4:

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  69

Table 3.2: The Cross-Classification between Political Affiliation and Views on Tax
Reform (Percentage Values in the Table are Based on Overall Total)

Political Affiliation
Tax Reform Party A Party B Party C Total
For 30.8 19 10.2 60
Against 9.2 16 14.8 40
Total 40 35 25 100

Table 3.3: The Cross-Classification between Political Affiliation Views on Tax Reform
(Percentage Values in the Table are Based on Total Rows)

Political Affiliation
Tax Reform Party A Party B Party C Total
For 51.33 31.67 17 100
Against 23 40 37 100
Total 40 35 25 100

Table 3.4: The Cross-Classification between Political Affiliation and Views on Tax
Reform (Percentage Values in the Table are Based on Total Columns)

Political Affiliation
Tax Reform Party A Party B Party C Total
For 77 54.29 40.8 60
Against 23 45.71 59.2 40
Total 100 100 100 100

Various decisions can be made from Tables 3.2, 3.3 and 3.4. Some of them are:
(a) 60% of chosen individuals support tax reform (Table 3.2);
(b) 51.33% of individuals from Party A support tax reform (Table 3.3); and
(c) 77% from those who agree on tax reform are from Party A (Table 3.4).

This study can be summarised to investigate whether there exists any relationship
between individuals’ opinions on tax reform based on political affiliation. This
information can be determined only if we know the expected frequency in each
cell. Hence, the expected frequency can be written as:

Copyright © Open University Malaysia (OUM)


70  TOPIC 3 CATEGORICAL DATA ANALYSIS

Cj  Ri 
Eij  N   
N  N 

where Eij = expected frequency for ij-cell


N = total overall observation
C j = total jth column where j = 1,…, n j
Ri = total ith row where i = 1,…, ni

Table 3.5

Political Affiliation
Tax Reform Party A Agreement Party A Agreement
For 308 190 102 600 ( R1 )

Against 92 160 148 400 ( R2 )

Total 400 ( C1 ) 350 ( C2 ) 250 ( C3 ) 1000 (N)

Take a look at Table 3.5. Let us say that we are interested in determining the
expected frequency for individual from Party A who agrees on tax reform, that is
(A and For). Assuming the variables are independent,
expected frequency (A and For)
 C  R   400  600 
Eij  N  j  i   1000    240
 N  N   1000  1000 
This means it is expected that 24% of individuals involved in this survey are from
Party A and agree on tax reforms. Repeat the process for all cells and the result
can be summarised in the following cross-classification table:

Table 3.6
Party A Party B Party C Total
For 308 190 102
(240) (210) (150) 600
Against 92 160 148
(160) (140) (100) 400
Total 400 350 250 1000
Note: (Figures in brackets are the expected frequencies)

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  71

SELF-CHECK 3.3
Construct a scatter plot on the percentage of those who agree and
disagree on the tax reform according to political affiliation. What can
you conclude?

EXERCISE 3.3

200 people have been randomly selected classified according to the age of
the respondents (less than 30 and 30 or more) and their preference on type
of cars (locally-produced or imported). The result of this study can be
seen in the table below:
Types of Cars Preferred
Age Local-produced Imported
<30 68 42
30 or more 31 59

Construct an expected frequency table based on the information.

3.2.1 Test of Independence


The question of independence of the two methods of classifications (variables)
can be investigated using a test of hypothesis based on the chi-square statistic. The
steps for the test are:

Step 1: State the null and alternative hypotheses


H0: The two methods of classifications/both variables are independent
H1: Both variables are dependent

Step 2: Determine the significance level and rejection region


The significance level is usually set at  = 0.01 and  = 0.05.

Check the number of rows (r) and columns (c) in the related table. Calculate the
degrees of freedom for the test, v   r  1 c  1 . Next, determine the critical
value (based on the Chi-Square Table), that is 2 ,v . Reject H 0 when the test
statistic value X > 2 ,v .

Copyright © Open University Malaysia (OUM)


72  TOPIC 3 CATEGORICAL DATA ANALYSIS

Step 3: Calculate the test statistic


(a) Calculate the expected frequency values for each cell.

Next, calculate the value of the chi-square statistic, that is  2   ( O  E ) .


2
(b)
E

Step 4: Determine test result


Results of the test depend on information in Steps 2 and 3.

Step 5: State the conclusion


Based on results in Step 4, conclude whether the two variables are independent or
not.

The following is an example of an independence test.

Worked Example 3.5


We will use the example given in Section 3.2. Based on the data and result
obtained, construct and test the appropriate hypothesis (use  = 0.05).
Answer:
Step 1: State the null and alternative hypotheses

H0: the two variables are independent


 H0: the level of agreement on tax reform does not have any
relationship with political affiliation
H1: the two variables are dependent

Step 2: Determine the significance level and rejection region


The significance level which is used is  = 0.05.
We have r: number of levels for opinions (2)
c: number of levels for political affiliation (3)
Hence,
v   2  1   3  1  2
The critical value (based on the Chi-Square table), is 02.05 ,2  5.991 . Reject H0
when test statistic value  2 > 0.05,2
2
= 5.991.

Step 3: Calculate test statistic


(a) The data on expected and observed frequencies is summarised in the
following table:

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  73

(the values inside the brackets represent the expected frequencies


calculated using Equation 3.2.1)
Next, calculate the chi-square statistic value, that is  2   ( O  E )
2
(b)
E
From the table,
(O  E) 2 (308  240) 2 (148  100) 2
2    ...   91.329 .
E 240 100

Party A Party B Party C Total


308 190 102
For 600
(240) (210) (150)

92 160 148
Against 400
(160) (140) (100)
Total 400 350 250 1000

Step 4: Determine test result


Since  2 = 91.329 > 02.05 ,2 = 5.991 that is 91.329 >5.91, reject H0 .

Step 5: State the conclusion


We can state that there is no strong evidence to accept H0 and conclude that the
two variables are dependent. This means that individual opinions on tax reform
depends on their political affiliation.

EXERCISE 3.4

Refer to Exercise 3.2. Use the results of the exercise and carry out the
appropriate hypothesis test (use  = 0.05).

3.2.2 Test of Homogeneity


The main objective for Test of Independence is to determine whether two criteria
(variables) that are associated with the subject in a population are independent of
each other or not. For example, subjects chosen at random from a population may
be classified according to their views on certain issues and their political

Copyright © Open University Malaysia (OUM)


74  TOPIC 3 CATEGORICAL DATA ANALYSIS

affiliation. In the Test of Homogeneity, we test the hypothesis that the population
proportions within each category are the same/homogenous. This applies when
either the row or column totals are predetermined. Data is given in the form of a
two-way contingency table, which is on classification of a variable and another
one on population classification. It is important to stress that the assumptions and
statements under the null and alternative hypothesis are different but the analysis
techniques are the same. Refer to example 3.6 below.

Worked Example 3.6


A two-year study has been carried out on 120 heart patients who were given
two types of drugs, A and B. After a certain period, the condition of the patients
was classified as “no change”, “shows improvement” and “recovering”. The
following table illustrates the distribution of patients. Determine whether
patients’ conditions are the same even though they were each given a different
type of drug (test at  =5%).

Patients’ Condition
No Shows
Recovering Total
Change Improvement
Drug A 15 22 33 70
Type B 20 18 12 50
Total 35 40 45 120

Answer:

Step 1: State the null and alternative hypotheses


H0: For each type of drug (A or B), the condition of patients is the
same.
 H0: For each type of drug (A or B), the proportions of the three
patients’ conditions are the same.
H1: Otherwise

Step 2: Determine the significance level and rejection region


Since there are two rows and three columns, the degree of freedom is,
v = (2 – 1) (3 – 1) = 2. Using this value, based on the table, the critical value
value (at = 5%) is 02.05 ,2 = 5.991. Reject H0 if the test statistic X > 02.05 ,2 .

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  75

Step 3: Calculate test statistic


(a) The following table contains data on expected frequencies (in brackets) as
well as observed frequencies:

Drug
Patients’ Condition
Type
Show
No Change Recovering Total
Improvement
A 15 (20.42) 22 (23.33) 33 (26.25) 70
B 20 (14.58) 18 (16.67) 12 (18.75) 50
Total 35 40 45 120

(b) With this, we can calculate the test statistic:


(O  E) 2 (15  20.42) 2 22  23.33
2
(12  18.75) 2
2    ...   7.801
E 20.42 23.33 18.75

Step 4: Test Result


Since  2 = 7.801 > 5.991, we have strong evidence to reject H0.

Step 5: Conclusion
Hence, we can conclude that the condition of the patients depends on the type
of drug that they received.

EXERCISE 3.5

A random sample of 100 female students and 100 male students at a


local university was taken for an interview on their favourite sports. It
was found that 33% male students preferred football, 38% favoured
basketball, 24% favoured baseball and the rest like tennis. For female
students, the preferences were quite balanced with 38% into football,
21% liking basketball, 15% into baseball and the rest favouring tennis.
Determine the classification variables and population involved. How
would you carry out the test? Explain (without calculation).

Copyright © Open University Malaysia (OUM)


76  TOPIC 3 CATEGORICAL DATA ANALYSIS

SELF-CHECK 3.4

State the differences between Test of Independence and Test of


Homogeneity.

3.2.3 Yates’s Continuity Correction


Yates’s continuity correction statistic is used as chi-square test correction, that is:
( O  E  0.5) 2
 
2
cc
E
This correction statistic is not only used in the 2  2 table but also for any data
employing  2 test statistic and with only one degree of freedom. There are times
when this correction statistic results in a value which is very different from the
value of ordinary  2 test statistic and this will result in the acceptance of H0. In
this case, a bigger sample size is needed and you would need to repeat the test or
use other appropriate methods.

It is important to know that the continuity correction will always cause a reduction
in  2 value, a fact that can be proven through careful analysis of the shape. If the
test value is in favour of H0 acceptance, we do not have to calculate the  cc2 value
as the test result would not give any effect.

Let us see an application example of Yate’s continuity correction.

Worked Example 3.7


Two Form 1 classes (A and B) conducted an examination for their students and
the following results were obtained:

Results Form 1
Class A Class B
Passed 72 64
Failed 17 23

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  77

Carry out a hypothesis test to determine whether there exists any difference
between examination results for the two classes using Yate’s continuity
correction. Use 5% significance level.

Answer:
Step 1: State the null and alternative hypotheses
H0: There is no difference in results for students in Class A and B.
H1: Otherwise.

Step 2: Determine the significance level and rejection


Since there are two rows and two columns, the degrees of freedom,  = (2-1)(2-
1) = 1. Using this value, we obtained from the chi-square table (at  =5%) ,
5%
2
= 3.84. Reject H0 if the test statistic > 5%
2
(1).

Step 3: Calculate the test statistic


(a) The following table contains data on expected frequencies (in brackets) as
well as observed frequencies:

Results Class A Class B Total


Pass 72 64
(68.8) (67.2) 136
Fail 17 23
(20.2) (19.8) 40
Total 89 87 176

(b) Hence,
( 72  68.8  0.5) 2 ( 23  19.8  0.5) 2
 cc 
2
 ...   0.94 .
68.8 19.8

Step 4: Determine test result


Since  cc2 < 5%
2
(1) =3.84, we have valid evidence to accept H0 .

Step 5: State the conclusion


Hence, any difference in students’ results between Class A and B does not
exist.

Copyright © Open University Malaysia (OUM)


78  TOPIC 3 CATEGORICAL DATA ANALYSIS

EXERCISE 3.6

In a study involving 5000 individuals, 2400 of them are males. The


relationship between the level of colour-blindness and sex is of interest.
92% of the males and 98% of the females do not experience colour-
blindness. Write down the information given in a tabular form as below:

Factor I
1 2 Total
Factor
II A a b nA

B c D nB

Total n1 n2 n

Test whether there exists any difference in the level of colour-blindness


according to gender using X cc statistic. Use 5% significance level.

SELF-CHECK 3.5

When do we use Yate’s continuity correction?

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  79

EXERCISE 3.7

1. The following table shows the number of books on loan from a


public library in one particular week.

Day Number of Books


Monday 204
Tuesday 292
Wednesday 242
Thursday 283
Friday 252
Saturday 275

Is there sufficient evidence to say that the number of books on loan


is the same for all days? Test at 1% significance level.

2. The result for 20 football matches between schools at national level


in 2001 resulted in a frequency distribution table for the number of
goals scored by 40 teams as shown below:
Number of Goals 0 1 2 3 4 5 6 7 8
Number of Teams 3 5 14 9 3 4 1 1 0

Test whether the frequency distribution for the number of goals


follows a Poisson distribution at 5% significance level.

3. The data below shows the number of defective items produced by a


local company.
Number of Defective Items, X 0 1 2 3 4 5
Frequency 22 37 20 13 6 2
Test whether the number of defective items follows a binomial
model with n = 5 and p = 0.3. Use 5% significance level.

Copyright © Open University Malaysia (OUM)


80  TOPIC 3 CATEGORICAL DATA ANALYSIS

4. The frequency distribution for the X variable is shown as below:

6.000 8.000 10.000 12.000 14.000 16.000 18.000


X – – – – – – –
7.999 9.999 11.999 13.999 15.999 17.999 19.999
f 16 21 38 50 48 36 10

The distribution of X is believed to be N 13,3.42  . Test whether this


claim is true or not at 5% significance level.
5. An academic faculty at a university selected 150 students according to
their year of study to investigate the relationship between year of study
and their grade point average value. The following results were
obtained.
Year of Study
Grade Average
Year 1 Year 2 Year 3
<2.0 14 16 15
2.0-3.0 10 11 11
>3.0 26 23 24

Test at 5% significance level.

6. Using the following data, determine whether the three population


proportions are equal in all four categories. Test at 1% significance
level.

Category
1 2 3 4 Total
Population 1 16 38 5 41 100
2 24 41 12 23 100
3 19 36 15 30 100

Copyright © Open University Malaysia (OUM)


TOPIC 3 CATEGORICAL DATA ANALYSIS  81

6. 200 students were asked about their opinion on the implementation of


a rule requiring motorcyclists to wear helmets inside the campus area.
Using 2 test with Yate’s continuity correction and information as
given below, test whether there exists any difference in opinion
between male and female students. Test at 1% significance level.
Gender Opinion
Agree Does not Agree
Male 32 11
Female 68 89

Do visit these websites to find out further information on the application of:
 contingency table test
http://www.graphpad.com/quickcalcs/contingency1.cfm
 goodness-of-fit test
http://www.sportsci.org/resource/stats/modelsdetail.html

 Analysis on frequency data is performed to determine these properties, which


are:
 the experiment is a multinomial experiment involving identification of
probability and type of distribution.
 determining dependency and homogeneity between two factors.

 The chi-square statistic is used, that is,


( O  E )2
2  
E

 where O is the observed frequency at each cell/category and E is the


expected frequency calculated using formula
 Cj   Ri 
Eij  N   
N  N 

Copyright © Open University Malaysia (OUM)


82  TOPIC 3 CATEGORICAL DATA ANALYSIS

where, Eij = expected frequency for ij-cell

N = total overall observation


C j = total jth column where j = 1,…, n j

Ri = total ith row where i = 1,…, ni

 The chi-square test statistic approaches the chi-square distribution with k – 1


degrees of freedom in goodness-of-fit test and (r – 1)(k – 1) in Test of
Independence and Test of Homogeneity.

 In each case, the approximation to chi-square distribution is accurate if all


expected frequencies are at least 1, and, at most, 20% of the expected
frequencies are less than 5.

 The Yate’s continuity correction statistic:


( O  E  0.5) 2
 
2
cc
E
is used when the 2  2 contingency table has only one degree of freedom.

Copyright © Open University Malaysia (OUM)


Topic  Correlation

4
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the correlation concepts between two variables;
2. Use the two-way scatter plot to show the relationship between two
variables;
3. Compute correlation coefficient to show the relationship between two
variables;
4. State the applications of Pearson and Spearman correlation coefficients;
and
5. Perform correlation coefficient significance test.

 INTRODUCTION
Correlation is a measurement of the strength of a linear relationship between two
variables. Both variables are usually denoted as X and Y and their distributions
approximate to normal distribution. There are three types of relationships between
X and Y: positive linear correlation, negative linear correlation and no correlation.
Positive linear correlation means that as one variable increases, the other
variable tends to increase linearly. On the other hand, negative linear correlation
means that one variable tends to decrease linearly as the other variable increases.
No correlation indicates that there is no linear relationship between the variables.

Copyright © Open University Malaysia (OUM)


84  TOPIC 4 CORRELATION

Try to identify the type of correlation (positive, negative and none) that we can
expect from the following:
(a) Students’ grade and their height;
(b) An individual’s weight and the cholesterol level in the blood;
(c) The amount of ice-cream sold and ambient temperature; and
(d) Price of rubber and amount of rainfall.

We can categorise the relationship between two variables into four conditions:
(a) Perfect
(b) Strong
(c) Weak
(d) No linear relationship

The linear relationship for these four conditions can be clearly visualised by using
a graphical display, usually the Two-Way Scatter Plot. However, judgment
based on the graph is very subjective and at times, not accurate. As such, to
accurately determine the condition of this relationship, a quantitative
measurement known as correlation coefficient is deployed. It is usually denoted
by ρ for population, which is usually unknown. This population parameter is
estimated by the sample correlation coefficient, r. The value of r always lies
between -1 and 1, i.e
–1.00  r  +1.00

The following Table 4.1 gives an explanation on the r values for each of the four
conditions on the relationship between two variables.

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  85

Table 4.1: r Values and Relationship between Two Variables

No. r Value Relationship between Two Variables


1 r = –1.00 There exists a perfect negative linear relationship
r = +1.00 There exists a perfect positive linear relationship
2 –1.00 < r < –0.50 There exists a strong negative linear relationship
+0.50 < r < +1.00 There exists a strong positive linear relationship
3 –0.50 < r < 0 There exists a weak negative linear relationship
0 < r < +0.50 There exists a weak positive linear relationship
4 r=0 No linear relationship exists between the two variables

There are two types of sample correlation coefficients, which are, the Pearson
correlation coefficient and Spearman correlation coefficient. Their application
depends on the types of data. The Pearson correlation coefficient is used for
quantitative data; in discrete and continuous form. On the other hand, the
Spearman correlation coefficient is used for ranking data, hence the name
Spearman Rank correlation coefficient is sometimes used.

SELF-CHECK 4.1

What is the meaning of “correlation”? Explain the differences in the


correlation situations between two variables.

4.1 TWO-WAY SCATTER PLOT


A two-way scatter plot is a two-dimensional graph that shows the relationship
between two variables X and Y. The horizontal axis represents the X variable and
the vertical axis represents the Y variable. Below are some of the plots that explain
the relationship between the two variables.
Figure 4.1(a) shows the X and Y variables having a perfectly positive linear
relationship as every value of the Y variable increases as the values of X increase
and all points fall on one straight line. On the other hand, Figure 4.1(b) shows a
perfectly negative linear relationship as every value of the Y variable decreases as
the values of X increases and all points fall on one straight line.

Copyright © Open University Malaysia (OUM)


86  TOPIC 4 CORRELATION

Figure 4.1(a): Perfect positive linear Figure 4.1(b): Perfect negative


relationship relationship

Figures 4.2 (a) and 4.2(b) display strong positive and negative linear relationship
respectively. It is said to be strong as most points fall near the straight line.

Figure 4.2(a): Strong positive linear Figure 4.2(b): Strong negative linear
correlation correlation

Figure 4.3 (a) and 4.3(b) show weak positive and negative linear relationships
respectively. It is said to be weak as most points fall far from the straight line.

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  87

Figure 4.3(a): Weak positive linear Figure 4.3(b): Weak negative linear
correlation correlation

Figure 4.4 shows no linear correlation between the X and Y variables.

Figure 4.4: No linear relationship

Let us try the following exercise on the application of the two-way scatter plot.

Copyright © Open University Malaysia (OUM)


88  TOPIC 4 CORRELATION

EXERCISE 4.1

For each pair of the following two-way scatter plots, identify which one
(between 1 and 2) has a higher value of the correlation coefficient r and
state its direction:

(a)

(b)

(c)

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  89

4.2 PEARSON CORRELATION COEFFICIENT


The Pearson (rp) correlation coefficient is used for quantitative data, for both
discrete and continuous forms. It is generated from the Pearson product moment
for n pairs of variables (X, Y). The formula is:
sxy
rp  (4.1)
sx s y
where
sxy = covariance(X,Y) =  (x i  x )( yi  y )
n 1
sx = standard deviation of X =  ( xi  x )2
n 1

sy = standard deviation of Y = ∑ ( yi y) 2
n 1

This Equation (4.1) can be simplified to:


n  x y  (  x )(  y )
r  i i i i
p  2 2  2 2 
 (n  xi  (  xi ) )   (n  yi  (  yi ) ) 
   (4.2)
In calculating the Pearson correlation coefficient using Equation (4.2), it is easier
if the values are arranged in the table as below:

xi yi xi yi xi2 yi2
x1 y1 x1 y1 x12 y12
x2 y2 x2 y2 x22 y22
x3 y3 x3 y3 x32 y32
. . . . .
. . . . .
. . . . .
. . . . .
xn yn xn yn xn2 yn2
x i y i x y i i x 2
i y 2
i

Copyright © Open University Malaysia (OUM)


90  TOPIC 4 CORRELATION

Worked Example 4.1


A teacher would like to prove to her students the negative effect of playing
computer games on their studies. She claims that the longer the time (in hours
per week) spent on playing computer games, the lower their examination marks
will be. A random sample of 10 students was taken and their exam marks were
recorded as below:

Time (hours per week) 4 10 14 12 4 5 8 11 13 15


Examination marks 26 17 7 12 30 40 20 15 10 5

Answer:
To prove that there exists a negative linear relationship between time spent (in
hours per week) and students’ examination marks, we are going to calculate the
Pearson correlation coefficient, rp . In this problem, we can define time spent on
computer games as X variable (independent variable) and students examination
marks as Y variable (dependent variable). Next, we can construct the following
table:

xi yi xi yi xi2 yi2
4 26 104 16 676
10 17 170 100 389
14 7 98 196 49
12 12 144 144 144
4 30 120 16 900
5 40 200 25 1600
8 20 160 64 400
11 15 165 121 225
13 10 130 169 100
15 5 75 225 25
96 182 1366 1076 4408

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  91

n  xi yi  ( xi ) ( yi )
rp 
 (n  x 2  ( x )2   (n  y 2  ( y ) 2 
 i i
 i i

10(1336)  96(182)

10(1076)  (96) 2 10(4408)  (182) 2
  0.927
The Pearson correlation coefficient value –0.927 shows that there exists a strong
negative linear relationship between time spent on computer games and
students’ examination marks. Hence, we can conclude that if a student spent a
large amount of time playing computer games, this will affect his or her
performance in studies.

EXERCISE 4.2
A farmer would like to know whether a new fertiliser that he uses is
effective in increasing his crop production. He recorded the frequency of
using the fertiliser at seven areas in his farm and the production results of
his crop in each of those areas. The following data is obtained:

Area of the Farm A1 A2 A3 A4 A5 A6 A7


Frequency of fertiliser used 1 2 4 5 6 8 10
Crop production (in kg) 2 3 4 7 12 10 7

Obtain the correlation between fertiliser and crop production. Give your
opinion on the value of the correlation coefficient obtained.

4.2.1 Pearson Correlation Coefficient Significance Test


The calculated Pearson correlation coefficient ( rp ) value is a sample statistic;
hence it is only an estimation of the actual population parameter for correlation, ρ.
In this case, we need to test its significance so that we can make a conclusion on
the population parameter ρ. This conclusion is based on the information from
sample correlation coefficient rp .

Copyright © Open University Malaysia (OUM)


92  TOPIC 4 CORRELATION

When there exists a non-linear relationship or no linear relationship between two


variables, then  p = 0. To determine whether we can conclude the population
parameter  p value is not zero, we will test the following hypothesis:

H0 :  p = 0
H1 : 1. p > 0
2. p < 0
3. p  0
n2
T  rp
1  rp 2
Test Statistic :
Test Result : T follows a t distribution with v = n – 2 degrees of freedom
and  significance level.
Reject Ho when : 1. T > t,v
2. T < -t,v
3. |T| > t
,
2

Worked Example 4.2


Refer to Example 4.1. Perform the Pearson correlation coefficient rp significance
test at 0.05 significance level.

Answer:
We will use one-tailed hypothesis testing since we know that the correlation
coefficient rp value is negative. Hence,

H0 : p = 0
H1 : p < 0

n2
Test Statistic: T  rp
1  rp 2
10  2
 0.927
1  (0.927) 2
 6.99

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  93

Test Result: T follows a t distribution with v = 10 – 2 = 8 degrees of


freedom and 0.05 significance level.

Reject H 0 when:
T   t 0.05,8  1.86

Since T  6.99   t 0.05,8  1.86 , we reject the null hypothesis and we have
strong evidence to conclude that  p < 0. This shows that there exists a
significant relationship at 5% significance level. If a student spends most of his
time on playing computer games, this results in less time spent on revision,
hence the poor academic performance.

EXERCISE 4.3

Refer to Exercise 4.2. Test the significance of Pearson correlation


coefficient  p at 1% significance level.

4.3 SPEARMAN RANK CORRELATION


COEFFICIENT
Another type of correlation coefficient is called the Spearman rank correlation
coefficient (rs), which is also a sample correlation coefficient. Unlike the Pearson
correlation coefficient, the Spearman correlation coefficient is used when the data
is qualitative in nature and can be ranked. The rank for the X variable value is
denoted by the symbol U, while the rank for the Y variable value is denoted by the
symbol V. The Spearman rank correlation coefficient for n numbers of rank (U,V)
follows this equation:
s
rs  uv (4.3)
su sv
where
suv = covariance (U,V )
su = standard deviation of U
sv = standard deviation of V

Copyright © Open University Malaysia (OUM)


94  TOPIC 4 CORRELATION

However, to facilitate the calculation, Equation (4.3) can be simplified to:

6 D 2
rs  1 
n(n2  1) (4.4)
with D = U – V that is the difference between rank U and rank V. The calculation
process using Equation (4.4) can be further simplified if the values are placed in a
table such as the following:
2
xi ui yi vi Di Di
2
x1 u1 y1 v1 D1 D1
2
x2 u2 y2 v2 D2 D2
2
x3 u3 y3 v3 D3 D3
. . . . . .
. . . . . .
. . . . . .
. . . . . .
2
xn un yn vn Dn Dn

D 2

Worked Example 4.3


A teacher would like to know whether there exists any difference between male
and female students on the level of difficulty in Form 5 subjects taken by social
science students. The teacher asked 10 male students and 10 female students to
provide their score on the level of difficulty in 10 subjects. These 10 subjects are
Additional Mathematics (AM), History (HIS), Modern Mathematics (MM),
Geography (GEO), English Language (EL), Accounting Principles (AP), Islamic
Studies (IS), Science (SC), Finance (FIN) and Malay Language (ML). Each
student is requested to give a score of 1-5 to each subject. Score 1 refers to the
easiest while Score 5 refers to the most difficult. Is there a relationship between
male and female students on the level of difficulty for the subjects that they took?
Subjects AM HIS MM GEO EL AP IS SC FIN ML
Male Students’ Score 45 33 20 24 43 39 13 36 28 15
Female Students’
34 15 30 38 26 45 17 20 48 40
Score

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  95

Answer:
Firstly, we need to determine the X and Y variables. Define total subject scores by
male students as X variable and Y variable as total subject scores by female
students. Prior to obtaining the Spearman rank correlation coefficient rs , we need
to convert the data into rank form. In deciding on the ranks, the scores can be
arranged in descending order, that is the highest score is given rank 1, the second
highest is rank 2 and the lowest score is rank 10. The subjects’ rank for males (U)
and females (V) along with their differences (D) are displayed in the following
table:
2
xi ui yi vi Di Di
45 1 34 5 -4 16
33 5 15 10 -5 25
20 8 30 6 2 4
24 7 38 4 3 9
43 2 26 7 -5 25
39 3 45 2 1 1
13 10 17 9 1 1
36 4 20 8 -4 16
28 6 48 1 5 25
15 9 40 3 6 36
D 2
 158

6 D 2
6 (158 )
rs  1  1
10 (10 2  1 )
=0.0424
n(n 2
 1)

The Spearman correlation coefficient value is 0.0424. This means that there is
almost no linear relationship between the opinion of male and female students on
the level of difficulty of the subjects that they took.

SELF-CHECK 4.2

What is the difference between the Spearman and Pearson correlation


coefficients? You may discuss with your tutor.

Copyright © Open University Malaysia (OUM)


96  TOPIC 4 CORRELATION

EXERCISE 4.4
Ten athletes were given ranking at the beginning of any sports match that
they took part in. After the match, their position in the match was
recorded as in the table below:
Athlete 1 2 3 4 5 6 7 8 9 10
Ranking 1 2 3 4 5 6 7 8 9 10
Position in
3 5 2 1 10 4 9 7 8 6
the match

Obtain the correlation between ranking and their position in the match.
Comment on the value of the correlation coefficient.

4.3.1 Spearman Rank Correlation Coefficient


Significance Test
The Spearman rank correlation coefficient ( rs ) is an estimation of the population
parameter s . When there exists a non-linear relationship or no linear relationship
between two variables, s = 0. To determine whether we can conclude the
population parameter s value is not zero, we will carry out the following
hypothesis test:

H 0 : s = 0
H1 : 1. s > 0
2. s < 0
3. s ≠ 0
n2
T  rs
1  rs 2
Test Statistic :
Test result : T follows a t distribution with v = n – 2 degrees of freedom at
 significance level.
Reject Ho when: 1. T > t,v
2. T < -t,v
3. |T| > t
,
2

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  97

Worked Example 4.4


Refer to Example 4.3. Test the significance of the Spearman Rank coefficient rs
at 0.05 level.

Answer:
We will use a one-sided hypothesis test as we know that the correlation
coefficient value rs is positive. As such,

H0 : s = 0
H1 : s > 0

Test Statistic:
n2
T  rs
1  rs 2
10  2
 0.0424
1  (0.0424)2
 0.12

Test result: T follows a t distribution with v = 10 – 2 = 8 degrees of


freedom and 0.05 significance level.
Reject H0 when T > t0.05,8 = 1.86

Since T = 0.12 < t0.05,8 = 1.86, we do not reject the null hypothesis ( s = 0).
Hence, we can conclude that the value of the population parameter  is zero. In
other words, there is no relationship between the opinion of male and female
students.

EXERCISE 4.5

Refer to Exercise 4.4. Test the significance of the correlation coefficient


at 1% significance level.

Copyright © Open University Malaysia (OUM)


98  TOPIC 4 CORRELATION

After studying this topic, do you know when the Pearson and Spearman
coefficients are used? If you are still unclear, please reread Topic 4 carefully.
When you have understood, let us try the exercises below.

EXERCISE 4.6
1. State the importance of the two-way scatter plot.
2. State the importance of the correlation coefficient sign r.
3. For each of the statements below, state whether the Pearson or
Spearman correlation coefficient should be employed to find the
relationship between two variables.
(a) A school principal would like to know whether the quality of
teaching received by students will affect their grades.
(b) The relationship between time spent on study revision and
grades obtained by students.
(c) A landlord’s claim that the rate of house rental depends on the
number of rooms in a house.
(d) Will getting a good CGPA guarantee a high starting salary?
(e) A manager at a firm would like to find out the relationship
between his employees’ aptitude test scores taken prior to
joining the firm and their work performance three months after
they joined.
4. A bank would like to reduce the waiting time of its customers at the
counter. For this purpose, the bank would like to know the relationship
between average waiting time (Y) and number of tellers at the counter
(X). Several customers were chosen at random and their data is
tabulated as below:
X 4 1 5 3 4 3 3 2 2 6 3 2 4
Y 6.4 8.7 3.2 10.5 8.2 11.3 11.3 12.8 11.6 3.2 9.4 12.8 8.2

(a) Determine whether the relationship between number of tellers


and customers’ waiting time is positive or negative.
(b) Measure the strength of relationship obtained in (a).
(c) Test the significance of this relationship at 1% level.
(d) State your conclusion based on the answers in (b) and (c).

Copyright © Open University Malaysia (OUM)


TOPIC 4 CORRELATION  99

5. A personnel manager would like to check the effectiveness of a


procedure on salesman selection. In this procedure, candidates are
required to sit for an aptitude test prior to being interviewed by the
manager. For this purpose, the manager has selected the sales records
of 10 new salesmen along with their aptitude test scores and their
performance rankings during their interviews. The data obtained is as
below:

Salesman Ali Jay Mus Tan Boi Mat Lia Goh Wan Zek
Interview
5 3 1 9 6 4 10 2 7 8
rank
Test score 50 68 45 68 78 68 60 56 76 72
Sales
17 32 27 46 55 45 36 28 18 66
(‘000)

(a) Find the correlation coefficient value for both relationships and
comment on the correlation coefficient value obtained.

(b) Test whether the coefficient value is significant at 0.05


significance level.

(c) Based on your answer in (a) and (b), what are the criteria needed
to get a salesman who will bring profit to the company?

Please visit the following website to find more information about correlation and
regression:
http://www.pinkmonkey.com/studyguides/subjects/stats/chap6/s0606101.asp

Copyright © Open University Malaysia (OUM)


100  TOPIC 4 CORRELATION

 This topic has explained how the relationship between two variables can be
derived.

 A two-way scatter plot can be used to roughly show the relationship between
two variables, whether it is positive linear, negative linear or there is no
relationship.

 Correlation coefficient measures the strength on the above linear relationship.

 A significance test is performed on the correlation coefficient value to


conclude on the population correlation coefficient parameter  based on the
information obtained from sample correlation coefficient value r.

Copyright © Open University Malaysia (OUM)


Topic  Simple Linear
5 Regression
Analysis
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain regression concepts;
2. Construct a simple linear regression model and identify the assumptions
made;
3. Apply the least squares method to estimate the parameters in a simple
linear regression model;
4. Identify inferential concepts for the regression parameters;
5. Use appropriate methods to evaluate data suitability in fitting a
regression model; and
6. Use regression analysis for prediction.

 INTRODUCTION
In Topic 4, we learned how to visually check for the relationship between two
variables using the two-way scatter plot as well as how to measure the strength of
this relationship using correlation. If a relationship exists, we would like to know
the meaning of the relationship. Once we have determined the relationship in
terms of equation, we will be able to predict the value of a variable given the
value of the other variable. In this topic, we used a statistical method called
Simple Linear Regression to examine a linear relationship between two
variables. Only quantitative variables are considered in this case.

Copyright © Open University Malaysia (OUM)


102  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

5.1 INTRODUCTION TO REGRESSION


CONCEPTS
Regression analysis concepts deal with finding the best relationship between
dependent variable Y and independent variable X, quantifying the strength of that
relationship, and the use of methods that allow for prediction of the response
values (Y) given values of the regressor X. The y variable value can only be
determined if the independent variables values (denoted by x1 , x2 ,..., xk where k is
the number of independent variables) are known.

Examples of independent variables are the amount of electrical consumption in a


house, profit made by a company, final examination students’ grades, selling price
of a house, etc. These are considered as dependent variables as their values
depend on other variables. For example, the amount of electrical consumption in a
house depends on a day’s temperature outside. If the temperature was high, then
the occupants of the house would most probably turn on their air-conditioner or
fan to cool themselves. Hence, we can say that temperature is an independent
variable since it is a factor that influences the amount of electrical consumption in
a house. Another possible variable is the number of electrical appliances in a
house – the more it has, the greater the amount of electrical consumption.

Regression analysis is used to determine the mathematical relationship between


these variables through a linear equation termed as regression model. From the
model, we can predict the y value for a given value of x.

SELF-CHECK 5.1

Try to think of the independent variables for the following dependent


variables:

(a) Profit made by a firm;


(b) Students’ final examination grade; and
(c) Selling price of a house.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  103

5.2 SIMPLE LINEAR REGRESSION MODEL AND


ITS ASSUMPTIONS
A simple linear regression model involves only one variable, that is for k = 1 case.
Multiple linear regressions are employed for cases involving more than one
independent variable (k > 1). A simple linear regression model is written as

y = β 0 + β1 x + ε

where  refers to the random variable for errors/residuals. Errors/Residuals exist


due to imperfect relationships between variables and measurements are rarely
done without errors. To further understand errors, let us look at the following
example:

A property development manager would like to know the estimated selling price
for each house that will be built. He knows that the cost of building a house is
RM90 for each square feet and the land price is RM25,000 for an area of 4,500
square feet. Hence, the manager can estimate the selling price using the equation
below:
y = 25,000 + 90x (5.1)
where y = selling price and x = house size in square feet. If the house is 2,000
square feet, the price would be RM205,000, that is
y = 25,000 + 90(2,000) = 205,000
However, this is only an estimated price. The actual price (based on observation)
would be between RM180,000 and RM250,000. For this reason, to reflect the
actual situation, another simple linear regression model replaces the previous
model, that is:
y = 25,000 + 90x +  (5.2)
where  is a random variable for errors representing all other variables which are
not considered in equation (5.1). In other words, the selling price for the same size
will also differ due to other factors such as location, number of bedrooms, toilets
and other unknown factors.

The simple linear regression model y = β 0 + β1 x + ε is a population model and the


regression coefficients  0 and 1 values are population parameters. It is difficult
to get these values of the population parameter and for this purpose, sample data
is collected to estimate the values. The estimation model is as shown in the
Equation (5.3).
Copyright © Open University Malaysia (OUM)
104  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

yˆ  ˆ 0  ˆ 1 x (5.3)

Here, ŷ is the predicted/fitted value for y, ̂0 is the estimation for population
parameter β̂ 0 and ̂1 is the estimation for population parameter 1 . The
estimation model (5.3) is a linear equation with ̂1 parameter as the regression
slope and ̂0 parameter as the y-intercept, which is the y value when x is zero
(Refer Figure 5.1). However, in most cases, when x = 0, the y value does not carry
any significant meaning and at times x = 0 is not possible. The slope of a straight
line is a fixed value that explains the changes (increasing or decreasing) in y value
given a one unit change in x value.

ŷ  ˆ 0  ˆ1 x

Figure 5.1: Estimation model

Errors (Refer to Figure 5.1) are obtained from the difference between y observed
values with ŷ fitted values. This is denoted by i for i = 1, 2…n and the formula
is:
εi = yi yˆ i (5.4)
The residuals, ε i is a random variable. To determine whether a calculated simple
linear regression is a good estimate for the population, we need to ensure that the
ε i random variable satisfies certain conditions. The assumptions made on i
random variable are:
2
(a) ε i is distributed as normal; that is ε i ~ N(0,s ), i=1, 2, …, n.

(b) mean for ε i is zero, that is E( ε i ) = 0, I = 1,2,…,n.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  105

(c) standard deviation for ε i is s; that is s( ε i ) = s , i=1, 2, …., n fixed


(d) ε i for any y value is independent of ε i for other values of y.

Assumption 1 is made to facilitate the inferential processes (hypothesis test and


confidence interval) on the significance of the relationship between x and y, as
displayed by the fitted line. Assumptions 2 and 3 refer to the linearity of a
regression model. Suppose we have the population regression model as below:
y = β 0 + β1 x + ε (5.5)
For each x value, y is distributed as normal with mean
E  y x   0  1 (5.6)

and standard deviation


s( y) = σ ε (5.7)
Observe from equation (5.6), mean E(y) depends on x but the standard deviation
does not depend on anything. This is because σ ε is fixed for all x values. The
visual display of a simple linear regression is shown in Figure 5.2 below.

Figure 5.2: Simple linear regression model

Copyright © Open University Malaysia (OUM)


106  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

EXERCISE 5.1

Given regression equation ŷ = –12.84 + 36.18x, state the values of ̂0 and
̂1 and explain both values. Next, calculate the residuals using the
following data:
x 8.3 8.3 12.1 12.1 17.0 17.0 17.0 24.3 24.3 24.3 33.6
y 227 312 362 521 640 539 728 945 738 759 1263

5.3 THE LEAST SQUARES METHOD


Using the available data, how can we derive a simple linear regression model
yˆ  ˆ o  ˆ 1 x ? In Topic 4, we will learn about two-way scatter plot to visualise the
relationship between two variables, or in this case between independent and
dependent variables. If there exists a linear relationship, we would be able to draw
a straight line across all available data. However, this situation is rare due to
errors/residuals.

You can try some online activities on these websites:


 http//www.emathzone.com/tutorial/basic-statistics/example-method-of-least-
squares.html
 http//www.texasoft.com/winkslr.html

When the straight line fails to capture all the data (point (x,y) on the graph), what
must we do to obtain the best straight line? This best straight line refers to the
fitted straight line that we build in the two-way scatter plot that best represents the
relationship between the two variables. This fitted line would be a straight line
that is close to points (x,y) and when the errors between the points on the straight
line (estimated) and actual observed points are minimised. However, the total
errors ∑εi do not represent the distance between the actual and observed points.
i

Let us look at an example to prove why  ( y  yˆ ) is not suitable to represent the


i i

distance value of the actual and observed points.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  107

Figure 5.3(a): Data (a)

Figure 5.3(b): Data (b)

With reference to Figures 5.3(a) and 5.3(b), we can see that the positions of the
two data sets [data (a) and (b)] are different. The total errors for data (a) and data
(b) are calculated as:

Copyright © Open University Malaysia (OUM)


108  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

Data (a) Data (b)


i  y i  yˆ i i  y i  yˆ i

8–6=2 7–6=1
1 – 5 = –4 6–5=1
6–4=2 2 – 4 = –2

 εi = 0  εi = 0

Total errors are zero for both data (a) and (b), and this always holds. This figure
shows that the distance of data points (a) and (b) from the regression line is the
same. However, from both graphs in Figure 5.3, we can see that this is not true.
There exist differences in positions of data points (a) and (b) from the regression
line where data points (b) are closer to the regression line compared to data points
(a). Hence, i is not suitable to be used as a selection criteria.

So, how can we solve this problem? It can be solved if we squared each error
before summing them up. The following table are the values of  ( y i  yˆ i ) for
2

data (a) and (b).

Data (a) Data (b)


εi2 = (yi yˆ i )2 εi2 = (yi yˆ i )2

(8 – 6)2 = 4 (7 – 6)2 = 1
(1 – 5)2 = 16 (6 – 5)2 = 1
(6 – 4)2 = 4 (2 – 4)2 = 4

( ε i )2=24 ( ε i )2= 6

Based on  ( y  yˆ )
i i
2
values for both data (a) and (b), it shows that the total sum
of squares for data (b) is smaller than (a). This proves that points for data (b) are
nearer to the regression line and this line is the best fitted line. This method to
obtain the best fitted line based on the least squares summation is known as the
least squares method.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  109

To fit the regression line, we need to get the estimates for regression coefficients
0 and 1 . Using the least squares method, the formula for regression coefficient
1 is:
n

 x y  nxy
i i
ˆ 1  i 1
n
(5.8)
x
2
2
i  nx
i 1
where
̂1 = Estimated value of regression coefficient 1
xi = Value of independent variable
yi = Value of dependent variable
x = Mean value of independent variable
y = Mean value of dependent variable
n = Number of (x,y) pairs

After getting the estimate for 1 , we can derive the value for 0 . The formula to
get ̂0 is:
βˆ 0 = y βˆ1 x (5.9)
where
̂0 = Estimated value of regression coefficient 0

Worked Example 5.1


For the following data, find the value of regression coefficients ̂0 and ̂1 , and
write down the fitted regression model:
x 3 7 6 6 10 12 12 12 13 13 14 15
y 33 38 24 61 52 45 29 65 82 63 50 79

Answer:
To facilitate the calculation of parameter values ̂0 and ̂1 , we can form the
following table.

Copyright © Open University Malaysia (OUM)


110  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

xi yi xi yi xi2 yi2
3 33 99 9 1089
7 38 266 49 1444
6 24 144 36 576
6 61 366 36 3721
10 52 520 100 2704
12 45 540 144 2025
12 29 348 144 4225
12 65 780 144 6724
13 82 1066 169 841
13 63 819 169 3969
14 50 700 196 2500
15 79 1185 225 6241

x i =
y i = 621 x y i i = 6833 xi2
= 1421 y i2
=
123 36059

From the table, firstly, we need to calculate x and y , x 


 xi

123
 10.25
n 12
and y   yi 
621
 51.75 . Now, we can get ̂1 regression coefficient using this
n 12
formula:
n

 x y  nxyi i
6833  12 10.25  51.75 
ˆ 1  i 1
  2.92 s
1421  12 10.25 
n 2

x
i 1
2
i  nx 2

and for ̂0 regression coefficient,

βˆ 0 = y βˆ1 x = 51.75 2.92(10.25) = 21.82

Hence, the simple linear regression model is ŷ = 21.82 + 2.92 x.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  111

EXERCISE 5.2

Fit a simple linear regression model to the data below:

x 60 62 64 65 66 67 68 70 72 74
y 63.6 65.2 66.0 65.5 66.9 67.1 67.4 68.3 70.1 70

Interpret the regression coefficients obtained in this model.

SELF-CHECK 5.2

State the application of the least squares estimation method.

5.4 INFERENCES ON REGRESSION


COEFFICIENTS
Inferential statistics is a branch of statistics which is concerned with making
conclusions about population based on information from samples. The calculated
regression coefficients ̂0 and ̂1 from sample data are only estimation of the
population parameters. In other words, the regression coefficients values are
subject to sampling errors. Regression coefficient values, whether positive or
negative, do not necessarily mean that the intended population parameters possess
the sample values. Hence, a test is needed to verify whether the regression
coefficients obtained are either in positive or negative form, which is not a zero
value.

We are going to test the parameter for the population regression slope 1 using the
̂1 regression coefficient. The hypothesis testing process for testing population
parameter 1 is similar to that of testing mean and variance. We will begin with a
hypothesis statement. The null hypothesis claims that there is no linear
relationship, which means the slope of the regression line is zero. If we accept the
null hypothesis, this means the population regression line is a straight line that
shows y value does not change with the changes in x value. In this case,
information on x is not enough to assist in predicting y value. On the other hand,
if the null hypothesis is rejected, there is enough evidence to say 1 is not zero,

Copyright © Open University Malaysia (OUM)


112  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

that is either 1 >0 or 1 < 0. This shows that the regression line has a tendency to
increase or decrease and this helps in predicting y value using x value.

We can perform either a one-sided ( 1 > 0 or 1 < 0) or a two-sided ( 1 0) test to


determine if there is enough evidence to conclude the existence of a linear
relationship (that population 1 is not zero). Hence, test the hypothesis:

Ho : 1 = 0
H1 : 1. 1 > 0
2. 1 < 0
3. 1  0
ˆ  ˆ ˆ
Test Statistic : T  1 
s (ˆ 1 ) s (ˆ 1 )
Test Result : T follows t distribution with v = n – 2 degrees of freedom and
. significance level.
Reject H0 when : 1. T > t,v
2. T < – t,v
3. |T| > t/2,v

s( ̂1 ) is the standard deviation for ̂1 . The formula to get the standard deviation
for ̂1 is:

y 2
i  ˆ 0  yi  ˆ 1  xi y1

 
s ˆ 1  n2
 xi2  nx 2
Apart from hypothesis testing, we can also construct a confidence interval for 1 .
Confidence interval will provide a confidence range that contains the value of
population parameter at a certain  level. Based on T test statistic (two-sided) that
follows tn  2 distribution, we can construct a (1 - ) 100% confidence interval as
below:
ˆ t
1  ˆ ˆ  
ˆ
 2 ,n  2 s 1  1  1  t 2 ,n  2 s 1

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  113

Worked Example 5.2


Based on the data in Example 5.1, prove that at α =0.05 significance level, there
is enough evidence to say that there is a linear relationship between x and y,
that is  1  0). Construct a 95% confidence interval for 1 .
Answer:
The hypothesis statement:
H0 : 1 = 0

H1 : 1  0 (two-tailed test, so α /2)


βˆ1 2.92
Test Statistic : T= = = 2.317
( )
s βˆ1 1.26

Test Result : T follows a t distribution with v = 12 – 2 = 10 degrees of


freedom at 0.05 significance level.

Reject H0 when
|T| > t0.025,10 = 2.228

Prior to obtaining the test statistic value, we need to calculate the value of
 
s ˆ .
1

36059  21.82  621  2.92  6833

 
s ˆ 1  10
1421  12 10.25 
2

=1.26

Since the test statistic (T = 2.317) > 2.228 ( t0.025,10 ), we reject the null
hypothesis. Hence, we can conclude that 1 is not zero, that there is enough
evidence of the existence of a linear relationship between x and y.

The 95% confidence interval (hence  = 0.05) for 1 is


ˆ t
1   ˆ ˆ   ˆ
0.025 ,10 s 1  1  1  t0.025 ,10 s 1
2.92  2.228 1.26   1  2.92  2.228 1.26 
0.113  1  5.727

Copyright © Open University Malaysia (OUM)


114  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

This confidence interval shows that the y value will increase between 0.113 and
5.727 for each increment in x. The wide range for 1 is due to small sample
size.

A test can also be performed on y-intercept 0 using regression coefficient ̂0 .


However, as discussed in Section 9.2, in most cases, the y-intercept does not
carry any meaning; hence a test on its value can be ignored.

EXERCISE 5.3

Answer the following questions based on the data in Exercise 5.2:

(a) Test the significance of 1 parameter at 0.05 level; and


(b) Construct a 99% confidence interval for 1 . Interpret this
confidence interval.

5.5 MODEL ADEQUACY CHECK


Confidence interval constructed using available data must be able to represent the
population involved. The fitted regression model should satisfy all assumptions in
the underlying model. The inferential model is not valid if these assumptions are
not satisfied. There are three methods to check for model adequacy:
 Coefficient Of Determination
 Residual Plot
 Transformation

5.5.1 Coefficient of Determination, R2


2
Coefficient of Determination, R , is a measurement of the proportion of
variation in the dependent variable that can be explained by the fitted regression
model. To further understand the coefficient of determination, refer to Figure 5.4.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  115

Figure 5.4

This is similar to saying that we are quantifying the contribution of x in predicting


y. Refer to Figure 5.4. For each x value, for example x0 , we can separate the y0
deviation from mean y into two parts; one part is “explained variation” and the
other part “unexplained variation”. Total variation is the total sum of squares of
deviations from mean of the y points, that is   yi  y  . This can be derived
2

from variation term in y, that is,


y i  y  ( yi  yˆ )  ( yˆ  y )
(y i  y ) 2  ( y i  yˆ ) 2   ( yˆ  y ) 2

The unexplained variation is the sum of squares of deviations of observed y from


ŷ estimate, that is   yi  y  and explained variation is the sum of squares of
2

deviations of fitted values from mean, that is  ( yˆ  y ) 2


.

Hence, if we want to express the proportion of explained variation, the simplified


formula for coefficient of determination R 2 value is,

ˆ 0  yi  ˆ 1  xi yi  ny 2
R 
2
(5.10)
y 2
i  ny 2

Copyright © Open University Malaysia (OUM)


116  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

This coefficient of determination in Equation (5.10) is always positive, that is 0 


2
R  1, and it is usually expressed in percentage that is by multiplying with 100%.
For example, if R 2 = 0.57, we say that 57% of the variation in y can be explained
by the fitted regression. The remaining 43% cannot be explained. The bigger the
2
value of R (approaching 1), the better the data fits the simple linear regression
model, that is the data concerned can explain the population well.

Refer to the following example:

Worked Example 5.3


Based on the data in Example 5.1, calculate the coefficient of determination and
interpret its meaning if y = sales and x = number of radio advertisements.
Answer:
The coefficient of determination is

ˆ 0  yi  ˆ 1  xi yi  ny 2
R  2

y 2
i  ny 2

21.82( 621 )  2.92( 6833 )  12( 51.75 )2


R2 
36059  12( 51.75 )2
 0.3481
This means the fitted regression model can explain only 34.81% of variation in
sales and 65.19% of variation in sales can be explained by other factors.

SELF-CHECK 5.3

What is the difference between the correlation coefficients and


coefficient of determination in regression model? Discuss in class.

EXERCISE 5.4

Based on the data in Exercise 5.2, calculate the coefficient of


2
determination R and interpret the value obtained.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  117

5.5.2 Residual Plot


The validity of many of the inferences associated with a regression analysis
depends on the error term, , satisfying certain assumptions. Hence, it is highly
recommended that some sort of analysis be conducted to assess these
assumptions. No regression analysis can be considered as complete without such
examination. This can be done through a graphical technique called residual plot.
From the plot, we will be able to check whether:
(a) The fitted model is linear or not;
(b) Variance for error εi is constant or proportional to xi ; and
(c) The residuals are distributed as normal or not.

The following are a few graphs that show deviations from assumptions made.
Figure 5.5 is a plot of εi versus the fitted values ŷ i or xi to determine whether
the linearity assumption is met or not. The graph shows that the data plotted forms
a curve and hence, we can conclude that the fitted model is non-linear.

ŷi or xi
Figure 5.5: Model is non-linear, instead curvature

Figure 5.6 (plot of εi versus the fitted values ŷ i or xi ) shows deviation of the model
from assumption that the random errors have constant variance. Plot of data shows a
bell-shaped pattern. This means random errors instead of having a non-constant
variance, the errors are actually proportional to ŷ values. The random errors have
constant variance if the graph shows a random pattern or no trend.

Copyright © Open University Malaysia (OUM)


118  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

ŷi or xi

Figure 5.6: Model with non-constant error variance

Graph in Figure 5.7 is a histogram on errors. This is to determine whether random


errors or residuals are distributed as normal. If it is distributed as normal, the
histogram will form a bell-shaped curve. Histogram in Figure 5.6 shows the
assumption on normality of random errors is not met, as the histogram’s shape is
not normal.

Figure 5.7: Histogram of residuals

SELF-CHECK 5.4

What can you do to a model if there exists a violation of assumptions?

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  119

EXERCISE 5.5

Explain the meaning of each of these plots:

(a) (b) (c)

5.5.3 Some Transformations


Transformation is important if the regression model is in non-linear shape. The
linearity of any model can be verified by drawing a two-way scatter plot. A linear
regression model will display a linear function, which is in straight-line form. A
common transformation function is logarithm or inverse, either on x or y. The
following are a few transformation examples to change some non-linear functions
to their linear form.

Table 5.1: Some Transformations

Functional Form Linear Regression


Transformation
that Relates y to x Model Form

Exponent: y = 0 e1x y* = ln y y* = ln 0 + 1 x

Power: y = 0 x1 y* = log y; x* = log x y* = log 0 + 1 x*

Inverse: x* = 1–x ; y = 0 + 1 x*

Hyperbolic: y* = 1–y ; x* = 1–x y* = 0 x*+ 1

A two-way scatter plot is very useful to ascertain whether a model has a linear or
non-linear form. Hence, it is good to know the shape of Exponential, Power,
Inverse and Hyperbolic functions (refer Figure 5.8). Observe Figures 5.1 and 5.8
on the chosen transformations.

Copyright © Open University Malaysia (OUM)


120  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

EXERCISE 5.6
Draw a two-way plot on the following regression models. Then perform
transformation on the models before obtaining the linear regression
models.
1
(a) y  2.67  0.68  
 x
3.1x
(b) y =2e
0.85
(c) y = 1.5x
x
(d) y
0.4  2 x

(a) Exponential Function

(b) Power Function

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  121

(a) Inverse Function

(a) Hyperbolic Function


Figure 5.8: Functional forms

5.6 PREDICTION AND ESTIMATION USING


REGRESSION MODEL
One of the reasons to build a linear regression is to predict variable values at
future x values. For example, refer to the property development manager’s
problem in estimating the selling price (in RM) for each house built (refer section
5.2). Using regression model

yˆ  25, 000  90 x, xb  x  xa (5.11)

where y = selling price and x = house size (in square foot). The x values are in
between xa and xb . If we would like to predict the selling price of a house where
the built-up area is 2,000 square feet, where the value 2,000 > xa , we can use the
regression model with x value = 2,000. Based on the regression equation, the
manager can predict that the selling price for each house with 2,000 square feet is
RM205,000.

Copyright © Open University Malaysia (OUM)


122  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

However, this selling price is a forward estimation and it does not explain the
position of that value with respect to the actual selling price. In other words, is the
estimation value close to the actual value or very different? This relates to the
reliability aspect of certain predictions. To get information on the position of
estimation values versus actual values, we need to use intervals. There are two
types of intervals used – prediction interval for any dependent variable y and
estimation interval for an estimated value of y.

5.6.1 Prediction Interval for an Individual Value of y


The prediction interval is used to predict a certain value of dependent variable y,
given a specific value of independent variable x when this x value is outside the
range of x values, that is x > xa or x < xb . The term “prediction interval” is used
rather than confidence interval because a population parameter is not being
estimated in this case; instead, the response or performance of a single individual
in the population is being predicted. The formula to get a (1 - ) 100% prediction
interval is:

 xg  x 
2
1
yˆ  t / 2 s 1 
n   xi  x 2
(5.12)

ŷ = Future estimated value of dependent variable calculated from


 ˆ 
ŷ   0
ˆ x
1 g 
t  = The critical value of t distribution with n – 2 degrees of freedom at
2

significance level

s = The standard deviation for estimator


xg = The specific value of independent variable

x = Mean value of independent variable


n = Sample size

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  123

Standard error of the estimate, se, is a measure of reliability of any estimation


equation. It represents an estimate of the standard deviation around the regression
lines. It is often referred to as the standard error of the regression, that is the
degree of deviation of observed values from estimated values based on regression
line. The formula for sis:
n

  y  yˆ 
2
i i
s  i 1

n2 (5.13)

Worked Example 5.4


Refer to data in Example 5.1. Calculate the 95% prediction interval for x = 20 and
explain its meaning if y = sales and x = number of advertisement in the radio.
Answer:
Refer to Example 5.1, the simple linear regression model is
ŷ = 21.82 + 2.92 x
When xg = 20, ŷ = 21.82 + 2.92 (20) = 80.22.

To get the standard error of the estimate, we need y values. This can be generated
using regression model ŷ = 21.82 + 2.92 x. Hence, data is as in the table:

x 3 7 6 6 10 12
y 33 38 24 61 52 45

y^ 30.58 42.26 39.34 39.34 51.02 56.86

x 12 12 13 13 14 15
y 29 65 82 63 50 79

y^ 56.86 56.86 59.78 59.78 62.70 65.62

  y  yˆ 
2

Hence, i i
2556.946
s  i 1
  15.99
n2 10

Copyright © Open University Malaysia (OUM)


124  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

Since  = 0.05, ta/2 = t0.025 = 2.228. Thus, the 95% prediction interval for xg = 20
is

 xg  x 
2
1
yˆ  t / 2 s 1  
n   xi  x 2

1  20  10.25 
2

80.22  (2.228)(15.99) 1  
12 160.25
80.22  46.13

The lower and upper limits of the prediction interval are 34.09 and 126.35
respectively. This shows that the minimum predicted sales is 34 units and
maximum is at 126 units when 20 advertisements are broadcasted on radio.

EXERCISE 5.7

Refer to data in Exercise 5.2. Calculate the 99% prediction interval for x =
86 and provide an explanation for it.

5.6.2 Confidence Interval for a Mean Value of y


The points on the least squares line corresponding to each x value in population
regression model y = β 0 + β1 x + ε , y will be distributed as normal with mean
E  y   0  1 x

Hence, to estimate a mean value of y, given any xg value, we can use the
following interval:

 xg  x 
2
1
yˆ  t / 2 s 
n   xi  x 2
(5.14)
This interval applies when any specific value of x lies between the interval for
independent variable x values, i.e. xb x xa.

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  125

Worked Example 5.5


Refer to data in Example 5.1. Calculate the 95% confidence interval for the mean
value of y when xg  11 and explain its meaning if y = sales and x = number of
radio advertisements.
Answer:
The values for ŷ , t/2 and s can be obtained from Example 5.4. Hence, the 95%
confidence interval for the mean value of y when xg  11 is:

 xg  x 
2
1
yˆ  t / 2 s 
n   xi  x 2

1 11  10.25 
2

53.94  (2.228)(15.99) 
12 160.25
53.94  10.50
Note that the lower and upper confidence limits for the mean value of y are 43.44
and 64.44 respectively. This shows that the minimum mean sales is 43 units
while the maximum is 64 units when 11 radio advertisements are broadcasted.

EXERCISE 5.8

Refer to Exercise 5.2. Calculate the 99% confidence interval for mean y
when x = 69 and provide an explanation on it.

Copyright © Open University Malaysia (OUM)


126  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

EXERCISE 5.9

1. Determine whether the following statements are true or false. If


false, write down the correct statement.
(a) Regression analysis is used to display the validity of an
estimated equation that explains the relationship of subjects
under study.
(b) Given a straight-line equation y = a – bx, we can say the
relationship between y and x is positive linear.
2
(c) R value approaching zero indicates a strong correlation
between x and y.
(d) Regression line is derived from sampling, not from population
under study.

2. Fill in the blanks to complete each of the following statements:


(a) If the value of a dependent variable decreases when the value
of an independent variable increases, their relationship is
_____________________.
(b) Each straight line has ___________________, which explains
how much the change in dependent variable given a unit
increase in independent variable.
(c) The least squares method is used to get _______________ and
__________________ of regression line.
(d) If the coefficient of determination is 0.80, this means 80% of
variation in the dependent variable ______________ by
variation in the independent variable.

3. An economist would like to know the effect of interest rates on the


total investments made by a company. He has collected some data
for a duration of eight months and it is displayed in the table below:

Investment
1.8 1.8 2.1 2.2 2.8 3.1 3.6 4.1
(RM Million)
Interest Rate (%) 9.9 10.5 9.6 9.8 12.1 9.2 9.5 7.7

Copyright © Open University Malaysia (OUM)


TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS  127

(a) State the dependent and independent variables.


(b) Without using a two-way scatter plot, explain if the regression
slope is positive or negative. Give reasons for your answer.
(c) By how much will investment value change, for a unit change
in the interest rate?
(d) To what extent is the variation in investment attributed to the
variation in interest rate?

4. Many people assume that the total amount of money saved depends
on their total income. The following data shows the average saving
per month (RM’00) and average income per month (RM’00) for
various groups of employees.
Income 19 22 27 30 36 43 47 51 61 64
Saving 1.0 1.4 1.8 2.4 3.0 3.8 4.3 4.5 5.8 6.3

(a) Fit a simple linear regression model for this data.


(b) How much money will a person save if his monthly income is
RM4,500?
(c) Test the significance of regression slope at 5% significance
level. Use a one-sided test and explain the reason for using this
test.
(d) Obtain a 95% confidence interval for 1.

5. For the following data:


x 1 2 3 7 10 11 14 14 16 17
y 20 30 50 100 150 200 260 400 400 700

(a) Calculate the residuals.


(b) Plot residuals versus x and ŷ .
(c) What can you conclude from the plot in (b)?
(d) Is the normality assumption for random errors satisfied?
(e) Is there any sign of violation from model assumptions? If yes,
what needs to be done?

Copyright © Open University Malaysia (OUM)


128  TOPIC 5 SIMPLE LINEAR REGRESSION ANALYSIS

Please visit the following websites to read more on simple linear regression:
 http//www.pinkmonkey.com/studyguides/subjects/stats/chap8/s0808n01.asp

 Simple linear regression is a technique used to analyse the relationship


between two variables. Regression analysis assumes that the relationship is in
linear form.

 The least squares method is used to get parameter estimates for slope of
regression line and intercept on y-axis.

 A hypothesis test is performed on the slope of regression line to determine if


there is enough evidence to support the existence of a linear relationship. The
validity of the relationship between the two variables can be determined using
coefficient of determination.

 A model adequacy check, that is, checking on violation of assumptions, can be


performed using residual plot and histogram.

 If the simple linear regression model obtained is adequate for the data, the
model can be used to estimate a dependent variable value for any specific
independent variable value. This model can also be used for prediction of a
mean value of dependent variables.

Copyright © Open University Malaysia (OUM)


Topic  Multiple
6 Regression
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Identify the multiple regression concept, that is the relationship
between one dependent variable with two or more independent
variables;
2. Identify the multiple regression model and assumptions made;
3. Explain how a multiple regression model is constructed using the least
square method;
4. Identify the inferential concept for regression parameters;
5. Identify available methods used to evaluate the suitability of data on
regression model; and
6. Use the regression equation for the prediction and estimation of
parameter values.

 INTRODUCTION
Most practical applications of regression analysis utilise models that are more
complex than the Simple Linear Regression discussed in Topic 5. In the previous
topic, only two variables; a dependent variable – (Y) and an independent variable
(X) – are considered. In Simple Linear Regression, it is important to fully
understand the following items:

Copyright © Open University Malaysia (OUM)


130  TOPIC 6 MULTIPLE REGRESSION

(a) Linear regression equation: Y   b0  b1 X (6.1)

(b) Linear regression model: Y = β0 + β1 X + ε (6.2)

Yˆ = βˆ 0 + βˆ1 X (6.2a)

(a) It can be understood from model (6.2) that Y is a random variable which
follows a certain population distribution. ε is assumed to be independent of
each other for any Y value with zero mean. From the above equations, for a
given X value, the expectation is

E Y X   0  1 X (6.3)

Therefore, Yin (6.1) is actually the conditional expectation line E(Y|X).


Refer to Figure 5.2 in Topic 5. The value depends very much on various
combinations of b0 and b1 values that characterise the line. In this case, b0
and b1 are population parameters that need to be estimated.

(b) In regression analysis, the first thing to do is to estimate the Y line using
model (6.2). That is to get estimations for 0 and 1 using methods such as
ordinary least squares which uses n pairs of observation values (y, x).

(c) 0 and 1 coefficients in (6.2) need to be estimated using methods such as


ordinary least squares which use n pairs of observation values (y, x). How
far Y is able to explain variation in Y values is usually associated with how
2
accurate the estimation of both parameters. This is usually measured with R
coefficient where the closer its value to 1 is, the better Yis in explaining
variation in Y, and vice-versa.

In another situation, a manager may want to find out the effect/relationship


between advertising cost and total space allocated in advertising board (as two
independent variables, X 1 and X 2 respectively) on the amount of monthly
product sales (as dependent variable, Y). Thus, equations (6.1), (6.2), and (6.2a)
would be:

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  131

Y   0  1 X 1  2 X 2 (6.4)

Y = β0 + β1 X 1 + β2 X 2 + ε (6.5)

Y = Y ' + ε  (6.5a)

Equation (6.4) or its equivalent involving two or more independent variables is


called Multiple Regression equation. Thus, model (6.5) or (6.5a) is termed as
Multiple Regression Model. In multiple regression method, there are usually n
observations for (y, x1 , x2 ). These observations are used to estimate
0  1  2 using method such as ordinary least squares. Similar to linear
regression case, the goodness-of-fit of Yin explaining variation in Y values
depends on the precision of 0  1  2 estimation. The R 2 coefficient of
determination is also used to measure this. However, the formula used is different
in this case. This matter will be further discussed in subsequent sub-topics.

6.1 MULTIPLE REGRESSION MODEL AND


ASSUMPTIONS
In general, a multiple regression model with k independent variables can be
written as below:

Y = β0 + β1 X 1 + β2 X 2 +  + β k X k+ε (6.6)

Y =Y' + ε (6.6a)
where
Y   0  1 X 1  2 X 2  ...  k X k

However, in this module, we will only focus on Multiple Regression Model with
two independent variables. For cases involving more than two independent
variables, calculation is usually done with the help of a statistical package such as
SPSS (Statistical Package for Social Science). You can refer to any related books.

Copyright © Open University Malaysia (OUM)


132  TOPIC 6 MULTIPLE REGRESSION

6.1.1 Assumptions for Multiple Regression Model


The assumptions are similar to those for the Simple Regression Model. Refer to
Table 6.1.

Table 6.1: Assumptions for Multiple Regression Model

Assumption Explanation
Normality Assumption For any specific value of independent variable, the Y
random variable values are distributed as normal, with
mean E(Y|x) = y, and variance =  2E .

Constant Variance For each value of independent variable, the random


Assumption variable Y has constant variance.
Linearity Assumption There exists a linear relationship between dependent and
independent variables.
Non-multicollinerity Independent variables are not correlated with each other.
Assumption
Independent The values of independent variable Y are independent of
Assumption each other.
Error/Residual Residuals or errors are random and independent of each
Assumption other, and assumed to be distributed as normal random
variable with zero mean and constant variance  2E .

6.1.2 Multiple Regression Model with TWO


Independent Variables

SELF-CHECK 6.1
Using your current understanding based on what you have learnt, try to
think of two or three independent variables for the following dependent
variables.
(a) Profit made by a company
(b) Students’ final examination grades
(c) Selling price of a house

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  133

(a) Model Statement


The regression model for two independent variables is as the following:

Y = β0 + β1 X 1 + β2 X 2 + ε  (6.7)
Y =Y' + ε (6.7a)

where Y   0  1 X 1  2 X 2 .

For the i-th observation, i = 1, 2, . . . , n, the above model value (a number,


written in small letter) as

y = β0 + β1 x1i + β2 x2i + εi (6.8)


y = y' + ε (6.8a)

Where y ' = β0 + β1 x1i + β2 x2i . .

In model (6.8), the population parameters 0  1  2 are unknown. Each


estimator is written as ̂0 , ̂1 , ̂2 , and thus the model equation is written as:

yˆi  0  1 x1i  2 x2i (6.9)

Hence, using (6.8a) and (6.9), errors between observations and their
estimations are given as ε = y yˆ '  The error value can take a negative sign.

(b) Estimation of Model Parameters and Their Meaning


The model parameter can be estimated using the ordinary least square
method by minimising the Error Sum of Squares and solve the parameters,
that is
n n 2

∑ε ∑(y 2
i i
(
yˆ ' ) = yi βˆ0 + βˆ1 x1i + βˆ 2 x2i )
2
6.10)
i =1 i =1

Using the outcome of equation (6.10), separating and equating the equation
to zero, we obtain the following three equations:
n
 n   n 
y1
i  nˆ 0  ˆ 1   x1i   ˆ 2   x2i  
 1   1 


Copyright © Open University Malaysia (OUM)


134  TOPIC 6 MULTIPLE REGRESSION

n
 n   n   n 
yx
1
i 1i  ˆ 0   x1i   ˆ 1   x12i   ˆ 2   x11i x2i  
 1   1   1 


n
 n   n   n 
yx
1
i 2i  ˆ 0   x2i   ˆ 1   x2i x1i   ˆ 2   x22i  
 1   1   1 


Solving these equations for model parameters, we obtain the following


estimation:

 d1d    d 22     d 2 d   d1d 2 
ˆ 1  (6.14)
  d   d     d d 
2 2 2
1 2 1 2

 d 2 d    d12     d1d   d1d 2 


ˆ 2  (6.15)
  d12   d22     d1d2 
2

ˆ 0  y  ˆ 1 x1  ˆ 2 x2 (6.16)

Each of them is the mean or average values for observations


y, x1 , and x2 ; d  y  y , d1  x1  x , d 2  x2  x2 y.

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  135

Worked Example 6.1


Given a set of n = 10 observations (y, x1 , x2 ), the following table is usually
constructed for manual calculation of parameter estimation (when n is
moderately large where the calculation works such as this can still be carried
out). For large n, the calculation can be done using a statistical package such as
EXCEL, MINITAB or SPSS with computer assistance. Please take note that
different packages may generate different answers due to rounding error.

Table 6.2: Calculation Steps (Using a Hand-held Calculator)

y x1 x2 d d1d2 dd dd1 dd2 d12 d12 d1d2


30 3.2 2 -0.2 0.43 -1 0.04 -0.086 0.2 0.1849 1 -0.43
36 3.4 4 5.8 0.63 1 33.64 3.654 5.8 0.3969 1 0.63
28 2.8 3 -2.2 0.03 0 4.84 -0.066 0 0.0009 0 0
29 2.4 4 -1.2 0.37 1 1.44 0.444 -1.2 0.1369 1 -0.37
27 2.5 2 -3.2 -0.27 -1 10.24 0.864 3.2 0.0729 1 0.27
28 2.2 3 -2.2 -0.57 0 4.84 1.254 0 0.3249 0 0
32 2.7 3 1.8 -0.07 0 3.24 -0.126 0 0.0049 0 0
27 2.6 2 -3.2 -0.17 -1 10.24 0.544 3.2 0.0289 1 0.17
34 3 4 3.8 0.23 1 14.44 0.874 3.8 0.0529 1 0.23
31 2.9 3 0.8 0.13 0 0.64 0.104 0 0.0169 0 0
302 27.7 30 7.11E- 0 0 83.6 7.46 15 1.221 6 0.5
15

y = 30.2, x1 =2.77, and x2 =3.


From equation (6.14), (6.15), (6.16), we obtain the following estimates:

ˆ 1 
 7.46  6   15 0.5 = 5.265687  5.266,
1.221 6    0.5
2

ˆ 2 
151.221   7.46  0.5 = 2.061193 2.0612,
1.221 6    0.5
2

̂0 = 30.2–5.266(2.77) – 2.0612 (3) = 9.430469  9.430.

Model Estimation:
yˆ '  9.430  5.266 x1  2.0612 x2 (6.17)

Copyright © Open University Malaysia (OUM)


136  TOPIC 6 MULTIPLE REGRESSION

(c) Estimated Value of y and its Residual


Model (6.17) can be used to obtain estimation for each y values. For
example, when i = 3, y3 = 28; yˆ 3 ' = 9.430 + 5.266 (2.8) + 2.0612 (3) =
30.3584  30.36, and its residual,

 3 = y3 – yˆ 3 ' = 28.0 – 30.36 = –2.36 (6.18)

The negative residual value indicates that an over-estimation of ŷ3 since the
value of ŷ3  y3 .
On the other hand, for

i = 7, y7 = 32; ŷ7 = 9.430 + 5.266 (2.7) + 2.061 (3) = 29.832 29.83, and its
residual,

7  y7  yˆ 7  32.0  29.83  2.17 (6.19)

The positive residual value indicates that the value ŷ7 is under-estimated as
the value of ŷ7 < y7.
The predicted value of y.

From this model, if x1  3.0, and x2  2, then ŷ '  29.35 is the predicted
value for the observation/actual value of y.

(d) Meaning of Parameter Estimation


The contribution of independent variables on variation in y values can be
explained one by one.
For example, when x2 is fixed or its value is given, by estimation, y will
increase by (+5.266) for each unit increase in x1 .

EXERCISE 6.1
Referring to data in Example 6.1, and estimated model (6.17), calculate
(a) An estimation and residual values of yi , i = 4, 9; and interpret the
value of the residuals; and
(b) The predicted observed value of y, if x1 = 4.0 , and x2 = 5.0.

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  137

TESTING THE SIGNIFICANCE OF MODEL PARAMETERS

EXERCISE 6.2
Referring to data in Example 6.1, and estimated model (6.17), calculate
(a) An estimation and residual values of yi , i = 4, 9; and interpret the
value of the residuals; and
(b) The predicted observed value of y, if x1 = 4.0 , and x2 = 5.0.

As mentioned at the beginning of this topic, ˆ 0 , ˆ 1 , ˆ 2 are estimations of population


parameters 0 , 1 , 2 respectively. The significance of ̂0 is of less importance;
hence, it can be ignored.

It is important to test for significance of ˆ 1 , ˆ 2 to determine whether x1 and/or x2


respectively contribute significantly to the variation in y values.

The population parameter estimates ˆ 1 , ˆ 2 follow a sampling distribution with


mean and variance as below:

Mean ̂1 = 1 and Variance ̂1 = Var( ̂1 ) = s2 ( ̂1 ) ,

 i2    d 2 
2
2
where s ( ̂1 ) =   (6.20)
 n  k  1    d1   d 2     d1d 2 
2 2 2

and

Mean ̂2 = 2 , and Variance ̂2 = Var( ̂2 ) = s 2  ̂2 ),

 i2    d1 
2

where s  ̂2 ) = 
2
 (6.21)
 n  k  1    d12   d 22     d1d 2 
2

and k = the number of parameters in the model EXCLUDING the constant.

Copyright © Open University Malaysia (OUM)


138  TOPIC 6 MULTIPLE REGRESSION

The hypothesis statements are:

H 0 : 1 = 0; and H 0 : 2 = 0;
H1 : 1  0; H1 : 1 0;

ˆ i  i ˆ
Hypothesis Test : T  i
 
s ˆ i s ˆ i  
i = 1, 2.

Test Result : T follows a t distribution with v = n ă k-1 degrees of freedom


and  significance level. Reject H 0 if: T  t
,
2

This test is a two-sided test. A distribution is commonly used since the population
parameter is unknown and sample size n is small. s( β̂i ) is the standard deviation
for β̂i where i = 1, 2. Both standard deviations are calculated prior to making
inferences.

CONFIDENCE INTERVAL FOR MODEL PARAMETERS


Apart from hypothesis testing, we can also construct a confidence interval for i .
This confidence interval will provide information on the range of values that
contains the population parameter value with a certain  confidence level. Based
on test statistic T (two-sided test) which follows a tn  k l distribution, we can
construct a confidence interval (1 ă  )100% as follows:

ˆ i  t 
2
, n  k 1
 
s ˆ i  t  ˆ i  t 
2
, n  k l
 
s ˆ i (6.22)

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  139

Worked Example 6.2: Testing Significance of Model Parameters (6.17).


This model has k = 2 parameter estimators and n = 10. From the data in
Example 2, the following values are obtained:

Table 6.3

y y  ε2  d2
30 30.40305   0.04
36 35.57858   33.64
28 30.35797   4.84
29 30.31289   1.44
27 26.71707   10.24
28 27.19856   4.84
32 29.8314   3.24
27 27.24364   10.24
34 33.4723   14.44
31 30.88454   0.64
  83.6

13.40008
∑ε 2
i
= 7 =1.914297  1.9143
10 3

Var ( ̂1 ) = s 2 ( ̂1 ) = 1.914297 


 6 = 1.6232
1.221 6    0.5
2

Standard Deviation, s ( ̂1 )  1.274


And

Var( ̂2 ) = s 2 ( ̂2 ) = 1.914297 


1.221 = 0.330322
1.221 6    0.5
2

 Standard Deviation, s ( ̂2 ) 0.5747

Copyright © Open University Malaysia (OUM)


140  TOPIC 6 MULTIPLE REGRESSION

(i) Test for 1 :


The test statistic value,
ˆ 1
t1 = = (5.265687)/1.274  4.1332 > t (7) = 2.3646,
s ˆ  1 2

 H 0 is rejected at 5% level; hence, x1 contributes to variation in y values.

(ii) Test for 2 :


The test statistic value,
ˆ 1
t2 = = (2.061193)/0.5747  3.5866 > t (7) = 2.3646,
s ˆ  1 2

 H 0 is rejected at 5% level, hence, it is significant. There is enough


evidence at 5% significance level that x2 contributes to the variation in y
values.

SIMPLE METHOD TO CALCULATE TOTAL RESIDUALS


In Table 6.2, total residuals obtained from individual residual between observation
y and individual estimation y can be calculated through model (6.17). However,
the total residuals can be obtained without having to calculate the estimated
observation values Y, using the following formula:

∑ε = ∑d
2 2
βˆ1 ∑dd
1 βˆ 2 ∑dd 2 (6.23)

As an example, for Example 2.1,

∑ε 2
= 83.6- (5.265687)(7.46)-(2.061193)(15) = 13.40

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  141

2
THE MULTIPLE COEFFICIENT OF DETERMINATION, R

(a) Firstly, get the y value using model (6.17) as in Table 6.2. The R 2
coefficient that also measures the goodness-of-fit of the estimated model
(6.17) provides the proportion of total variation in y, which can be explained
by the multiple regression model (6.17). This value can be estimated using
the following formula:

2
∑(y '
y)
2
∑ε 2
(6.24)
R = =1
∑(y y )2 ∑y 2

(b) Without calculating y , the coefficient is given by the following formula:

( ˆ1 )   dd1   ( ˆ2 )   dd 2 


R2 
d 2
(6.24a)
2
R for Example 2.1
Based on the Formula (6.24),

R 2
= 1

2

= 1 – (13.40)/(83.6) = 83.9713  83.97


y
2

This means the estimated model (6.17) can only explain 83.97% of variation in y
values; and the rest cannot be explained by the model and is usually contained in
the error ε .
From formula (6.24a),

R2 
 5.265687  7.46    2.06119315 = 0.839712  0.8397
83.6
2
R 2 (R -adjusted)

The R2-adjusted value is given by the formula below:

(n  1)
R 2  1  (1  R 2 ) (6.24b)
(n  k  1)

Copyright © Open University Malaysia (OUM)


142  TOPIC 6 MULTIPLE REGRESSION

The quantity of R 2 is introduced to take into consideration the reduction in


degrees of freedom (df) with the addition of independent variables. The addition
usually will increase the total squared of regression values SSR, where
SSR   yˆ i2

A few results can be derived from this:


2
When k  1, R 2  R ;
2
When n is large, for any value of k, R 2  R ;
2 2 2
When n is small (relative to n), R << R , hence, R value can be negative
2 2
although R value is in the interval 0  R  1.
2
R for Example 6.1

R 2  1  1  0.839712 
10  1 = 0.793915.
10  2  1
TESTING THE SIGNIFICANCE OF OVERALL REGRESSION MODEL
The F test can be used to test the significance of overall regression model. It is
tested based on the ratio of explained variance in the model on the remaining
unexplained variance. The F distribution is used, with k and n-k-1 degrees of
freedom where k is the number of parameters estimated EXCLUDING the
constant β0 . (A few books consider 0 as a parameter, hence the F degrees of
freedom becomes k-1 and n-k). The test statistic is given by:

Fk ,n  k 1 
 y ' /(k )
2
i

R 2 /(k )
2
i / 9(n  k  1) (1  R 2 ) /(n  k  1)
(6.25)

The test hypothesis is:


H 0 : There is no linear association between dependent variable, y and
independent variable, x
H1 : There is a linear association
Or

H0 : 1 = 2 = 0;
H1 : at least one 1 is not zero.

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  143

Test result: If the F-probability is < 0.05, reject H0 at 5% significance level. This
means, it is significant that the regression parameters β0 , β1 , β2 , especially the last
two are not all zeroes. Subsequently, it is significant that the coefficient of
determination R2 is not zero.

6.1.3 Calculation Using Microsoft Excel Package


Steps:
(i) Open Excel Spreadsheet.
(ii) Enter data (or import data from data source), as in Table 6.3. (You can enter
data in any column).
(iii) Click on Tools.
(iv) Click on Data analysis.
(v) Double Click on Regressions.
(vi) Enter y range as: A1:A11.
(vii) Enter x range as: B1:C11.
(viii) Choose: label, confidence level, residuals, residual plot, normal probability
plots.
(ix) Click on OK.

Table 6.4: Example of Microsoft Excel Application


A B C D E F G H ….
y x1 x2
30 3.2 2
36 3.4 4
28 2.8 3
29 2.4 4
27 2.5 2
28 2.2 3
32 2.7 3
27 2.6 2
34 3 4
31 2.9 3

Copyright © Open University Malaysia (OUM)


144  TOPIC 6 MULTIPLE REGRESSION

Example of Output:

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.918389714
R Square 0.843439667
Adjusted R Square 0.79125289
Standard Error 1.476566151
Observations 9

ANOVA
df SS MS F Significance F
Regression 2 70.47406997 35.237035 16.1619419 0.003837472
Residual 6 13.08148559 2.1802476
Total 8 83.55555556

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 9.037694013 4.035365839 2.23962198 0.06638156 -0.83649046 18.9118785
3.2 5.609756098 1.630594534 3.44031332 0.01379565 1.619835016 9.59967718
2 1.900776053 0.743177823 2.5576329 0.04304568 0.082285433 3.7192666

RESIDUAL OUTPUT PROBABILITY OUTPUT


Observation Predicted 30 Residuals Standard Residuals Percentile 30
1 35.71396896 0.286031042 0.22368127 5.555555556 27
- -
2 30.44733925 2.447339246 1.91386207 16.66666667 27
- -
3 30.10421286 -1.10421286 0.86351376 27.77777778 28
4 26.86363636 0.136363636 0.10663875 38.88888889 28
5 27.08148559 0.918514412 0.71829432 50 29
6 29.88636364 2.113636364 1.65290058 61.11111111 31
- -
7 27.42461197 0.424611973 0.33205398 72.22222222 32
8 33.47006652 0.529933481 0.41441724 83.33333333 34
- -
9 31.00831486 0.008314856 0.00650236 94.44444444 36

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  145

Figure 6.1

Figure 6.2

There are differences in values obtained manually and those obtained using Excel.
This is due to rounding error in the Excel package.

Interpretation of Output:

(a) Multiple R
This quantity is often referred to as multiple correlation between y and all
independent variables without any condition imposed on any independent
2
variable. The value is the source of multiple R .

(b) R Square
This measures the goodness-of-fit of y model on observed y. A high value
means that the regression model can explain better about the variation in Y
2
as much as (R 100)%. On the other hand, a small value indicates a poor fit
of the regression model.

(c) ANOVA (Analysis of Variance)


Analysis of variation in Y can be explained by regression model (measured
by Regression Sum of Squares) and unexplained variation (measured by
Residuals Sum of Squares), both in SS column. The MS (mean square)
column is the SS value divide with respective degrees of freedom (df).
Subsequently, the F value is the ratio of MS values. The F probability
Copyright © Open University Malaysia (OUM)
146  TOPIC 6 MULTIPLE REGRESSION

(Significance F) is used to test the hypothesis (6.6) or (6.6a). If the


probability value is less than 0.05, the following results are obtained:

(i) Reject at 5% significance level.

(ii) This means there exists a linear relationship between Y with X1 and X
simultaneously.

(iii) Hence, neither is significantly zero.


2
(iv) When R value is not significantly zero and this proves that there
exists a linear relationship between Y with X1 and X2.

(d) t Test:
This test is used to evaluate whether individual regression coefficient
( 1  2 ) is significantly zero at α level that is by comparing the p probability
value with  value. For example, assume that α = 0.05; and if p < α, reject
H 0 :1 = 0 at 5% level. This means that 1 ≠0, and x1 contributes
significantly to the variation in Y. In the above example, we found that
1 , 2 are both not significantly zero at 5% level.

(e) Residual Output:


One way to know the adequacy of a model is by looking at the parameters
involved and checking whether the assumptions in model construction are
fulfilled or not. This can be done through the residuals shape on predicted
values. If the assumptions are met, usually the predicted residuals will not
have any particular pattern that can be seen from the residuals plot.

This means that residuals are random around the horizontal line, as in Figure 6.1.
The normality assumption is satisfied if the normal plot follows a straight line.
Referring to Figure 6.2, not all points fall on the straight line; hence, the
regression model can satisfy only about 94% of the normality assumption.

Copyright © Open University Malaysia (OUM)


TOPIC 6 MULTIPLE REGRESSION  147

EXERCISE 6.3

1. (i) Perform a manual analysis and use formulae (6.14), (6.15) and
(6.16) to obtain the parameter estimates for 0  1  2 for
multiple regression model for the following data:
Y: 10, 24, 40, 20, 15
X1: 2, 3, 7, 3, 4;
X2: 5, 6, 6, 5, 3
(ii) Use your model to obtain the estimated values of y, and
estimate the total residuals.
(iii) Perform a significant test on the parameters.
2
(iv) Obtain the R coefficient and interpret its value.
(v) Perform the F test and make the conclusion.

2. Use Microsoft Excel to analyse multiple regression on the following


set of data:
(a) Y: 10, 24, 40, 20, 15
X1: 2, 3, 7, 3, 4; X2: 5, 6, 6, 5, 3
(b) Y: 10, 25, 30, 20, 15
X1: 2, 4, 4, 3, 4; X2: 2, 6, 8, 6, 3.

3. Referring to question 2, explain the results of the analysis (as in


Example 6.1).

 Multiple linear regression is a technique to analyse the relationship between a


dependent variable and two or more independent variables.

 The regression analysis assumes that the relationship is in a linear form.

 The least squares method is used to obtain the estimates of the regression
model parameters and y-intercept.

 Next, hypothesis testing is carried out on the regression coefficients to


determine whether there exists enough evidence to state the existence of a
linear relationship.
Copyright © Open University Malaysia (OUM)
148  TOPIC 6 MULTIPLE REGRESSION

2
 The goodness-of-fit between two variables can be obtained using R
coefficient of determination.

 The deviation of model from assumptions made can be identified using


residuals plot that should not display any distinct pattern (that is random) if
the model assumptions are met.

 If the multiple linear regression model obtained is suitable and fits the data,
this model can be used to obtain the estimated values of dependent variable
for any given independent variable values. This model can later be used for
making prediction for any independent variable values as well as estimating
the values of dependent variables.

Copyright © Open University Malaysia (OUM)


Topic  Introduction to
7 Non-Parametric
Concepts
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Identify data condition for the application of non-parametric statistics;
2. Apply alternative tests other than parametric statistics in inferential
statistics;
3. Compare parametric statistics and non-parametric statistics; and
4. Explain the advantages and disadvantages of non-parametric statistics.

 INTRODUCTION
The methodologies in inferential statistics discussed so far, estimation and
hypothesis testing, are under the assumption that the random samples are selected
from normal populations. Since the population under study is normal, inferential
statistics will use normal distribution or other distributions related to normal
distribution such as t, chi-square and F distributions. Fortunately, most of these
tests are still reliable when we experience slight departures from normality,
particularly when the sample size is large.

However, there are some circumstances where normality assumptions cannot be


met such as when the sample size is small. Figure 7.1 below shows several
examples of some distributions that cannot be assumed to approach normality.

Copyright © Open University Malaysia (OUM)


150  TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS

Figure 7.1: A few examples of non-normal distributions

There are cases where we would like to do inferential studies on the measure of
location for a population distribution without using the population parameters that
are frequently used such as mean, x and variance, s 2 .

In this case, we need a flexible inferential statistical method that does not depend
on the requirement of normality for population data.

The methodologies and tests discussed in this unit satisfy the requirement
mentioned above. It is called either Distribution-Free Test that means that the
test does not require prior knowledge of the distributions of the underlying
populations; or Non-Parametric Test that refers to inferences not involving
population parameters.

Apart from that, some tests in non-parametric method are also known as Rank
Test due to the usage of rank score replacing data value or actual value in the
analysis.

7.1 APPLICATION OF NON-PARAMETRIC


STATISTICS
Data analysts uses nonparametric or distribution-free procedures in many
applications. In short, it is used when the data does not meet the underlying
assumptions or when the data is qualitative of type. Let us study each of these
situations in detail.

Copyright © Open University Malaysia (OUM)


TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS  151

7.1.1 Data Not Meeting Assumptions


The application of parametric statistics such as t test, analysis of variance and
correlation analysis depends on several assumptions. Such assumptions are, data
is distributed as normal or at least approaching normality. There are some
situations when this condition cannot be met, for example:
 Distribution of data is not normal – it could be flat; very peaked or skewed as
in Figure 7.1; or
 Small sample size, which is n < 30.

When the normality assumption which is required such as in t tests, is not


fulfilled, the t test statistic obtained from the sample will not satisfy the property
of a Student’s t distribution. In this case, the significance level for this test cannot
be determined with certainly. Hence, in this case, non-parametric methods are the
best alternative for data analysis.

7.1.2 Qualitative Data


Often in surveys, some of the important data are demographic data of respondents
such as gender, birthplace, etc. These are qualitative data and need to be assigned
a number or code for analysis. The assigned numbers or codes do not carry any
meaning. Usually numbers or values are given to characteristics and these
numbers or values do not have any order. For example, a characteristic may be
assigned number 1 and another one number 2 but it does not mean characteristic 1
is superior than characteristic 2. This type of data is called nominal data.

There are situations when observed values are assigned rank according to
response magnitude. For example, a comparison is to be made between new
software versus the existing software used in an operation. It is difficult to give an
exact value for the qualitative variable “software ease of use in handling
operation”. Instead, we can still make a concrete and clear decision based on this
comparison by assigning ranks according to which software is more efficient or
better. This type of data is called ordinal data.

Data in both situations above is neither distributed as normal nor suitable for
parametric analysis. In this case, mean is not suitable as a measure of location.
Thus, the non-parametric test is an alternative for analysing nominal or ordinal
data.

Copyright © Open University Malaysia (OUM)


152  TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS

(a) Nominal Data


The Nominal scale is the lowest or weakest scale from the four types of
measurement scales, which are nominal, ordinal, interval and ratio. We will
not discuss interval and ratio scales in this topic. Measurement of a
characteristic is said to follow a nominal scale if the measurements are
assigned to classes or categories based on what is being measured. For
example, items produced from a production line can be classified or
categorised as defective or non-defective. A newborn baby can be classified
as male or female. For example, in survey terms, the biodata of a respondent
can include the category of birth states such as Johor, Malacca, Negeri
Sembilan, etc.

Next, for analysis purposes, numbers or codes will be assigned to nominal


data arbitrarily. For example, items produced from a production line will be
assigned digit 1 if defective and digit 0 if non-defective. Baby’s gender will
be assigned number 1 if male and 2 if female. The same goes for the birth
state; for example, Johor – 1, Malacca – 2, Negeri Sembilan – 3 etc.

Data collected is in terms of the frequency or number of objects in each


classification or nominal category. This data is sometimes referred to as
count data or frequency data.

SELF-CHECK 7.1

1. What would be the consequences of applying inferential statistical


methods learnt in Topics 1 and 2 on non-normal data?
2. Give examples of data assigned on a nominal scale.

(b) Ordinal Data


A measurement scale is said to follow the ordinal scale if the scale can
provide information on the rank of a unit as compared to the other units in
terms of the measured characteristic. This scale is assigned according to
object level possessing a certain characteristic relatively. The measurement
scales enable objects to be assigned according to rank or code in sequence
whether in increasing (indicates better) or decreasing (indicates worse) form.
For example, the effectiveness of a medicine in curing a disease can be
ranked as 1, 2, 3, 4 and 5. Rank 1 indicates the medicine does not give any
effect at all and rank 2 refers to the medicine giving little effect. 2 is higher
than 1 in terms of effectiveness. Consequently, rank 5 shows the medicine as
most effective.
Copyright © Open University Malaysia (OUM)
TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS  153

Another example of ordinal data is data on employees’ performance. The


lowest rank represents the least satisfactory work performance while the
highest rank means the most satisfactory work performance. Scale 1 – 5, or
1 – 7 or their equivalent is known as the Likert scale.

One thing to note is that, ordinal scales cannot determine the magnitude or
by how much an object is better than the other in terms of the measured
characteristic. In the example of effectiveness of a medicine, the difference
in magnitude between 2 and 3 cannot be determined. What we can conclude
is that 3 is more effective than 2.

SELF-CHECK 7.2

Give examples of data assigned on an ordinal scale.

Studies and researches have shown that in general, non-parametric tests


possess higher test power in detecting and deviation or differences in
population compared to parametric methods when the underlying
assumptions are not met.

Based on the discussion above, in summary, there are some advantages of


non-parametric statistical methods compared to their parametric counterparts,
including:

 Can be used for a small sample size without having to satisfy the
assumption of normality.
 Can be used for data classified in “weak” types of measurement such as
nominal and ordinal data.
 Does not require any prior knowledge of the sampled population
distribution.
 Easy to understand.

Copyright © Open University Malaysia (OUM)


154  TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS

SELF-CHECK 7.3

Mark () the following types of data as either nominal or ordinal.

Nominal Ordinal
Data
Scale Scale
Data on car models: Proton; Toyota; Nissan;
(a)
Honda; Datsun; Mazda

Data on efficiency of alarm systems: Very


(b) effective; Quite Effective; Neutral; Less
Effective; Not Effective

Customers’ evaluation on Company A's


(c)
product: Satisfactory; Unsatisfactory

Bus service companies: Transnasional;


(d)
Sutera; Hasry Ekspres; Labu Sendayan

7.2 LIMITATIONS OF NON-PARAMETRIC


STATISTICS
There are a few limitations of non-parametric applications despite their
advantages.

Firstly, this method does not make full use of information available in the sample.
For example, in a non-parametric analysis, rank or code 1, 2, 3 etc is used to
replace original data. As discussed earlier, rank data does not take into
consideration the differences in magnitude between observations. For example,
suppose that the original given data are 13.3, 18.8, 22.1, etc. Say, another
researcher also took samples 14.3, 14.5 and 14.9.

If non-parametric statistics were used, statistical values such as mean, standard


deviation, etc obtained by researcher 1 would definitely be different from those of
researcher 2. However, these differences would not be detected if non-parametric
statistics were applied since rank assignment is used in analysis.

Copyright © Open University Malaysia (OUM)


TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS  155

Secondly, non-parametric statistics are less efficient compared to the standard


techniques that they replace. This is due to the information on the population that
may not be available or fully able in the analysis.

Non-parametric tests produce a wider confidence range for any significance level.
This results in lower power of test and hence, the results obtained are less
accurate.

To test your understanding of Topic 7, try out the exercises below.

EXERCISE 7.1

1. Explain the differences between nominal and ordinal measurements.


Nominal Measurement Ordinal Measurement

2. Suppose that in a survey on customers’ perception on quality of a


type of bath soap, respondents were asked to assign rank 1 – very
unsatisfactory, 2 – unsatisfactory, 3 – unsure, 4 – satisfactory and 5
– very satisfactory. The following data was collected:
1 1 5 2 1 5 2 2 2 5
5 4 2 2 5 5 2 1 5 5
(a) Calculate and interpret mean, median and mode.
(b) In your opinion, is parametric method suitable to be used in
this data analysis? Explain.

For further information on the following, please visit the websites below:
Ć Non-parametric concepts:
http://www.statsofttinc.com/textbook/stnonpar.html
Ć Test of Sample Randomness:
http://geography.asu.edu/fall2002/gcu495/ttest/ttest

Copyright © Open University Malaysia (OUM)


156  TOPIC 7 INTRODUCTION TO NON-PARAMETRIC CONCEPTS

Ć Non-parametric methods differ from parametric statistical methodologies in


that the latter have wider applications. These can range from level aspects or
measurement scales used or obtained for analysis; types of inferences; and the
assumptions required for population distribution.

Copyright © Open University Malaysia (OUM)


Topic  Non-Parametric
8 Test for
Randomness
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the concepts of runs test for randomness;
2. Construct the null and alternative hypotheses for testing data
randomness;
3. Test sample randomness using runs test;
4. Test randomness of large samples; and
5. Test randomness of quantitative data.

 INTRODUCTION
Most of the time, in analysis, we assume that the samples are chosen at random
from the population under study. Assumption on randomness or random
samples means samples are chosen without any preferences; in fact, each data in
the population has an equal chance of being selected.

There are several non-parametric statistical methods to test randomness of any set
of observed data based on arrangement or data sequence in which the sample
observations are obtained. The method discussed here is based on Runs theory.
Runs is defined as a subsequence of one or more identical symbols representing a
common property of the data.

The number of runs that is too small or too large indicates departures in
randomness in an observed sample. The runs test for randomness of data will test
the hypothesis that the sequence of an event that occurs is random versus an
alternative hypothesis that the sequence produced is not random.

Copyright © Open University Malaysia (OUM)


158  TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS

8.1 RUNS TEST


Let us consider a production process that produces a sequence of items that are
well-functioning (G) or defective (D). Say we obtained the following sequence,

GGGGGDDGGGDDDGGGGGGG

From the example above, there is first a run of 5 ‘G’s (which is the first run), then
a run of 2 ‘D’s (second run). Next, there is a run of 3 ‘G’s, a run of 3 ‘D’s and
finally, a run of 7 ‘G’s. In all, there are 5 different runs of varying lengths:

GGGGGDDGGGDDDGGGGGGG

1 2 3 4 5

Regardless of whether our sample measurements represent quantitative or


qualitative data, the runs test divides the data into two mutually exclusive
categories. Consequently, a sequence will always be limited to distinct letters or
symbols.

Let n1 represent the sample size for first letter or symbol, while n2 represents the
sample size for the second letter or symbol. Hence, the sample size n = n1 + n2 .
For production data above, there are n1 = number of letter ‘D’ = 5 while
n2 = number of letter ‘G’ = 15, which gives in total n = 20 letter sequences.

The total number of runs in a sample with size n is an indication of data


randomness. If the number of runs is too big or too small, the randomness
property in the data obtained is doubtful. This may be due to clusters or trends in
the sample data. Referring to the above example, there are n = 20 letter sequences.
If a sample resulted in only 2 runs such as the following:

DDDDDGGGGGGGGGGGGGGG

or any other forms of 2 runs, this is most unlikely to occur from a random
selection process. Such a result indicates that the production process generates the
first 5 products as defective, followed by 15 well-functioning products.

Copyright © Open University Malaysia (OUM)


TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS  159

Likewise, if the production sample resulted in alternating letter sequence such as


the following:

GDGDGDGDGDGGGGGGGGGG

where n1 = 5 and n2 = 15, the maximum runs with alternating letters is as many
as 11 runs. Hence, we would again be suspicious of the order in which the
samples were selected.

In another example, if the number of both letters is the same, that is n1 = n2 = m,


hence, there are 2m maximum runs in alternate sequence. If n = 20, the following
sequence is obtained:

GDGDGDGDGDGDGDGDGDGD

and maximum runs is 20. The number of runs is large for a sample of size 20.
Thus, this sample can be said as not random.

How can we verify the above statement? Suppose that R denotes the number of
runs. We would like to get as many as n1 probability of first letter and n2 second
letter forming R runs.

Firstly, we need to obtain runs for a letter or symbol. Let us say, we would like to
get n1 probability forming k runs, with R = 2k and k is a positive integer. Hence,
if n1 = 5, to form k = 3 runs for ‘G’, there are six possibilities of sequences:

G|G|GGG G|GG|GG G|GGG|G GG|G|GG GG|GG|G GGG|G|G

The vertical bars ‘|’ separate the five letters into three different runs. Observe that
 n1  1  5  1
  
there are 6 possible runs for this example, which can be written as  k  1   3  1
 n1  1
 
= 6. By the same token, there are  k  1  ways in which n1 letters of the first kind
 n2  1
 
can form k runs and  k  1  ways in which the n2 letters of the second kind can
 n  1  n  1
form k runs. It follows that there are altogether 2  1   2  ways in which
 k 1   k 1 
these n1 + n2 letters can form 2k runs. The factor 2 is accounted for by the fact

Copyright © Open University Malaysia (OUM)


160  TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS

that when we combine the two kinds of runs so that they alternate, we can begin
either with a run of the first kind of letter or with a run of the second kind. Thus,
when R = 2k, the probability of getting that many runs is:

 n  1  n  1
2 1  2 
k 1   k 1 
f ( R)  
 n1  n2 
 
 n1 
For the case when the number of both letters is not equal, for k + 1 of first letter
 n  1  n  1
and k runs of the second letter; we have  1   2  ways and as many as
 k   k 1 
 n1  1  n2  1
   ways to get k runs of the first letter and k + 1 runs of the second
 k 1   k 
letter. Hence, the probability is:

 n1  1  n2  1  n1  1  n2  1
    
 k   k 1   k 1   k 
f ( R)  f (2k  1) 
 n1  n2 
 
 n1 
The runs test for testing sample randomness is based on R random variable, that is
total number of runs. It is a two-sided test as shown in Figure 8.1.

The construction of null and alternative hypotheses is as the following:

H0: data sequence is random


H1: data sequence is not random

There are two methods to decide whether a given set of data is random or not.
Firstly, we can calculate the probability value to get runs as shown in the data. If
the probability is smaller compared to significance level value, α/2 (since this is a
two-sided test) hence H0: data sequence is random, is rejected.

The second method is by using the runs test table as in Table 5.1 in the
attachment. This table provides critical values of R used in Runs Test for  = 0.05
with n1 = 20 and n2 = 20. H0 will be rejected at  = 0.05 if the number or runs, R
 r(a) or R  r(b) with r(a) and r(b), as the critical values at both sides at
 = 0.05 (see Figure 8.1) . This table cannot be used if   α not 0.05.
Copyright © Open University Malaysia (OUM)
TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS  161

Figure 8.1: R Distribution and critical values of r(a) and r(b)

Worked Example 8.1


Suppose that there are 8 male and 8 female patients on appointment with a
doctor. How many runs should there be to reject the null hypothesis on
randomness of patients’ sequence to meet the doctor at 0.01 significance level?

Answer:
From the question, n1 = 8 and n2 = 8. The null hypothesis which states that the
sample sequence is random will be rejected if the probability of getting a total
number of runs is less than α/2, that is Pr(R = r) < 0.01/2 = 0.005. Since
α ≠ 0.05,

77
2  
0 0 2
Pr(least runs) = f (2) =       0.000155
 16  12870
 
8 

 7  7   7   7 
       
1 0 0 1
f (3) =         0.001088
12870

Pr(R ≤ 3) = f (2) + f (3) = 0.001243

Copyright © Open University Malaysia (OUM)


162  TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS

 7  7 
2   
1 1 2
f (4) = f (4) =      0.007615
12870 12870

Pr(R ≤ 4) = f (2) + f (3) + f (4) = 0.008858

Based on the above calculation, we obtain Pr(R ≤ 3) = 0.001243 < 0.005, while
Pr(R ≤4) = 0.008858 > 0.005. Hence, if total runs, R ≤ 3, H0 will be rejected.
When R = 4, the null hypothesis cannot be rejected since the probability value
of 4 runs is relatively large, that is Pr(R = 4) = 0.007615 (> 0.005).

Pr(total runs is maximum) = Pr(alternating letters)

 7  7 
2   
7 7 2
f (16) =      0.000155
12870 12870

 7  7   7   7 
       
7 6 6 7 14
f (15) =          0.001088
12870 12870

 7  7 
2   
6 6 98
f (14) =      0.007615
12870 12870

Observe that probability values are symmetrical for a two-sided test. Hence,
reject H0: the sequence of samples is random at α = 0.01 when total runs, R = 2,
3, 15 or 16. If any one of these number of runs occurs, the probability value will
be smaller than 0.005.

If the significance level is replaced by α = 0.1, we would reject the null


hypothesis when the total runs R = 2, 3, 4, 14, 15 or 16. The probability value of
getting any of these numbers of runs is smaller than 0.1/2, which is 0.05.

If we use the Runs Test Table method, we will be checking the critical value or
the hypothesis rejection value at n1 = n2 = 8. From Table 5.1 in the attachment,
the rejection region is in the area R ≤ 4 or R ≥14. Notice that the same answer is
obtained using both methods.

Copyright © Open University Malaysia (OUM)


TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS  163

SELF-CHECK 8.1

1. What is meant by runs? How can the number of runs indicate the
randomness of an event or data?

2. In a Runs Test for randomness, will the null hypothesis be


rejected for the following data?
(a) n1 = 7 n2 = 7 R = 3
(b) n1 = 5 n2 =11 R = 5
Next, compare your answers with Table 5.1.

Now, let us try the exercise below.

EXERCISE 8.1

1. Find the probability that n1 = 6 letters of one kind and n2 = 5 letters


of another kind will form at least 8 runs.

2. An obstetrician would like to study the randomness of newborns’


gender. In the most recent 18 deliveries, she recorded the babies’
gender and obtained the following data:

BBGBGGGGGGBBGBBBBB

Test at 0.05 significance level whether this arrangement of babies’


gender may be regarded as random.

8.2 RUNS TEST FOR LARGE SAMPLE SIZE


When the total number of runs in a sample of size n is large, it is considered
reasonable to assume that the distribution of random variable corresponding to R
can be approximated closely with a normal distribution. Hence, we can compare
runs distribution R with standard normal distribution Z. The sample size n is said
to be large when both n1 and n2 ≥ 10, or one of them has a sample size of 20.

Copyright © Open University Malaysia (OUM)


164  TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS

Based on this assumption, when the sample size increases, sampling distribution R
will approach normal distribution. If we arranged in random n1 letters of one kind
and n2 letters of another kind, the mean value for number of runs,  R and
variance,  2R are:

2n1n2 2n1n2  2n1n2  n1  n2 


R  1  2R 
n1  n2  n1  n2   n1  n2  1
2

Hence, we can use test statistic of Z function that is distributed as standard normal
R  R
with Z  . If the number of observed runs, R is near to mean value, the
R
hypothesis on data randomness is supported. If R differs from the mean, there is
evidence that the sample is not random.

Worked Example 8.2


A safety consultant would like to monitor the performance of a set of radar
placed at the back of an advertisement board beside the North-South Highway.
This radar will give the letter ‘P’ if the speed of vehicle passing through the
highway is within the speed limit and the letter ‘L’ if its speed is above the
speed limit. The following data has been recorded for 40 vehicles passing
through the area:

PPLLLLPPPPPPPPPLPPLLLLPPPPPPPLLLLPPPPLPP

This safety consultant would like to know if the sequence of drivers driving
within and above the speed limit is random or not. Alternatively, he would like
to find out if those drivers who are speeding are driving in a group.

Copyright © Open University Malaysia (OUM)


TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS  165

Answer:
Let us say, n1 = number of letter P while n2 = number of letter L. From the data
recorded, n1 = 26, n2 = 14, R = 11.

Test for randomness is based on the following hypothesis:


H 0 : Speeding and slow vehicles arrived at random or µR = 19.2
versus
H1 : Speeding and slow vehicles are non-random, or µR ≠ 19.2

Since both n1 and n2 ≥10, for large sample size, the mean and variance can be
calculated as below:

2(26)(14)
  1  19.2
(26  14)

2(26)(14)[2(26)(14)  26  14]
2   8.0089
(26  14) 2 (26  14  1)

Hence, test statistic, Z = (11 – 19.2 )/2.83 = –2.897.

For a two-sided test at 0.05 significance level, the rejection regions are at
z ≤ –1.96 or z ≥ 1.96. Since z = –2.879 < Z0.05 = –1.96, H 0 is rejected. Speeding
and slow vehicles’ drivers drive in groups in the sequence not at random, which
is at 5% significance level.

Copyright © Open University Malaysia (OUM)


166  TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS

EXERCISE 8.2

1. Calculate n1, n2, R, R and R based on the following outcomes:


(a) MMMFFMFMMFM
(b) NONONOONNONNNNOONON

2. The following is an arrangement of males and females lined up to


purchase bus tickets for the Chinese New Year holiday. Test for
randomness of this sequence at 0.01 level.

MF M F M M M F M F M M MF F M M M M F F M F
M M M F M M M F F F M F M M M M F M F M M M
M F F M

8.3 TESTS FOR RANDOMNESS OF


QUANTITATIVE DATA
The runs test can also be used to detect departure from randomness of a sequence
of quantitative measurements over time, caused by trends or periodicities. A
sequence of ‘+’ and ‘-’ symbols will generate to test the randomness of
quantitative data when replacing each measurement or observation in the order in
which they are collected by:

(a) a ‘+’ sign if it falls above the median value; and

(b) a ‘-’ sign if it falls below the median value and omitting all measurements
that are exactly equal to the median value.

Copyright © Open University Malaysia (OUM)


TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS  167

Worked Example 8.3


A machine is adjusted to dispense acrylic paint thinner into a container. Would
you say that the amount of paint thinner being dispensed by this machine varies
randomly if the contents of the next 15 containers are measured and found to be
3.6, 3.9, 4.1, 3.6, 3.8, 3.7, 3.4, 4.0, 4.8, 4.1, 3.9, 4.0, 3.8, 4.2 and 4.1 litres?

Answer:
From the given sample, we find median ~ x = 3.9. Replacing each measurement
by the ‘+’ symbol if it falls above 3.9, by the ‘–’ symbol if it falls below 3.9
and ‘X’ sign for data that will be taken out if the value is 3.9, we obtained the
following sequence,
– X+ – – – – + + + X + – + +
for which n1 = 6, n2 = 7 and R = 6.

The hypothesis for randomness test is:


H 0 : sequence varies randomly versus
H1 : sequence does not vary randomly

Referring to Table 5.1, we get critical values for runs as 3 and 12 giving (3,12)
interval as the acceptance region. Since R = 6 is in this interval, we do not reject
the null hypothesis at 0.05 significance level. The sequence of measurements
varies randomly.

SELF-CHECK 8.2

What is meant by the ‘+’, ‘–’ and ‘0’ signs in the test for randomness for
quantitative data? How is sample size obtained?

Copyright © Open University Malaysia (OUM)


168  TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS

EXERCISE 8.3
1. The following are the numbers of students absent from school on 22
consecutive school days: 29, 25, 31, 28, 33, 31, 35, 29, 31, 33, 35, 28,
36, 30, 33, 26, 30, 28, 32, 31, 38 and 27. Did the absenteeism of
students occur at random? Test at 0.01 significance level.

2. A silver-plating process is being used to coat a certain type of serving


tray. When the plating process is in statistical control, the thickness of
the silver on the tray will vary randomly following a normal
distribution with mean 0.02 ml and a standard deviation of 0.005 ml.
Suppose that the next 12 trays examined show the following thickness
of silver:

0.019, 0.021, 0.020, 0.019, 0.020, 0.018, 0.023, 0.021, 0.024, 0.022,
0.023, 0.022.

Use the runs test to determine if the fluctuations in thickness from one
tray to another are random. Test at 0.05 significance level.

Ć A runs test can be used to detect certain trends in a sample that shows the non-
randomness of data.

Ć A runs test is a two-sided test since the question to be answered is whether


there are too many/few runs in the sample or not.

Ć This test can be used for any sample size, and to test for randomness of
quantitative data.

Copyright © Open University Malaysia (OUM)


Topic Non-Parametric
9 Hypothesis Test
for Single
Population
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Construct appropriate null and alternative hypotheses for a non-
parametric single population test;
2. Test the location of a population using sign test;
3. Test the location of a population using signed-rank test;
4. Test the location of a large-size population; and
5. Differentiate between sign test and signed-rank test.

 INTRODUCTION
In testing a single population using parametric methods, the Students t-test and z-
test (for large sample size) are used to determine whether the population mean  is
equivalent to or different from a certain mean value 0. Non-parametric methods
for testing a single population also enable us to verify whether there exists a
significant difference in terms of the location or position of a population with a
given value of a measure of location.

As discussed earlier, non-parametric methods do not assume samples are taken


from a normal distribution. For this analysis, the mean is replaced by the median
as the pertinent location parameter under test. This is because the median value is
not influenced by outlier values or a skewed distribution shape. Median is a
measure of the middle data or equal separation of two data in a population.

Copyright © Open University Malaysia (OUM)


170  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

To test a population based on the median, we will test the hypothesis whether the
median of a population under study, say denoted by τ (read as ‘tau’) is
significantly different from a specific median, say τ 0 (read as ‘tau not’).

Two types of non-parametric statistical tests will be discussed in this Topic 9 to


test the position or location of population using median value; these are the sign
test and the signed-rank test.

SELF-CHECK 9.1

What are examples of qualitative data?

9.1 HYPOTHESIS STATEMENT FOR SINGLE


POPULATION TESTING
In constructing hypothesis to check for a single population, there are three
possible alternative hypotheses, H1. In general, an alternative hypothesis is a
statement or claim that we wish to support or prove through a sample collected
and a test performed. On the other hand, the null hypothesis refers to the statement
that we wish to reject (note that null comes from the English word ‘nullify’
meaning reject or invalidate).

For example, in a study to test consumers’ preferences on a certain product, the


test will decide whether more than half of the consumers sampled chose this
product or, equivalently, less than half chose the other products. If x measures
consumers’ preference on the product, then the probability of the consumers
preferring the product to be larger or smaller than the median value must be equal
to ½ (refer to Figure 9.1). If median is denoted by 0 , this can be written as
Pr(x < 0) = Pr(x > 0) = ½ = .

Copyright © Open University Malaysia (OUM)


TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  171

Figure 9.1: Area under the graph from median value 

Hence, we will test whether H0: ≤ 0.5 (consumers do not have any preference on the
product; that is the number of consumers preferring the product exceeding median
value is equal to the number of consumers preferring the product less than median
value) versus H1 :  > 0.5 (consumers prefer this product). We can state the hypothesis
of interest in terms of the median, that is, H0: ≤0 against H1 :  > 0.

For the above example, the expression term in constructing the alternative
hypothesis statement H1:  > 0.5 or H: > 0 is “larger than”. In summary, these
expression terms (please check Figure 9.1) can lead us to decide the set of null
and alternative hypotheses for a single-population test:

Table 9.1: Expression Terms with Related Null and Alternatives Hypothesis

Examples of Expression
Alternative Hypothesis Null Hypothesis
Terms
“more than”
H 1 :   0.5 H 0 :   0.5
“exceeds” or or
(H1 :    0 ) (H 0 :    0 )
“increase”

“less than” H 1 :   0.5 H 0 :   0.5


or or
“decrease” (H1 :    0 ) (H 0 :    0 )

“differs from” H 1 :   0.5 H 0 :   0.5


or
“not equal to” (H1 :    0 ) (H 0 :    0 )

Often, the null hypothesis statement is simplified to H 0 :   0.5 or H 0 :    0 in all


cases, which is the population parameter does not differ or is equal to ½.

Copyright © Open University Malaysia (OUM)


172  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

SELF-CHECK 9.2

State the expression terms and define the appropriate null and alternative
hypothesis for the following cases:

Alternative Null
Statement with Expressions
Hypothesis Hypothesis
(a) A supervisor recorded 9 observations of
battery lifetime before a recharge is
required. Determine whether this battery
operates with a median of 1.8 hours before
requiring a recharge.
(b) The following data is obtained from a non-
normal population. Determine whether the
median is distributed less than 5.2 cm.

9.2 SIGN TEST


The sign test is often used as a non-parametric alternative to the one-sample t-test,
where we test the null hypothesis measure of central tendency median against a
suitable alternative. For the sign test, the assumption needed is merely that the
population sampled is continuous and symmetrical. We assume that the
population is continuous so that there is a zero probability of getting a value equal
to m0. We do not need the assumption of symmetrical distribution if the median
replaces the mean in this test.

Median is the observed value that is located in the middle of the data when all
other observations are ranked in sequence regardless of the order, increasing or
decreasing. If the sample size is even, the median would be the mean value of the
two observations at the centre. Similar to the mean value, median is also a
measure of location for a distribution. Hence, the sign test is sometimes called
test of distribution location.

Suppose that x2, . . . , xn is a random sample from a population with unknown


median,. Let us say we would like to determine whether the population median is
larger than 100. Hence, 0=100. Since the expression used in this example is

Copyright © Open University Malaysia (OUM)


TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  173

“larger than 100”, then the hypothesis to be tested is: H0:  = 100 against a one-
sided alternative hypothesis,
H1: > 100.

In the sign test, we replace each sample value exceeding median value, 0 with a
‘+’ sign and each value less than 0 with a ‘-‘ sign, that is,

If xi > 0, then xi  ‘+’, if xi < 0, then xi  ‘–’

The sample value which is equal to 0 value will be replaced with ‘0’ sign. This
situation can happen if we deal with rounded data even though the population is
continuous. Observations replaced with ‘0’ sign will not be used in subsequent
analysis. When this occurs, the sample size for analysis will decrease, as many as
the number of ‘0’ sign (zero sign). Figure 9.2 summarises the information above.

Table 9.2: Summary of Sign Test

Sample Value Sign


xi >  0 xi ‘+’

xi <  0 xi  ‘-’

xi = 0 xi  ‘0’

In the sign test, the test statistic S is the random variable x representing the number of
‘+’ sign in the random sample. If the null hypothesis 0 is true, the probability that a
sample value results in either a ‘+’ or a ‘–’ sign is equal to ½. Therefore, we are actually
testing H0 that the number of ‘+’ sign, S, is a value of a random variable having the
binomial distribution with the parameter = ½, that is,

= Pr(xi – 0 > 0) = Pr(xi > 0) = ½.

For the example above, we shall reject the null hypothesis H0:  = 100 or
H0 :  = 0.5 only if the proportion of ‘+’ sign is sufficiently greater than ½, that is
when S is large.

Copyright © Open University Malaysia (OUM)


174  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

SELF-CHECK 9.3
The following data are test marks data of 10 returning students: 65 85 43
38 90 73 65 59 88 74. A test has been performed to determine whether
the performance of these students in a test exceeds the population median
value of 65.
(a) Define the ‘+’ and ‘-’ signs in this example.
(b) How many ‘+’ and ‘-’ signs are there in this study?

The Procedure for Test of Location for Single Population:


(a) State the Null and Alternative Hypotheses to test a population location
parameter.
One-sided test: H0: = ½ versus H1 :  > ½
[or H0  = ½ against H1 :  < ½ ]
Two-sided test: H0: = ½ versus H1 :  ½
or
One-sided test: H0:  = 0 versus H1 :  > 0
[or H0 : = 0 against H1 :  < 0]
Two-sided test: H00 versus H1 : 0.

(b) Calculate the Test Statistic.


One-sided test: S = the number of xi > 0 [or S = the number of xi < 0]
Two-sided test: S = maximum(S1, S2) where S1 = the number of xi > 0 ,
S2= number of xi < 0 and S2 = n – S1

(c) Determine the Significance Level.


One-sided test: p-value = Pr(X  S)
Two-sided test: p-value = 2 Pr(X  S)
X ~ Binomial with n sample size and Pr(‘success’) =  = 0.5.
Refer to Table 2 on Binomial Distribution to get the p-value.

(d) Find the Rejection Region.


One-sided and two-sided tests: Reject H0 at  level if p-value < .

Copyright © Open University Malaysia (OUM)


TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  175

Worked Example 9.1


Suppose that we would like to test whether the median of a population is less
than 51. From the following observations, calculate the test statistic and
significance level.
36 43 52 51 51 48 57 50
Answer:
To decide whether  is smaller than 51, test
H0:  = 51 versus H1:  < 51
Comparison between observations and 0 value resulted in the following signs:
36 43 52 51 51 48 57 50
– – + 0 0 – + –
Therefore, S = number of ‘–’ sign = 4 and n = 6, or, in summary, S ~ Bin(6,0.5).
From Table 9.2, p-value = Pr(S  4) = 1 – Pr(S  3) = 1 – 0.6562 = 0.3438.

Worked Example 9.2


It is suspected that the percentage of active bacteria obtained from a sewerage
specimen at an area has a median of 40. The active bacteria percentages in a
random sample of 9 specimens are given below:
41 33 43 52 37 44 49 53 40
Is there enough evidence based on the data provided to say that the median for
active bacteria percentage exceeded 40? Carry out a sign test using  = 0.05.

Answer:
To determine whether the median of percentage of active bacteria0 exceeded
40, test: H0: = 40 versus H1:  > 40 at  = 0.05.

Assigning values exceeding 40 with ‘+’ sign and values less than 40 with
‘–’sign:
41 33 43 52 37 44 49 53 40
+ – + + – + + + 0

Copyright © Open University Malaysia (OUM)


176  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

Using sign test, the test statistic S = number of observed ‘+’ signs = 6. Hence, S
is distributed as binomial with n = 9 – 1 = 8 and  = 0.5. If x variable is
distributed as binomial with n = 8 and  = 0.5, then

p-value = Pr(x  S) = Pr(x  6) = 1 – Pr(x  5)


= 1 – 0.8555 = 0.1445

Since  = 0.05 < p-value = 0.1455, H0 is not rejected. There is not enough
evidence to say that the median for active bacteria percentage exceeds 40 at 5%
significance level.

9.2.1 Sign Test for Large n


Recall that the normal distribution can be used as an approximation to binomial
distribution when sample size n is large. Since  = ½, and binomial distribution is
symmetrical, hence normal approximation can be used for sample size n as small
as 10. For a more precise analysis, normal approximation is used when sample
size n  15.

Procedure for single population test of location for large sample:

(a) State the Null and Alternative Hypotheses for testing a population location
(Refer to hypothesis for small n)

(b) Calculate the Test Statistic


S  mean( S ) S  n S  0.5n
Z  
var( S ) n (1   ) 0.5 n

(c) Find the Rejection Region


One-sided test: Z  z
Two-sided test: Z  z/2
z and z 2 values can be obtained from Table 5.3, standard deviation.

Copyright © Open University Malaysia (OUM)


TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  177

Worked Example 9.3


The following data shows the amount of sulfur oxides (in tonnes) emitted by an
industrial plant in 40 days. Perform a sign test to determine whether  < 21.5 at
0.01 significance level.

17 15 20 29 19 18 22 25 27 9 24 20 17 6 24
14 15 23 24 26 19 23 28 19 16 22 24 17 20 14
13 19 10 23 18 31 13 20 17 24

Answer:
Test to determine whether  < 21.5.
H0:  = 21.5 H1:  < 21.5
For a one-sided test, reject H0 if test statistic z > z0.01 = 2.33 where
S  n
z
n (1   )

at  = ½ and S = number of observed ‘–’sign (sample values that are less than
21.5)
Since n = 40, and S = 24, then mean(S) = n= 40  (½) = 20 while
Var  S   nθ( 1  θ)  40  0.5  0.5   3.16 40(0.5)(0.5) = 3.16
 z = (24 – 20)/3.16 = 1.26

Since z = 1.26 < 2.33, we do not reject H0. There is not enough evidence to
prove that the sulfur oxides content is less than 21.5 at 0.01 significance level.

Copyright © Open University Malaysia (OUM)


178  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

EXERCISE 9.1

1. The following data are measures of elasticity strength of a type of


fabric produced by a textile company:
163 165 160 189 161 171 158 151 169
162 163 139 172 165 148 166 172 163
187 173
Given that t0= 160, use the sign test to test the null hypothesis t = t0
against the alternative hypothesis t > t0 at 0.05 significance level.

2. Use the sign test to test H0: t = t0 versus H1 : t > t0, where
S1 = number of observations > t0 and S2 = number of observations
< t0 and show that Pr ( S1  c)  Pr ( S2  n  c) for 0  c  n.

9.3 WILCOXON SIGNED-RANK TEST


In sign test procedure, note that the sign test utilises only the plus and minus signs
of the differences between the observations and 0 value but it does not take into
consideration the magnitudes of these differences. A test procedure for a single
population called Wilcoxon signed-rank test takes into consideration the
magnitudes of these differences where ranks are assigned to observations based
on these magnitudes.

Under the null hypothesis that there is no difference between x values and 0, we
would expect that on average, half of the differences would be negative and the
other half would be positive. In other words, there will be n/2 negative differences
and vice-versa. Next, we would rank these positive and negative differences in
absolute value, and assign ranks according to sequence. It is expected that the
total rank corresponding to the positive differences should be equal/nearly equal
to the total ranks which correspond to the negative differences. The obvious
difference in total rank assigned to positive and negative differences is an
indication of differences between x values and 0.

Copyright © Open University Malaysia (OUM)


TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  179

Procedures in applying the Wilcoxon Signed-Rank Test:


(a) Get the differences between observations and 0 value.
(b) Get the absolute difference.
(c) Assign rank to the absolute difference (rank 1 – for smallest absolute
difference; rank n – the largest).
(d) When the absolute value of two or more differences is the same, assign to
each the average of the ranks that would have been assigned if the
differences were distinguishable.
(e) Calculate the total rank T–(negative differences) and T+ (positive
differences).
(f) Differences with 0 value will be discarded, hence the reduction in sample
size by that amount.

The smaller the total rank value, the bigger the possibility that there exist
differences between sample values and 0. Hence, we can reject H0 when the test
statistic, that is the total rank, say T, is less than or equal to a critical value T0.

The single population test procedure that takes into consideration the magnitude
and difference sign is as follows:
(a) State the null and alternative hypotheses statement for a single
population test.

(b) Calculate the Test Statistic.


One-sided test: T– = total rank of difference ‘–’
[or T+ = total rank of difference ‘+’]
Two-sided test: T = minimum(T+, T–)

(c) Find the Rejection Region.


One-sided test: Reject H0 if T– T0 [or Reject H0 if T+  T0]
Two-sided test: Reject H0 if T  T0
Refer Table 5.2 in attachment to get the critical value T0.

Copyright © Open University Malaysia (OUM)


180  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

Worked Example 9.4 (Please refer to Example 9.2)


Test to determine whether the median of percentages of active bacteria exceeds
40 at  = 5%. Use the Wilcoxon Signed-Rank test and Table 5.2 (in
attachment) to make the decision.

Answer:
To determine whether population median , exceeds 40, test:
H0 :  = 40 versus H1 :  > 40,  = 0.05
Following the procedures in Wilcoxon Signed-Rank test, we obtain the
following results.
(To facilitate your understanding, the circled numbers indicate the procedures in
the test).

From the above table, the total differences with ‘+’ sign and total differences
with ‘–’ sign are T+ = 28.5 and T– = 7.5 respectively, refer to 3 Since there
.
is only one observation the value of which is equal to the median value, refer to
5 , n = 9–1 = 8, refer to 6 .

For a one-sided test, the test statistic is given as T+ = 28.5. From the Table of
critical value for Wilcoxon signed-rank test (Table 5.2) in attachment, with
n = 8, the critical value is T0.05 = 4. Since T+ = 28.5 0.05= 4, we do not reject
H0. There is not enough evidence to prove that the median percentage of active
bacteria is more than 40. A similar conclusion is obtained through a sign test.
Copyright © Open University Malaysia (OUM)
TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  181

9.3.1 Wilcoxon Signed-rank Test for Large n


For large sample size, say n  15, from the Central Limit Theorem, the
distribution of test statistic T (either T+ or T–) will approach normal distribution.
Hence, T is a random variable with mean and variance,
n(n  1) n(n  1)(2n  1)
  E (T )  2  Var (T ) 
4 24
T  T
The signed-rank test statistic for n  15 is Z  distributed as standard
T
T   T
normal. For a one-sided test, the test statistic is Z  (right side) or
T
T   T
Z (left side).
T

The procedure for a one location test with large n using Wilcoxon signed-rank
test can be summarised as below:

(a) State the null and alternative hypotheses statement for a single
population test.

(b) Calculate the Test Statistic.


Using the same definition for T, T+ and T– as previously, the test statistic
n(n  1)
T
is z  4 , with T = T– (1 sided- right test),
n(n  1) (2n  1)
24
T= T+ (1 sided –left test) and T = mean(T+,T–) for two-sided test.

(c) Find the Rejection Region.


One-sided test: Z  z [or Z  – z ]
Two-sided test: |Z|  z 2
z and z 2 values can be obtained from Table 3 (standard normal).

Copyright © Open University Malaysia (OUM)


182  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

Worked Example 9.5


It is claimed that a type of detergent is the choice of many consumers. To test
this claim, the detergent’s producer has recorded its sales for a month in a
hypermarket as shown below. Test whether the median sale of this detergent
differs from 120 units at 5% level.
85 100 121 73 150 119
99 129 94 123 124 127
120 115 78 119 100 96
116 141 152 85 101 109
138 142 97 128 130 83

Answer:
To test whether the median sale differs from 120 units in a month duration, test:
H0 :  = 120 against H1 : 120 at  = 0.05. All observations are subtracted
from the median value,  0 = 120. The magnitude and difference signs are
recorded and next, ranks are assigned to each difference. The results are:
yi y i - t0 Rank yi y i - t0 Rank
85 -35 (-) 25.5 73 -47 (-) 29
99 -21 (-) 17.5 123 +3 (+) 4
120 0 0 119 -1 (-) 2
116 -4 (-) 5.5 85 -35 (-) 25.5
138 +8 (+) 13 128 +8 (+) 9
100 -20 (-) 15.5 150 + 30 (+) 23
129 +9 (+) 10 124 +4 (+) 5.5
115 -5 (-) 7 100 -20 (-) 15.5
141 +21 (+) 17.5 101 -19 (-) 14
142 +22 (+) 19 130 +10 (+) 11
121 +1 (+) 2 119 -1 (-) 2
94 -26 (-) 22 127 +7 (+) 8
78 -42 (-) 28 96 -24 (-) 21
152 +32 (+)24 109 -11 (-) 12
97 -23 (-) 20 83 -37 (-) 27

Copyright © Open University Malaysia (OUM)


TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION  183

An observation has been discarded since its difference = ‘0’, hence n = 29.
Total differences T+ = 146 while T– = 289. Since n is large, the normal
approximation is used. The test statistic calculated is,
n(n  1) 29(30)
T 146 
z 4  4  1.546
n(n  1) (2n  1) 29(30) (59)
24 24
Since Z < Z0.025 = 1.96, we do not reject H0. In conclusion, the median of the
detergent’s purchases does not differ significantly from 120 at 5% level.

SELF-CHECK 9.4
What are the differences between Sign Test and Wilcoxon Signed-Rank
Test?

EXERCISE 9.2

1. For Wilcoxon signed-rank test, show that the sum of positive and
negative differences T+ + T– = n(n+ 1)/2 with n is the number of
non-zero differences assigned rank.

2. The following are 15 measures of hydrocarbon gas content:


Use the signed-rank test at  = 0.05 to test whether the median gas
content is 98.5.

97.5 95.2 97.3 96.0 96.8 100.3 97.4 95.3


93.2 99.1 96.1 97.6 98.2 98.5 94.9

3. A sewerage pipe at a housing area is subjected to having a mean


strength of less than 2,500kg. An evaluation contractor randomly
selected seven pipes and the measurements on those pipes are
2, 610 2,750 2,420 2,510 2,540 2,490 2,680
Use the sign test at  = 0.10 to determine whether these pipes
follow the specification requirement. Compare the results with the
Wilcoxon signed-rank test.

Copyright © Open University Malaysia (OUM)


184  TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION

4. A quick solution algorithm to the polynomial problem of the 0–1


Boolean concept has been discovered. Twenty-five random problems
were solved by this algorithm and the time taken for each (in CPU
seconds) was recorded:

0.045 1.055 0.136 1.894 0.379 0.136 0.336


0.258 1.070 0.506 0.088 0.242 1.639 0.912
0.412 0.361 8.788 0.579 1.267 0.567 0.182
0.036 0.394 0.209 0.445

Carry out both the sign and signed-rank tests to determine whether
more than half of the random polynomial 0-1 problems requires less
than or equals to 1 CPU second. Use  = 0.01.

Do visit the following website:


http://software.biostat.washington.edu/~rossini/courses/intrononpar/text/Wilcoxon
_Signed_Rank_Test_Statistic.html

Ć A single population test involves the comparison of position or population


locations under study using median measurement with a certain population
median value.

Ć Two methods of non-parametric statistics can be used to perform a test on a


single population location, which is the sign test or the signed-rank test.

Ć The sign test is an easier approach using a simpler calculation, while a


signed-rank test is more precise as it takes into account the magnitude of
differences between sample values and the specific value of interest in the
test, apart from information on sign or direction of difference.

Copyright © Open University Malaysia (OUM)


Topic Non-Parametric
10 Hypothesis Test
for Two
Populations
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Construct two-population hypothesis using non-parametric methods;
2. Compare two dependent populations using the sign test method;
3. Compare two dependent populations using the Wilcoxon signed-rank
method;
4. Differentiate between the sign test and the signed-rank test in
comparing two dependent populations; and
5. Compare two independent populations using the Mann-Whitney total
rank method.

 INTRODUCTION
A researcher often takes observations from two populations with the purpose of
comparing them, such as whether both populations come from the same
distribution or not. For example, in a parametric test, with two random samples
X1, …, Xn1 and Y1, …, Yn2 obtained from two normal populations with means, µx
and µy respectively and constant variance, the researcher may be interested to test
H0: µx = µy versus H1: µx < µy . If the null hypothesis holds, we can conclude that
both distributions are distributed as normal with similar mean and variance. In
other words, both samples were taken from the same population. On the other

Copyright © Open University Malaysia (OUM)


186  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

hand, if the alternative hypothesis is true, then µx < µy, that is, the location
parameter of X (selected as the mean) has a smaller value compared to the
location parameter of Y. Hence, X population distribution is located on the left
side of the Y distribution. The dispersion of X and Y distributions is still the same
as both variances are assumed constant.

In a non-parametric test, we compare two populations using the second


distribution, and not using any parameter in particular. In general, comparison is
made to determine whether both population distributions are different, or in
particular and more accurately, whether the distribution of population 1 is located
on the left or right side of the distribution of population 2. Note that non-
parametric tests do not require assumption of population normality.

10.1 DEPENDENT AND INDEPENDENT


POPULATIONS
When comparing two populations, it is important to check whether both
populations are independent or not. Two populations under comparison are said to
be independent when samples taken from one population are not dependent on or
influenced by the samples chosen from the other population.

For example, if we are comparing students’ marks in a basic statistics course


between male and female students, we are sure that the selection of the first
sample from the male student population will not be influenced or influences the
second sample or any sample from the female student population and vice-versa.
Hence, both populations and samples can be categorised as two independent
samples. Another example is when population 1 which consists of families in
Town A is considered as independent of families staying in Town B that form
population 2.

When comparing two independent populations, since the selection of samples


from population 1 does not influence the selection of samples from population 2,
the sample size n1 from population 1 and n2 from population 2 may be equal or
different.

On the other hand, if we can associate or match two populations where the
selection of samples from the second population depends on the selection of
samples from the first population, both are said to be dependent. An example of
two dependent populations is a study on the effectiveness of vehicle safety tools
where the measure of injuries faced by a driver when putting on safety tools

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  187

(sample 1) is compared with the measure of injuries on the driver when he/she is
not putting on any safety tool (which is sample 2). In this case, both samples are
related and are experimented on the same subject, that is, the same driver is used
to obtain measures of injuries when putting on the safety tools and when not
putting on the tools.

In testing two dependent populations, the sample size n1 and n2 must be equal, that
is, n1 = n2 = n due to a relationship or similarity in the data source/measurement
obtained, as in the case of similar subjects, and comparisons made are based on
paired comparison.

To test your understanding so far, try out the self-check below.

SELF-CHECK 10.1

Determine whether the following data represents two dependent


populations or not:
(a) The average accidents that occurred during work at a factory before
and after the implementation of a safety programme.
(b) The nicotine content in cigarette brands X and Y. Give other
examples for both types of populations.

10.2 HYPOTHESIS STATEMENT FOR


TWO–POPULATION TESTING
In two-population hypothesis testing using non-parametric statistics, a comparison is
made between the populations, with reference to the location of population distributions
in general. We will be comparing population 1 distribution, denoted by D1 with X1, X2,
…, Xn1 and population 2 distribution, say D2, with Y1,Y2, . . .,Yn2.

In general, the null hypothesis H0: D1 = D2 means that there is no difference in


both population distributions in which they are equal. In other words, on average,
there are no significant differences in X1, X2, …, Xn1 values versus Y1, Y2 , …, Yn2
values. There are 3 possible alternatives for the hypothesis statements:

(a) H1: D1 < D2 means population 1 distribution (which contains X1, X2, …, Xn1)
is located on the left side of population 2 (which contains Y1, Y2, …, Yn2).

Copyright © Open University Malaysia (OUM)


188  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

This also means that on average, X1, X2, …, Xn1 values are smaller than Y1,
Y2, …, Yn2 values;

(b) H1: D1 > D2 means population 1 distribution is located on the right side of
population 2; and

(c) H1: D1  D2 means the location of population 1 distribution is not the same
as the location of population 2.

The construction of an alternative hypothesis depends on the claim in the study or


question that needs to be answered.

Worked Example 10.1


A company manager claims that the night-shift workers tend to apply for more
sick leaves compared to the day-shift workers. Construct a hypothesis to test
whether the number of sick leaves taken by night-shift workers is higher than
the day shift workers.
Answer:
Define D1 as the distribution of sick leaves applied for by the night-shift
workers and D2 as the distribution of sick leaves applied by the day shift
workers. From the expression ‘higher’ we can conclude that H1: D1 > D2 while
H0: D1 = D2.

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  189

Worked Example 10.2


It is claimed that students who have not been provided with sample examination
questions in advance will obtain lower marks than to those who have them. To
test this claim, 20 students are selected and each matched pair has almost the
same overall quality point average in other examinations. Construct a
hypothesis to test this claim.

Answer:
Define D1 as the distribution of marks for students who did not have the sample
problems and D2 as the distribution of marks for students who were provided
with sample problems. From the expression ‘lower’ we can test whether X1, X2,
…, Xn1 values are smaller than Y1,Y2, …,Yn2 values or D1 is located on the left
side of D2. Hence, the null and alternative hypotheses can be described as H0:
D1 = D2 versus H1: D1 < D2.

Worked Example 10.3


Twelve students identified as obese were put on a special diet which is believed
to be able to reduce body weight. The students’ weights before and after this
diet were monitored for a month. Construct a hypothesis to test whether this
diet programme is effective or not.

Answer:
Define D1 as students’ weight distribution before starting the diet programme
and D2 as the distribution of weight after the programme. If the programme is
effective, hence, weight reduction, this means the observation values in D1 must
be larger than the observation values in D2. In other words, D1 must be located
on the right side of D2. Hence, the alternative hypothesis is H1: D1 > D2.

The weight distribution before and after can also be viewed as the same
distribution as the weight difference between the weight before and the weight
after, that is Di = Xi – Yi. Next, the alternative hypothesis can be described as
H1 : Di = Xi – Yi > 0 and H0 : Di = 0. We will discuss this further in section
10.3.

Copyright © Open University Malaysia (OUM)


190  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

Some examples of expressions that provide clues in constructing suitable null and
alternative hypotheses for two independent populations are as shown below (refer
to Table 10.1).

Table 10.1: Summary of Expression Terms with the Null and Alternative Hypotheses

Expression Terms Alternative Hypothesis Null Hypothesis


Ć ‘larger than’
Ć ‘increase’
H1 : D1 > D2 H0 : D1  D2
Ć ‘bigger’
Ć ‘more’
Ć ‘less than’
Ć ‘decrease’ H1 : D1 < D2 H0 : D1  D2
Ć ‘reduce’
Ć ‘Different than’
H1 : D1 D2 H0 : D1 = D2
Ć ‘Not equal to’

10.3 COMPARING TWO DEPENDENT


POPULATIONS
Suppose that there are n pairs of observations in form (Xi, Yi). Xi is the evaluation
or observation in a specific situation (say, before treatment) and Yi is the
evaluation or observation in another situation (say, after treatment). To test the
hypothesis that both distributions are equal versus the alternative hypothesis that
both distributions are different in terms of their location, a non-parametric
statistical test can be used. There are two methods for comparing two dependent
populations that will be discussed here, which are the sign test and the Wilcoxon
signed-rank test.

10.3.1 The Sign Test for Two Dependent Populations


The sign test, which has been used to test the location or position of a single
population, is further extended to comparison testing of two locations or
populations position.

In general, we will be comparing whether population 1 distribution is different or


not from population 2 distribution as discussed in section 10.2. Suppose that
Di = Xi – 0 (0 is the median under H0). Hence,

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  191

1
 = Pr(Xi > 0) = Pr(Xi – 0 > 0) =
2
1
= Pr(Di > 0) = = Pr(Di< 0)
2

for two dependent or paired populations, Di = Xi – Yi – 0. Under H0,  = Pr(Di >
1
0) = = Pr(Di< 0). If Xi and Yi, i = 1, . .. , n (for two paired samples, n = n1 =
2
1
n2) come from the same population, so Pr(Di > 0) = Pr(Di < 0) = . Hence,
2
replacing Xi – 0 with Xi – Yi –0, we can use the results of the sign test for a
single population to test two dependent or paired populations. Therefore, the null
hypothesis can be written as:

H0 : Pr(Di > 0) = Pr(Di < 0) =  = 1/2 or H0: 1 – 2 = 0

Suppose that S represents the sum of differences between Xi and Yi marked as ‘+’,
1
hence S follows a binomial distribution with  = . Thus, the null hypothesis
2
1
statement for comparing two paired populations is H0 :  = .
2

Steps to perform sign test for two dependent populations:

(i) Obtain the differences between the first sample and its pair, that is, the
second sample.

(ii) Assign the ‘+’ or ‘-’ sign according to the result of the differences. The
paired sample with zero difference is discarded.

(iii) Count the sum of ‘+’ signs for one-sided (right) test and the sum of ‘-’
signs for one-sided (left) test.

Copyright © Open University Malaysia (OUM)


192  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

The testing procedures for two dependent or paired populations are similar to
testing procedures for single population.
Let us study the following two examples.
Next, the rejection region can be determined using binomial probability
distribution.

Worked Example 10.4


A manager would like to study whether a raise in employees’ salaries will
reduce the number of defective products. For this purpose, data on defective
products before and after the salary increment was recorded. Construct an
appropriate null and alternative hypotheses and state the test statistic.

Answer:
To determine whether salary increment results in lower defective products, test:

H0 :  = 1/2 (the distribution of defective products is the same before and


after salary increment) or H0 : after = before ,
versus

H1 :  < 1/2 (there are fewer defective products after salary increment than
before the increment) or H0 : after < before

From the alternative hypothesis, the above test is a one-sided (left) test. Hence,
test statistic = S2= the number of differences between X and Y with ‘–’ sign.

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  193

Worked Example 10.5


The marketing department of a fast-food restaurant would like to know whether a
new ingredient used resulted in tastier fried chicken compared to the one using a
traditional ingredient. 10 culinary experts were chosen at random to evaluate both
types of fried chicken and were asked to rate the taste at a scale of 1 to 10
(1 represents least delicious and 10 very delicious). The results are as follows:

Culinary Expert Original Ingredient New Ingredient


A 3 9
B 5 5
C 3 6
D 1 3
E 5 10
F 8 4
G 2 2
H 8 5
I 4 6
J 6 7

Answer:
There are two dependent samples since the evaluation of both types of fried
chicken were made by the same subject (culinary expert). The hypothesis to test
whether the two ingredients are different,
1
H0 : There is no change in the fried chicken taste or = versus
2
1
H1 : The new ingredient resulted in tastier fried chicken or  >
2

Next, using the sign test from the table,

n = the total number of observations (exclusive of those with zero


difference)
= the total number of ‘+’ sign and ‘-’ sign = 6 + 2 = 8
S = the total number of ‘+’ sign = 6.

Copyright © Open University Malaysia (OUM)


194  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

Di = New
Culinary Original New
Difference Ingredient -
Expert Ingredient Ingredient
Original
A 3 9 6 +
B 5 5 0 0
C 3 6 3 +
D 1 3 2 +
E 5 10 5 +
F 8 4 –4 –
G 2 2 0 0
H 8 5 –3 –
I 4 6 2 +
J 6 7 1 +

From the table above, we can see that six out of 10 culinary experts found that
the chicken tasted better using the new ingredient. Two said that the original
ingredient tasted better and the two other experts could not detect any
difference.

In a one-sided right test, if H0 is true, a large number of ‘+’ sign or small


number of ‘–’ sign will result in H0 rejection.

From the binomial table with n = 8 and  = 0.5, Pr (S 6) = 0.1445. Since p-
value = 0.1445 is greater than the significance level value,  = 0.05, we do not
reject the null hypothesis. In conclusion, the use of the new ingredient in the
fried chicken did not result in a significant difference compared with the
traditional ingredient at 0.05 significance level.

If the sample size is large, normal distribution can be used as an approximation


to the binomial distribution. The normal approximation rules are n5 and
n(1 – ) = 5. Even with  = 0.5 the normal approximation can still be used for n
as small as 10, but for a more precise result, normal approximation is used for
n = 15.

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  195

Worked Example 10.6 (Refer to Example 10.5)


At a restaurant, 35 customers were chosen at random and asked about the
difference in taste between the two ingredients for fried chicken. The summary
of the results are as follows:

‘+’ Difference = 19 ‘–’ Difference = 13


‘0’ Difference (no difference in evaluation) = 3

Carry out a test at 5% level to determine whether the customers prefer the new
ingredient to the traditional one.

Answer:
To determine whether the new ingredient is preferable, test H0:  = 0.5 versus
H1:  > 0.5.
S = the number of ‘+’ sign = 19 Sample size, n = 32

S  0.5n 2S  n 2(19)  32
Test Statistics     1.061
0.5 n n 32
From the standard normal table, the critical value z0.05 = 1.645 for one-sided
test. Since 1.061 < 1.645, the null hypothesis cannot be rejected.

10.3.2 Wilcoxon Signed-Rank Test for Two


Dependent Populations
Another method to compare two paired samples is by using the Wilcoxon Signed-Rank
Test which uses information not only from the sign but also on the magnitudes of the
differences. The test procedures have been discussed in test of location for single
population using the Wilcoxon Signed-Rank Test.

Remember that the signed-rank test is based on either:


 T  , that is the sum of the ranks assigned to the positive differences;
 T  , that is the sum of the ranks assigned to the negative differences; or
 T, where T = minimum ( T  , T  ).

Copyright © Open University Malaysia (OUM)


196  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

Under the null hypothesis, there is no difference between single population


distribution and two-population distribution, it is expected that half of the
differences will be ‘+’ sign and another half ‘–’ sign. The sum of T  and T  is
always n(n + 1)/2 and they are both values of random variables that take on values
on the interval from 0 to n(n + 1)/2 inclusive, and the distributions that are
symmetrical about n(n + 1)/ 4.

Summary of steps in Wilcoxon Signed-Rank Test for two dependent


populations:
(i) Calculate the difference for each sample pair.
(ii) Get the absolute values of the differences.
(iii) Assign rank to the absolute values of the differences. (1 – for smallest
absolute difference; n – largest).
(iv) Differences with the equal value will be assigned the mean rank for ranks
that they jointly occupy.
(v) Count the sum of ranks T  (negative differences) and T  (positive
differences).
(vi) Zero differences are discarded, hence the sample size will be reduced by
the number of differences with 0 values.

The test procedure for paired two populations with signed-rank method is similar
to the test procedure for a single population.

Worked Example 10.7


In evaluating paper quality, the paper smoothness is very important to assure
customers’ acceptance. Suppose that 10 judges were each given two samples of
paper produced by a factory. The evaluations of the judges were assigned rank
1 – 10, with rank 10 representing the highest quality of smoothness. From the
evaluation results below, test for differences in quality for both paper products
at 0.05 significance level.

Judge 1 2 3 4 5 6 7 8 9 10
Product A 6 8 4 9 4 7 6 5 6 8
Product B 4 5 5 8 1 9 2 3 7 2

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  197

Answer:
The hypothesis test for differences in both paper product quality,

H0: the distributions of evaluation for both products 1 and 2 are the same or
H0: 1 2 versus
H1: the distributions of evaluation for both products 1 and 2 are different or
H1 : 1 2

Applying the signed-rank method, the test procedure generated the following
results:

Judge A B Differences |Differences Rank|Differences|


(A–B) (A–B)|
1 6 4 +2 2 (+)5
2 8 5 +3 3 (+)7.5
3 4 5 –1 1 (–)2
4 9 8 +1 1 (+)2
5 4 1 +3 3 (+)7.5
6 7 9 –2 2 (–)5
7 6 2 +4 4 (+)9
8 5 3 +2 2 (+)5
9 6 7 –1 1 (–)2
10 8 2 +6 6 (+)10

From the table, the sum of the rank differences ‘+’ = T  = 46, while the sum of
the rank differences ‘–’ = T  = 9. Since the test is two-sided, the test statistic is
given by

T = minimum ( T  , T  ) = T  = 9.

With n = 10 and  = 0.05, the critical value is T0 = 8. H0 will be rejected if


T  T0. Since T = 9 is not  T0, H0 is not rejected. In conclusion, there is not
enough evidence to say that there are shifts between the two distributions of
evaluation for paper quality at  = 5%.

Copyright © Open University Malaysia (OUM)


198  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

When the sample size is large, say n 15, the distribution of test statistic T
(either T  or T  ) will approach normal distribution. To perform the signed-rank
test for large n, T is a random variable with mean and variance,

n  n  1 n(n  1)(2n  1)
 = E(T) =  2 = Var(T) =
4 24

T 
Hence, the signed-rank test statistic for n  15 is Z = which is a standard

T  T 
normal. For a single-sided test, the test statistic is Z = or Z =
 

Worked Example 10.8


A company producing energy drinks claimed that the drink is effective in
reducing body weight. The following are the weights of 16 random samples
before and after 4 weeks of taking the drink.

Weight before 147.0 183.5 232.1 161.6 197.5 206.3 177.0 215.4
Weight after 137.9 176.2 219.0 163.8 193.5 201.4 180.6 203.2
Weight before 147.7 208.1 166.8 131.9 150.3 197.2 159.8 171.6
Weight after 149.0 195.4 158.5 134.4 149.3 189.1 159.1 173.2

Use the signed-rank test to test at 0.05 level of significance whether the
company’s claim is true or not.

Answer:
The energy drink is said to be effective if the weight distribution before >
weight distribution after, that is the number of ‘+’ sign is less than the number
of ‘-’ sign. To determine whether the energy drink is effective or not in
reducing body weight, test
1 1
H0 :  = versus H1:  <
2 2

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  199

The Wilcoxon signed-rank procedure for testing two dependent populations


gave the following results:

The test statistic T– = 5 + 7 + 3 + 6 + 4 = 25, with mean and variance,


n(n  1) 16(17) n(n  1)(2n  1) (16)(17)(33)
=   68 2 =  = 374
4 4 24 24

For a one-sided (right) test, the null hypothesis will be rejected if the test
statistic,
T  25  68
Z= = = –2.22 < the critical value z0.05 = –1.645.
 374

Since z = –2.22 < z0.05 = –1.645, the null hypothesis must be rejected. We
conclude that the energy drink is effective in reducing the body weight at 5%
level.

After studying these examples, you can proceed to Exercise 10.1.

EXERCISE 10.1

1. Suppose that the differences in paired data are 15 ‘+’ signs, 5 ‘-’
signs and 9 ‘0’. Use the sign test for right-end test.
(a) What are the values for n and S?
(b) At 0.05 level, will H0 be rejected?

2. To determine the effectiveness of a new traffic-control system, the


number of accidents that occurred at 12 dangerous intersections
during four weeks before and four weeks after the installation of the
new system were observed. The following data were obtained
through the observation:

Before 3 3 1 5 3 6 2 0 4 3 4 1
After 1 2 3 2 0 4 3 2 1 2 3 0

Use the sign test to test whether the new traffic-control system is
more effective than the old system at 0.05 level of significance.

Copyright © Open University Malaysia (OUM)


200  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

10.4 COMPARING TWO INDEPENDENT


POPULATIONS
The Wilcoxon Rank-Sum Test is a better alternative to the two-sample t test in
testing the equality of two populations which are non-normal and samples chosen
are independent (i.e. there is no pairing of observations).

Suppose that there are n1 and n2 independent samples from population 1 and
population 2 respectively. The procedures for the rank-sum test suggest that we
combine n1 + n2 = n observations and assign rank according to the observed
magnitude. Rank 1 will be assigned to observation with the smallest value and
rank n to the observation with the highest value. In the case of ties (identical
observations), we would replace the observations by the mean of the ranks that the
observations would be entitled to if they were distinguishable. If both samples
actually come from the same distribution, the sum of the ranks corresponding to
the first sample, W1, and the sum of the ranks corresponding to the second sample,
W2, will be proportional to respective sample size. If n1 and n2 are equal, W1 + W2
should be almost identical. If one of the rank-sum is sufficiently large while the
other one is very small, this shows there is a possibility that a significant
difference exists in both sample distributions.

Mann and Whitney have suggested a test statistic that also uses the sum of the
ranks for both samples and it can be shown that this is equivalent to the Wilcoxon
test. This test, called the Mann-Whitney U test, has been used extensively since
the availability of table for U critical values.

When comparing two populations using the Mann-Whitney test, the following
statistic will be used as the test statistic:
n1  n1  1
U1  n1n2  n1W1
2
n  n  1
U 2  n1n2  1 1  W2 or
2
U  the minimum of U1 and U 2

where U1 + U 2 = n1 n2 , while W1 and W2 are the sum of the ranks of the values of
the first and second sample, respectively.

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  201

From the formulas for U1 and U2, U1 will be small when W1 is large. This can
only happen if the population 1 distribution is shifted to the right of the population
2 distribution. Hence, the test statistic U1 will be used when the alternative
hypothesis is D1 > D2.

Steps in Mann-Whitney Rank-Sum Test:


(i) Arrange all n = n1 + n2 data from both populations, where n1 is the
sample size for population 1 and n2 is the sample size for population 2.
(ii) Arrange in ascending order regardless of the population.
(iii) Assign rank to the arranged data. Assign rank 1 for data with the smallest
value and n for the largest data.
(iv) In the case of ties (identical observations), we replace the observations
with the mean of the ranks that the observations would have if they were
distinguishable (tie-rank).
(v) Sum up the ranks for each sample.

If the samples chosen were from two identical populations, it is expected that the
sum of the ranks of both samples would not differ too much. If there is an
appreciable difference between the means of the two populations, most of the
lower ranks are likely to go to the values of one sample, while most of the higher
ranks are likely to go to the values of the other sample.

Let D1 and D2 be the distribution for population 1 and population 2 respectively.


The procedures for Mann-Whitney 2 independent population’s comparison test:

(a) State the Null and Alternative Hypotheses for two independent populations
test.
One-sided test: H0: D1 and D2 are equal versus H1: D1 has shifted to the
right of D2.
[or H0: D1 and D2 are equal versus H1: D1 has shifted to the
left of D2].
Two-sided test: H0: D1 and D2 are equal versus H1 : D1 has shifted either to
the left or the right of D2.

Copyright © Open University Malaysia (OUM)


202  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

(b) Calculate the Test Statistic.


n1 (n1  1)
One-sided test: U1 = n1 n2 + – W1
2
n2 (n2  1)
[or U2 = n1 n2+ – W2]
2

Two-sided test: U = minimum (U1, U2)

(c) Find the Rejection Region:


One-sided test: U1  U 2 [or U1 U  2 ]
Two-sided test: U  U 
U 2 and U values are given in Table 5.3 (as in the Attachment).

Worked Example 10.9


To find out whether a new serum will arrest the seriousness of leukaemia, 9
laboratory rats, which have all reached an advanced stage of the disease, are
selected. The survival time (in years), from the time the experiment commenced
is as follows:
Treatment 2.1 5.3 1.4 4.6 0.9
No treatment 1.9 0.5 2.8 3.1

Determine if the serum is effective or not at 0.05 level of significance.

Answer:
Let D1 be the distribution of survival times of rats receiving treatment and D2 as
the distribution of survival times for rats not receiving treatment. To determine
the effectiveness of the serum treatment, test:
H0: D1 = D2 against H1: D1 > D2
or D1 and D2are equal or D1 has shifted to the right of D2

Data showed n11 = 5 and n2 = 4. The observations were arranged in ascending


order and assigned rank 1 until 9 as below:

Original Data 0.5 0.9 1.4 1.9 2.1 2.8 3.1 4.6 5.3

Rank 1 2 3 4 5 6 7 8 9

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  203

Observations which received treatment from sample 1 are underlined. Hence,


w1 = 2 + 3 + 5 + 8 + 9 = 27 and w2= [(9)(10)/2] – 27 = 18. If a calculation was
done by taking the sum of the ranks, a similar answer is obtained, w2 = 1 + 4 +
6 + 7 = 18.
For a one-sided (right) test, the test statistic is
n1 (n1  1)
U1 = n1 n2 + – W1 = 8.
2
From Table 5 (Mann-Whitney U Test critical values), the critical value U with
n1=5, n2 = 4 and  = 0.05 for a one-sided test is 2. Since the test statistic value
is not  2, U1 falls in the acceptance region. In conclusion, do not reject H0. The
serum treatment is ineffective in prolonging the survival time of leukaemia
patients at 0.05 level.
The application of the Rank-Sum test is not limited to non-normal population
only. It can be used to replace the t test when the type II error is large. Recall
that Type II error is the probability of accepting H0 when H0 is actually false.
When n1 15 and n2  15, the sampling distribution of U will approach normal
distribution with mean and variance,
n1 (n1  n2  1) n1 (n1  1) n1n2
U   
2 2 2
n n (n  n  1)
 n2  1 2 1 2
12
U  U
Next, the Z  statistic which approaches standard normal distribution
U
can be used to make a decision about the test.

Worked Example 10.10


A random sample of 16 young turkeys was fed two different diets. The weight
gains (in pounds) for these samples were recorded and they were kept under
identical conditions for 3 months. The following are the ranks assigned to these
young turkeys from the diets:

Perform the Mann-Whitney U test at 0.01 level of significance to test the null
hypothesis that the two populations sampled are identical against the alternative
hypothesis that the second diet produced a greater weight gain.

Copyright © Open University Malaysia (OUM)


204  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

Answer:
Let D1 be the weight distribution of young turkeys with diet 1 and D2 be the
weight distribution of young turkeys with diet 2. To determine whether the
second diet results in greater weight compared to the first, test
H0 : D1 = D2 (or D1 and D2 are equal) against
H1 : D1 < D2 (or D1 has shifted to the left of D2)
From data, n1 = n2 = 16, hence, the normal approximation will be used. Prior to
that, obtain the sum of ranks w2 and test statistic U2.
w2 = 21 + 1 + 3 + 8 + 15 + 4 + 11 + 2 + 5.5 + 13 + 31 + 16 + 12 + 22 + 7 +
10
= 181.5
U = U2 = (16)(16) + (16)(17)/2 – 181.5 = 210.5

The mean and variance of U are given as:


nn
U = 1 2 = (16)(16)/ 2 = 128
2
n n  n  n  1
 u2 = 1 2 1 2 = [(16)(16)(33)]/12 = 704
12
210.5  128
Hence, Z = = 3.11
704

Since Z = 3.11 > 2.33 = Z0.01, the null hypothesis must be rejected. We conclude
that the second diet produces a greater gain in weight compared to the first diet
at 0.01 significance level. In other words, D1 distribution is situated on the left
side of D2 distribution.

You have reached the end of Topic 10. Test your understanding with this next
self-check.

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  205

SELF-CHECK 10.2
What are the differences between the Rank-Sum, the Sign and the Signed-
Rank tests for comparing two populations?
Rank-Sum Test Sign Test Signed-Rank Test

Let us now try the following exercises:

EXERCISE 10.2

1. State the test results for the following data.


(a) n1 = 3 n2 = 5 W1 = 8
H0: D1 and D2 are equal H1: D1 is on the right of D2

(b) n1 = 6 n2 =4 W2 = 17
H0: D1 and D2 are equal H1: D1 and D2 are unequal

2. In a study on the methods of teaching mathematics to


schoolchildren, the weak students were divided into two groups.
Seven students were selected at random for group 1 where they
were taught problem-solving using a conventional method. Ten
other students were put into group 2 and were taught to solve real
problems using a video programme. At the end of the sessions, the
students were given a test and the scores are as listed below:

Group 1 15 21 15 23 17 14 16
Group 2 18 22 24 25 19 24 17 19 23 16

Test at 0.01 level that group 2 obtained higher scores in the test.
Compare both methods of teaching.

Copyright © Open University Malaysia (OUM)


206  TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS

3. The following data is the number of accidents that occurred at 12


manufacturing factories in one week before and after a workplace
safety campaign was conducted.

Factory Before After Factory Before After


1 3 2 7 5 3
2 4 1 8 3 3
3 6 3 9 2 0
4 3 5 10 4 3
5 4 4 11 4 1
6 5 2 12 5 2

(a) Does the data support the claim that the campaign was
successful?
Test the effectiveness of the campaign at 0.05 significance level.
(b) Repeat the test above using the signed-rank test.

4. In a study, a group (experimental group) was given 2 bottles of


alcoholic drinks and then was asked to solve a problem. Another group
(control group) was given non-alcoholic drinks and was also asked to
solve the same problem. The time taken (in minutes) by each
individual to solve the problem was recorded as below. At  = 0.05,
can you prove that alcohol intake resulted in longer time to solve the
problem?

Control Group Experimental Group


63 70 64 78 74 62 43
57 50 56 77 80 72 57
44 42 41 75 55 66 44

ACTIVITY 10.2

Copyright © Open University Malaysia (OUM)


TOPIC 10 NON-PARAMETRIC HYPOTHESIS TEST FOR TWO POPULATIONS  207

Do visit the following websites for in-depth information about single-population


hypothesis testing and Mann Whitney test:
Ć http://software.biostat.washington.edu/~rossini/courses/intro-nonpar/ text/one-
sample problems.html
Ć http://software.biostat.washington.edu/~rossini/courses/intro-nonpar/ text/one
sample location Problems.html
Ć http://www2.latech.edu/~drug/mann Whitney. PDF

Ć In starting the non-parametric analysis to compare two populations, we need


to determine whether both populations are dependent or not.

Ć When comparing two dependent or paired populations, their differences can


be considered as a single population, so they will be analysed using tests for a
single population such as the Sign and the Signed-Rank tests.

Ć The Wilcoxon Rank test and the Mann-Whitney test can be used to compare
two independent populations.

Copyright © Open University Malaysia (OUM)


208  ANSWERS

Answers
TOPIC 1: CHI-SQUARE DISTRIBUTION, F DISTRIBUTION
AND THEIR APPLICATIONS

Exercise 1.1
(a) Histogram
(b) Scatter plot of probability function versus Y

Comment: Histogram at (a) and scatter plot at (b) is right-skewed. Hence, it is true
that
4
Y   Zi  2  4 
i 1

Exercise 1.2
The value 0.831 in Figure 1.6(b) shows column  = 0.975 with row, v = 5 having
significance value 0.975
2
(5) = 0.831.

From Figure 1.6(a), it is clear that the value 23.68 given in Table 1.1 at column
 = 0.05 and row, v = 14 which means ‘point 0.05 at 0.05 2
(14) distribution is
23.68’, i.e.  0.05
2
(14) = 23.68.

Exercise 1.3
Step 1: Determine the test parameter
Population parameter to be tested is population variance that is 2 .

Step 2: Gather all information given


Our results will depend on the sample variance value that is 18.4. We have a
normal population with mean,  (not given in the test) and assumed variance to be
tested, 02 = 42

Copyright © Open University Malaysia (OUM)


ANSWERS  209

Step 3: Construct hypothesis statement


Population variance = 42. One-side (left) test to be performed is:

H 0 : 2 42
H1 :  2  42

Step 4: Determine the significance level and rejection region


One sided test performed at 5% level (you can choose any significance level), chi
square distribution with n – 1 = 16 – 1 = 15 degrees of freedom. From table, the
critical value  2   2 = (15) = 7.26. We will reject the null hypothesis when
95%

 2  95%
2
= (15)

Step 5: Calculate the test statistic

The test statistic,  2 


 n  1 S 2  15 18.4   6.57
2 42

Step 6: Test result


Since the  2 value is inside the rejection region, i.e. 6.57 < 7.26, reject H0.

Step 7: Test conclusion


Hence, we have strong evidence to say that population variance  2 is less than 42.

Exercise 1.4

Figure 1.1

Copyright © Open University Malaysia (OUM)


210  ANSWERS

Comment: The three distribution graphs show skewness. The curve shapes
change with the change in the degrees of freedom pairs.

Exercise 1.5
1. For  = 5% = 0.05, 1% = 0.01 and 0.1% = 0.001 refer to the first, third and
fourth row in the Table for every pair of v1 and v2 . Thus,

(a) F0.05 (3,16) = 3.24 (b) F0.05 (12,25) = 2.16

(c) F0.01 (4,15) = 4.89 (d) F0.01 (7,4) = 49.66

2. Values on the right of the equation represent values in the distribution table.
Thus, determine their values based on suitability/accuracy value in table by
referring to the intersection of the column and row according to the degrees
of freedom. Hence, we obtained:
(a) F (6,14) = 3.50
Since the value 3.50 (at v1 = 6 and v2 =14) is on the second row, = 0.025.
(b) F (10,32) = 2.93
Since the value 2.93 (at v1 = 10 and v2 =32) is on the second row,= 0.01.
(c) F (24,38) = 1.81
Since the value 1.81 (at v1 = 24 and v2 =38) is on the second row,= 0.05.
(d) F (2,24) = 5.61
Since the value 5.61 (at v1 = 2 and v2 =24) is on the second row,= 0.01.

Exercise 1.6
From the table, we obtain f 0.01,14,11 = 4.30 and f 0.01,11,14 = 3.87 (since =0.02). Hence:

 3.07 
2
1  2 (3.07) 2
.  12  .3.87
(0.8) 2 4.30  2 (0.8) 2
 12
that is 3.425  2  56.991
2

Copyright © Open University Malaysia (OUM)


ANSWERS  211

 12
This means, the assumption that ratio  1 is not true because the estimation of
 22
ratio interval does not contain the value 1.

Exercise 1.7
Step 1: Determine the test parameter
Population parameters are the first and second population variance, 12 and  22
respectively.

Step 2: Gather all information given


Population 1 Population 2
(a) shape: normal shape : normal
(b) mean: 1 (unknown) mean : (unknown)
(c) standard deviation: 1 (unknown) standard deviation:  2 (unknown)

Step 3: Construct hypothesis statement


Choose the appropriate H1 based on the expressions as in the table below:

Terms of Expression H0 H1

1.  22 is larger than  12  22 >  12  22 ≤  12


2.  22 is smaller than  12  22 <  12  22 ≥  12
3.  22 is different from  12  22 ≠  12  22 =  12

Is the population variance  12 equal to  22 (i.e.  12 /  22 =1) or  12 >  22 ( i.e.  12 /


 22 ). (Refer Table 1.4, Chapter 1)

 12
H 0 :  12   22  H 0 : 1
 22
 12
H1 :     H1 : 2  1
2 2
1 2
2

Copyright © Open University Malaysia (OUM)


212  ANSWERS

Step 4: Determine the significance level and rejection region


Perform a two-sided test at = 0.02 significance level. Rejection region is either on
the right side of the F distribution or on the left side of F distribution (Figure 1.19).
1 1
From the table, we obtained f 0.01,15,24 = 2.89 and F0.99,15,24    0.034 .
F0.01,24,15 3.29
Hence, we will reject H 0 when the statistic, F > 2.89 or F < 0.034.

Step 5: Calculate the test statistic


s12 10.6
The test statistic, F    1.45
s22 7.3

Step 6: Test Result


Since the F value fell in the acceptance region, that is 0.034 < 1.45 < 2.89,
accept the null hypothesis.

Step 7: Test Conclusion


Hence, there exists a strong evidence to say that  12 =  22 , that is both population
variances are equal.

Exercise 1.8
1. Using properties c(i) and c(ii) of the chi square distribution
(a) X 1 + X 2 is distributed as  2 (1+5) = (6). Hence, E[Y] = 6 and Var[Y] =
2(6) = 12.
(b) X 1 + X 3 is distributed as  2 (1+10) = (11). Hence, E[Y] = 11 and
Var[Y] = 2(11) = 22.

2. Use properties (i) and c(ii). Since X 1 is distributed as N( 1 ,  12 ),


X  1 X 2  2
Z1  1 and Z 2  are each distributed as standard normal,
1 2
X  
2

N(0,1). Subsequently from c(i) property, Z1


2
 1 1 ~  2 (1),
2
1

 X 2  2 
2

Z 22  ~  2 (1) and Z12  Z 22 ~  2 (2) .


2 2

Copyright © Open University Malaysia (OUM)


ANSWERS  213

  X   2  X 2   2 2 
Hence, proven that E 1 2 1    E  Z12  Z 22   E   2  2    2
 1  2
2 

3. Using the distribution table, we obtained

(a) value when


(i)  2 = 19.02 (v = 9)
Observe column  = 0.025 with row v = 9 in the table gives
value 19.02. Thus,  2 = 19.02 (v = 9) is true at = 0.025.

(ii)  2 = 24.43 (v = 40)


Observe column  = 0.975 with row 40 in the table gives value
24.43. Thus,  2 = 24.43 (v = 40) is true at  = 0.975.

(b) determine the x value if


(i) 0.005
2
= x (v = 29)
Let  value = 0.005 with v = 29 degrees of freedom. Find the
value in the table that is the intersection between column = 0.5
and v = 29. This gives 52.336. Hence, x = 52.336.
(ii) 0.99
2
= x (v = 4)
Let  value = 0.99 with v = 4 degrees of freedom. Find the value
in the table that is the intersection between column  = 0.5 and v
= 4. This gives 0.297. Hence, x = 0.297.

4. If X 1 and X 2 are random variables with X 1 distributed as  2 (3) and X 2


distributed as  2 (4), determine the probability of
(a) X 1 > 6.25
X 1 ~  (3). From the table, 0.10 of the distribution is on the right side
2

of point 6.25. (See column 0.10, row 3). Hence,


Pr( X 1 > 6.25) = 0.10

Copyright © Open University Malaysia (OUM)


214  ANSWERS

(b) X 1 < 0.115


X 1 ~  (3). From the table, 0.99 of the distribution is on the right side
2

of point 0.115. (See column 0.99, row 3). Hence,


Pr(X < 0.115) = 1 – 0.99 = 0.01

(c) X 2 > 14.86


X 2 ~  (3\4). From the table, 0.005 of the distribution is on the right
2

side of point 14.86. (See column 0.005, row 4). Hence,


Pr( X 2 > 4.86) = 0.005

(d) X 1 + X 2 > 14.07


X 1 ~ X 2 ~  (3+4) ~  (7). (sum of two variables   ). From the
2 2 2

table, 0.05 of the distribution is on the right side of point 14.07. (See
column 0.05, row 7).Hence,
Pr( X 1 + X 2 > 14.07) = 0.05

5. Use the following steps to solve this question:

Step 1: Determine the test parameter


The test parameter is the population variance.

Step 2: Gather all information given


Our results depend on the value from the sample variance that is 12.1. We
have a normal population with mean,  (not given in the test) and assumed
variance to be tested,  12 =11.

Step 3: Construct the hypothesis statement


One-sided (right) test will be performed that is (with population variance = 11)
H 0 :  2 = 11
H 1 :  2 > 11

Step 4: Determine the significance level and rejection region


One sided test is performed at 5% level, the chi-square distribution with
n – 1 = 9 – 1= 8 degrees of freedom. From the table, the critical value is
 2 = 5%
2
(8) =15.507. We will reject the null hypothesis if
 2 > 95%
2
(8)

Copyright © Open University Malaysia (OUM)


ANSWERS  215

Step 5: Calculate the test statistic


(n  1)
Test statistic,  2   8.8
2

Step 6: Test result


Since  2 falls in the acceptance region that is 8.8 < 15.507, accept H 0 .

Step 7: Test conclusion


Hence, we have a strong evidence to say that population variance,  2 =11.

6. From the data, we obtained


n

_ x i
9.1  14.3  ....  10.4
(a) Sample mean, x  i 1
  10.73
n 15
2
 n_

  xi  x 
   9.1  10.73  .... 10.4  10.73  3.39
2 2

Sample variance, s 2  i 1 
n 1 14

Step 1: Determine test parameter


Population test parameter is population variance that is 2 .

Step 2: Gather all information given


Our results depend on the value from the sample variance that is 3.39.
We have a normal population with mean, (not given in the test) and
assumed variance to be tested,  2 = 1.9

Step 3: Construct hypothesis statement


Perform two-sided test, that is:
H 0 :  2 = 1.9
H1 : 2  1.9

Step 4: Determine the significance level and rejection region


A two-sided test is performed at 5% level, the chi-square distribution
has n – 1 = 15 – 1 = 14 degrees of freedom. From the table, we
obtained the critical value 2 =  22.5% (14) = 26.119 and 97.5%
2
(14) =
5.629. We will reject the null hypothesis when
2 < 97.5%
2
(14) or  2 > 97.5%
2
(14)

Copyright © Open University Malaysia (OUM)


216  ANSWERS

Step 5: Calculate the test statistic

Test statistic,  
2  n  1 S 2 14  3.39 
  24.98
2 1.9

Step 6: Test result


Since the test statistic value,  2 is in acceptance region that is 5.629 <
24.98 < 26.119, H 0 is accepted.

Step 7: Test conclusion


Thus, we have strong evidence to say that the sample is distributed as
normal with population variance,  2 =1.9.

(b) Step 1 and Step 2 are as in (a)

Step 3: Construct hypothesis statement


Conduct a one-sided test, that is
H 0 :  2 = 1.9
H1 : 2  1.9

Step 4: Determine the significance level and rejection region


A one-sided test is performed at 5% level, chi-square distribution has n – 1 =
15 – 1 = 14 degrees of freedom. From table, we obtained the critical value
 2 = 5%
2
(14) = 23.685. We will reject null hypothesis when
2 = 5%
2
(14)

Step 5: Calculate the test statistic

Test statistic,  
2  n  1 S 2 14(3.39)
  24.98
2 1.9

Step 6: Test result


Since the test statistic value,  2 falls in the rejection region, that is 24.98 >
23.685, reject H 0 .

Step 7: Test conclusion


Thus, we have strong evidence to say that the sample does not come from a
normal distribution population with population variance,  2 >1.9 (a
different result from (a)).

Copyright © Open University Malaysia (OUM)


ANSWERS  217

7. To determine a smaller product variation between two production channels,


use confidence interval of 2 variance ratios. For 95% confidence interval of
 12
, f  / 2,211,251  f1.125,20,40 and f 0.025,20,24  2.408 . Thus, a 95% confidence
 22
interval for the ratio is
1432 1  2 1432
.  12  .2.408
3761 2.327  2 3761

that is
 12
0.1636   0.9168
 22
Since the confidence interval contains values < 1, we are confident that 95%
of the channel 1 sequence has smaller variance compared to channel 2
sequence, and therefore should be chosen by the firm.

8. Follow these steps:

Step 1: Determine the test parameter


Population parameter is the first and second population variance, 12 and  22
respectively.

Step 2: Gather all given information


Population 1 Population 2
(a) shape: normal shape : normal
(b) mean: 1 (unknown) mean   2 (unknown)
(c) standard deviation: 1 standard deviation:  2
(unknown) (unknown)

Step 3: Construct the hypothesis statement


The appropriate hypothesis statement for the two-sided test to be performed:
 12
H 0 :  12   22  H 0 : 1
 22
 12
H 0 :  12   22  H 0 : 1
 22

Copyright © Open University Malaysia (OUM)


218  ANSWERS

Step 4: Determine the significance level and rejection region


A two-sided test is performed at significance level, = 0.02. Rejection
region is either on the right side or the left side of f distribution. From the
1 1
table, we obtained F0.01,7,10 = 4.85 and F0.99,15,24  F 
6.62
 0.151 .
0.01,10,7

As such, we will reject H 0 when test statistic value, F > 4.85 or F < 0.151.

Step 5: Calculate the test statistic


s12 12.4
Test statistic, F  2   0.64
s2 19.3

Step 6: Test result


Since test statistic value, F falls in the acceptance region that is 0.151 < 0.64
< 4.85, failed to reject the null hypothesis.

Step 7: Test conclusion


Thus, we have evidence to conclude that both populations are taken from a
normal population with 12 =  22 , that is the variances for both populations
are equal.

9. Calculate the variance values for both samples.


Mean value: x1 = 49.31 , Variance value : s12 =177.9
Mean value: x 2 = 44.32, Variance value: s22 =62.39

Step 1: Determine test parameter


Population parameter is first variance populati 12 and second variance
population,  22 .

Step 2: Gather all given information


Population 1 Population 2
shape : normal shape : normal
mean: 1 (unknown) mean: (unknown)
standard deviation: 1 (unknown) standard deviation:  2 (unknown)

Copyright © Open University Malaysia (OUM)


ANSWERS  219

Step 3: Construct the hypothesis statement


Conduct a one-sided (right) test, that is
12
H 0 : 12  12  H 0 1
 22
12
H 1 : 12  12  H 1 1
 22

Step 4: Determine the significance level and rejection region


A one-sided (right) test is performed at  = 0.01 significance level. The
rejection region is either on the right of F distribution or the left F
distribution. From the table, we obtained F0.01,9,9  5.35 . Hence, we will
reject H 0 when test statistic value F > 5.35.

Step 5: Calculate the test statistic


s12 177.9
Test statistic, F  2   2.85
s2 62.39

Step 6: Test Results


Since test statistic value, F falls in the acceptance region, that is 2.85 < 5.35,
we failed to reject the null hypothesis.

Step 7: Test conclusion


Hence, we have evidence to conclude that both populations are taken from a
normal population with 12 =  22 , that is the variances for both populations
are equal.

Copyright © Open University Malaysia (OUM)


220  ANSWERS

TOPIC 2: ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Exercise 2.1

Figure 2.4(a) Figure 2.4(b)

Comment: From the two figures, both data sets illustrate the changes between and
within samples for the variables. In Figure 2.4(a), the changes between samples
are large compared to within sample changes. However, Figure 2.4b shows that
the between sample changes is not much different to within samples changes.
How about Figure 2.1 (from text book)? Figure 2.1 does not clearly show if the
variation between samples is statistically larger than within sample variation.
Thus, significance tests must be performed. Statistical test used to examine the
equality in population mean should be able to differentiate the between and within
sample variations. Thus, we need to calculate the between and within sample
variations.

Copyright © Open University Malaysia (OUM)


ANSWERS  221

Exercise 2.2
The following result is obtained:
1 2 3 4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
81 72 83
69 79 76
90
Total 454 549 425 351
The number of students 6 7 6 4
Average 75.67 78.43 70.83 87.75

4 nj

_  x
i 1 j 1
ij
65  75  ...  88 1779
x..     77.35
n 23 23

Sum of squares of deviation,

4 ni _
  ( xi , j  x.. ) 2
i 1 j 1

  65  77.35    75  77.35   ...   88  77.35 


2 2 2

 1909.2

Sum of squares due to treatments (between student groups)


_ _
  n j ( x j  x.. ) 2
 6  75.67  77.35   7  78.46  77.35   6  70.83  77.35   4  87.75  77.35 
2 2 2 2

 712.6

Sum of squares due to errors (within groups)


 (n1  1) s12  (n2  1) s 22  (n3  1) s32
 SS T  SS (Tr )  (n1  1) s12  (n2  1) s 22  (n3  1) s32
 1196.6

Copyright © Open University Malaysia (OUM)


222  ANSWERS

Thus, the mean squares treatments and errors are:


MS(Tr) = SS(Tr) /k – 1 = 712.6/3 = 237.5
MSE = SSE /n – k = 1196.6/19 = 63.0

Exercise 2.3
In ANOVA tests, the critical area is determined by , the degrees of freedom for
treatments and degrees of freedom for errors. Step 2 and Step 3 (in text) need to
be understood prior to getting the critical value.

Exercise 2.4
The relevant factors are classes with factor levels and type of class that the
students are in.

Step 1: Determine null and alternative hypotheses


H 0 : 1   2  3 (no difference in means for the 3 factor levels)
H1 : not all populations has equal mean.

Step 2: Choose the significance level


Significance level,  = 0.05.

Step 3: Determine the rejection region


(a) the degrees of freedom for treatment, v1 = k – 1 = 3 – 1 = 2.
(b) the degrees of freedom for errors, v2 = N – k = 12 – 3 = 9.

Determine the critical value that is F2,9,0.05  4.256 (obtained from F-distribution
table). Reject H 0 if F > 4.256

Step 4: Calculate test statistic


MS (Tr ) SS (Tr ) /(k  1) 64586.17 / 2
F    2.683
MS E SS E /( N  k ) 1083335.5 / 9

Copyright © Open University Malaysia (OUM)


ANSWERS  223

Step 5: Test Results


Since F = 2.683 < 4.256, fail to reject H 0 .

Step 6: Test conclusion


We have strong evidence to say that the mean time spent on watching television
among students is equal. The teacher’s claim is true.

Exercise 2.5
1. Population under consideration must be distributed as normal with equal
variance, and every sample chosen are independent and randomly selected.

2. The null hypothesis: no difference in means for all populations;


1   2  ... k , with k = number of groups.

Alternative hypothesis: at least one group has a different mean.

3. We obtained:
(a) the critical value for ANOVA test at  = 0.01 when there are 6
samples with 34 items in each samples is F2,28,0.01  3.75 . This comes
from  = 0.01, the degrees of freedom for numerator = k – 1 = 6 – 1 =
5 and the degrees of freedom for denominator = N – k = 34 – 6 = 28.

(b) the critical value for ANOVA test at  = 0.05 when there are 4
samples with 44 observations is = 2.84. This comes from  = 0.05, the
degrees of freedom for numerator = k – 1 = 4 – 1 = 3 and the degrees
of freedom for denominator = N – k = 44 – 4 = 40.

4. We obtained:
(a) When MSE =14.6 and MS(Tr) = 35.7,
MS (Tr ) 35.7
F   2.45
MS E 14.6

(b) When MSE = 73.81 and MS(Tr) = 215.23,


MS (Tr ) 215.23
F   2.92
MS E 73.81

Copyright © Open University Malaysia (OUM)


224  ANSWERS

5. The relevant factor is the socioeconomic status;


The factor level is the socioeconomic level.

Step 1: Determine null and alternative hypotheses


H 0 : 1   2  3 (no difference in mean for factor levels)
H1 : not all populations have equal mean.

Step 2: Choose significance level


The chosen significance level is =0.01.

Step 3: Determine the rejection region


(a) The degrees of freedom for treatment, v1 = k – 1 = 3 – 1 = 2.
(b) The degrees of freedom for errors, v2 = N – k = 17 – 3 = 14.

Determine the critical value that is F2,14,0.01  6.515 (obtained from the F
distribution table). Reject H 0 when F > 6.515

Step 4: Calculate test statistic

MS (Tr ) SS (Tr ) /(k  1) 15.529 / 2


F    0.3418
MS E SS E /( N  k ) 318 / 14

Step 5: Test Results


Since the test statistic value, F = 0.3418 < 6.515, accept H 0 .

Step 6: Test conclusion


We have strong evidence to say that mean aptitude test among students is
the same regardless of their family socioeconomic status. The study's claim
is true.

Copyright © Open University Malaysia (OUM)


ANSWERS  225

TOPIC 3: CATEGORICAL DATA ANALYSIS

Exercise 3.1
Step 1: Construct appropriate hypothesis statement
H 0 : attendance record is uniform everyday
 H 0 : absenteeism rate is the same everyday, that is, 10 people daily.

H1: otherwise

Step 2: Determine the significance level and rejection region


Test is performed at 1% level, hence reject H0 if test statistic
(O  E ) 2
X   1%
2
(5)  15.086 with v = number of row – 1 = 6 – 1 = 5
E
degrees of freedom.

Step 3: Calculate the test statistic


6 (O j  E j ) 2
Test statistic is calculated as X   that is distributed as  2 with
Ej
j 1

v = 5 degrees of freedom. The following table contains the required information.

Observed Expected (O  E ) 2
Day
Frequency, O Frequency, E
O-E  E
Monday 12 10 2 0.4
Tuesday 9 10 -1 0.1
Wednesday 11 10 1 0.1
Thursday 10 10 0 0
Friday 9 10 -1 0.1
Saturday 9 10 -1 0.1
Total 60 60 0.8

(O  E ) 2
Thus we obtained X    0.8.
E

Copyright © Open University Malaysia (OUM)


226  ANSWERS

Step 4: Test Results


Since the test statistic value, X = 0.8 < 15.086, fail to reject H 0 .

Step 5: Test conclusion


Hence, it is proven that staff absenteeism is uniformly distributed every week.

Exercise 3.2
1. Solve the question using the following steps:

Step 1: Construct appropriate hypothesis statement


H 0 : frequencies follow a Binomial distribution.
x 3.32
 H 0 : X ~ b(5, p ) where p    0.664
n 5
H1 : otherwise.

Step 2: Calculate the test statistic


Using n = 5 and p = 0.664, we can generate a Binomial distribution
following a theory that is X (which represents the number of events in an
n x
experiment) ~ Bin (5, 0.664) that is Pr  X  x     p (1  p) and
n x

x
 
n
Pr X  x     p x (1  p) n x and we obtained the following table (Expectation =
 x
Pr(X = x) multiplied with f ):
X 0 1 2 3 4 5 or 6
Expectation 0.4 4.2 16.7 33.1 32.7 12.9

Since the first two frequencies are less than 5 (X = 0 and 1), both are
combined together at X=2 resulting in frequency value 21 (that is 0.4 + 4.2 +
16.7). Thus, combining the observed and expected frequencies for
subsequent analysis in the following table:
X ≤2 3 4 5 Total
Observed 21 33 31 15 100
Expected 21.3 33.1 32.7 12.9 100

Copyright © Open University Malaysia (OUM)


ANSWERS  227

and

 21  21.3  33  33.1
2 2
(31  32.7) 2 (15  12.9)2
X     0.43
21.3 33.1 32.7 12.9

Step 3: Determine the significance level and rejection region


There are two restrictions that are estimation of p and mean. Thus, there is
4 – 2 = 2 degrees of freedom. Hence, the critical value for this test is
5%
2
(1) = 3.84 (refer table). Reject H 0 when test statistic, X > 5%
2
(2)= 3.84

Step 4: Test result


Since  2 test = 0.43 < 5%
2
(2)= 5.99 , fail to reject H 0 .

Step 5: Test conclusion


There is no strong evidence to reject the null hypothesis. Hence, it is obvious
that the frequency distribution follows a Binomial distribution.

2. Solve the question using the following steps:

Step 1: Construct appropriate hypothesis statement


H 0 : frequencies follow a Poisson distribution
 396
 H 0 : X ~ Pr( ) where  x   1.2
330
H1 : otherwise

Step 2: Calculate the test statistic


Using this mean value, we are able to calculate
 x
e 
f ( x)  Pr( X  x)  where ˆ  x. Alternatively, we may also use the
x!
formula f  r  1   . f (r ) to calculate Pr(X = x). Using the formula, the
r 1
mean result and combining several observed frequencies data, we obtained
the following table:

X 0 1 2 3 4 or more Total
Observed 102 114 74 28 12 330
Expected 99.39 119.27 71.56 28.63 8.59 330

Copyright © Open University Malaysia (OUM)


228  ANSWERS

Hence, we obtained:

X 
 (O  E ) 2


(102  99.39) 2
 ... 
(12  8.59) 2
 0.46
E 99.39 8.59

Step 3: Determine the significance level and rejection region


Since there are 5 pairs of frequencies with 2 restrictions, we have 5 – 2 = 3
degrees of freedom. Thus, the critical value for the test is 5%
2
(3) = 7.81.
Reject H 0 when test statistic X > 5%
2
(3) = 7.81.

Step 4: Test result


Since X=0.46 < 5%
2
=7.81, accept H 0 .

Step 5: Test conclusion


This means that we have a strong evidence to accept H 0 and may conclude
that Poisson distribution gives the best fit.

3. Step 1: Construct appropriate hypothesis statement


H 0 : frequencies follows a normal distribution.
 H 0 : X  N   ,  2  yang x  ˆ  134.356 dan s  ˆ  6.195
H1 : otherwise

Step 2: Calculate the test statistic


Calculate x = ̂ = 79.72 and s = ̂ = 15.60. To calculate the observed
frequencies under the assumption of normal distribution using the
information on observed values, the following table is obtained:

Upper Class x  79.72 Expected Observed


Z= Pr(Z<z) p
Limit 15.60 Values Values
59.95 -1.26730769 0.102 0.102 6.63 8
69.95 -0.62628205 0.2643 0.1623 10.5495 10
79.95 0.01474359 0.504 0.2397 15.5805 16
89.95 0.655769231 0.745 0.241 15.665 14
99.95 1.296794872 0.9032 0.1582 10.283 10
109.95 1.937820513 0.9738 0.0706 4.589 5
119.95 2.578846154 0.995 0.0212 1.378 2

Copyright © Open University Malaysia (OUM)


ANSWERS  229

Thus,

X 
 (O  E ) 2

=
(8  6.63) 2 (10  10.55) 2
  ... 
(2  138) 2
 0.83
E 6.63 10.55 1.38

Step 3: Determine the significance level and rejection region


There are 7 true cells with 3 restrictions that are mean, standard deviation
and total giving v = 7 – 3 = 4 degrees of freedom. Thus, 5% 2
(4) = 9.488.
Reject H 0 when test statistic X < 5%
2
(4) = 9.488.

Step 4: Test result


Since X = 0.83 < 5%
2
= 9.488, accept H 0 .

Step 5: Test conclusion


Thus, we have strong evidence to accept H 0 and the assumption that the
observation data follows a normal distribution is true.

Copyright © Open University Malaysia (OUM)


230  ANSWERS

Exercise 3.3
 L  B 
Using the formula for expected frequencies as: Eij  N  j  i  ,
 N  N 
we obtained the following results:
Type of Car
Age
Local-made Import Total
>30 110*99 110*101 110
 54.45  55.55
200 200
30 and above 90*99 90*101 90
 44.55  45.45
200 200
Total 99 101 200

Exercise 3.4
Step 1: Construct appropriate hypothesis statement
H 0 : both variables are independent
 H 0 : there is no relationship between the interest on type of car and age level.
H1 : both variables are dependent.

Step 2: Determine the significance level and rejection region


Significance level usually used is  = 0.05.

We have r: number of levels/ factor for age variable (2 counts)


c: number of levels/ factor for interest on car type (2 counts).

Thus, v = (2 – 1)  (2 – 1) = 1

Since v = 1 degree of freedom, 0.05


2
= 3.841.

Copyright © Open University Malaysia (OUM)


ANSWERS  231

Step 3: Calculate the test statistic


(a) The following table gives the expected frequency (and observed frequency
data):
Age Type of Car
Local Made Import Total
>30 68 42 110
(54.45) (55.55)
30 and above 31 42 90
(44.55) (45.45
Total 99 101 200

(b) Next, calculate the chi square test statistic, that is X 


 (O  E ) 2

E .
From the table, we obtained the following result:

(68  54.45) 2 (42  55.55) 2 (59  45.45) 2


X    ....   14.84
54.45 55.55 45.45

Step 4: Test result


Since X= 14.84 > 3.841, reject H 0 .

Step 5: Test conclusion


We are able to say that there is no strong evidence to accept H0, that is, both
variables are dependent. This means, preference on type of car depends on one's
age.

Exercise 3.5
The information can be summarized in the table below:
Type of Favourite Sport
Students
Football Basketball Baseball Tennis
Male 33 38 24 5
Female 38 21 15 26

The variables classified are the tendency/interest on type of sport. Populations are
male and female students. Testing method follows several steps:

Copyright © Open University Malaysia (OUM)


232  ANSWERS

Step 1: Determine the hypothesis test


H 0 : distribution of sport favoured is similar among male and female students
H1 : otherwise

Step 2: Determine significance level and rejection region


Choose the significance level, =0.05 or =0.01. Since there are two rows and
four columns, the degrees of freedom obtained is v = (2 – 1)(4 – 1) = 3. Using this
degrees of freedom value, we will get the chi square value from the distribution
table at  = 5% level as  0.05,2
 . Reject H 0 when the test statistic value,

X >  0.05,
2
 .

Step 3: Calculate the test statistic


(a) Construct the table of expected frequency
(b) Calculate the test statistic value

Step 4: Test Result


Test result depends on results in 2 and 3.

Step 5: Test conclusion


Test conclusion depends on results in 4.

Exercise 3.6
Step 1: Construct the appropriate hypothesis statement
H 0 : there is no difference in colour blindness level according to gender.
H1 : otherwise.

Step 2: Determine significance level and rejection region


Since there are two rows and two columns, the degrees of freedom obtained is
v = (2 – 1)(2 – 1) = 1. Using this degrees of freedom value, we will get chi-square
value from the distribution table (at  level =5%) as 5%2
(1) = 3.84. Reject H 0
when the test statistic value, X > 5%
2
(1).

Copyright © Open University Malaysia (OUM)


ANSWERS  233

Step 3: Calculate the test statistic


(a) The following table gives the expected frequency (and observed frequency
data):

Colour Blindness
Normal Colour Blind Total
Male 2210 190
(2280) (120) 2400
Factor Female 2540 60
II (2470) (130) 2600
Total 4750 250 5000

(Numbers in brackets are the expected frequencies)

(b) Thus,
( 2210  2280  0.5) 2 ( 60  130  0.5) 2
X cc   ...   82.66
2280 130

Step 4: Test Result


Since X cc > 5%
2
(1) =3.84, we have evidence to reject H 0 .

Step 5: Test conclusion


There is valid evidence to show that there exist differences in colour blindness
levels according to gender.

Exercise 3.7
1. Step 1: Construct the appropriate hypothesis statement
H 0 : the distribution of book-loan is uniform
 H 0 : the number of books borrowed is the same everyday that is 258 books.
H 0 : otherwise.

Copyright © Open University Malaysia (OUM)


234  ANSWERS

Step 2: Determine the significance level and rejection region


Perform test at 1%, level, hence reject H0 if test statistic

X
 (O  E ) 2

 1%
2
(5)  15.086 with v = number of days – 1 = 6 – 1 = 5
E
degrees of freedom.

Step 3: Calculate the test statistic


6 (O j  E j ) 2
Test statistic calculated as X   which is distributed as  2
Ej j 1

with v = 5 degrees of freedom, The following table contains the information


needed.

Day
Observed Expected
O-E X 
 (O  E ) 2

Frequency, O Frequency, E E
Monday 204 258 -54 11.3023
Tuesday 292 258 34 4.4806
Wednesday 242 258 -16 0.9923
Thursday 283 258 25 2.4225
Friday 252 258 -6 0.1395
Saturday 275 258 17 1.1202
Total 1548 258 20.457

Thus X 
 (O  E ) 2

=20.457
E

Step 4: Test Results


Since test statistic value X = 20.457 > 15.086, reject H 0 .

Step 5: Test conclusion


It can be proved that the number of books borrowed from the library is not
the same every day. Thus, we have enough evidence to say that the
distribution of book-loan is not uniform.

Copyright © Open University Malaysia (OUM)


ANSWERS  235

2. It is known that,

X : the number of goals obtained in a match by a team.

Thus,
{X = 0} no goal obtained in a match by a team.
: mean/average goals per team per match
Thus, ̂ =Total number of goals made/(2 x number of matches)

3.0  5.1  ...  1.7  0.8



2  20
105

40

Step 1: Construct appropriate test hypothesis


H 0 : frequencies follow a Poisson distribution
 105
 H 0 : X ~ Pr( ) where  x   2.625
40
H1 : otherwise

Step 2: Calculate the test statistic


e  x
Using this mean value, we can calculate f (x)=Pr(X=x)  whereˆ  x .
x!
However, we can also use the formula f  x +1 = λ .f  x  to calculate
x +1
Pr(X=x). Using the formula and mean result, we will get the following table:

Number of
0 1 2 3 4 5 6 7 8
Goals, X
Number of
Expected 2.883 7.583 9.971 8.741 5.747 3.023 1.325 0.498 0.164
Team

Copyright © Open University Malaysia (OUM)


236  ANSWERS

Since the number of expected team scoring X=0,1 and X=5,6,7 and 8 goals
are less than 5, we need to combine the total expected teams according to the
number of goals. The result is displayed in the following table:

X <1 2 3 4 5 or More Total


Observed 8 14 9 3 6 40
Expected 10.466 9.97 8.74 5.75 5.01 40

Thus, we get:

 (O  E )  8  10.466   6  5.01
2 2 2

X   ...   3.73
E 10.466 5.01

Step 3: Determine significance level and rejection region


Since there are 5 pairs of frequencies with 2 restrictions, there are 5 – 2 = 3
degrees of freedom. Thus, the critical value for the test is 5% 2
(3) = 7.81.
Reject H0 when the test statistic X > 5%
2
(3) = 7.81.

Step 4: Test Result


Since X =3.73< 5%
2
(3) = 7.81, we accept H 0 .

Step 5: Test conclusion


This means, there exists a strong evidence to accept H 0 and state that the
Poisson distribution is the best fit.

3. Step 1: Construct appropriate test hypothesis


H 0 : frequencies follow a Binomial distribution
 H 0 : X ~ b (5,0.3)
H1 : otherwise

Copyright © Open University Malaysia (OUM)


ANSWERS  237

Step 2: Calculate the test statistic


Using n = 5 and p = 0.3, Binomial distribution can be generated using the
theory that X (which represents the number of events in the experiment)
n
~ Bin (5, 0.3) that is Pr ( X  x)    p x (1  p) n  x and we obtain the
 x
following table (expected = Pr(X=x) multiplied with f ):

X 0 1 2 3 Total
Observed 22 37 20 21 100
Expected 16.807 36.015 30.87 16.31 100

and
 22  16.807   37  36.015
2 2
(20  30.87) 2 (21  16.31) 2
X   
16.807 36.015 30.87 16.31
 6.81

Step 3: Determine significance level and rejection region


There is no restriction, thus 4 – 0 = 4 degrees of freedom. Hence, the critical
value for this test is 5%
2
(4) = 9.49 (see table). Reject H 0 when test statistic,
X > 5%
2
(4) = 9.49.

Step 4: Test Result


Since X = 6.81 < 5%
2
(4) =9.49, accept H 0 .

Step 5: Test conclusion


Thus, there is no strong evidence to reject the null hypothesis. It is clear that
the frequency distribution follows a Binomial distribution.

4. Step 1: Construct appropriate test hypothesis


H 0 : frequencies follow a Normal distribution
 H : X ~ N (13.3,42)
0

H1 : otherwise

Copyright © Open University Malaysia (OUM)


238  ANSWERS

To calculate the observed frequency, we use the assumption that the


observations are distributed as normal, hence the table below is obtained:
x  79.72
Upper Class Limit Z= Pr(Z < z) p Expected Observed
15.60
7.9995 -1.47 0.0708 0.0708 15.505 16
9.9995 -0.88 0.189 0.1182 25.886 21
11.9995 -0.29 0.3859 0.1969 43.121 38
13.9995 0.29 0.6141 0.2282 49.976 50
15.9995 0.88 0.8106 0.1965 43.034 48
17.9995 1.47 0.9292 0.1186 25.973 36
- - 1 0.0708 15.505 10

Thus,

X 
 (O  E ) 2


(16  15.505) 2 (21  25.886) 2
  ... 
(10  15.505) 2
 7.945
E 15.505 25.886 15.505

Step 3: Determine the significance level and rejection region


There are 7 cells with 0 restrictions. Thus, 5%
2
(7) = 14.067. Reject H 0
when test statistic X < 5%
2
(7) = 14.067.

Step 4: Statistical test result


Since X =7.945 < 5%
2
(7) = 14.067, accept H 0 .

Step 5: Test conclusion


Thus, there is a strong evidence to accept H 0 and the normal assumption on
observed data is true.

5. Step 1: Determine the null and alternative hypotheses


H 0 : both variables are independent
 H 0 : the average grade level does not have any association with year of study
H1 : both variables are dependent

Copyright © Open University Malaysia (OUM)


ANSWERS  239

Step 2: Determine significance level and rejection region


Significance level used is  = 0.05.

From information given:


r: number of levels for grade point average factor (there are 3)
c: number of levels for year of study factor (there are 3)

Thus, v = (3 – 1) x (3 – 1) = 4

Since there are v = 1 degrees of freedom, we have 0.05


2
= 9.488

Step 3: Calculate test statistic


(a) Given the expected frequency data (which is also observed frequency
data) as follows:

Table of observed and expected data:


Average Grade
Year of Study
Value
Year 1 Year 2 Year 3 Total
<2.0 14 16 15
(15) (15) (15) 45
2.0-3.0 10 11 11
(10.67) (10.67) (10.67) 32
>3.0 26 23 24
(24.33) (24.33) (24.33) 73
Total 50 50 50 150
(Numbers in brackets are expected frequencies.)

(b) Then, calculate the chi square value, which is X 


 (O  E ) 2

E
From the table, we obtained the following result:

(14  15) 2 (16  15) 2 (24  24.33) 2


X   ....   0.39
15 15 24.33
Step 4: Test Result
Since X= 0.39 < 9.488, accept H 0 .

Copyright © Open University Malaysia (OUM)


240  ANSWERS

Step 5: Test conclusion


We can state that there is no strong evidence to reject H 0 and may conclude
that both variables are independent. This means, the average of grade value
results obtained by students does not depend on the year of study.

6. Solve the problem using the following steps:

Step 1: Determine the null and alternative hypotheses


H 0 : the proportion of three populations is equal for each category.
H1 : otherwise.

Step 2: Determine the significance level and rejection region


Since there are three rows and four columns, the degrees of freedom
obtained is v = (3 – 1)(4 – 1) = 4. Using this degrees of freedom, we will get
the value of chi square distribution table (at  = 1% level) as
0.01,4  13.277 . Reject H 0 when the test statistic value, X > 0.01,4 .

Step 3: Calculate test statistic


(a) Given the expected frequency data (which is also observed frequency
data) as follows:

Category
1 2 3 4 Total
Population 1 16 (19.67) 38 5 (10.67) 41 (31.33) 100
(38.33)
2 24 (19.67) 41 12 (10.67) 23 (31.33) 100
(38.33)
3 19 (19.67) 36 15 (10.67) 30 (31.33) 100
(38.33)

(b) Hence, we can calculate the value of test statistic and obtained:
(16  19.67) 2 (38  38.33) 2 (30  31.33) 2
X   ...   12.184
19.67 38.33 31.33

Step 4: Test Result


Since X = 12.184 < 13.277, accept H 0 .

Copyright © Open University Malaysia (OUM)


ANSWERS  241

Step 5: Test conclusion


Thus, it can be concluded that the proportion of all three populations is
equal for each category.

7. Solve the problem using the following steps:

Step 1: Determine the null and alternative hypotheses


H 0 : there is no difference between male and female opinions
H1 : otherwise

Step 2: Determine the significance level and rejection region


Since there are two rows and two columns the degrees of freedom obtained
is v = (2 – 1)(2 – 1) = 1 Using this degrees of freedom, we will get the value
of chi square distribution table (at level  = 1%) as 1%
2
(1) = 6.635. Reject
H 0 when the test statistic value, X > 1%
2
(1).

Step 3: Calculate test statistic


(a) Let the expected frequency data (which is also observed frequency
data) as follows:
Gender Opinion on Usage
Helmet
Agree Disagree
Male 32 (21.5) 11 (21.5)
Female 68 (78.5) 89 (78.5)

(b) Thus,
( 32  21.5  0.5) 2 ( 89  78.5  0.5) 2
X cc   ...   11.85
21.5 78.5

Step 4: Test Result


Since cc > 1%
2
(1) = 6.635, we have evidence to reject H 0 .

Step 5: Test conclusion


This shows clearly that there exists differences in opinion on the usage
of helmets in campus among male and female students.

Copyright © Open University Malaysia (OUM)


242  ANSWERS

TOPIC 4: CORRELATION

Exercise 4.1
(a) 1, +
(b) 2, +
(c) 2, -

Exercise 4.2

xi yi xi yi xi2 yi2
1 2 2 1 4
2 3 6 4 9
4 4 16 16 16
5 7 35 25 49
6 12 72 36 144
8 10 80 64 100
10 7 70 100 49
Total 36 45 281 246 371

n xi yi  (  xi )(  yi )
rp 
(n xi  (  xi )2 ) (n yi 2  (  yi )2 )
2

7( 281 )  ( 36 )( 45 )

( 7( 246 )  ( 36 )2 ) ( 7( 371 )  ( 45 )2 )
 0.703

The Pearson correlation coefficient value 0.703 shows that there is a strong
positive linear relationship between the frequency of fertilizer usage and crop
yields. This means that the more frequent the farmer distributes the fertilizer, the
higher the amount of crop yield produced.

Copyright © Open University Malaysia (OUM)


ANSWERS  243

Exercise 4.3
A one-sided hypothesis test (since the Pearson correlation coefficient value is
positive) is as follows:

H0 : p = 0
H1 : p > 0
n2
T  rp
1  rp 2
72
 0.703
1  (0.703) 2
Test Statistic :  2.21
Test Results : T follows a t distribution with v = 7 – 2 = 5 degrees of
freedom and 0.01 significance level.

Reject H 0 when
T > t0.01,5 t0.01,5 = 3.365

Since the test statistic T < 3.365, we cannot reject the null hypothesis. This means
that we do not have enough evidence to say that the Pearson correlation
coefficient value is not zero, that there does not exist any significant relationship
between the frequency of fertilizer distribution with crop yield at 1% significance
level.

Exercise 4.4
6 Di 2
rs  1 
n(n 2  1 )
6( 74 )
 1
10(( 10 )2  1 )
 0.5515

The Spearman correlation coefficient value 0.5515 shows that there exists a strong
positive linear relationship between athletes’ ranking and their position in a
match.

Copyright © Open University Malaysia (OUM)


244  ANSWERS

Exercise 4.5
A one-sided hypothesis test (since the Spearman correlation coefficient value is
positive) is as follows:

H0 : s = 0
H1 : s > 0
n2
T  rs
1  rs 2
10  2
 0.5515
1  (0.5515) 2
Test Statistic :  1.87
Test Results : T follows a t distribution with v = 10 – 2 = 8 degrees of freedom
and 0.01 significance level.

Reject H 0 when
T > t0.01,8 t0.01,8 = 2.896

Since the test statistic T < 2.896, we cannot reject the null hypothesis. This means
that there is not enough evidence to say that there exists a significant relationship
between athlete ranking and their position in a match at 1% significance level.

Exercise 4.6
1. The importance of the two-way scatter plot:
– the two-way scatter plot can be used to display or determine the
relationship between two quantitative variables X and Y.
– the two-way scatter plot can also be used to analyze patterns in
bivariate data.

2. The importance of the correlation coefficient sign r :


– The correlation coefficient sign r can be used to show the strength of a
relationship between two variables whereby the larger the correlation,
the stronger the relationship.

Copyright © Open University Malaysia (OUM)


ANSWERS  245

3. (a) Spearmen correlation coefficient because the data are qualitative in


nature and can be ranked.

(b) Spearmen correlation coefficient because the data are qualitative in


nature and can be ranked. In this case, X = “time spent on study
revision” can be categorised and becomes qualitative in nature and thus
can be ranked.

(c) Pearson correlation coefficient because the data are quantitative.

(d) Pearson correlation coefficient because the data are quantitative.

(e) Spearmen correlation coefficient because the data are qualitative in


nature and can be ranked.

4. Let x = number of tellers and y = customers’ waiting time

(a) From the scatter plot we can suggest that the number of tellers and
customers’ waiting time have a negative relationship.

(b) To measure the strength of relationship obtained in (a), we are going to


compute the Pearson correlation coefficient, rp. The estimated value of
rp is computed as follows:

rp 
 x y   x  y 
n i i i i

n x   x   n y   y  
2
i i
2 2
i i
2

Copyright © Open University Malaysia (OUM)


246  ANSWERS

i x y xy x2 y2
1 4.00 6.40 25.60 16.00 40.96
2 1.00 8.70 8.70 1.00 75.69
3 5.00 3.20 16.00 25.00 10.24
4 3.00 10.50 31.50 9.00 110.25
5 4.00 8.20 32.80 16.00 67.24
6 3.00 11.30 33.90 9.00 127.69
7 3.00 11.30 33.90 9.00 127.69
8 2.00 12.80 25.60 4.00 163.84
9 2.00 11.60 23.20 4.00 134.56
10 6.00 3.20 19.20 36.00 10.24
11 3.00 9.40 28.20 9.00 88.36
12 2.00 12.80 25.60 4.00 163.84
13 4.00 8.20 32.80 16.00 67.24
Total 42.00 117.60 337.00 158.00 1187.84

13337   42 117.6 


rp 
13158  42  131187.84  117.6 
2 2

4381  4939.2
rp 
2054  1764 15441.92  13829.76

 558.2  558.2
rp    0.8163
17.0340.15 683.75
The Pearson correlation coefficient value -0.8163 indicates that there
exists a strong negative linear relationship between the number of
tellers and customers’ waiting time. Thus, we can suggest that if the
number of tellers is increased, the number of customers’ waiting time
will be reduced.

(c) Since the Pearson correlation coefficient value is negative, we will use
the following null and alternative hypotheses;

H0 :  p  0
H1 :  p  0

Copyright © Open University Malaysia (OUM)


ANSWERS  247

n2 13  2
Test statistic, T = r p =  0.8163 = -26.9120
1   0.8163
2 2
1  rp

Critical value,  t 0.01,11  2.718

Since the test statistic T <  t 0.01,11  2.718 , we reject the null
hypothesis and conclude that  p  0 . In other words, there exists a
significant negative linear relationship between the number of tellers
and the number of customers’ waiting time. We may conclude that the
number of customers’ waiting time can be reduced further if the
number of tellers is increased.

5. (a) In this problem, we will investigate the relationship between the


following variables:
(i) The sales (‘000) and interview rank.
(ii) The test score and interview rank

Table 1 gives the computations required to investigate the relationship


between the sales (‘000) and interview rank using Spearman
correlation coefficient.

Table 1
Interview Sales
Interview Sales
Rank, rank, D D2
Rank (‘000)
u v
5 17 5 9 -4 16
3 32 3 5 -2 4
1 27 1 10 -9 81
9 46 9 5 4 16
6 55 6 1 5 25
4 45 4 5 -1 1
10 36 10 7 -3 9
2 28 2 8 -6 36
7 18 7 2 5 25
8 66 8 3 5 25
 D 2  238

Copyright © Open University Malaysia (OUM)


248  ANSWERS

The Spearman correlation coefficient value, rs , is computed as follows:

rs 1
D
6 2

nn  1 2

6238
rs  1  =0.442

10 10 2  1 
The Spearman correlation coefficient value is 0.442. Thus we can say
that there exists a weak positive linear relationship between the sales
(‘000) and the interview rank.

Table 2 gives the computations required to investigate the relationship


between the test score and interview rank using Spearman correlation
coefficient.
Table 2
Test Interview Test Score Interview D = u -
D2
Score Rank Rank = u Rank = v v
50 5 9 5 4 16
68 3 5 3 2 4
45 1 10 1 9 81
68 9 5 9 -4 16
78 6 1 6 -5 25
68 4 5 4 1 1
60 10 7 10 -3 9
56 2 8 2 6 36
76 7 2 7 -5 25
72 8 3 8 -5 25
 D 2  238

The Spearman correlation coefficient value, rs , is computed as follows:

6238
rs  1  =0.442

10 10 2  1 
Similarly, we notice that the Spearson correlation coefficient value is
0.442. Thus, we may conclude that there exists a weak positive linear
relationship between the test score and the interview rank.

Copyright © Open University Malaysia (OUM)


ANSWERS  249

(b) Since the correlation coefficient values obtained in (a) above are both
positive, we will use the following null and alternative hypotheses to
test for their significance:

H0 :  p  0
H1 :  p  0

n2 10  2
Test statistic, T = rs = 0.442  = 1.3937
1  0..442 
2 2
1  rs
Critical value, t 0.05 ,8  1.860

As the test statistic T < t 0.05 ,8  1.860 , we do not reject the null
hypothesis and conclude that  p  0 . Hence,

(i) there is not enough evidence to suggest that there exists a


significant positive linear relationship between the interview rank
and the sales’ record of the salesman;

(ii) there is not enough evidence to suggest that there exists a


significant positive linear relationship between the interview rank
and the test score of the salesman.

(c) Our results in (a) and (b) do show that both criteria namely the test
score and sales record have very weak positive relationship with the
salesman’s interview rank. Therefore in this particular problem, there
is no guarantee that salesman selection based on interview rank will
bring higher profit to the company.

Copyright © Open University Malaysia (OUM)


250  ANSWERS

TOPIC 5: SIMPLE LINEAR REGRESSION ANALYSIS

Exercise 5.1
x y yˆ  12.84  36.18 x  y  yˆ
8.3 227 287.454 -60.454
8.3 312 287.454 24.546
12.1 362 424.938 -62.938
12.1 521 424.938 96.062
17.0 640 602.22 37.78
47.0 539 1687.62 -1148.62
17.0 728 602.22 125.78
24.3 945 866.334 78.666
24.3 738 866.334 -128.334
24.3 759 866.334 -107.334
33.6 1263 1202.81 60.192

Exercise 5.2

x y xy x2 y2
60 63.6 3816.0 3600 4044.96
62 65.2 4042.4 3844 4251.04
64 66.0 4224.0 4096 4356.00
65 65.5 4257.5 4225 4290.25
66 66.9 4415.4 4356 4475.61
67 67.1 4495.7 4489 4502.41
68 67.4 4583.2 4624 4542.76
70 68.3 4781.0 4900 4664.89
72 70.1 5047.2 5184 4914.01
74 70.0 5180.0 5476 4900.00
Total 668.0 670.1 44842.4 44794 44941.90

From the table above, we need to calculate the x and y values first, that is

x
x i

668.0
 66.8 dan y 
y i

670.1
 67.01
n 10 n 10

Copyright © Open University Malaysia (OUM)


ANSWERS  251

Now, we can get the ̂1 regression coefficient using the following formula:
n

x y i i
 nxy
44842.4  10  66.8  67.01
ˆ 1  i 1
  0.465
44794  10  66.8 
n 2

x
i 1
2
i  nx 2

and next, we can get the ̂0 regression coefficient,


ˆ  y  ˆ x  67.01  0.465  66.8   35.95
0 1

Hence, the simple linear regression model is ŷ =35.95+0.465x

̂1 = 0.465 shows that the y value will increase by 0.465 for each one unit increase
in x. ̂0 = 35.95 refers to the y value when the x value is zero.

Exercise 5.3
(a) The hypothesis test (one-sided test since ̂1 value is positive):
H0 : 1 = 0
H1 : 1 > 0
ˆ 1 0.465
Test Statistic : T   14.085
 
s ˆ 1 0.033
Test Results : T follows a t distribution with v = 10 – 2 = 8 degrees
of freedom and 0.05 significance level.
Reject H 0 when
T > t0.05,8  1.86

s( ̂1 ) is the standard deviation for ̂1 sampling distribution. The formula to
get the standard deviation for ̂1 is

Copyright © Open University Malaysia (OUM)


252  ANSWERS

y 2
i  ˆ 0  yi  ˆ 1  xi yi

 
s ˆ 1  n2
 xi2  nx 2

44941.9  35.95  670.1  0.465  44842.4 


 10  2
44794  10  66.8 
2

 0.033

Since the test statistic t0.05,8  1.86 , we reject the null hypothesis. We have
enough evidence to say that 1 value is not zero but positive.

(b) The 99% confidence interval for 1 is as follows (with  = 0.01 and
t0.05,8  3.355):

   
ˆ 1  t0.005,8 s ˆ 1  1  ˆ 1  t0.005,8 s ˆ 1
0.465  3.355  0.033  1  0.465  3.355  0.033
0.354  1  0.576

Exercise 5.4
The coefficient of determination is

ˆ 0  yi  ˆ 1  xi yi  ny 2
R 
2

y 2
i  ny 2

35.95  670.1  0.465  44842.4   10  67.01


2


44941.9  10  67.01
2

 0.961

Copyright © Open University Malaysia (OUM)


ANSWERS  253

This means that 96.1% of variation in y can be explained by variation in x and


only 3.9% of variation in y is explained by other factors.

Exercise 5.5
(a) There is no particular pattern in this plot. We found that the model has
random error with constant variance. Hence, there is no violation from the
linear model assumptions.

(b) The model is non-linear, but has a curvy shape.

(c) The histogram shows a normal shape. Hence random error is distributed as
normal. Thus, there is no violation from the linear model assumptions.

Exercise 5.6
(a)

The plot shows the regression model is in reciprocal function form. Hence,
transformation is x* =1/x and the linear regression model is y = 2.67 –
0.68x*.

(b)

Plot shows the regression model is in exponential form. Hence,


transformation is y* = ln y and the linear regression model is y*= ln 2 +
3.1x.

Copyright © Open University Malaysia (OUM)


254  ANSWERS

(c)

Plot shows the regression model is in power function form. Hence,


transformation is y* = log y and x* = log x and the linear regression model is
y* = log 1.5 + 0.85x*.

(d)

Plot shows the regression model is in hyperbolic function form. Hence,


transformation is y* = 1/y and x* = 1/x and the linear regression model is
y* = 0.4x* + 2.

Exercise 5.7
Refer to Exercise 5.2, the simple linear regression model
ŷ = 35.95 + 0.465x

When xg = 86, ŷ = 35.95 + 0.465(86) = 75.94.

To get the standard error for estimator, we need to have w value. Using regression
model ŷ = 35.95 + 0.465x, the ŷ value for each x value is shown in the table
below:

Copyright © Open University Malaysia (OUM)


ANSWERS  255

x 60 62 64 65 66 67
ŷ 63.6 65.2 66.0 65.5 66.9 67.1
w 63.85 64.78 65.71 66.175 66.64 67.105
x 68 70 72 74
ŷ 67.4 68.3 70.1 70.0
w 67.57 68.5 69.43 70.36

Hence,

 
n 2
 y  yˆ
i i 1.494
s  i 1   0.432
ε n2 8

The 99% confidence interval gives = 0.01 and t 2  t0.005  3.355 . Hence, for
xg = 86, the prediction interval is

 xg  x 
2
1
yˆ  t 2 s 1  
n   xi  x 2

1  86  66.8 
2

75.94   3.355  0.432  1  


10 171.6

75.94  2.61

It is found that the upper and lower limit for the 99% confidence interval is 73.33
and 78.55 respectively. This means that the predicted y value is 73.33 unit at the
minimum and is 78.55 unit at the maximum for x = 86.

Copyright © Open University Malaysia (OUM)


256  ANSWERS

Exercise 5.8
The ŷ values, t 2 and s can be obtained from Exercise 5.7. Hence, the interval
for expected value of y for xg = 69 is:

 xg  x 
2
1
yˆ  t 2 s 1  
n   xi  x 2

1  69  66.8 
2

75.94   3.355  0.432  


10 171.6

75.94  0.52

It is found that the upper and lower limit for the confidence interval is 75.42 and
76.46 respectively. This means that the predicted y value is 75.42 unit at the
minimum and is 76.46 unit at the maximum for x = 69.

Copyright © Open University Malaysia (OUM)


ANSWERS  257

TOPIC 6: MULTIPLE REGRESSION

Exercise 6.1

a  i  4, y4 =29; yˆ 4 =9.430+5.266  2.4  +2.0612  4  =30.3132  30.31, and error


4 =y4  yˆ 4 =29.0  30.31  1.31, "over-estimate"

i  9, y9 =34; yˆ 9 =9.430+5.266  3 +2.0612  4  =33.4728  33.47, and error


9 =y9  yˆ9 =34  33.47  1.31, "under-estimate"

 b ˆ
y=9.430+5.266  4  +2.0612  5 =40.8

Exercise 6.2
1.

Copyright © Open University Malaysia (OUM)


258  ANSWERS

2. Data a)

y x1 x2
10 2 5
24 3 6
40 7 6
20 3 5
15 4 3

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.999906
R Square 0.999812
Adjusted R Square 0.999435
Standard Error 0.25713
Observations 4

ANOVA
Df SS MS F Significance F
Regression 2 350.6839 175.3419 2652.047 0.013729
Residual 1 0.066116 0.066116
Total 3 350.75

Standard Lower Upper Lower Upper


Coefficients t Stat P-value
Error 95% 95% 95.0% 95.0%
Intercept -13.8099 0.579456 -23.8325 0.026697 -21.1726 -6.44723 -21.1726 -6.44723
2 3.958678 0.080975 48.88773 0.01302 2.929794 4.987561 2.929794 4.987561
5 4.347107 0.108387 40.10712 0.01587 2.969915 5.7243 2.969915 5.7243

RESIDUAL OUTPUT PROBABILITY OUTPUT


Observation Predicted 10 Residuals Percentile 10
1 24.14876 -0.14876 12.5 15
2 39.98347 0.016529 37.5 20
3 19.80165 0.198347 87.562.5 24
4 15.06612 -0.06612 40
NOTE: the differences in values for answers in Q.1 & 2(a)
Copyright © Open University Malaysia (OUM)
ANSWERS  259

Data b)

y x1 x2
10 2 2
25 4 6
30 4 8
20 3 6
15 4 3

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.997365
R Square 0.994737
Adjusted R Square 0.984211
Standard Error 0.811107
Observations 4

ANOVA
Df SS MS F Significance df
Regression 2 124.3421 62.17105 94.5 0.072548
Residual 1 0.657895 0.657895
Total 3 125

Standard Lower Upper Lower Upper


Coefficients t Stat P-value
Error 95% 95% 95.0% 95.0%
Intercept -11.1842 3.87879 -2.88343 0.212523 -60.4689 38.10049 -60.4689 38.10049
2 4.342105 0.939662 4.620924 0.135677 -7.59743 16.28164 -7.59743 16.28164
2 3.026316 0.227901 13.27906 0.047851 0.130554 5.922078 0.130554 5.922078

RESIDUAL OUTPUT PROBABILITY OUTPUT


Observation Predicted 10 Residuals Percentile 10
1 24.34211 0.657895 12.5 15
2 30.39474 -0.39474 37.5 20
3 20 0 62.5 25
4 15.26316 -0.26316 87.5 30

3. You should be able to answer this question by referring to the module.

Copyright © Open University Malaysia (OUM)


260  ANSWERS

TOPIC 7: INTRODUCTION TO NON-PARAMETRIC


CONCEPTS

Exercise 7.1
1.
Nominal Measurement Ordinal Measurement
Data which uses numbers, symbols or If an object in a category or class has a
codes to classify objects, individuals or relationship or connection such as ‘more than’
characteristics is known as nominal or ‘less than’ with other classes, then the data
scale. is ordinal.

2. From the data there are 4 respondents who chose rank 1, 7 respondents
chose rank 2, none chose rank 3, 2 for rank 4 and 8 for rank 5. Hence,

(a) Mean = Data Average = [4(1) + 7(2) + 0(3) + 194) + 8(5)] /20 = 62/20
= 3.1
Median = middle of 20 data = mean for the 10th and 11th data.
The 10th data = 2 and the 11th data = 2. Hence, median = 2.
Mode = data with the highest frequency = 5

The three measures of location gave different values. From the 20


users, on average, they were unsure of the soap quality, but half of
them; that 50% is unsatisfied with the soap quality, while the majority
of users felt very satisfied with the quality of the bath soap.

(b) The parametric statistical method is not suitable for use since it uses
mean value to represent the data location, and the mean quantity does
not reflect the actual data in this data analysis.

Copyright © Open University Malaysia (OUM)


ANSWERS  261

TOPIC 8: NON-PARAMETRIC TEST FOR RANDOMNESS

Exercise 8.1
1. n1 = 6 n2 = 5. To get Pr(R  8), calculate
 5  4 
2   
3 3 2.10.4 80
f (8)      
11 462 462
 
6
 5 4 5 4
      
4 3 3 4 (5)(4)  (10)(1) 30
f (9)           
 
11 462 462
 
6
 
 5 4 5 4  5 4
2         
 4   4  2.5.1 10 5 4 4 5 1
f (10)    f (11)          
 11 462 462 11 462
   
 6 6
 
80  30  10  1 121
f (8)  f (9)  ( f 10)  f (11)    0.2619
Hence 462 462

2. To determine whether babies’ gender occur at random or not, test


H 0 : The birth of male and female babies is random versus
H1 : Babies’ birth according to gender is not random
 = 0.05

Data obtained is: BB G B GGGGGG BB G BBBBB


Hence, the run : 1 2 3 4 5 6 7
n1 = number of ‘B’ =10 n2 = number of ‘G’ = 8 R=7

From Table 1, the rejection region is R  15 or R  5. Since the total number


of runs, R = 7 falls between the critical values 5 and 15, the null hypothesis
cannot be rejected. In conclusion, the birth of male and female babies occurs
at random at 0.05 significance level.

Copyright © Open University Malaysia (OUM)


262  ANSWERS

Exercise 8.2
1. (a) n1 =7 n2 =4 R=7
2(7)(4) 2(7)(4)[2(7)(4)  7  4]
R   1  6.06 R 
74 (7  4) 2 (7  4  1)
 1.44

(b) n1 =11 n2 =8 R=13


2(11)(8) 2(11)(8)[2(11)(8)  11  8]
R   1  10.26 R 
11  8 (11  8) 2 (11  8  1)
 2.13

2. To determine whether the sequence of passengers queuing to purchase bus


tickets is random, test
H 0 : Sequence is random versus H1 : Sequence is not random

From the data, the number of M = number of male passengers = n1 = 30.


While the number of F = number of female passengers = n2 = 18.
Data arrangement gives a run total of 27. Next, using test for large sample,
the mean and standard deviation values are:

2.30.18 2.30.18(2.30.18  30  18)


  1  23.5   3.21
30  18 (30  18) 2 (30  181)

Hence, the test statistic z = (27 – 23.5)/3.21 = 1.09. Since z = 1.09 falls in
the interval –1.96 and +1.96, the null hypothesis cannot be rejected. There is
not enough evidence to say that the sequence of passengers’ queuing is not
random at  = 5% level.

Copyright © Open University Malaysia (OUM)


ANSWERS  263

Exercise 8.3
1. To determine whether students’ absenteeism occurs at random or not, test
H0 : The number of students who are absent is random versus
H1 : The number of students who are absent is not random
 = 0.01
Median = 31.5
Let n1 = the number of absenteeism > 30.5 = 11
And n2 = the number of absenteeism < 30.5 = 11

By assigning ‘-’ and ‘+’ signs, the following sequence can be obtained:
- - - - +- + - - + + -+ - + - - - + - +-

Hence, the total number of runs, R = 15


2(11)(11)[2(11)(11)  11  11]
2(11)(11) 
  1  12 (11  11)2 (11  11  1)
11  11
 4.57738
15  2
Thus, z   0.655
4.577

Since z = 0.655 falls in between –2.575 and +2.575, we do not reject the null
hypothesis. The number of students absent from school for 22 consecutive
days is random at  = 0.01

2. To determine whether the fluctuations in thickness of silver plating from a


tray to another is random, test
H 0 : The fluctuations in thickness is random versus
H1 : The fluctuations in thickness is non-random
 = 0.05

Median = 0.021, and ‘+’ , ‘–’ and ‘0’ arrangement generates the following
sequence: – 0 – – – – + 0 + + + + and n1 = 5, n2 = 5, R = 2,

From Table 1 Appendix (b), the critical region for the run with n1 = n2 = 5
is R  2 and R10. Since R = 2 is located inside the rejection region, we
reject the null hypothesis. Sample is not random at 0.05 significance level.

Copyright © Open University Malaysia (OUM)


264  ANSWERS

TOPIC 9: NON-PARAMETRIC HYPOTHESIS TEST FOR


SINGLE POPULATION

Exercise 9.1
1. To test whether the population median exceeds 160, test:
H 0 : = 160 H1 :  > 160  = 0.05

A one-sided (right) test provides the test statistic value, S = number of


observed ‘+’ sign.

Replacing values greater than 160 with ‘+’ sign and values less than 160
with ‘–’ sign, we will get,
+ + + + + – – + + + – + + – + + + + +

Observe that n = 20 – 1= 19 since there is one observation with 0 value,


that is 160. S obtained is 15.

S  E ( S ) S  0.5n 15  (0.5)(19)
Hence, Z     2.52
Var ( S ) 0.5 n  
0.5 19

From the standard normal table,

Pr(X  15) = Pr(Z  2.52) = 1 – 0.99413 = 0.00587

Since p-value = 0.00587 <  = 0.05, the null hypothesis is rejected. In


conclusion, the median strength of fabric is > 160 kg.

2. It is known that both S1 and S2 are distributed as binomial with n sample


size and  = 0.5. For a binomial variable X with n and  = 0.5, Pr(x  a) =
Pr(x  n – a) since the distribution is symmetrical. Hence, Pr( S1  c) =
Pr( S2  n – c).

Copyright © Open University Malaysia (OUM)


ANSWERS  265

Exercise 9.2
1. It is known that T  and T  are sum of rank differences with positive and
negative signs respectively. Hence, the sum of rank differences for both
ranks without taking into consideration of ‘+’ or ‘–’ signs is sum of all
possible ranks, that is
(rank) 1 + (rank) 2 + . . . + largest rank = T  + T 
1 + 2 + 3 + . . . + n = T  + T  . Thus, (n + 1)/ 2 = T  + T 

2. To determine whether median content is 98.5 or not, test


H 0 :  = 98.5 against H1 :  98.5 and = 0.05.

Using the signed-rank test, we obtained the following table:


yi yi – 98.5 Rank yi yi – 98.5 Rank
97.5 -1.0 (-)4 93.2 -5.3 (-)14
95.2 -3.3 (-)12 99.1 +0.6 (+)2
97.3 -1.2 (-)6 96.1 -2.4 (-)9
96.0 -2.5 (-)10 97.6 -0.9 (-)3
96.8 -1.7 (-)7 98.2 -0.3 (-)1
100.3 +1.8 (+)8 98.5 0 0
97.4 -1.1 (-)5 94.9 -3.6 (-)13
95.3 -3.2 (-)11

The signed-rank test for small sample size is performed since n = 14. From
the calculation above, the total number of negative differences, T  = 10.
From Table 4, Appendix B, with n = 14, the critical value T0.01 = 13. Since
T  is not < T0.01 , hence, we do not reject H 0 that median hydrocarbon
content is 98.5.

3. Test to determine whether pipes produced by the supplier’s company satisfy


the specifications given

H 0 :  = 2500 against H1 :  > 2500

Using the Sign Test:


Test Statistic, S = number of observations greater than 2500 = 5. Hence, S is
distributed as binomial (n = 7,  = 0.5). From the binomial table,
Pr(S  5) = 1 – Pr(S  4) = 1 – 0.7734 = 0.2266

Copyright © Open University Malaysia (OUM)


266  ANSWERS

We reject H 0 if  > p-value. Since  = 0.10 is not > 0.2266, H 0 cannot be


rejected. Pipes produced by the supplier’s company do not satisfy the
specification at  = 0.10.
yi yi –2500 Rank ( yi –2500)
2610 +110 (+) 5
2750 +250 (+) 7
2420 –80 (+) 4
2510 +10 (+) 1.5
2540 +40 (+) 3
2490 –10 (–) 1.5

4.
Rank Rank
yi yi - 1 yi yi - 1
( yi – 1) ( yi – 1)
0.045 -0.955 (-) 23 1.894 +0.894 (+) 20
0.258 -0.742 (-) 14 0.088 -0.912 (-) 22
0.412 -0.588 (-) 10 0.579 -0.421 (-) 4
0.036 -0.964 (-) 24 0.445 -0.555 (-) 9
1.055 +0.55 (+) 8 0.379 -0.621 (-) 12
1.070 +0.070 (+) 1 0.242 -0.758 (-) 15
0.361 -0.906 (-) 21 1.267 +0.267 (+) 3
0.394 -0.606 (-) 11 0.136 -0.864 (-) 18.5
0.136 -0.864 (-) 18.5 1.639 +0.639 (+) 13
0.506 -0.494 (-) 6 0.567 -0.433 (-) 5
0.209 -0.791 (-) 16 0.336 -0.664 (-) 14
8.788 +7.788 (+) 25 0.912 -0.088 (-) 2
0.182 -0.818 (-) 17

Sum of differences with ‘–’ sign is T– = 262, Hence, the test statistic:
n(n  1) 25(26)
T 262 
Z 4  4  262  162.5
n(n  1)(2n  1) 25(26)(51) 37.1652
24 24
 2.6772

Since Z = 2.6772 is not < Z  = 2.33, we cannot reject H 0 . The same


decision can be obtained from the sign test.

Copyright © Open University Malaysia (OUM)


ANSWERS  267

TOPIC 10: NON-PARAMETRIC HYPOTHESIS TEST FOR


TWO POPULATIONS

Exercise 10.1
1. (a) n = total positive and negative signs = 15 + 5 = 20 (0 or tie is not
counted). For one-sided (right) test, S = number of positive sign = 15.

(b) p-value = Pr(S  15) = 1 – Pr(S  14) = 1 – 0.9793 = 0.0207. At 0.05


level, reject H 0 since 0.0207 > 0.05.

2. To compare the effectiveness of traffic control system, test


H0 : the number of accidents before and after installation is the same
H1 : the number of accidents before the installation of traffic control
system is higher compared to after installation of system.
 = 0.05

Sign test:

The test statistic, S = number of ‘+’ sign. By replacing positive differences


with ‘+’ sign and negative differences with ‘-’ sign, you will get: + + + +
+ + - + - + + + with this, n =12, x = 10. Using binomial distribution
Table with = ½, Pr(S  10)
= 1 – Pr(S  9)
= 1 – 0.9807 = 0.0193

Since p-value = 0.0193 < 0.05 = a, hence reject H 0 . In conclusion, the new
traffic control system is more effective in reducing the number of accidents
at dangerous junctions at 0.05 significance level.

Exercise 10.2
1. (a) w2 = [(8)(9)/2] – 8 = 28 Y2 = 15 +[(3)(4)/2] – 8 = 13
Y1 = 15 + [(5)(6)/2] – 28 = 2
For a one-sided (right) test, test statistic, Y1 = 2. With n1 = 3 and n2 =
5, Y0.05 = 1 (Refer Table 5). Since test statistic  > critical value Y0.05 ,
do not reject H 0 at 5% significance level.
Copyright © Open University Malaysia (OUM)
268  ANSWERS

(b) w1 = [(10)(11)/2] – 17 = 38 Y1 = 24 + [(6)(7)/2] – 38 = 7


Y2 = 24 + [(4)(5)/2] – 17 = 17

For a two-sided test, test statistic  = minimum ( Y1 , Y2 ) = 7. When


n1 = 6 and n2 = 4, Y0.10 = 3. Since test statistic  > critical value Y0.10 ,
do not reject H 0 at 10% significance level. When Y0.02 = 1 and also
Y0.05 = 2 the decision is do not reject H 0 (as expected since even at
10% level, H 0 cannot be rejected).

2. To determine whether both groups of students differ in terms of score, test

H 0 : Students from both groups obtain similar score


H1 : Students from video programme group and solving real problems
obtain higher score
 = 0.01

From data, n1 = 7, n2 = 10. Ranks assignment resulted in,

Group 1 2.5 11 2.5 13.5 6.5 1 4.5


Group 2 8 12 15.5 17 9.5 15.5 6.5 9.5 13.5 4.5

w1 = 2.5 + 11 + 2.5 + 13.5 + 6.5 + 1 + 4.5 = 41.5,


(17)(18)
w2   41.5  111.5
hence, 2
Test Statistic , Y2 = (7)(10) + (10)(11)/2 – 111.5 = 13.5

From  table, the critical value at = 0.01 for a one-sided test with n1 = 7
and n2 = 10 is 11. Since 13.5 is not < 11, we do not reject the null
hypothesis. In conclusion, there is no significant difference in the score for
both groups at 0.01 significance level.

Copyright © Open University Malaysia (OUM)


ANSWERS  269

3. To test the claim that the campaign is successful, test:


H 0 : Probability distribution of number of accidents before and after
campaign is equal, D1 = D2 versus

H1 : Probability distribution of number of accidents after campaign has


shifted to the left of probability distribution of number of accidents before
campaign, D1 > D2

(a) Using sign test:


Before After Difference
Factory
Campaign Campaign Sign
1 3 2 +
2 4 1 +
3 6 3 +
4 3 5 -
5 4 4 0
6 5 2 +
7 5 3 +
8 3 3 0
9 2 0 +
10 4 3 +
11 4 1 +
12 5 2 +

Test statistic S = number of ‘+’ sign = 9. S ~ binomial (10, ½). The p-


value is Pr(S  9) = 1 – Pr(S  8) = 0.0107. Since 0.01 < 0.05 we
reject H 0 . In conclusion, the claim that the campaign is effective in
reducing the number of accidents is true at 5% level.

Copyright © Open University Malaysia (OUM)


270  ANSWERS

(b) Using the signed-rank test:


Factory Before After Difference Rank |Difference |
1 3 2 +1 +1.5
2 4 1 +3 +8
3 6 3 +3 +8
4 3 5 -2 -4
5 4 4 0 Discard
6 5 2 +3 +8
7 5 3 +2 +4
8 3 3 0 Discard
9 2 0 +2 +4
10 4 3 +1 +1.5
11 4 1 +3 +8
12 5 2 +3 +8

From the table, we got T  = 51 while T– = 4

For a one-sided (right) test, the test statistic is T  = 4. From Table 4


Appendix B, the critical value = 14 for n = 12. Since T = 4  14, the
decision is to reject H 0 . The claim that the campaign is effective in
reducing road accidents is true at 0.05 level.

4. Let D1 be the distribution of time duration for problem-solving taken by the


control group, while D2 is the distribution of time duration taken by the
experimental group.

To determine whether alcohol influences individuals’ thinking ability, test:


H0 : Time taken to solve the problem is the same for both groups or
D1 = D2 versus
H1 : Subjects taking alcohol take longer time to solve the problem or
D1 < D2
= 0.05

Copyright © Open University Malaysia (OUM)


ANSWERS  271

By combining all observations, assign ranks according to data sequence in


ascending order. Then, calculate the sum of ranks obtained for respective
groups. The following table is obtained.
Experiment
Control Group Rank Rank
Group
63 10 78 18
57 8 77 17
44 4 75 16
70 13 74 15
50 5 80 19
42 2 55 6
64 11 62 9
56 7 72 14
41 1 66 12
43 3
Total 61 Total 129

From the table, n1 = 9, when n2 = 10, w1 = 61 when w2 = 129


Thus test statistic Y1

From Table 5 with n1 = 9 and n2 = 10, the critical value for  = 0.05 is 24.
Since 16 < 24, reject the null hypothesis. Alcohol does have an effect on
individuals’ thinking ability at 0.05 level.

Copyright © Open University Malaysia (OUM)


272  GLOSSARY

Glossary
 – The significance level of a hypothesis test that denotes
the probability of rejecting a null hypothesis when it
actually is true

 – The probability of not rejecting a null hypothesis


when it is actually false

Alternative – A claim about a population parameter that will be true


Hypothesis if the null hypothesis is false

Analysis of – a statistical technique used to test whether the means


Variance (ANOVA) of three or more populations are equal

Chi-Square – A distribution, with degrees of freedom (dof) as the


distribution only parameter, that is skewed to the right for small
dof and looks like a normal curve for large dof

Coefficient of – A measure that gives the proportion (or percentage) of


Determination the total variation in a dependent variable that is
explained by a given independent variable

Critical Value – One or two values that divide the whole region under
the sampling distribution of a sample statistic into
rejection and non-rejection region

Dependent variable – The variable to be predicted or explained

Estimated or – The value of the dependent variable that is calculated


predicted value of y for a given value of x using the estimated regression
model

Expected – The frequencies for different categories of a


frequencies multinomial experiment or for different cells of a
contingency table that are expected to occur when a
given null hypothesis is true

F distribution – A continuous distribution that has two parameters -


dof for the numerator and dof for the denominator

Copyright © Open University Malaysia (OUM)


GLOSSARY  273

Goodness-of-fit test – A test of the null hypothesis that the observed


frequencies for an experiment follow a certain pattern
or theoretical distribution

Independent – The variable included in a model to explain the


variable variation in the dependent variable

Least squares – The method used to fit a regression line through a


method scatter diagram such that the error sum of squares is
minimum

Linear correlation – A measure of the strength of the linear relationship


coefficient between two variables

Mean Square – A measure of the variation within the data of all MSE
between samples, sample taken from different populations

Mean Square – A measure of the variation among the samples taken


between samples, from different populations
MS(Tr)

Multinomial – An experiment with n trials for which (1) the trials are
Experiment identical, (2) there are more than two possible
outcomes per trial, (3) the trials are independent, and
(4) the probabilities of the various outcomes remain
constant for each trial

Multiple linear – A regression model with one dependent and two or


regression more independent variables that assumes a straight
line relationship

Non-parametric – A test that makes minimal assumptions about the


test distribution of the data or about certain parameters of
a statistical model. Non-parametric tests for ordinal or
continuous variables are typically based on the ranks
of the data values.

Null hypothesis – A claim about a population parameter that is assumed


to be true until proven otherwise

Observed – The frequencies actually obtained from the


frequencies performance of an experiment

Copyright © Open University Malaysia (OUM)


274  GLOSSARY

One-tailed test – A test in which there is only one rejection region,


either in the left or right tail of the distribution curve

One-Way ANOVA – The analysis of variance technique that analyses one


variable only

Prediction interval – The confidence interval for a particular value of y for


a given value of x

Runs Test – To test in studies where measurements are made


according to some well-defined defined ordering,
either in time or space. A frequent question is whether
or not a measure of location selected (usually the
median value of the measurement) is different at
different points in the sequence.

Scatter diagram – A plot of the paired observations of x and y

Sign Test – Designed to test a hypothesis about the location of a


population distribution. It is most often used to test the
hypothesis about a population median, and often
involves the use of matched pairs, for example, before
and after data, in which case it tests for a median
difference of zero

Simple linear – A regression model with one dependent variable and


regression one independent variable that assumes a straight-line
relationship

SS(Tr) – The sum of squares of the factor or treatment. Also


called the sum of squares between samples

SSE – The sum of squares of errors. Also called the sum of


squares within samples

SST – The total sum of squares given by the sum of SS(Tr)


and SSE

Test of – A test of the null hypothesis that the proportions of


homogeneity elements that belong to different groups in two (or
more) populations are similar

Copyright © Open University Malaysia (OUM)


GLOSSARY  275

Test of – A test of the null hypothesis that two attributes of a


Independence population are not related

Test Statistic – The value of z or t calculated for a sample statistic

Wilcoxon Mann- – One of the most powerful of the non-parametric tests


Whitney Test for comparing two populations. It is used to test the
null hypothesis that two populations have identical
distribution functions against the alternative
hypothesis that the two distribution functions differ
only with respect to location (median), if at all

Wilcoxon Signed – Designed to test a hypothesis about the location


Ranks Test (median) of a population distribution. It often involves
the use of matched pairs, for example, before and after
data, in which case it tests for a median difference of
zero

Copyright © Open University Malaysia (OUM)


MODULE FEEDBACK
MAKLUM BALAS MODUL

If you have any comment or feedback, you are welcome to:

1. E-mail your comment or feedback to modulefeedback@oum.edu.my

OR

2. Fill in the Print Module online evaluation form available on myINSPIRE.

Thank you.

Centre for Instructional Design and Technology


(Pusat Reka Bentuk Pengajaran dan Teknologi )
Tel No.: 03-27732578
Fax No.: 03-26978702

Copyright © Open University Malaysia (OUM)


Copyright © Open University Malaysia (OUM)

You might also like