Data Analysis Text Book
Data Analysis Text Book
SBST3103
Introductory Data Analysis
Copyright © Open
Copyright Open University
University Malaysia
Malaysia (OUM)
(OUM)
Table of Contents
Course Guide ix - xiv
Topic 4 Correlation 83
4.1 Two-Way Scatter Plot 85
4.2 Pearson Correlation Coefficient 89
4.2.1 Pearson Correlation Coefficient Significance Test 91
4.3 Spearman Rank Correlation Coefficient 93
4.3.1 Spearman Rank Correlation Coefficient
Significance Test 96
Summary 100
Answers 208
Glossary 272
INTRODUCTION
SBST3103 Introductory Data Analysis is one of the courses offered by the
Faculty of Science and Technology, Open University Malaysia (OUM).
Similar to other courses offered by the Faculty of Science and Technology, this
3 credit hour course will be conducted over 15 weeks and is offered in semesters
January, May and September.
COURSE AUDIENCE
This is a core course for students undergoing Bachelor of Education (Mathematics)
(Honours) at OUM.
STUDY SCHEDULE
It is a standard OUM practice that learners accumulate 40 study hours for every
credit hour. As such, for a three-credit hour course, you are expected to spend
120 study hours. Table 1 gives an estimation of how the 120 study hours could be
accumulated.
COURSE OUTCOMES
By the end of this module, you should be able to:
1. Explain the One-Way Analysis of Variance concepts;
2. Explain the regression and correlation concepts;
3. Describe the simple and multiple linear regression concepts; and
4. Describe the non-parametric methodologies concepts.
COURSE SYNOPSIS
Topic 1 introduces you to the Chi-Square and F distributions, where a good
understanding in sampling distribution and hypothesis testing is necessary. In this
topic, we will deal a lot with Chi-Square and F distributions. Hence, you must
master these two distributions including their standard tables.
Topic 2 takes a look at mean comparisons for more than two populations using
ANOVA which is a follow-up from mean comparison testing involving one or
two populations. Variance partitioning concepts is introduced in this topic.
Topic 6 introduces you to multiple regressions for cases involving more than two
independent variables.
Learning Outcomes: This section refers to what you should achieve after you
had completely gone through a topic. As you go through each topic, you should
frequently refer your reading back to these given learning outcomes. By doing
this, you can continuously gauge your progress of digesting the topic.
Summary: You can find this component at the end of each topic. This component
assists you to recap the whole topic. By going through summary, you should be
able to gauge your knowledge retention level. Should you find points inside the
summary that you do not fully understand; it would be a good idea for you to
revisit the details from the module.
Key Terms: This component can be found at the end of each topic. You should
go through this component so as to remind yourself on important terms or jargons
used throughout the module. Should you find terms here that you are not able to
explain, you should look for the terms from the module.
PRIOR KNOWLEDGE
Students taking this course are required to have prior knowledge in courses
SBST1103 and SBST2103.
ASSESSMENT METHOD
Please refer to myINSPIRE.
REFERENCES
Dielman, T. E. (2004). Applied regression analysis: A second course in business
and economic statistics (4th ed.). Texas: Thomson Brooks/Cole.
Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2006). Probability and
statistics for engineers and scientist (8th ed.). Prentice-Hall.
Copyright © Open
Copyright Open University
University Malaysia
Malaysia (OUM)
(OUM)
xiv COURSE GUIDE
Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2006). Probability and
statistics for engineers and scientist (8th ed.). Prentice-Hall.
INTRODUCTION
Previously, we have been exposed to several methods of testing and estimating a
population mean. Similarly, we may be interested in making inferences on
changes in a population; hence, the correct parameter to use in this case is the
population variance, 2 . There are several reasons why it is important to test
hypotheses concerning the variances of populations. Inferences on variance can be
applied in daily life. For example, a quality control engineer needs to monitor the
consistency of products manufactured by the factory as this ensures that the
products are meeting the required specifications. One of the methods for
consistency checking is by calculating the variance of size, weight or volume of
the product. If the variation in these measurements is large, this means that more
products will be outside the specification limits. In the financial area, investors
use the variations in the returns from their portfolios stocks, bonds or any type of
SELF-CHECK 1.1
Prior to this, the Central Limit Theorem and various theories explaining
the properties of sample mean x have been discussed. Can you recall
what the properties are?
The value of sample mean x and sample variance s2 differs from one sample to
the other. It is also important to know the properties of the sample variance. The
following are several theorems explaining the properties of sample variance 2 .
Theorem 1
Suppose a random sample X 1 , X 2 ,…, X n of size n is chosen from a normal
distribution with mean and variance 2 . The sample mean and variance can be
computed as follows:
n
X i
Sample mean, X i 1
n
n
X X
2
i
Sample variance, s 2 i 1
n 1
(n 1) s 2
As both X and s 2 are random variables, then
2
is also a random variable which follows a Chi-Square distribution with v = n – 1
degrees of freedom. Let a Greek symbol 2 (pronounced as Chi-Square) to
represent the random variable, thus we have
(n 1) s 2
2
~ 2 (n 1)
2
Based on Figure 1.1, 2 only takes positive values starting from zero at horizontal
axis.
Theorem 2
A random variable X is said to follow a Chi-Square distribution only if its
probability density function is given by
2
1
/2 x 2 e x /2 , x 0
f ( x) 2 ( / 2)
0, otherwise
where v is the degree of freedom and ( / 2) 1 ! .
2
Figure 1.2
The bigger the value of v, the flatter the density curve is, skewing to the right.
Using v = 1, 2, 3 and 4, the graph of f(x) versus x is shown in Figure 1.2. What
can be observed from this figure? Give your comment.
(a) A few specific properties are:
(i) The distribution is continuous;
(ii) There is only one parameter, v; and
(iii) The random variable X that follows a 2 distribution with parameter v
can be written as X ~ 2 ( ) .
(b) Other properties of a Chi-Square distribution:
(i) if X is distributed as 2 v , then its mean, E[X] = v and variance,
Var[X] = 2v;
Y2 Z i2 Z 2j ~ 2 (1 1) , i = 1, 2, …, n and j = 1, 2, …, n
Yn Z12 Z 22 Z n2 ~ 2 (1 1 1 n)
Answer:
Using property (b), the Chi-Square distribution, Y = X1+X2 will be distributed
as 2 (4+6) =2 (10). Hence, E[Y] = 10 and Var[Y] = 2(10).
SELF-CHECK 1.2
Given an independent random variable X ~ 2 (2) and Y ~ 2 (3) ,
determine the distribution of random variable T = X + Y.
X
2
random variable Y and show that the mean and variance of Y are
2
= 1 and 2 = 2, respectively.
Answer:
( X )
Since X is distributed as N , 2 , the Z = will be distributed as
X
2
EXERCISE 1.1
4
The following are 100 data, Y = z
i 1
2
i calculated for 4 sets of Z1, Z2, Z3,
3.472 6.472 8.347 13.025 1.483 7.832 0.772 5.449 1.037 4.744
4.940 2.186 2.920 1.083 3.047 5.627 4.091 1.031 4.532 1.033
3.146 4.004 2.685 4.379 1.510 0.964 1.519 4.668 12.723 2.018
6.018 3.820 4.900 3.300 3.147 5.741 6.613 9.386 4.874 9.775
5.290 6.854 8.992 3.330 2.574 0.611 0.870 1.152 0.738 8.630
5.233 0.579 1.653 1.237 8.484 3.643 2.118 5.813 5.168 4.255
1.079 3.145 4.541 2.052 6.846 0.570 0.476 2.151 0.391 0.758
3.700 2.476 2.680 0.756 3.549 2.694 10.884 1.630 2.392 3.084
2.577 4.354 3.785 2.232 1.348 1.840 6.208 10.938 2.217 1.264
1.330 1.808 1.642 3.434 3.596 4.687 2.650 6.203 1.830 4.865
Pr( 2 c2 ) ,0 1
To facilitate the usage of this table, we use the term “small” if its values are in
the range of 0 0.1 and “ large” if its values are in the complementary
interval of 0.9 1. Usually, the 2c values are big for small , and 2c are
small when is large. For example:
where Pr 2 ( ) c2
c2 = critical value.
Based on Table 1.1 above, it can be seen that for the same v degrees of freedom
(v = 5), the c2 value approaches zero when the value of increases for “large ”
case. On the other hand, the value of c2 increases and approaches when
value decreases for “small ” case. This is an important property especially when
is used as a significance level in hypothesis testing or confidence interval
construction for population, 2 .
The following Table 1.2 shows part of c2 values for some values and the
corresponding degrees of freedom v where = n –1. This table was constructed in
a similar manner as the t- Table that contains the t values. The top row of the table
represents the right-hand section of the point as in Figure 1.4, in reference to the
rows with the appropriate degrees of freedom. Let us check how Table 1.2 is used.
1
2
.
.
.
.
11 4.575 5.578 17.275 19.675
12 5.226 6.304 18.549 21.026
13 5.892 7.042 19.812 22.362
14 22.362 7.790 21.064 33.685
.
.
.
The values and area for Chi-Square distribution with v = 12 degrees of freedom
are shown in Figure 1.4(a). Since
It is important to note that the 2 point represents its position on the horizontal
axis. As such, all points and probabilities/areas in the table satisfy
Pr 2 ( ) c2
Answer:
When n = 20, v = n – 1 = 20 – 1 = 19 degrees of freedom. From the distribution
table,
20.975= 8.907
20.025 = 32.852
Figure 1.5 clearly depicts that the area on the right-hand side of point 8.907 is
0.975 and the area on the right-hand side of point 32.852 is 0.025.
Figure 1.5
(a) X > 7.38; (b) X < 0.103; (c) Y < 22.46; and (d) X+Y > 2.18
Answer:
(a) It is known that X is distributed as 2(2). From the table, 0.025 from the
distribution is situated on the right-hand side of point 7.38 (column
= 0.025, row v = 2)
Pr (X > 7.38) = 0.025
(d) X + Y ~2 (2 + 6). Based on the table, with column = 0.975 and row = 8.
Pr(X+Y > 2.18) = 0.975
EXERCISE 1.2
Statement 1
To test whether a random sample of size n with sample variance S 2 was drawn
The following Figure 1.7 shows the critical region for testing H0:2 02 versus
(a) H1:2 02 ; (b) H1:2 02 ; and (c) H1: 2 02 .
Figure 1.8
(b) Find the critical region for a one-sided left-hand side (large ) at 1%
level 0.01 and sample size n = 22.
Figure 1.9
(c) Determine the critical value at α = 0.05 level and sample size n = 21 for a
two-sided test.
(i) Step I; for two sided test, divide to half, i.e. 2 0.025 . Refer
to figure 1.10:
Figure 1.10
The following Table 1.3 gives the rejection region for various combinations of
null hypothesis, H0 and alternative hypothesis, H1.
SELF-CHECK 1.3
Answer:
There are a few steps to be taken to answer this question:
Test Statistic: 2
n 1 S 2
8 8.01
7.12
2
9
Figure 1.12
SELF-CHECK 1.4
(a) Give your opinion on the possible value change for the standard
deviation of the population total exam score; and
EXERCISE 1.3
1.2 F DISTRIBUTION
The F distribution is required for studies involving two or more independent
populations. The sample variance s12 will be calculated and used as point
estimation for the first population variance 12 and the second sample variance s22
will be calculated and used as point estimation for the second population variance.
Figure 1.13
Theorem 3
Let U and V be two independent random variables having Chi-Squared
distribution with v1 and v2 degrees of freedom, respectively. Then,
U v1
F
V v2
is the random variable that follows F distribution with v1 and v2 degrees of
freedom and the probability density function g(f) is
1 2 1 / 2 1
1
1 2
2 1 1 2
g( f ) . f 2 1 1 , f 0
2
1 2 2
2 2
and g(f) = 0 otherwise.
EXERCISE 1.4
Sketch separately three graphs of function f (x) versus x using pairs of
(29,28), (19,6) and (6,6). Give your comment.
(Please answer on separate sheets.)
The critical value of F distribution lies on the right-hand side of the function
graph (refer to Figure 1.13). For each pair of v1 and v2; the first, second and third
rows in the F distribution table are critical values at 0.05, 0.025 and 0.01
significance levels. To determine the critical values on the left-hand side, this
relationship is used:
1
Fv1 ,v2 ;
Fv2 ,v1 ;1
To facilitate the usage of the table, we will observe several critical values of F
when:
1 2 Fv1 , v 2 ; 0.05
6 5 4.950
6 6 4.284
6 7 3.866
6 8 3.581
Comment: The F critical value decreases when v1 is fixed and v2 is varied.
1 2 Fv1 , v 2 ; 0.05
5 6 4.387
6 6 4.284
7 6 4.207
8 6 4.147
Comment: The F critical value decreases when v1 is varied and v2 is fixed.
Answer:
Using the property,
1 1 1
F10,11,0.95 0.3398
F11,10,10.95 F11,10,0.05 2.943
EXERCISE 1.5
1. Determine the following values based on the F distribution table.
(a) F0.05 (3,16) (b) F0.05 (12,25)
(c) F0.01 (4,15) (d) F0.05 (7,4)
2. Determine the values (using the F distribution table) that satisfy the
equation below:
(a) (6,14) = 3.50 (b) (10,32) = 2.93
(c) (24,38) = 1.81 (d) (2,24) = 5.61
(Please answer on separate sheets.)
22 S12
F 2 2
1 S2
is a random variable following an F distribution with n1 – 1 and n2 – 1 degrees of
freedom. Hence, we write:
22 S12
Pr( F1 / 2 ,n1 1,n2 1 2 2 F / 2 ,n1 1,n2 1 1
1 S 2
Figure 1.14
22 S12
Pr( F1 / 2 ( 1 , 2 ) F / 2 ( 1 , 2 )) 1
12 S 22
S2 2
Pr( 22 F1 / 2 ( 1 , 2 ) 22 F / 2 ( 1 , 2 )) 1
S1 1
Theorem 4
If S12 and S 22 are variances for independent random variables sized n1 and n2 from
a normal population, then,
s12 1 12
F 2,n2 1, n1 1
s22 F 2, n1 1,n2 1 22
12
is defined as (1 – α)100% confidence interval for 2 .
2
variances.
Answer:
From the F distribution, we obtained
f 0.01,9,7 6.72 and f 0.01,7,9 5.61 for =0.02 .
0.5 1 12 0.5 5.61 12
2 2
EXERCISE 1.6
Now, if both samples come from a normal population with equal variance, σ2 then
( n 1 )S X2 ( m 1 )SY2
~ 2
( n 1 ) and ~ 2 ( m 1 )
2
2
(n 1) (n 1) 2
To test whether two random samples of sizes n and m with a sample variance S X2
and SY2 respectively are taken from a normal population with equal variance, we
ˆ 2X
use the F statistic distributed as F(n – 1, m – 1) when the null hypothesis is
ˆ Y2
true (that is, samples have equal variance) with unbiased estimated variance:
( n 1 )S X2 ( m 1 )SY2
ˆ 2X and
ˆ Y2 respectively.
n 1 m 1
SELF-CHECK 1.5
Answer:
A few steps need to be followed to solve this problem.
EXERCISE 1.7
Sample 1 Sample 2
n1 =16 n2 =25
x1 =48.7 x 2 =39.2
EXERCISE 1.8
(ii) 0.99 = x (v = 4)
2
5. EXERCISE
A test has 1.8
been suggested at 5% level to check whether a sample
containing 9 items comes from a normal population with variance
11. If the calculated sample variance is 12.1, determine the result of
that test.
Chi-Square Distribution
2
( n 1) s 2
The Chi-Square distribution is a sampling distribution for the
2
variable possessing the following properties:
The Chi-Square value is always greater than or equal to 0, a property
which is not applicable for z and t distributions;
The distribution is non-symmetrical;
The distribution will vary according to the sample size. This means the
distribution shape is very dependent on the value , which is the degree of
freedom value that exists in a sampling situation;
The mean for any distribution is equal to its degrees of freedom; and
It is used to compare the variance of a population.
F Distribution
The F distribution is a sampling distribution for variables possessing the
following properties:
There are no negative values in the F distribution (similar to Chi-Square
distribution). As such, the scale for the F value starts with 0 and extends
towards the positive side on the right;
The F distribution is also non-symmetrical;
There are various shapes of F distribution depending on sample size,
which is the respective sample degrees of freedom; and
Is used to compare two independent population variances.
INTRODUCTION
There are situations that may interest us to broaden our test scope to more than
two populations. The following are two situations, which compare three or more
populations. A teacher may be interested in comparing the mean of Mathematics
marks obtained from three groups of students following different teaching
methods. Similarly, a scientist may be interested to compare the strength of
certain pulps produced using different techniques. Each production method may
result in a different mean strength and the scientist may want to test the equality
of several means. This can be performed through a procedure called variance
analysis. Analysis of Variance (ANOVA) is a basic method used in experimental
design. This technique has wide applications and is a useful technique in
inferential statistics.
Relevant terms:
(a) Independent variables are called factors. A study may involve 1, 2 or more
factors.
Example: type of tuition classes, time of classes and students’ commitment.
(c) The combination of one level of a factor to another factor’s level is termed
as treatment or run.
Note:
(i) For a single factor experiment, the factor level and treatment carry the same
meaning.
(ii) In this module, only the single-factor experiment is considered. For multiple
factors, please refer to appropriate reference books.
SELF-CHECK 2.1
Can you think of several other examples that may use ANOVA?
You have learnt about display plots in the Statistics I module. The focus is on the
usage of the box-plot and dot-plot to provide a visual display before calculating
the mean comparison. Figure 2.1 displays the box-plot graph for comparison of
mean speeds according to types of cars.
Figure 2.1
Figure 2.1 gives information on five cars which have different means or medians
and variances for their speed at different levels. However, do the observed
differences yield significant statistical results? We may need the analysis of
variance to carry out a numerical significance test on the equality of each mean.
The null hypothesis in there provides no significant difference on the mean speed
of all cars. If the null hypothesis is rejected, this means that there exists
differences in mean speed and if we are interested, we can determine the mean
speed of the car that caused the difference.
There are several assumptions we can make when carrying out ANOVA
procedures. The assumptions are that:
(a) Each population under study must be normally distributed;
(b) Samples taken are random and independent; and
(c) Each population that produces a sample value has unknown and equal
population variance, that is
12 22 ... k2
Figure 2.2 below displays the graph shape for populations that satisfy the
assumptions above.
Figure 2.2
Copyright © Open University Malaysia (OUM)
38 TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)
Observe Figure 2.3 below. What can you comment on the means µ1, µ2, and µ3?
Figure 2.3
Note:
(a) The factor is the tuition class; the factor level is type/category.
(b) Four observations for factor level 1 is called group 1.
(c) There are three groups in this experiment with the dot plot shown in Figure
2.4:
Figure 2.4
(d) There exists variations within each respective group (as observed from the
variation in marks value).
(e) There exists variations between groups (as observed from the varying
position of the group centre).
(f) Based on these, we can raise the question: Do the three types of extra classes
result in similar effects on students’ performance? In other words, is the
mean population for group 1 (1) = mean population for group 2 (2) = mean
population for group 3 (3)?
SELF-CHECK 2.2
EXERCISE 2.1
Using the two data sets below, construct a dot-plot graph. Give your
comment.
Data 1 Data 2
A B C A B C
14.9 14.4 14.5 16.6 13.2 13.5
14.9 14.4 14.5 16.8 14.7 16.9
14.9 14.4 14.5 13.2 15.7 15.4
14.9 14.4 14.5 15.3 12.3 12.8
14.9 14.4 14.5 12.6 16.1 13.9
Total 74.5 72.0 72.5 74.5 72.0 72.5
Mean 14.9 14.4 14.5 14.9 14.4 14.5
Variance 0 0 0 3.71 2.63 2.71
Overall Total 219.0 219.0
Mean Total 14.6 14.6
k n _
i 1 j 1
Table 2.1
Observation
1 x11 x12 … 2 ,n1
2 x21 x22 … 2 ,n2
Treatment . . . … .
. . . … .
k xk1 xk2 k ,nk
The table shows the measurement collected from observations under study
(subject) i and j where i = 1, 2, …k, and j = 1, 2, …,n. The sample sizes are not
k
dividing the total overall observation from samples with the total sample size, that
_ k
is x.. Ti / N .
i 1
i 1 j 1
( xij x ..) 2
k n _ k n _ _ _ (2.2)
n ( x i. x.. ) 2 ( xij x i. ) 2 2 ( x i. x.. )(xij x i. )
i 1 j 1 i 1 j 1
(x
j 1
ij x i. ) xi. n x i. xi. n( xi. / n) 0
Hence, we have
k n _ k k n
Equation (2.3) states that the total variability in the data, as measured by the total
sum of squares can be partitioned into the sum of squares deviation between
means in the treatment and within means of the treatments. This means the
differences between mean of observed treatment and the overall mean is a
measure of differences between treatment means, while the observed differences
Copyright © Open University Malaysia (OUM)
42 TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)
between a treatment with the treatment mean can only be caused by random error.
As such, we can write Equation (2.3) (in symbol) as:
SST SST SS E
N 1 k 1 N k
where
It is important for us to check intrinsically both terms on the right side of the basic
identity for analysis of variance (Equation (2.3)). Consider the following sum of
squares for error:
k n k n
SS E ( xij xi. ) 2 ( xij xi. ) 2
i 1 j 1 i 1 j 1
It is easier for us to see that the term inside the square bracket is equivalent to the
sample variance at the i-th treatment divided by n – 1, that is
n
(x
j 1
ij xi. ) 2
S i2 , i 1,2,...., k
ni 1
Now, the sample variance can be combined to get an equal estimation for the
population variance, that is,
k n _
n 1 S2
n 1 S ... n 1 S
2 2 ( x ij xi. ) 2
1 2 k
i 1 j 1
(n 1) (n 1) ... (n 1) k
(n 1)
i 1
= SSE/(N-k)
In the same way, if there is no difference between k treatment means, we can use
the changes in treatment means from the total mean to estimate 2 .
Note that the identity analysis of variance (Equation 2.3) provides two
estimations, which are based on the existence of changes within treatments and
between treatments. If there are no differences in treatment means, both
estimations should be almost equal. If both estimations are not the same, it is
suspected that the observed differences must be due to differences within
treatment means. The quantity
SS (Tr )
MS (Tr )
k 1
and
SS E
MS E
N k
We obtained:
2.38 14
0.17
In this topic, the usage of mean squares is synonymous with the term covariance.
As such, when discussing about the changes in measurement for a set of data, the
word mean squares is used compared to the word variance. When the value of the
mean squares of treatments is greater than the mean squares of errors, we can
make an early conclusion that there exists a significant difference between the
treatments that is affecting towards experimental results. The following section
Copyright © Open University Malaysia (OUM)
46 TOPIC 2 ONE-WAY ANALYSIS OF VARIANCE (ANOVA)
SELF-CHECK 2.3
EXERCISE 2.2
Determine the critical value, that is F1 , 2 , (obtained from the F table
distribution). Reject H when F > F1 , 2 , .
EXERCISE 2.3
The mean model is the model usually used for single-factor testing, that is, the
model used to compare means ( µj) of factor levels. This model is
X i , j j ji i = 1,2,..., k ; j = 1,2...,n
As we know, MS(Tr) and MSE are measures of changes between treatments and
changes within treatments. The combination of both terms will result in the total
variability in the sample data. This means:
The total sum of squares = Sum of squares treatment + Sum of squares Error
Carry out an ANOVA test at 0.01 significance level. Is there any significant
difference in the efficiency of employees' based on their former schools.
Answer:
The factor involved here is the former school; the factor level is the 4 original
schools selected by the accountants.
Step 6: Conclusion
Hence, we have a strong evidence to state that the mean mistakes done by the
accountants are equal. This means that there does not exist any significant
difference in evaluating employees’ efficiency based on schools.
EXERCISE 2.4
A teacher claimed that the frequencies of watching television are equal
for all students in primary 6, Form 1 and 2. He conducted a survey on a
random sample of selected students and their total time (in minutes)
spent on watching television after school time until just before bed
time were recorded as below:
Primary 6 Form 1 Form 2
459 115 272
311 153 88
152 201 374
293 30 178
SELF-CHECK 2.4
State:
1. Three assumptions made in ANOVA hypothesis testing.
2. The steps involved to perform ANOVA hypothesis testing.
EXERCISE 2.5
EXERCISE 2.5
There are three assumptions that need to be verified before ANOVA technique
can be carried out which are:
The population distribution approaches normality;
The samples are chosen at random and independent; and
The population variances are equal.
INTRODUCTION
In the previous topics, analysis of variance and hypothesis testing were conducted
on quantitative and continuous data. The statistical techniques discussed in those
topics required measurement values such as weight, height, diameter, distance,
total money or total score/marks for a test. On the other hand, many types of
surveys and experiments result in qualitative rather than quantitative response
variables. As a result, the responses can be classified but not quantified. Data from
these experiments consist of the count or number of observations that fall into
each of the response categories included in the experiment. In this topic, we are
concerned with methods for analysing categorical data.
SELF-CHECK 3.1
Multinomial Experiment
(a) This experiment consists of n identical trials.
(b) The outcome of each trial falls into one of k categories or cells.
(c) The probability that the outcome of a single trial will fall in a particular
cell, say, cell i, is pi, where i = 1,2,…,k, and remains the same from trial to
trial and P1 P2 Pk 1 .
(d) The experimenter counts the observed number of outcomes in each
category, written as O1 O2 Ok where O j ( j = 1, 2 ,...,k) with n = as
O1 O2 Ok .
(e) When performing a goodness-of-fit test, these two assumptions are required:
(i) The experiment satisfies the properties of a multinomial experiment;
and
(ii) All expected frequencies are at least 1, and, at most, 20% of the
expected frequencies are less than 5.
(f) The required alternative hypothesis.
(g) H1: at least one of the multinomial probabilities is unequal.
(h) In n trials, the expected number that falls into the j-th category under the
null hypothesis is as E j np j .
Face 1 2 3 4 5 6 Total
Total Frequency (expected) 20 10 10 20 20 40 120
Answer:
Step 1: Construct the appropriate hypothesis statement
H0 : P1 = P2 = ...= P6= 1/6, that is each toss of the dice is fair
H1 : at least pj 1/6 for j = 1,2,…,6 interval
Theoretically, if the dice is balanced, we would expect each face to occur 20
times that it follows a uniform distribution. This means, each face of the dice
from 120 throws is repetitive as in the following table:
Face 1 2 3 4 5 6 Total
Frequency 20 20 20 20 20 20 120
test statistic is
2 O E
2
2 5% 5 11.070 with degrees of freedom,
E
v = (number of columns) – 1 = 6 – 1 = 5.
Face 1 2 3 4 5 6 Total
The observed frequency, O 20 10 10 20 20 40 120
The expected frequency, E
(when the null hypothesis is 20 20 20 20 20 20 120
true)
(O E ) 2
0 5 5 0 0 20 30
E
6 O Ej
2
(O E ) 2
Hence, we obtain E
2 j
= 30. Under H0, is
j 1 Ej
distributed as 2 distribution. To calculate the test value in this case, the
information of the value of the 6 pairs is needed. The 6 pairs are independent
and their total frequency must be 120. Meanwhile, the difference for each added
pair must be zero. This means, as many as 6-1 observed independent pairs
(degrees of freedom) would be used to calculate the 2 test value. In other
words, there are 6 cells to fill based on 1 restriction where the total frequency
for the 6 sets must be 120. For this reason, 6 choices – 1 restriction = 5 degrees
of freedom.
Figure 3.1
Step 5: State the conclusion
Hence, we have strong evidence to state that the throw of the dice is unbalanced
and that it is not uniformly distributed.
SELF-CHECK 3.2
EXERCISE 3.1
Answer:
Let X be a random variable that represents the number of students taking
elective courses.
Step 1: Construct the appropriate hypothesis statement
H0 : The number of students taking elective courses is distributed as
binomial.
_
In this case, x
fx 0(12) 1(16) 2(8) 3(3) 4(1) 5(0) 45
f 12 16 8 3 4 0 40
X 0 1 2 3 4 5 or 6
Expected values 11.51 15.93 9.19 2.83 0.49 0.05
And therefore the first cell expected values are E = (40)(0.2877) = 11.51, etc.
Since there are three cells with frequency less than 5 (X = 3,4 and 5 or 6),
all of them will be combined together with frequency at X=2 resulting in
the value of 12.56 (that is 9.19 + 2.83 + 0.49 + 0.05). Hence, combining
the observed and expected frequencies table to perform subsequent
analysis:
X 0 1 2 or More Total
Observed 12 16 12 40
Expected 11.51 15.93 12.56 40
So, df = 3 - 2 = 1
Hence, the critical value for this test is 5% (1) = 3.84 (refer to table).
2
E
forms depending on whether (a) p is unknown or (b) p is known, and here,
we will only consider the Poisson distribution for one case. Similar to a
previous case, to generate the expected distribution (when H0 is true), it is
necessary to generate a theoretical distribution and this requires:
(i) mean parameter , µ; and
(ii) total frequencies, f .
Let us see the following example.
Answer:
Step 1: Construct the appropriate hypothesis test
H0: The frequency follows a Poisson distribution
H0 : X ~ P
where
173
ˆ x 1.765
98
H1: otherwise
x!
̂ x . We can also use the formula f ( x 1) f ( x) to calculate
x 1
Pr(X = x).
X 0 1 2 3 4 or More Total
Observed 19 26 27 13 13 98
Expected 16.8 29.6 26.1 15.4 10.1 98
E 16.8 10.1
Height 117 – 120 121 – 124 125 – 128 129 – 132 133 – 136
Frequency 8 28 82 140 188
Answer:
Z
Upper x 134.256 Pr Expected
P Observed
Interval (Z<z) (p*Observed)
= 6.195
x-class
120.5 -2.24 0.013 0.013 9.0 8
124.5 -1.59 0.056 0.043 29.9 28
128.5 -0.95 0.171 0.115 79.8 82
132.5 -0.3 0.382 0.211 146.4 140
136.5 0.35 0.637 0.255 177.0 188
140.5 0.99 0.839 0.202 140.2 148
144.5 1.64 0.950 0.111 77.0 69
148.5 2.28 0.989 0.039 27.1 15
- - 1.000 0.011 7.6 16
E 9 29.9 7.6
Step 5: Conclusion
We have evidence to state that the normality assumption on the observed
data is not satisfied.
EXERCISE 3.2
n < 40 and when any expected frequency is less than 5. If n > 40, each expected
frequency in table r * c has a value more than 1 (c is the number of columns for
the level of the first factor, r is the number of rows for the level of the second
factor).
Political Affiliation
Tax Reform Party A Party B Party C Total
For 308 190 102 600
Against 92 160 148 400
Total 400 350 250 1000
Table 3.1 is also called a 2 * 3 (or 3 * 2) table since it consists of two rows and three
columns. The two categorical variables involved are Political Affiliation (at three
levels that are Party A, Party B, and Party C) and their views on tax reform (at two
levels, “For” or “Against”). The values inside the table (308, 190 and the rest) are the
intersection given type of political affiliation and views on tax reform. These values
are observed frequencies as they represent the results obtained in the study and
identified as the number of individuals in all of the six categories or cells. For
example, the number of people who are members of Party A and agree on tax reform
is 308 (row 1, column 1) while the number of people who are members of Party C
and disagree on tax reform is 148 (row 2, column 3).
Table 3.2: The Cross-Classification between Political Affiliation and Views on Tax
Reform (Percentage Values in the Table are Based on Overall Total)
Political Affiliation
Tax Reform Party A Party B Party C Total
For 30.8 19 10.2 60
Against 9.2 16 14.8 40
Total 40 35 25 100
Table 3.3: The Cross-Classification between Political Affiliation Views on Tax Reform
(Percentage Values in the Table are Based on Total Rows)
Political Affiliation
Tax Reform Party A Party B Party C Total
For 51.33 31.67 17 100
Against 23 40 37 100
Total 40 35 25 100
Table 3.4: The Cross-Classification between Political Affiliation and Views on Tax
Reform (Percentage Values in the Table are Based on Total Columns)
Political Affiliation
Tax Reform Party A Party B Party C Total
For 77 54.29 40.8 60
Against 23 45.71 59.2 40
Total 100 100 100 100
Various decisions can be made from Tables 3.2, 3.3 and 3.4. Some of them are:
(a) 60% of chosen individuals support tax reform (Table 3.2);
(b) 51.33% of individuals from Party A support tax reform (Table 3.3); and
(c) 77% from those who agree on tax reform are from Party A (Table 3.4).
This study can be summarised to investigate whether there exists any relationship
between individuals’ opinions on tax reform based on political affiliation. This
information can be determined only if we know the expected frequency in each
cell. Hence, the expected frequency can be written as:
Cj Ri
Eij N
N N
Table 3.5
Political Affiliation
Tax Reform Party A Agreement Party A Agreement
For 308 190 102 600 ( R1 )
Take a look at Table 3.5. Let us say that we are interested in determining the
expected frequency for individual from Party A who agrees on tax reform, that is
(A and For). Assuming the variables are independent,
expected frequency (A and For)
C R 400 600
Eij N j i 1000 240
N N 1000 1000
This means it is expected that 24% of individuals involved in this survey are from
Party A and agree on tax reforms. Repeat the process for all cells and the result
can be summarised in the following cross-classification table:
Table 3.6
Party A Party B Party C Total
For 308 190 102
(240) (210) (150) 600
Against 92 160 148
(160) (140) (100) 400
Total 400 350 250 1000
Note: (Figures in brackets are the expected frequencies)
SELF-CHECK 3.3
Construct a scatter plot on the percentage of those who agree and
disagree on the tax reform according to political affiliation. What can
you conclude?
EXERCISE 3.3
200 people have been randomly selected classified according to the age of
the respondents (less than 30 and 30 or more) and their preference on type
of cars (locally-produced or imported). The result of this study can be
seen in the table below:
Types of Cars Preferred
Age Local-produced Imported
<30 68 42
30 or more 31 59
Check the number of rows (r) and columns (c) in the related table. Calculate the
degrees of freedom for the test, v r 1 c 1 . Next, determine the critical
value (based on the Chi-Square Table), that is 2 ,v . Reject H 0 when the test
statistic value X > 2 ,v .
92 160 148
Against 400
(160) (140) (100)
Total 400 350 250 1000
EXERCISE 3.4
Refer to Exercise 3.2. Use the results of the exercise and carry out the
appropriate hypothesis test (use = 0.05).
affiliation. In the Test of Homogeneity, we test the hypothesis that the population
proportions within each category are the same/homogenous. This applies when
either the row or column totals are predetermined. Data is given in the form of a
two-way contingency table, which is on classification of a variable and another
one on population classification. It is important to stress that the assumptions and
statements under the null and alternative hypothesis are different but the analysis
techniques are the same. Refer to example 3.6 below.
Patients’ Condition
No Shows
Recovering Total
Change Improvement
Drug A 15 22 33 70
Type B 20 18 12 50
Total 35 40 45 120
Answer:
Drug
Patients’ Condition
Type
Show
No Change Recovering Total
Improvement
A 15 (20.42) 22 (23.33) 33 (26.25) 70
B 20 (14.58) 18 (16.67) 12 (18.75) 50
Total 35 40 45 120
Step 5: Conclusion
Hence, we can conclude that the condition of the patients depends on the type
of drug that they received.
EXERCISE 3.5
SELF-CHECK 3.4
It is important to know that the continuity correction will always cause a reduction
in 2 value, a fact that can be proven through careful analysis of the shape. If the
test value is in favour of H0 acceptance, we do not have to calculate the cc2 value
as the test result would not give any effect.
Results Form 1
Class A Class B
Passed 72 64
Failed 17 23
Carry out a hypothesis test to determine whether there exists any difference
between examination results for the two classes using Yate’s continuity
correction. Use 5% significance level.
Answer:
Step 1: State the null and alternative hypotheses
H0: There is no difference in results for students in Class A and B.
H1: Otherwise.
(b) Hence,
( 72 68.8 0.5) 2 ( 23 19.8 0.5) 2
cc
2
... 0.94 .
68.8 19.8
EXERCISE 3.6
Factor I
1 2 Total
Factor
II A a b nA
B c D nB
Total n1 n2 n
SELF-CHECK 3.5
EXERCISE 3.7
Category
1 2 3 4 Total
Population 1 16 38 5 41 100
2 24 41 12 23 100
3 19 36 15 30 100
Do visit these websites to find out further information on the application of:
contingency table test
http://www.graphpad.com/quickcalcs/contingency1.cfm
goodness-of-fit test
http://www.sportsci.org/resource/stats/modelsdetail.html
4
LEARNING OUTCOMES
By the end of this topic, you should be able to:
1. Explain the correlation concepts between two variables;
2. Use the two-way scatter plot to show the relationship between two
variables;
3. Compute correlation coefficient to show the relationship between two
variables;
4. State the applications of Pearson and Spearman correlation coefficients;
and
5. Perform correlation coefficient significance test.
INTRODUCTION
Correlation is a measurement of the strength of a linear relationship between two
variables. Both variables are usually denoted as X and Y and their distributions
approximate to normal distribution. There are three types of relationships between
X and Y: positive linear correlation, negative linear correlation and no correlation.
Positive linear correlation means that as one variable increases, the other
variable tends to increase linearly. On the other hand, negative linear correlation
means that one variable tends to decrease linearly as the other variable increases.
No correlation indicates that there is no linear relationship between the variables.
Try to identify the type of correlation (positive, negative and none) that we can
expect from the following:
(a) Students’ grade and their height;
(b) An individual’s weight and the cholesterol level in the blood;
(c) The amount of ice-cream sold and ambient temperature; and
(d) Price of rubber and amount of rainfall.
We can categorise the relationship between two variables into four conditions:
(a) Perfect
(b) Strong
(c) Weak
(d) No linear relationship
The linear relationship for these four conditions can be clearly visualised by using
a graphical display, usually the Two-Way Scatter Plot. However, judgment
based on the graph is very subjective and at times, not accurate. As such, to
accurately determine the condition of this relationship, a quantitative
measurement known as correlation coefficient is deployed. It is usually denoted
by ρ for population, which is usually unknown. This population parameter is
estimated by the sample correlation coefficient, r. The value of r always lies
between -1 and 1, i.e
–1.00 r +1.00
The following Table 4.1 gives an explanation on the r values for each of the four
conditions on the relationship between two variables.
There are two types of sample correlation coefficients, which are, the Pearson
correlation coefficient and Spearman correlation coefficient. Their application
depends on the types of data. The Pearson correlation coefficient is used for
quantitative data; in discrete and continuous form. On the other hand, the
Spearman correlation coefficient is used for ranking data, hence the name
Spearman Rank correlation coefficient is sometimes used.
SELF-CHECK 4.1
Figures 4.2 (a) and 4.2(b) display strong positive and negative linear relationship
respectively. It is said to be strong as most points fall near the straight line.
Figure 4.2(a): Strong positive linear Figure 4.2(b): Strong negative linear
correlation correlation
Figure 4.3 (a) and 4.3(b) show weak positive and negative linear relationships
respectively. It is said to be weak as most points fall far from the straight line.
Figure 4.3(a): Weak positive linear Figure 4.3(b): Weak negative linear
correlation correlation
Let us try the following exercise on the application of the two-way scatter plot.
EXERCISE 4.1
For each pair of the following two-way scatter plots, identify which one
(between 1 and 2) has a higher value of the correlation coefficient r and
state its direction:
(a)
(b)
(c)
sy = standard deviation of Y = ∑ ( yi y) 2
n 1
xi yi xi yi xi2 yi2
x1 y1 x1 y1 x12 y12
x2 y2 x2 y2 x22 y22
x3 y3 x3 y3 x32 y32
. . . . .
. . . . .
. . . . .
. . . . .
xn yn xn yn xn2 yn2
x i y i x y i i x 2
i y 2
i
Answer:
To prove that there exists a negative linear relationship between time spent (in
hours per week) and students’ examination marks, we are going to calculate the
Pearson correlation coefficient, rp . In this problem, we can define time spent on
computer games as X variable (independent variable) and students examination
marks as Y variable (dependent variable). Next, we can construct the following
table:
xi yi xi yi xi2 yi2
4 26 104 16 676
10 17 170 100 389
14 7 98 196 49
12 12 144 144 144
4 30 120 16 900
5 40 200 25 1600
8 20 160 64 400
11 15 165 121 225
13 10 130 169 100
15 5 75 225 25
96 182 1366 1076 4408
n xi yi ( xi ) ( yi )
rp
(n x 2 ( x )2 (n y 2 ( y ) 2
i i
i i
10(1336) 96(182)
10(1076) (96) 2 10(4408) (182) 2
0.927
The Pearson correlation coefficient value –0.927 shows that there exists a strong
negative linear relationship between time spent on computer games and
students’ examination marks. Hence, we can conclude that if a student spent a
large amount of time playing computer games, this will affect his or her
performance in studies.
EXERCISE 4.2
A farmer would like to know whether a new fertiliser that he uses is
effective in increasing his crop production. He recorded the frequency of
using the fertiliser at seven areas in his farm and the production results of
his crop in each of those areas. The following data is obtained:
Obtain the correlation between fertiliser and crop production. Give your
opinion on the value of the correlation coefficient obtained.
H0 : p = 0
H1 : 1. p > 0
2. p < 0
3. p 0
n2
T rp
1 rp 2
Test Statistic :
Test Result : T follows a t distribution with v = n – 2 degrees of freedom
and significance level.
Reject Ho when : 1. T > t,v
2. T < -t,v
3. |T| > t
,
2
Answer:
We will use one-tailed hypothesis testing since we know that the correlation
coefficient rp value is negative. Hence,
H0 : p = 0
H1 : p < 0
n2
Test Statistic: T rp
1 rp 2
10 2
0.927
1 (0.927) 2
6.99
Reject H 0 when:
T t 0.05,8 1.86
Since T 6.99 t 0.05,8 1.86 , we reject the null hypothesis and we have
strong evidence to conclude that p < 0. This shows that there exists a
significant relationship at 5% significance level. If a student spends most of his
time on playing computer games, this results in less time spent on revision,
hence the poor academic performance.
EXERCISE 4.3
6 D 2
rs 1
n(n2 1) (4.4)
with D = U – V that is the difference between rank U and rank V. The calculation
process using Equation (4.4) can be further simplified if the values are placed in a
table such as the following:
2
xi ui yi vi Di Di
2
x1 u1 y1 v1 D1 D1
2
x2 u2 y2 v2 D2 D2
2
x3 u3 y3 v3 D3 D3
. . . . . .
. . . . . .
. . . . . .
. . . . . .
2
xn un yn vn Dn Dn
D 2
Answer:
Firstly, we need to determine the X and Y variables. Define total subject scores by
male students as X variable and Y variable as total subject scores by female
students. Prior to obtaining the Spearman rank correlation coefficient rs , we need
to convert the data into rank form. In deciding on the ranks, the scores can be
arranged in descending order, that is the highest score is given rank 1, the second
highest is rank 2 and the lowest score is rank 10. The subjects’ rank for males (U)
and females (V) along with their differences (D) are displayed in the following
table:
2
xi ui yi vi Di Di
45 1 34 5 -4 16
33 5 15 10 -5 25
20 8 30 6 2 4
24 7 38 4 3 9
43 2 26 7 -5 25
39 3 45 2 1 1
13 10 17 9 1 1
36 4 20 8 -4 16
28 6 48 1 5 25
15 9 40 3 6 36
D 2
158
6 D 2
6 (158 )
rs 1 1
10 (10 2 1 )
=0.0424
n(n 2
1)
The Spearman correlation coefficient value is 0.0424. This means that there is
almost no linear relationship between the opinion of male and female students on
the level of difficulty of the subjects that they took.
SELF-CHECK 4.2
EXERCISE 4.4
Ten athletes were given ranking at the beginning of any sports match that
they took part in. After the match, their position in the match was
recorded as in the table below:
Athlete 1 2 3 4 5 6 7 8 9 10
Ranking 1 2 3 4 5 6 7 8 9 10
Position in
3 5 2 1 10 4 9 7 8 6
the match
Obtain the correlation between ranking and their position in the match.
Comment on the value of the correlation coefficient.
H 0 : s = 0
H1 : 1. s > 0
2. s < 0
3. s ≠ 0
n2
T rs
1 rs 2
Test Statistic :
Test result : T follows a t distribution with v = n – 2 degrees of freedom at
significance level.
Reject Ho when: 1. T > t,v
2. T < -t,v
3. |T| > t
,
2
Answer:
We will use a one-sided hypothesis test as we know that the correlation
coefficient value rs is positive. As such,
H0 : s = 0
H1 : s > 0
Test Statistic:
n2
T rs
1 rs 2
10 2
0.0424
1 (0.0424)2
0.12
Since T = 0.12 < t0.05,8 = 1.86, we do not reject the null hypothesis ( s = 0).
Hence, we can conclude that the value of the population parameter is zero. In
other words, there is no relationship between the opinion of male and female
students.
EXERCISE 4.5
After studying this topic, do you know when the Pearson and Spearman
coefficients are used? If you are still unclear, please reread Topic 4 carefully.
When you have understood, let us try the exercises below.
EXERCISE 4.6
1. State the importance of the two-way scatter plot.
2. State the importance of the correlation coefficient sign r.
3. For each of the statements below, state whether the Pearson or
Spearman correlation coefficient should be employed to find the
relationship between two variables.
(a) A school principal would like to know whether the quality of
teaching received by students will affect their grades.
(b) The relationship between time spent on study revision and
grades obtained by students.
(c) A landlord’s claim that the rate of house rental depends on the
number of rooms in a house.
(d) Will getting a good CGPA guarantee a high starting salary?
(e) A manager at a firm would like to find out the relationship
between his employees’ aptitude test scores taken prior to
joining the firm and their work performance three months after
they joined.
4. A bank would like to reduce the waiting time of its customers at the
counter. For this purpose, the bank would like to know the relationship
between average waiting time (Y) and number of tellers at the counter
(X). Several customers were chosen at random and their data is
tabulated as below:
X 4 1 5 3 4 3 3 2 2 6 3 2 4
Y 6.4 8.7 3.2 10.5 8.2 11.3 11.3 12.8 11.6 3.2 9.4 12.8 8.2
Salesman Ali Jay Mus Tan Boi Mat Lia Goh Wan Zek
Interview
5 3 1 9 6 4 10 2 7 8
rank
Test score 50 68 45 68 78 68 60 56 76 72
Sales
17 32 27 46 55 45 36 28 18 66
(‘000)
(a) Find the correlation coefficient value for both relationships and
comment on the correlation coefficient value obtained.
(c) Based on your answer in (a) and (b), what are the criteria needed
to get a salesman who will bring profit to the company?
Please visit the following website to find more information about correlation and
regression:
http://www.pinkmonkey.com/studyguides/subjects/stats/chap6/s0606101.asp
This topic has explained how the relationship between two variables can be
derived.
A two-way scatter plot can be used to roughly show the relationship between
two variables, whether it is positive linear, negative linear or there is no
relationship.
INTRODUCTION
In Topic 4, we learned how to visually check for the relationship between two
variables using the two-way scatter plot as well as how to measure the strength of
this relationship using correlation. If a relationship exists, we would like to know
the meaning of the relationship. Once we have determined the relationship in
terms of equation, we will be able to predict the value of a variable given the
value of the other variable. In this topic, we used a statistical method called
Simple Linear Regression to examine a linear relationship between two
variables. Only quantitative variables are considered in this case.
SELF-CHECK 5.1
y = β 0 + β1 x + ε
A property development manager would like to know the estimated selling price
for each house that will be built. He knows that the cost of building a house is
RM90 for each square feet and the land price is RM25,000 for an area of 4,500
square feet. Hence, the manager can estimate the selling price using the equation
below:
y = 25,000 + 90x (5.1)
where y = selling price and x = house size in square feet. If the house is 2,000
square feet, the price would be RM205,000, that is
y = 25,000 + 90(2,000) = 205,000
However, this is only an estimated price. The actual price (based on observation)
would be between RM180,000 and RM250,000. For this reason, to reflect the
actual situation, another simple linear regression model replaces the previous
model, that is:
y = 25,000 + 90x + (5.2)
where is a random variable for errors representing all other variables which are
not considered in equation (5.1). In other words, the selling price for the same size
will also differ due to other factors such as location, number of bedrooms, toilets
and other unknown factors.
yˆ ˆ 0 ˆ 1 x (5.3)
Here, ŷ is the predicted/fitted value for y, ̂0 is the estimation for population
parameter β̂ 0 and ̂1 is the estimation for population parameter 1 . The
estimation model (5.3) is a linear equation with ̂1 parameter as the regression
slope and ̂0 parameter as the y-intercept, which is the y value when x is zero
(Refer Figure 5.1). However, in most cases, when x = 0, the y value does not carry
any significant meaning and at times x = 0 is not possible. The slope of a straight
line is a fixed value that explains the changes (increasing or decreasing) in y value
given a one unit change in x value.
ŷ ˆ 0 ˆ1 x
Errors (Refer to Figure 5.1) are obtained from the difference between y observed
values with ŷ fitted values. This is denoted by i for i = 1, 2…n and the formula
is:
εi = yi yˆ i (5.4)
The residuals, ε i is a random variable. To determine whether a calculated simple
linear regression is a good estimate for the population, we need to ensure that the
ε i random variable satisfies certain conditions. The assumptions made on i
random variable are:
2
(a) ε i is distributed as normal; that is ε i ~ N(0,s ), i=1, 2, …, n.
EXERCISE 5.1
Given regression equation ŷ = –12.84 + 36.18x, state the values of ̂0 and
̂1 and explain both values. Next, calculate the residuals using the
following data:
x 8.3 8.3 12.1 12.1 17.0 17.0 17.0 24.3 24.3 24.3 33.6
y 227 312 362 521 640 539 728 945 738 759 1263
When the straight line fails to capture all the data (point (x,y) on the graph), what
must we do to obtain the best straight line? This best straight line refers to the
fitted straight line that we build in the two-way scatter plot that best represents the
relationship between the two variables. This fitted line would be a straight line
that is close to points (x,y) and when the errors between the points on the straight
line (estimated) and actual observed points are minimised. However, the total
errors ∑εi do not represent the distance between the actual and observed points.
i
With reference to Figures 5.3(a) and 5.3(b), we can see that the positions of the
two data sets [data (a) and (b)] are different. The total errors for data (a) and data
(b) are calculated as:
8–6=2 7–6=1
1 – 5 = –4 6–5=1
6–4=2 2 – 4 = –2
εi = 0 εi = 0
Total errors are zero for both data (a) and (b), and this always holds. This figure
shows that the distance of data points (a) and (b) from the regression line is the
same. However, from both graphs in Figure 5.3, we can see that this is not true.
There exist differences in positions of data points (a) and (b) from the regression
line where data points (b) are closer to the regression line compared to data points
(a). Hence, i is not suitable to be used as a selection criteria.
So, how can we solve this problem? It can be solved if we squared each error
before summing them up. The following table are the values of ( y i yˆ i ) for
2
(8 – 6)2 = 4 (7 – 6)2 = 1
(1 – 5)2 = 16 (6 – 5)2 = 1
(6 – 4)2 = 4 (2 – 4)2 = 4
( ε i )2=24 ( ε i )2= 6
Based on ( y yˆ )
i i
2
values for both data (a) and (b), it shows that the total sum
of squares for data (b) is smaller than (a). This proves that points for data (b) are
nearer to the regression line and this line is the best fitted line. This method to
obtain the best fitted line based on the least squares summation is known as the
least squares method.
To fit the regression line, we need to get the estimates for regression coefficients
0 and 1 . Using the least squares method, the formula for regression coefficient
1 is:
n
x y nxy
i i
ˆ 1 i 1
n
(5.8)
x
2
2
i nx
i 1
where
̂1 = Estimated value of regression coefficient 1
xi = Value of independent variable
yi = Value of dependent variable
x = Mean value of independent variable
y = Mean value of dependent variable
n = Number of (x,y) pairs
After getting the estimate for 1 , we can derive the value for 0 . The formula to
get ̂0 is:
βˆ 0 = y βˆ1 x (5.9)
where
̂0 = Estimated value of regression coefficient 0
Answer:
To facilitate the calculation of parameter values ̂0 and ̂1 , we can form the
following table.
xi yi xi yi xi2 yi2
3 33 99 9 1089
7 38 266 49 1444
6 24 144 36 576
6 61 366 36 3721
10 52 520 100 2704
12 45 540 144 2025
12 29 348 144 4225
12 65 780 144 6724
13 82 1066 169 841
13 63 819 169 3969
14 50 700 196 2500
15 79 1185 225 6241
x i =
y i = 621 x y i i = 6833 xi2
= 1421 y i2
=
123 36059
x y nxyi i
6833 12 10.25 51.75
ˆ 1 i 1
2.92 s
1421 12 10.25
n 2
x
i 1
2
i nx 2
EXERCISE 5.2
x 60 62 64 65 66 67 68 70 72 74
y 63.6 65.2 66.0 65.5 66.9 67.1 67.4 68.3 70.1 70
SELF-CHECK 5.2
We are going to test the parameter for the population regression slope 1 using the
̂1 regression coefficient. The hypothesis testing process for testing population
parameter 1 is similar to that of testing mean and variance. We will begin with a
hypothesis statement. The null hypothesis claims that there is no linear
relationship, which means the slope of the regression line is zero. If we accept the
null hypothesis, this means the population regression line is a straight line that
shows y value does not change with the changes in x value. In this case,
information on x is not enough to assist in predicting y value. On the other hand,
if the null hypothesis is rejected, there is enough evidence to say 1 is not zero,
that is either 1 >0 or 1 < 0. This shows that the regression line has a tendency to
increase or decrease and this helps in predicting y value using x value.
Ho : 1 = 0
H1 : 1. 1 > 0
2. 1 < 0
3. 1 0
ˆ ˆ ˆ
Test Statistic : T 1
s (ˆ 1 ) s (ˆ 1 )
Test Result : T follows t distribution with v = n – 2 degrees of freedom and
. significance level.
Reject H0 when : 1. T > t,v
2. T < – t,v
3. |T| > t/2,v
s( ̂1 ) is the standard deviation for ̂1 . The formula to get the standard deviation
for ̂1 is:
y 2
i ˆ 0 yi ˆ 1 xi y1
s ˆ 1 n2
xi2 nx 2
Apart from hypothesis testing, we can also construct a confidence interval for 1 .
Confidence interval will provide a confidence range that contains the value of
population parameter at a certain level. Based on T test statistic (two-sided) that
follows tn 2 distribution, we can construct a (1 - ) 100% confidence interval as
below:
ˆ t
1 ˆ ˆ
ˆ
2 ,n 2 s 1 1 1 t 2 ,n 2 s 1
Reject H0 when
|T| > t0.025,10 = 2.228
Prior to obtaining the test statistic value, we need to calculate the value of
s ˆ .
1
s ˆ 1 10
1421 12 10.25
2
=1.26
Since the test statistic (T = 2.317) > 2.228 ( t0.025,10 ), we reject the null
hypothesis. Hence, we can conclude that 1 is not zero, that there is enough
evidence of the existence of a linear relationship between x and y.
This confidence interval shows that the y value will increase between 0.113 and
5.727 for each increment in x. The wide range for 1 is due to small sample
size.
EXERCISE 5.3
ŷ
Figure 5.4
ˆ 0 yi ˆ 1 xi yi ny 2
R
2
(5.10)
y 2
i ny 2
ˆ 0 yi ˆ 1 xi yi ny 2
R 2
y 2
i ny 2
SELF-CHECK 5.3
EXERCISE 5.4
The following are a few graphs that show deviations from assumptions made.
Figure 5.5 is a plot of εi versus the fitted values ŷ i or xi to determine whether
the linearity assumption is met or not. The graph shows that the data plotted forms
a curve and hence, we can conclude that the fitted model is non-linear.
ŷi or xi
Figure 5.5: Model is non-linear, instead curvature
Figure 5.6 (plot of εi versus the fitted values ŷ i or xi ) shows deviation of the model
from assumption that the random errors have constant variance. Plot of data shows a
bell-shaped pattern. This means random errors instead of having a non-constant
variance, the errors are actually proportional to ŷ values. The random errors have
constant variance if the graph shows a random pattern or no trend.
ŷi or xi
SELF-CHECK 5.4
EXERCISE 5.5
Exponent: y = 0 e1x y* = ln y y* = ln 0 + 1 x
Inverse: x* = 1–x ; y = 0 + 1 x*
A two-way scatter plot is very useful to ascertain whether a model has a linear or
non-linear form. Hence, it is good to know the shape of Exponential, Power,
Inverse and Hyperbolic functions (refer Figure 5.8). Observe Figures 5.1 and 5.8
on the chosen transformations.
EXERCISE 5.6
Draw a two-way plot on the following regression models. Then perform
transformation on the models before obtaining the linear regression
models.
1
(a) y 2.67 0.68
x
3.1x
(b) y =2e
0.85
(c) y = 1.5x
x
(d) y
0.4 2 x
where y = selling price and x = house size (in square foot). The x values are in
between xa and xb . If we would like to predict the selling price of a house where
the built-up area is 2,000 square feet, where the value 2,000 > xa , we can use the
regression model with x value = 2,000. Based on the regression equation, the
manager can predict that the selling price for each house with 2,000 square feet is
RM205,000.
However, this selling price is a forward estimation and it does not explain the
position of that value with respect to the actual selling price. In other words, is the
estimation value close to the actual value or very different? This relates to the
reliability aspect of certain predictions. To get information on the position of
estimation values versus actual values, we need to use intervals. There are two
types of intervals used – prediction interval for any dependent variable y and
estimation interval for an estimated value of y.
xg x
2
1
yˆ t / 2 s 1
n xi x 2
(5.12)
significance level
y yˆ
2
i i
s i 1
n2 (5.13)
To get the standard error of the estimate, we need y values. This can be generated
using regression model ŷ = 21.82 + 2.92 x. Hence, data is as in the table:
x 3 7 6 6 10 12
y 33 38 24 61 52 45
x 12 12 13 13 14 15
y 29 65 82 63 50 79
y yˆ
2
Hence, i i
2556.946
s i 1
15.99
n2 10
Since = 0.05, ta/2 = t0.025 = 2.228. Thus, the 95% prediction interval for xg = 20
is
xg x
2
1
yˆ t / 2 s 1
n xi x 2
1 20 10.25
2
80.22 (2.228)(15.99) 1
12 160.25
80.22 46.13
The lower and upper limits of the prediction interval are 34.09 and 126.35
respectively. This shows that the minimum predicted sales is 34 units and
maximum is at 126 units when 20 advertisements are broadcasted on radio.
EXERCISE 5.7
Refer to data in Exercise 5.2. Calculate the 99% prediction interval for x =
86 and provide an explanation for it.
Hence, to estimate a mean value of y, given any xg value, we can use the
following interval:
xg x
2
1
yˆ t / 2 s
n xi x 2
(5.14)
This interval applies when any specific value of x lies between the interval for
independent variable x values, i.e. xb x xa.
xg x
2
1
yˆ t / 2 s
n xi x 2
1 11 10.25
2
53.94 (2.228)(15.99)
12 160.25
53.94 10.50
Note that the lower and upper confidence limits for the mean value of y are 43.44
and 64.44 respectively. This shows that the minimum mean sales is 43 units
while the maximum is 64 units when 11 radio advertisements are broadcasted.
EXERCISE 5.8
Refer to Exercise 5.2. Calculate the 99% confidence interval for mean y
when x = 69 and provide an explanation on it.
EXERCISE 5.9
Investment
1.8 1.8 2.1 2.2 2.8 3.1 3.6 4.1
(RM Million)
Interest Rate (%) 9.9 10.5 9.6 9.8 12.1 9.2 9.5 7.7
4. Many people assume that the total amount of money saved depends
on their total income. The following data shows the average saving
per month (RM’00) and average income per month (RM’00) for
various groups of employees.
Income 19 22 27 30 36 43 47 51 61 64
Saving 1.0 1.4 1.8 2.4 3.0 3.8 4.3 4.5 5.8 6.3
Please visit the following websites to read more on simple linear regression:
http//www.pinkmonkey.com/studyguides/subjects/stats/chap8/s0808n01.asp
The least squares method is used to get parameter estimates for slope of
regression line and intercept on y-axis.
If the simple linear regression model obtained is adequate for the data, the
model can be used to estimate a dependent variable value for any specific
independent variable value. This model can also be used for prediction of a
mean value of dependent variables.
INTRODUCTION
Most practical applications of regression analysis utilise models that are more
complex than the Simple Linear Regression discussed in Topic 5. In the previous
topic, only two variables; a dependent variable – (Y) and an independent variable
(X) – are considered. In Simple Linear Regression, it is important to fully
understand the following items:
Yˆ = βˆ 0 + βˆ1 X (6.2a)
(a) It can be understood from model (6.2) that Y is a random variable which
follows a certain population distribution. ε is assumed to be independent of
each other for any Y value with zero mean. From the above equations, for a
given X value, the expectation is
E Y X 0 1 X (6.3)
(b) In regression analysis, the first thing to do is to estimate the Y line using
model (6.2). That is to get estimations for 0 and 1 using methods such as
ordinary least squares which uses n pairs of observation values (y, x).
Y 0 1 X 1 2 X 2 (6.4)
Y = β0 + β1 X 1 + β2 X 2 + ε (6.5)
Y = Y ' + ε (6.5a)
Y = β0 + β1 X 1 + β2 X 2 + + β k X k+ε (6.6)
Y =Y' + ε (6.6a)
where
Y 0 1 X 1 2 X 2 ... k X k
However, in this module, we will only focus on Multiple Regression Model with
two independent variables. For cases involving more than two independent
variables, calculation is usually done with the help of a statistical package such as
SPSS (Statistical Package for Social Science). You can refer to any related books.
Assumption Explanation
Normality Assumption For any specific value of independent variable, the Y
random variable values are distributed as normal, with
mean E(Y|x) = y, and variance = 2E .
SELF-CHECK 6.1
Using your current understanding based on what you have learnt, try to
think of two or three independent variables for the following dependent
variables.
(a) Profit made by a company
(b) Students’ final examination grades
(c) Selling price of a house
Y = β0 + β1 X 1 + β2 X 2 + ε (6.7)
Y =Y' + ε (6.7a)
where Y 0 1 X 1 2 X 2 .
Hence, using (6.8a) and (6.9), errors between observations and their
estimations are given as ε = y yˆ ' The error value can take a negative sign.
∑ε ∑(y 2
i i
(
yˆ ' ) = yi βˆ0 + βˆ1 x1i + βˆ 2 x2i )
2
6.10)
i =1 i =1
Using the outcome of equation (6.10), separating and equating the equation
to zero, we obtain the following three equations:
n
n n
y1
i nˆ 0 ˆ 1 x1i ˆ 2 x2i
1 1
n
n n n
yx
1
i 1i ˆ 0 x1i ˆ 1 x12i ˆ 2 x11i x2i
1 1 1
n
n n n
yx
1
i 2i ˆ 0 x2i ˆ 1 x2i x1i ˆ 2 x22i
1 1 1
d1d d 22 d 2 d d1d 2
ˆ 1 (6.14)
d d d d
2 2 2
1 2 1 2
ˆ 0 y ˆ 1 x1 ˆ 2 x2 (6.16)
ˆ 1
7.46 6 15 0.5 = 5.265687 5.266,
1.221 6 0.5
2
ˆ 2
151.221 7.46 0.5 = 2.061193 2.0612,
1.221 6 0.5
2
Model Estimation:
yˆ ' 9.430 5.266 x1 2.0612 x2 (6.17)
The negative residual value indicates that an over-estimation of ŷ3 since the
value of ŷ3 y3 .
On the other hand, for
i = 7, y7 = 32; ŷ7 = 9.430 + 5.266 (2.7) + 2.061 (3) = 29.832 29.83, and its
residual,
The positive residual value indicates that the value ŷ7 is under-estimated as
the value of ŷ7 < y7.
The predicted value of y.
From this model, if x1 3.0, and x2 2, then ŷ ' 29.35 is the predicted
value for the observation/actual value of y.
EXERCISE 6.1
Referring to data in Example 6.1, and estimated model (6.17), calculate
(a) An estimation and residual values of yi , i = 4, 9; and interpret the
value of the residuals; and
(b) The predicted observed value of y, if x1 = 4.0 , and x2 = 5.0.
EXERCISE 6.2
Referring to data in Example 6.1, and estimated model (6.17), calculate
(a) An estimation and residual values of yi , i = 4, 9; and interpret the
value of the residuals; and
(b) The predicted observed value of y, if x1 = 4.0 , and x2 = 5.0.
i2 d 2
2
2
where s ( ̂1 ) = (6.20)
n k 1 d1 d 2 d1d 2
2 2 2
and
i2 d1
2
where s ̂2 ) =
2
(6.21)
n k 1 d12 d 22 d1d 2
2
H 0 : 1 = 0; and H 0 : 2 = 0;
H1 : 1 0; H1 : 1 0;
ˆ i i ˆ
Hypothesis Test : T i
s ˆ i s ˆ i
i = 1, 2.
This test is a two-sided test. A distribution is commonly used since the population
parameter is unknown and sample size n is small. s( β̂i ) is the standard deviation
for β̂i where i = 1, 2. Both standard deviations are calculated prior to making
inferences.
ˆ i t
2
, n k 1
s ˆ i t ˆ i t
2
, n k l
s ˆ i (6.22)
Table 6.3
y y ε2 d2
30 30.40305 0.04
36 35.57858 33.64
28 30.35797 4.84
29 30.31289 1.44
27 26.71707 10.24
28 27.19856 4.84
32 29.8314 3.24
27 27.24364 10.24
34 33.4723 14.44
31 30.88454 0.64
83.6
13.40008
∑ε 2
i
= 7 =1.914297 1.9143
10 3
∑ε = ∑d
2 2
βˆ1 ∑dd
1 βˆ 2 ∑dd 2 (6.23)
∑ε 2
= 83.6- (5.265687)(7.46)-(2.061193)(15) = 13.40
2
THE MULTIPLE COEFFICIENT OF DETERMINATION, R
(a) Firstly, get the y value using model (6.17) as in Table 6.2. The R 2
coefficient that also measures the goodness-of-fit of the estimated model
(6.17) provides the proportion of total variation in y, which can be explained
by the multiple regression model (6.17). This value can be estimated using
the following formula:
2
∑(y '
y)
2
∑ε 2
(6.24)
R = =1
∑(y y )2 ∑y 2
R 2
= 1
2
This means the estimated model (6.17) can only explain 83.97% of variation in y
values; and the rest cannot be explained by the model and is usually contained in
the error ε .
From formula (6.24a),
R2
5.265687 7.46 2.06119315 = 0.839712 0.8397
83.6
2
R 2 (R -adjusted)
(n 1)
R 2 1 (1 R 2 ) (6.24b)
(n k 1)
R 2 1 1 0.839712
10 1 = 0.793915.
10 2 1
TESTING THE SIGNIFICANCE OF OVERALL REGRESSION MODEL
The F test can be used to test the significance of overall regression model. It is
tested based on the ratio of explained variance in the model on the remaining
unexplained variance. The F distribution is used, with k and n-k-1 degrees of
freedom where k is the number of parameters estimated EXCLUDING the
constant β0 . (A few books consider 0 as a parameter, hence the F degrees of
freedom becomes k-1 and n-k). The test statistic is given by:
Fk ,n k 1
y ' /(k )
2
i
R 2 /(k )
2
i / 9(n k 1) (1 R 2 ) /(n k 1)
(6.25)
H0 : 1 = 2 = 0;
H1 : at least one 1 is not zero.
Test result: If the F-probability is < 0.05, reject H0 at 5% significance level. This
means, it is significant that the regression parameters β0 , β1 , β2 , especially the last
two are not all zeroes. Subsequently, it is significant that the coefficient of
determination R2 is not zero.
Example of Output:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.918389714
R Square 0.843439667
Adjusted R Square 0.79125289
Standard Error 1.476566151
Observations 9
ANOVA
df SS MS F Significance F
Regression 2 70.47406997 35.237035 16.1619419 0.003837472
Residual 6 13.08148559 2.1802476
Total 8 83.55555556
Figure 6.1
Figure 6.2
There are differences in values obtained manually and those obtained using Excel.
This is due to rounding error in the Excel package.
Interpretation of Output:
(a) Multiple R
This quantity is often referred to as multiple correlation between y and all
independent variables without any condition imposed on any independent
2
variable. The value is the source of multiple R .
(b) R Square
This measures the goodness-of-fit of y model on observed y. A high value
means that the regression model can explain better about the variation in Y
2
as much as (R 100)%. On the other hand, a small value indicates a poor fit
of the regression model.
(ii) This means there exists a linear relationship between Y with X1 and X
simultaneously.
(d) t Test:
This test is used to evaluate whether individual regression coefficient
( 1 2 ) is significantly zero at α level that is by comparing the p probability
value with value. For example, assume that α = 0.05; and if p < α, reject
H 0 :1 = 0 at 5% level. This means that 1 ≠0, and x1 contributes
significantly to the variation in Y. In the above example, we found that
1 , 2 are both not significantly zero at 5% level.
This means that residuals are random around the horizontal line, as in Figure 6.1.
The normality assumption is satisfied if the normal plot follows a straight line.
Referring to Figure 6.2, not all points fall on the straight line; hence, the
regression model can satisfy only about 94% of the normality assumption.
EXERCISE 6.3
1. (i) Perform a manual analysis and use formulae (6.14), (6.15) and
(6.16) to obtain the parameter estimates for 0 1 2 for
multiple regression model for the following data:
Y: 10, 24, 40, 20, 15
X1: 2, 3, 7, 3, 4;
X2: 5, 6, 6, 5, 3
(ii) Use your model to obtain the estimated values of y, and
estimate the total residuals.
(iii) Perform a significant test on the parameters.
2
(iv) Obtain the R coefficient and interpret its value.
(v) Perform the F test and make the conclusion.
The least squares method is used to obtain the estimates of the regression
model parameters and y-intercept.
2
The goodness-of-fit between two variables can be obtained using R
coefficient of determination.
If the multiple linear regression model obtained is suitable and fits the data,
this model can be used to obtain the estimated values of dependent variable
for any given independent variable values. This model can later be used for
making prediction for any independent variable values as well as estimating
the values of dependent variables.
INTRODUCTION
The methodologies in inferential statistics discussed so far, estimation and
hypothesis testing, are under the assumption that the random samples are selected
from normal populations. Since the population under study is normal, inferential
statistics will use normal distribution or other distributions related to normal
distribution such as t, chi-square and F distributions. Fortunately, most of these
tests are still reliable when we experience slight departures from normality,
particularly when the sample size is large.
There are cases where we would like to do inferential studies on the measure of
location for a population distribution without using the population parameters that
are frequently used such as mean, x and variance, s 2 .
In this case, we need a flexible inferential statistical method that does not depend
on the requirement of normality for population data.
The methodologies and tests discussed in this unit satisfy the requirement
mentioned above. It is called either Distribution-Free Test that means that the
test does not require prior knowledge of the distributions of the underlying
populations; or Non-Parametric Test that refers to inferences not involving
population parameters.
Apart from that, some tests in non-parametric method are also known as Rank
Test due to the usage of rank score replacing data value or actual value in the
analysis.
There are situations when observed values are assigned rank according to
response magnitude. For example, a comparison is to be made between new
software versus the existing software used in an operation. It is difficult to give an
exact value for the qualitative variable “software ease of use in handling
operation”. Instead, we can still make a concrete and clear decision based on this
comparison by assigning ranks according to which software is more efficient or
better. This type of data is called ordinal data.
Data in both situations above is neither distributed as normal nor suitable for
parametric analysis. In this case, mean is not suitable as a measure of location.
Thus, the non-parametric test is an alternative for analysing nominal or ordinal
data.
SELF-CHECK 7.1
One thing to note is that, ordinal scales cannot determine the magnitude or
by how much an object is better than the other in terms of the measured
characteristic. In the example of effectiveness of a medicine, the difference
in magnitude between 2 and 3 cannot be determined. What we can conclude
is that 3 is more effective than 2.
SELF-CHECK 7.2
Can be used for a small sample size without having to satisfy the
assumption of normality.
Can be used for data classified in “weak” types of measurement such as
nominal and ordinal data.
Does not require any prior knowledge of the sampled population
distribution.
Easy to understand.
SELF-CHECK 7.3
Nominal Ordinal
Data
Scale Scale
Data on car models: Proton; Toyota; Nissan;
(a)
Honda; Datsun; Mazda
Firstly, this method does not make full use of information available in the sample.
For example, in a non-parametric analysis, rank or code 1, 2, 3 etc is used to
replace original data. As discussed earlier, rank data does not take into
consideration the differences in magnitude between observations. For example,
suppose that the original given data are 13.3, 18.8, 22.1, etc. Say, another
researcher also took samples 14.3, 14.5 and 14.9.
Non-parametric tests produce a wider confidence range for any significance level.
This results in lower power of test and hence, the results obtained are less
accurate.
EXERCISE 7.1
For further information on the following, please visit the websites below:
Ć Non-parametric concepts:
http://www.statsofttinc.com/textbook/stnonpar.html
Ć Test of Sample Randomness:
http://geography.asu.edu/fall2002/gcu495/ttest/ttest
INTRODUCTION
Most of the time, in analysis, we assume that the samples are chosen at random
from the population under study. Assumption on randomness or random
samples means samples are chosen without any preferences; in fact, each data in
the population has an equal chance of being selected.
There are several non-parametric statistical methods to test randomness of any set
of observed data based on arrangement or data sequence in which the sample
observations are obtained. The method discussed here is based on Runs theory.
Runs is defined as a subsequence of one or more identical symbols representing a
common property of the data.
The number of runs that is too small or too large indicates departures in
randomness in an observed sample. The runs test for randomness of data will test
the hypothesis that the sequence of an event that occurs is random versus an
alternative hypothesis that the sequence produced is not random.
GGGGGDDGGGDDDGGGGGGG
From the example above, there is first a run of 5 ‘G’s (which is the first run), then
a run of 2 ‘D’s (second run). Next, there is a run of 3 ‘G’s, a run of 3 ‘D’s and
finally, a run of 7 ‘G’s. In all, there are 5 different runs of varying lengths:
GGGGGDDGGGDDDGGGGGGG
1 2 3 4 5
Let n1 represent the sample size for first letter or symbol, while n2 represents the
sample size for the second letter or symbol. Hence, the sample size n = n1 + n2 .
For production data above, there are n1 = number of letter ‘D’ = 5 while
n2 = number of letter ‘G’ = 15, which gives in total n = 20 letter sequences.
DDDDDGGGGGGGGGGGGGGG
or any other forms of 2 runs, this is most unlikely to occur from a random
selection process. Such a result indicates that the production process generates the
first 5 products as defective, followed by 15 well-functioning products.
GDGDGDGDGDGGGGGGGGGG
where n1 = 5 and n2 = 15, the maximum runs with alternating letters is as many
as 11 runs. Hence, we would again be suspicious of the order in which the
samples were selected.
GDGDGDGDGDGDGDGDGDGD
and maximum runs is 20. The number of runs is large for a sample of size 20.
Thus, this sample can be said as not random.
How can we verify the above statement? Suppose that R denotes the number of
runs. We would like to get as many as n1 probability of first letter and n2 second
letter forming R runs.
Firstly, we need to obtain runs for a letter or symbol. Let us say, we would like to
get n1 probability forming k runs, with R = 2k and k is a positive integer. Hence,
if n1 = 5, to form k = 3 runs for ‘G’, there are six possibilities of sequences:
The vertical bars ‘|’ separate the five letters into three different runs. Observe that
n1 1 5 1
there are 6 possible runs for this example, which can be written as k 1 3 1
n1 1
= 6. By the same token, there are k 1 ways in which n1 letters of the first kind
n2 1
can form k runs and k 1 ways in which the n2 letters of the second kind can
n 1 n 1
form k runs. It follows that there are altogether 2 1 2 ways in which
k 1 k 1
these n1 + n2 letters can form 2k runs. The factor 2 is accounted for by the fact
that when we combine the two kinds of runs so that they alternate, we can begin
either with a run of the first kind of letter or with a run of the second kind. Thus,
when R = 2k, the probability of getting that many runs is:
n 1 n 1
2 1 2
k 1 k 1
f ( R)
n1 n2
n1
For the case when the number of both letters is not equal, for k + 1 of first letter
n 1 n 1
and k runs of the second letter; we have 1 2 ways and as many as
k k 1
n1 1 n2 1
ways to get k runs of the first letter and k + 1 runs of the second
k 1 k
letter. Hence, the probability is:
n1 1 n2 1 n1 1 n2 1
k k 1 k 1 k
f ( R) f (2k 1)
n1 n2
n1
The runs test for testing sample randomness is based on R random variable, that is
total number of runs. It is a two-sided test as shown in Figure 8.1.
There are two methods to decide whether a given set of data is random or not.
Firstly, we can calculate the probability value to get runs as shown in the data. If
the probability is smaller compared to significance level value, α/2 (since this is a
two-sided test) hence H0: data sequence is random, is rejected.
The second method is by using the runs test table as in Table 5.1 in the
attachment. This table provides critical values of R used in Runs Test for = 0.05
with n1 = 20 and n2 = 20. H0 will be rejected at = 0.05 if the number or runs, R
r(a) or R r(b) with r(a) and r(b), as the critical values at both sides at
= 0.05 (see Figure 8.1) . This table cannot be used if α not 0.05.
Copyright © Open University Malaysia (OUM)
TOPIC 8 NON-PARAMETRIC TEST FOR RANDOMNESS 161
Answer:
From the question, n1 = 8 and n2 = 8. The null hypothesis which states that the
sample sequence is random will be rejected if the probability of getting a total
number of runs is less than α/2, that is Pr(R = r) < 0.01/2 = 0.005. Since
α ≠ 0.05,
77
2
0 0 2
Pr(least runs) = f (2) = 0.000155
16 12870
8
7 7 7 7
1 0 0 1
f (3) = 0.001088
12870
7 7
2
1 1 2
f (4) = f (4) = 0.007615
12870 12870
Based on the above calculation, we obtain Pr(R ≤ 3) = 0.001243 < 0.005, while
Pr(R ≤4) = 0.008858 > 0.005. Hence, if total runs, R ≤ 3, H0 will be rejected.
When R = 4, the null hypothesis cannot be rejected since the probability value
of 4 runs is relatively large, that is Pr(R = 4) = 0.007615 (> 0.005).
7 7
2
7 7 2
f (16) = 0.000155
12870 12870
7 7 7 7
7 6 6 7 14
f (15) = 0.001088
12870 12870
7 7
2
6 6 98
f (14) = 0.007615
12870 12870
Observe that probability values are symmetrical for a two-sided test. Hence,
reject H0: the sequence of samples is random at α = 0.01 when total runs, R = 2,
3, 15 or 16. If any one of these number of runs occurs, the probability value will
be smaller than 0.005.
If we use the Runs Test Table method, we will be checking the critical value or
the hypothesis rejection value at n1 = n2 = 8. From Table 5.1 in the attachment,
the rejection region is in the area R ≤ 4 or R ≥14. Notice that the same answer is
obtained using both methods.
SELF-CHECK 8.1
1. What is meant by runs? How can the number of runs indicate the
randomness of an event or data?
EXERCISE 8.1
BBGBGGGGGGBBGBBBBB
Based on this assumption, when the sample size increases, sampling distribution R
will approach normal distribution. If we arranged in random n1 letters of one kind
and n2 letters of another kind, the mean value for number of runs, R and
variance, 2R are:
Hence, we can use test statistic of Z function that is distributed as standard normal
R R
with Z . If the number of observed runs, R is near to mean value, the
R
hypothesis on data randomness is supported. If R differs from the mean, there is
evidence that the sample is not random.
PPLLLLPPPPPPPPPLPPLLLLPPPPPPPLLLLPPPPLPP
This safety consultant would like to know if the sequence of drivers driving
within and above the speed limit is random or not. Alternatively, he would like
to find out if those drivers who are speeding are driving in a group.
Answer:
Let us say, n1 = number of letter P while n2 = number of letter L. From the data
recorded, n1 = 26, n2 = 14, R = 11.
Since both n1 and n2 ≥10, for large sample size, the mean and variance can be
calculated as below:
2(26)(14)
1 19.2
(26 14)
2(26)(14)[2(26)(14) 26 14]
2 8.0089
(26 14) 2 (26 14 1)
For a two-sided test at 0.05 significance level, the rejection regions are at
z ≤ –1.96 or z ≥ 1.96. Since z = –2.879 < Z0.05 = –1.96, H 0 is rejected. Speeding
and slow vehicles’ drivers drive in groups in the sequence not at random, which
is at 5% significance level.
EXERCISE 8.2
MF M F M M M F M F M M MF F M M M M F F M F
M M M F M M M F F F M F M M M M F M F M M M
M F F M
(b) a ‘-’ sign if it falls below the median value and omitting all measurements
that are exactly equal to the median value.
Answer:
From the given sample, we find median ~ x = 3.9. Replacing each measurement
by the ‘+’ symbol if it falls above 3.9, by the ‘–’ symbol if it falls below 3.9
and ‘X’ sign for data that will be taken out if the value is 3.9, we obtained the
following sequence,
– X+ – – – – + + + X + – + +
for which n1 = 6, n2 = 7 and R = 6.
Referring to Table 5.1, we get critical values for runs as 3 and 12 giving (3,12)
interval as the acceptance region. Since R = 6 is in this interval, we do not reject
the null hypothesis at 0.05 significance level. The sequence of measurements
varies randomly.
SELF-CHECK 8.2
What is meant by the ‘+’, ‘–’ and ‘0’ signs in the test for randomness for
quantitative data? How is sample size obtained?
EXERCISE 8.3
1. The following are the numbers of students absent from school on 22
consecutive school days: 29, 25, 31, 28, 33, 31, 35, 29, 31, 33, 35, 28,
36, 30, 33, 26, 30, 28, 32, 31, 38 and 27. Did the absenteeism of
students occur at random? Test at 0.01 significance level.
0.019, 0.021, 0.020, 0.019, 0.020, 0.018, 0.023, 0.021, 0.024, 0.022,
0.023, 0.022.
Use the runs test to determine if the fluctuations in thickness from one
tray to another are random. Test at 0.05 significance level.
Ć A runs test can be used to detect certain trends in a sample that shows the non-
randomness of data.
Ć This test can be used for any sample size, and to test for randomness of
quantitative data.
INTRODUCTION
In testing a single population using parametric methods, the Students t-test and z-
test (for large sample size) are used to determine whether the population mean is
equivalent to or different from a certain mean value 0. Non-parametric methods
for testing a single population also enable us to verify whether there exists a
significant difference in terms of the location or position of a population with a
given value of a measure of location.
To test a population based on the median, we will test the hypothesis whether the
median of a population under study, say denoted by τ (read as ‘tau’) is
significantly different from a specific median, say τ 0 (read as ‘tau not’).
SELF-CHECK 9.1
Hence, we will test whether H0: ≤ 0.5 (consumers do not have any preference on the
product; that is the number of consumers preferring the product exceeding median
value is equal to the number of consumers preferring the product less than median
value) versus H1 : > 0.5 (consumers prefer this product). We can state the hypothesis
of interest in terms of the median, that is, H0: ≤0 against H1 : > 0.
For the above example, the expression term in constructing the alternative
hypothesis statement H1: > 0.5 or H: > 0 is “larger than”. In summary, these
expression terms (please check Figure 9.1) can lead us to decide the set of null
and alternative hypotheses for a single-population test:
Table 9.1: Expression Terms with Related Null and Alternatives Hypothesis
Examples of Expression
Alternative Hypothesis Null Hypothesis
Terms
“more than”
H 1 : 0.5 H 0 : 0.5
“exceeds” or or
(H1 : 0 ) (H 0 : 0 )
“increase”
SELF-CHECK 9.2
State the expression terms and define the appropriate null and alternative
hypothesis for the following cases:
Alternative Null
Statement with Expressions
Hypothesis Hypothesis
(a) A supervisor recorded 9 observations of
battery lifetime before a recharge is
required. Determine whether this battery
operates with a median of 1.8 hours before
requiring a recharge.
(b) The following data is obtained from a non-
normal population. Determine whether the
median is distributed less than 5.2 cm.
Median is the observed value that is located in the middle of the data when all
other observations are ranked in sequence regardless of the order, increasing or
decreasing. If the sample size is even, the median would be the mean value of the
two observations at the centre. Similar to the mean value, median is also a
measure of location for a distribution. Hence, the sign test is sometimes called
test of distribution location.
“larger than 100”, then the hypothesis to be tested is: H0: = 100 against a one-
sided alternative hypothesis,
H1: > 100.
In the sign test, we replace each sample value exceeding median value, 0 with a
‘+’ sign and each value less than 0 with a ‘-‘ sign, that is,
The sample value which is equal to 0 value will be replaced with ‘0’ sign. This
situation can happen if we deal with rounded data even though the population is
continuous. Observations replaced with ‘0’ sign will not be used in subsequent
analysis. When this occurs, the sample size for analysis will decrease, as many as
the number of ‘0’ sign (zero sign). Figure 9.2 summarises the information above.
xi < 0 xi ‘-’
xi = 0 xi ‘0’
In the sign test, the test statistic S is the random variable x representing the number of
‘+’ sign in the random sample. If the null hypothesis 0 is true, the probability that a
sample value results in either a ‘+’ or a ‘–’ sign is equal to ½. Therefore, we are actually
testing H0 that the number of ‘+’ sign, S, is a value of a random variable having the
binomial distribution with the parameter = ½, that is,
For the example above, we shall reject the null hypothesis H0: = 100 or
H0 : = 0.5 only if the proportion of ‘+’ sign is sufficiently greater than ½, that is
when S is large.
SELF-CHECK 9.3
The following data are test marks data of 10 returning students: 65 85 43
38 90 73 65 59 88 74. A test has been performed to determine whether
the performance of these students in a test exceeds the population median
value of 65.
(a) Define the ‘+’ and ‘-’ signs in this example.
(b) How many ‘+’ and ‘-’ signs are there in this study?
Answer:
To determine whether the median of percentage of active bacteria0 exceeded
40, test: H0: = 40 versus H1: > 40 at = 0.05.
Assigning values exceeding 40 with ‘+’ sign and values less than 40 with
‘–’sign:
41 33 43 52 37 44 49 53 40
+ – + + – + + + 0
Using sign test, the test statistic S = number of observed ‘+’ signs = 6. Hence, S
is distributed as binomial with n = 9 – 1 = 8 and = 0.5. If x variable is
distributed as binomial with n = 8 and = 0.5, then
Since = 0.05 < p-value = 0.1455, H0 is not rejected. There is not enough
evidence to say that the median for active bacteria percentage exceeds 40 at 5%
significance level.
(a) State the Null and Alternative Hypotheses for testing a population location
(Refer to hypothesis for small n)
17 15 20 29 19 18 22 25 27 9 24 20 17 6 24
14 15 23 24 26 19 23 28 19 16 22 24 17 20 14
13 19 10 23 18 31 13 20 17 24
Answer:
Test to determine whether < 21.5.
H0: = 21.5 H1: < 21.5
For a one-sided test, reject H0 if test statistic z > z0.01 = 2.33 where
S n
z
n (1 )
at = ½ and S = number of observed ‘–’sign (sample values that are less than
21.5)
Since n = 40, and S = 24, then mean(S) = n= 40 (½) = 20 while
Var S nθ( 1 θ) 40 0.5 0.5 3.16 40(0.5)(0.5) = 3.16
z = (24 – 20)/3.16 = 1.26
Since z = 1.26 < 2.33, we do not reject H0. There is not enough evidence to
prove that the sulfur oxides content is less than 21.5 at 0.01 significance level.
EXERCISE 9.1
2. Use the sign test to test H0: t = t0 versus H1 : t > t0, where
S1 = number of observations > t0 and S2 = number of observations
< t0 and show that Pr ( S1 c) Pr ( S2 n c) for 0 c n.
Under the null hypothesis that there is no difference between x values and 0, we
would expect that on average, half of the differences would be negative and the
other half would be positive. In other words, there will be n/2 negative differences
and vice-versa. Next, we would rank these positive and negative differences in
absolute value, and assign ranks according to sequence. It is expected that the
total rank corresponding to the positive differences should be equal/nearly equal
to the total ranks which correspond to the negative differences. The obvious
difference in total rank assigned to positive and negative differences is an
indication of differences between x values and 0.
The smaller the total rank value, the bigger the possibility that there exist
differences between sample values and 0. Hence, we can reject H0 when the test
statistic, that is the total rank, say T, is less than or equal to a critical value T0.
The single population test procedure that takes into consideration the magnitude
and difference sign is as follows:
(a) State the null and alternative hypotheses statement for a single
population test.
Answer:
To determine whether population median , exceeds 40, test:
H0 : = 40 versus H1 : > 40, = 0.05
Following the procedures in Wilcoxon Signed-Rank test, we obtain the
following results.
(To facilitate your understanding, the circled numbers indicate the procedures in
the test).
From the above table, the total differences with ‘+’ sign and total differences
with ‘–’ sign are T+ = 28.5 and T– = 7.5 respectively, refer to 3 Since there
.
is only one observation the value of which is equal to the median value, refer to
5 , n = 9–1 = 8, refer to 6 .
For a one-sided test, the test statistic is given as T+ = 28.5. From the Table of
critical value for Wilcoxon signed-rank test (Table 5.2) in attachment, with
n = 8, the critical value is T0.05 = 4. Since T+ = 28.5 0.05= 4, we do not reject
H0. There is not enough evidence to prove that the median percentage of active
bacteria is more than 40. A similar conclusion is obtained through a sign test.
Copyright © Open University Malaysia (OUM)
TOPIC 9 NON-PARAMETRIC HYPOTHESIS TEST FOR SINGLE POPULATION 181
The procedure for a one location test with large n using Wilcoxon signed-rank
test can be summarised as below:
(a) State the null and alternative hypotheses statement for a single
population test.
Answer:
To test whether the median sale differs from 120 units in a month duration, test:
H0 : = 120 against H1 : 120 at = 0.05. All observations are subtracted
from the median value, 0 = 120. The magnitude and difference signs are
recorded and next, ranks are assigned to each difference. The results are:
yi y i - t0 Rank yi y i - t0 Rank
85 -35 (-) 25.5 73 -47 (-) 29
99 -21 (-) 17.5 123 +3 (+) 4
120 0 0 119 -1 (-) 2
116 -4 (-) 5.5 85 -35 (-) 25.5
138 +8 (+) 13 128 +8 (+) 9
100 -20 (-) 15.5 150 + 30 (+) 23
129 +9 (+) 10 124 +4 (+) 5.5
115 -5 (-) 7 100 -20 (-) 15.5
141 +21 (+) 17.5 101 -19 (-) 14
142 +22 (+) 19 130 +10 (+) 11
121 +1 (+) 2 119 -1 (-) 2
94 -26 (-) 22 127 +7 (+) 8
78 -42 (-) 28 96 -24 (-) 21
152 +32 (+)24 109 -11 (-) 12
97 -23 (-) 20 83 -37 (-) 27
An observation has been discarded since its difference = ‘0’, hence n = 29.
Total differences T+ = 146 while T– = 289. Since n is large, the normal
approximation is used. The test statistic calculated is,
n(n 1) 29(30)
T 146
z 4 4 1.546
n(n 1) (2n 1) 29(30) (59)
24 24
Since Z < Z0.025 = 1.96, we do not reject H0. In conclusion, the median of the
detergent’s purchases does not differ significantly from 120 at 5% level.
SELF-CHECK 9.4
What are the differences between Sign Test and Wilcoxon Signed-Rank
Test?
EXERCISE 9.2
1. For Wilcoxon signed-rank test, show that the sum of positive and
negative differences T+ + T– = n(n+ 1)/2 with n is the number of
non-zero differences assigned rank.
Carry out both the sign and signed-rank tests to determine whether
more than half of the random polynomial 0-1 problems requires less
than or equals to 1 CPU second. Use = 0.01.
INTRODUCTION
A researcher often takes observations from two populations with the purpose of
comparing them, such as whether both populations come from the same
distribution or not. For example, in a parametric test, with two random samples
X1, …, Xn1 and Y1, …, Yn2 obtained from two normal populations with means, µx
and µy respectively and constant variance, the researcher may be interested to test
H0: µx = µy versus H1: µx < µy . If the null hypothesis holds, we can conclude that
both distributions are distributed as normal with similar mean and variance. In
other words, both samples were taken from the same population. On the other
hand, if the alternative hypothesis is true, then µx < µy, that is, the location
parameter of X (selected as the mean) has a smaller value compared to the
location parameter of Y. Hence, X population distribution is located on the left
side of the Y distribution. The dispersion of X and Y distributions is still the same
as both variances are assumed constant.
On the other hand, if we can associate or match two populations where the
selection of samples from the second population depends on the selection of
samples from the first population, both are said to be dependent. An example of
two dependent populations is a study on the effectiveness of vehicle safety tools
where the measure of injuries faced by a driver when putting on safety tools
(sample 1) is compared with the measure of injuries on the driver when he/she is
not putting on any safety tool (which is sample 2). In this case, both samples are
related and are experimented on the same subject, that is, the same driver is used
to obtain measures of injuries when putting on the safety tools and when not
putting on the tools.
In testing two dependent populations, the sample size n1 and n2 must be equal, that
is, n1 = n2 = n due to a relationship or similarity in the data source/measurement
obtained, as in the case of similar subjects, and comparisons made are based on
paired comparison.
SELF-CHECK 10.1
(a) H1: D1 < D2 means population 1 distribution (which contains X1, X2, …, Xn1)
is located on the left side of population 2 (which contains Y1, Y2, …, Yn2).
This also means that on average, X1, X2, …, Xn1 values are smaller than Y1,
Y2, …, Yn2 values;
(b) H1: D1 > D2 means population 1 distribution is located on the right side of
population 2; and
(c) H1: D1 D2 means the location of population 1 distribution is not the same
as the location of population 2.
Answer:
Define D1 as the distribution of marks for students who did not have the sample
problems and D2 as the distribution of marks for students who were provided
with sample problems. From the expression ‘lower’ we can test whether X1, X2,
…, Xn1 values are smaller than Y1,Y2, …,Yn2 values or D1 is located on the left
side of D2. Hence, the null and alternative hypotheses can be described as H0:
D1 = D2 versus H1: D1 < D2.
Answer:
Define D1 as students’ weight distribution before starting the diet programme
and D2 as the distribution of weight after the programme. If the programme is
effective, hence, weight reduction, this means the observation values in D1 must
be larger than the observation values in D2. In other words, D1 must be located
on the right side of D2. Hence, the alternative hypothesis is H1: D1 > D2.
The weight distribution before and after can also be viewed as the same
distribution as the weight difference between the weight before and the weight
after, that is Di = Xi – Yi. Next, the alternative hypothesis can be described as
H1 : Di = Xi – Yi > 0 and H0 : Di = 0. We will discuss this further in section
10.3.
Some examples of expressions that provide clues in constructing suitable null and
alternative hypotheses for two independent populations are as shown below (refer
to Table 10.1).
Table 10.1: Summary of Expression Terms with the Null and Alternative Hypotheses
1
= Pr(Xi > 0) = Pr(Xi – 0 > 0) =
2
1
= Pr(Di > 0) = = Pr(Di< 0)
2
for two dependent or paired populations, Di = Xi – Yi – 0. Under H0, = Pr(Di >
1
0) = = Pr(Di< 0). If Xi and Yi, i = 1, . .. , n (for two paired samples, n = n1 =
2
1
n2) come from the same population, so Pr(Di > 0) = Pr(Di < 0) = . Hence,
2
replacing Xi – 0 with Xi – Yi –0, we can use the results of the sign test for a
single population to test two dependent or paired populations. Therefore, the null
hypothesis can be written as:
Suppose that S represents the sum of differences between Xi and Yi marked as ‘+’,
1
hence S follows a binomial distribution with = . Thus, the null hypothesis
2
1
statement for comparing two paired populations is H0 : = .
2
(i) Obtain the differences between the first sample and its pair, that is, the
second sample.
(ii) Assign the ‘+’ or ‘-’ sign according to the result of the differences. The
paired sample with zero difference is discarded.
(iii) Count the sum of ‘+’ signs for one-sided (right) test and the sum of ‘-’
signs for one-sided (left) test.
The testing procedures for two dependent or paired populations are similar to
testing procedures for single population.
Let us study the following two examples.
Next, the rejection region can be determined using binomial probability
distribution.
Answer:
To determine whether salary increment results in lower defective products, test:
H1 : < 1/2 (there are fewer defective products after salary increment than
before the increment) or H0 : after < before
From the alternative hypothesis, the above test is a one-sided (left) test. Hence,
test statistic = S2= the number of differences between X and Y with ‘–’ sign.
Answer:
There are two dependent samples since the evaluation of both types of fried
chicken were made by the same subject (culinary expert). The hypothesis to test
whether the two ingredients are different,
1
H0 : There is no change in the fried chicken taste or = versus
2
1
H1 : The new ingredient resulted in tastier fried chicken or >
2
Di = New
Culinary Original New
Difference Ingredient -
Expert Ingredient Ingredient
Original
A 3 9 6 +
B 5 5 0 0
C 3 6 3 +
D 1 3 2 +
E 5 10 5 +
F 8 4 –4 –
G 2 2 0 0
H 8 5 –3 –
I 4 6 2 +
J 6 7 1 +
From the table above, we can see that six out of 10 culinary experts found that
the chicken tasted better using the new ingredient. Two said that the original
ingredient tasted better and the two other experts could not detect any
difference.
From the binomial table with n = 8 and = 0.5, Pr (S 6) = 0.1445. Since p-
value = 0.1445 is greater than the significance level value, = 0.05, we do not
reject the null hypothesis. In conclusion, the use of the new ingredient in the
fried chicken did not result in a significant difference compared with the
traditional ingredient at 0.05 significance level.
Carry out a test at 5% level to determine whether the customers prefer the new
ingredient to the traditional one.
Answer:
To determine whether the new ingredient is preferable, test H0: = 0.5 versus
H1: > 0.5.
S = the number of ‘+’ sign = 19 Sample size, n = 32
S 0.5n 2S n 2(19) 32
Test Statistics 1.061
0.5 n n 32
From the standard normal table, the critical value z0.05 = 1.645 for one-sided
test. Since 1.061 < 1.645, the null hypothesis cannot be rejected.
The test procedure for paired two populations with signed-rank method is similar
to the test procedure for a single population.
Judge 1 2 3 4 5 6 7 8 9 10
Product A 6 8 4 9 4 7 6 5 6 8
Product B 4 5 5 8 1 9 2 3 7 2
Answer:
The hypothesis test for differences in both paper product quality,
H0: the distributions of evaluation for both products 1 and 2 are the same or
H0: 1 2 versus
H1: the distributions of evaluation for both products 1 and 2 are different or
H1 : 1 2
Applying the signed-rank method, the test procedure generated the following
results:
From the table, the sum of the rank differences ‘+’ = T = 46, while the sum of
the rank differences ‘–’ = T = 9. Since the test is two-sided, the test statistic is
given by
T = minimum ( T , T ) = T = 9.
When the sample size is large, say n 15, the distribution of test statistic T
(either T or T ) will approach normal distribution. To perform the signed-rank
test for large n, T is a random variable with mean and variance,
n n 1 n(n 1)(2n 1)
= E(T) = 2 = Var(T) =
4 24
T
Hence, the signed-rank test statistic for n 15 is Z = which is a standard
T T
normal. For a single-sided test, the test statistic is Z = or Z =
Weight before 147.0 183.5 232.1 161.6 197.5 206.3 177.0 215.4
Weight after 137.9 176.2 219.0 163.8 193.5 201.4 180.6 203.2
Weight before 147.7 208.1 166.8 131.9 150.3 197.2 159.8 171.6
Weight after 149.0 195.4 158.5 134.4 149.3 189.1 159.1 173.2
Use the signed-rank test to test at 0.05 level of significance whether the
company’s claim is true or not.
Answer:
The energy drink is said to be effective if the weight distribution before >
weight distribution after, that is the number of ‘+’ sign is less than the number
of ‘-’ sign. To determine whether the energy drink is effective or not in
reducing body weight, test
1 1
H0 : = versus H1: <
2 2
For a one-sided (right) test, the null hypothesis will be rejected if the test
statistic,
T 25 68
Z= = = –2.22 < the critical value z0.05 = –1.645.
374
Since z = –2.22 < z0.05 = –1.645, the null hypothesis must be rejected. We
conclude that the energy drink is effective in reducing the body weight at 5%
level.
EXERCISE 10.1
1. Suppose that the differences in paired data are 15 ‘+’ signs, 5 ‘-’
signs and 9 ‘0’. Use the sign test for right-end test.
(a) What are the values for n and S?
(b) At 0.05 level, will H0 be rejected?
Before 3 3 1 5 3 6 2 0 4 3 4 1
After 1 2 3 2 0 4 3 2 1 2 3 0
Use the sign test to test whether the new traffic-control system is
more effective than the old system at 0.05 level of significance.
Suppose that there are n1 and n2 independent samples from population 1 and
population 2 respectively. The procedures for the rank-sum test suggest that we
combine n1 + n2 = n observations and assign rank according to the observed
magnitude. Rank 1 will be assigned to observation with the smallest value and
rank n to the observation with the highest value. In the case of ties (identical
observations), we would replace the observations by the mean of the ranks that the
observations would be entitled to if they were distinguishable. If both samples
actually come from the same distribution, the sum of the ranks corresponding to
the first sample, W1, and the sum of the ranks corresponding to the second sample,
W2, will be proportional to respective sample size. If n1 and n2 are equal, W1 + W2
should be almost identical. If one of the rank-sum is sufficiently large while the
other one is very small, this shows there is a possibility that a significant
difference exists in both sample distributions.
Mann and Whitney have suggested a test statistic that also uses the sum of the
ranks for both samples and it can be shown that this is equivalent to the Wilcoxon
test. This test, called the Mann-Whitney U test, has been used extensively since
the availability of table for U critical values.
When comparing two populations using the Mann-Whitney test, the following
statistic will be used as the test statistic:
n1 n1 1
U1 n1n2 n1W1
2
n n 1
U 2 n1n2 1 1 W2 or
2
U the minimum of U1 and U 2
where U1 + U 2 = n1 n2 , while W1 and W2 are the sum of the ranks of the values of
the first and second sample, respectively.
From the formulas for U1 and U2, U1 will be small when W1 is large. This can
only happen if the population 1 distribution is shifted to the right of the population
2 distribution. Hence, the test statistic U1 will be used when the alternative
hypothesis is D1 > D2.
If the samples chosen were from two identical populations, it is expected that the
sum of the ranks of both samples would not differ too much. If there is an
appreciable difference between the means of the two populations, most of the
lower ranks are likely to go to the values of one sample, while most of the higher
ranks are likely to go to the values of the other sample.
(a) State the Null and Alternative Hypotheses for two independent populations
test.
One-sided test: H0: D1 and D2 are equal versus H1: D1 has shifted to the
right of D2.
[or H0: D1 and D2 are equal versus H1: D1 has shifted to the
left of D2].
Two-sided test: H0: D1 and D2 are equal versus H1 : D1 has shifted either to
the left or the right of D2.
Answer:
Let D1 be the distribution of survival times of rats receiving treatment and D2 as
the distribution of survival times for rats not receiving treatment. To determine
the effectiveness of the serum treatment, test:
H0: D1 = D2 against H1: D1 > D2
or D1 and D2are equal or D1 has shifted to the right of D2
Original Data 0.5 0.9 1.4 1.9 2.1 2.8 3.1 4.6 5.3
Rank 1 2 3 4 5 6 7 8 9
Perform the Mann-Whitney U test at 0.01 level of significance to test the null
hypothesis that the two populations sampled are identical against the alternative
hypothesis that the second diet produced a greater weight gain.
Answer:
Let D1 be the weight distribution of young turkeys with diet 1 and D2 be the
weight distribution of young turkeys with diet 2. To determine whether the
second diet results in greater weight compared to the first, test
H0 : D1 = D2 (or D1 and D2 are equal) against
H1 : D1 < D2 (or D1 has shifted to the left of D2)
From data, n1 = n2 = 16, hence, the normal approximation will be used. Prior to
that, obtain the sum of ranks w2 and test statistic U2.
w2 = 21 + 1 + 3 + 8 + 15 + 4 + 11 + 2 + 5.5 + 13 + 31 + 16 + 12 + 22 + 7 +
10
= 181.5
U = U2 = (16)(16) + (16)(17)/2 – 181.5 = 210.5
Since Z = 3.11 > 2.33 = Z0.01, the null hypothesis must be rejected. We conclude
that the second diet produces a greater gain in weight compared to the first diet
at 0.01 significance level. In other words, D1 distribution is situated on the left
side of D2 distribution.
You have reached the end of Topic 10. Test your understanding with this next
self-check.
SELF-CHECK 10.2
What are the differences between the Rank-Sum, the Sign and the Signed-
Rank tests for comparing two populations?
Rank-Sum Test Sign Test Signed-Rank Test
EXERCISE 10.2
(b) n1 = 6 n2 =4 W2 = 17
H0: D1 and D2 are equal H1: D1 and D2 are unequal
Group 1 15 21 15 23 17 14 16
Group 2 18 22 24 25 19 24 17 19 23 16
Test at 0.01 level that group 2 obtained higher scores in the test.
Compare both methods of teaching.
(a) Does the data support the claim that the campaign was
successful?
Test the effectiveness of the campaign at 0.05 significance level.
(b) Repeat the test above using the signed-rank test.
ACTIVITY 10.2
Ć The Wilcoxon Rank test and the Mann-Whitney test can be used to compare
two independent populations.
Answers
TOPIC 1: CHI-SQUARE DISTRIBUTION, F DISTRIBUTION
AND THEIR APPLICATIONS
Exercise 1.1
(a) Histogram
(b) Scatter plot of probability function versus Y
Comment: Histogram at (a) and scatter plot at (b) is right-skewed. Hence, it is true
that
4
Y Zi 2 4
i 1
Exercise 1.2
The value 0.831 in Figure 1.6(b) shows column = 0.975 with row, v = 5 having
significance value 0.975
2
(5) = 0.831.
From Figure 1.6(a), it is clear that the value 23.68 given in Table 1.1 at column
= 0.05 and row, v = 14 which means ‘point 0.05 at 0.05 2
(14) distribution is
23.68’, i.e. 0.05
2
(14) = 23.68.
Exercise 1.3
Step 1: Determine the test parameter
Population parameter to be tested is population variance that is 2 .
H 0 : 2 42
H1 : 2 42
2 95%
2
= (15)
Exercise 1.4
Figure 1.1
Comment: The three distribution graphs show skewness. The curve shapes
change with the change in the degrees of freedom pairs.
Exercise 1.5
1. For = 5% = 0.05, 1% = 0.01 and 0.1% = 0.001 refer to the first, third and
fourth row in the Table for every pair of v1 and v2 . Thus,
2. Values on the right of the equation represent values in the distribution table.
Thus, determine their values based on suitability/accuracy value in table by
referring to the intersection of the column and row according to the degrees
of freedom. Hence, we obtained:
(a) F (6,14) = 3.50
Since the value 3.50 (at v1 = 6 and v2 =14) is on the second row, = 0.025.
(b) F (10,32) = 2.93
Since the value 2.93 (at v1 = 10 and v2 =32) is on the second row,= 0.01.
(c) F (24,38) = 1.81
Since the value 1.81 (at v1 = 24 and v2 =38) is on the second row,= 0.05.
(d) F (2,24) = 5.61
Since the value 5.61 (at v1 = 2 and v2 =24) is on the second row,= 0.01.
Exercise 1.6
From the table, we obtain f 0.01,14,11 = 4.30 and f 0.01,11,14 = 3.87 (since =0.02). Hence:
3.07
2
1 2 (3.07) 2
. 12 .3.87
(0.8) 2 4.30 2 (0.8) 2
12
that is 3.425 2 56.991
2
12
This means, the assumption that ratio 1 is not true because the estimation of
22
ratio interval does not contain the value 1.
Exercise 1.7
Step 1: Determine the test parameter
Population parameters are the first and second population variance, 12 and 22
respectively.
Terms of Expression H0 H1
12
H 0 : 12 22 H 0 : 1
22
12
H1 : H1 : 2 1
2 2
1 2
2
Exercise 1.8
1. Using properties c(i) and c(ii) of the chi square distribution
(a) X 1 + X 2 is distributed as 2 (1+5) = (6). Hence, E[Y] = 6 and Var[Y] =
2(6) = 12.
(b) X 1 + X 3 is distributed as 2 (1+10) = (11). Hence, E[Y] = 11 and
Var[Y] = 2(11) = 22.
X 2 2
2
X 2 X 2 2 2
Hence, proven that E 1 2 1 E Z12 Z 22 E 2 2 2
1 2
2
table, 0.05 of the distribution is on the right side of point 14.07. (See
column 0.05, row 7).Hence,
Pr( X 1 + X 2 > 14.07) = 0.05
_ x i
9.1 14.3 .... 10.4
(a) Sample mean, x i 1
10.73
n 15
2
n_
xi x
9.1 10.73 .... 10.4 10.73 3.39
2 2
Sample variance, s 2 i 1
n 1 14
Test statistic,
2 n 1 S 2 14 3.39
24.98
2 1.9
Test statistic,
2 n 1 S 2 14(3.39)
24.98
2 1.9
that is
12
0.1636 0.9168
22
Since the confidence interval contains values < 1, we are confident that 95%
of the channel 1 sequence has smaller variance compared to channel 2
sequence, and therefore should be chosen by the firm.
As such, we will reject H 0 when test statistic value, F > 4.85 or F < 0.151.
Exercise 2.1
Comment: From the two figures, both data sets illustrate the changes between and
within samples for the variables. In Figure 2.4(a), the changes between samples
are large compared to within sample changes. However, Figure 2.4b shows that
the between sample changes is not much different to within samples changes.
How about Figure 2.1 (from text book)? Figure 2.1 does not clearly show if the
variation between samples is statistically larger than within sample variation.
Thus, significance tests must be performed. Statistical test used to examine the
equality in population mean should be able to differentiate the between and within
sample variations. Thus, we need to calculate the between and within sample
variations.
Exercise 2.2
The following result is obtained:
1 2 3 4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
81 72 83
69 79 76
90
Total 454 549 425 351
The number of students 6 7 6 4
Average 75.67 78.43 70.83 87.75
4 nj
_ x
i 1 j 1
ij
65 75 ... 88 1779
x.. 77.35
n 23 23
4 ni _
( xi , j x.. ) 2
i 1 j 1
1909.2
712.6
Exercise 2.3
In ANOVA tests, the critical area is determined by , the degrees of freedom for
treatments and degrees of freedom for errors. Step 2 and Step 3 (in text) need to
be understood prior to getting the critical value.
Exercise 2.4
The relevant factors are classes with factor levels and type of class that the
students are in.
Determine the critical value that is F2,9,0.05 4.256 (obtained from F-distribution
table). Reject H 0 if F > 4.256
Exercise 2.5
1. Population under consideration must be distributed as normal with equal
variance, and every sample chosen are independent and randomly selected.
3. We obtained:
(a) the critical value for ANOVA test at = 0.01 when there are 6
samples with 34 items in each samples is F2,28,0.01 3.75 . This comes
from = 0.01, the degrees of freedom for numerator = k – 1 = 6 – 1 =
5 and the degrees of freedom for denominator = N – k = 34 – 6 = 28.
(b) the critical value for ANOVA test at = 0.05 when there are 4
samples with 44 observations is = 2.84. This comes from = 0.05, the
degrees of freedom for numerator = k – 1 = 4 – 1 = 3 and the degrees
of freedom for denominator = N – k = 44 – 4 = 40.
4. We obtained:
(a) When MSE =14.6 and MS(Tr) = 35.7,
MS (Tr ) 35.7
F 2.45
MS E 14.6
Determine the critical value that is F2,14,0.01 6.515 (obtained from the F
distribution table). Reject H 0 when F > 6.515
Exercise 3.1
Step 1: Construct appropriate hypothesis statement
H 0 : attendance record is uniform everyday
H 0 : absenteeism rate is the same everyday, that is, 10 people daily.
H1: otherwise
Observed Expected (O E ) 2
Day
Frequency, O Frequency, E
O-E E
Monday 12 10 2 0.4
Tuesday 9 10 -1 0.1
Wednesday 11 10 1 0.1
Thursday 10 10 0 0
Friday 9 10 -1 0.1
Saturday 9 10 -1 0.1
Total 60 60 0.8
(O E ) 2
Thus we obtained X 0.8.
E
Exercise 3.2
1. Solve the question using the following steps:
x
n
Pr X x p x (1 p) n x and we obtained the following table (Expectation =
x
Pr(X = x) multiplied with f ):
X 0 1 2 3 4 5 or 6
Expectation 0.4 4.2 16.7 33.1 32.7 12.9
Since the first two frequencies are less than 5 (X = 0 and 1), both are
combined together at X=2 resulting in frequency value 21 (that is 0.4 + 4.2 +
16.7). Thus, combining the observed and expected frequencies for
subsequent analysis in the following table:
X ≤2 3 4 5 Total
Observed 21 33 31 15 100
Expected 21.3 33.1 32.7 12.9 100
and
21 21.3 33 33.1
2 2
(31 32.7) 2 (15 12.9)2
X 0.43
21.3 33.1 32.7 12.9
X 0 1 2 3 4 or more Total
Observed 102 114 74 28 12 330
Expected 99.39 119.27 71.56 28.63 8.59 330
Hence, we obtained:
X
(O E ) 2
(102 99.39) 2
...
(12 8.59) 2
0.46
E 99.39 8.59
Thus,
X
(O E ) 2
=
(8 6.63) 2 (10 10.55) 2
...
(2 138) 2
0.83
E 6.63 10.55 1.38
Exercise 3.3
L B
Using the formula for expected frequencies as: Eij N j i ,
N N
we obtained the following results:
Type of Car
Age
Local-made Import Total
>30 110*99 110*101 110
54.45 55.55
200 200
30 and above 90*99 90*101 90
44.55 45.45
200 200
Total 99 101 200
Exercise 3.4
Step 1: Construct appropriate hypothesis statement
H 0 : both variables are independent
H 0 : there is no relationship between the interest on type of car and age level.
H1 : both variables are dependent.
Thus, v = (2 – 1) (2 – 1) = 1
E .
From the table, we obtained the following result:
Exercise 3.5
The information can be summarized in the table below:
Type of Favourite Sport
Students
Football Basketball Baseball Tennis
Male 33 38 24 5
Female 38 21 15 26
The variables classified are the tendency/interest on type of sport. Populations are
male and female students. Testing method follows several steps:
X > 0.05,
2
.
Exercise 3.6
Step 1: Construct the appropriate hypothesis statement
H 0 : there is no difference in colour blindness level according to gender.
H1 : otherwise.
Colour Blindness
Normal Colour Blind Total
Male 2210 190
(2280) (120) 2400
Factor Female 2540 60
II (2470) (130) 2600
Total 4750 250 5000
(b) Thus,
( 2210 2280 0.5) 2 ( 60 130 0.5) 2
X cc ... 82.66
2280 130
Exercise 3.7
1. Step 1: Construct the appropriate hypothesis statement
H 0 : the distribution of book-loan is uniform
H 0 : the number of books borrowed is the same everyday that is 258 books.
H 0 : otherwise.
X
(O E ) 2
1%
2
(5) 15.086 with v = number of days – 1 = 6 – 1 = 5
E
degrees of freedom.
Day
Observed Expected
O-E X
(O E ) 2
Frequency, O Frequency, E E
Monday 204 258 -54 11.3023
Tuesday 292 258 34 4.4806
Wednesday 242 258 -16 0.9923
Thursday 283 258 25 2.4225
Friday 252 258 -6 0.1395
Saturday 275 258 17 1.1202
Total 1548 258 20.457
Thus X
(O E ) 2
=20.457
E
2. It is known that,
Thus,
{X = 0} no goal obtained in a match by a team.
: mean/average goals per team per match
Thus, ̂ =Total number of goals made/(2 x number of matches)
Number of
0 1 2 3 4 5 6 7 8
Goals, X
Number of
Expected 2.883 7.583 9.971 8.741 5.747 3.023 1.325 0.498 0.164
Team
Since the number of expected team scoring X=0,1 and X=5,6,7 and 8 goals
are less than 5, we need to combine the total expected teams according to the
number of goals. The result is displayed in the following table:
Thus, we get:
(O E ) 8 10.466 6 5.01
2 2 2
X ... 3.73
E 10.466 5.01
X 0 1 2 3 Total
Observed 22 37 20 21 100
Expected 16.807 36.015 30.87 16.31 100
and
22 16.807 37 36.015
2 2
(20 30.87) 2 (21 16.31) 2
X
16.807 36.015 30.87 16.31
6.81
H1 : otherwise
Thus,
X
(O E ) 2
(16 15.505) 2 (21 25.886) 2
...
(10 15.505) 2
7.945
E 15.505 25.886 15.505
Thus, v = (3 – 1) x (3 – 1) = 4
E
From the table, we obtained the following result:
Category
1 2 3 4 Total
Population 1 16 (19.67) 38 5 (10.67) 41 (31.33) 100
(38.33)
2 24 (19.67) 41 12 (10.67) 23 (31.33) 100
(38.33)
3 19 (19.67) 36 15 (10.67) 30 (31.33) 100
(38.33)
(b) Hence, we can calculate the value of test statistic and obtained:
(16 19.67) 2 (38 38.33) 2 (30 31.33) 2
X ... 12.184
19.67 38.33 31.33
(b) Thus,
( 32 21.5 0.5) 2 ( 89 78.5 0.5) 2
X cc ... 11.85
21.5 78.5
TOPIC 4: CORRELATION
Exercise 4.1
(a) 1, +
(b) 2, +
(c) 2, -
Exercise 4.2
xi yi xi yi xi2 yi2
1 2 2 1 4
2 3 6 4 9
4 4 16 16 16
5 7 35 25 49
6 12 72 36 144
8 10 80 64 100
10 7 70 100 49
Total 36 45 281 246 371
n xi yi ( xi )( yi )
rp
(n xi ( xi )2 ) (n yi 2 ( yi )2 )
2
7( 281 ) ( 36 )( 45 )
( 7( 246 ) ( 36 )2 ) ( 7( 371 ) ( 45 )2 )
0.703
The Pearson correlation coefficient value 0.703 shows that there is a strong
positive linear relationship between the frequency of fertilizer usage and crop
yields. This means that the more frequent the farmer distributes the fertilizer, the
higher the amount of crop yield produced.
Exercise 4.3
A one-sided hypothesis test (since the Pearson correlation coefficient value is
positive) is as follows:
H0 : p = 0
H1 : p > 0
n2
T rp
1 rp 2
72
0.703
1 (0.703) 2
Test Statistic : 2.21
Test Results : T follows a t distribution with v = 7 – 2 = 5 degrees of
freedom and 0.01 significance level.
Reject H 0 when
T > t0.01,5 t0.01,5 = 3.365
Since the test statistic T < 3.365, we cannot reject the null hypothesis. This means
that we do not have enough evidence to say that the Pearson correlation
coefficient value is not zero, that there does not exist any significant relationship
between the frequency of fertilizer distribution with crop yield at 1% significance
level.
Exercise 4.4
6 Di 2
rs 1
n(n 2 1 )
6( 74 )
1
10(( 10 )2 1 )
0.5515
The Spearman correlation coefficient value 0.5515 shows that there exists a strong
positive linear relationship between athletes’ ranking and their position in a
match.
Exercise 4.5
A one-sided hypothesis test (since the Spearman correlation coefficient value is
positive) is as follows:
H0 : s = 0
H1 : s > 0
n2
T rs
1 rs 2
10 2
0.5515
1 (0.5515) 2
Test Statistic : 1.87
Test Results : T follows a t distribution with v = 10 – 2 = 8 degrees of freedom
and 0.01 significance level.
Reject H 0 when
T > t0.01,8 t0.01,8 = 2.896
Since the test statistic T < 2.896, we cannot reject the null hypothesis. This means
that there is not enough evidence to say that there exists a significant relationship
between athlete ranking and their position in a match at 1% significance level.
Exercise 4.6
1. The importance of the two-way scatter plot:
– the two-way scatter plot can be used to display or determine the
relationship between two quantitative variables X and Y.
– the two-way scatter plot can also be used to analyze patterns in
bivariate data.
(a) From the scatter plot we can suggest that the number of tellers and
customers’ waiting time have a negative relationship.
rp
x y x y
n i i i i
n x x n y y
2
i i
2 2
i i
2
i x y xy x2 y2
1 4.00 6.40 25.60 16.00 40.96
2 1.00 8.70 8.70 1.00 75.69
3 5.00 3.20 16.00 25.00 10.24
4 3.00 10.50 31.50 9.00 110.25
5 4.00 8.20 32.80 16.00 67.24
6 3.00 11.30 33.90 9.00 127.69
7 3.00 11.30 33.90 9.00 127.69
8 2.00 12.80 25.60 4.00 163.84
9 2.00 11.60 23.20 4.00 134.56
10 6.00 3.20 19.20 36.00 10.24
11 3.00 9.40 28.20 9.00 88.36
12 2.00 12.80 25.60 4.00 163.84
13 4.00 8.20 32.80 16.00 67.24
Total 42.00 117.60 337.00 158.00 1187.84
4381 4939.2
rp
2054 1764 15441.92 13829.76
558.2 558.2
rp 0.8163
17.0340.15 683.75
The Pearson correlation coefficient value -0.8163 indicates that there
exists a strong negative linear relationship between the number of
tellers and customers’ waiting time. Thus, we can suggest that if the
number of tellers is increased, the number of customers’ waiting time
will be reduced.
(c) Since the Pearson correlation coefficient value is negative, we will use
the following null and alternative hypotheses;
H0 : p 0
H1 : p 0
n2 13 2
Test statistic, T = r p = 0.8163 = -26.9120
1 0.8163
2 2
1 rp
Since the test statistic T < t 0.01,11 2.718 , we reject the null
hypothesis and conclude that p 0 . In other words, there exists a
significant negative linear relationship between the number of tellers
and the number of customers’ waiting time. We may conclude that the
number of customers’ waiting time can be reduced further if the
number of tellers is increased.
Table 1
Interview Sales
Interview Sales
Rank, rank, D D2
Rank (‘000)
u v
5 17 5 9 -4 16
3 32 3 5 -2 4
1 27 1 10 -9 81
9 46 9 5 4 16
6 55 6 1 5 25
4 45 4 5 -1 1
10 36 10 7 -3 9
2 28 2 8 -6 36
7 18 7 2 5 25
8 66 8 3 5 25
D 2 238
rs 1
D
6 2
nn 1 2
6238
rs 1 =0.442
10 10 2 1
The Spearman correlation coefficient value is 0.442. Thus we can say
that there exists a weak positive linear relationship between the sales
(‘000) and the interview rank.
6238
rs 1 =0.442
10 10 2 1
Similarly, we notice that the Spearson correlation coefficient value is
0.442. Thus, we may conclude that there exists a weak positive linear
relationship between the test score and the interview rank.
(b) Since the correlation coefficient values obtained in (a) above are both
positive, we will use the following null and alternative hypotheses to
test for their significance:
H0 : p 0
H1 : p 0
n2 10 2
Test statistic, T = rs = 0.442 = 1.3937
1 0..442
2 2
1 rs
Critical value, t 0.05 ,8 1.860
As the test statistic T < t 0.05 ,8 1.860 , we do not reject the null
hypothesis and conclude that p 0 . Hence,
(c) Our results in (a) and (b) do show that both criteria namely the test
score and sales record have very weak positive relationship with the
salesman’s interview rank. Therefore in this particular problem, there
is no guarantee that salesman selection based on interview rank will
bring higher profit to the company.
Exercise 5.1
x y yˆ 12.84 36.18 x y yˆ
8.3 227 287.454 -60.454
8.3 312 287.454 24.546
12.1 362 424.938 -62.938
12.1 521 424.938 96.062
17.0 640 602.22 37.78
47.0 539 1687.62 -1148.62
17.0 728 602.22 125.78
24.3 945 866.334 78.666
24.3 738 866.334 -128.334
24.3 759 866.334 -107.334
33.6 1263 1202.81 60.192
Exercise 5.2
x y xy x2 y2
60 63.6 3816.0 3600 4044.96
62 65.2 4042.4 3844 4251.04
64 66.0 4224.0 4096 4356.00
65 65.5 4257.5 4225 4290.25
66 66.9 4415.4 4356 4475.61
67 67.1 4495.7 4489 4502.41
68 67.4 4583.2 4624 4542.76
70 68.3 4781.0 4900 4664.89
72 70.1 5047.2 5184 4914.01
74 70.0 5180.0 5476 4900.00
Total 668.0 670.1 44842.4 44794 44941.90
From the table above, we need to calculate the x and y values first, that is
x
x i
668.0
66.8 dan y
y i
670.1
67.01
n 10 n 10
Now, we can get the ̂1 regression coefficient using the following formula:
n
x y i i
nxy
44842.4 10 66.8 67.01
ˆ 1 i 1
0.465
44794 10 66.8
n 2
x
i 1
2
i nx 2
̂1 = 0.465 shows that the y value will increase by 0.465 for each one unit increase
in x. ̂0 = 35.95 refers to the y value when the x value is zero.
Exercise 5.3
(a) The hypothesis test (one-sided test since ̂1 value is positive):
H0 : 1 = 0
H1 : 1 > 0
ˆ 1 0.465
Test Statistic : T 14.085
s ˆ 1 0.033
Test Results : T follows a t distribution with v = 10 – 2 = 8 degrees
of freedom and 0.05 significance level.
Reject H 0 when
T > t0.05,8 1.86
s( ̂1 ) is the standard deviation for ̂1 sampling distribution. The formula to
get the standard deviation for ̂1 is
y 2
i ˆ 0 yi ˆ 1 xi yi
s ˆ 1 n2
xi2 nx 2
0.033
Since the test statistic t0.05,8 1.86 , we reject the null hypothesis. We have
enough evidence to say that 1 value is not zero but positive.
(b) The 99% confidence interval for 1 is as follows (with = 0.01 and
t0.05,8 3.355):
ˆ 1 t0.005,8 s ˆ 1 1 ˆ 1 t0.005,8 s ˆ 1
0.465 3.355 0.033 1 0.465 3.355 0.033
0.354 1 0.576
Exercise 5.4
The coefficient of determination is
ˆ 0 yi ˆ 1 xi yi ny 2
R
2
y 2
i ny 2
44941.9 10 67.01
2
0.961
Exercise 5.5
(a) There is no particular pattern in this plot. We found that the model has
random error with constant variance. Hence, there is no violation from the
linear model assumptions.
(c) The histogram shows a normal shape. Hence random error is distributed as
normal. Thus, there is no violation from the linear model assumptions.
Exercise 5.6
(a)
The plot shows the regression model is in reciprocal function form. Hence,
transformation is x* =1/x and the linear regression model is y = 2.67 –
0.68x*.
(b)
(c)
(d)
Exercise 5.7
Refer to Exercise 5.2, the simple linear regression model
ŷ = 35.95 + 0.465x
To get the standard error for estimator, we need to have w value. Using regression
model ŷ = 35.95 + 0.465x, the ŷ value for each x value is shown in the table
below:
x 60 62 64 65 66 67
ŷ 63.6 65.2 66.0 65.5 66.9 67.1
w 63.85 64.78 65.71 66.175 66.64 67.105
x 68 70 72 74
ŷ 67.4 68.3 70.1 70.0
w 67.57 68.5 69.43 70.36
Hence,
n 2
y yˆ
i i 1.494
s i 1 0.432
ε n2 8
The 99% confidence interval gives = 0.01 and t 2 t0.005 3.355 . Hence, for
xg = 86, the prediction interval is
xg x
2
1
yˆ t 2 s 1
n xi x 2
1 86 66.8
2
75.94 2.61
It is found that the upper and lower limit for the 99% confidence interval is 73.33
and 78.55 respectively. This means that the predicted y value is 73.33 unit at the
minimum and is 78.55 unit at the maximum for x = 86.
Exercise 5.8
The ŷ values, t 2 and s can be obtained from Exercise 5.7. Hence, the interval
for expected value of y for xg = 69 is:
xg x
2
1
yˆ t 2 s 1
n xi x 2
1 69 66.8
2
75.94 0.52
It is found that the upper and lower limit for the confidence interval is 75.42 and
76.46 respectively. This means that the predicted y value is 75.42 unit at the
minimum and is 76.46 unit at the maximum for x = 69.
Exercise 6.1
b ˆ
y=9.430+5.266 4 +2.0612 5 =40.8
Exercise 6.2
1.
2. Data a)
y x1 x2
10 2 5
24 3 6
40 7 6
20 3 5
15 4 3
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.999906
R Square 0.999812
Adjusted R Square 0.999435
Standard Error 0.25713
Observations 4
ANOVA
Df SS MS F Significance F
Regression 2 350.6839 175.3419 2652.047 0.013729
Residual 1 0.066116 0.066116
Total 3 350.75
Data b)
y x1 x2
10 2 2
25 4 6
30 4 8
20 3 6
15 4 3
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.997365
R Square 0.994737
Adjusted R Square 0.984211
Standard Error 0.811107
Observations 4
ANOVA
Df SS MS F Significance df
Regression 2 124.3421 62.17105 94.5 0.072548
Residual 1 0.657895 0.657895
Total 3 125
Exercise 7.1
1.
Nominal Measurement Ordinal Measurement
Data which uses numbers, symbols or If an object in a category or class has a
codes to classify objects, individuals or relationship or connection such as ‘more than’
characteristics is known as nominal or ‘less than’ with other classes, then the data
scale. is ordinal.
2. From the data there are 4 respondents who chose rank 1, 7 respondents
chose rank 2, none chose rank 3, 2 for rank 4 and 8 for rank 5. Hence,
(a) Mean = Data Average = [4(1) + 7(2) + 0(3) + 194) + 8(5)] /20 = 62/20
= 3.1
Median = middle of 20 data = mean for the 10th and 11th data.
The 10th data = 2 and the 11th data = 2. Hence, median = 2.
Mode = data with the highest frequency = 5
(b) The parametric statistical method is not suitable for use since it uses
mean value to represent the data location, and the mean quantity does
not reflect the actual data in this data analysis.
Exercise 8.1
1. n1 = 6 n2 = 5. To get Pr(R 8), calculate
5 4
2
3 3 2.10.4 80
f (8)
11 462 462
6
5 4 5 4
4 3 3 4 (5)(4) (10)(1) 30
f (9)
11 462 462
6
5 4 5 4 5 4
2
4 4 2.5.1 10 5 4 4 5 1
f (10) f (11)
11 462 462 11 462
6 6
80 30 10 1 121
f (8) f (9) ( f 10) f (11) 0.2619
Hence 462 462
Exercise 8.2
1. (a) n1 =7 n2 =4 R=7
2(7)(4) 2(7)(4)[2(7)(4) 7 4]
R 1 6.06 R
74 (7 4) 2 (7 4 1)
1.44
Hence, the test statistic z = (27 – 23.5)/3.21 = 1.09. Since z = 1.09 falls in
the interval –1.96 and +1.96, the null hypothesis cannot be rejected. There is
not enough evidence to say that the sequence of passengers’ queuing is not
random at = 5% level.
Exercise 8.3
1. To determine whether students’ absenteeism occurs at random or not, test
H0 : The number of students who are absent is random versus
H1 : The number of students who are absent is not random
= 0.01
Median = 31.5
Let n1 = the number of absenteeism > 30.5 = 11
And n2 = the number of absenteeism < 30.5 = 11
By assigning ‘-’ and ‘+’ signs, the following sequence can be obtained:
- - - - +- + - - + + -+ - + - - - + - +-
Since z = 0.655 falls in between –2.575 and +2.575, we do not reject the null
hypothesis. The number of students absent from school for 22 consecutive
days is random at = 0.01
Median = 0.021, and ‘+’ , ‘–’ and ‘0’ arrangement generates the following
sequence: – 0 – – – – + 0 + + + + and n1 = 5, n2 = 5, R = 2,
From Table 1 Appendix (b), the critical region for the run with n1 = n2 = 5
is R 2 and R10. Since R = 2 is located inside the rejection region, we
reject the null hypothesis. Sample is not random at 0.05 significance level.
Exercise 9.1
1. To test whether the population median exceeds 160, test:
H 0 : = 160 H1 : > 160 = 0.05
Replacing values greater than 160 with ‘+’ sign and values less than 160
with ‘–’ sign, we will get,
+ + + + + – – + + + – + + – + + + + +
S E ( S ) S 0.5n 15 (0.5)(19)
Hence, Z 2.52
Var ( S ) 0.5 n
0.5 19
Exercise 9.2
1. It is known that T and T are sum of rank differences with positive and
negative signs respectively. Hence, the sum of rank differences for both
ranks without taking into consideration of ‘+’ or ‘–’ signs is sum of all
possible ranks, that is
(rank) 1 + (rank) 2 + . . . + largest rank = T + T
1 + 2 + 3 + . . . + n = T + T . Thus, (n + 1)/ 2 = T + T
The signed-rank test for small sample size is performed since n = 14. From
the calculation above, the total number of negative differences, T = 10.
From Table 4, Appendix B, with n = 14, the critical value T0.01 = 13. Since
T is not < T0.01 , hence, we do not reject H 0 that median hydrocarbon
content is 98.5.
4.
Rank Rank
yi yi - 1 yi yi - 1
( yi – 1) ( yi – 1)
0.045 -0.955 (-) 23 1.894 +0.894 (+) 20
0.258 -0.742 (-) 14 0.088 -0.912 (-) 22
0.412 -0.588 (-) 10 0.579 -0.421 (-) 4
0.036 -0.964 (-) 24 0.445 -0.555 (-) 9
1.055 +0.55 (+) 8 0.379 -0.621 (-) 12
1.070 +0.070 (+) 1 0.242 -0.758 (-) 15
0.361 -0.906 (-) 21 1.267 +0.267 (+) 3
0.394 -0.606 (-) 11 0.136 -0.864 (-) 18.5
0.136 -0.864 (-) 18.5 1.639 +0.639 (+) 13
0.506 -0.494 (-) 6 0.567 -0.433 (-) 5
0.209 -0.791 (-) 16 0.336 -0.664 (-) 14
8.788 +7.788 (+) 25 0.912 -0.088 (-) 2
0.182 -0.818 (-) 17
Sum of differences with ‘–’ sign is T– = 262, Hence, the test statistic:
n(n 1) 25(26)
T 262
Z 4 4 262 162.5
n(n 1)(2n 1) 25(26)(51) 37.1652
24 24
2.6772
Exercise 10.1
1. (a) n = total positive and negative signs = 15 + 5 = 20 (0 or tie is not
counted). For one-sided (right) test, S = number of positive sign = 15.
Sign test:
Since p-value = 0.0193 < 0.05 = a, hence reject H 0 . In conclusion, the new
traffic control system is more effective in reducing the number of accidents
at dangerous junctions at 0.05 significance level.
Exercise 10.2
1. (a) w2 = [(8)(9)/2] – 8 = 28 Y2 = 15 +[(3)(4)/2] – 8 = 13
Y1 = 15 + [(5)(6)/2] – 28 = 2
For a one-sided (right) test, test statistic, Y1 = 2. With n1 = 3 and n2 =
5, Y0.05 = 1 (Refer Table 5). Since test statistic > critical value Y0.05 ,
do not reject H 0 at 5% significance level.
Copyright © Open University Malaysia (OUM)
268 ANSWERS
From table, the critical value at = 0.01 for a one-sided test with n1 = 7
and n2 = 10 is 11. Since 13.5 is not < 11, we do not reject the null
hypothesis. In conclusion, there is no significant difference in the score for
both groups at 0.01 significance level.
From Table 5 with n1 = 9 and n2 = 10, the critical value for = 0.05 is 24.
Since 16 < 24, reject the null hypothesis. Alcohol does have an effect on
individuals’ thinking ability at 0.05 level.
Glossary
– The significance level of a hypothesis test that denotes
the probability of rejecting a null hypothesis when it
actually is true
Critical Value – One or two values that divide the whole region under
the sampling distribution of a sample statistic into
rejection and non-rejection region
Mean Square – A measure of the variation within the data of all MSE
between samples, sample taken from different populations
Multinomial – An experiment with n trials for which (1) the trials are
Experiment identical, (2) there are more than two possible
outcomes per trial, (3) the trials are independent, and
(4) the probabilities of the various outcomes remain
constant for each trial
OR
Thank you.