Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Stat 362 Study Guide

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 88

KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY,

KUMASI.

INSTITUTE OF DISTANCE LEARNING

BSc Statistics 2

STAT 362
Statistical Computing & Data Analysis II
2 Credits

STUDY GUIDE

Emmanuel Harris
Department of Statistics and Actuarial Science

1
[STAT 362 Statistical Computing & Data Analysis II]

Publisher’s Information

© IDL, 2017

All rights reserved. No part of this study guide may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording or
by any information storage and retrieval system, without the permission from the
copyright holders.

For any information contact:

Director
Institute of Distance Learning
New Library Building
Kwame Nkrumah University of Science and Technology
Kumasi, Ghana

Phone: +233-32-2060013
+233-32-2061287
+233-32-2060023

Fax: +233-32-2060014

E-mail: emmaharris2002@yahoo.com

Web: www.idl.knust.edu.gh
www.knust.edu.gh

2
INTRODUCTION

Welcome to STAT362 Statistical Computing & Data Analysis II. My name is

Emmanuel Harris. I am your facilitator in this course. In addition to welcoming you to

the course, I would like to give you some useful information about Statistical Computing

& Data Analysis and offer you a few hints for successful completion of this course.

Statistical modeling and data analysis techniques are difficult subjects to grasp and

apply, and it is often necessary to use computer software to aid the implementation of

large data sets and to obtain useful results. R is recognized as one of the most powerful,

flexible, and free statistical software packages, and it enables the user to apply several

statistical methods, ranging from simple regression to time series or multivariate

analysis.

This course offers the students how to easily analyze large data sets in R to obtain useful

results.

The requirement for successful completion of this course is a computer with R software

successfully installed.

3
[STAT 362 Statistical Computing & Data Analysis II]

COURSE OVERVIEW

The course is organized into five units:


Unit 1: Hypothesis testing (one-sample independent t-test, two-sample independent t-
test paired sample t-test)

Unit 2: Regression

Unit 3: Non-Parametric test-

Unit 4: Chi-Square
Unit 5: ANOVA

Unit 6: Logistic Regression

COURSE OBJECTIVE(S)
On completion of the course, you students should be able to:
1. Perform hypothesis testing (t-tests) in R.
2. Perform nonparametric test using R.
3. Perform Chisquare goodness of fit test and test of homogeneity using R
4. Perform with R the One-way, two-way and Multiple ANOVA.
5. Perform Logistic regression analysis in R.

COURSE OUTLINE
 Unit 1: Hypothesis testing
 Unit 2: Regression
 Unit 3: Non-parametric Test
 Unit 4: Chi-Square
 Unit 5: ANOVA
 Unit 6: Logistic Regression

4
REQUIRED TEXTBOOKS/READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using R. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

GRADING
Continuous assessment: 30%

End of semester Examinations: 70%

Total: 100%

5
[STAT 362 Statistical Computing & Data Analysis II]

ASSIGNMENT SCHEDULE

All assignments are due before the end of the specified day of delivery (GMT 23:59).
All assignments are to be uploaded to the hand-in folder for this course unless other
instructions are given. If you are unable to hand in your assignment on the LMS of IDL
KNUST (vclass), you may email it to the course facilitator (on the said day of delivery).
Failure to deliver assignments on the specified date will attract penalties in the form of a
reduced grade.

Assignment Description Type Deadline Value


(Each Unit has an (Activities title) (Individual/Group (Duration (of final grade)
Activities) ) )
Hypothesis
1 Individual 1 week 10%
testing activities

2 Regression Individual 1 week 10%

Non Parametric
3 Individual 1 week 10%
Tests

4 Chi-Square Individual 1 week 10%

5 ANOVA Individual 1 week 10%

Logistic
6 Individual 1 week 10%
Regression
* Participation in Online discussions may account for 15% of the final grade
* Deadline could be weekly based

6
UNIT ONE
HYPOTHESIS TESTING (t-test)
OVERVIEW
One of the most common tests in statistics is the t-test, used to determine whether the
means of two groups are equal (i.e. Two-Sample t-test) and/or determine whether the
hypothesized mean of a certain population is true (One-Sample t-test). The assumption
for the test is that both groups are sampled from normal distributions with equal
variances.

CONTENT
Session 1.1 One-sample independent t-test
Session 1.2 Two-sample independent t-test
Session 1.3 Dependent/Paired t-test

REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Test the difference between two means for independent samples, using the t-test in
R.
2. Test the difference between two means for dependent samples, using the t-test in R.
3. Test a claim about a hypothesized mean of one sample, using the t-test in R.

7
[STAT 362 Statistical Computing & Data Analysis II]

SESSION/ EXAMPLES/ ACTIVITIES

This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for Performing t-Tests in R software. Session
Examples include solved questions on each type of t-Tests in R software; it illustrates
the R codes for t-Tests. Under each example(s) are activity(ies). These ACTIVITIES
are to be solved and submitted for grading. The deadline for submission is one week
after each lecture.

Video Activity
1. https://youtu.be/kvmSAXhX9Hs (One-sample t-Test)
2. https://youtu.be/RlhnNbPZC0A ( Two-sample independent t-Test)
3. https://youtu.be/yD6aU0fY2lo (Two-sample dependent t-Test)

The format for a t.test with R is given as


t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)

8
SESSION 1.1
ONE SAMPLE INDEPENDENT T-TEST

Example 1.1.1
A researcher estimated that the average height of story buildings on the KNUST campus
is 700 feet. A random sample of 9 story buildings is selected and the heights in feet are
shown below:
485, 511, 841, 725, 615, 520, 535, 635, 616
At α =0.05 , is there enough evidence to reject this claim?

Solution
Hypothesis:
H o : μ=700
H 1 : μ ≠ 700

R codes
> x=c(485,511,841,725,615,520,535,635,616)
> t.test(x,mu=700,conf.level = 0.95, alternative = "two.sided")

9
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

One Sample t-test

data: x
t = -2.3612, df = 8, p-value = 0.04587
alternative hypothesis: true mean is not equal to 700
95 percent confidence interval:
520.5678 697.8767
sample estimates:
mean of x
609.2222

Conclusion: We reject H o and conclude that there is insufficient evidence to support the
researcher’s claim that the average height of story buildings on the KNUST campus is
700 feet.

Example 1.1.2
A state executive claims that the average number of acres in Western Region parks is
less than 2000 acres. A random sample of five parks is selected, and the number of acres
is shown. At α = 0.05, is there enough evidence to support the claim?
959 1187 493 6249 541
Solution
Hypothesis:
H o : μ=2000
H 1 : μ ≠ 2000

R codes
> x=c(959, 1187, 493, 6249, 541)
> t.test(x,mu=2000, conf.level = 0.95, alternative = "less")

10
OUTPUT
One Sample t-test

data: x
t = -0.10396, df = 4, p-value = 0.4611
alternative hypothesis: true mean is less than 2000
95 percent confidence interval:
-Inf 4227.591
sample estimates:
mean of x
1885.8

Activity 1.1
The average 1-ounce chocolate chip cookie contains 110 calories. A random sample of
15 different brands of 1-ounce chocolate chip cookies resulted in the following calorie
amounts. At the α = 0.01 level, is there sufficient evidence that the average calorie con-
tent is greater than 110 calories?
100, 125, 150, 160, 185, 125, 155, 145, 160, 100, 150, 140, 135, 120, 110

11
[STAT 362 Statistical Computing & Data Analysis II]

SESSION 1.2
TWO SAMPLE INDEPENDENT T -TEST

Example 1.2.1 (Assuming Equal Variances)


The number of grams of carbohydrates contained in 1-ounce servings of randomly selec-
ted chocolate and nonchocolate candy is listed here. Is there sufficient evidence to con-
clude that the difference in the means is statistically significant? Use α = 0.01.
Chocolate: 29 25 17 36 41 25 32 29 38 34 24 27 29
Nonchocol- 41 41 37 29 30 38 39 10 29 55 29
ate:

Solution
Hypothesis:
H o : there is a significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
H 1 : there is no significant difference in the carbohydrates content in chocolate and non-
chocolate candy.

R codes
> x = c(29,25,17,36,41,25,32,29,38,34,24,27,29)
> y = c(41,41,37,29,30,38,39,10,29,55,29)
> t.test(x,y, conf.level = 0.99, var.equal = TRUE)

12
OUTPUT

Two Sample t-test

data: x and y

t = -1.2744, df = 22, p-value = 0.2158

alternative hypothesis: true difference in means is not equal to 0

99 percent confidence interval:


-15.003753 5.661096

sample estimates:

mean of x mean of y

29.69231 34.36364

Conclusion: We fail to reject H o and conclude that there is no significant difference in


the carbohydrates content in chocolate and nonchocolate candy.

Example 1.2.2 (Assuming Unequal Variances)

13
[STAT 362 Statistical Computing & Data Analysis II]

The number of grams of carbohydrates contained in 1-ounce servings of randomly selec-


ted chocolate and nonchocolate candy is listed here. Is there sufficient evidence to con-
clude that the difference in the means is statistically significant? Use α = 0.01.
Chocolate: 29 25 17 36 41 25 32 29 38 34 24 27 29
Nonchocol- 41 41 37 29 30 38 39 10 29 55 29
ate:

Solution
Hypothesis:
H o : there is a significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
H 1 : there is no significant difference in the carbohydrates content in chocolate and non-
chocolate candy.

R codes
> x = c(29,25,17,36,41,25,32,29,38,34,24,27,29)
> y = c(41,41,37,29,30,38,39,10,29,55,29)
> t.test(x,y, conf.level = 0.99, var.equal = FALSE)

14
OUTPUT

Welch Two Sample t-test

data: x and y

t = -1.2203, df = 15.463, p-value = 0.2406

alternative hypothesis: true difference in means is not equal to 0

99 percent confidence interval:

-15.903601 6.560943

sample estimates:

mean of x mean of y

29.69231 34.36364

Conclusion: We fail to reject H o and conclude that there is no significant difference in


the carbohydrates content in chocolate and nonchocolate candy.
Activity 1.2
Upright vacuum cleaners have either a hard body type or a soft body type. Shown in the
table below are the weights in pounds of a random sample of each type. At α = 0.05, can
it be concluded that the weights are different?

15
[STAT 362 Statistical Computing & Data Analysis II]

Hard body 21 17 17 20 16 17 15 20 23

Soft body 24 13 11 13 12 15 12 16

SESSION 1.3
DEPENDENT/PAIRED T -TEST

Example 1.3.1

16
A dietician wishes to see if a person’s cholesterol level will change if the diet is supple-
mented by a certain mineral. Six subjects were pretested, and then they took the mineral
supplement for a 6-week period. The results are shown in the table below:

Subject 1 2 3 4 5 6
Before 21 235 20 190 17 244
0 8 2
After 19 170 21 188 17 228
0 0 3

Can it be concluded that the cholesterol level has been changed at α = 0.10? Assume the
variable is approximately normal.

Solution
Hypothesis:
H o : There is no difference in the mean cholesterol level.
H 1 : There is a significant difference in the mean cholesterol level.

R codes:
> b = c(210,235,208,190,172,244)

> a = c(190,170,210,188,173,228)

> t.test(b,a, conf.level = 0.90, alternative = "two.sided", paired =


TRUE)

17
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Paired t-test

data: b and a

t = 1.6079, df = 5, p-value = 0.1688

alternative hypothesis: true difference in means is not equal to 0


90 percent confidence interval:

-4.22040 37.55373

sample estimates:

mean of the differences

16.66667

Conclusion: We fail to reject H o and conclude that is there no difference in the mean
cholesterol level.

18
Example 1.3.2
A reporter hypothesizes that the average assessed values of land in a large city have
changed during a 5-year period. A random sample of wards is selected, and the data (in
millions of Ghana cedis) are shown. At α = 0.01, can it be concluded that the average
taxable assessed values have changed? Use the P-value method.

Kumasi Accra Tema Koforidua Cape Coast


2007 344.4 207.0 169.0 1711.5 861.8
2006 1262.0 960.0 529.0 1969.0 1405.0

Solution
Hypothesis:
H o : There is no difference between the average asses values of land in 2007 and 2006
H 1 : There is a difference between the average asses values of land in 2007 and 2006

R codes:
> x = c(344.4,207.0,169.0,1711.5,861.8)

> y = c(1262.0,960.0,529.0,1969.0,1405.0)

> t.test(x,y,conf.level=0.99,alternative="two.sided",paired=TRUE)

19
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Paired t-test

data: x and y

t = -4.649, df = 4, p-value = 0.009669

alternative hypothesis: true difference in means is not equal to 0

99 percent confidence interval:

-1127.052469 -5.467531

sample estimates:

mean of the differences

-566.26

Activity 1.3
A medical researcher wishes to see if he can lower the cholesterol levels through diet in
six people by showing a film about the effects of high cholesterol levels. The data is
shown below.
Patient 1 2 3 4 5 6
Before 243 216 214 222 206 219
After 215 202 198 195 204 213

At α = 0.05, did the cholesterol level decrease on average?

20
UNIT TWO
REGRESSION
OVERVIEW
This unit is divided into three sessions. In session 2-1, we would consider the simple
case of linear regression models in R. R has an in-built linear regression model function
that allows you to perform linear regression model computations.
Session 2-2 deals with multiple linear regression in R and the last session 2-3, would
then introduce you to the exponential regression models with R codes.

CONTENT
Session 2.1 Linear regression model
Session 2.2 Multiple regression model
Session 2.3 Exponential regression

REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Estimate the relationship between dependent and independent variables using linear
regression in R

2. Use the lm() command function in R to perform least-squares regressions

3. Perform quadratic, cubic, and quartic regression analysis in R

21
[STAT 362 Statistical Computing & Data Analysis II]

SESSION/ EXAMPLES/ACTIVITIES

This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for some simple linear regression in R. Session
Examples include worked examples of some data; it illustrates the R codes for each
scenario of regression considered. Below the examples are activities. Please these
ACTIVITIES are to be solved and submitted for grading. The deadline for submission
is one week after each lecture.

Video Activity
1. https://www.youtube.com/watch?v=66z_MRwtFJM (linear regression)
2. https://www.youtube.com/watch?v=u1cc1r_Y7M0 (multiple regression)
3. https://www.youtube.com/watch?v=hokALdIst8k (exponential regression)

22
SESSION 2.1
LINEAR REGRESSION MODEL
‘lm’ is used to fit linear models. It can be used to carry out regression, single stratum
analysis of variance, and analysis of covariance (although ‘aov’ may provide a more
convenient interface for these).
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x =
FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)

Linear Regression
R-code for linear regression
>x=c(numeric values)
>y=c(numeric values)

>a=lm(y~x)
>summary(a)

Example 2.1.1: Linear Model R


>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>a=lm(y~x)
>summary(a)

Results

23
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-2.7255 -1.3034 0.4168 1.5894 2.0108

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6299 1.7085 3.881 0.00467 **
x 0.1546 0.3642 0.424 0.68244

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Residual standard error: 1.873 on 8 degrees of freedom

Multiple R-squared: 0.02202,Adjusted R-squared: -0.1002

F-statistic: 0.1801 on 1 and 8 DF, p-value: 0.6824

MODEL: Y= 6.6299+0.1546x

Example 2.1.2: Quadratic Linear Regression

>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>b=lm(y~x+I(x^2))
>summary(b)

24
OUTPUT

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
Min 1Q Median 3Q Max
-2.38024 -1.26603 0.08659 1.09552 2.57011

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.8590 4.8338 2.453 0.0439 *
x -2.3402 2.1926 -1.067 0.3213
I(x^2) 0.2612 0.2265 1.153 0.2867

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1.836 on 7 degrees of freedom

Multiple R-squared: 0.1781,Adjusted R-squared: -0.05667

F-statistic: 0.7587 on 2 and 7 DF, p-value: 0.5032

MODEL:=𝟏𝟏.𝟖𝟓𝟗−𝟐.𝟑𝟒𝟎𝟐 x +𝟎.𝟐𝟔𝟏𝟐 x 2

25
[STAT 362 Statistical Computing & Data Analysis II]

Example 2.1.3: Cubic Regression

>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>c=lm(y~x+I(x^2)+I(x^3))
>summary(c)

OUTPUT

Call:
lm(formula = y ~ x + I(x^2) + I(x^3))
Residuals:

Min 1Q Median 3Q Max


-2.0692 -1.4149 0.1101 1.2082 2.5999

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.78399 14.68099 0.462 0.660
x 1.44018 10.50169 0.137 0.895
I(x^2) -0.59553 2.33259 -0.255 0.807
I(x^3) 0.05974 0.16178 0.369 0.725

26
Residual standard error: 1.961 on 6 degrees of freedom

Multiple R-squared: 0.1964,Adjusted R-squared: -0.2054

F-statistic: 0.4888 on 3 and 6 DF, p-value: 0.7026

MODEL:=6.78399+1.44018 x −0.𝟓𝟗𝟓𝟓𝟑 x 2+0.𝟎𝟓𝟗𝟕𝟒 x 3

Example 2.1.4: Quartic Regression

>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>d=lm(y~x+I(x^2)+I(x^3)+I(x^4))
>summary(d)

OUTPUT

Call:
lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4))

Residuals:
1 2 3 4 5 6
-0.5671 -0.4279 -1.3427 1.5557 2.0758 1.9301
7 8 9 10
0.3901 0.2268 -1.2881 -2.5526

27
[STAT 362 Statistical Computing & Data Analysis II]

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.57466 46.70585 0.783 0.469
x -28.93120 46.28444 -0.625 0.559
I(x^2) 10.22251 16.19846 0.631 0.556
I(x^3) -1.54560 2.38226 -0.649 0.545
I(x^4) 0.08439 0.12491 0.676 0.529

Residual standard error: 2.056 on 5 degrees of freedom


Multiple R-squared: 0.2636,Adjusted R-squared: -0.3255
F-statistic: 0.4475 on 4 and 5 DF, p-value: 0.772

MODEL:y=36.5747−28.93120 x +10.22𝟐𝟓 x 2−1.5𝟒𝟓𝟔 x 3+0.𝟎𝟖𝟒𝟑𝟗 x 4

28
SESSION 2.2
MULTIPLE REGRESSION MODEL
A regression with two or more explanatory variables is called a multiple regression.
Rather than modeling the mean as a straight line in linear regression, it is now modeled
as a function of several explanatory variables.

Multiple Regression R code format


>explanatory1=c(numeric values)
>explanatory2=c( numeric values)
>response=c(numeric values)
>s=lm(response~explanatory1+explanatory2)
>summary(s)

Example 2.2: R Code for Multiple Linear Regression

> x1=c(91,90,88,87,91,94,87,86)
> x2=c(25,21,24,25,25,26,25,25)
> y=c(240,236,270,274,301,316,300,296)
> a=lm(y~x1+x2)
> summary(a)

29
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT
Call:
lm(formula = y ~ x1 + x2)
Residuals:
1 2 3 4 5 6.7 8

-45.718 4.871 -2.461 -12.285 15.282 17.024 13.715 9.573

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.4558 331.1189 -0.131 0.9007
x1 -0.1417 3.4741 -0.041 0.9690
x2 13.6828 6.2329 2.195 0.0796 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 24.8 on 5 degrees of freedom
Multiple R-squared: 0.4927,Adjusted R-squared: 0.2897
F-statistic: 2.428 on 2 and 5 DF, p-value: 0.1833
MODEL:y=−43.4558−0.𝟏𝟒𝟏𝟕 x 1+13.6828¿

30
SESSION 2.2.1
R CODE SYNTAX WHEN THERE IS INTERACTION AMONG THE INDE-
PENDENT VARIABLES

> x1=c(numeric values)


> x2=c(numeric values)
>y=c(numeric values)
>b=lm(y~x1+x2+x1*x2)
> summary(b)

Example 2.2.1
> x1=c(91,90,88,87,91,94,87,86)
> x2=c(25,21,24,25,25,26,25,25)
>y=c(240,236,270,274,301,316,300,296)
>b=lm(y~x1+x2+x1*x2)
> summary(b)

31
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT
Call:
lm(formula = y ~ x1 + x2 + x1 * x2)
Residuals:
1 2 3 4 5 6.7 8
-36.003 5.462 -14.565 -10.977 24.997 7.283 15.023 8.780

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15661.115 19353.552 0.809 0.464
x1 -174.234 214.539 -0.812 0.462
x2 -607.238 765.100 -0.794 0.472
x1:x2 6.880 8.477 0.812 0.463

Residual standard error: 25.69 on 4 degrees of freedom


Multiple R-squared: 0.5644,Adjusted R-squared: 0.2377
F-statistic: 1.727 on 3 and 4 DF, p-value: 0.299
MODEL:y=15661.1−1𝟕𝟒.𝟐𝟑𝟒𝒙𝟏−607.2𝒙𝟐+6.88𝒙𝟏𝒙𝟐

32
SESSION 2.3
EXPONENTIAL REGRESSION
When given data you might as well find the law that governs the data in your table, say
in the form of T =a b t

Example 2.3.1

The data below shows the cooling temperatures of a freshly brewed cup of coffee after it
is poured from the brewing pot into a serving cup. The brewing pot temperature is
approximately 180°F. Find the law in the form T =a b t .

Tim 0 5 8 11 15 18 22 25 30 34 38 42 45 50
e (t)
Tem 179. 168. 158. 149. 141. 134. 125. 123. 116. 113. 109. 105. 102. 100.
p 5 7 1 2 7 6 4 5 3 2 1 7 2 5

R-Code
>time=c(0,5,8,11,15,18,22,25,30,34,38,42,45,50)
>temp=c(179.5,168.7,158.1,149.2,141.7,134.6,125.4,123.5,116
.3,113.2,109.1,105.7,102.2,100.5)
>a=lm(log(temp)~time)
>summary(a)

33
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT
Call:
lm(formula = log(temp) ~ time)
Residuals:

Min 1Q Median 3Q Max


-0.052753 -0.025261 -0.005929 0.014306 0.056930

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1443601 0.0172782 297.74 < 2e-16 ***
time -0.0118227 0.0005988 -19.74 1.62e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 0.03415 on 12 degrees of freedom

Multiple R-squared: 0.9701,Adjusted R-squared: 0.9676

F-statistic: 389.8 on 1 and 12 DF, p-value: 1.621e-10

Model log(temp)= 5.144-0.0118time

Example 2.3.2
Given the period and mean distances of some of the planets, you are to find a law in the
form P=k s n

Period, P (days) 87.97 224.7 365.3 687.0 4333.0 10760.0


Mean distance, s in 58 108 150 228 778 1426
millions of km

R-Code
>p=c(87.97,224.7,365.3,687.0,4333,10760)
>s=c(58,108,150,228,778,1426)
>a=lm(log(p)~log(s))
>summary(a)

34
OUTPUT
Call:
lm(formula = log(p) ~ log(s))
Residuals:
1 2 3 4 5 6
-9.492e-04 3.871e-03 -3.153e-03 1.154e-04-9.961e-05 2.155e-
04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.6154480 0.0052541 -307.5 6.71e-10 ***
log(s) 1.5006720 0.0009335 1607.5 8.99e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 0.002544 on 4 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.584e+06 on 1 and 4 DF, p-value: 8.986e-13

Model: log(p) = −𝟏.𝟔𝟏𝟓𝟒𝟒𝟖 +𝟏.𝟓𝟎𝟎𝟔𝟕𝟐log(s)

35
[STAT 362 Statistical Computing & Data Analysis II]

Example 2.3.3
The width of successive whorls of a shell of Turbo duplicatus has been measured.
Find the law in the form w=a b n.
Positions of whorls (n) 1 2 3 4 5 6 7 8
Width of whorl (w 3.33 2.84 2.39 2.03 1.70 1.45 1.22 1.04
cm)

R-Code
>n=c(1,2,3,4,5,6,7,8)
>w=c(3.33,2.84,2.39,2.03,1.70,1.45,1.22,1.04)
>r=lm(log(w)~n)
>summary(r)

OUTPUT

Call:
lm(formula = log(w) ~ n)

Residuals:
Min 1Q Median 3Q Max
-0.0065511 -0.0033214 0.0006323 0.0036527 0.0049239

36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3733474 0.0035109 391.2 1.88e-14 ***
n -0.1672336 0.0006953 -240.5 3.48e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Residual standard error: 0.004506 on 6 degrees of freedom


Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999

F-statistic: 5.786e+04 on 1 and 6 DF, p-value: 3.484e-13


Model log(𝒘)=𝟏.𝟑𝟕𝟑𝟑𝟒𝟕𝟒−𝟎.𝟏𝟔𝟕𝟐𝟑𝟑𝟔n

Activity 2.3.1
In a study on speed and braking distance, researchers looked for a method to estimate
how fast a person was traveling before an accident by measuring the length of the skid
marks. An area that was focused on in the study was the distance required to completely
stop a vehicle at various speeds. Use the following table to find the linear regression
equation.
MPH Brakingdistance(feet)
20 20
30 45
40 81
50 133
60 205
80 411

Activity 2.3.2

37
[STAT 362 Statistical Computing & Data Analysis II]

The nursing instructor wishes to see whether a student’s grade point average and age are
related to the student’s score on the state board nursing examination. She selects five
students and obtains the following data. Obtain the multiple linear regression equation
obtained from the data

Student GPA Age Score


A 3.2 22 550
B 2.7 27 570
C 2.5 24 525
D 3.4 28 670
E 2.2 23 490

Activity 2.3.3
If V =k Dr Find the values of k and r using the table below

Diameter 4.4 4.6 5 5.1 5.1 5.2 5.2 5.5 5.5 5.6
Volume 2 2.2 3 4.3 3 2.9 3.5 3.4 5 7.2

38
UNIT THREE
NON-PARAMETRIC TEST
OVERVIEW
This chapter will introduce you to several non-parametric hypothesis tests, namely, the
spearman and Kendall rank correlation, sign test, Wilcoxon sign, and sum test,
randomness test, and Kruskal Wallis test. We will discuss the applications of these
hypothesis tests for nonparametric statistics using the R software.

CONTENT
Session 3.1 Spearman and Kendall Rank Correlation Test
Session 3.2 Sign Test
Session 3.3 Wilcoxon Sign and Sum Test
Session 3.4 Randomness Test (Categorical and Continuous)
Session 3.5 Kruskal-Wallis Test

REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform Spearman and Kendall Rank Correlation Tests using R software.

2. Perform the Sign test using R software.

3. Perform the Wilcoxon Sign and Sum Test using R software.

4. Perform Randomness tests for categorical and continuous cases using R software.

5. Perform the Kruskal-Wallis test using R software.

39
[STAT 362 Statistical Computing & Data Analysis II]

SESSION/EXAMPLES/ACTIVITIES

This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for computing spearman and Kendall rank correlation
coefficients, Sign test, Wilcoxon signed and sum test, Tests of Randomness, and the
Kruskal Wallis test in R software. Session Examples include solved questions on each
topic in R software; it illustrates the R codes for the various techniques. Below each
example(s) are activity(ies). Please these ACTIVITIES are to be solved and submitted
for grading. The deadline for submission is one week after each lecture.

Video Activity
1. https://www.youtube.com/watch?v=F0lvYZmxib8 (Spearman and Kendall
Rank Correlation)

2. https://www.youtube.com/watch?v=fH4S4aqfs9k (Sign Test)

3. https://www.youtube.com/watch?v=zM8OZUM5I4Y (Wilcoxon signed Rank


Test)

4. https://www.youtube.com/watch?v=KroKhtCD9eE (Wilcoxon Rank Sum)

5. https://www.youtube.com/watch?v=Y1qeAFAV5yQ (Kruskal Wallis Test)

40
SESSION 3.1
SPEARMAN AND KENDALL RANK CORRELATION TEST
Test for association between paired samples, using one of Pearson's product moment
correlation coefficient, Kendall's tau or Spearman's rho.
cor.test(x, y,
alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"),
exact = NULL, conf.level = 0.95, continuity = FALSE, ...)

Spearman and Kendall Rank Correlation Tests:


 Cor.test(x,y, method = “spearman”)for spearman rho
correlation
 Cor.test(x,y, method = “Kendall”) for kendall correla-
tion

Example 3.1.1: Spearman Correlation Coefficient


Two students were asked to rate eight different textbooks for a specific course on as-
cending scale from 0 to 20 points. Points were assigned for each of several categories,
such as reading level, use of illustrations, and use of color. At α =¿ 0.05, calculate the
Spearman and Kendall rank correlation coefficient between the two students’ ratings.
The data is shown in the following table.
Textbook A B C D E F G H
Students Ratings 1 4 10 18 20 12 2 5 9
Students Ratings 2 4 6 20 14 16 8 11 7

Solution

𝐻o = There is no linear correlation between the two student rating.

𝐻A = There is alinear correlation between the two student ratings.

Let a=¿students rating 1 and b=¿ student rating 2

41
[STAT 362 Statistical Computing & Data Analysis II]

R-Code
>a=c(4,10,18,20,12,2,5,9)
>b=c(4,6,20,14,16,8,11,7)
>cor.test(a,b,method="spearman")

OUTPUT
Spearman's rank correlation rho
data: a and b
S = 30, p-value = 0.09618
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.6428571
Conclusion: We fail to reject the 𝐻𝑜and conclude that there is no linear correlation
between the two ratings.

Example 3.1.2: Kendall Rank Correlation


H 0= There is no linear correlation between the two student rating.

𝐻A =There is linear correlation between the two student ratings.

R-Code
>a=c(4,10,18,20,12,2,5,9)
>b=c(4,6,20,14,16,8,11,7)
>cor.test(a,b,method=“kendall")

42
OUTPUT

Kendall's rank correlation tau


data: a and b
T = 20, p-value = 0.1789
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.4285714

Conclusion: We fail to reject 𝐻𝑜and conclude that there is no linear correlation between
the two ratings.

43
[STAT 362 Statistical Computing & Data Analysis II]

Activity 3.1.1
As a biologist, you wish to see if there is a relationship between the heights of tall trees
and their diameters. You find the following data for the diameter (in inches) of the tree
at 4.5 feet from the ground and the corresponding heights (in feet).
Diameter (in.) Height (ft)
1024 261
950 321
451 219
505 281
761 159
644 83
707 191
586 141
442 232
546 108

Perform both tests and write a short statement comparing the results.

Activity 3.1.2
The data below show the number of books published in six different subject areas for the
years 1980 and 2004. Use α= 0.05 to see if there is a relationship between the two data
sets.
agriculture home eco- literature music science sports and
nomics recreation

1980 461 879 1686 357 3109 971


2004 1065 3639 4671 2764 8509 4806

SESSION 3.2.1

44
SIGN TEST
The simplest nonparametric test, the sign test for single samples, is used to test the value
of a median for a specific sample.
 SIGN.test(x, md) requires the “BSDA” package in R.

Example 3.2.1
A convenience store owner hypothesizes that the median number of snow cones she sells
per day is 40. A random sample of 20 days yields the following data for the number of
snow cones sold each day.
18, 43, 40, 16, 22, 30, 29, 32, 37, 36, 39, 34, 39, 45, 28, 36, 40, 34, 39, 52

At α =¿ 0.05, test the owner’s hypothesis.

Solution
Ho=The median number of snow cones she sells per day is 40.
Ha=The median number of snow cones she sells per day is not 40.

R-Code

>x=c(18, 43, 40, 16, 22, 30, 29, 32, 37, 36,39, 34, 39, 45,
28, 36, 40, 34, 39, 52)
>library(BSDA)
>SIGN.test(x, md=40)

45
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

One-sample Sign-Test
data: x
s = 3, p-value = 0.007538
alternative hypothesis: true median is not equal to 40

95 percent confidence interval:


30.23294 39.00000

sample estimates:
median of x
36

Conclusion: There is enough evidence to reject the claim that the median number of
snow cones sold per day is 40.

46
SESSION 3.2.2
SIGN TEST (PAIRED SAMPLE)
The sign test can also be used to test sample means in a comparison of two dependent
samples, such as a before-and-after test.

Example 3.2.2
A medical researcher believed the number of ear infections in swimmers can be reduced
if the swimmers use earplugs. A sample of 10 people was selected, and the number of
infections for four months was recorded. During the first two months, the swimmers did
not use the earplugs; during the second two months, they did. At the beginning of the
second two-month period, each swimmer was examined to make sure that no infections
were present. The data are shown below. At a = 0.05, can the researcher conclude that
using earplugs reduced the median number of ear infections?

Swimmers Before After


A 3 2
B 0 1
C 5 4
D 4 0
E 2 1
F 4 3
G 3 1
H 5 3
I 2 2
J 1 3

Solution
Ho=The median number of ear infections will not be reduced.
Ha= The median number of ear infections will be reduced.
Let x=before and y=after

R-Code

>x=c(3,0,5,4,2,4,3,5,2,1)
>y=c(2,1,4,0,1,3,1,3,2,3)
>w=x-y
>library(BSDA)
>SIGN.test(w)

47
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

One-sample Sign-Test
data: w
s = 7, p-value = 0.1797
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
-0.6755556 2.0000000
sample estimates:
median of x
1
Conclusion: There is not enough evidence to support the claim that the use of earplugs
reduced the median number of ear infections.

Activity 3.2
The median age for the total population of the state of Maine is 41.2, the highest in the
nation. The mayor of a particular city believes that his population is considerably
“younger” and that the median age there is 36 years. At α =¿ 0.05, is there sufficient
evidence to reject his claim? The data here represent a random selection of persons from
the household population of the city.
40, 56, 42, 72, 12, 22, 25, 43, 39, 48, 50, 37, 18, 35, 15, 30, 52, 45

48
SESSION 3.3.1
WILCOXON SIGN-RANK TEST
When the samples are dependent, as they would be in a before-and-after test using the
same subjects, the Wilcoxon signed-rank test can be used in place of the t-test for de-
pendent samples.

Example 3.3.1
In a large department store, the owner wishes to see whether the number of shoplifting
incidents per day will change if the number of uniformed security officers is doubled. A
sample of 7 days before security is increased and 7 days after the increase shows the
number of shoplifting incidents.

Days Monday Tuesday Wednesday Thursday Fri- Saturday Sunday


day
Be- 7 2 3 6 5 8 12
fore
After 5 3 4 3 1 6 4

Is there enough evidence to support the claim at ∝=0.05, that there is a difference in the
number of shoplifting incidents before and after the increase in security.

Solution
𝐻𝑜= There is no difference in the number of shoplifting incidents before and after the
increase in security.
𝐻a= There is a difference in the number of shoplifting incidents before and after the in-
crease in security.

R-Code

>before=c(7,2,3,6,5,8,12)
>after=c(5,3,4,3,1,6,4)
>wilcox.test(before,after,paired=T)

49
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Wilcoxon signed rank test with continuity correction


data: x and y
V = 25, p-value = 0.07488
alternative hypothesis: true location shift is not equal to
0

Conclusion: We fail to reject 𝐻𝑜and conclude that there is not enough evidence to
support the claim that there is a difference in the number of shoplifting incidents.

50
SESSION 3.3.2
WILCOXON RANK-SUM TEST
The Wilcoxon rank-sum test is used for independent samples.

Example 3.3.2

Two independent samples of the army and marine recruits are selected, and the time in
minutes it takes each recruit to complete an obstacle course is recorded, as shown in the
table below.
Army 15 18 16 17 13 22 24 17 19 21 26 28
Marines 14 9 16 19 10 12 11 8 15 18 25

At a = 0.05, is there a difference in the times it takes the recruits to complete the
course?

Solution
𝐻o = There is no difference in the times it takes the recruits to complete the obstacle
course.
𝐻A = There is a difference in the time it takes the recruits to complete the obstacle
course.

R-Code

>x=c(15,18,16,17,13,22,24,17,19,21,26,28)
>y=c(14,9,16,19,10,12,11,8,15,18,25)
>wilcox.test(x,y)

51
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Wilcoxon rank sum test with continuity correction


data: x and y
W = 105, p-value = 0.01767
alternative hypothesis: true location shift is not equal to
0

Conclusion: Reject Ho and conclude that there is enough evidence to support the claim
that there is a difference in the times it takes the recruits to complete the course.

Activity 3.3

Two groups of alcoholics, one group male, and the other female were asked at what age
they first drunk alcohol. The data are shown here. Using the Wilcoxon rank-sum test at α
= 0.05, is there a difference in the ages of the females and males?
Males 6 12 14 16 17 17 13 12 10 11

Females 8 9 9 12 14 15 12 16 17 19

52
SESSION 3.4
RUNS TEST FOR RANDOMNESS
Performs the runs test for randomness (Mendenhall and Reinmuth 1982) for continuous
data. Also computes the runs test for randomness of the dichotomous (binary) data
series ‘x’. Users can choose whether to plot the correlation graph or not, and whether to
test against a two-sided, negative, or positive correlation. ‘NA’s from the data is omitted.

Require(lawstat) for continuous data


runs.test(y,
plot.it = FALSE,
alternative = c("two.sided", "positive.correlated", "negative.correlated")
)

Require(tseries) for discrete and categorical data


runs.test(x, alternative = c("two.sided", "less", "greater"))

SESSION 3.4.1
RANDOMNESS TESTS FOR CATEGORICAL AND CONTINUOUS
When samples are selected, you assume that they are selected at random. How do you
know if the data obtained from a sample are truly random?
R-code for Runs Test
 runs.test(a) requires “tseries” package for categorical case
 runs.test(a) requires “lawstat” package for continuous case

Example 3.4.1.1: (Discrete)


On a commuter train, the conductor wishes to see whether the passengers enter the train
at random. He observes the first 25 people, with the following sequence of males (M)
and females (F).
FFFMMFFFFMFMMMFFFFMMFFFMM
Test for randomness at a = 0.05.
Solution
Ho=The passenger board the train at random, according to gender.
Ha=The passengers do not board the train at random, according to gender.
Note: Load the ‘tseries’ package before you run the code.

53
[STAT 362 Statistical Computing & Data Analysis II]

R-Code

>a=factor(c("F","F","F","M","M","F","F","F","F","M","F","M"
,"M","M","F","F","F","F","M","M","F","F","F","M","M"))
> library(tseries)
> runs.test(a)

Alternative code
>a=scan(what=“ ”)
1.F
2.F
..
..
..
..
25.M
>library(tseries)
>runs.test(factor(a))

54
OUTPUT

Runs Test
data: f
Standard Normal = -1.2792, p-value = 0.2008
alternative hypothesis: two.sided

55
[STAT 362 Statistical Computing & Data Analysis II]

Conclusion: There is no enough evidence to reject the hypothesis that the passengers
board the train at random according to gender.

Example 3.4.1.2 (Continuous)

Twenty people enrolled in a drug abuse program. Test the claim that the ages of the
people, according to the order in which they enroll, occur at random, at α =¿ 0.05.
The data are:
18, 36, 19, 22, 25, 44, 23, 27, 27, 35, 19, 43, 37, 32, 28, 43, 46, 19, 20, 22.

Solution
Ho = The ages of the people, according to the order in which they enroll in a drug pro-
gram occur at random
Ha = The ages of the people, according to the order in which they enroll in a drug pro-
gram, do not occur at random.

R-Code

>a=c(18, 36, 19, 22, 25, 44, 23, 27, 27, 35, 19, 43, 37,
32, 28, 43, 46, 19, 20, 22)
>library(lawstat)
>runs.test(a)

56
OUTPUT

Runs Test -Two-sided


data: a
Standardized Runs Statistic = -0.8823, p-value = 0.3776

Conclusion: There is not enough evidence to reject the hypothesis that the ages of the
people who enroll occur at random.

Activity 3.4
1. As students, faculty, friends, and family arrived for the Spring Wind Ensemble
Concert at Shafer Auditorium, they were asked whether they were going to sit on
the balcony (B) or the ground floor (G). Use the responses listed below and test
for randomness at α = 0.05.
BBGGBBGBBBBBBGBBGGBBBBGGGGBGBBBGG

2. A school dentist wanted to test the claim, at α = 0.05, that the number of cavities
in fourth-grade students is random. Forty students were checked, and the number
of cavities each had is shown here. Test for the randomness of the values above
or below the median.
0460625315122137360260231521302373151122

57
[STAT 362 Statistical Computing & Data Analysis II]

SESSION 3-5
KRUSKAL-WALLIS TEST
The analysis of variance uses the F test to compare the means of three or more popula-
tions. The assumptions for the ANOVA test are that the populations are normally dis-
tributed and that the population variances are equal. When these assumptions cannot be
met, the nonparametric Kruskal-Wallis test, sometimes called the H test, can be used
to compare three or more means.

R-code for Kruskal Wallis Test


 kruskal.test(list(a,b,c)) i.e. where a, b, and c are the three
categories to be compared.

Example 3.5.1

A researcher tests three different brands of breakfast drinks to see how many mill equi-
valents of potassium per quart each contains. These data are obtained. What is the prob-
ability that all 3 are college graduates?

Brand A Brand B Brand C


4.7 5.3 6.3
3.2 6.4 8.2
5.1 7.3 6.2
5.2 6.8 7.1
5.0 7.2 6.6

a. At ∝=0.05, is there enough evidence to reject the hypothesis that all brands con-
tain the same amount of potassium?

Solution
Ho = There is no difference in the amount of potassium contained in the brands.
Ha = There is a difference in the amount of potassium contained in the brands.

R-Code

>a=c(4.7,3.2,5.1,5.2,5.0)
>b=c(5.3,6.4,7.3,6.8,7.2)
>c=c(6.3,8.2,6.2,7.1,6.6)
>kruskal.test(list(a,b,c))

58
OUTPUT
Kruskal-Wallis rank sum test
data: list(a, b, c)
Kruskal-Wallis chi-squared = 9.38, df= 2, p-value =
0.009187
Conclusion: There is no enough evidence to reject the claim that there is no difference
in the amount of potassium contained in the three brands.

Activity 3.5.1
You are researching an article on the waterfalls on our planet. You want to make a state-
ment about the heights of waterfalls on three continents. Three samples of waterfall
heights (in feet) are shown.
North America Africa Asia
600 406 330
1200 508 830
182 630 614
620 726 1100
1170 480 885
442 2014 330

1 What are the hypotheses?


2 Select a significance level and run the test. What is the H value?
3 What is your conclusion?

59
[STAT 362 Statistical Computing & Data Analysis II]

Activity 3.5.2

Samples of three different types of wrapping tape are tested for breaking strength, in
pounds. The data are shown here. At α = 0.05, is there a difference in the breaking
strength of the tapes? Use the Kruskal-Wallis test.
Type A 225 332 404 387 351 280 362 431 266
Type B 256 203 261 305 232 278 261 299 272
Type C 406 427 481 397 351 409 462 471 399

60
UNIT FOUR
CHI-SQUARE TEST
OVERVIEW
In this unit, chi-square goodness of fit and homogeneity tests will be considered. These
techniques are under the nonparametric methods of hypothesis testing. We will discuss
the applications of the chi-square tests and also compute test statistics and hypothesis
tests using R software.

CONTENT
Session 4.1 Goodness of Fit: A chi-square test used to see whether a frequency distribu-
tion fits a specific pattern
Session 4.2 Test of Homogeneity

REQUIRED READINGS

1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,


NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform the goodness of fit test with chi-square using R

2. Perform the test of homogeneity using R

61
[STAT 362 Statistical Computing & Data Analysis II]

SESSION/EXAMPLES/ACTIVITIES

This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for the goodness of fit test with chi-square and test of
homogeneity in R software. Session Examples include solved questions on each
technique in R software. Under each example(s) are activity(ies). Please these
ACTIVITIES are to be solved and submitted for grading. The deadline for submission
is one week after each lecture.

Video Activity
1. https://www.youtube.com/watch?v=n5c11B5FJ24 (Goodness of Fit Test)

2. https://www.youtube.com/watch?v=oAs15X_hsJ4 (Test of Homogeneity Test)

62
SESSION 4.1
GOODNESS OF FIT
‘chisq.test’ performs chi-squared contingency table tests and
goodness-of-fit tests.

chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p =


FALSE, simulate.p.value = FALSE, B = 2000)

Goodness of Fit Test – R Syntax:


 Chisq.test(x,p = p)

Example 4.1.1
The data below shows the preference in the selection of fruit soda flavors. Sellers claim
that no preference in the selection of fruit soda flavors.

Frequency Cherry Strawberry Orange Lime Grape


Observed 32 28 16 14 10
Expected 0.2 0.2 0.2 0.2 0.2
Probabilities

Is there enough evidence to reject the claim that there is no preference in the selection of
fruit soda flavors, using the data above? Let ∝=0.05.

Solution:
𝐻o: Consumers show no preference for flavors of the fruit soda.

𝐻A: Consumers show a preference for flavors of the fruit soda

R-Code

>a=c(32,28,16,14,10)
>p=c(0.2,0.2,0.2,0.2,0.2)
>chisq.test(a,p=p)

63
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Chi-squared test for given probabilities


data: a
X-squared = 18, df= 4, p-value = 0.001234

Conclusion: We reject the null hypothesis and conclude that consumers show a prefer-
ence for flavors.

Activity 4.1
Home-Schooled Student Activities Students who are home-schooled often attend their
local schools to participate in various types of activities such as sports or musical
ensembles. According to the government, 82% of home-schoolers receive their
education entirely at home, while 12% attend school up to 9 hours per week and 6%
spend from 9 to 25 hours per week at school. A survey of 85 students who are home-
schooled revealed the following information about where they receive their education.

Entirely at home Up to 9 hours 9 to 25 hours


50 25 10

At α = 0.05, is there sufficient evidence to conclude that the proportions differ from
those stated by the government?

64
SESSION 4.2
TEST OF HOMOGENEITY
R-Syntax for text of homogeneity:
 Chisq.test(x); where x is cross-tabulation of the data

Example 4.2.1
A researcher selected 100 passengers from each of the 3 airlines and asked them if the
airline had lost their luggage on their last flight. At a 0.05 level of significance, test the
claim that the proportion of passengers from each airline who lost luggage on the flight
is the same for each airline. The data are shown in the table.
Airline 1 Airline 2 Airline 3 Total
Yes 10 7 4 21
No 90 93 96 279

Solution
𝐻o=The proportion of passengers from each airline who lost luggage on the flight is the
same for each airline.
𝐻A=The proportion of passengers from each airline who lost luggage on the flight is not
the same for each airline.

R-Code

>e=rbind(c(10,90),c(7,93),c(4,96))
>chisq.test(e)

65
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Pearson's Chi-squared test


data: e
X-squared = 2.765, df= 2, p-value = 0.251

Conclusion: There is not enough evidence to reject the null hypothesis, hence the
proportion of passengers from each airline who lost luggage on the flight is the same for
each airline.

Activity 4.2.1
Endangered or Threatened Species Can you conclude a relationship between the class of
vertebrate and whether it is endangered or threatened? Use the 0.05 level of significance.
Is there a different result for the 0.01 level of significance?

Mammal Bird Reptile Amphibian Fish


Endangered 68 76 14 13 76
Threatened 13 15 23 10 61

Activity 4.2.2
Is there sufficient evidence at the 5% level of significance to conclude that a relationship
exists between the city and the number of television and radio stations that it has?
TV Radio
Albuquerque N. Mex. 13 32
Boston Mass. 12 21
St. Petersburg Fla. 17 41
Minneapolis Minn. 7 30
Toledo Ohio 6 22

66
UNIT FIVE
ANALYSIS OF VARIANCE
OVERVIEW
In this unit, the F test is used to compare two variances. It is used to test claims in-
volving three or more means. We will discuss the applications of the various analysis of
variance tests using R software. The two-way ANOVA is an extension of the oneway
analysis of variance; it involves two independent variables. The independent variables
are also called factors.

CONTENT
Session 5.1 One Way ANOVA
Session 5.2 Two Way ANOVA
Session 5.3 Multiple Analysis of Variance

REQUIRED READINGS

1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New


York, NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statist-
ics with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley &
Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform the one-way ANOVA test using R

2. Perform the two-way ANOVA test using R

3. Perform the multiple Analysis of Variance

67
[STAT 362 Statistical Computing & Data Analysis II]

SESSION/EXAMPLES/ACTIVITIES

This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for performing one-way ANOVA, two-way ANOVA
and Multiple ANOVA in R software. Session Examples include solved questions on
each test in R software; it illustrates the R codes for finding the test of ANOVA Under
each example(s) are activity(ies). Please these ACTIVITIES are to be solved and
submitted for grading. The deadline for submission is one week after each lecture.

Video Activity
1. https://www.youtube.com/watch?v=4DeCaCaC2JQ (one-way ANOVA)

2. https://www.youtube.com/watch?v=oEaS_yKJ8lM (two-way ANOVA)

68
SESSION 5.1
ONE WAY ANOVA
Fit an analysis of variance model by a call to ‘lm’ for each
stratum.
aov(formula, data = NULL, projections = FALSE, qr = TRUE,
contrasts = NULL, ...)

One-way ANOVA – R Syntax:


 aov(y~x, data).

Example 5.1.1:
A researcher wishes to try three different techniques to lower the blood pressure of
individuals diagnosed with high blood pressure. The subjects are randomly assigned to
three groups; the first group takes medication, the second group exercises, and the third
group follows a special diet. After four weeks, the reduction in each person’s blood
pressure is recorded. At 𝛼=0.05, test the claim that there is no difference among the
means. The data are shown below
Techniques
Medication(M) 10 12 9 15 13
Exercise(E) 6 8 3 0 2
Diet(D) 5 9 12 8 4

Solution:
𝐻o: The mean of the three techniques is the same.
𝐻A: At least one mean is different from the others.

Note
•Enter the data in excel.
•Save the data. (With the blood pressure example, the data is saved as ‘mydata’).
•Save the data as a comma-delimited file(csv) in your document.
•Import data into ‘R’ by using ‘data2=read.csv("mydata.csv")’.

69
[STAT 362 Statistical Computing & Data Analysis II]

R-Code

>data2=read.csv("mydata.csv")
>data2
>results=aov(pressure~techniques, data=Blood.pressure)
>summary(results)

70
OUTPUT

DfSum SqMean SqF value Pr(>F)


techniques 2 160.1 80.07 9.168 0.00383 **
Residuals 12 104.8 8.73
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Conclusion: The decision is to reject the 𝐻0, since 9.17 is greater than the critical value
(3.89), therefore we conclude that at least one mean is different from the others.

Activity 5.1
The following set of data values was obtained from a study of people’s perceptions on
whether the color of a person’s clothing is related to how intelligent the person looks.
The subjects rated the person’s intelligence on a scale of 1 to 10. Group 1 subjects were
randomly shown people with clothing in shades of blue and gray. Group 2 subjects were
randomly shown people with clothing in shades of brown and yellow. Group 3 subjects
were randomly shown people with clothing in shades of pink and orange. The results
follow.

71
[STAT 362 Statistical Computing & Data Analysis II]

Group 1 Group 2 Group 3


8 7 4
7 8 9
7 7 6
7 7 7
8 5 9
8 8 8
6 5 5
8 8 8
8 7 7
7 6 5
7 6 4
8 6 5
8 6 4

Use ANOVA to test for any significant differences between the means.

72
SESSION 5.2
TWO-WAY ANOVA TEST
R Syntax for two-way anova:
 aov(y~x + z, data)

Example 5.2.1:
A researcher wishes to see whether the type of gasoline used and the type of automobile
driven have any effect on gasoline consumption. Two types of gasoline, regular and
high-octane, will be used, and two types of automobiles, two-wheel-and four-wheel
drive, will be used in each group. There will be two automobiles in each group, for a
total of eight automobiles used. Take 𝛼=0.05.
The data (in miles per gallon) are shown and summarized in the table below.

The hypotheses for the gasoline types are


𝐻o: There is no difference between the means of gasoline consumption for the two types
of gasoline.
HA: There is a difference between the means of gasoline consumption for the two types
of gasoline.

The hypotheses for the types of automobile driven are


𝐻o: There is no difference between the means of gasoline consumption for two-wheel-
drive and four-wheel-drive automobiles.
𝐻A: There is a difference between the means of gasoline consumption for two-wheel-
drive and four-wheel-drive automobiles.

Solution

73
[STAT 362 Statistical Computing & Data Analysis II]

Note
•Enter the data in excel.

•Save the data. (With the gasoline example, the data is saved as ‘mydata2’).

•Save the data as a comma-delimited file(csv) in your document.


•Import data into ‘R’ by using ‘data3=read.csv("mydata2.csv")’.

R-Code

>data3=read.csv("mydata2.csv")
>data3
>results=aov(Gasoline~Gas+Automobile,data=data3)
>summary(results)

74
OUTPUT

DfSum SqMean SqF value Pr(>F)


Gas 1 3.92 3.92 0.342 0.584
Automobile 1 9.68 9.68 0.843 0.401
Residuals 5 57.38 11.48

Conclusion for the gasoline types :


We fail to reject the 𝐻0for the gasoline types since 0.342 is less than the critical value of
7.71, we conclude that there is no difference between the means of gasoline consump-
tion for the two types of gasoline.

Conclusion for the types of Automobiles driven:


We fail to reject the 𝐻0for the type of automobile driven since 0.842 is less than the crit-
ical value of 7.71, we conclude that there is no difference between the means of gasoline
consumption for two-wheel-drive and four-wheel-drive automobiles.

When there is an interaction between the variables


The hypotheses for the interactions are:
𝐻o: There is no interaction effect between the type of gasoline used and the type of auto-
mobile a person drives on gasoline consumption.

75
[STAT 362 Statistical Computing & Data Analysis II]

𝐻A: There is an interaction effect between the type of gasoline used and the type of
automobile a person drives on gasoline consumption.
R-Code

>data3=read.csv("mydata2.csv")
>data3
>results=aov(Gasoline~Gas+Automobile+Gas:Automobile,data=data3)
>summary(results)

OUTPUT

DfSum SqMean SqF value Pr(>F)


Gas 1 3.92 3.92 4.752 0.09477 .
Automobile 1 9.68 9.68 11.733 0.02665 *
Gas:Automobile1 54.08 54.08 65.552 0.00126 **
Residuals 4 3.30 0.82
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Conclusion for the interaction:
We reject the 𝐻O for the interaction effect since 65.552 is greater than the critical value
of 7.71. Since the null hypothesis for the interaction effect was rejected, it can be con-
cluded that there is an interaction effect between the type of gasoline used and the type
of automobile a person drives on gasoline consumption.

76
SESSION 5.3
MULTIPLE ANALYSIS OF VARIANCE
The R syntax for the multivariate analysis of variance is given by
manova(...)

Example 5.3.1
A researcher will like to know if there is a significant difference in sepal and petal
length, between the different species of flowers.∝=0.05. The data is given in the table
below
Species Sepal.length Petal.width
Versicolor 5.0 3.3
Versicolor 5.5 4.4
Versicolor 5.8 4.0
Virginia 6.2 4.8
Virginia 5.9 5.1
Setosa 4.9 1.4
Setosa 4.5 1.4
Versicolor 5.7 4.2
Versicolor 6.1 4.7
Versicolor 6.2 4.3

Solution
𝐻o: There is no significant difference in petal and sepal length between the different spe-
cies.
𝐻A: There is a significant difference in petal and sepal length between the different
species.

Let Vi represent Virginia.


Let Verepresent Versicolor.
Let Se represent Setosa.

Note
•Enter the data in excel.

•Save the data. (With the species example, the data is saved as ‘mydata3’).

•Save the data as a comma-delimited file(csv) in your document.

•Import data into ‘R’ by using ‘flower=read.csv("data3.csv")’.

77
[STAT 362 Statistical Computing & Data Analysis II]

78
R-Code

>flower=read.csv("mydata3.csv")
>flower
>results=manova(cbind(Sepal.length,petal.length)~Spe-
cies, data=flower)
>summary(results)

OUTPUT
DfPillai approxF numDfden DfPr(>F)
Species 2 0.93928 3.0993 4 14 0.05061 .
Residuals 7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Conclusion: We fail to reject the null hypothesis and conclude that there is no
significant difference, in petal and sepal length between the different species.

79
[STAT 362 Statistical Computing & Data Analysis II]

Activity 5.3.1
A state employee wishes to see if there is a significant difference in the number of
employees at the interchanges of three Regional toll roads. The data are shown. At a=¿
0.05, can it be concluded that there is a significant difference in the average number of
employees at each interchange?
Accra-Tema Motorway Kumasi-Accra road Kumasi-Tamale Road
7 10 1
14 1 12
32 1 1
19 0 9
10 11 1
11 1 11

Activity 5.3.2
The following set of data values was obtained from a study of people’s perceptions on
whether the color of a person’s clothing is related to how intelligent the person looks.
The subjects rated the person’s intelligence on a scale of 1 to 10. Group 1 subjects were
randomly shown people with clothing in shades of blue and gray. Group 2 subjects were
randomly shown people with clothing in shades of brown and yellow. Group 3 subjects
were randomly shown people with clothing in shades of pink and orange. The results
follow.
Group 1 Group 2 Group 3
8 7 4
7 8 9
7 7 6
7 7 7
8 5 9
8 8 8
6 5 5
8 8 8
8 7 7
7 6 5
7 6 4
8 6 5
8 6 4

80
UNIT SIX
LOGISTIC REGRESSION
OVERVIEW
In this unit, logistic regression will be considered. Logistic regression is a statistical
method for analyzing a dataset in which there are one or more independent variables that
determine an outcome. The outcome is measured with a dichotomous variable (in which
there are only two possible outcomes). We will discuss the applications of logistic
regression using R software.

CONTENT
Session 6-1. Logistic regression

REQUIRED READINGS

1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New


York, NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statist-
ics with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley &
Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.

LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform the logistic regression using R

81
[STAT 362 Statistical Computing & Data Analysis II]

SESSION/EXAMPLES/ACTIVITIES

This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for performing logistic regression models in R
software. Session Examples include solved questions on logistic regression with R
software; it illustrates the R codes for finding the estimates of coefficients of
independent variables and their hypothesis tests. Under each example(s) are
activity(ies). Please these ACTIVITIES are to be solved and submitted for grading.
The deadline for submission is one week after each lecture.
Description
‘glm’ is used to fit generalized linear models, specified by
giving a symbolic description of the linear predictor and a
description of the error distribution.

Rcode
glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart,
mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y =
TRUE, singular.ok = TRUE, contrasts = NULL, ...)

Video Activity
1. https://www.youtube.com/watch?v=C4N3_XJJ-jU

82
SESSION 6.1
LOGISTIC REGRESSION
 glm(y~x+z, data, family = binomial(“logit”)).
Note.
 The dependent variable must be categorical.
 The independent variable can either be categorical or continuous.

Example 6.1.1
Example of a logistics regression table, where the dependent variable is a discrete dicho-
tomous variable with 1s and 0s. The independent variable can be discrete or continuous

Y x1 x2 x3
1 1 0 15
0 0 0 14
0 0 1 18
1 0 1 9
0 1 1 10
0 1 1 11

Solution

R-Code for the logistics regression table

>y=c(1,0,0,1,0,0)
>x1=c(1,0,0,0,1,1)
>x2=c(0,0,1,1,1,1)
>x3=c(15,14,18,9,10,11)
>a=glm(y~x1+x2+x3, family=binomial())
>summary(a)

83
[STAT 362 Statistical Computing & Data Analysis II]

Alternative way of entering the data.


•Enter the data in excel.
•Save the data. (Example, ‘data5’).
•Save the data as a comma-delimited file(csv) in your document.

R-Code

>logis=read.csv("data5.csv")
>a=glm(y~x1+x2+x3, family=binomial(), data=logis)
>summary(a)

84
85
[STAT 362 Statistical Computing & Data Analysis II]

OUTPUT

Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial())
Deviance Residuals:
1 2 3 4 5
1.17741 -1.17741 0.00000 1.17741 -1.17741
6
-0.00016

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 254.31 74727.90 0.003 0.997
x1 18.17 5337.71 0.003 0.997
x2 -90.83 26688.54 -0.003 0.997
x3 -18.17 5337.71 -0.003 0.997

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 7.6382 on 5 degrees of freedom


Residual deviance: 5.5452 on 2 degrees of freedom
AIC: 13.545

Number of Fisher Scoring iterations: 20


𝒚=𝟐𝟓𝟒.𝟑𝟏+𝟏𝟖.𝟏𝟕𝒙𝟏−𝟗𝟎.𝟖𝟑𝒙𝟐−𝟏𝟖.𝟏𝟕𝒙𝟑

Activity 6.1
The discrete dichotomous dependent variable in a study has values1s and 0s. The inde-
pendent variables x1 and x2 are discrete and a third x3 is continuous. Use the logistic
model to find the effect of the independent variables
Y x1 x2 x3
1 1 0 152
0 0 1 142
0 0 1 183
0 0 1 95
0 1 1 108
1 0 0 157
1 1 0 150
0 0 0 14
0 0 1 158
1 0 1 99
1 0 0 108
0 1 1 111

86
END-OF-COURSE EVALUATION
Please visit the End of Course evaluation folder to pick the evaluation questionnaire.
Answer the questions and submit your response to the facilitator. This evaluation is very
critical to the betterment of the course in subsequent sessions.

The final examination, which counts towards 70% of your final grade, will include all
topical issues discussed during this course. Please review all examples, activities, and
individual assignments, as preparation for your final exam. The programme’s
examination officer will communicate the final examination dates and venue to students
sometime later.
Thanks for your

87
[STAT 362 Statistical Computing & Data Analysis II]

88

You might also like