Stat 362 Study Guide

KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY,
KUMASI.
INSTITUTE OF DISTANCE LEARNING
BSc Statistics 2
STAT 362
Statistical Computing & Data Analysis II
2 Credits
STUDY GUIDE
Emmanuel Harris
Department of Statistics and Actuarial Science
1
[STAT 362 Statistical Computing & Data Analysis II]
Publisher’s Information
© IDL, 2017
All rights reserved. No part of this study guide may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording or
by any information storage and retrieval system, without the permission from the
copyright holders.
For any information contact:
Director
Institute of Distance Learning
New Library Building
Kwame Nkrumah University of Science and Technology
Kumasi, Ghana
Phone: +233-32-2060013
+233-32-2061287
+233-32-2060023
Fax: +233-32-2060014
E-mail: emmaharris2002@yahoo.com
Web: www.idl.knust.edu.gh
www.knust.edu.gh
2
INTRODUCTION
Welcome to STAT362 Statistical Computing & Data Analysis II. My name is
Emmanuel Harris. I am your facilitator in this course. In addition to welcoming you to
the course, I would like to give you some useful information about Statistical Computing
& Data Analysis and offer you a few hints for successful completion of this course.
Statistical modeling and data analysis techniques are difficult subjects to grasp and
apply, and it is often necessary to use computer software to aid the implementation of
large data sets and to obtain useful results. R is recognized as one of the most powerful,
flexible, and free statistical software packages, and it enables the user to apply several
statistical methods, ranging from simple regression to time series or multivariate
analysis.
This course offers the students how to easily analyze large data sets in R to obtain useful
results.
The requirement for successful completion of this course is a computer with R software
successfully installed.
3
COURSE OVERVIEW
The course is organized into five units:

Unit 1: Hypothesis testing (one-sample independent t-test, two-sample independent t-
test paired sample t-test)
Unit 2: Regression
Unit 3: Non-Parametric test-
Unit 4: Chi-Square
Unit 5: ANOVA
Unit 6: Logistic Regression
COURSE OBJECTIVE(S)
On completion of the course, you students should be able to:
1. Perform hypothesis testing (t-tests) in R.
2. Perform nonparametric test using R.
3. Perform Chisquare goodness of fit test and test of homogeneity using R
4. Perform with R the One-way, two-way and Multiple ANOVA.
5. Perform Logistic regression analysis in R.
COURSE OUTLINE
 Unit 1: Hypothesis testing
 Unit 2: Regression
 Unit 3: Non-parametric Test
 Unit 4: Chi-Square
 Unit 5: ANOVA
 Unit 6: Logistic Regression
4
REQUIRED TEXTBOOKS/READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using R. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.
GRADING
Continuous assessment: 30%
End of semester Examinations: 70%
Total: 100%
5
ASSIGNMENT SCHEDULE
All assignments are due before the end of the specified day of delivery (GMT 23:59).
All assignments are to be uploaded to the hand-in folder for this course unless other
instructions are given. If you are unable to hand in your assignment on the LMS of IDL
KNUST (vclass), you may email it to the course facilitator (on the said day of delivery).
Failure to deliver assignments on the specified date will attract penalties in the form of a
reduced grade.
Assignment Description Type Deadline Value

(Each Unit has an (Activities title) (Individual/Group (Duration (of final grade)
Activities) ) )
Hypothesis
1 Individual 1 week 10%
testing activities
2 Regression Individual 1 week 10%
Non Parametric
Tests
4 Chi-Square Individual 1 week 10%
5 ANOVA Individual 1 week 10%
Logistic
Regression
* Participation in Online discussions may account for 15% of the final grade
* Deadline could be weekly based
6
UNIT ONE
HYPOTHESIS TESTING (t-test)
OVERVIEW
One of the most common tests in statistics is the t-test, used to determine whether the
means of two groups are equal (i.e. Two-Sample t-test) and/or determine whether the
hypothesized mean of a certain population is true (One-Sample t-test). The assumption
for the test is that both groups are sampled from normal distributions with equal
variances.
CONTENT
Session 1.1 One-sample independent t-test
Session 1.2 Two-sample independent t-test
Session 1.3 Dependent/Paired t-test
REQUIRED READINGS
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Test the difference between two means for independent samples, using the t-test in
R.
2. Test the difference between two means for dependent samples, using the t-test in R.
3. Test a claim about a hypothesized mean of one sample, using the t-test in R.
7
SESSION/ EXAMPLES/ ACTIVITIES
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for Performing t-Tests in R software. Session
Examples include solved questions on each type of t-Tests in R software; it illustrates
the R codes for t-Tests. Under each example(s) are activity(ies). These ACTIVITIES
are to be solved and submitted for grading. The deadline for submission is one week
after each lecture.
Video Activity
1. https://youtu.be/kvmSAXhX9Hs (One-sample t-Test)
2. https://youtu.be/RlhnNbPZC0A ( Two-sample independent t-Test)
3. https://youtu.be/yD6aU0fY2lo (Two-sample dependent t-Test)
The format for a t.test with R is given as

t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
8
SESSION 1.1
ONE SAMPLE INDEPENDENT T-TEST
Example 1.1.1
A researcher estimated that the average height of story buildings on the KNUST campus
is 700 feet. A random sample of 9 story buildings is selected and the heights in feet are
shown below:
485, 511, 841, 725, 615, 520, 535, 635, 616
At α =0.05 , is there enough evidence to reject this claim?
Solution
Hypothesis:
H o : μ=700
H 1 : μ ≠ 700
R codes
> x=c(485,511,841,725,615,520,535,635,616)
> t.test(x,mu=700,conf.level = 0.95, alternative = "two.sided")
9
OUTPUT
One Sample t-test
data: x
t = -2.3612, df = 8, p-value = 0.04587
alternative hypothesis: true mean is not equal to 700
95 percent confidence interval:
520.5678 697.8767
sample estimates:
mean of x
609.2222
Conclusion: We reject H o and conclude that there is insufficient evidence to support the
researcher’s claim that the average height of story buildings on the KNUST campus is
700 feet.
Example 1.1.2
A state executive claims that the average number of acres in Western Region parks is
less than 2000 acres. A random sample of five parks is selected, and the number of acres
is shown. At α = 0.05, is there enough evidence to support the claim?
959 1187 493 6249 541
Solution
Hypothesis:
H o : μ=2000
H 1 : μ ≠ 2000
R codes
> x=c(959, 1187, 493, 6249, 541)
> t.test(x,mu=2000, conf.level = 0.95, alternative = "less")
10
OUTPUT
One Sample t-test
data: x
t = -0.10396, df = 4, p-value = 0.4611
alternative hypothesis: true mean is less than 2000
-Inf 4227.591
sample estimates:
mean of x
1885.8
Activity 1.1
The average 1-ounce chocolate chip cookie contains 110 calories. A random sample of
15 different brands of 1-ounce chocolate chip cookies resulted in the following calorie
amounts. At the α = 0.01 level, is there sufficient evidence that the average calorie con-
tent is greater than 110 calories?
100, 125, 150, 160, 185, 125, 155, 145, 160, 100, 150, 140, 135, 120, 110
11
SESSION 1.2
TWO SAMPLE INDEPENDENT T -TEST
Example 1.2.1 (Assuming Equal Variances)

The number of grams of carbohydrates contained in 1-ounce servings of randomly selec-
ted chocolate and nonchocolate candy is listed here. Is there sufficient evidence to con-
clude that the difference in the means is statistically significant? Use α = 0.01.
Chocolate: 29 25 17 36 41 25 32 29 38 34 24 27 29
Nonchocol- 41 41 37 29 30 38 39 10 29 55 29
ate:
Solution
Hypothesis:
H o : there is a significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
H 1 : there is no significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
R codes
> x = c(29,25,17,36,41,25,32,29,38,34,24,27,29)
> y = c(41,41,37,29,30,38,39,10,29,55,29)
> t.test(x,y, conf.level = 0.99, var.equal = TRUE)
12
OUTPUT
Two Sample t-test
data: x and y
t = -1.2744, df = 22, p-value = 0.2158
alternative hypothesis: true difference in means is not equal to 0

-15.003753 5.661096
sample estimates:
mean of x mean of y
29.69231 34.36364
Conclusion: We fail to reject H o and conclude that there is no significant difference in

the carbohydrates content in chocolate and nonchocolate candy.
Example 1.2.2 (Assuming Unequal Variances)
13
The number of grams of carbohydrates contained in 1-ounce servings of randomly selec-

ted chocolate and nonchocolate candy is listed here. Is there sufficient evidence to con-
clude that the difference in the means is statistically significant? Use α = 0.01.
Chocolate: 29 25 17 36 41 25 32 29 38 34 24 27 29
Nonchocol- 41 41 37 29 30 38 39 10 29 55 29
ate:
Solution
Hypothesis:
H o : there is a significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
H 1 : there is no significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
R codes
> x = c(29,25,17,36,41,25,32,29,38,34,24,27,29)
> y = c(41,41,37,29,30,38,39,10,29,55,29)
> t.test(x,y, conf.level = 0.99, var.equal = FALSE)
14
OUTPUT
Welch Two Sample t-test
data: x and y
t = -1.2203, df = 15.463, p-value = 0.2406
-15.903601 6.560943
sample estimates:
mean of x mean of y
29.69231 34.36364
Conclusion: We fail to reject H o and conclude that there is no significant difference in

the carbohydrates content in chocolate and nonchocolate candy.
Activity 1.2
Upright vacuum cleaners have either a hard body type or a soft body type. Shown in the
table below are the weights in pounds of a random sample of each type. At α = 0.05, can
it be concluded that the weights are different?
15
Hard body 21 17 17 20 16 17 15 20 23
Soft body 24 13 11 13 12 15 12 16
SESSION 1.3
DEPENDENT/PAIRED T -TEST
Example 1.3.1
16
A dietician wishes to see if a person’s cholesterol level will change if the diet is supple-
mented by a certain mineral. Six subjects were pretested, and then they took the mineral
supplement for a 6-week period. The results are shown in the table below:
Subject 1 2 3 4 5 6
Before 21 235 20 190 17 244
0 8 2
After 19 170 21 188 17 228
0 0 3
Can it be concluded that the cholesterol level has been changed at α = 0.10? Assume the
variable is approximately normal.
Solution
Hypothesis:
H o : There is no difference in the mean cholesterol level.
H 1 : There is a significant difference in the mean cholesterol level.
R codes:
> b = c(210,235,208,190,172,244)
> a = c(190,170,210,188,173,228)
> t.test(b,a, conf.level = 0.90, alternative = "two.sided", paired =

TRUE)
17
OUTPUT
Paired t-test
data: b and a
t = 1.6079, df = 5, p-value = 0.1688

-4.22040 37.55373
sample estimates:
mean of the differences
16.66667
Conclusion: We fail to reject H o and conclude that is there no difference in the mean
cholesterol level.
18
Example 1.3.2
A reporter hypothesizes that the average assessed values of land in a large city have
changed during a 5-year period. A random sample of wards is selected, and the data (in
millions of Ghana cedis) are shown. At α = 0.01, can it be concluded that the average
taxable assessed values have changed? Use the P-value method.
Kumasi Accra Tema Koforidua Cape Coast

2007 344.4 207.0 169.0 1711.5 861.8
2006 1262.0 960.0 529.0 1969.0 1405.0
Solution
Hypothesis:
H o : There is no difference between the average asses values of land in 2007 and 2006
H 1 : There is a difference between the average asses values of land in 2007 and 2006
R codes:
> x = c(344.4,207.0,169.0,1711.5,861.8)
> y = c(1262.0,960.0,529.0,1969.0,1405.0)
> t.test(x,y,conf.level=0.99,alternative="two.sided",paired=TRUE)
19
OUTPUT
Paired t-test
data: x and y
t = -4.649, df = 4, p-value = 0.009669
-1127.052469 -5.467531
sample estimates:
mean of the differences
-566.26
Activity 1.3
A medical researcher wishes to see if he can lower the cholesterol levels through diet in
six people by showing a film about the effects of high cholesterol levels. The data is
shown below.
Patient 1 2 3 4 5 6
Before 243 216 214 222 206 219
After 215 202 198 195 204 213
At α = 0.05, did the cholesterol level decrease on average?
20
UNIT TWO
REGRESSION
OVERVIEW
This unit is divided into three sessions. In session 2-1, we would consider the simple
case of linear regression models in R. R has an in-built linear regression model function
that allows you to perform linear regression model computations.
Session 2-2 deals with multiple linear regression in R and the last session 2-3, would
then introduce you to the exponential regression models with R codes.
CONTENT
Session 2.1 Linear regression model
Session 2.2 Multiple regression model
Session 2.3 Exponential regression
REQUIRED READINGS
with R. CRC Press.
LEARNING OUTCOMES
1. Estimate the relationship between dependent and independent variables using linear
regression in R
2. Use the lm() command function in R to perform least-squares regressions
3. Perform quadratic, cubic, and quartic regression analysis in R
21
SESSION/ EXAMPLES/ACTIVITIES
Activities include Youtube videos for some simple linear regression in R. Session
Examples include worked examples of some data; it illustrates the R codes for each
scenario of regression considered. Below the examples are activities. Please these
ACTIVITIES are to be solved and submitted for grading. The deadline for submission
is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=66z_MRwtFJM (linear regression)
2. https://www.youtube.com/watch?v=u1cc1r_Y7M0 (multiple regression)
3. https://www.youtube.com/watch?v=hokALdIst8k (exponential regression)
22
SESSION 2.1
LINEAR REGRESSION MODEL
‘lm’ is used to fit linear models. It can be used to carry out regression, single stratum
analysis of variance, and analysis of covariance (although ‘aov’ may provide a more
convenient interface for these).
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x =
FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)
Linear Regression
R-code for linear regression
>x=c(numeric values)
>y=c(numeric values)
>a=lm(y~x)
>summary(a)
Example 2.1.1: Linear Model R

>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>a=lm(y~x)
>summary(a)
Results
23
OUTPUT
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.7255 -1.3034 0.4168 1.5894 2.0108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6299 1.7085 3.881 0.00467 **
x 0.1546 0.3642 0.424 0.68244
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1.873 on 8 degrees of freedom
Multiple R-squared: 0.02202,Adjusted R-squared: -0.1002
F-statistic: 0.1801 on 1 and 8 DF, p-value: 0.6824
MODEL: Y= 6.6299+0.1546x
Example 2.1.2: Quadratic Linear Regression
>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>b=lm(y~x+I(x^2))
>summary(b)
24
OUTPUT
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
-2.38024 -1.26603 0.08659 1.09552 2.57011
Coefficients:
(Intercept) 11.8590 4.8338 2.453 0.0439 *
x -2.3402 2.1926 -1.067 0.3213
I(x^2) 0.2612 0.2265 1.153 0.2867
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
MODEL:=𝟏𝟏.𝟖𝟓𝟗−𝟐.𝟑𝟒𝟎𝟐 x +𝟎.𝟐𝟔𝟏𝟐 x 2
25
Example 2.1.3: Cubic Regression
>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>c=lm(y~x+I(x^2)+I(x^3))
>summary(c)
OUTPUT
Call:
lm(formula = y ~ x + I(x^2) + I(x^3))
Residuals:

-2.0692 -1.4149 0.1101 1.2082 2.5999
Coefficients:
(Intercept) 6.78399 14.68099 0.462 0.660
x 1.44018 10.50169 0.137 0.895
I(x^2) -0.59553 2.33259 -0.255 0.807
I(x^3) 0.05974 0.16178 0.369 0.725
26
MODEL:=6.78399+1.44018 x −0.𝟓𝟗𝟓𝟓𝟑 x 2+0.𝟎𝟓𝟗𝟕𝟒 x 3
Example 2.1.4: Quartic Regression
>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>d=lm(y~x+I(x^2)+I(x^3)+I(x^4))
>summary(d)
OUTPUT
Call:
lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4))
Residuals:
1 2 3 4 5 6
-0.5671 -0.4279 -1.3427 1.5557 2.0758 1.9301
7 8 9 10
0.3901 0.2268 -1.2881 -2.5526
27
Coefficients:
(Intercept) 36.57466 46.70585 0.783 0.469
x -28.93120 46.28444 -0.625 0.559
I(x^2) 10.22251 16.19846 0.631 0.556
I(x^3) -1.54560 2.38226 -0.649 0.545
I(x^4) 0.08439 0.12491 0.676 0.529

MODEL:y=36.5747−28.93120 x +10.22𝟐𝟓 x 2−1.5𝟒𝟓𝟔 x 3+0.𝟎𝟖𝟒𝟑𝟗 x 4
28
SESSION 2.2
MULTIPLE REGRESSION MODEL
A regression with two or more explanatory variables is called a multiple regression.
Rather than modeling the mean as a straight line in linear regression, it is now modeled
as a function of several explanatory variables.
Multiple Regression R code format

>explanatory1=c(numeric values)
>explanatory2=c( numeric values)
>response=c(numeric values)
>s=lm(response~explanatory1+explanatory2)
>summary(s)
Example 2.2: R Code for Multiple Linear Regression
> x1=c(91,90,88,87,91,94,87,86)
> x2=c(25,21,24,25,25,26,25,25)
> y=c(240,236,270,274,301,316,300,296)
> a=lm(y~x1+x2)
> summary(a)
29
OUTPUT
Call:
lm(formula = y ~ x1 + x2)
Residuals:
1 2 3 4 5 6.7 8
-45.718 4.871 -2.461 -12.285 15.282 17.024 13.715 9.573
Coefficients:
(Intercept) -43.4558 331.1189 -0.131 0.9007
x1 -0.1417 3.4741 -0.041 0.9690
x2 13.6828 6.2329 2.195 0.0796 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Multiple R-squared: 0.4927,Adjusted R-squared: 0.2897
MODEL:y=−43.4558−0.𝟏𝟒𝟏𝟕 x 1+13.6828¿
30
SESSION 2.2.1
R CODE SYNTAX WHEN THERE IS INTERACTION AMONG THE INDE-
PENDENT VARIABLES
> x1=c(numeric values)

> x2=c(numeric values)
>y=c(numeric values)
>b=lm(y~x1+x2+x1*x2)
> summary(b)
Example 2.2.1
> x1=c(91,90,88,87,91,94,87,86)
> x2=c(25,21,24,25,25,26,25,25)
>y=c(240,236,270,274,301,316,300,296)
>b=lm(y~x1+x2+x1*x2)
> summary(b)
31
OUTPUT
Call:
lm(formula = y ~ x1 + x2 + x1 * x2)
Residuals:
1 2 3 4 5 6.7 8
-36.003 5.462 -14.565 -10.977 24.997 7.283 15.023 8.780
Coefficients:
(Intercept) 15661.115 19353.552 0.809 0.464
x1 -174.234 214.539 -0.812 0.462
x2 -607.238 765.100 -0.794 0.472
x1:x2 6.880 8.477 0.812 0.463

MODEL:y=15661.1−1𝟕𝟒.𝟐𝟑𝟒𝒙𝟏−607.2𝒙𝟐+6.88𝒙𝟏𝒙𝟐
32
SESSION 2.3
EXPONENTIAL REGRESSION
When given data you might as well find the law that governs the data in your table, say
in the form of T =a b t
Example 2.3.1
The data below shows the cooling temperatures of a freshly brewed cup of coffee after it
is poured from the brewing pot into a serving cup. The brewing pot temperature is
approximately 180°F. Find the law in the form T =a b t .
Tim 0 5 8 11 15 18 22 25 30 34 38 42 45 50
e (t)
Tem 179. 168. 158. 149. 141. 134. 125. 123. 116. 113. 109. 105. 102. 100.
p 5 7 1 2 7 6 4 5 3 2 1 7 2 5
R-Code
>time=c(0,5,8,11,15,18,22,25,30,34,38,42,45,50)
>temp=c(179.5,168.7,158.1,149.2,141.7,134.6,125.4,123.5,116
.3,113.2,109.1,105.7,102.2,100.5)
>a=lm(log(temp)~time)
>summary(a)
33
OUTPUT
Call:
lm(formula = log(temp) ~ time)
Residuals:

-0.052753 -0.025261 -0.005929 0.014306 0.056930
Coefficients:
(Intercept) 5.1443601 0.0172782 297.74 < 2e-16 ***
time -0.0118227 0.0005988 -19.74 1.62e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
F-statistic: 389.8 on 1 and 12 DF, p-value: 1.621e-10
Model log(temp)= 5.144-0.0118time
Example 2.3.2
Given the period and mean distances of some of the planets, you are to find a law in the
form P=k s n
Period, P (days) 87.97 224.7 365.3 687.0 4333.0 10760.0

Mean distance, s in 58 108 150 228 778 1426
millions of km
R-Code
>p=c(87.97,224.7,365.3,687.0,4333,10760)
>s=c(58,108,150,228,778,1426)
>a=lm(log(p)~log(s))
>summary(a)
34
OUTPUT
Call:
lm(formula = log(p) ~ log(s))
Residuals:
1 2 3 4 5 6
-9.492e-04 3.871e-03 -3.153e-03 1.154e-04-9.961e-05 2.155e-
04
Coefficients:
(Intercept) -1.6154480 0.0052541 -307.5 6.71e-10 ***
log(s) 1.5006720 0.0009335 1607.5 8.99e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.584e+06 on 1 and 4 DF, p-value: 8.986e-13
Model: log(p) = −𝟏.𝟔𝟏𝟓𝟒𝟒𝟖 +𝟏.𝟓𝟎𝟎𝟔𝟕𝟐log(s)
35
Example 2.3.3
The width of successive whorls of a shell of Turbo duplicatus has been measured.
Find the law in the form w=a b n.
Positions of whorls (n) 1 2 3 4 5 6 7 8
Width of whorl (w 3.33 2.84 2.39 2.03 1.70 1.45 1.22 1.04
cm)
R-Code
>n=c(1,2,3,4,5,6,7,8)
>w=c(3.33,2.84,2.39,2.03,1.70,1.45,1.22,1.04)
>r=lm(log(w)~n)
>summary(r)
OUTPUT
Call:
lm(formula = log(w) ~ n)
Residuals:
-0.0065511 -0.0033214 0.0006323 0.0036527 0.0049239
36
Coefficients:
(Intercept) 1.3733474 0.0035109 391.2 1.88e-14 ***
n -0.1672336 0.0006953 -240.5 3.48e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1

Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
F-statistic: 5.786e+04 on 1 and 6 DF, p-value: 3.484e-13

Model log(𝒘)=𝟏.𝟑𝟕𝟑𝟑𝟒𝟕𝟒−𝟎.𝟏𝟔𝟕𝟐𝟑𝟑𝟔n
Activity 2.3.1
In a study on speed and braking distance, researchers looked for a method to estimate
how fast a person was traveling before an accident by measuring the length of the skid
marks. An area that was focused on in the study was the distance required to completely
stop a vehicle at various speeds. Use the following table to find the linear regression
equation.
MPH Brakingdistance(feet)
20 20
30 45
40 81
50 133
60 205
80 411
Activity 2.3.2
37
The nursing instructor wishes to see whether a student’s grade point average and age are
related to the student’s score on the state board nursing examination. She selects five
students and obtains the following data. Obtain the multiple linear regression equation
obtained from the data
Student GPA Age Score

A 3.2 22 550
B 2.7 27 570
C 2.5 24 525
D 3.4 28 670
E 2.2 23 490
Activity 2.3.3
If V =k Dr Find the values of k and r using the table below
Diameter 4.4 4.6 5 5.1 5.1 5.2 5.2 5.5 5.5 5.6
Volume 2 2.2 3 4.3 3 2.9 3.5 3.4 5 7.2
38
UNIT THREE
NON-PARAMETRIC TEST
OVERVIEW
This chapter will introduce you to several non-parametric hypothesis tests, namely, the
spearman and Kendall rank correlation, sign test, Wilcoxon sign, and sum test,
randomness test, and Kruskal Wallis test. We will discuss the applications of these
hypothesis tests for nonparametric statistics using the R software.
CONTENT
Session 3.1 Spearman and Kendall Rank Correlation Test
Session 3.2 Sign Test
Session 3.3 Wilcoxon Sign and Sum Test
Session 3.4 Randomness Test (Categorical and Continuous)
Session 3.5 Kruskal-Wallis Test
REQUIRED READINGS
with R. CRC Press.
LEARNING OUTCOMES
1. Perform Spearman and Kendall Rank Correlation Tests using R software.
2. Perform the Sign test using R software.
3. Perform the Wilcoxon Sign and Sum Test using R software.
4. Perform Randomness tests for categorical and continuous cases using R software.
5. Perform the Kruskal-Wallis test using R software.
39
SESSION/EXAMPLES/ACTIVITIES
Activities include Youtube videos for computing spearman and Kendall rank correlation
coefficients, Sign test, Wilcoxon signed and sum test, Tests of Randomness, and the
Kruskal Wallis test in R software. Session Examples include solved questions on each
topic in R software; it illustrates the R codes for the various techniques. Below each
example(s) are activity(ies). Please these ACTIVITIES are to be solved and submitted
for grading. The deadline for submission is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=F0lvYZmxib8 (Spearman and Kendall
Rank Correlation)
2. https://www.youtube.com/watch?v=fH4S4aqfs9k (Sign Test)
3. https://www.youtube.com/watch?v=zM8OZUM5I4Y (Wilcoxon signed Rank

Test)
4. https://www.youtube.com/watch?v=KroKhtCD9eE (Wilcoxon Rank Sum)
5. https://www.youtube.com/watch?v=Y1qeAFAV5yQ (Kruskal Wallis Test)
40
SESSION 3.1
SPEARMAN AND KENDALL RANK CORRELATION TEST
Test for association between paired samples, using one of Pearson's product moment
correlation coefficient, Kendall's tau or Spearman's rho.
cor.test(x, y,
alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"),
exact = NULL, conf.level = 0.95, continuity = FALSE, ...)
Spearman and Kendall Rank Correlation Tests:

 Cor.test(x,y, method = “spearman”)for spearman rho
correlation
 Cor.test(x,y, method = “Kendall”) for kendall correla-
tion
Example 3.1.1: Spearman Correlation Coefficient

Two students were asked to rate eight different textbooks for a specific course on as-
cending scale from 0 to 20 points. Points were assigned for each of several categories,
such as reading level, use of illustrations, and use of color. At α =¿ 0.05, calculate the
Spearman and Kendall rank correlation coefficient between the two students’ ratings.
The data is shown in the following table.
Textbook A B C D E F G H
Students Ratings 1 4 10 18 20 12 2 5 9
Students Ratings 2 4 6 20 14 16 8 11 7
Solution
𝐻o = There is no linear correlation between the two student rating.
𝐻A = There is alinear correlation between the two student ratings.
Let a=¿students rating 1 and b=¿ student rating 2
41
R-Code
>a=c(4,10,18,20,12,2,5,9)
>b=c(4,6,20,14,16,8,11,7)
>cor.test(a,b,method="spearman")
OUTPUT
Spearman's rank correlation rho
data: a and b
S = 30, p-value = 0.09618
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.6428571
Conclusion: We fail to reject the 𝐻𝑜and conclude that there is no linear correlation
between the two ratings.
Example 3.1.2: Kendall Rank Correlation

H 0= There is no linear correlation between the two student rating.
𝐻A =There is linear correlation between the two student ratings.
R-Code
>a=c(4,10,18,20,12,2,5,9)
>b=c(4,6,20,14,16,8,11,7)
>cor.test(a,b,method=“kendall")
42
OUTPUT
Kendall's rank correlation tau

data: a and b
T = 20, p-value = 0.1789
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.4285714
Conclusion: We fail to reject 𝐻𝑜and conclude that there is no linear correlation between
the two ratings.
43
Activity 3.1.1
As a biologist, you wish to see if there is a relationship between the heights of tall trees
and their diameters. You find the following data for the diameter (in inches) of the tree
at 4.5 feet from the ground and the corresponding heights (in feet).
Diameter (in.) Height (ft)
1024 261
950 321
451 219
505 281
761 159
644 83
707 191
586 141
442 232
546 108
Perform both tests and write a short statement comparing the results.
Activity 3.1.2
The data below show the number of books published in six different subject areas for the
years 1980 and 2004. Use α= 0.05 to see if there is a relationship between the two data
sets.
agriculture home eco- literature music science sports and
nomics recreation
1980 461 879 1686 357 3109 971

2004 1065 3639 4671 2764 8509 4806
SESSION 3.2.1
44
SIGN TEST
The simplest nonparametric test, the sign test for single samples, is used to test the value
of a median for a specific sample.
 SIGN.test(x, md) requires the “BSDA” package in R.
Example 3.2.1
A convenience store owner hypothesizes that the median number of snow cones she sells
per day is 40. A random sample of 20 days yields the following data for the number of
snow cones sold each day.
18, 43, 40, 16, 22, 30, 29, 32, 37, 36, 39, 34, 39, 45, 28, 36, 40, 34, 39, 52
At α =¿ 0.05, test the owner’s hypothesis.
Solution
Ho=The median number of snow cones she sells per day is 40.
Ha=The median number of snow cones she sells per day is not 40.
R-Code
>x=c(18, 43, 40, 16, 22, 30, 29, 32, 37, 36,39, 34, 39, 45,
28, 36, 40, 34, 39, 52)
>library(BSDA)
>SIGN.test(x, md=40)
45
OUTPUT
One-sample Sign-Test
data: x
s = 3, p-value = 0.007538
alternative hypothesis: true median is not equal to 40

30.23294 39.00000
sample estimates:
median of x
36
Conclusion: There is enough evidence to reject the claim that the median number of
snow cones sold per day is 40.
46
SESSION 3.2.2
SIGN TEST (PAIRED SAMPLE)
The sign test can also be used to test sample means in a comparison of two dependent
samples, such as a before-and-after test.
Example 3.2.2
A medical researcher believed the number of ear infections in swimmers can be reduced
if the swimmers use earplugs. A sample of 10 people was selected, and the number of
infections for four months was recorded. During the first two months, the swimmers did
not use the earplugs; during the second two months, they did. At the beginning of the
second two-month period, each swimmer was examined to make sure that no infections
were present. The data are shown below. At a = 0.05, can the researcher conclude that
using earplugs reduced the median number of ear infections?
Swimmers Before After

A 3 2
B 0 1
C 5 4
D 4 0
E 2 1
F 4 3
G 3 1
H 5 3
I 2 2
J 1 3
Solution
Ho=The median number of ear infections will not be reduced.
Ha= The median number of ear infections will be reduced.
Let x=before and y=after
R-Code
>x=c(3,0,5,4,2,4,3,5,2,1)
>y=c(2,1,4,0,1,3,1,3,2,3)
>w=x-y
>library(BSDA)
>SIGN.test(w)
47
OUTPUT
One-sample Sign-Test
data: w
s = 7, p-value = 0.1797
alternative hypothesis: true median is not equal to 0
-0.6755556 2.0000000
sample estimates:
median of x
1
Conclusion: There is not enough evidence to support the claim that the use of earplugs
reduced the median number of ear infections.
Activity 3.2
The median age for the total population of the state of Maine is 41.2, the highest in the
nation. The mayor of a particular city believes that his population is considerably
“younger” and that the median age there is 36 years. At α =¿ 0.05, is there sufficient
evidence to reject his claim? The data here represent a random selection of persons from
the household population of the city.
40, 56, 42, 72, 12, 22, 25, 43, 39, 48, 50, 37, 18, 35, 15, 30, 52, 45
48
SESSION 3.3.1
WILCOXON SIGN-RANK TEST
When the samples are dependent, as they would be in a before-and-after test using the
same subjects, the Wilcoxon signed-rank test can be used in place of the t-test for de-
pendent samples.
Example 3.3.1
In a large department store, the owner wishes to see whether the number of shoplifting
incidents per day will change if the number of uniformed security officers is doubled. A
sample of 7 days before security is increased and 7 days after the increase shows the
number of shoplifting incidents.
Days Monday Tuesday Wednesday Thursday Fri- Saturday Sunday

day
Be- 7 2 3 6 5 8 12
fore
After 5 3 4 3 1 6 4
Is there enough evidence to support the claim at ∝=0.05, that there is a difference in the
number of shoplifting incidents before and after the increase in security.
Solution
𝐻𝑜= There is no difference in the number of shoplifting incidents before and after the
increase in security.
𝐻a= There is a difference in the number of shoplifting incidents before and after the in-
crease in security.
R-Code
>before=c(7,2,3,6,5,8,12)
>after=c(5,3,4,3,1,6,4)
>wilcox.test(before,after,paired=T)
49
OUTPUT
Wilcoxon signed rank test with continuity correction

data: x and y
V = 25, p-value = 0.07488
alternative hypothesis: true location shift is not equal to
0
Conclusion: We fail to reject 𝐻𝑜and conclude that there is not enough evidence to
support the claim that there is a difference in the number of shoplifting incidents.
50
SESSION 3.3.2
WILCOXON RANK-SUM TEST
The Wilcoxon rank-sum test is used for independent samples.
Example 3.3.2
Two independent samples of the army and marine recruits are selected, and the time in
minutes it takes each recruit to complete an obstacle course is recorded, as shown in the
table below.
Army 15 18 16 17 13 22 24 17 19 21 26 28
Marines 14 9 16 19 10 12 11 8 15 18 25
At a = 0.05, is there a difference in the times it takes the recruits to complete the
course?
Solution
𝐻o = There is no difference in the times it takes the recruits to complete the obstacle
course.
𝐻A = There is a difference in the time it takes the recruits to complete the obstacle
course.
R-Code
>x=c(15,18,16,17,13,22,24,17,19,21,26,28)
>y=c(14,9,16,19,10,12,11,8,15,18,25)
>wilcox.test(x,y)
51
OUTPUT
Wilcoxon rank sum test with continuity correction

data: x and y
W = 105, p-value = 0.01767
alternative hypothesis: true location shift is not equal to
0
Conclusion: Reject Ho and conclude that there is enough evidence to support the claim
that there is a difference in the times it takes the recruits to complete the course.
Activity 3.3
Two groups of alcoholics, one group male, and the other female were asked at what age
they first drunk alcohol. The data are shown here. Using the Wilcoxon rank-sum test at α
= 0.05, is there a difference in the ages of the females and males?
Males 6 12 14 16 17 17 13 12 10 11
Females 8 9 9 12 14 15 12 16 17 19
52
SESSION 3.4
RUNS TEST FOR RANDOMNESS
Performs the runs test for randomness (Mendenhall and Reinmuth 1982) for continuous
data. Also computes the runs test for randomness of the dichotomous (binary) data
series ‘x’. Users can choose whether to plot the correlation graph or not, and whether to
test against a two-sided, negative, or positive correlation. ‘NA’s from the data is omitted.
Require(lawstat) for continuous data

runs.test(y,
plot.it = FALSE,
alternative = c("two.sided", "positive.correlated", "negative.correlated")
)
Require(tseries) for discrete and categorical data

runs.test(x, alternative = c("two.sided", "less", "greater"))
SESSION 3.4.1
RANDOMNESS TESTS FOR CATEGORICAL AND CONTINUOUS
When samples are selected, you assume that they are selected at random. How do you
know if the data obtained from a sample are truly random?
R-code for Runs Test
 runs.test(a) requires “tseries” package for categorical case
 runs.test(a) requires “lawstat” package for continuous case
Example 3.4.1.1: (Discrete)

On a commuter train, the conductor wishes to see whether the passengers enter the train
at random. He observes the first 25 people, with the following sequence of males (M)
and females (F).
FFFMMFFFFMFMMMFFFFMMFFFMM
Test for randomness at a = 0.05.
Solution
Ho=The passenger board the train at random, according to gender.
Ha=The passengers do not board the train at random, according to gender.
Note: Load the ‘tseries’ package before you run the code.
53
R-Code
>a=factor(c("F","F","F","M","M","F","F","F","F","M","F","M"
,"M","M","F","F","F","F","M","M","F","F","F","M","M"))
> library(tseries)
> runs.test(a)
Alternative code
>a=scan(what=“ ”)
1.F
2.F
..
..
..
..
25.M
>library(tseries)
>runs.test(factor(a))
54
OUTPUT
Runs Test
data: f
Standard Normal = -1.2792, p-value = 0.2008
alternative hypothesis: two.sided
55
Conclusion: There is no enough evidence to reject the hypothesis that the passengers
board the train at random according to gender.
Example 3.4.1.2 (Continuous)
Twenty people enrolled in a drug abuse program. Test the claim that the ages of the
people, according to the order in which they enroll, occur at random, at α =¿ 0.05.
The data are:
18, 36, 19, 22, 25, 44, 23, 27, 27, 35, 19, 43, 37, 32, 28, 43, 46, 19, 20, 22.
Solution
Ho = The ages of the people, according to the order in which they enroll in a drug pro-
gram occur at random
Ha = The ages of the people, according to the order in which they enroll in a drug pro-
gram, do not occur at random.
R-Code
>a=c(18, 36, 19, 22, 25, 44, 23, 27, 27, 35, 19, 43, 37,
32, 28, 43, 46, 19, 20, 22)
>library(lawstat)
>runs.test(a)
56
OUTPUT
Runs Test -Two-sided

data: a
Standardized Runs Statistic = -0.8823, p-value = 0.3776
Conclusion: There is not enough evidence to reject the hypothesis that the ages of the
people who enroll occur at random.
Activity 3.4
1. As students, faculty, friends, and family arrived for the Spring Wind Ensemble
Concert at Shafer Auditorium, they were asked whether they were going to sit on
the balcony (B) or the ground floor (G). Use the responses listed below and test
for randomness at α = 0.05.
BBGGBBGBBBBBBGBBGGBBBBGGGGBGBBBGG
2. A school dentist wanted to test the claim, at α = 0.05, that the number of cavities
in fourth-grade students is random. Forty students were checked, and the number
of cavities each had is shown here. Test for the randomness of the values above
or below the median.
0460625315122137360260231521302373151122
57
SESSION 3-5
KRUSKAL-WALLIS TEST
The analysis of variance uses the F test to compare the means of three or more popula-
tions. The assumptions for the ANOVA test are that the populations are normally dis-
tributed and that the population variances are equal. When these assumptions cannot be
met, the nonparametric Kruskal-Wallis test, sometimes called the H test, can be used
to compare three or more means.
R-code for Kruskal Wallis Test

 kruskal.test(list(a,b,c)) i.e. where a, b, and c are the three
categories to be compared.
Example 3.5.1
A researcher tests three different brands of breakfast drinks to see how many mill equi-
valents of potassium per quart each contains. These data are obtained. What is the prob-
ability that all 3 are college graduates?
Brand A Brand B Brand C

4.7 5.3 6.3
3.2 6.4 8.2
5.1 7.3 6.2
5.2 6.8 7.1
5.0 7.2 6.6
a. At ∝=0.05, is there enough evidence to reject the hypothesis that all brands con-
tain the same amount of potassium?
Solution
Ho = There is no difference in the amount of potassium contained in the brands.
Ha = There is a difference in the amount of potassium contained in the brands.
R-Code
>a=c(4.7,3.2,5.1,5.2,5.0)
>b=c(5.3,6.4,7.3,6.8,7.2)
>c=c(6.3,8.2,6.2,7.1,6.6)
>kruskal.test(list(a,b,c))
58
OUTPUT
Kruskal-Wallis rank sum test
data: list(a, b, c)
Kruskal-Wallis chi-squared = 9.38, df= 2, p-value =
0.009187
Conclusion: There is no enough evidence to reject the claim that there is no difference
in the amount of potassium contained in the three brands.
Activity 3.5.1
You are researching an article on the waterfalls on our planet. You want to make a state-
ment about the heights of waterfalls on three continents. Three samples of waterfall
heights (in feet) are shown.
North America Africa Asia
600 406 330
1200 508 830
182 630 614
620 726 1100
1170 480 885
442 2014 330
1 What are the hypotheses?

2 Select a significance level and run the test. What is the H value?
3 What is your conclusion?
59
Activity 3.5.2
Samples of three different types of wrapping tape are tested for breaking strength, in
pounds. The data are shown here. At α = 0.05, is there a difference in the breaking
strength of the tapes? Use the Kruskal-Wallis test.
Type A 225 332 404 387 351 280 362 431 266
Type B 256 203 261 305 232 278 261 299 272
Type C 406 427 481 397 351 409 462 471 399
60
UNIT FOUR
CHI-SQUARE TEST
OVERVIEW
In this unit, chi-square goodness of fit and homogeneity tests will be considered. These
techniques are under the nonparametric methods of hypothesis testing. We will discuss
the applications of the chi-square tests and also compute test statistics and hypothesis
tests using R software.
CONTENT
Session 4.1 Goodness of Fit: A chi-square test used to see whether a frequency distribu-
tion fits a specific pattern
Session 4.2 Test of Homogeneity
REQUIRED READINGS

with R. CRC Press.
LEARNING OUTCOMES
1. Perform the goodness of fit test with chi-square using R
2. Perform the test of homogeneity using R
61
Activities include Youtube videos for the goodness of fit test with chi-square and test of
homogeneity in R software. Session Examples include solved questions on each
technique in R software. Under each example(s) are activity(ies). Please these
ACTIVITIES are to be solved and submitted for grading. The deadline for submission
is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=n5c11B5FJ24 (Goodness of Fit Test)
2. https://www.youtube.com/watch?v=oAs15X_hsJ4 (Test of Homogeneity Test)
62
SESSION 4.1
GOODNESS OF FIT
‘chisq.test’ performs chi-squared contingency table tests and
goodness-of-fit tests.
chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p =

FALSE, simulate.p.value = FALSE, B = 2000)
Goodness of Fit Test – R Syntax:

 Chisq.test(x,p = p)
Example 4.1.1
The data below shows the preference in the selection of fruit soda flavors. Sellers claim
that no preference in the selection of fruit soda flavors.
Frequency Cherry Strawberry Orange Lime Grape

Observed 32 28 16 14 10
Expected 0.2 0.2 0.2 0.2 0.2
Probabilities
Is there enough evidence to reject the claim that there is no preference in the selection of
fruit soda flavors, using the data above? Let ∝=0.05.
Solution:
𝐻o: Consumers show no preference for flavors of the fruit soda.
𝐻A: Consumers show a preference for flavors of the fruit soda
R-Code
>a=c(32,28,16,14,10)
>p=c(0.2,0.2,0.2,0.2,0.2)
>chisq.test(a,p=p)
63
OUTPUT
Chi-squared test for given probabilities

data: a
X-squared = 18, df= 4, p-value = 0.001234
Conclusion: We reject the null hypothesis and conclude that consumers show a prefer-
ence for flavors.
Activity 4.1
Home-Schooled Student Activities Students who are home-schooled often attend their
local schools to participate in various types of activities such as sports or musical
ensembles. According to the government, 82% of home-schoolers receive their
education entirely at home, while 12% attend school up to 9 hours per week and 6%
spend from 9 to 25 hours per week at school. A survey of 85 students who are home-
schooled revealed the following information about where they receive their education.
Entirely at home Up to 9 hours 9 to 25 hours

50 25 10
At α = 0.05, is there sufficient evidence to conclude that the proportions differ from
those stated by the government?
64
SESSION 4.2
TEST OF HOMOGENEITY
R-Syntax for text of homogeneity:
 Chisq.test(x); where x is cross-tabulation of the data
Example 4.2.1
A researcher selected 100 passengers from each of the 3 airlines and asked them if the
airline had lost their luggage on their last flight. At a 0.05 level of significance, test the
claim that the proportion of passengers from each airline who lost luggage on the flight
is the same for each airline. The data are shown in the table.
Airline 1 Airline 2 Airline 3 Total
Yes 10 7 4 21
No 90 93 96 279
Solution
𝐻o=The proportion of passengers from each airline who lost luggage on the flight is the
same for each airline.
𝐻A=The proportion of passengers from each airline who lost luggage on the flight is not
the same for each airline.
R-Code
>e=rbind(c(10,90),c(7,93),c(4,96))
>chisq.test(e)
65
OUTPUT
Pearson's Chi-squared test

data: e
X-squared = 2.765, df= 2, p-value = 0.251
Conclusion: There is not enough evidence to reject the null hypothesis, hence the
proportion of passengers from each airline who lost luggage on the flight is the same for
each airline.
Activity 4.2.1
Endangered or Threatened Species Can you conclude a relationship between the class of
vertebrate and whether it is endangered or threatened? Use the 0.05 level of significance.
Is there a different result for the 0.01 level of significance?
Mammal Bird Reptile Amphibian Fish

Endangered 68 76 14 13 76
Threatened 13 15 23 10 61
Activity 4.2.2
Is there sufficient evidence at the 5% level of significance to conclude that a relationship
exists between the city and the number of television and radio stations that it has?
TV Radio
Albuquerque N. Mex. 13 32
Boston Mass. 12 21
St. Petersburg Fla. 17 41
Minneapolis Minn. 7 30
Toledo Ohio 6 22
66
UNIT FIVE
ANALYSIS OF VARIANCE
OVERVIEW
In this unit, the F test is used to compare two variances. It is used to test claims in-
volving three or more means. We will discuss the applications of the various analysis of
variance tests using R software. The two-way ANOVA is an extension of the oneway
analysis of variance; it involves two independent variables. The independent variables
are also called factors.
CONTENT
Session 5.1 One Way ANOVA
Session 5.2 Two Way ANOVA
Session 5.3 Multiple Analysis of Variance
REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New

York, NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statist-
ics with R. CRC Press.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley &
Sons.
LEARNING OUTCOMES
1. Perform the one-way ANOVA test using R
2. Perform the two-way ANOVA test using R
3. Perform the multiple Analysis of Variance
67
Activities include Youtube videos for performing one-way ANOVA, two-way ANOVA
and Multiple ANOVA in R software. Session Examples include solved questions on
each test in R software; it illustrates the R codes for finding the test of ANOVA Under
each example(s) are activity(ies). Please these ACTIVITIES are to be solved and
submitted for grading. The deadline for submission is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=4DeCaCaC2JQ (one-way ANOVA)
2. https://www.youtube.com/watch?v=oEaS_yKJ8lM (two-way ANOVA)
68
SESSION 5.1
ONE WAY ANOVA
Fit an analysis of variance model by a call to ‘lm’ for each
stratum.
aov(formula, data = NULL, projections = FALSE, qr = TRUE,
contrasts = NULL, ...)
One-way ANOVA – R Syntax:

 aov(y~x, data).
Example 5.1.1:
A researcher wishes to try three different techniques to lower the blood pressure of
individuals diagnosed with high blood pressure. The subjects are randomly assigned to
three groups; the first group takes medication, the second group exercises, and the third
group follows a special diet. After four weeks, the reduction in each person’s blood
pressure is recorded. At 𝛼=0.05, test the claim that there is no difference among the
means. The data are shown below
Techniques
Medication(M) 10 12 9 15 13
Exercise(E) 6 8 3 0 2
Diet(D) 5 9 12 8 4
Solution:
𝐻o: The mean of the three techniques is the same.
𝐻A: At least one mean is different from the others.
Note
•Enter the data in excel.
•Save the data. (With the blood pressure example, the data is saved as ‘mydata’).
•Save the data as a comma-delimited file(csv) in your document.
•Import data into ‘R’ by using ‘data2=read.csv("mydata.csv")’.
69
R-Code
>data2=read.csv("mydata.csv")
>data2
>results=aov(pressure~techniques, data=Blood.pressure)
>summary(results)
70
OUTPUT
DfSum SqMean SqF value Pr(>F)

techniques 2 160.1 80.07 9.168 0.00383 **
Residuals 12 104.8 8.73
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Conclusion: The decision is to reject the 𝐻0, since 9.17 is greater than the critical value
(3.89), therefore we conclude that at least one mean is different from the others.
Activity 5.1
The following set of data values was obtained from a study of people’s perceptions on
whether the color of a person’s clothing is related to how intelligent the person looks.
The subjects rated the person’s intelligence on a scale of 1 to 10. Group 1 subjects were
randomly shown people with clothing in shades of blue and gray. Group 2 subjects were
randomly shown people with clothing in shades of brown and yellow. Group 3 subjects
were randomly shown people with clothing in shades of pink and orange. The results
follow.
71
Group 1 Group 2 Group 3

8 7 4
7 8 9
7 7 6
7 7 7
8 5 9
8 8 8
6 5 5
8 8 8
8 7 7
7 6 5
7 6 4
8 6 5
8 6 4
Use ANOVA to test for any significant differences between the means.
72
SESSION 5.2
TWO-WAY ANOVA TEST
R Syntax for two-way anova:
 aov(y~x + z, data)
Example 5.2.1:
A researcher wishes to see whether the type of gasoline used and the type of automobile
driven have any effect on gasoline consumption. Two types of gasoline, regular and
high-octane, will be used, and two types of automobiles, two-wheel-and four-wheel
drive, will be used in each group. There will be two automobiles in each group, for a
total of eight automobiles used. Take 𝛼=0.05.
The data (in miles per gallon) are shown and summarized in the table below.
The hypotheses for the gasoline types are

𝐻o: There is no difference between the means of gasoline consumption for the two types
of gasoline.
HA: There is a difference between the means of gasoline consumption for the two types
of gasoline.
The hypotheses for the types of automobile driven are

𝐻o: There is no difference between the means of gasoline consumption for two-wheel-
drive and four-wheel-drive automobiles.
𝐻A: There is a difference between the means of gasoline consumption for two-wheel-
drive and four-wheel-drive automobiles.
Solution
73
Note
•Save the data. (With the gasoline example, the data is saved as ‘mydata2’).

•Import data into ‘R’ by using ‘data3=read.csv("mydata2.csv")’.
R-Code
>data3=read.csv("mydata2.csv")
>data3
>results=aov(Gasoline~Gas+Automobile,data=data3)
>summary(results)
74
OUTPUT

Gas 1 3.92 3.92 0.342 0.584
Automobile 1 9.68 9.68 0.843 0.401
Residuals 5 57.38 11.48
Conclusion for the gasoline types :

We fail to reject the 𝐻0for the gasoline types since 0.342 is less than the critical value of
7.71, we conclude that there is no difference between the means of gasoline consump-
tion for the two types of gasoline.
Conclusion for the types of Automobiles driven:

We fail to reject the 𝐻0for the type of automobile driven since 0.842 is less than the crit-
ical value of 7.71, we conclude that there is no difference between the means of gasoline
consumption for two-wheel-drive and four-wheel-drive automobiles.
When there is an interaction between the variables

The hypotheses for the interactions are:
𝐻o: There is no interaction effect between the type of gasoline used and the type of auto-
mobile a person drives on gasoline consumption.
75
𝐻A: There is an interaction effect between the type of gasoline used and the type of
automobile a person drives on gasoline consumption.
R-Code
>data3=read.csv("mydata2.csv")
>data3
>results=aov(Gasoline~Gas+Automobile+Gas:Automobile,data=data3)
>summary(results)
OUTPUT

Gas 1 3.92 3.92 4.752 0.09477 .
Automobile 1 9.68 9.68 11.733 0.02665 *
Gas:Automobile1 54.08 54.08 65.552 0.00126 **
Residuals 4 3.30 0.82
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Conclusion for the interaction:
We reject the 𝐻O for the interaction effect since 65.552 is greater than the critical value
of 7.71. Since the null hypothesis for the interaction effect was rejected, it can be con-
cluded that there is an interaction effect between the type of gasoline used and the type
of automobile a person drives on gasoline consumption.
76
SESSION 5.3
MULTIPLE ANALYSIS OF VARIANCE
The R syntax for the multivariate analysis of variance is given by
manova(...)
Example 5.3.1
A researcher will like to know if there is a significant difference in sepal and petal
length, between the different species of flowers.∝=0.05. The data is given in the table
below
Species Sepal.length Petal.width
Versicolor 5.0 3.3
Versicolor 5.5 4.4
Versicolor 5.8 4.0
Virginia 6.2 4.8
Virginia 5.9 5.1
Setosa 4.9 1.4
Setosa 4.5 1.4
Versicolor 5.7 4.2
Versicolor 6.1 4.7
Versicolor 6.2 4.3
Solution
𝐻o: There is no significant difference in petal and sepal length between the different spe-
cies.
𝐻A: There is a significant difference in petal and sepal length between the different
species.
Let Vi represent Virginia.

Let Verepresent Versicolor.
Let Se represent Setosa.
Note
•Save the data. (With the species example, the data is saved as ‘mydata3’).
•Import data into ‘R’ by using ‘flower=read.csv("data3.csv")’.
77
78
R-Code
>flower=read.csv("mydata3.csv")
>flower
>results=manova(cbind(Sepal.length,petal.length)~Spe-
cies, data=flower)
>summary(results)
OUTPUT
DfPillai approxF numDfden DfPr(>F)
Species 2 0.93928 3.0993 4 14 0.05061 .
Residuals 7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Conclusion: We fail to reject the null hypothesis and conclude that there is no
significant difference, in petal and sepal length between the different species.
79
Activity 5.3.1
A state employee wishes to see if there is a significant difference in the number of
employees at the interchanges of three Regional toll roads. The data are shown. At a=¿
0.05, can it be concluded that there is a significant difference in the average number of
employees at each interchange?
Accra-Tema Motorway Kumasi-Accra road Kumasi-Tamale Road
7 10 1
14 1 12
32 1 1
19 0 9
10 11 1
11 1 11
Activity 5.3.2
The following set of data values was obtained from a study of people’s perceptions on
whether the color of a person’s clothing is related to how intelligent the person looks.
The subjects rated the person’s intelligence on a scale of 1 to 10. Group 1 subjects were
randomly shown people with clothing in shades of blue and gray. Group 2 subjects were
randomly shown people with clothing in shades of brown and yellow. Group 3 subjects
were randomly shown people with clothing in shades of pink and orange. The results
follow.
Group 1 Group 2 Group 3
8 7 4
7 8 9
7 7 6
7 7 7
8 5 9
8 8 8
6 5 5
8 8 8
8 7 7
7 6 5
7 6 4
8 6 5
8 6 4
80
UNIT SIX
LOGISTIC REGRESSION
OVERVIEW
In this unit, logistic regression will be considered. Logistic regression is a statistical
method for analyzing a dataset in which there are one or more independent variables that
determine an outcome. The outcome is measured with a dichotomous variable (in which
there are only two possible outcomes). We will discuss the applications of logistic
regression using R software.
CONTENT
Session 6-1. Logistic regression
REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New

York, NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statist-
ics with R. CRC Press.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley &
Sons.
LEARNING OUTCOMES
1. Perform the logistic regression using R
81
Activities include Youtube videos for performing logistic regression models in R
software. Session Examples include solved questions on logistic regression with R
software; it illustrates the R codes for finding the estimates of coefficients of
independent variables and their hypothesis tests. Under each example(s) are
activity(ies). Please these ACTIVITIES are to be solved and submitted for grading.
The deadline for submission is one week after each lecture.
Description
‘glm’ is used to fit generalized linear models, specified by
giving a symbolic description of the linear predictor and a
description of the error distribution.
Rcode
glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart,
mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y =
TRUE, singular.ok = TRUE, contrasts = NULL, ...)
Video Activity
1. https://www.youtube.com/watch?v=C4N3_XJJ-jU
82
SESSION 6.1
LOGISTIC REGRESSION
 glm(y~x+z, data, family = binomial(“logit”)).
Note.
 The dependent variable must be categorical.
 The independent variable can either be categorical or continuous.
Example 6.1.1
Example of a logistics regression table, where the dependent variable is a discrete dicho-
tomous variable with 1s and 0s. The independent variable can be discrete or continuous
Y x1 x2 x3
1 1 0 15
0 0 0 14
0 0 1 18
1 0 1 9
0 1 1 10
0 1 1 11
Solution
R-Code for the logistics regression table
>y=c(1,0,0,1,0,0)
>x1=c(1,0,0,0,1,1)
>x2=c(0,0,1,1,1,1)
>x3=c(15,14,18,9,10,11)
>a=glm(y~x1+x2+x3, family=binomial())
>summary(a)
83
Alternative way of entering the data.

•Save the data. (Example, ‘data5’).
R-Code
>logis=read.csv("data5.csv")
>a=glm(y~x1+x2+x3, family=binomial(), data=logis)
>summary(a)
84
85
OUTPUT
Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial())
Deviance Residuals:
1 2 3 4 5
1.17741 -1.17741 0.00000 1.17741 -1.17741
6
-0.00016
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 254.31 74727.90 0.003 0.997
x1 18.17 5337.71 0.003 0.997
x2 -90.83 26688.54 -0.003 0.997
x3 -18.17 5337.71 -0.003 0.997
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7.6382 on 5 degrees of freedom

Residual deviance: 5.5452 on 2 degrees of freedom
AIC: 13.545
Number of Fisher Scoring iterations: 20

𝒚=𝟐𝟓𝟒.𝟑𝟏+𝟏𝟖.𝟏𝟕𝒙𝟏−𝟗𝟎.𝟖𝟑𝒙𝟐−𝟏𝟖.𝟏𝟕𝒙𝟑
Activity 6.1
The discrete dichotomous dependent variable in a study has values1s and 0s. The inde-
pendent variables x1 and x2 are discrete and a third x3 is continuous. Use the logistic
model to find the effect of the independent variables
Y x1 x2 x3
1 1 0 152
0 0 1 142
0 0 1 183
0 0 1 95
0 1 1 108
1 0 0 157
1 1 0 150
0 0 0 14
0 0 1 158
1 0 1 99
1 0 0 108
0 1 1 111
86
END-OF-COURSE EVALUATION
Please visit the End of Course evaluation folder to pick the evaluation questionnaire.
Answer the questions and submit your response to the facilitator. This evaluation is very
critical to the betterment of the course in subsequent sessions.
The final examination, which counts towards 70% of your final grade, will include all
topical issues discussed during this course. Please review all examples, activities, and
individual assignments, as preparation for your final exam. The programme’s
examination officer will communicate the final examination dates and venue to students
sometime later.
Thanks for your
87
88

Stat 362 Study Guide

Uploaded by

Copyright:

Available Formats

Stat 362 Study Guide

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 362 Study Guide

Uploaded by

Copyright:

Available Formats

KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY,

INSTITUTE OF DISTANCE LEARNING

For any information contact:

Welcome to STAT362 Statistical Computing & Data Analysis II. My name is

Emmanuel Harris. I am your facilitator in this course. In addition to welcoming you to

statistical methods, ranging from simple regression to time series or multivariate

The course is organized into five units:

Unit 3: Non-Parametric test-

Unit 6: Logistic Regression

End of semester Examinations: 70%

Assignment Description Type Deadline Value

2 Regression Individual 1 week 10%

4 Chi-Square Individual 1 week 10%

5 ANOVA Individual 1 week 10%

SESSION/ EXAMPLES/ ACTIVITIES

The format for a t.test with R is given as

One Sample t-test

Example 1.2.1 (Assuming Equal Variances)

Two Sample t-test

t = -1.2744, df = 22, p-value = 0.2158

alternative hypothesis: true difference in means is not equal to 0

99 percent confidence interval:

Conclusion: We fail to reject H o and conclude that there is no significant difference in

Example 1.2.2 (Assuming Unequal Variances)

The number of grams of carbohydrates contained in 1-ounce servings of randomly selec-

Welch Two Sample t-test

t = -1.2203, df = 15.463, p-value = 0.2406

alternative hypothesis: true difference in means is not equal to 0

99 percent confidence interval:

Conclusion: We fail to reject H o and conclude that there is no significant difference in

> t.test(b,a, conf.level = 0.90, alternative = "two.sided", paired =

t = 1.6079, df = 5, p-value = 0.1688

alternative hypothesis: true difference in means is not equal to 0

mean of the differences

Kumasi Accra Tema Koforidua Cape Coast

t = -4.649, df = 4, p-value = 0.009669

alternative hypothesis: true difference in means is not equal to 0

99 percent confidence interval:

mean of the differences

At α = 0.05, did the cholesterol level decrease on average?

2. Use the lm() command function in R to perform least-squares regressions

3. Perform quadratic, cubic, and quartic regression analysis in R

Example 2.1.1: Linear Model R

Residual standard error: 1.873 on 8 degrees of freedom

Multiple R-squared: 0.02202,Adjusted R-squared: -0.1002

F-statistic: 0.1801 on 1 and 8 DF, p-value: 0.6824

Example 2.1.2: Quadratic Linear Regression

Multiple R-squared: 0.1781,Adjusted R-squared: -0.05667

F-statistic: 0.7587 on 2 and 7 DF, p-value: 0.5032

Example 2.1.3: Cubic Regression

Min 1Q Median 3Q Max

Multiple R-squared: 0.1964,Adjusted R-squared: -0.2054

F-statistic: 0.4888 on 3 and 6 DF, p-value: 0.7026

MODEL:=6.78399+1.44018 x −0.𝟓𝟗𝟓𝟓𝟑 x 2+0.𝟎𝟓𝟗𝟕𝟒 x 3

Example 2.1.4: Quartic Regression

Residual standard error: 2.056 on 5 degrees of freedom

MODEL:y=36.5747−28.93120 x +10.22𝟐𝟓 x 2−1.5𝟒𝟓𝟔 x 3+0.𝟎𝟖𝟒𝟑𝟗 x 4

Multiple Regression R code format

Example 2.2: R Code for Multiple Linear Regression

-45.718 4.871 -2.461 -12.285 15.282 17.024 13.715 9.573