Stat 362 Study Guide
Stat 362 Study Guide
Stat 362 Study Guide
KUMASI.
BSc Statistics 2
STAT 362
Statistical Computing & Data Analysis II
2 Credits
STUDY GUIDE
Emmanuel Harris
Department of Statistics and Actuarial Science
1
[STAT 362 Statistical Computing & Data Analysis II]
Publisher’s Information
© IDL, 2017
All rights reserved. No part of this study guide may be reproduced or utilized in any
form or by any means, electronic or mechanical, including photocopying, recording or
by any information storage and retrieval system, without the permission from the
copyright holders.
Director
Institute of Distance Learning
New Library Building
Kwame Nkrumah University of Science and Technology
Kumasi, Ghana
Phone: +233-32-2060013
+233-32-2061287
+233-32-2060023
Fax: +233-32-2060014
E-mail: emmaharris2002@yahoo.com
Web: www.idl.knust.edu.gh
www.knust.edu.gh
2
INTRODUCTION
the course, I would like to give you some useful information about Statistical Computing
& Data Analysis and offer you a few hints for successful completion of this course.
Statistical modeling and data analysis techniques are difficult subjects to grasp and
apply, and it is often necessary to use computer software to aid the implementation of
large data sets and to obtain useful results. R is recognized as one of the most powerful,
flexible, and free statistical software packages, and it enables the user to apply several
analysis.
This course offers the students how to easily analyze large data sets in R to obtain useful
results.
The requirement for successful completion of this course is a computer with R software
successfully installed.
3
[STAT 362 Statistical Computing & Data Analysis II]
COURSE OVERVIEW
Unit 2: Regression
Unit 4: Chi-Square
Unit 5: ANOVA
COURSE OBJECTIVE(S)
On completion of the course, you students should be able to:
1. Perform hypothesis testing (t-tests) in R.
2. Perform nonparametric test using R.
3. Perform Chisquare goodness of fit test and test of homogeneity using R
4. Perform with R the One-way, two-way and Multiple ANOVA.
5. Perform Logistic regression analysis in R.
COURSE OUTLINE
Unit 1: Hypothesis testing
Unit 2: Regression
Unit 3: Non-parametric Test
Unit 4: Chi-Square
Unit 5: ANOVA
Unit 6: Logistic Regression
4
REQUIRED TEXTBOOKS/READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using R. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.
GRADING
Continuous assessment: 30%
Total: 100%
5
[STAT 362 Statistical Computing & Data Analysis II]
ASSIGNMENT SCHEDULE
All assignments are due before the end of the specified day of delivery (GMT 23:59).
All assignments are to be uploaded to the hand-in folder for this course unless other
instructions are given. If you are unable to hand in your assignment on the LMS of IDL
KNUST (vclass), you may email it to the course facilitator (on the said day of delivery).
Failure to deliver assignments on the specified date will attract penalties in the form of a
reduced grade.
Non Parametric
3 Individual 1 week 10%
Tests
Logistic
6 Individual 1 week 10%
Regression
* Participation in Online discussions may account for 15% of the final grade
* Deadline could be weekly based
6
UNIT ONE
HYPOTHESIS TESTING (t-test)
OVERVIEW
One of the most common tests in statistics is the t-test, used to determine whether the
means of two groups are equal (i.e. Two-Sample t-test) and/or determine whether the
hypothesized mean of a certain population is true (One-Sample t-test). The assumption
for the test is that both groups are sampled from normal distributions with equal
variances.
CONTENT
Session 1.1 One-sample independent t-test
Session 1.2 Two-sample independent t-test
Session 1.3 Dependent/Paired t-test
REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Test the difference between two means for independent samples, using the t-test in
R.
2. Test the difference between two means for dependent samples, using the t-test in R.
3. Test a claim about a hypothesized mean of one sample, using the t-test in R.
7
[STAT 362 Statistical Computing & Data Analysis II]
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for Performing t-Tests in R software. Session
Examples include solved questions on each type of t-Tests in R software; it illustrates
the R codes for t-Tests. Under each example(s) are activity(ies). These ACTIVITIES
are to be solved and submitted for grading. The deadline for submission is one week
after each lecture.
Video Activity
1. https://youtu.be/kvmSAXhX9Hs (One-sample t-Test)
2. https://youtu.be/RlhnNbPZC0A ( Two-sample independent t-Test)
3. https://youtu.be/yD6aU0fY2lo (Two-sample dependent t-Test)
8
SESSION 1.1
ONE SAMPLE INDEPENDENT T-TEST
Example 1.1.1
A researcher estimated that the average height of story buildings on the KNUST campus
is 700 feet. A random sample of 9 story buildings is selected and the heights in feet are
shown below:
485, 511, 841, 725, 615, 520, 535, 635, 616
At α =0.05 , is there enough evidence to reject this claim?
Solution
Hypothesis:
H o : μ=700
H 1 : μ ≠ 700
R codes
> x=c(485,511,841,725,615,520,535,635,616)
> t.test(x,mu=700,conf.level = 0.95, alternative = "two.sided")
9
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
data: x
t = -2.3612, df = 8, p-value = 0.04587
alternative hypothesis: true mean is not equal to 700
95 percent confidence interval:
520.5678 697.8767
sample estimates:
mean of x
609.2222
Conclusion: We reject H o and conclude that there is insufficient evidence to support the
researcher’s claim that the average height of story buildings on the KNUST campus is
700 feet.
Example 1.1.2
A state executive claims that the average number of acres in Western Region parks is
less than 2000 acres. A random sample of five parks is selected, and the number of acres
is shown. At α = 0.05, is there enough evidence to support the claim?
959 1187 493 6249 541
Solution
Hypothesis:
H o : μ=2000
H 1 : μ ≠ 2000
R codes
> x=c(959, 1187, 493, 6249, 541)
> t.test(x,mu=2000, conf.level = 0.95, alternative = "less")
10
OUTPUT
One Sample t-test
data: x
t = -0.10396, df = 4, p-value = 0.4611
alternative hypothesis: true mean is less than 2000
95 percent confidence interval:
-Inf 4227.591
sample estimates:
mean of x
1885.8
Activity 1.1
The average 1-ounce chocolate chip cookie contains 110 calories. A random sample of
15 different brands of 1-ounce chocolate chip cookies resulted in the following calorie
amounts. At the α = 0.01 level, is there sufficient evidence that the average calorie con-
tent is greater than 110 calories?
100, 125, 150, 160, 185, 125, 155, 145, 160, 100, 150, 140, 135, 120, 110
11
[STAT 362 Statistical Computing & Data Analysis II]
SESSION 1.2
TWO SAMPLE INDEPENDENT T -TEST
Solution
Hypothesis:
H o : there is a significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
H 1 : there is no significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
R codes
> x = c(29,25,17,36,41,25,32,29,38,34,24,27,29)
> y = c(41,41,37,29,30,38,39,10,29,55,29)
> t.test(x,y, conf.level = 0.99, var.equal = TRUE)
12
OUTPUT
data: x and y
sample estimates:
mean of x mean of y
29.69231 34.36364
13
[STAT 362 Statistical Computing & Data Analysis II]
Solution
Hypothesis:
H o : there is a significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
H 1 : there is no significant difference in the carbohydrates content in chocolate and non-
chocolate candy.
R codes
> x = c(29,25,17,36,41,25,32,29,38,34,24,27,29)
> y = c(41,41,37,29,30,38,39,10,29,55,29)
> t.test(x,y, conf.level = 0.99, var.equal = FALSE)
14
OUTPUT
data: x and y
-15.903601 6.560943
sample estimates:
mean of x mean of y
29.69231 34.36364
15
[STAT 362 Statistical Computing & Data Analysis II]
Hard body 21 17 17 20 16 17 15 20 23
Soft body 24 13 11 13 12 15 12 16
SESSION 1.3
DEPENDENT/PAIRED T -TEST
Example 1.3.1
16
A dietician wishes to see if a person’s cholesterol level will change if the diet is supple-
mented by a certain mineral. Six subjects were pretested, and then they took the mineral
supplement for a 6-week period. The results are shown in the table below:
Subject 1 2 3 4 5 6
Before 21 235 20 190 17 244
0 8 2
After 19 170 21 188 17 228
0 0 3
Can it be concluded that the cholesterol level has been changed at α = 0.10? Assume the
variable is approximately normal.
Solution
Hypothesis:
H o : There is no difference in the mean cholesterol level.
H 1 : There is a significant difference in the mean cholesterol level.
R codes:
> b = c(210,235,208,190,172,244)
> a = c(190,170,210,188,173,228)
17
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Paired t-test
data: b and a
-4.22040 37.55373
sample estimates:
16.66667
Conclusion: We fail to reject H o and conclude that is there no difference in the mean
cholesterol level.
18
Example 1.3.2
A reporter hypothesizes that the average assessed values of land in a large city have
changed during a 5-year period. A random sample of wards is selected, and the data (in
millions of Ghana cedis) are shown. At α = 0.01, can it be concluded that the average
taxable assessed values have changed? Use the P-value method.
Solution
Hypothesis:
H o : There is no difference between the average asses values of land in 2007 and 2006
H 1 : There is a difference between the average asses values of land in 2007 and 2006
R codes:
> x = c(344.4,207.0,169.0,1711.5,861.8)
> y = c(1262.0,960.0,529.0,1969.0,1405.0)
> t.test(x,y,conf.level=0.99,alternative="two.sided",paired=TRUE)
19
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Paired t-test
data: x and y
-1127.052469 -5.467531
sample estimates:
-566.26
Activity 1.3
A medical researcher wishes to see if he can lower the cholesterol levels through diet in
six people by showing a film about the effects of high cholesterol levels. The data is
shown below.
Patient 1 2 3 4 5 6
Before 243 216 214 222 206 219
After 215 202 198 195 204 213
20
UNIT TWO
REGRESSION
OVERVIEW
This unit is divided into three sessions. In session 2-1, we would consider the simple
case of linear regression models in R. R has an in-built linear regression model function
that allows you to perform linear regression model computations.
Session 2-2 deals with multiple linear regression in R and the last session 2-3, would
then introduce you to the exponential regression models with R codes.
CONTENT
Session 2.1 Linear regression model
Session 2.2 Multiple regression model
Session 2.3 Exponential regression
REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Estimate the relationship between dependent and independent variables using linear
regression in R
21
[STAT 362 Statistical Computing & Data Analysis II]
SESSION/ EXAMPLES/ACTIVITIES
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for some simple linear regression in R. Session
Examples include worked examples of some data; it illustrates the R codes for each
scenario of regression considered. Below the examples are activities. Please these
ACTIVITIES are to be solved and submitted for grading. The deadline for submission
is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=66z_MRwtFJM (linear regression)
2. https://www.youtube.com/watch?v=u1cc1r_Y7M0 (multiple regression)
3. https://www.youtube.com/watch?v=hokALdIst8k (exponential regression)
22
SESSION 2.1
LINEAR REGRESSION MODEL
‘lm’ is used to fit linear models. It can be used to carry out regression, single stratum
analysis of variance, and analysis of covariance (although ‘aov’ may provide a more
convenient interface for these).
lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x =
FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)
Linear Regression
R-code for linear regression
>x=c(numeric values)
>y=c(numeric values)
>a=lm(y~x)
>summary(a)
Results
23
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.7255 -1.3034 0.4168 1.5894 2.0108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.6299 1.7085 3.881 0.00467 **
x 0.1546 0.3642 0.424 0.68244
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
MODEL: Y= 6.6299+0.1546x
>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>b=lm(y~x+I(x^2))
>summary(b)
24
OUTPUT
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-2.38024 -1.26603 0.08659 1.09552 2.57011
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.8590 4.8338 2.453 0.0439 *
x -2.3402 2.1926 -1.067 0.3213
I(x^2) 0.2612 0.2265 1.153 0.2867
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1.836 on 7 degrees of freedom
MODEL:=𝟏𝟏.𝟖𝟓𝟗−𝟐.𝟑𝟒𝟎𝟐 x +𝟎.𝟐𝟔𝟏𝟐 x 2
25
[STAT 362 Statistical Computing & Data Analysis II]
>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>c=lm(y~x+I(x^2)+I(x^3))
>summary(c)
OUTPUT
Call:
lm(formula = y ~ x + I(x^2) + I(x^3))
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.78399 14.68099 0.462 0.660
x 1.44018 10.50169 0.137 0.895
I(x^2) -0.59553 2.33259 -0.255 0.807
I(x^3) 0.05974 0.16178 0.369 0.725
26
Residual standard error: 1.961 on 6 degrees of freedom
>x=c(3.9,2.1,6.4,5.7,4.7,2.8,3.4,7.5,3.0,4.5)
>y=c(6.5,7.8,5.2,8.2,9.2,8.9,7.3,9.8,5.6,4.6)
>d=lm(y~x+I(x^2)+I(x^3)+I(x^4))
>summary(d)
OUTPUT
Call:
lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4))
Residuals:
1 2 3 4 5 6
-0.5671 -0.4279 -1.3427 1.5557 2.0758 1.9301
7 8 9 10
0.3901 0.2268 -1.2881 -2.5526
27
[STAT 362 Statistical Computing & Data Analysis II]
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.57466 46.70585 0.783 0.469
x -28.93120 46.28444 -0.625 0.559
I(x^2) 10.22251 16.19846 0.631 0.556
I(x^3) -1.54560 2.38226 -0.649 0.545
I(x^4) 0.08439 0.12491 0.676 0.529
28
SESSION 2.2
MULTIPLE REGRESSION MODEL
A regression with two or more explanatory variables is called a multiple regression.
Rather than modeling the mean as a straight line in linear regression, it is now modeled
as a function of several explanatory variables.
> x1=c(91,90,88,87,91,94,87,86)
> x2=c(25,21,24,25,25,26,25,25)
> y=c(240,236,270,274,301,316,300,296)
> a=lm(y~x1+x2)
> summary(a)
29
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Call:
lm(formula = y ~ x1 + x2)
Residuals:
1 2 3 4 5 6.7 8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -43.4558 331.1189 -0.131 0.9007
x1 -0.1417 3.4741 -0.041 0.9690
x2 13.6828 6.2329 2.195 0.0796 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 24.8 on 5 degrees of freedom
Multiple R-squared: 0.4927,Adjusted R-squared: 0.2897
F-statistic: 2.428 on 2 and 5 DF, p-value: 0.1833
MODEL:y=−43.4558−0.𝟏𝟒𝟏𝟕 x 1+13.6828¿
30
SESSION 2.2.1
R CODE SYNTAX WHEN THERE IS INTERACTION AMONG THE INDE-
PENDENT VARIABLES
Example 2.2.1
> x1=c(91,90,88,87,91,94,87,86)
> x2=c(25,21,24,25,25,26,25,25)
>y=c(240,236,270,274,301,316,300,296)
>b=lm(y~x1+x2+x1*x2)
> summary(b)
31
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Call:
lm(formula = y ~ x1 + x2 + x1 * x2)
Residuals:
1 2 3 4 5 6.7 8
-36.003 5.462 -14.565 -10.977 24.997 7.283 15.023 8.780
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15661.115 19353.552 0.809 0.464
x1 -174.234 214.539 -0.812 0.462
x2 -607.238 765.100 -0.794 0.472
x1:x2 6.880 8.477 0.812 0.463
32
SESSION 2.3
EXPONENTIAL REGRESSION
When given data you might as well find the law that governs the data in your table, say
in the form of T =a b t
Example 2.3.1
The data below shows the cooling temperatures of a freshly brewed cup of coffee after it
is poured from the brewing pot into a serving cup. The brewing pot temperature is
approximately 180°F. Find the law in the form T =a b t .
Tim 0 5 8 11 15 18 22 25 30 34 38 42 45 50
e (t)
Tem 179. 168. 158. 149. 141. 134. 125. 123. 116. 113. 109. 105. 102. 100.
p 5 7 1 2 7 6 4 5 3 2 1 7 2 5
R-Code
>time=c(0,5,8,11,15,18,22,25,30,34,38,42,45,50)
>temp=c(179.5,168.7,158.1,149.2,141.7,134.6,125.4,123.5,116
.3,113.2,109.1,105.7,102.2,100.5)
>a=lm(log(temp)~time)
>summary(a)
33
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Call:
lm(formula = log(temp) ~ time)
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1443601 0.0172782 297.74 < 2e-16 ***
time -0.0118227 0.0005988 -19.74 1.62e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 0.03415 on 12 degrees of freedom
Example 2.3.2
Given the period and mean distances of some of the planets, you are to find a law in the
form P=k s n
R-Code
>p=c(87.97,224.7,365.3,687.0,4333,10760)
>s=c(58,108,150,228,778,1426)
>a=lm(log(p)~log(s))
>summary(a)
34
OUTPUT
Call:
lm(formula = log(p) ~ log(s))
Residuals:
1 2 3 4 5 6
-9.492e-04 3.871e-03 -3.153e-03 1.154e-04-9.961e-05 2.155e-
04
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.6154480 0.0052541 -307.5 6.71e-10 ***
log(s) 1.5006720 0.0009335 1607.5 8.99e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 0.002544 on 4 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.584e+06 on 1 and 4 DF, p-value: 8.986e-13
35
[STAT 362 Statistical Computing & Data Analysis II]
Example 2.3.3
The width of successive whorls of a shell of Turbo duplicatus has been measured.
Find the law in the form w=a b n.
Positions of whorls (n) 1 2 3 4 5 6 7 8
Width of whorl (w 3.33 2.84 2.39 2.03 1.70 1.45 1.22 1.04
cm)
R-Code
>n=c(1,2,3,4,5,6,7,8)
>w=c(3.33,2.84,2.39,2.03,1.70,1.45,1.22,1.04)
>r=lm(log(w)~n)
>summary(r)
OUTPUT
Call:
lm(formula = log(w) ~ n)
Residuals:
Min 1Q Median 3Q Max
-0.0065511 -0.0033214 0.0006323 0.0036527 0.0049239
36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3733474 0.0035109 391.2 1.88e-14 ***
n -0.1672336 0.0006953 -240.5 3.48e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Activity 2.3.1
In a study on speed and braking distance, researchers looked for a method to estimate
how fast a person was traveling before an accident by measuring the length of the skid
marks. An area that was focused on in the study was the distance required to completely
stop a vehicle at various speeds. Use the following table to find the linear regression
equation.
MPH Brakingdistance(feet)
20 20
30 45
40 81
50 133
60 205
80 411
Activity 2.3.2
37
[STAT 362 Statistical Computing & Data Analysis II]
The nursing instructor wishes to see whether a student’s grade point average and age are
related to the student’s score on the state board nursing examination. She selects five
students and obtains the following data. Obtain the multiple linear regression equation
obtained from the data
Activity 2.3.3
If V =k Dr Find the values of k and r using the table below
Diameter 4.4 4.6 5 5.1 5.1 5.2 5.2 5.5 5.5 5.6
Volume 2 2.2 3 4.3 3 2.9 3.5 3.4 5 7.2
38
UNIT THREE
NON-PARAMETRIC TEST
OVERVIEW
This chapter will introduce you to several non-parametric hypothesis tests, namely, the
spearman and Kendall rank correlation, sign test, Wilcoxon sign, and sum test,
randomness test, and Kruskal Wallis test. We will discuss the applications of these
hypothesis tests for nonparametric statistics using the R software.
CONTENT
Session 3.1 Spearman and Kendall Rank Correlation Test
Session 3.2 Sign Test
Session 3.3 Wilcoxon Sign and Sum Test
Session 3.4 Randomness Test (Categorical and Continuous)
Session 3.5 Kruskal-Wallis Test
REQUIRED READINGS
1. Bluman, A. G. (2009). Elementary statistics: A step-by-step approach. New York,
NY: McGraw-Hill Higher Education.
2. Ugarte, M. D., Militino, A. F., & Arnholt, A. T. (2008). Probability and Statistics
with R. CRC Press.
3. Kerns, G. J. (2010). Introduction to probability and statistics using r. Lulu. com.
4. Schmuller, J. (2017). Statistical Analysis with R for Dummies. John Wiley & Sons.
5. Crawley, M. J. (2005). An introduction using R. Á Wiley.
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform Spearman and Kendall Rank Correlation Tests using R software.
4. Perform Randomness tests for categorical and continuous cases using R software.
39
[STAT 362 Statistical Computing & Data Analysis II]
SESSION/EXAMPLES/ACTIVITIES
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for computing spearman and Kendall rank correlation
coefficients, Sign test, Wilcoxon signed and sum test, Tests of Randomness, and the
Kruskal Wallis test in R software. Session Examples include solved questions on each
topic in R software; it illustrates the R codes for the various techniques. Below each
example(s) are activity(ies). Please these ACTIVITIES are to be solved and submitted
for grading. The deadline for submission is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=F0lvYZmxib8 (Spearman and Kendall
Rank Correlation)
40
SESSION 3.1
SPEARMAN AND KENDALL RANK CORRELATION TEST
Test for association between paired samples, using one of Pearson's product moment
correlation coefficient, Kendall's tau or Spearman's rho.
cor.test(x, y,
alternative = c("two.sided", "less", "greater"),
method = c("pearson", "kendall", "spearman"),
exact = NULL, conf.level = 0.95, continuity = FALSE, ...)
Solution
41
[STAT 362 Statistical Computing & Data Analysis II]
R-Code
>a=c(4,10,18,20,12,2,5,9)
>b=c(4,6,20,14,16,8,11,7)
>cor.test(a,b,method="spearman")
OUTPUT
Spearman's rank correlation rho
data: a and b
S = 30, p-value = 0.09618
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.6428571
Conclusion: We fail to reject the 𝐻𝑜and conclude that there is no linear correlation
between the two ratings.
R-Code
>a=c(4,10,18,20,12,2,5,9)
>b=c(4,6,20,14,16,8,11,7)
>cor.test(a,b,method=“kendall")
42
OUTPUT
Conclusion: We fail to reject 𝐻𝑜and conclude that there is no linear correlation between
the two ratings.
43
[STAT 362 Statistical Computing & Data Analysis II]
Activity 3.1.1
As a biologist, you wish to see if there is a relationship between the heights of tall trees
and their diameters. You find the following data for the diameter (in inches) of the tree
at 4.5 feet from the ground and the corresponding heights (in feet).
Diameter (in.) Height (ft)
1024 261
950 321
451 219
505 281
761 159
644 83
707 191
586 141
442 232
546 108
Perform both tests and write a short statement comparing the results.
Activity 3.1.2
The data below show the number of books published in six different subject areas for the
years 1980 and 2004. Use α= 0.05 to see if there is a relationship between the two data
sets.
agriculture home eco- literature music science sports and
nomics recreation
SESSION 3.2.1
44
SIGN TEST
The simplest nonparametric test, the sign test for single samples, is used to test the value
of a median for a specific sample.
SIGN.test(x, md) requires the “BSDA” package in R.
Example 3.2.1
A convenience store owner hypothesizes that the median number of snow cones she sells
per day is 40. A random sample of 20 days yields the following data for the number of
snow cones sold each day.
18, 43, 40, 16, 22, 30, 29, 32, 37, 36, 39, 34, 39, 45, 28, 36, 40, 34, 39, 52
Solution
Ho=The median number of snow cones she sells per day is 40.
Ha=The median number of snow cones she sells per day is not 40.
R-Code
>x=c(18, 43, 40, 16, 22, 30, 29, 32, 37, 36,39, 34, 39, 45,
28, 36, 40, 34, 39, 52)
>library(BSDA)
>SIGN.test(x, md=40)
45
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
One-sample Sign-Test
data: x
s = 3, p-value = 0.007538
alternative hypothesis: true median is not equal to 40
sample estimates:
median of x
36
Conclusion: There is enough evidence to reject the claim that the median number of
snow cones sold per day is 40.
46
SESSION 3.2.2
SIGN TEST (PAIRED SAMPLE)
The sign test can also be used to test sample means in a comparison of two dependent
samples, such as a before-and-after test.
Example 3.2.2
A medical researcher believed the number of ear infections in swimmers can be reduced
if the swimmers use earplugs. A sample of 10 people was selected, and the number of
infections for four months was recorded. During the first two months, the swimmers did
not use the earplugs; during the second two months, they did. At the beginning of the
second two-month period, each swimmer was examined to make sure that no infections
were present. The data are shown below. At a = 0.05, can the researcher conclude that
using earplugs reduced the median number of ear infections?
Solution
Ho=The median number of ear infections will not be reduced.
Ha= The median number of ear infections will be reduced.
Let x=before and y=after
R-Code
>x=c(3,0,5,4,2,4,3,5,2,1)
>y=c(2,1,4,0,1,3,1,3,2,3)
>w=x-y
>library(BSDA)
>SIGN.test(w)
47
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
One-sample Sign-Test
data: w
s = 7, p-value = 0.1797
alternative hypothesis: true median is not equal to 0
95 percent confidence interval:
-0.6755556 2.0000000
sample estimates:
median of x
1
Conclusion: There is not enough evidence to support the claim that the use of earplugs
reduced the median number of ear infections.
Activity 3.2
The median age for the total population of the state of Maine is 41.2, the highest in the
nation. The mayor of a particular city believes that his population is considerably
“younger” and that the median age there is 36 years. At α =¿ 0.05, is there sufficient
evidence to reject his claim? The data here represent a random selection of persons from
the household population of the city.
40, 56, 42, 72, 12, 22, 25, 43, 39, 48, 50, 37, 18, 35, 15, 30, 52, 45
48
SESSION 3.3.1
WILCOXON SIGN-RANK TEST
When the samples are dependent, as they would be in a before-and-after test using the
same subjects, the Wilcoxon signed-rank test can be used in place of the t-test for de-
pendent samples.
Example 3.3.1
In a large department store, the owner wishes to see whether the number of shoplifting
incidents per day will change if the number of uniformed security officers is doubled. A
sample of 7 days before security is increased and 7 days after the increase shows the
number of shoplifting incidents.
Is there enough evidence to support the claim at ∝=0.05, that there is a difference in the
number of shoplifting incidents before and after the increase in security.
Solution
𝐻𝑜= There is no difference in the number of shoplifting incidents before and after the
increase in security.
𝐻a= There is a difference in the number of shoplifting incidents before and after the in-
crease in security.
R-Code
>before=c(7,2,3,6,5,8,12)
>after=c(5,3,4,3,1,6,4)
>wilcox.test(before,after,paired=T)
49
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Conclusion: We fail to reject 𝐻𝑜and conclude that there is not enough evidence to
support the claim that there is a difference in the number of shoplifting incidents.
50
SESSION 3.3.2
WILCOXON RANK-SUM TEST
The Wilcoxon rank-sum test is used for independent samples.
Example 3.3.2
Two independent samples of the army and marine recruits are selected, and the time in
minutes it takes each recruit to complete an obstacle course is recorded, as shown in the
table below.
Army 15 18 16 17 13 22 24 17 19 21 26 28
Marines 14 9 16 19 10 12 11 8 15 18 25
At a = 0.05, is there a difference in the times it takes the recruits to complete the
course?
Solution
𝐻o = There is no difference in the times it takes the recruits to complete the obstacle
course.
𝐻A = There is a difference in the time it takes the recruits to complete the obstacle
course.
R-Code
>x=c(15,18,16,17,13,22,24,17,19,21,26,28)
>y=c(14,9,16,19,10,12,11,8,15,18,25)
>wilcox.test(x,y)
51
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Conclusion: Reject Ho and conclude that there is enough evidence to support the claim
that there is a difference in the times it takes the recruits to complete the course.
Activity 3.3
Two groups of alcoholics, one group male, and the other female were asked at what age
they first drunk alcohol. The data are shown here. Using the Wilcoxon rank-sum test at α
= 0.05, is there a difference in the ages of the females and males?
Males 6 12 14 16 17 17 13 12 10 11
Females 8 9 9 12 14 15 12 16 17 19
52
SESSION 3.4
RUNS TEST FOR RANDOMNESS
Performs the runs test for randomness (Mendenhall and Reinmuth 1982) for continuous
data. Also computes the runs test for randomness of the dichotomous (binary) data
series ‘x’. Users can choose whether to plot the correlation graph or not, and whether to
test against a two-sided, negative, or positive correlation. ‘NA’s from the data is omitted.
SESSION 3.4.1
RANDOMNESS TESTS FOR CATEGORICAL AND CONTINUOUS
When samples are selected, you assume that they are selected at random. How do you
know if the data obtained from a sample are truly random?
R-code for Runs Test
runs.test(a) requires “tseries” package for categorical case
runs.test(a) requires “lawstat” package for continuous case
53
[STAT 362 Statistical Computing & Data Analysis II]
R-Code
>a=factor(c("F","F","F","M","M","F","F","F","F","M","F","M"
,"M","M","F","F","F","F","M","M","F","F","F","M","M"))
> library(tseries)
> runs.test(a)
Alternative code
>a=scan(what=“ ”)
1.F
2.F
..
..
..
..
25.M
>library(tseries)
>runs.test(factor(a))
54
OUTPUT
Runs Test
data: f
Standard Normal = -1.2792, p-value = 0.2008
alternative hypothesis: two.sided
55
[STAT 362 Statistical Computing & Data Analysis II]
Conclusion: There is no enough evidence to reject the hypothesis that the passengers
board the train at random according to gender.
Twenty people enrolled in a drug abuse program. Test the claim that the ages of the
people, according to the order in which they enroll, occur at random, at α =¿ 0.05.
The data are:
18, 36, 19, 22, 25, 44, 23, 27, 27, 35, 19, 43, 37, 32, 28, 43, 46, 19, 20, 22.
Solution
Ho = The ages of the people, according to the order in which they enroll in a drug pro-
gram occur at random
Ha = The ages of the people, according to the order in which they enroll in a drug pro-
gram, do not occur at random.
R-Code
>a=c(18, 36, 19, 22, 25, 44, 23, 27, 27, 35, 19, 43, 37,
32, 28, 43, 46, 19, 20, 22)
>library(lawstat)
>runs.test(a)
56
OUTPUT
Conclusion: There is not enough evidence to reject the hypothesis that the ages of the
people who enroll occur at random.
Activity 3.4
1. As students, faculty, friends, and family arrived for the Spring Wind Ensemble
Concert at Shafer Auditorium, they were asked whether they were going to sit on
the balcony (B) or the ground floor (G). Use the responses listed below and test
for randomness at α = 0.05.
BBGGBBGBBBBBBGBBGGBBBBGGGGBGBBBGG
2. A school dentist wanted to test the claim, at α = 0.05, that the number of cavities
in fourth-grade students is random. Forty students were checked, and the number
of cavities each had is shown here. Test for the randomness of the values above
or below the median.
0460625315122137360260231521302373151122
57
[STAT 362 Statistical Computing & Data Analysis II]
SESSION 3-5
KRUSKAL-WALLIS TEST
The analysis of variance uses the F test to compare the means of three or more popula-
tions. The assumptions for the ANOVA test are that the populations are normally dis-
tributed and that the population variances are equal. When these assumptions cannot be
met, the nonparametric Kruskal-Wallis test, sometimes called the H test, can be used
to compare three or more means.
Example 3.5.1
A researcher tests three different brands of breakfast drinks to see how many mill equi-
valents of potassium per quart each contains. These data are obtained. What is the prob-
ability that all 3 are college graduates?
a. At ∝=0.05, is there enough evidence to reject the hypothesis that all brands con-
tain the same amount of potassium?
Solution
Ho = There is no difference in the amount of potassium contained in the brands.
Ha = There is a difference in the amount of potassium contained in the brands.
R-Code
>a=c(4.7,3.2,5.1,5.2,5.0)
>b=c(5.3,6.4,7.3,6.8,7.2)
>c=c(6.3,8.2,6.2,7.1,6.6)
>kruskal.test(list(a,b,c))
58
OUTPUT
Kruskal-Wallis rank sum test
data: list(a, b, c)
Kruskal-Wallis chi-squared = 9.38, df= 2, p-value =
0.009187
Conclusion: There is no enough evidence to reject the claim that there is no difference
in the amount of potassium contained in the three brands.
Activity 3.5.1
You are researching an article on the waterfalls on our planet. You want to make a state-
ment about the heights of waterfalls on three continents. Three samples of waterfall
heights (in feet) are shown.
North America Africa Asia
600 406 330
1200 508 830
182 630 614
620 726 1100
1170 480 885
442 2014 330
59
[STAT 362 Statistical Computing & Data Analysis II]
Activity 3.5.2
Samples of three different types of wrapping tape are tested for breaking strength, in
pounds. The data are shown here. At α = 0.05, is there a difference in the breaking
strength of the tapes? Use the Kruskal-Wallis test.
Type A 225 332 404 387 351 280 362 431 266
Type B 256 203 261 305 232 278 261 299 272
Type C 406 427 481 397 351 409 462 471 399
60
UNIT FOUR
CHI-SQUARE TEST
OVERVIEW
In this unit, chi-square goodness of fit and homogeneity tests will be considered. These
techniques are under the nonparametric methods of hypothesis testing. We will discuss
the applications of the chi-square tests and also compute test statistics and hypothesis
tests using R software.
CONTENT
Session 4.1 Goodness of Fit: A chi-square test used to see whether a frequency distribu-
tion fits a specific pattern
Session 4.2 Test of Homogeneity
REQUIRED READINGS
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform the goodness of fit test with chi-square using R
61
[STAT 362 Statistical Computing & Data Analysis II]
SESSION/EXAMPLES/ACTIVITIES
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for the goodness of fit test with chi-square and test of
homogeneity in R software. Session Examples include solved questions on each
technique in R software. Under each example(s) are activity(ies). Please these
ACTIVITIES are to be solved and submitted for grading. The deadline for submission
is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=n5c11B5FJ24 (Goodness of Fit Test)
62
SESSION 4.1
GOODNESS OF FIT
‘chisq.test’ performs chi-squared contingency table tests and
goodness-of-fit tests.
Example 4.1.1
The data below shows the preference in the selection of fruit soda flavors. Sellers claim
that no preference in the selection of fruit soda flavors.
Is there enough evidence to reject the claim that there is no preference in the selection of
fruit soda flavors, using the data above? Let ∝=0.05.
Solution:
𝐻o: Consumers show no preference for flavors of the fruit soda.
R-Code
>a=c(32,28,16,14,10)
>p=c(0.2,0.2,0.2,0.2,0.2)
>chisq.test(a,p=p)
63
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Conclusion: We reject the null hypothesis and conclude that consumers show a prefer-
ence for flavors.
Activity 4.1
Home-Schooled Student Activities Students who are home-schooled often attend their
local schools to participate in various types of activities such as sports or musical
ensembles. According to the government, 82% of home-schoolers receive their
education entirely at home, while 12% attend school up to 9 hours per week and 6%
spend from 9 to 25 hours per week at school. A survey of 85 students who are home-
schooled revealed the following information about where they receive their education.
At α = 0.05, is there sufficient evidence to conclude that the proportions differ from
those stated by the government?
64
SESSION 4.2
TEST OF HOMOGENEITY
R-Syntax for text of homogeneity:
Chisq.test(x); where x is cross-tabulation of the data
Example 4.2.1
A researcher selected 100 passengers from each of the 3 airlines and asked them if the
airline had lost their luggage on their last flight. At a 0.05 level of significance, test the
claim that the proportion of passengers from each airline who lost luggage on the flight
is the same for each airline. The data are shown in the table.
Airline 1 Airline 2 Airline 3 Total
Yes 10 7 4 21
No 90 93 96 279
Solution
𝐻o=The proportion of passengers from each airline who lost luggage on the flight is the
same for each airline.
𝐻A=The proportion of passengers from each airline who lost luggage on the flight is not
the same for each airline.
R-Code
>e=rbind(c(10,90),c(7,93),c(4,96))
>chisq.test(e)
65
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Conclusion: There is not enough evidence to reject the null hypothesis, hence the
proportion of passengers from each airline who lost luggage on the flight is the same for
each airline.
Activity 4.2.1
Endangered or Threatened Species Can you conclude a relationship between the class of
vertebrate and whether it is endangered or threatened? Use the 0.05 level of significance.
Is there a different result for the 0.01 level of significance?
Activity 4.2.2
Is there sufficient evidence at the 5% level of significance to conclude that a relationship
exists between the city and the number of television and radio stations that it has?
TV Radio
Albuquerque N. Mex. 13 32
Boston Mass. 12 21
St. Petersburg Fla. 17 41
Minneapolis Minn. 7 30
Toledo Ohio 6 22
66
UNIT FIVE
ANALYSIS OF VARIANCE
OVERVIEW
In this unit, the F test is used to compare two variances. It is used to test claims in-
volving three or more means. We will discuss the applications of the various analysis of
variance tests using R software. The two-way ANOVA is an extension of the oneway
analysis of variance; it involves two independent variables. The independent variables
are also called factors.
CONTENT
Session 5.1 One Way ANOVA
Session 5.2 Two Way ANOVA
Session 5.3 Multiple Analysis of Variance
REQUIRED READINGS
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform the one-way ANOVA test using R
67
[STAT 362 Statistical Computing & Data Analysis II]
SESSION/EXAMPLES/ACTIVITIES
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for performing one-way ANOVA, two-way ANOVA
and Multiple ANOVA in R software. Session Examples include solved questions on
each test in R software; it illustrates the R codes for finding the test of ANOVA Under
each example(s) are activity(ies). Please these ACTIVITIES are to be solved and
submitted for grading. The deadline for submission is one week after each lecture.
Video Activity
1. https://www.youtube.com/watch?v=4DeCaCaC2JQ (one-way ANOVA)
68
SESSION 5.1
ONE WAY ANOVA
Fit an analysis of variance model by a call to ‘lm’ for each
stratum.
aov(formula, data = NULL, projections = FALSE, qr = TRUE,
contrasts = NULL, ...)
Example 5.1.1:
A researcher wishes to try three different techniques to lower the blood pressure of
individuals diagnosed with high blood pressure. The subjects are randomly assigned to
three groups; the first group takes medication, the second group exercises, and the third
group follows a special diet. After four weeks, the reduction in each person’s blood
pressure is recorded. At 𝛼=0.05, test the claim that there is no difference among the
means. The data are shown below
Techniques
Medication(M) 10 12 9 15 13
Exercise(E) 6 8 3 0 2
Diet(D) 5 9 12 8 4
Solution:
𝐻o: The mean of the three techniques is the same.
𝐻A: At least one mean is different from the others.
Note
•Enter the data in excel.
•Save the data. (With the blood pressure example, the data is saved as ‘mydata’).
•Save the data as a comma-delimited file(csv) in your document.
•Import data into ‘R’ by using ‘data2=read.csv("mydata.csv")’.
69
[STAT 362 Statistical Computing & Data Analysis II]
R-Code
>data2=read.csv("mydata.csv")
>data2
>results=aov(pressure~techniques, data=Blood.pressure)
>summary(results)
70
OUTPUT
Conclusion: The decision is to reject the 𝐻0, since 9.17 is greater than the critical value
(3.89), therefore we conclude that at least one mean is different from the others.
Activity 5.1
The following set of data values was obtained from a study of people’s perceptions on
whether the color of a person’s clothing is related to how intelligent the person looks.
The subjects rated the person’s intelligence on a scale of 1 to 10. Group 1 subjects were
randomly shown people with clothing in shades of blue and gray. Group 2 subjects were
randomly shown people with clothing in shades of brown and yellow. Group 3 subjects
were randomly shown people with clothing in shades of pink and orange. The results
follow.
71
[STAT 362 Statistical Computing & Data Analysis II]
Use ANOVA to test for any significant differences between the means.
72
SESSION 5.2
TWO-WAY ANOVA TEST
R Syntax for two-way anova:
aov(y~x + z, data)
Example 5.2.1:
A researcher wishes to see whether the type of gasoline used and the type of automobile
driven have any effect on gasoline consumption. Two types of gasoline, regular and
high-octane, will be used, and two types of automobiles, two-wheel-and four-wheel
drive, will be used in each group. There will be two automobiles in each group, for a
total of eight automobiles used. Take 𝛼=0.05.
The data (in miles per gallon) are shown and summarized in the table below.
Solution
73
[STAT 362 Statistical Computing & Data Analysis II]
Note
•Enter the data in excel.
•Save the data. (With the gasoline example, the data is saved as ‘mydata2’).
R-Code
>data3=read.csv("mydata2.csv")
>data3
>results=aov(Gasoline~Gas+Automobile,data=data3)
>summary(results)
74
OUTPUT
75
[STAT 362 Statistical Computing & Data Analysis II]
𝐻A: There is an interaction effect between the type of gasoline used and the type of
automobile a person drives on gasoline consumption.
R-Code
>data3=read.csv("mydata2.csv")
>data3
>results=aov(Gasoline~Gas+Automobile+Gas:Automobile,data=data3)
>summary(results)
OUTPUT
76
SESSION 5.3
MULTIPLE ANALYSIS OF VARIANCE
The R syntax for the multivariate analysis of variance is given by
manova(...)
Example 5.3.1
A researcher will like to know if there is a significant difference in sepal and petal
length, between the different species of flowers.∝=0.05. The data is given in the table
below
Species Sepal.length Petal.width
Versicolor 5.0 3.3
Versicolor 5.5 4.4
Versicolor 5.8 4.0
Virginia 6.2 4.8
Virginia 5.9 5.1
Setosa 4.9 1.4
Setosa 4.5 1.4
Versicolor 5.7 4.2
Versicolor 6.1 4.7
Versicolor 6.2 4.3
Solution
𝐻o: There is no significant difference in petal and sepal length between the different spe-
cies.
𝐻A: There is a significant difference in petal and sepal length between the different
species.
Note
•Enter the data in excel.
•Save the data. (With the species example, the data is saved as ‘mydata3’).
77
[STAT 362 Statistical Computing & Data Analysis II]
78
R-Code
>flower=read.csv("mydata3.csv")
>flower
>results=manova(cbind(Sepal.length,petal.length)~Spe-
cies, data=flower)
>summary(results)
OUTPUT
DfPillai approxF numDfden DfPr(>F)
Species 2 0.93928 3.0993 4 14 0.05061 .
Residuals 7
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Conclusion: We fail to reject the null hypothesis and conclude that there is no
significant difference, in petal and sepal length between the different species.
79
[STAT 362 Statistical Computing & Data Analysis II]
Activity 5.3.1
A state employee wishes to see if there is a significant difference in the number of
employees at the interchanges of three Regional toll roads. The data are shown. At a=¿
0.05, can it be concluded that there is a significant difference in the average number of
employees at each interchange?
Accra-Tema Motorway Kumasi-Accra road Kumasi-Tamale Road
7 10 1
14 1 12
32 1 1
19 0 9
10 11 1
11 1 11
Activity 5.3.2
The following set of data values was obtained from a study of people’s perceptions on
whether the color of a person’s clothing is related to how intelligent the person looks.
The subjects rated the person’s intelligence on a scale of 1 to 10. Group 1 subjects were
randomly shown people with clothing in shades of blue and gray. Group 2 subjects were
randomly shown people with clothing in shades of brown and yellow. Group 3 subjects
were randomly shown people with clothing in shades of pink and orange. The results
follow.
Group 1 Group 2 Group 3
8 7 4
7 8 9
7 7 6
7 7 7
8 5 9
8 8 8
6 5 5
8 8 8
8 7 7
7 6 5
7 6 4
8 6 5
8 6 4
80
UNIT SIX
LOGISTIC REGRESSION
OVERVIEW
In this unit, logistic regression will be considered. Logistic regression is a statistical
method for analyzing a dataset in which there are one or more independent variables that
determine an outcome. The outcome is measured with a dichotomous variable (in which
there are only two possible outcomes). We will discuss the applications of logistic
regression using R software.
CONTENT
Session 6-1. Logistic regression
REQUIRED READINGS
LEARNING OUTCOMES
By the completion of this unit, the students should be able to;
1. Perform the logistic regression using R
81
[STAT 362 Statistical Computing & Data Analysis II]
SESSION/EXAMPLES/ACTIVITIES
This session presents video activities, session examples, and other activities. Video
Activities include Youtube videos for performing logistic regression models in R
software. Session Examples include solved questions on logistic regression with R
software; it illustrates the R codes for finding the estimates of coefficients of
independent variables and their hypothesis tests. Under each example(s) are
activity(ies). Please these ACTIVITIES are to be solved and submitted for grading.
The deadline for submission is one week after each lecture.
Description
‘glm’ is used to fit generalized linear models, specified by
giving a symbolic description of the linear predictor and a
description of the error distribution.
Rcode
glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart,
mustart, offset, control = list(...), model = TRUE, method = "glm.fit", x = FALSE, y =
TRUE, singular.ok = TRUE, contrasts = NULL, ...)
Video Activity
1. https://www.youtube.com/watch?v=C4N3_XJJ-jU
82
SESSION 6.1
LOGISTIC REGRESSION
glm(y~x+z, data, family = binomial(“logit”)).
Note.
The dependent variable must be categorical.
The independent variable can either be categorical or continuous.
Example 6.1.1
Example of a logistics regression table, where the dependent variable is a discrete dicho-
tomous variable with 1s and 0s. The independent variable can be discrete or continuous
Y x1 x2 x3
1 1 0 15
0 0 0 14
0 0 1 18
1 0 1 9
0 1 1 10
0 1 1 11
Solution
>y=c(1,0,0,1,0,0)
>x1=c(1,0,0,0,1,1)
>x2=c(0,0,1,1,1,1)
>x3=c(15,14,18,9,10,11)
>a=glm(y~x1+x2+x3, family=binomial())
>summary(a)
83
[STAT 362 Statistical Computing & Data Analysis II]
R-Code
>logis=read.csv("data5.csv")
>a=glm(y~x1+x2+x3, family=binomial(), data=logis)
>summary(a)
84
85
[STAT 362 Statistical Computing & Data Analysis II]
OUTPUT
Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial())
Deviance Residuals:
1 2 3 4 5
1.17741 -1.17741 0.00000 1.17741 -1.17741
6
-0.00016
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 254.31 74727.90 0.003 0.997
x1 18.17 5337.71 0.003 0.997
x2 -90.83 26688.54 -0.003 0.997
x3 -18.17 5337.71 -0.003 0.997
Activity 6.1
The discrete dichotomous dependent variable in a study has values1s and 0s. The inde-
pendent variables x1 and x2 are discrete and a third x3 is continuous. Use the logistic
model to find the effect of the independent variables
Y x1 x2 x3
1 1 0 152
0 0 1 142
0 0 1 183
0 0 1 95
0 1 1 108
1 0 0 157
1 1 0 150
0 0 0 14
0 0 1 158
1 0 1 99
1 0 0 108
0 1 1 111
86
END-OF-COURSE EVALUATION
Please visit the End of Course evaluation folder to pick the evaluation questionnaire.
Answer the questions and submit your response to the facilitator. This evaluation is very
critical to the betterment of the course in subsequent sessions.
The final examination, which counts towards 70% of your final grade, will include all
topical issues discussed during this course. Please review all examples, activities, and
individual assignments, as preparation for your final exam. The programme’s
examination officer will communicate the final examination dates and venue to students
sometime later.
Thanks for your
87
[STAT 362 Statistical Computing & Data Analysis II]
88