Class Notes
Class Notes
Demetris Athienitis
Department of Statistics,
University of Florida
Contents
Contents 1
I Modules 1-2 4
1 Descriptive Statistics 5
1.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Spread . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Effect of shifting and scaling measurements . . . . . . . 7
1.3 Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Dot Plot . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Box-Plot . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.5 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . 13
1
2.5.5 Conditional distributions . . . . . . . . . . . . . . . . . 38
2.5.6 Independent random variables . . . . . . . . . . . . . . 39
2.5.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.8 Mean and variance of linear combinations . . . . . . . 43
2.5.9 Common Discrete Distributions . . . . . . . . . . . . . 44
2.5.10 Common Continuous Distributions . . . . . . . . . . . 48
2.6 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 52
2.7 Normal Probability Plot . . . . . . . . . . . . . . . . . . . . . 54
II Modules 3-4 56
3 Inference for One Population 57
3.1 Inference for Population Mean . . . . . . . . . . . . . . . . . . 57
3.1.1 Confidence intervals . . . . . . . . . . . . . . . . . . . 57
3.1.2 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Inference for Population Proportion . . . . . . . . . . . . . . . 68
3.2.1 Large sample confidence interval . . . . . . . . . . . . . 68
3.2.2 Large sample hypothesis test . . . . . . . . . . . . . . . 69
3.3 Inference for Population Variance . . . . . . . . . . . . . . . . 70
3.3.1 Confidence interval . . . . . . . . . . . . . . . . . . . . 71
3.3.2 Hypothesis test . . . . . . . . . . . . . . . . . . . . . . 72
3.4 Distribution Free Inference . . . . . . . . . . . . . . . . . . . . 74
3.4.1 Sign test . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4.2 Wilcoxon signed-rank test . . . . . . . . . . . . . . . . 77
2
5.1.3 Inference on slope coefficient . . . . . . . . . . . . . . . 106
5.1.4 Confidence interval on the mean response . . . . . . . . 107
5.1.5 Prediction interval . . . . . . . . . . . . . . . . . . . . 108
5.2 Checking Assumptions and Transforming Data . . . . . . . . . 109
5.2.1 Normality . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.3 Homogeneity of variance/Fit of model . . . . . . . . . 112
5.2.4 Box-Cox (Power) transformation . . . . . . . . . . . . 113
5.3 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.2 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . 117
5.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Qualitative Predictors . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 146
3
Part I
Modules 1-2
4
Module 1
Descriptive Statistics
1.1 Concept
Definition 1.1. Population parameters are a numerical summary concerning
the complete collection of subjects, i.e. the population.
Sample statistics are notated by the “hat” symbol over the population
parameter such as the sample mean µ̂, or sometimes for convenience a symbol
from the English alphabet. For the sample mean µ̂ ≡ x̄.
1.2.1 Location
• The mode is the most frequently encountered observation.
1
Pn
• The mean is the arithmetic average of the observations. x̄ = n i=1 xi .
• The pth percentile value divides the ordered data such that p% of
the data are less than that value and (100-p)% greater than it. It is
located at (p/100)(n+1) position of the ordered data. If the position
value is not an integer then take a weighted average of the values at
b(p/100)(n + 1)c and d(p/100)(n + 1)e. The median is actually
the 50th percentile.
5
• The α% trimmed mean is the mean of the data with the smallest
α% × n observations and the largest α% × n observations truncated
from the data.
Example 1.1. The following values of fracture stress (in megapascals) were
measured for a sample of 24 mixtures of hot mixed asphalt (HMA).
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
http://www.stat.ufl.edu/~athienit/IntroStat/loc_stats.R
Remark 1.1. To calculate a weighted average when a percentile is located at
position P that is between two observed position Closest Lower Position to
P (CLP) and the Closest Upper Position to P (CUP)
CLP P CUP
P−CLP CUP−P
CUP−CLP CUP−CLP
a ? b
And the weighted average is going to gice less weight to CUP with cor-
responding value b, as it’s further away, than to CLP with value a.
P − CLP CUP − P
?=b +a (1.1)
CUP − CLP CUP − CLP
The first weight goes to the second value (the largest) and the second weight
goes to the first value (the smallest).
6
Remark 1.2. Note that the mean is more sensitive to outliers-observations
that do not fall in the general pattern of the rest of the data-than the median.
Assume we have values
2, 3, 5.
The mean is 3.33 and the median is 3. Now assume we add a value and now
have
2, 3, 5, 112.
The mean is 30.5 but the median is now 4.
1.2.2 Spread
• The variance is a measure of spread of the individual observations
from their center (as indicated by the mean).
n
" n # !
1 X 1 X
σ̂ 2 = s2 = (xi − x̄)2 = x2i − nx̄2
n − 1 i=1 n−1 i=1
7
1.3 Graphical Summaries
1.3.1 Dot Plot
Stack each observation on a horizontal line to create a dot plot that gives an
idea of the “shape” of the data. Some rounding of data values is allowed in
order to stack.
1.3.2 Histogram
1. Create class intervals (by choosing boundary points) in which to place
the data.
8
Histogram
0.004
0.003
Density
0.002
0.001
0.000 0 100 200 300 400 500
Remark 1.3. May use Frequency, Relative Frequency or Density as the ver-
tical axis when class widths are equal. However, class widths are not nec-
essarily equal; usually done to create smoother graphics if not mandated by
the situation at hand. If this is the case then we must use Density that
accounts for the width because large classes may have unrepresentative large
frequencies.
http://www.stat.ufl.edu/~athienit/IntroStat/hist1_boxplot1.R
1.3.3 Box-Plot
Box-Plot is a graphic that only uses quartiles. A box is created with Q1 , Q2 ,
and Q3 . A lower whisker is drawn from Q1 down to the smallest data point
that is within 1.5 IQR of Q1 . Hence from Q1 = 110.25 down to Q1 −
1.5IQR = 110.25 − 1.5(134) = −90.75, but we stop at the smallest point
within than which is 30. Similarly the upper whisker is drawn from Q3 =
244.25 to Q3 + 1.5IQR = 445.25 but we stop at the largest point within
which is 384.
9
100 200 300 400
Remark 1.4. Any point beyond the whiskers is classified as an outlier and
any point beyond 3IQR from either Q1 or Q3 is classified as an extreme
outlier.
http://www.stat.ufl.edu/~athienit/IntroStat/hist1_boxplot1.R
10
These densities have shapes that can be described as:
• Symmetric
0.5
0.4
σ=1
σ=1.5
σ=0.8
0.3
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−5 −4 −3 −2 −1 0 1 −1 0 1 2 3 4 5
−4 −2 0 2 4 6
11
1.3.4 Pie chart
A pie or circle has 360 degrees. For each category of a variable, the size of the
slice is determined by the fraction of 360 that corresponds to that category.
USA 67%
Other 6%
Australia 5%
Canada 6%
UK 17%
http://www.stat.ufl.edu/~athienit/IntroStat/pie.R
12
1.3.5 Scatterplot
It is used to plot the raw 2-D points of two variables in an attempt to discern
a relationship.
Scatterplot
80
70
60
Math score
50
40
30
1 2 3 4 5 6
http://www.stat.ufl.edu/~athienit/IntroStat/scatterplot.R
13
Module 2
The study of probability began in the 17th century when gamblers starting
hiring mathematicians to calculate the odds of winning for different types of
games.
• Machine cuts rods of certain length (in cm). S = {x|5.6 < x < 6.4}
For instance the empty set ∅ = {} and the entire sample space S are also
events.
Example 2.2. Let A be the event of an even outcome when rolling a die.
Then, A = {2, 4, 6} ⊂ S.
14
2.1.2 Relating events
When we are concerned with multiple events within the sample space, Venn
Diagrams are useful to help explain some of the relationships. Let’s illustrate
this via an example.
Example 2.3. Let,
S ={1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
A ={1, 3, 5, 7, 9}
B ={6, 7, 8, 9, 10}
A B
S
4
3
6
1 9 10
7
5 8
Combining events implies combining the elements of the events. For ex-
ample,
A ∪ B = {1, 3, 5, 6, 7, 8, 9, 10}.
Intersecting events implies only listing the elements that the events have
in common. For example,
A ∩ B = {7, 9}.
The complement of an event implies listing all the elements in the sample
space that are not in that event. For example,
15
• Distributive law: For any sets A, B, and C we have
– A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
– A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
Definition 2.3. A collection of events A1 , A2 , . . . is mutually exclusive if no
two of them have any outcomes in common. That is, Ai ∩ Aj = ∅, ∀i, j
In terms of the Venn Diagram, there is no overlapping between them.
Example 2.4. Let,
S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} A = {1, 5} B = {6, 8, 10}
S 3
A B
4
1 10
7
5
8
9
2
Example 2.5. A set is always mutually exclusive with its complement. Let,
S = {1, 2} A = {1}
S
A Ac
16
2.2 Probability
Notation: Let P (A) denote the probability that the event A occurs. It is the
proportion of times that the event A would occur in the long run.
Axioms of Probability:
• P (S) = 1
• 0 ≤ P (A) ≤ 1, since A ⊆ S
• If A1 , A2 , . . . are mutually exclusive, then
P (A1 ∪ A2 ∪ · · · ) = P (A1 ) + P (A2 ) + · · ·
Crashes Probability
0 0.60
1 0.30
2 0.05
3 0.04
4 0.01
Let A be the event that at least one crash occurs on a given day.
P (A) = 0.30 + 0.05 + 0.04 + 0.01
= 0.4
or
= 1 − P (Ac )
= 1 − 0.60
= 0.4
If S contains N equally likely outcomes/elements and the event A contains
k(≤ N ) outcomes then,
k
P (A) =
N
Example 2.7. The experiment consists of rolling a die. There are 6 outcomes
in the sample space, all of which are equally likely (assuming a fair die).
Then, if A is the event of an outcome of a roll being even, A = {2, 4, 6} with
3 elements so, P (A) = 3/6 = 0.5
17
The axioms provide a way of finding the probability of a union of two
events but only if they are mutually exclusive. In Example 2.3 we have seen
that
A ∪ B = {1, 3, 5, 6, 7, 8, 9, 10} and A ∩ B = {7, 9}.
So, to find the P (A ∪ B) we can either
1. A ∩ B c = {1, 3, 5}
2. A ∩ B = {7, 9}
3. Ac ∩ B = {6, 8, 10}
P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)
• Add P (A) + P (B) but we need to subtract the probability of the in-
tersection as that probability was double counted since it is included
within A and within B, leading to
Example 2.8. A sample of 1000 aluminum rods is taken and a quality check
was performed on each rod’s diameter and length.
Diameter
Length Too Thin OK Too Thick Sum
Too Short 10 3 5 18
OK 38 900 4 942
Too Long 2 25 13 40
Sum 50 928 22 1000
18
2.3 Counting Methods
Proposition 2.1. Product Rule: If the first task of an experiment can result
in n1 possible outcomes and for each such outcome, the second task can result
in n2 possible outcomes, and so forth up to kQtasks, then the total number
of ways to perform the sequence of k tasks is ki=1 ni
Example 2.9. When buying a certain type and brand of car a buyer has
the following number of choices:
(2) Engine
(5) Color
(4) Interior
Then, the total number of car choices, i.e. number of unique cars, is
2 × 5 × 4 = 40.
Notation: In mathematics, the factorial of a non-negative integer n, denoted
by n!, is the product of all positive integers less than or equal to n. For
example,
5! = 5 × 4 × 3 × 2 × 1 = 120
with 0! = 1.
2.3.1 Permutations
Definition 2.4. Permutation is the number of ordered arrangements, or
permutations, of r objects selected from n distinct objects (r ≤ n) without
replacement. It is given by,
n!
Prn = .
(n − r)!
The number of permutations of n objects is n!.
Example 2.10. Take the letters A, B, C. Then there are 3! = 6 possible
permutations. These are:
ABC, ACB, BAC, BCA, CAB, CBA
In effect we have three “slots” in which to place the letters. There are 3
options for the first slot, 2 for the second and 1 for the third. Hence we have
3 × 2 × 1 = 6 = P33 .
Example 2.11. Let n = 10 with A, B, C, D, E, F, G, H, I, J and we wish to
select 5 letters at random, where order is important. Then there are 5 slots
with 10 choices for the first slot, 9 for the second, 8 for the third, 7 for the
fourth and 6 for the fifth. That is,
10!
10 × 9 × 8 × 7 × 6 = = P510 = 30240.
5!
19
Example 2.12. An election is held for a board consisting of 3 individuals
out of a pool of 40 candidates. There will be 3 positions of:
1. President
2. Vice-President
3. Secretary
How many possible boards are there?
Since order is important there are
40!
P340 = = 59280
37!
That is 40 choices for president, then, 39 choices for vice-president, and finally
38 choices for secretary, i.e. (40)(39)(38) = 59280
2.3.2 Combinations
Definition 2.5. Combination is the number of unordered arrangements, or
combinations, of r objects selected from n distinct objects (r ≤ n) without
replacement. It is given by,
n n n! n
Cr = = = .
r r!(n − r)! n−r
The way to think about combinations is that they are a special case or
permutations. When selecting r objects from n we know that there are Prn
permutations but also there are r! different orderings, which for combinations
cannot be considered different. Hence,
Prn
= Crn .
r!
Example 2.13. Referring back to Example 2.10 where there were P33 = 6
permutations of the letters. However, there are also 3! = 6 different orderings.
Consequently, there is only 6/6=1 combination.
Example 2.14. Continuing from Example 2.12, assume instead of a board
we are interested in the number of possible committees where each member
has equal power.
Then, there are
40 40
C3 = = 9880
3
With combinations we have only seen two groups. Those items chosen
and those that were not. A generalization of combinations to more that two
groups states that the number of ways of partitioning n distinct objects into
k categories containing n1 , . . . , nk objects respectively, with n1 + · · · + nk = n
is
n!
n1 ! · · · nk !
20
Application of Combinations to Probability Problems. Knowing the
total number of combinations of a certain set and knowing the number of
combinations for a certain subset we, are able to calculate the probability of
an event assuming each possible outcome is equally likely. This is done by
dividing the number of ways a certain outcome occurs over the total number
of ways.
• 10
3
is the number of ways of selecting 3 defective out of 10.
• 90
2
is the number of ways of selecting 2 non-defective out of 90.
• 100
5
is the total number of ways of selecting 5 articles out of a 100.
21
2.4 Conditional Probability and Independence
Definition 2.6. A probability that is based upon the entire sample space is
called an unconditional probability, but when it is based upon a subset of the
sample space it is a conditional (on the subset) probability.
Definition 2.7. Let A and B be two events with P (B) 6= 0. Then the
conditional probability of A given B (has occurred) is
P (A ∩ B)
P (A|B) = .
P (B)
The reason that we divide by the probability of given said occurrence, i.e.
P (B) is to re-standardize the sample space. We update the sample space to
be just B, i.e. S = B and hence P (B|B) = 1. The only part of event A that
occurs within this new S = B is P (A ∩ B).
Proposition 2.2. Rule of Multiplication:
• If P (A) 6= 0, then P (A ∩ B) = P (B|A)P (A).
• If P (B) 6= 0, then P (A ∩ B) = P (A|B)P (B).
Success
What is the probability that the server loses a point, i.e. P (Fault 1 and Fault 2)?
P (Fault 1 and Fault 2) = P (Fault 2|Fault 1)P (Fault 1) = (0.02)(0.44) = 0.009
Example 2.17. Referring to Example 2.8. Given that the length of a rod is
too long, what is the probability that the diameter is okay, i.e.
P (Diam. OK ∩ Length too long)
P (Diam. OK | Length too long) =
P ( Length too long)
25/1000
=
40/1000
25
= = 0.625.
40
22
2.4.1 Independent Events
When the given occurrence of one event does not influence the probability
of a potential outcome of another event, then the two events are said to be
independent.
23
Example 2.19. A system consists of four components, connected as shown.
Suppose that the components function independently, and that the proba-
bilities of failure are 0.05 for A, 0.03 for B, 0.07 for C, and 0.014 for D. Find
the probability that the system functions.
A B
24
To better illustrate this proposition let n = 4 and look at Figure 2.4.
B
A1 A2
A1 ∩ B A2 ∩ B
A3 ∩ B A4 ∩ B
A3 A4
Example 2.20. Customers can purchase a car with three options for engine
sizes
Of the cars with the small engine 10% fail an emissions test within 10 years
of purchase, while 12% fail of the medium and 15% of the large.
What is the probability that a randomly chosen car will fail the emissions
test within 10 years?
What we have is:
Therefore
25
2.4.3 Bayes’ Rule
In most cases P (B|A) 6= P (A|B). Baye’s rule provides a method to calculate
one conditional probability if we know the other one. It uses the rule of
multiplication in conjunction with the law of total probability.
P (A ∩ B) P (B|A)P (A)
P (A|B) = = .
P (B) P (B|A)P (A) + P (B|Ac )P (Ac )
26
2.5 Random Variables and Probability Dis-
tributions
Definition 2.10. A random variable is a function that assigns a numerical
value to each outcome of an experiment. It is a measurable function from a
probability space into a measurable space known as the state space.
It is an outcome characteristic that is unknown prior to the experiment.
For example, an experiment may consist of tossing two dice. One poten-
tial random variable could be the sum of the outcome of the two dice, i.e.
X= sum of two dice. Then, X is a random variable that maps an outcome
of the experiment (36 of them) into a numerical value (11 of them), the sum
in this case.
X
(1, 1) 7−→ 2
X
(1, 2) 7−→ 3
X
(1, 3) 7−→ 4
..
.
X
(6, 6) 7−→ 12
Some times the function is just the identity function, that is, the numerical
value assigned is the outcome value. For example, if you are measuring
the height of trees (in an agricultural experiment) then the random variable
might simply be the height of the tree.
Quantitative random random variables can either be discrete, by which
they have a countable set of possible values, or continuous which have
uncountably infinite.
27
Example 2.22. (Discrete) Suppose a storage tray contains 10 circuit boards,
of which 6 are type A and 4 are type B, but they both appear similar. An
inspector selects 2 boards for inspection. He is interested in X = number of
type A boards. What is the probability distribution of X?
Since,
X
(A,A) 7−→ 2
X
(A,B) 7−→ 1
X
(B,A) 7−→ 1
X
(B,B) 7−→ 0
Consequently,
X=x p(x)
0 0.1334
1 0.5333
2 0.3333
Total 1.0
28
0.5
0.4
0.3
Density
0.2
0.1
0.0
0 2 4 6 8
P(X ≤ 1)
0.4
Probability
0.3
0.2
0.1
0.0
0 1 2
Possible Values
F (1) = P (X ≤ 1)
= P (X = 0) + P (X = 1)
= 0.1334 + 0.5333 = 0.6667
29
Example 2.25. Example 2.23 continued.
Find F (1). That is,
Battery distribution
0.5
0.4
P(X ≤ 1)
0.3
0.2
0.1
0.0
0 1 2 4 6 8
Z 1
F (1) = f (x)dx
−∞
Z 0 Z 1
= 0dx + 0.5e−0.5x dx
−∞ 0
−0.5x
= 0 + (−e )|10 = 0.3935
30
2.5.1 Expected Value And Variance
The expected value of a r.v. is thought of as the long term average for that
variable. Similarly, the variance is thought of as the long term average of
values of the r.v. to the expected value.
Definition 2.13. The expected value (or mean) of function h(·) of a r.v. X
is Z ∞
E(h(X)) = h(x)f (x)dx.
−∞
Example 2.26. Referring back to Example 2.22, the expected value of the
number of type A boards (X) is
X
E(X) = xp(x) = 0(0.1334) + 1(0.5333) + 2(0.3333) = 1.1999.
∀x
31
Example 2.27. This refers to Example 2.22. We know that E(X) = 1.1999
and E(X 2 ) = 02 (0.1334) + 12 (0.5333) + 22 (0.3333) = 1.8665. Thus,
So E(X) = 2.
So, E(X 2 ) = 8.
= a2 E {X − E(X)}2
= a2 V (X)
32
2.5.2 Population Percentiles
Let X be a continuous r.v. with p.d.f. f and c.d.f. F . The population pth
percentile, xp is found by solving the following equation for xp
Z xp
p
F (xp ) = f (t)dt = .
−∞ 100
Example 2.29. Let r.v. X have p.d.f. f (x) = 0.5e−0.5x , x > 0. The 60th
percentile of X is found by solving for xm in
Z x0.6
F (x0.6 ) = 0.5e−0.5t dt = 0.6.
0
Battery distribution
0.5
0.4
P(X ≤ x_(0.6))=0.6
0.3
0.2
0.1
0.0
0 ?2 4 6 8
Z x
0.5 −0.5t x
F (x) = 0.5e−0.5t dt = e |0
0 −0.5
= −e−0.5x + 1.
33
Example 2.30. Refer back to Example 2.22.
34
2.5.3 Chebyshev’s inequality
This is useful concept when the distribution of a r.v. is unknown.
Proposition 2.5. Let X be a random variable with E(X) = µ and V (X) =
σ 2 . Then,
1 1
P (|X − µ| < kσ) ≥ 1 − 2
⇔ P (|X − µ| ≥ kσ) ≤ 2
k k
This proposition implies (through the second term) that the probability
that a random variable differs from its mean by standard deviations or more
is never greater than 1/k 2 .
Example 2.31. A widget manufacturer is interested in the largest propor-
tion of days when production is not in the optimal range of 100 to 140 widgets.
The manufacturing process is known to produce on average 120 widgets with
a standard deviation of 10.
P (X ≤ 99 or X ≥ 141) = P (|X − 120| ≥ 21)
= P (|X − 120| ≥ 2.1(10))
1
≤ = 0.2268
2.12
For example, the probability that a box has length 15mm and width
129mm is
P (X = 129 ∩ Y = 15) = 0.12.
We also implement the law of total probability to find marginal probabilities
such as
P (X = 129) = P (X = 129 ∩ Y = 15) + P (X = 129 ∩ Y = 16)
| {z } | {z }
0.12 0.08
= 0.20
35
Finding marginal probabilities or one r.v. involves summing out or inte-
grating out the other(s).
Z Z
fX (x) = f (x, y)dy fY (y) = f (x, y)dx
Example 2.33. For a certain type of washer, both the thickness and the
hole diameter vary from item to item. Let X denote the thickness and Y the
diameter, both in mm. Assume that the joint p.d.f. is given by
1
f (x, y) = (x + y) x ∈ [1, 2], y ∈ [4, 5]
6
Find the probability that a randomly chosen washer has a thickness between
1.0 and 1.5mm and a hole diameter between 4.5 and 5mm.
What we need to find then is
Z 1.5 Z 5
P (1 ≤ X ≤ 1.5 and 4.5 ≤ Y ≤ 5) = (1/6)(x + y) dydx
1 4.5
Z 1.5
x 19
= + dx
1 12 48
= 1/4.
36
Example 2.34. Let X and Y be random variables with p.d.f.
f (x, y) = 8xy x ∈ [0, 1], y ∈ [0, x]
We wish to find
• P (X > 0.5 and Y < X). All we need to do is integrate the p.d.f over
the joint support
Support of X and Y
1.0
0.8
0.6
y
0.4
0.2
0.0
x
Z 1 Z x
P (X > 0.5 and Y < X) = 8xy dydx
Z0.51 0
= 4x3 dx
0.5
= 0.9375.
• The marginals
– of X
Z x
fX (x) = 8xy dy
0
= 4x3 0≤x≤1
– of Y
Z 1
fY (y) = 8xy dx
y
= 4y(1 − y 2 ) 0≤y≤1
Expanding definition 2.13 we have
Definition 2.16. The expected value of a function h(·) of r.vs X and Y is
Z ∞Z ∞
E [h(X, Y )] = h(x, y)f (x, y) dxdy
−∞ −∞
37
2.5.5 Conditional distributions
Definition 2.17. Let X and Y be jointly continuous r.vs with joint p.d.f.
f (x, y). Let fX (x) denote the marginal p.d.f. of X and let x be any number
for which fX (x) > 0. The conditional p.d.f. of Y |X = x is
f (x, y)
fY |X (y|x) =
fX (x)
Example 2.35. Continuing from Example 2.33, it is easy to show that the
marginal distribution of X is (1/6)(x + 4.5) for x ∈ [1, 2]. To find the p.d.f.
of Y given that X = 1.2
(1/6)(1.2
+ y)
fY |X (y|1.2) =
(1/6)(1.2
+ 4.5)
1.2 + y
= y ∈ [4, 5]
5.7
We can use this distribution to find probabilities such as the probability
P (Y ≤ 4.8|X = 1.2). Do as an exercise.
38
2.5.6 Independent random variables
Definition 2.18. Let X and Y be two continuous r.vs. They are independent
if
f (x, y) = fX (x)fY (y)
This provides us with a method to determine from a joint p.d.f. if two
r.vs are independent. We need to be able to partition the joint p.d.f. into a
product of the two marginal p.d.fs. This may not be a trivial task because
we have to make sure that the two partitioned functions are actually p.d.fs,
that is that they are greater than or equal to 0 and that they integrate to
1 when integrated over the whole support. Fortunately the following lemma
makes matters easier.
Lemma 2.6. Two r.vs X and Y are independent if and only if there exist
functions g(x) and h(y) strictly either nonnegative or nonpositive such that
for every x, y ∈ R
f (x, y) = g(x)h(y)
Proof. (⇒) If X and Y are independent then by definition f (x, y) = fX (x)fY (y).
Let g(x) = fX (x) and
R ∞h(y) = fY (y). R∞
(⇐) Define c := −∞ g(x)dx and d := −∞ h(y)dy. Note that
Z Z
cd = g(x)h(y)dxdy
Z Z
= f (x, y)dxdy
=1
The marginal p.d.f. of X is given by
Z Z Z
fX (x) = f (x, y)dy = g(x)h(y)dy = g(x) h(y)dy = g(x)d.
where I(·) denotes the indicator (0,1) function. It is clear that X and Y
independent.
39
Example 2.38. Back in Example 2.34 we had p.d.f. 8xy for x ∈ [0, 1] and
y ∈ [0, x] which can be expressed as
However, I(y ∈ [0, x]) cannot be further decomposed into a function simply
of x and one simply of y. Hence, X and Y are not independent.
2.5.7 Covariance
The population covariance is a measure of strength of a linear relationship
among two variables. It is not a measure of the slope of the linear relationship
but how close points lie to a straight line. See example 2.41
= E(X)E(Y )
• unitless
• ranges from −1 to 1
Cov(X, Y )
ρXY = p p ,
V (X) V (Y )
A negative relationship implies a negative covariance and consequently a
negative correlation.
40
Example 2.39. Recall that in Example 2.34 we had two r.vs X and Y with
joint p.d.f.
f (x, y) = 8xy x ∈ [0, 1], y ∈ [0, x].
To find Cov(X, Y ) we will have to find the three terms E(XY ), E(X) and
E(Y ).
Z 1Z x
E(XY ) = xy(8xy)dydx
0 0
Z 1
= (8x5 )/3dx
0
= 4/9
We had also shown that the marginal p.d.fs where
• fX (x) = 4x3 0≤x≤1
• fY (y) = 4y − 4y 3 0≤y≤1
and therefore, E(X) = 4/5 and E(Y ) = 8/15. Thus,
4 4 8
Cov(X, Y ) = − = 0.01778.
9 5 15
To find the correlation, it can be shown that V (X) = 0.02667 and V (Y ) =
0.04889. Therefore,
0.01778
ρXY = √ √ = 0.49239
0.02667 0.04889
Moving away from the population parameters, to estimate the sample
statistic of the covariance and the correlation we need
n
\ 1 X
σ̂XY := Cov(X, Y ) = (xi − x̄)(yi − ȳ)
n − 1 i=1
" n ! #
1 X
= xi yi − nx̄ȳ
n−1 i=1
Therefore, Pn
( i=1 xi yi ) − nx̄ȳ
rXY := ρ̂XY = .
(n − 1)sX sY
Example 2.40. Let’s assume that we want to look at the relationship be-
tween two variables, height (in inches) and self esteem for 20 individuals.
Height 68 71 62 75 58 60 67 68 71 69
Esteem 4.1 4.6 3.8 4.4 3.2 3.1 3.8 4.1 4.3 3.7
68 67 63 62 60 63 65 67 63 61
3.5 3.2 3.7 3.3 3.4 4.0 4.1 3.8 3.4 3.6
41
Hence,
4937.6 − 20(65.4)(3.755)
rXY = = 0.731
19(4.406)(0.426)
there is a moderate to strong positive linear relationship.
75
70
Height
65
60
Esteem
http://www.stat.ufl.edu/~athienit/IntroStat/esteem.R
Example 2.41. Here are some examples that illustrate different sample cor-
relations. Again we note that correlation measures the strength of the linear
relationship and not the slope.
r=1 r = 0.8775
10
10
8
8
6
6
y1
y2
4
4
2
2 4 6 8 10 2 4 6 8 10
x x
r = −0.9307 r = 0.4968
10
10
8
8
6
6
y3
y4
4
4
2
2
0
2 4 6 8 10 2 4 6 8 10
x x
42
2.5.8 Mean and variance of linear combinations
Let X and Y be two r.vs, for (aX + b) + (cY + d) for constants a, b, c and d,
E(aX + b + cY + d) = aE(X) + cE(Y ) + b + d
V (aX + b + cY + d) = Cov(aX, aX) + Cov(cY, cY ) + Cov(aX, cY ) + Cov(cY, aX)
| {z } | {z } | {z }
a2 V (X) c2 V (Y ) 2acCov(X,Y )
n
! n
X X
E ai X i = ai E(Xi )
i=1 i=1
and
n
! n X
n
X X
V ai X i = ai aj Cov(Xi , Xj ) (2.1)
i=1 i=1 j=1
Xn XX
= a2i V (Xi ) + 2 ai aj Cov(Xi , Xj ) (2.2)
i=1 i<j
and !
n n
1X ind. 1 X 1 2 σ2
V Xi = V (X i ) = nσ =
n i=1 n2 i=1 n2 n
Remark 2.4. As the sample size increases, the variance of the sample mean
decreases with limn→∞ V (X̄) = 0. That is, there is no uncertainty in the
sample mean, which is the estimate of the population mean. With n → ∞
the sample mean is the population mean.
43
2.5.9 Common Discrete Distributions
In the following sections we will be reviewing some of the most frequently used
discrete distributions. Probability calculations can be software once a p.m.f.
is specified for any distribution, however for these common ones software
have built-in p.m.f., c.d.f, quantile (inverse function of c.d.f.) function and
so on.
Bernoulli
Imagine an experiment where the r.v. X can take only two possible outcomes,
The p.m.f. of X is
p(x) = px (1 − p)1−x x = 0, 1 0 ≤ p ≤ 1
Example 2.44. A die is rolled and we are interested in whether the outcome
is a 6 or not. Let, (
1 if outcome is 6
X=
0 otherwise
Then, X ∼ Bernoulli(1/6) with mean 1/6 and variance 5/36.
Binomial
If X1 , . . . , Xn correspond to n Bernoulli trials conducted where
then X = ni=1 Xi ∼ Bin(n, p). The the intuition behind the form of the
P
p.m.f. can be motivated by the following example.
44
Example 2.45. A fair coin is tossed 10 times and X = the number of heads
is recorded. What is the probability that X = 3?
One possible outcome is
(H) (H) (H) (T) (T) (T) (T) (T) (T) (T)
1-pbinom(0,4,1/6)
45
Geometric
Assume that a sequence of i.i.d. Bernoulli trials is conducted and of interest
is the number of trials necessary to achieve the first success. Let, X denote
the total number of trials up to and including the first success. Then, X ∼
Geom(p) with p.m.f.
p(x) = p(1 − p)x−1 x = 1, 2, . . .
and E(X) = 1/p and V (X) = (1 − p)/p2 .
Example 2.47. In an experiment, such as tossing a fair coin, what is the
probability that it will take the experimenter exactly 5 attempts to land the
coin heads up, i.e. P (X = 5)?
This implies that the outcome Head (H) is preceded by 4 Tails.
(T) (T) (T) (T) (H)
The probability of this outcome is (1 − p)4 p, where p denotes the probability
of head (success) on each try.
Remark 2.5. In R, we would use the function dgeom(4,p) since the function
in R counts the number of failures before the success. In this case there are
4 failures before the first success. This is simply an alternate form. Please
look up “geometric” on wikipedia or in R help files.
Negative Binomial
A r.v. with a negative binomial distribution is simple an extension of the
geometric. It is the number of trials up to and including the rth success. So
if X ∼ NB(r, p) then it has p.m.f.
x−1 r
p(x) = p (1 − p)x−r x = r, r + 1, . . .
r−1
with E(X) = r/p and V (X) = r(1 − p)/p2 .
Example 2.48. Continuing the coin toss scenario assume we have 8 tosses
until the 3rd Head with the following outcome:
(T) (T) (H) (T) (H) (T) (T) (H)
The probability of this occurring is p3 (1 − p)5 . This is one possibility that
3 Heads occur in 8 tosses with the restriction that the last Head occurs on
the 8th trial. Consequently, there are 7 prior positions to which to place 2
Heads, so there are 72 such ways.
Hence the probability of 8 trials necessary
7 3
to achieve the 3 Head is 2 p (1 − p)5 .
rd
46
Remark 2.7. A negative binomial is a sum of geometric r.vs, in the same way
a binomial is a sum of Bernoullis. In our example
Poisson
The Poisson distribution occurs when we count the number of occurrences of
an event over a given interval of time and/or space. These occurrences are
assumed to occur with a fixed rate, for example the number of particles that
decay in a radioactive process.
If X ∼ Poisson(λ) then it has p.m.f.
λx e−λ
p(x) = x = 0, 1, . . . λ>0
x!
with E(X) = V (X) = λ.
Example 2.49. Assume that the number of hits a website receives during
regular business hours occurs with a mean rate of 5 hits per minute. Find
the probability that there will be exactly 17 hits in the next 3 minutes.
Denote X to be the number of hits in the next 3 minutes. Hence, X ∼
Poisson(5 × 3) and
1517 e−15
P (X = 17) = = 0.0847
17!
and in R, dpois(17,15)
• estimate λ̂ = X/t
47
2.5.10 Common Continuous Distributions
Uniform
A continuous r.v. that places equal weight to all values within its support,
[a, b], a ≤ b, is said to be a uniform r.v. It has p.d.f.
1
f (x) = a≤x≤b
b−a
Uniform Distribution
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6
a+b (b−a)2
Hence if X ∼ Uniform[a, b] then E(X) = 2
and V (X) = 12
.
Example 2.50. Waiting time for the delivery of a part from the warehouse
to certain destination is said to have a uniform distribution from 1 to 5 days.
What is the probability that the delivery time is two or more days?
Let X ∼ Uniform[1, 5]. Then, f (x) = 0.25 for 1 ≤ x ≤ 5 and hence
Z 5
P (X ≥ 2) = 0.25 dt = 0.75.
2
1-punif(2,1,5)
48
Normal
The normal distribution (Gaussian distribution) is by far the most important
distribution in statistics. The normal distribution is identified by a location
parameter µ and a scale parameter σ 2 (> 0). A normal r.v. X is denoted as
X ∼ N (µ, σ 2 ) with p.d.f.
1 1 2
f (x) = √ e− 2σ2 (x−µ) −∞<x<∞
σ 2π
Normal Distribution
0.4
0.3
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Example 2.51. Find P (−2.34 < Z < −1). From the relevant remark,
In R: pnorm(-1)-pnorm(-2.34)
49
If Z is standard normal then it has mean 0 and variance 1. Now if we
take a linear transformation of Z, say X = aZ + b, for constants a 6= 0 and
b, then
*0
E(X) = E(aZ + b) = a E(Z)
+ b = b
and
1
V (X) = V (aZ + b) = a2 = a2 .
*
V(Z)
This fact together with the following proposition allows us to express any
normal r.v. as a linear transformation of the standard normal r.v. Z by
setting a = σ and b = µ.
Proof. In statistics it is known that two r.vs with the same c.d.f. implies
that they have the same p.d.f. Our goal is to show that X has the p.d.f. as
in equation (2.5.10)
P (X ≤ x) = P (σZ + µ ≤ x)
x−µ
=P Z≤
σ
Z x−µ
1 σ 2
=√ e−(1/2)z dz
2π −∞
Z x
1 t−µ 2
= √ e−1/2( σ ) dt by substitution t = σz + µ
σ 2π −∞
which by definition is the c.d.f. of a normal r.v. with mean µ and variance
σ 2 . Hence, X ∼ N (µ, σ 2 ).
Linear transformations are completely reversible, so given a normal r.v.
X with mean µ and variance σ 2 we can revert back to a standard normal by
X −µ
Z= .
σ
As a consequence any probability statements made about an arbitrary normal
r.v. can be reverted to statements about a standard normal r.v.
50
Example 2.52. Let X ∼ N (15, 7). Find P (13.4 < X < 19.0).
We begin by noting
13.4 − 15 X − 15 19.0 − 15
P (13.4 < X < 19.0) = P √ < √ < √
7 7 7
= P (−0.6047 < Z < 1.5119)
= FZ (1.5119) − FZ (−0.6047)
= P (Z < 1.5119) − P (Z ≤ −0.6047)
= 0.6620312
If one is using a computer there is no need to revert back and forth from a
standard normal, but it is always useful to standardize concepts. You could
find the answer by using
pnorm(1.5119)-pnorm(-0.6047), or
pnorm(19,15,sqrt(7))-pnorm(13.4,15,sqrt(7))
which standardizes for you.
Example 2.53. The height of males in inches is assumed to be normally dis-
tributed with mean of 69.1 and standard deviation 2.6. Let X ∼ N (69.1, 2.62 ).
Find the 90th percentile for the height of males.
0.15
90 % area
0.10
0.05
0.00
69.1
First we find the 90th percentile of the standard normal which is qnorm(0.9)=
1.281552. Then we transform to
2.6(1.281552) + 69.1 = 72.43204.
Or, just input into R: qnorm(0.9,69.1,2.6).
A very useful theorem (whose proof is beyond the scope of this class is
the following.
Proposition 2.8. A linear combination of (independent) normal random
variables is a normal random variable.
51
2.6 Central Limit Theorem
The Central Limit Theorem (C.L.T.) is a powerful statement concerning
the mean of a random sample. There are three versions, the classical, the
Lyapunov and the Linderberg but in effect they all make the same statement
that the asymptotic distribution of the sample mean X̄ is normal, irrespective
of the distribution of the individual r.vs. X1 , . . . , Xn .
Proposition 2.9. (Central Limit Theorem)
Pni.i.d., with E(Xi ) = µ < ∞ and
Let X1 , . . . , Xn be a random sample, i.e.
2
V (Xi ) = σ < ∞. Then, for X̄ = (1/n) i=1 Xi
X̄ − µ d
−→ N (0, 1)
√σ n→∞
n
• Poisson: λ > 10
Example 2.54. At a university the mean age of students is 22.3 and the
standard deviation is 4. A random sample of 64 students is to be drawn.
What is the probability that the average age of the sample will be greater
than 23?
By the C.L.T.
42
approx.
X̄ ∼ N 22.3, .
64
So we need to find
!
X̄ − 22.3 23 − 22.3
P (X̄ > 23) = P p > p
4/ (64) 4/ (64)
= P (Z > 1.4)
= 1 − P (Z ≤ 1.4)
= 0.0808
or in R: 1-pnorm(23,22.3,4/sqrt(64))
52
Example 2.55. At a university assume it is known that 25% of students are
over 21. In a sample of 400 what is the probability that more than 110 of
them are over 21?
Exact solution: Let X be the number of students over 21. Then X ∼
Bin(400, 0.25).
in R:1-pbinom(110,400,0.25)
Approximation via C.L.T.: Let p̂ = X/n, then we have shown that E(p̂) =
p = 0.25 and V (p̂) = p(1 − p)/n = 0.0004688. Since, p̂ is in fact an average
we can use the C.L.T. to find P (p̂ > 110/400). To reiterate
approx.
p̂ ∼ N (0.25, 0.0004688)
Hence,
in R: 1-pnorm(110/400,0.25,sqrt(0.0004688)
Remark 2.9. We could have actually solved this using X = np̂, since if p̂ is
normal via the C.L.T. then np̂
P is simply a linear transformation on a normal.
In general same applies for ni=1 Xi = nX̄.
So recall that if we can assume X̄ is normal (or at least approxiamte
normal) then so is any linear transformation of it...such as the sum.
53
2.7 Normal Probability Plot
The C.L.T. provides us with the tools to understand and make inference
on the sampling distribution of the sample mean X̄. When the sample size
is small we cannot implement the C.L.T. However, by proposition 2.8, if
the data are normally distributed then X̄ is guaranteed to have a normal
distribution. This is where normal probability plots are useful.
A probability plot is a graphical technique for comparing two data sets,
either two sets of empirical observations, one empirical set against a theoret-
ical set.
0.4
0.2
0.0
0 2 4 6 8 10
54
The data are plotted against a theoretical normal distribution in such a way
that the points should form an approximate straight line. Departures from
this straight line indicate departures from normality.
There are two types of plots commonly used to plot the empirical c.d.f.
to the normal theoretical one (G(·)).
• P-P plot that plots (F̂n (x), G(x)) (with scaled changed to look linear),
• Q-Q plot which plots the quantile functions (F̂n−1 (x), G−1 (x)).
0.015
2
Normal
Data
1
Theoretical Quantiles
0.010
Density
0
0.005
−1
−2
0.000
Sample Quantiles
Note that the data appears to be skewed right, with a lighter tail on the
left and a heavier tail on the right (as compared to the normal).
http://www.stat.ufl.edu/~athienit/IntroStat/QQ.R
For interpretation of Q-Q plots, please watch the relevant podcasts. With
the vertical axis being the theoretical quantiles, and the horizontal axis be-
ing the sample quantiles the interpretation of P-P plots and Q-Q plots is
equivalent. Compared to straight line that corresponds to the distribution
you wish to compare your data, here is a quick guideline of how the tails are
55
Part II
Modules 3-4
56
Module 3
At the end of this module we will address the case where neither of these
points can be used and we will also take a look at the sample variance too.
57
4
−2
J
le
le
le
le
le
le
le
le
le
le
mp
mp
mp
mp
mp
mp
mp
mp
mp
mp
Sa
Sa
Sa
Sa
Sa
Sa
Sa
Sa
Sa
Sa
Figure 3.1: Multiple confidence intervals from different samples
Standard Normal
0.4
0.3
0.2
1−α
0.1
α 2 α 2
0.0
zα 2 0 z1−α 2
58
the area to the right.
X̄ − µ
1 − α = P −z1−α/2 < √ < z1−α/2 (3.1)
σ/ n
σ σ
= P X̄ − z1−α/2 √ < µ < X̄ + z1−α/2 √
n n
and the probability that (on the long run) the random C.I. interval,
σ
X̄ ∓ z1−α/2 √
n
59
Density Functions
0.4
N(0,1)
0.3
t_4
0.2
0.1
0.0
−4 −2 0 2 4
Example 3.2. In a packaging plant, the sample mean and standard deviation
for the fill weight of 100 boxes are x̄ = 12.05 and s = 0.1. The 95% C.I. for
the mean fill weight of the boxes is (using qt(0.975,99) for t(1−0.025,99) )
0.1
12.05 ∓ t(1−0.025,99) √ → (12.03016, 12.06984), (3.4)
| {z } 100
1.984
60
Remark 3.2. So far we have only discussed two-sided confidence intervals. In
equation (3.1) However, one-sided confidence intervals might be more
appropriate in certain circumstances. For example, when one is interested
in the minimum breaking strength, or the maximum current in a circuit. In
these instances we are not interested in an upper and lower limit respectively
but only in a lower or only in an upper limit, respectively. Then we simply
replace z1−α/2 or t(1−α/2,n−1) by z1−α or t1−α,n−1 , e.g. a 100(1 − α)% C.I. for
µ
s s
x̄ − t(1−α,n−1) √ , ∞ or −∞, x̄ + t(1−α,n−1) √
n n
If we assume σ is known then the width of the interval is twice the margin
of error
σ
width = 2z1−α/2 √ .
n
Thus,
√
σ σ 2
n = 2z1−α/2 ⇒ n≥ 2z1−α/2 .
width width
However, an estimate of σ is required, which can be from similar studies,
pilot studies or some times just rough guesses such as the (range)/4.
Example 3.4. In Example 3.2 we had that x̄ = 12.05 and s = 0.1 for the
100 boxes, leading to a 95% C.I. for the true mean width 0.0392 or ±0.0196
(see (3.4)). Boss man requires a narrower 95% C.I. of ±0.0120.
So, 2
0.1
2 (1.96) = 266.7778
2 (0.0120)
and we round up to n ≥ 267. In practice we should try and round up to as
far as our resources allow.
61
3.1.2 Hypothesis tests
A statistical hypothesis is a claim about a population characteristic (and on
occasion more than one). An example of a hypothesis is the claim that the
population is some value, e.g. µ = 75.
• Rejection region/criteria, the set of all test statistic values for which
H0 will be rejected.
The type I error is generally considered to be the most serious one, and
due to limitations, we can only control for one, so the rejection region is
chosen based upon the maximum P (type I error) = α that a researcher is
willing to accept.
Since we wish to control for the type I error, we set P (type I error) = α.
The default value of α is usually taken to be 5%.
62
An obvious candidate for a test statistic, that is an unbiased estimator
of the population mean, is X̄ which is normally distributed. If the data
were not known to be normally distributed the normality of X̄ can also be
confirmed by the C.L.T. Thus, under the null assumption H0
92
H0
X̄ ∼ N 75, ,
35
or equivalently
X̄ − 75 H0
∼ N (0, 1).
√9
35
and assuming that x̄ = 70.8 from the 35 samples, then, T.S. = −2.76. This
implies that 70.8 is 2.76 standard deviations below 75. Although this appears
to be far, we need to use the p-value to reach a formal conclusion.
{x|x ≤ −2.76},
The criterion for rejecting the null is p-value < α, the null hypothesis is
rejected in favor of the alternative hypothesis as the probability of observing
the test statistic value of -2.76 or more extreme (as indicated by Ha ) is smaller
than the probability of the type I error we are willing to undertake.
63
Standard Normal
0.4
α=0.05 area
p−value
0.3
0.2
0.1
0.0
−2.76 −1.645 0
(i) H0 : µ ≤ µ0 vs Ha : µ > µ0
(ii) H0 : µ ≥ µ0 vs Ha : µ < µ0
(iii) H0 : µ = µ0 vs Ha : µ 6= µ0
(iii) P (|Z| ≥ |T.S.|) < α (area to the right of |T.S.| plus area to the left of
−|T.S.| < α)
64
Hence, the p-value is 0.02013675 and we reject the null hypothesis and con-
clude that the true mean is not 1000.
Standard Normal
0.4
p−value
0.3
0.2
0.1
0.0
−2.32379 0 2.32379
H0 : µ ≤ 15 vs Ha : µ > 15
65
Unknown population variance
If σ is unknown, which is usually the case, we replace it by its sample
estimate s. Consequently,
X̄ − µ0 H0
√ ∼ tn−1 ,
S/ n
and the for an observed value X̄ = x̄, the test statistic becomes
x̄ − µ0
T.S. = √ .
s/ n
At the α significance level, for the same hypothesis tests as before, we reject
H0 if
H0 : µ ≤ 25 vs Ha : µ > 25
27.54 − 25
T.S. = √ = 1.03832
5.47/ 5
The p-value is the area to the right of 1.03832 under the t4 distribution,
which is 0.1788813. Hence, we fail to reject the null hypothesis. In R input:
66
Remark 3.3. The values contained within a two-sided 100(1 − α)% C.I. are
precisely those values (that when used in the null hypothesis) will result in
the p-value of a two sided hypothesis test to be greater than α.
For the one sided case, an interval that only uses the
• upper limit, contains precisely those values for which the p-value of
a one-sided hypothesis test, with alternative less than, will be greater
than α.
• lower limit, contains precisely those values for which the p-value of a
one-sided hypothesis test, with alternative greater than, will be greater
than α. (As in example 3.7)
The p-value is P (t4 < −0.293) + P (t4 > 0.293) = 0.7839. Hence, since the
p-value is large (> 0.05) we fail to reject H0 and conclude that population
mean is not statistically different from 257.
Instead of a hypothesis test if a two sided 95% was constructed by
3.05
256.6 ∓ t(1−0.025,4) √ → (252.81, 260.39),
| {z } 5
2.776
it clear that the null hypothesis value of µ = 257 is a plausible value and
consequently H0 is plausible, so it is not rejected.
67
3.2 Inference for Population Proportion
3.2.1 Large sample confidence interval
In the binomial setting, experiments had binary outcomes and of interest was
the number of successes out of the total number of trials. Let X be the total
number of successes, then X ∼ Bin(n, p). Once an experiment is conducted
and data obtained an estimate for p can be obtained,
x
p̂ =
n
which is an average. It is the total number of successes over the total number
of trials. As such, if the number of successes and number of failures are
greater than 5, the C.L.T. tells us that
p(1 − p)
p̂ ∼ N p, .
n
A 100(1 − α)% C.I. can be created as before,
r
p̂(1 − p̂)
p̂ ∓ z1−α/2 .
n
This is the classical approach for when the sample size is large. There does
exist an interval similar to classical version that works relatively well for
small sample sizes (not too small) and is equivalent for large sample sizes. It
is called the Agresti-Coull 100(1 − α)% C.I.,
r
p̃(1 − p̃)
p̃ ∓ z1−α/2 ,
ñ
where ñ := n + 4, and p̃ := (x + 2)/ñ.
Example 3.9. A map and GPS application for a smartphone was tested for
accuracy. The experiment yielded 26 error out of the 74 trials. Find the 90%
C.I. for the proportion of errors.
Since n = 74 and x = 26, then ñ = 74 + 4 and p̃ = (26 + 2)/78 = 0.359.
Hence the 90% C.I. for p is
r
0.359(1 − 0.359)
0.359 ∓ z1−0.05 → (0.2696337, 0.4483151)
| {z } 78
1.645
or in R:
> library(binom)
> binom.agresti.coull(26,74,conf.level=0.90)
method x n mean lower upper
1 agresti-coull 26 74 0.3513514 0.2666357 0.4465532
Note that the answers are slightly different because in R the function states:
“this method does not use the concept of adding 2 successes and 2 fail-
ures,”but rather uses the formulas explicitly described in [the paper]”. Hence
we recommend and encourage the use of software.
68
3.2.2 Large sample hypothesis test
Let X be the number of successes in n Bernoulli trials with probability of
success p, then X ∼ Bin(n, p). We know by the the C.L.T. that under certain
regularity conditions, then
p(1 − p)
p̂ ∼ N p, .
n
To test
(i) H0 : p ≤ p0 vs Ha : p > p0
(ii) H0 : p ≥ p0 vs Ha : p < p0
(iii) H0 : p = p0 vs Ha : p 6= p0
28/78 − 0.5
T.S. = q = −2.596426
28/78(1−28/78)
78
69
3.3 Inference for Population Variance
The sample statistic s2 is widely used as the point estimate for the population
variance σ 2 , and similar to the sample mean it varies from sample to sample
and has a sampling distribution.
Let X1 , . . . , Xn be i.i.d. r.v.’s. WePalready have some tools that help us
determine the distribution of X̄ = n ni=1 Xi , a function of the r.v.’s, and
1
(n − 1)S 2
2
∼ χ2n−1 ,
σ
where χ2 denotes a chi-square distribution with (n − 1) degrees of freedom.
χ2distribution
α 2
α 2
1−α
0 χα2 2 χ1−α
2
2
It is worth mentioning that a χ2n−1 has mean n − 1 and variance 2(n − 1).
70
3.3.1 Confidence interval
Consequently,
(n − 1)S 2
1−α=P χ2(α/2,n−1)< < χ2(1−α/2,n−1)
σ2
!
(n − 1)S 2 (n − 1)S 2
=P < σ2 < 2
χ2(1−α/2,n−1) χ(α/2,n−1)
which implies that on the long run this interval will contain the true popu-
lation variance parameter 100(1 − α)% of the time. Thus, the 100(1 − α)%
C.I. for σ 2 is !
(n − 1)s2 (n − 1)s2
, .
χ2(1−α/2,n−1) χ2(α/2,n−1)
Example 3.11. At a coffee plant a machine fills 500g coffee containers. Ide-
ally, the amount of coffee in a container should vary only slightly about the
500g nominal value. The machine is designed to have a mean to dispense
coffee amounts that have a normal distribution with mean 506.6g and stan-
dard deviation of 4g. This implies that only 5% of containers weigh less than
500g.
Normal(506.6,16)
0.10
0.08
0.05 area
0.06
0.04
0.02
500 506.6
29(3.93022 ) 2 29(3.93022 )
< σ <
45.722 16.047
2
9.7971 < σ < 27.9144
71
as a 95% C.I. for σ 2 , or equivalently, by taking the square root, (3.1300, 5.2834)
as a 95% C.I. for σ.
In R we can find the critical values using the qchisq function as in
qchisq(0.025,29). If we have the raw data we use the varTest function to
calculate the CI. See
http://www.stat.ufl.edu/~athienit/IntroStat/vartest.R
Remark 3.4. In this example, of more interest is the upper limit of the vari-
ance. A smaller variance for a N (506.6, 16) means that the area to the left of
500 will be smaller and a larger variance that the area to the left of 500 will
be larger. To construct a one-sided C.I. simply replace α/2 in the formula
by α.
where (iii) simply stated is twice the smallest of the two probabilities/areas.
29(3.93022 )
T.S. = = 27.99649
16
72
χ229distribution
0.4819 0.4819
0 TS
To find the p-value we need to find the area in the two tails. First order
of business is to determine is 27.99649 is in the right or left tail. Since the
area to the left of 27.99649, pchisq(27.99649,29)=0.4818993, is smaller
than the area to the right, it means our T.S. is in the left tail. As can be
seen from the figure.
When we worked with the normal or Student-t distributions it always
easy to find the other tail due to symmetry. For, example, if the T.S.= -2.5
then it means we are in the left tail and the right tail is +2.5. In any case,
the two sided p-value was just twice the probability of the tail which is what
we will do here as well. Multiply 2 to get the two tail equivalent p-value of
0.9637986. Note this is greater than 0.05. Or use software, see
http://www.stat.ufl.edu/~athienit/IntroStat/vartest.R
Where we get
> varTest(x,alternative="two.sided",sigma.squared=16,conf.level=0.95)
73
3.4 Distribution Free Inference
When the sample size is small and we cannot assume that the data are
normally distributed we need must use exact nonparametric procedures to
perform inference on population cental values. Instead of means we will be
referring to medians (µ̃) and other location concepts as they are less influ-
enced by outliers which can have a drastic impact (especially) on estimates
from small samples.
p−value
Probability
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Possible Values
74
p-value = P (B ≤ 5|B ∼ Bin(15, 0.5))
= P (B = 0) + . . . + P (B = 5)
5
X 15
= 0.5i 0.515−i
i=0
i
= 0.1509.
which yields the one sided 98.24219% C.I. (−∞, 68). This implies that since
65 is in the interval so we fail to reject the null.
Remark 3.5. Because we are working with a discrete distribution it is not
always possible to find the α (one sided) or α/2 and 1 − α/2 percentiles so
we find the next ones that will give us a true α less than what was specified.
Assume we wished to perform a two-sided test, then the p-value would
correspond to the sum of the probabilities in the two tails as long as the
probabilities in the other tail are less than or equal to the probability of the
test statistic, i.e. ≤ P (B = 5). If the left tail is 0 to 5 the the right tail is 10
to 15.
Binomial Distribution
n = 15 , p = 0.5
0.20
P(B=5)
0.15
p−value
Probability
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Possible Values
75
eqnpar(x,p=0.5,type=6,ci=T,ci.method="exact",ci.type="two-sided",
+ approx.conf.level = 0.95)
Example 3.14. With the same data to test Ha : µ̃40 6= 65, then under the
null B ∼ Bin(15, 0.6) and using a two-sided test
Binomial Distribution
n = 15 , p = 0.6
0.20 P(B=5)
p−value
0.15
Probability
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Possible Values
eqnpar(x,p=0.4,type=6,ci=T,ci.method="exact",ci.type="two-sided",
+ approx.conf.level = 0.95)
to get the 96.09947% C.I. for the 40th percentile (60,64). Note a slight
discrepancy with the test, but that is because the α is different in the two
cases.
76
3.4.2 Wilcoxon signed-rank test
This procedure does not require the distributional assumption of normality.
However, it require the assumption of a continuous symmetric p.d.f. Assume
that X1 , . . . , Xn are i.i.d. from some c.d.f. F (x) that meets these assump-
tions. The null hypothesis, H0 is that the distribution is centrally located
around some value µ0 which is tested against
2. Discard any di = 0 (as long as they are not more that 10% of the data,
otherwise research “adjusted” methods).
The sampling distribution of the test statistic, denoted here as W , has been
determined and is available in textbooks and software for calculation of p-
values.
H
S+ ∼0 Wn
The p-value calculation for the three alternatives will be done by software
(as we have not discussed the Wn distribution) is
(i) P (Wn ≥ S+ )
(ii) P (Wn ≤ S+ )
(iii) Find the two tails and add probabilities in similar fashion as the sign
test.
77
Example 3.15. Take for example the a dataset with values and we wish to
test Ha : center is greater than 16.
x d = x − 16 Rank
13.9 -2.1 1 (-)
11.0 -5.0 2 (-)
21.7 5.7 3 (+)
9.3 -6.7 4 (-)
5.0 -11.0 5 (-)
0.9 -15.1 6 (-)
p−value=0.9531
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
data: x
V = 3, p-value = 0.9531
alternative hypothesis: true location is greater than 16
95 percent confidence interval:
5 Inf
sample estimates:
(pseudo)median
10.15
Notice, the built in R function can also provide us with a C.I. for the (pseudo)median;
and that 16 is in the interval.
78
If we wished to perform a two-sided test, then the p-value=0.1563
0.08
p−value=0.1563
0.06
0.04
0.02
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
http://www.stat.ufl.edu/~athienit/IntroStat/wilcox_1sample.R
Remark 3.6. If there are ties in the data then the values that are tied get the
average of the ranks that they would have gotten if not tied. For example,
the rank of the data
The values 0.5 should have gotten ranks 2 and 3 if they were slightly
different. Now consider a potential three-way tie.
79
Module 4
80
and hence a 100(1 − α)% C.I. for the difference of µK = µX − µY is
s
2
σX σ2
x̄ − ȳ ∓ z1−α/2 + Y.
nX nY
Example 4.2. Two methods are considered standard practice for surface
hardening. For Method A there were 15 specimens with a mean of 400.9
(N/mm2 ) and standard deviation 10.6. For Method B there were also 15
specimens with a mean of 367.2 and standard deviation 6.1. Assuming the
samples are independent and from a normal distribution the 98% C.I. for
µA − µB is r
10.62 6.12
400.9 − 367.2 ∓ t0.99,ν +
15 15
where 2 2
10.6 6.12
15
+ 15
ν = (10.62 /15)2 (6.12 /15)2 = 22.36
14
+ 14
and hence t0.99,22.36 = 2.5052 giving a 98% C.I. for the difference µA − µB of
(25.7892 41.6108).
81
Notice that 0 is not in the interval so we can conclude that the two means
are different. In fact the interval is purely positive so we can conclude that
µA is at least 25.7892 N/mm2 larger than µB and at most 41.6108 N/mm2 .
2
Remark 4.1. When population variances are believed to be equal, i.e. σX ≡
2
σY we can improve on the estimate of variance by using a pooled or weighted
average estimate. If in addition to the regular assumptions, if we can assume
equality of variances then the 100(1 − α)% C.I. for µX − µY is
s
s2p s2p
x̄ − ȳ ∓ t1−α/2,nX +nY −2 + ,
nX nY
with
(nX − 1) (nY − 1)
s2p = s2X + s2Y .
nX + nY − 2 nX + nY − 2
The assumption that the variances are equal must be made a priori and not
used simply because the two variances may be close in magnitude.
2
Example 4.3. Consider Example 4.2 but now assume that σX ≡ σY2 . A
98% C.I. for the difference of µX − µY constructed with
14(10.62 ) + 14(6.12 )
s2p = = 8.6482
28
is r
2
400.9 − 367.2 ∓ t0.99,28 (8.648) → (25.9097, 41.4903)
| {z } 15
2.467
Intuitively, since proportions are between 0 and 1, the difference of two pro-
portions must lie between -1 and 1. Hence if the bounds of a C.I. are outside
the intuitive ones, they should be replaced by the intuitive bounds.
82
Example 4.4. In a clinical trial for a pain medication, 394 subjects were
blindly administered the drug, while an independent group of 380 were given
a placebo. From the drug group, 360 showed an improvement. From the
placebo group 304 showed improvement. Construct a 95% C.I. for the dif-
ference and interpret.
Let D stand for drug and P for placebo, then p̃D = 361/396 and p̃P =
305/382
r
p̃D (1 − p̃D ) p̃P (1 − p̃P )
p̃D − p̃P ∓ 1.96 + → (0.0642, 0.1622)
396 382
Hence the proportion of subjects that showed substantial improvement under
the drug treatment was at least 6.42% and at most 16.22% greater than under
the placebo.
Paired data
There are instances when two samples are not independent, when a rela-
tionship exists between the two. For example, before treatment and after
treatment measurements made on the same experimental subject are depen-
dent on eachother through the experimental subject. This is a common event
in clinical studies where the effectiveness of a treatment, that may be quan-
tified by the difference in the before and after measurements, is dependent
upon the individual undergoing the treatment. Then, the data is said to be
paired.
Consider the data in the form of the pairs (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ).
We note that the pairs, i.e. two dimensional vectors, are independent as the
experimental subjects are assumed to be independent with marginal expec-
tations E(Xi ) = µX and E(Yi ) = µY for all i = 1, . . . , n. By defining,
D1 = X1 − Y1
D2 = X2 − Y2
..
.
Dn = Xn − Yn
a two sample problem has been reduced to a one sample problem. Inference
for µX − µY is equivalent to one sample inference on µD as was done in
Module 3. This holds since,
n
! n
!
1X 1X
µD := E(D̄) = E Di = E Xi − Yi = E(X̄−Ȳ ) = µX −µY .
n i=1 n i=1
83
Example 4.5. A new and old type of rubber compound can be used in
tires. A researcher is interested in a compound/type that does not wear
easily. Ten random cars were chosen at random that would go around a
track a predetermined number of times. Each car did this twice, once for
each tire type and the depth of the tread was then measured.
Car
1 2 3 4 5 6 7 8 9 10
New 4.35 5.00 4.21 5.03 5.71 4.61 4.70 6.03 3.80 4.70
Old 4.19 4.62 4.04 4.72 5.52 4.26 4.27 6.24 3.46 4.50
d 0.16 0.38 0.17 0.31 0.19 0.35 0.43 -0.21 0.34 0.20
With d¯ = 0.232 and sD = 0.183. Assuming that the data are normally
distributed, a 95% C.I. for µnew − µold = µD is
0.183
0.232 ∓ t0.975,9 √ → (0.101, 0.363)
| {z } 10
2.262
and we note that the interval is strictly greater than 0, implying that that
the difference is positive, i.e. that µnew > µold . In fact we can conclude that
µnew is larger than µold by at least 0.101 units and at most 0.363 units.
84
4.1.2 Hypothesis tests
Known variance
Let X1 , . . . , XnX and Y1 , . . . , YnY represent two independent random large
2
samples with nX > 40, nY > 40 with means µX , µY and variances σX , σY2
respectively. We have seen in Section 4.1.1 that if can assume that X̄ and Ȳ
are normally distributed, that
2
σX σY2
X̄ − Ȳ ∼ N −
| X {z Y} nX nY .
µ µ , +
H0
= ∆0
To test
(i) H0 : µX − µY ≤ ∆0 vs Ha : µX − µY > ∆0
(ii) H0 : µX − µY ≥ ∆0 vs Ha : µX − µY < ∆0
(iii) H0 : µX − µY = ∆0 vs Ha : µX − µY 6= ∆0
we assume that the variances are known and the test statistic is
x̄ − ȳ − ∆0
T.S. = p 2 .
σX /nX + σY2 /nY
The r.v. corresponding to the test statistic has a standard normal distri-
bution under the null hypothesis H0 , that µX − µY = ∆0 . Reject the null
if
85
addition to the regular assumptions, if we can assume equality of variances
then set the sample variances of X and Y to
(i) H0 : pX − pY ≤ ∆0 vs Ha : pX − pY > ∆0
(ii) H0 : pX − pY ≥ ∆0 vs Ha : pX − pY < ∆0
(iii) H0 : pX − pY = ∆0 vs Ha : pX − pY 6= ∆0
where ∆0 ∈ [−1, 1], we must assume that the number of successes and failures
is greater than 10 for both samples. As the null hypotheses values for pX and
pY are not available we simply check that the sample successes and failures
are greater than 10. By virtue of the C.L.T.
H pX (1 − pX ) pY (1 − pY )
p̂X − p̂Y ∼0 N pX − pY , + ,
| {z } nX nY
∆0
(x + 1) + (y + 1)
p̃ = .
ñX + ñY
The test statistic is then
p̃X − p̃Y − 0 H0
T.S. = p ∼ N (0, 1),
p̃(1 − p̃)(1/ñX + 1/ñY )
and the r.v. corresponding to the test statistic has a standard normal distri-
bution under the null hypothesis.
86
Paired data
In the event that two samples are dependent, i.e. paired, such as when two
different measurements are made on the same experimental unit, the infer-
ence methodology must be adapted to account for the dependence/covariance
between the two samples.
Refer to Section 4.1.1, where we consider the data in the form of the
pairs (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) and construct the one-dimensional, i.e.
one-sample D1 , D2 , . . . , Dn where Di = Xi − Yi for all i = 1, . . . , n. As shown
2
earlier, µD = µX − µY and the variance term σD incorporates the covariance
between X and Y .
To test
(i) H0 : µX − µY = µD ≤ ∆0 vs Ha : µX − µY = µD > ∆0
(ii) H0 : µX − µY = µD ≥ ∆0 vs Ha : µX − µY = µD < ∆0
(iii) H0 : µX − µY = µD = ∆0 vs Ha : µX − µY = µD 6= ∆0
87
4.2 Inference for Population Variances
Now we extend the set up to two independent i.i.d. normal distribution
2
samples X1 , . . . , XnX and Y1 , . . . , YnY with variances σX and σY2 respectively.
It is known that
2
(nX − 1)SX (nY − 1)SY2
2
∼ χ2nX −1 and ∼ χ2nY −1
σX σY2
but it is also known that a standardized (by dividing by the degrees of free-
dom) ratio of two χ2 ’s is an F-distribution. Therefore,
2 /σ 2
(nX −1)SX X 2
nX −1 SX /SY2
(nY −1)SY2 /σY
2 = 2
∼ FnX −1,nY −1 .
σX /σY2
nY −1
2
• σX /σY2 > 1 ⇒ σX
2
> σY2
2
• σX 2
/σY2 < 1 ⇒ σX < σY2
α 2
α 2 1−α
0 Fα 2 F1−α 2
88
2
A 100(1 − α)% C.I for σX /σY2 is constructed by
2
/SY2
SX
1 − α = P F(α/2;nX −1,nY −1) < 2 2 < F(1−α/2;nX −1,nY −1)
σX /σY
2 2 2
SX 1 σX SX 1
=P < 2 < 2 .
SY2 F(1−α/2;nX −1,nY −1) σY SY F(α/2;nX −1,nY −1)
2
Thus, the 100(1 − α)% C.I. for σX /σY2 is
2
s2X
sX 1 1
,
s2Y F(1−α/2;nX −1,nY −1) s2Y F(α/2;nX −1,nY −1)
Example 4.6. The life length of an electrical component was studied under
two operating voltages, 110 and 220. Ten different components were assigned
to be tested under 110V and 16 under 220V. The times to failure (in 100’s
hrs) were then recorded. Assuming that the two samples are independent
2 2
and normal we construct a 95% C.I. for σ110 /σ220 .
V n Mean St.Dev.
110 10 20.1932 0.5688
220 16 9.9222 0.2408
Hence,
0.56882 1 0.56882 1
0.24082 F0.90;9,15 0.24082 F0.10;9,15 → (1.786875, 21.032615)
,
| {z } | {z }
2.086209 0.4274191
We note that the value 1 is not in the interval. Therefore, we conclude that
the variance are not equal and that in fact the variance of 110V is at least
78.69% larger than the variance of 220V and at most 2003.62%. In terms of
the ratio of standard deviations the 95% C.I. for σX /σY is
√ √
( 1.786875, 21.032615) → (1.336740, 4.586133)
Critical values can be obtained using the quantile function qf, for example
89
qf(0.95,9,15). In R once we create two vectors for the two datasets
> var.test(V110,V220,ratio=1,alternate="two.sided",conf.level=0.95)
http://www.stat.ufl.edu/~athienit/IntroStat/var_ratio.R
Rejection region criteria can be found in the relevant textbook section but
are omitted here.
90
4.3 Distribution Free Inference
4.3.1 Wilcoxon rank-sum test
One of the most widely used two sample tests for location differences between
two populations. Assume, that two independent samples X1 , . . . , XnX are
i.i.d. with a c.d.f. F1 (·) and Y1 , . . . , YnY are i.i.d. with a c.d.f. F2 (·). The
null hypothesis H0 : F1 (x) = F2 (x) ∀x is tested against
(i) X’s tend to be larger than the Y ’s by ∆0 units, i.e. X’s> Y ’s+∆0 .
(ii) X’s tend to be smaller than the Y ’s by ∆0 units, i.e. X’s< Y ’s+∆0 .
(iii) One of the two populations is shifted from the other by ∆0 units.
2. calculate the sum of the ranks associated with the first sample, (as-
suming X is first)
H
T.S. = SX ∼0 WN
Under the null the test statistic has a wilcoxon sampling distribution. It
really does not matter which sum rank we calculate since
N
X N (N + 1)
SX + SY = i=
i=1
2
To find the p-value we will use software (rather than working with limited
tables), where we can even obtain confidence intervals for the difference in
the location of the center from the first population to the second..
Example 4.7. Two groups of 10 did not know whether they were receiving
alcohol or the placebo and their reaction times (in seconds) was recorded.
(x) Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
(y) Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
Test whether the distribution of reaction times for the placebo are shifted
to the “left” of that for alcohol (case (ii)). The ranks are:
Placebo 7 1 16 5 8 4 6 3 2 18 70
Alcohol 15 14 17 13 10 20 9 11 19 12 140
91
> wilcox.test(placebo,alcohol,alternative="less",mu=0,conf.int=TRUE)
Wilcoxon rank sum test
data: placebo and alcohol
W = 15, p-value = 0.003421
alternative hypothesis: true location shift is less than 0
95 percent confidence interval:
-Inf -0.37
sample estimates:
difference in location
-0.61
(iii) D’s tend to be consistently larger or smaller than ∆0 , i.e. X’s tend to
be consistently different than Y ’s by an amount of ∆0 or greater.
92
Field A B d Rank(|d|) Field A B d Rank(|d|)
1 211.4 186.3 25.1 15 11 208.9 183.6 25.3 17.5
2 204.4 205.7 -1.3 1 12 208.7 188.7 20.0 8
3 202.0 184.4 17.6 7 13 213.8 188.6 25.2 16
4 201.9 203.6 -1.7 2 14 201.6 204.2 -2.6 4
5 202.4 180.4 22.0 14 15 201.8 181.6 20.1 9
6 202.0 202.0 0 0 16 200.3 208.7 -8.4 6
7 202.4 181.5 20.9 13 17 201.8 181.5 20.3 10
8 207.1 186.7 20.4 11 18 201.5 208.7 -7.2 5
9 203.6 205.7 -2.1 3 19 212.1 186.8 25.3 17.5
10 216.0 189.1 26.9 19 20 203.4 182.9 20.5 12
The test statistic is S+ = 169 (although it would have been easier to calcu-
late S− and then deduce S+ ). Since we are in the two sided case (case (iii)),
to find the p-value we need to find P (W19 ≤ 21, W19 ≥ 169) or use R (which
will give a slightly different answer as it does not ignore the zero, run code
and see warnings) as done in
http://www.stat.ufl.edu/~athienit/IntroStat/wilcox_2.R
> wilcox.test(A,B,paired=TRUE,conf.int=TRUE)
Wilcoxon signed rank test with continuity correction
data: A and B
V = 169, p-value = 0.003098
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
8.700026 22.650044
sample estimates:
(pseudo)median
12.12435
where the p-value is tiny and we reject the null. Also, we note that the 95%
C.I. is strictly positive implying that A tends to be 8.7 to 22.65 units larger
than B.
93
Example 4.9. Continuing from the previous example, suppose that type B
was the old fertilizer and that a sales agent approached the city council with
a claim that their new fertilizer (type A) was better in that it would produce
5 or more pounds of grass clippings compared to B.
The alternative hypothesis is case (i) with ∆0 = 5. As a result we obtain
the following table
The test statistic is S+ = 181 and since the p-value is small and the C.I. is
strictly greater than 5, we reject the null.
> wilcox.test(A,B,alternative="greater",mu=5,paired=TRUE,conf.int=TRUE)
Wilcoxon signed rank test
data: A and B
V = 181, p-value = 0.001576
alternative hypothesis: true location shift is greater than 5
95 percent confidence interval:
9 Inf
sample estimates:
(pseudo)median
11.8
94
versus the alternative that not all σi2 ’s are equal.
Many different methods exist but we will focus on Levene’s test that is the
least restrictive in its assumptions as there are no assumptions regarding the
sample sizes/distributions. We still require the assumptions of independent
populations/groups. However, it is tedious calculation so we will rely once
again on software.
For Levene’s test, the sampling distribution of the test statistic is
H
T.S. ∼0 Ft−1,N −t
where N = ni=1 ni , i.e. the grand total number of observations. Reject H0
P
if the p-value P (Ft−1,N −t ≥ T.S.) < α (area to the right of the test statistic
is less than α.)
Example 4.10. Three different additives that are marketed for increasing
fuel efficiency in miles per gallon (mpg) were evaluated by a testing agency.
Studies have shown an average increase of 8% in mpg after using the products
for 250 miles. The testing agency wants to evaluate the variability in the
increase. (We will see in later sections how to compare the means).
F(2,27) distribution
1.0
0.8
0.6
0.1803 area
0.4
0.2
0.0
0 1.8268
95
4.4 Contingency Tables: Tests for Indepen-
dence
Contingency tables are cross-tabulations of frequency counts where the rows
(typically) represent the levels of the explanatory variable and the columns
represent the levels of the response variable.
We motivate the methodology through an example. A personnel man-
ager wants to assess the popularity of 3 alternative flexible time-scheduling
plans among workers. A random sample of 216 workers yields the following
frequencies.
Office
Favored Plan 1 2 3 4 Total
1 15 32 18 5 70
2 8 29 23 18 78
3 1 20 25 22 68
Total 24 81 66 45 216
• Row and column totals are called the marginal distributions for the two
variables. Denote ni+ to the ith row total and n+j to be the j th column
total. Let pi+ denote the proportion for that row i and p+j denote the
proportion for that column j
In Section 2.4.1 we have seen that two events are independent if the
joint probability can be written as a product of the marginal proportions (or
estimated probabilities). Hence, under independence we expect
ind
pij = pi+ p+j i = 1, 2, 3 j = 1, 2, 3, 4.
96
As a result E11 = (70)(24)/216 = 7.7778. Continuing in same way,
Office
Favored Plan 1 2 3 4 Total
1 15(7.7778) 32(26.2500) 18(21.3889) 5(14.5833) 70
2 8(8.6667) 29(29.2500) 23(23.8333) 18(16.2500) 78
3 1(7.5556) 20(25.5000) 25(20.7778) 22(14.1667) 68
Total 24 81 66 45 216
To test
where r is the number of rows and c is the number of columns. For a specified
α, H0 : is rejected if
T.S. > χ21−α,(r−1)(c−1) ,
or if the p-value P (χ2(r−1)(c−1) ≥ T.S.) < α (the area to the right of the test
statistic is less that α). For the example at hand, the T.S. is
(15 − 7.7778)2 (22 − 14.1667)2
T.S. = + ··· + = 27.135,
7.7778 14.1667
the degrees of freedom are 2(3) = 6 and the p-value is 0.0001366. Therefore,
we reject H0 and conclude that Favored Plan and Office are not independent.
Once dependence is established, of interest is to determine which cells in
the contingency table have higher or lower frequencies than expected (under
independence). This is usually determined by observing the standardized
residuals (deviations) of the observed counts, nij , to the expected counts
Eij , i.e.
? nij − Eij
rij =p
Eij (1 − pi+ )(1 − p+j )
Office
Favored Plan 1 2 3 4
1 3.3409 1.7267 -1.0695 -3.4306
2 -0.3005 -0.0732 -0.2563 0.6104
3 -3.0560 -1.6644 1.3428 2.8258
97
All this can be done in R by utilizing the chisq.test function. See
http://www.stat.ufl.edu/~athienit/IntroStat/contingency.R
Rating
School Outstanding Average Poor
Most desirable 21 25 2
Good 20 36 10
Adequate 4 14 7
Undesirable 3 8 6
98
Part III
Modules 5-6
99
Module 5
Regression
y = β0 + β1 x,
Yi = β0 + β1 xi + i i = 1, . . . , n (5.1)
where Yi represents the r.v. corresponding to the response, i.e. the variable
we wish to model and xi stands for the observed value of the predictor.
Therefore we have that
ind.
Yi ∼ N (β0 + β1 xi , σ 2 ). (5.2)
Notice that the Y s are no longer identical since their mean depends on the
value of xi .
100
15
10
y
Data points
Regression line
0
−20 −10 0 10 20 30 40 50 60
In order to fit a regression line one needs to find estimates for the coeffi-
cients β0 and β1 in order to find the prediction line
The goal is to have this line as “close” to the data points as possible. The
concept, is to minimize the error from the actual data points to the predicted
points (in the direction of Y , i.e. vertical)
n
X n
X
2
min (Yi − E(Yi )) ⇒ min (yi − (β0 + β1 xi ))2 .
i=1 i=1
Hence, the goal is to find the values of β0 and β1 that minimizes the sum of
the distances between the points and their expected value under the model.
This is done by the following steps:
101
Therefore,
Pn
( ni=1 xi yi ) − nx̄ȳ
P
i=1 (xi − x̄)(yi − ȳ) sY
b1 := β̂1 = Pn 2
= Pn 2 =r (5.3)
i=1 (xi − x̄) ( i=1 xi ) − nx̄2 sX
and
b0 := β̂0 = ȳ − b1 x̄.
Remark 5.1. Do not extrapolate model for values of the predictor x that were
not in the data, as it is not clear how the model behave for other values. Also,
do not fit a linear regression for data that do not appear to be linear.
Next we introduce some notation that will be useful in conducting in-
ference of the model. In order to determine whether a regression model is
adequate we must compare it to the most naive model which uses the sample
mean Ȳ as its prediction, i.e. Ŷ = Ȳ . This model does not take into account
any predictors as the prediction is the same for all values of x. Then, the
total distance of a point yi to the sample mean ȳ can be broken down into two
components, one measuring the error of the model for that point, and one
measuring the “improvement” distance accounted by the regression model.
102
Summing over all observations we have that
n
X n
X n
X
(yi − ȳ)2 = (yi − ŷi )2 + (ŷi − ȳ)2 , (5.4)
|i=1 {z } |i=1 {z } |i=1 {z }
SST SSE SSR
since it can easily be shown that the cross-product term ni=1 (yi − ŷi )(ŷi − ȳ)
P
is equal to 0.
Each sum of squares term has an associated degrees of freedom value.
Example 5.1. Let x be the number of copiers serviced and Y be the time
spent (in minutes) by the technician for a known manufacturer.
1 2 ··· 44 45
Time (y) 20 60 · · · 61 77
Copiers (x) 2 4 ··· 4 5
103
Scatterplot
150
Time (in minutes)
100
50
0
2 4 6 8 10
Quantity
Figure 5.3: Scatterplot of Time vs Copiers.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5802 2.8039 -0.207 0.837
Copiers 15.0352 0.4831 31.123 <2e-16 ***
---
http://www.stat.ufl.edu/~athienit/STA4210/Examples/copier.R
ŷ = −0.5802 + 15.0352x
We note that the slope b1 = 15.0352 implies that for each unit increase in
copier quantity, the service time increases by 15.0352 minutes (for quantity
values between 1 and 10). The coefficient of determination is R2 = 0.9575
implying that 95.75% of the variability in time (to its mean as conveyed by
SS Total) is explained by the model.
If we wish to estimate the time needed for a service call for 5 copiers that
would be
−0.5802 + 15.0352(5) = 74.5958 minutes
We can obtain SSR, SSE and SST from R, however everything we need
was already provided in the output earlier from the summary function
104
• SSE = (Residual standard error)2 (dfError ) (from (5.6))
| {z }
s2
• SSR = SST-SSE
and
n
1 X σ2
V (B1 ) = hP i2 (xi − x̄)2 V (Yi ) = · · · = Pn 2
.
n
− x̄)2 j=1 (xj − x̄)
j=1 (xj
| {z }
i=1
σ2
Thus,
σ2
B1 ∼ N β1 , Pn 2
. (5.7)
i=1 (xi − x̄)
Remark 5.3. The intercept term is not of much practical importance as it
is the value of the response when the predictor value is 0 and inference is
omitted. Also, whether statistically significant or not, it is always kept in the
105
model to create a parsimonious and better fitting model. It can be shown,
in similar fashion to B1 that
x̄2
1
B0 ∼ N β0 , + Pn 2
σ2 .
n (x
i=1 i − x̄)
Remark
P 5.4. The larger the spread in the values of the predictor, the larger
the ni=1 (xi − x̄)2 value will be and hence the smaller the variances for B0
and B1 . Also, as (xi − x̄)2 are nonnegative terms when we have more data
points,
Pni.e. larger2n, we are summing more non-negative terms and the larger
the i=1 (xi − x̄) .
B1 − β1
∼ tn−2 ,
√Pn s
2
i=1 (xi −x̄)
where s stands for the conditional (upon the model) standard deviation of
the response. The true variance σ is never known as there are infinite model
variations and hence the Student’s-t distribution is used, instead of the stan-
dard normal, irrespective of the sample size. Important to note is the fact
that the degrees of freedom are n − 2, as 2 were lost due to the estimation
of β0 and β1 .
Therefore, a 100(1 − α)% C.I. for β1 is
> confint(reg,level=0.95,type="Wald")
2.5 % 97.5 %
(Intercept) -6.234843 5.074529
Copiers 14.061010 16.009486
106
5.1.4 Confidence interval on the mean response
The mean is not longer a constant but it is a mean line.
µY |X=xobs := E(Y |X = xobs ) = β0 + β1 xobs
Hence, we can create an interval for the mean at a specific value of the
predictor. We simply need to find a statistic to estimate the mean and find
its distribution. The sample statistic is
ŷ = b0 + b1 xobs
and the corresponding r.v. is
n
" #
X 1 xi − x̄
Ŷ = B0 + B1 xobs = + (xobs − x̄) Pn 2
Yi . (5.8)
i=1
n j=1 (xj − x̄)
In R,
> predict.lm(reg,se.fit=TRUE,newdata=data.frame(Copiers=5),interval="confidence",l
$fit
fit lwr upr
1 74.59608 71.91422 77.27794
$se.fit
[1] 1.329831
$df
[1] 43
107
5.1.5 Prediction interval
Once a regression model is fitted, after obtaining data (x1 , y1 ), . . . , (xn , yn ),
it may be of interest to predict a future value of the response. From equation
(5.1), we have some idea where this new prediction value will lie, somewhere
around the mean response
β0 + β1 xnew
However, according to the model, equation (5.1), we do not expect new
predictions to fall exactly on the mean response, but close to them. Hence,
the r.v. corresponding to the statistic we plan to use is the same as equation
(5.8) with the addition of the error term ∼ N (0, σ 2 )
Ŷpred = B0 + B1 xnew +
Therefore,
" # !
1 (xnew − x̄)2
Ŷpred ∼ N β0 + β1 xnew , 1 + + Pn 2
σ2 ,
n (x
j=1 j − x̄)
and a 100(1 − α)% prediction interval (P.I.) for , for a value of the predictor
that is unobserved, i.e. not in the data, is
s !
1 (xnew − x̄)2
ŷpred ∓ t1−α/2,n−2 s 1 + + Pn 2
.
n j=1 (xj − x̄)
| {z }
spred
Example 5.4. Refer back to Example 5.1. Let us estimate the future service
time value when copier quantity is 7 and create a interval around it. The
predicted value is
−0.5802 + 15.0352(7) = 104.6666 minutes
A 95% P.I. around the predicted value is
104.6666 ∓ t1−0.025,43 (9.058051) → (86.399, 122.9339)
| {z }
2.016692
> newdata=data.frame(Copiers=7)
> predict.lm(reg,se.fit=TRUE,newdata,interval="prediction",level=0.95)
$fit
fit lwr upr
1 104.6666 86.39922 122.9339
$se.fit
[1] 1.6119
$df
[1] 43
Note that se.fit provided is the value for the CI not the PI. However, in
the calculation of the PI the correct standard error term is used.
http://www.stat.ufl.edu/~athienit/STA4210/Examples/copier.R
108
5.2 Checking Assumptions and Transforming
Data
Recall that for the simple linear regression model
Yi = β0 + β1 xi + i i = 1, . . . , n
i.i.d.
we assume that i ∼ N (0, σ 2 ) for i = 1, . . . , n. However, once a model is
fit, before any inference or conclusions are made based upon a fitted model,
the assumptions of the model need to be checked.
These are:
1. Normality
2. Independence
3. Homogeneity of variance
4. Model fit
1. Normality: Shapiro-Wilk
109
5.2.1 Normality
The simplest way to check for normality is with two graphical procedures:
• Histogram
• P-P or Q-Q plot
A histogram of the residuals is plotted and we try to determine if the
histogram is symmetric and bell shaped like a normal distribution is. In
addition, to check the model fit, we assume the observed response values
yi are centered around the regression line ŷ. Hence, the histogram of the
residuals should be centered at 0. Referring to Example 5.1, we obtain the
following histogram.
0.2
0.1
0.0
−4 −3 −2 −1 0 1 2 3
std. residuals
We have referenced P-P and Q-Q plots in Section 2.7. Referring to Ex-
ample 5.1, we obtain the following P-P plot of the residuals.
1
0
−1
−2
−2 −1 0 1
Sample Quantiles
110
5.2.2 Independence
To check for independence a time series plot of the residuals/standardized
residuals is used, i.e. a plot of the value of the residual versus the value of
its position in the data set (usually ordered by date and time). For example,
the first data point (x1 , y1 ) will yield the residual e1 = y1 − ŷ1 . Hence, the
order of e1 is 1, and so forth. Independence is graphically checked if there
is no discernible pattern in the plot. That is, one cannot predict the next
ordered residual by knowing the a few previous ordered residuals. Referring
to Example 5.1, we obtain the following plot where there does not appear to
be any discernible pattern.
Independence
1
0
std res
−1
−2
0 10 20 30 40
Order
When creating this plot the order in which the data was obtained must
be the same as the way they are in the datasheet.
111
5.2.3 Homogeneity of variance/Fit of model
Recall that the regression model assumes that the errors i have constant
variance σ 2 . In order to check this assumption a plot of the residuals (ei )
versus the fitted values (ŷi ) is used. If the variance is constant, one expects
to see a constant spread/distance of the residuals to the 0 line across all the
ŷi values of the horizontal axis. Referring to Example 5.1, we see that this
assumption does not appear to be violated.
Homogeneity / Fit
3
2
1
std res
0
−1
−2
−3
y^
In addition, the same plot can be used to check the fit of the model.
If the model is a good fit, once expects to see the residuals evenly spread
on either side of the 0 line. For example, if we observe residuals that are
more heavily sided above the 0 line for some interval of ŷi , then this is an
indication that the regression line is not “moving” through the center of the
data points for that section. By construct, the regression line does “move”
through the center of the data overall, i.e. for the whole big picture. So if it
is underestimating (or overestimating) for some portion then it will overes-
timate (or underestimate) for some other. This is an indication that there is
some curvature and that perhaps some polynomial terms should be added.
(To be discussed in the next chapter).
http://www.stat.ufl.edu/~athienit/STA4210/Examples/copier.R
112
5.2.4 Box-Cox (Power) transformation
In the event that the model assumptions appear to be violated to a signifi-
cant degree, then a linear regression model on the available data is not valid.
However, have no fear, your friendly statistician is here. The data can be
transformed, in an attempt to fit a valid regression model to the new trans-
formed data set. Both the response and the predictor can be transformed
but there is usually more emphasis on the response.
Remark 5.5. However, when we apply such a transformation, call it g(·), we
are in fact fitting the mean line
E(g(Y )) = β0 + β1 x1 + . . .
Example 5.5. Referring to example 5.1, and figure 5.8 we see that λ̂ ≈ 0.75.
113
95%
0
−50
log−Likelihood
−100
−150
−200
−2 −1 0 11.11 2
http://www.stat.ufl.edu/~athienit/STA4210/Examples/boxcox.R
However, one could argue that the value is close to 1 and that a transfor-
mation may not necessarily improve the overall validity of the assumptions,
so no transformation is necessary. In addition, we know that linear regres-
sion is somewhat robust to deviations from the assumptions, and it is more
practical to work with the untransformed data that are in the original units
of measurements. For example, if the data is in miles and a transformation
is used on the response, inference will be on log(miles).
114
Histogram of std res Normal Q−Q Plot
Theoretical Quantiles
3.0
1.5
Frequency
2.0
0.0
1.0
−1.5
0.0
0.8
−1 0 1 2 −1.0 0.0 1.0 2.0
3
0.5 1.5
1
std res
std res
0.2
−1
−1.0
−3
0 2000 4000 6000 8000 10000 2 4 6 8 10 12 0.0 0.1 0.2 0.3 0.4 0.5
time Order y^
> powerTranform(dat$time)
bcPower Transformation to Normality
It seems that a decent choice for λ is 0, i.e. a log transformation for time.
1.5
Frequency
0.0
2
1
−1.5
0.8
0.0 1.0
0.0 1.0
std res
std res
0.2
−1.5
−1.5
l.time Order y^
http://www.stat.ufl.edu/~athienit/IntroStat/reg_transpred.R
Remark 5.6. Software has a tendency to “zoom” in to where the data is and
you may see patterns where there might not be if you were to “zoom” out.
Is glass smooth? If you are viewing by eye then yes. If you are viewing it
via an electron microscope then no. It is suggested that the axis where the
standardized residuals are plotted are at least from −3 to 3. However, the
check function was written that automatically adjusts for this.
115
5.3 Multiple Regression
5.3.1 Model
The multiple regression model is an extension of the simple regression model
whereby instead of only one predictor, there are multiple predictors to better
aid in the estimation and prediction of the response. The goal is to determine
the effects (if any) of each predictor, controlling for the others.
Let p denote the number of predictors and (yi , x1,i , x2,i , . . . xp,i ) denote the
p + 1 dimensional data points for i = 1, . . . , n. The statistical model is
p
X
Yi = β0 + β1 x1,i + · · · + βp xp,i + i ⇔ Yi = βk xk,i + i x0,i ≡ 1
k=0
i.i.d.
for i = 1, . . . , n where i ∼ N (0, σ 2 ).
Multiple regression models can also include polynomial terms (powers of
predictors) such as
116
The interpretation of the slope coefficients now requires an additional
statement. A 1-unit increase in predictor xk will cause the response, y, to
change by amount βk , assuming all other predictors are held constant. In a
model with interaction terms special care needs to be taken as an increase
in predictor also causes a change in a predictor that is an interaction term
involving said predictor. Take for example
E(Y |x1 , x2 ) = β0 + β1 x1 + β2 x2 + β3 x1 x2
|{z}
x3
117
is usually not of any practical value and the model may be overcomplicated
with redundant predictors. This has lead to the introduction of the adjusted
R2 , defined as
2 2 2 p SSE/(n − p − 1)
Radj := R − (1 − R ) =1− .
n−p−1 SST/(n − 1)
| {z }
penalizing fcn.
2 3
R(1)adj = 0.677 − (1 − 0.677) = 0.6559
46
2 5
R(2)adj = 0.679 − (1 − 0.679) = 0.6425
44
2
that Radj has decreased from model (1) to model (2).
5.3.3 Inference
The sum of squares calculation remains as in equation (5.4). However, the
degrees of freedom associated with SSE is now n − (p + 1). Therefore,
118
The Mean Squared Regression (MSR) and the Mean Squared Error (MSE)
are defined as
SSR SSE
MSR = , MSE =
p n−p−1
Before we continue, it is important to note that there are (mathematical)
limitations to how many predictors can be added to a model. As a guideline
we usually have one predictor per 10 observations. For example, a
dataset with sample size 60 should have at most 6 predictors.
Individual tests
Estimating the vector of coefficients β = (β0 , β1 , . . . , βp ) now falls in the field
of matrix algebra and will not be covered in this class. We will simply rely
on statistical software.
Inference on the slope parameters βj for j = 1, . . . , p is done as in Section
5.1.3 but under the assumption that
Bj − βj
∼ tn−p−1 .
sβj
• Now assume that a multiple regression model is fitted with both pre-
dictors, x1 = length and x2 = height. Now, for the test H0 : β1 = 0, do
you think that we would reject H0 , i.e. is length a significant predictor
of area given that height is already included in the model?
119
(1)
y = 6.33 + 1.29 x1
(2)
y = 54.0 - 0.919 x2
has decreased from 80.2% in (1) to 80.0% in (3). This is because x1 is acting
as a confounding variable on x2 . The relationship of x2 with the response
y is mainly accounted for by the relationship of x1 on y. The correlation
coefficient of
rx1 ,x2 = −0.573
which indicates a moderate negative relationship.
However, since x1 is a better predictor, the multiple regression model is
still able to determine that x1 is significant given x2 , but not vice versa.
120
Remark 5.7. In the event that the correlation between x1 and x2 is strong, e.g.
|rx1 ,x2 | > 0.7, both p-values for the individual tests in the multiple regression
model would be large. The model would not be able to distinguish a better
predictor from the two since they are nearly identical. Hence, x1 given x2 ,
and x2 given x1 would not be significant.
Simultaneous tests
This far we have only seen hypotheses test about individual β’s. In an experi-
ment with multiple predictors, using only individual tests, the researcher can
only test and potentially drop one predictor at a time and refitting the model
at each step. However, a method exists for testing the statistical significance
of multiple predictors simultaneously.
Let p denote the total number of predictors. Then, we can simultaneously
test for the significance of k(≤ p) predictors. For example, let p = 5 and the
full model is
Yi = β0 + βx1 x1,i + βx2 x2,i + βx3 x3,i + βx4 x4,i + βx5 x5,i + i (5.9)
Now, assume that after fitting this model and looking at some preliminary
results, including the individual tests, we wish to test whether we can re-
move simultaneously the first, third and fourth predictor, i.e x1 , x3 and x4 .
Consequently, we wish to test the hypotheses
H0 : β1 = β3 = β4 = 0 vs Ha : at least one of them 6= 0
In effect we wish to test the full model in equation (5.9) to the reduced model
Yi = β0 + βx2 x2,i + βx5 x5,i + i (5.10)
Remark 5.8. A full model does not necessarily imply a model with all the
predictors. It simply means a model that has more predictors than the
reduced model, i.e. a “fuller” model. For example, one may do a simultaneous
test to determine if they can drop 2 predictors and hence compare a full versus
reduced model. Assume that they do decide to go with the reduced model
but then wish to perform an additional simultaneous test on the reduced
model. In this second step, the reduced model becomes the new full model
that will be compared to a further reduced model.
The SSE of the reduced model will be larger than the SSE of the full
model, as it only has two of the predictors of the full model and can never
fit the data better. The test statistic is based on comparing the difference in
SSE of the reduced model to the full model.
SSEred − SSEf ull
dfEred − dfEf ull H0
T.S. = ∼ Fν1 ,ν2
SSEf ull
dfEf ull
where
121
• ν1 = dfEred − dfEf ull
• ν2 = dfEf ull
The p-value for this test is always the area to the right of the F-distribution,
i.e. P (Fν1 ,ν2 ≥ T.S.).
Remark 5.9. Note that ν1 = dfEred − dfEf ull always equals the number of
coefficients being restricted under the null hypothesis in a simultaneous test.
If n denotes the sample size then for our example with p = 5 and testing
3 predictors,
ν1 = (n − 2 − 1) − (n − 5 − 1) = 3
Remark 5.10. In computer output a simultaneous test for testing the signifi-
cance of all the predictors, H0 : β1 = · · · βp = 0, is automatically given. This
is called the overall test of the model. In this case, the reduced model has
no predictors, hence
Yi = β0 + i ⇔ Yi = µ + i ,
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 171.06949 1481.15956 0.115 0.90864
salinity -9.11037 28.82709 -0.316 0.75366
pH 311.58775 105.41592 2.956 0.00527
K -0.08950 0.41797 -0.214 0.83155
Na -0.01336 0.01911 -0.699 0.48877
Zn -4.47097 18.05892 -0.248 0.80576
AIC 690.4836
122
Assuming all the model assumptions are met, we first take a look at the
overall fit of the model.
The test statistic value is T.S. = 7.395 with an associated p-value of approx-
imately 0 (found using an F5,39 distribution). Hence, at least one predictor
appears to be significant. In addition, the coefficient of determination, R2 , is
48.67%, indicating that a large proportion of the variability in the response
can be accounted for by the regression model.
Looking at the individual tests, pH is significant given all the other predic-
tors with a p-value of 0.00527, but salinity, K, Na and Zn have large p-values
(from the individual tests). Table 5.2 provides the pairwise correlations of
the quantitative predictor variables.
biomass salinity pH K Na Zn
biomass . -0.084 0.669 -0.150 -0.219 -0.503
salinity . . -0.051 -0.021 0.162 -0.421
pH . . . 0.019 -0.038 -0.722
K . . . . 0.792 0.074
Na . . . . . 0.117
Zn . . . . . .
AIC 684.6179
123
The test statistic is
(8928321 − 8901715)/3
T.S. = = 0.0389.
8901715/39
with p-value P (F3,39 ≥ 0.0389) = 0.9896, and therefore fail to reject the null
which implies that salinity, K and Zn are not statistically significant. Now
we proceed with the reduced model as our current model.
At this point we see that Na is marginally significant with a p-value of
0.0871. Some may argue to remove it and some may not (due to its p-value
being on the cusp). As the model is “simple” enough it is suggested to keep
it. Arguments for keeping Na is that model without it yields
√
• a model with a higher conditional standard deviation, s = MSE =
472 (compared to 461.1)
2
• smaller Radj = 0.4347 (compared to 0.4606)
• one can still create C.I. or P.I. for multiple regression but done via
statistical software. For example, fit biomass for pH= 4.15 and Na=
10000 and create a 95% P.I.
> newdata=data.frame(pH=4.15,Na=10000)
> predict(linthurst.model.r, newdata, interval="prediction",level=0.95)
fit lwr upr
1 922.4975 -29.45348 1874.448
http://www.stat.ufl.edu/~athienit/IntroStat/linthurst.R
Remark 5.11. We could have reached the same final model choice by simply
performing individual t-tests on the coefficients and refitting the model each
time, i.e. find the coefficient with the highest p-value and if it is above some
cutoff point such as 0.20 then remove it, refit and repeat. However, there are
computer algorithms for this.
124
5.4 Qualitative Predictors
Interpreting a regression model with qualitative predictors is slightly differ-
ent. A qualitative predictor is a variable with groups or classification. The
simple case with only two groups will be illustrated by the following example.
Yi = β0 + β1 x1,i + β2 x2,i + i
implies that (
(β0 + β2 ) + β1 x1,i + i if x2 = 1
Yi =
β0 + β1 x1,i + i if x2 = 0
When a safety program is used, i.e. x2 = 1, the intercept is β0 + β2 , but the
slope (for x1 ) remains the same in both cases. Fitting this model yields
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.39945 9.90247 3.171 0.00305 **
x1 0.01421 0.00140 10.148 3.07e-12 ***
x2 -54.21033 7.24299 -7.485 6.47e-09 ***
125
x2
0
1
150
100
y
50
0
x1
Although the overall fit of the model seems adequate, from Figure 5.9 we
see that the regression line for x2 = 1 (red), does fit the data well - a fact
that can also be seen by plotting the residuals in the assumption checking
procedure. The model is too restrictive by forcing parallel lines. Adding an
interaction term makes the model less restrictive.
which implies
(
(β0 + β2 ) + (β1 + β3 )x1,i + i if x2 = 1
Yi =
β0 + β1 x1,i + i if x2 = 0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.844082 10.127410 -0.182 0.857
x1 0.019749 0.001546 12.777 6.11e-15 ***
x2 10.725385 14.054508 0.763 0.450
x1:x2 -0.010957 0.002174 -5.041 1.32e-05 ***
126
The overall fit of the new model is adequate with a T.S. = 98.90 but more
2
importantly Radj has increased and s has decreased. Figure 5.10 also shows
the better fit.
x2
0
1
150
100
y
50
0
x1
Remark 5.12. Since the interaction term x1 x2 is deemed significant, then for
model parsimony, all lower order terms of the interaction, i.e. x1 and x2
should be kept in the model, irrespective of their statistical significance. If
x1 x2 is significant then intuitively x1 and x2 are of importance (maybe not
in the statistical sense).
Now lets try and to perform inference on the slope coefficient for x1 . From
the previous equation we saw that the slope takes on two values depending
on the value of x2 .
The sample statistics for all the covariances among all the coefficients
can easily be obtained in R using the vcov function (although we readily
have available the variances, i.e. squared standard errors, for β1 and
β3 . Then create a 100(1 − α)% C.I. for β1 + β3
q
b1 + b3 ∓ t1−α/2,n−p−1 s2β1 + s2β3 + 2sβ1 β3
127
Remark 5.13. The sample covariance is not part of the standard output and
in R we use the vcov() function. Also, this concept can easily be extended
to any linear combination of more that two coefficients.
> vc=vcov(reg2);vc
(Intercept) x1 x2 x1:x2
(Intercept) 102.564428 -1.433300e-02 -102.56442795 1.433300e-02
x1 -0.014333 2.389211e-06 0.01433300 -2.389211e-06
x2 -102.564428 1.433300e-02 197.52920433 -2.799714e-02
x1:x2 0.014333 -2.389211e-06 -0.02799714 4.724125e-06
> sum(reg2$coefficients[c(2,4)])+c(1,-1)*
+ qt(0.025,reg2$df.residual)*sqrt(vc[2,2]+vc[4,4]+2*vc[2,4])
[1] 0.005693056 0.011891084
http://www.stat.ufl.edu/~athienit/IntroStat/safe_reg.R
In the previous example the qualitative predictor only had two levels, the
use or the the lack of use of a safety program. To fully state all levels only
one dummy/indicator predictor was necessary. In general, if a qualitative
predictor has k levels, then k − 1 dummy/indicator predictor variables are
necessary. For example, a qualitative predictor for a traffic light has three
levels:
• red,
• yellow,
• green.
Therefore, only two binary predictors are necessary to fully model this sce-
nario. ( (
1 if red 1 if yellow
xred = xyellow =
0 otherwise 0 otherwise
Braking it down by case we have:
Testing
128
• β1 = 0, is testing whether the mean for red is the same as for green
• β2 = 0, is testing whether the mean for yellow is the same as for green
The color variable has three categories, one may argue that color (in some
context) is an ordinal qualitative predictor and therefore scores can be as-
signed, making it quantitative. For example, you can order a drink in 3 sizes:
small, medium and large, and there is an inherent order of 1, 2 and 3.
Size Score
Small 1
Medium 2
Large 3
Now assume we knew that the medium size is 50% larger than the small,
and that the large drink was 350% larger than the small. More representative
scores might be
Size Score
Small 1
Medium 1.5
Large 3.5
E(Y ) = β0 + β1 score
129
Module 6
Analysis of Variance
i.i.d. 2
where ij ∼ N (0,Ptσ ) and the restriction (to make model identifiable) that
some αi = 0 or i=1 αi = 0. The goal is test the statistical significance of
the treatment effects α’s. If all α’s are 0 then it implies that the response
can be modeled by a single mean µ rather than individual µi ’s for each
treatment/sample.
To see the equivalence to the regression model, assume t = 2. The CRD
model under the restriction α1 = 0 dictates that
(
µ level 1
E(Yij ) =
µ + α2 level 2
130
However, if we treat this as regression, we have a qualitative predictor with
two levels. Let, (
0 level 1
x=
1 level 2
and the regression model Yi = β0 + β1 xi + i dictates that
(
β0 ≡ µ level 1
E(Yi ) =
β0 + β1 ≡ µ + α2 level 2
Example 6.1. Company officials were concerned about the length of time
a particular drug retained its potency. A random sample of n1 = 10 fresh
bottles was retained and a second sample of n2 = 10 samples were stored for
a period of 1 year and the following potency readings were obtained.
Fresh 10.2 10.5 10.3 10.8 9.8 10.6 10.7 10.2 10.0 10.6
Stored 9.8 9.6 10.1 10.2 10.1 9.7 9.5 9.6 9.8 9.9
Observations
Grand Mean
10.4
10.2
potency
10.0
9.8
9.6
131
10.8
10.6
Observations
Trt Mean
Grand Mean
10.4
10.2
potency
10.0
9.8
9.6 Fresh Stored
Method
The model in equation (6.1) is a linear model so we have the same identity
for the sum of squares
X ni
t X ni
t X
X ni
t X
X
2 2
(yij − ȳ++ ) = (yij − ȳi+ ) + (ȳi+ − ȳ++ )2
i=1 j=1 i=1 j=1 i=1 j=1
| {z } | {z } | {z }
SST SSE SSTrt
where ȳi+ denotes the sample mean of group i, and ȳ++ = ȳ denotes the
sample grand mean. We can simplify each SS term
• SST = (N − 1)s2y
This is once again the same as equation (5.4) in regression with ŷi = ȳi+ .
In addition, we have a similar identity for the degrees of freedom associated
with each SS.
Xt
N − 1} = N
| {z − }t + t| −
| {z {z 1}, N = ni
dfTotal dfError dfTrt i=1
Once again, (s2 = MSE), and it can be shown (in more advanced courses)
that the SS have χ2 distribution and from that
E(MSE) = σ 2
Pt
2 i=1ni αi2
E(MSTrt) = σ +
t−1
As a consequence, under
H0 : α1 = · · · = αt = 0, =⇒ E(MSTrt)/E(MSE) = 1.
The test statistic for this hypothesis is, with sampling distribution of
MSTrt H0
T.S. = ∼ Ft−1,N −t
MSE
Reject p-value= P (Ft−1,N −t ≥ T.S.) < α. Equivalent to equation (5.11)
132
Remark 6.1. Checking the assumptions for the CRD model are exactly the
same as for regression since both models belong to the same family of linear
models. In addition, the Box-Cox transformation can also be used just as
previously done for regression.
Example 6.2. A metal alloy that undergoes one of four possible strength-
ening procedures is tested for strength.
Dotplots by Treatments
Example
280
270
strength
260
250
Observations
Trt Mean
Grand Mean
240
A B C D
Factor
pre−analysis plot
http://www.stat.ufl.edu/~athienit/IntroStat/anova1.R
133
6.1.1 Post-hoc comparisons
If differences in group means are determined from the F-test, researchers want
to compare pairs of groups. Recall that each pairwise confidence interval, i.e.
a C.I. for the difference of two means is equivalent to a hypothesis test.
Hence, if each inference is done with P (Type I Error) = α, then the question
becomes, if we wish to perform joint (or simultaneous) inference on a certain
number of confidence intervals and combine them into one conclusion, then
surely the type I error cannot still be α.
Let αI denote the individual comparison Type I error rate. Thus,
P (Type I error) = αI on each of the g tests. Now assume we wish to combine
all the individual tests into an overall/combined/simultaneous test
H0 is rejected if any, i.e. at least one, of the null hypotheses H0i is rejected.
The experimentwise error rate αE , is the probability of falsely rejecting
at least one of the g null hypotheses. If each of the g tests is done with αI ,
then assuming each test is independent and denoting the probability of not
falsely rejecting H0i by Ei
αE = 1 − P (∩gi=1 Ei )
g
Y
=1− P (Ei ) independence
i=1
= 1 − (1 − αI )g
which implies
g
X
αE = 1 − P (∩gi=1 Ei ) ≤ g − P (Ei )
i=1
g
X
= [1 − P (Ei )]
i=1
g
X
= αI
i=1
= gαI
134
Hence, αE ≤ gαI . So what we will do is choose an α to serve as an upper
bound for αE . That is we won’t know the true value of αE but we will now
it is bounded above by α, i.e. αE ≤ α. For example, if we set α = 0.05 then
αE ≤ 0.05, or that simultaneous C.I. from g individual C.I.’s, will have a
confidence of at least 95% (if not more). Set
α
αI =
g
For example, if we have 5 multiple comparisons and wish that the overall
error rate is 0.05, or simultaneous confidence of at least 95%, then each one
(of the five) C.I’s must be done at the
0.05
100 1 − = 99%
5
confidence level.
For additional details the reader can read the multiple comparisons prob-
lem and the familywise error rate.
In order to control the experiment wise error rate, we will have to adjust
the individual error rate of each test. Three popular methods include (from
most conservative to least):
1. Bonferroni’s Method: Adjusts individual comparison error rates so that
all conclusions will be correct at desired confidence/significance level.
Any number of comparisons can be made. Very general approach can
be applied to any inferential problem.
2. Tukey’s Method: Specifically compares all t(t − 1)/2 pairs of groups.
Utilizes special q distribution.
Bonferroni procedure
This is the most general procedure when we wish to test a priori g pairwise
comparisons. When all pairs of treatments are to be compared g = t(t−1)/2.
However, we shall see that the larger the g is the wider the intervals will be.
The steps are:
1. Choose an overall upper bound α(≥ αE ) so that the overall confidence
level is
100(1 − α) ≤ 100(1 − αE )%.
2. Decide how many and which pairwise comparisons are to be made, g.
3. Construct each pairwise C.I. (or test) with αI = α/g, i.e. confidence
level 100(1 − α/g)%. For comparing treatment means µi to µj will be
s
1 1
ȳi+ − ȳj+ ∓ t1−α/(2g),N −t MSE +
ni nj
135
Example 6.3. In our example we do not know before hand which compar-
isons we wish to make so let us perform all 4(3)/2 = 6 pairwise comparisons
with an overall confidence level of at least 95% (since αE ≤ 0.05). This
implies that each pairwise comparison must be made at the level of
0.05
100 1 − = 99.1667%
6
Since all our factors have the same sample size 5 we are lucky in that the
margin of error is the same for all pairwise comparisons, 15.21813. The
sample means were already calculated and hence,
ADBC
Tukey’s procedure
This procedure is derived so that the probability that at least one false dif-
ference is detected is α (experimentwise error rate). The C.I. is
r
MSE
ȳi+ − ȳj+ ∓ q1−α;t,N −t
n
where n is the common sample size for each treatment (which was 5 in the
example). If the sample sizes are unequal use (harmonic mean)
t
n= 1 1
n1
+ ··· + nt
136
Example 6.4. Continuing with our example, R has a built in function that
can create the C.I’s and also create a plot for us.
−20 −10 0 10 20 30
http://www.stat.ufl.edu/~athienit/IntroStat/anova1.R
137
6.1.2 Distribution free procedure
The Kruskal-Wallis test is an extension of the Wilcoxon rank-sum test to ≥ 2
groups. It is a distribution free procedure (but still requires the assumptions
of independence and constant variance). To test
H0 : The t distributions (corresponding to the t treatments) are identical
the steps are:
1. Rank the observations across groups from smallest to largest, adjusting
for ties.
2. Compute the sums of ranks for each group: T1 , . . . , Tt
and then compute
t
!
12 X T2 i H
T.S. = − 3(N + 1) ∼0 χ2t−1
N (N + 1) i=1
ni
where N is the grand sample size. Reject H0 if p-value = P (χ2t−1 ≥ T.S.) < α
Remark 6.2. Alternate version of this test statistic exist when ties are present,
but it is unnecessarily complicated. Usually software will use the “best”
adjusted procedure. By hand we suggest to adjust for ties as we have always
done.
138
providing
12
132 /4 + 252 /4 + 402 /4 − 3(13) = 7.0385
T.S. =
12(13)
LMH
http://www.stat.ufl.edu/~athienit/IntroStat/kruskal_wallis.R
Remark 6.3. Note that the Kruskal-Wallis test and the rank analogue to
Tukey’s may not always be in agreement as these are two different procedures
and not equivalent. This is true for any two different methodologies as the
“look” at the data in slightly different ways.
139
6.2 Randomized Block Design
Blocking (where applicable) is used to reduce variability so that treatment
differences can be identified. Usually, experimental units constitute the
blocks. In effect this is a 2-way ANOVA with treatment being a fixed factor,
and the experimental unit being the random factor. Each subject, receives
each treatment and the order in which treatments are assigned to
subjects must be random/arbitrary.
For example, consider a temperature predictor with levels: 20◦ F, 30◦ F, 40◦ F.
Is this fixed or random? It depends!
Yij = µ + αi + βj + ij i = 1, . . . , t, j = 1, . . . , b
i.i.d.
with ij ∼ N (0, σ 2 ) and independent βj ∼ N (0, σβ2 ). Notice that the random
factor has similar notation as the error. If we were performing a 1-way
ANOVA it would have hidden inside it but now we try to account it and
remove some “noise” from the model. We P still have the same restrictions on
the α’s that for some i, αi = 0 (or that αi = 0).
Block
1 2 ··· b
1 y11 y12 ··· y1b
2 y21 y22 ··· y2b
Factor .. .. .. .. ..
. . . . .
t yt1 yt2 ··· ytb
140
The sum of squares is
SST = SSTrt
| +{zSSBlock} +SSE
SSModel
The SSBlock is pulled out of the SSE term of a CRD model.
t X
X b
SST = (yij − ȳ++ )2
i=1 j=1
t
X
SSTrt = b(ȳi+ − ȳ++ )2
i=1
Xb
SSBlock = t(ȳ+j − ȳ++ )2
j=1
t X
X b
SSE = (yij − ȳi+ − ȳ+j + ȳ++ )2
i=1 j=1
where ȳi+ is the mean of treatment i, ȳ+j is the mean of block j, and ȳ is
the grand mean.
The ANOVA table is then
Source SS df MS E(MS) F
Pt
SSTrt α2i MSTrt
Trt SSTrt t−1 t−1
σ2 + b i=1
t−1 MSE
Block SSBlock b−1 SSBlock σ 2 + tσβ2
b−1
Total SST bt − 1
141
Multiple comparison procedures are the same as in Section 6.1.1 with the
exception that ni = b and degrees of freedom error are (b − 1)(t − 1).
Example 6.6. Data from a study quantifying the interaction between theo-
phylline and two drugs (famotidine and cimetidine) in a three-period crossover
study that included receiving theophylline with a placebo control (Bachman,
et. al 1995). We would like to compare the mean theophylline clearances
when it is taken with each of the three drugs.
In the RBD, we control for the variation within subjects when comparing
the three treatments, i.e. we account for the sum of squares of subjects.
In this example there are three treatments (t = 3) and fourteen subjects
(b = 14).
Response: thcl
Df Sum Sq Mean Sq F value Pr(>F)
fintagnt 2 7.005 3.5026 10.591 0.0004321
subj 13 71.811 5.5240 16.703 2.082e-09
Residuals 26 8.599 0.3307
142
We note that drug treatment, fintagnt, is highly significant with a p-value
of 0.0004 and hence that not all means are equal. Next we perform Tukey’s
multiple comparison to determine which differ.
143
6.2.1 Distribution free procedure
Friedman’s test works by ranking the measurements corresponding to the t
treatments within each block.
4. Under the null the sampling distribution of the test statistic is χ2t−1 .
As usual, we reject the null if the p-value= P (χ2t−1 ≥ T.S.) < α.
144
Formulation
Subject cap f cap nf entct f
1 3.5(2) 4.5(3) 2.5(1)
2 4.0(2) 4.5(3) 3.0(1)
3 3.5(2) 4.5(3) 3.0(1)
4 3.0(1.5) 4.5(3) 3.0(1.5)
5 3.5(1.5) 5.0(3) 3.5(1.5)
6 3.0(1) 5.5(3) 3.5(2)
7 4.0(2.5) 4.0(2.5) 2.5(1)
8 3.5(2) 4.5(3) 3.0(1)
9 3.5(1.5) 5.0(3) 3.5(1.5)
10 3.0(1) 4.5(3) 3.5(2)
11 4.5(2) 6.0(3) 3.0(1)
Ri+ 19.0 32.5 14.5
with
12
192 + 32.52 + 14.52 − 3(11)(4) = 15.95455
T.S. =
11(3)(4)
and hence p-value= P (χ22 ≥ 15.955) = 0.00034, so we reject the null and
conclude that all treatments have the same center of location. In the R
script provided for this example a function was created friedman.test2
that provides the test statistic and all the pairwise comparisons. However, a
built in function friedman.test does exist (that adjusts for ties differently)
but does not compute the confidence intervals.
> friedman.test2(tmax~fformu|subj,data=cap,mc=TRUE)
[1] "95 % Pairwise CIs on rank sums"
Difference Lower Upper Differ?
cap_f - cap_nf -13.5 -18.1967 -8.8033 1
cap_f - entct_f 4.5 -0.1967 9.1967 0
cap_nf - entct_f 18.0 13.3033 22.6967 1
Ef Cf Cnf
http://www.stat.ufl.edu/~athienit/IntroStat/friedman.R
145
Bibliography
[3] Kutner, M., Nachtsheim, C., Neter, J., Li, W. Applied Linear Statistical
Models. McGraw-Hill, 2004
146