Ed Inference1
Ed Inference1
TESTING
To draw certain inferences or make a decision about some hypotheses concerning the
In any decision, because of a small amount of data collected risk to take the wrong
decision uncertainty.
Definitions:
variable).
Random samples: each member of the population has an equal chance of being included
Xi
X = i=1
n
2
(Xi-X )
S 2=
i =1
n 1
Estimator: given f(x,) the density function of X, and the unknown parameter, an
= g(x1,...,xn)
is the single numerical value that is called the estimate of the parameter . When is an
estimator good for estimating a given parameter? unbiased estimator of if E (U ) = .
Estimation:
Two types of estimate:
1) Point estimate: a single numerical value to estimate univocally a parameter. However,
two different samples may produce differences between the estimated values.
2) Confidence interval: two numerical values defining a range of values within which the
parameter falls with a specified degree of confidence (confidence coefficient or
confidence level = 1 , most frequently 0.90, 0.95 and 0.99).
Confidence interval for the difference between two population means 1-2 .
Confidence interval for the variance 2 of a normally distributed population (s2 = sample
variance).
2
Confidence interval for the ratio of the variances 22 of two normally distributed
populations.
Confidence interval for the difference between two population proportions p1-p2.
distributions.
ASSUMPTIONS
CONFIDENCE INTERVAL
n > 30
Arbitrary population with unknown
n > 30
X - z/2
X - z/2
S
S
X + z/2
n
n
X - t/2
S
S
X + t/2
n
n
X + z/2
1 + 2 -
1 + 2
1 2
X 1 - X 2 - z/2
X 1 - X 2 + z/2
n1 n 2
n1 n 2
Sampling distribution: normal
X 1 - X 2 - z/2
S 1 + S 2 -
S1 + S 2
1 2
X 1 - X 2 + z/2
n1 n 2
n1 n 2
X 1 - X 2 - t/2
1
1 ( n - 1) S 12 + ( n 2 - 1) S 22
1 1 ( n1 - 1) S 12 + ( n2 - 1) S 22
+ 1
1-2 X 1 - X 2 + t/2 +
n1 + n2 - 2
n1 + n 2 - 2
n1 n2
n 1 n 2
Normal population
(n - 1) S 2
2/2
(n - 1) S 2
12 - /2
F 1 - /2
S2 2
S2
F /2 2
2
2
S1
1
S1
Sampling distribution: F with n1 1 numerator degrees of freedom and n2 1 denominator degrees of freedom
Binomial population
n > 30
p - z/2
p (1 - p )
p p + z/2
n
p (1 - p )
n
p 1 - p 2 - z/2
p 1 (1 - p 1 ) p 2 (1 - p 2 )
+
p1-p2 p 1 - p 2 + z/2
n1
n2
p 1 (1 - p 1 ) p 2 (1 - p 2 )
+
n1
n2
Examples
Hypothesis Testing:
= 0.008 C . Added water raises the freezing temperature. XXLs laboratory manager
measures the freezing temperature of five consecutive lots of milk from one producer. The mean
measurement is x = 0.538 C . Is this good evidence that this producer is adding water to the
milk? (Moore-McCabe, p. 463)
1. Suppose that no water has been added, so that the mean freezing point of the
population of all milk from this producer is = 0.545 C . Then what is the
probability that five measurements would give a sample mean as high as
0.538 C or higher?
2. An outcome this high or higher has probability 0.025 if natural milk is measured.
3. Hence, since a mean freezing temperature as high as that observed would occur only
2.5 times per 100 samples of natural milk, there is evidence that the producer is
watering the milk.
1. Hypotheses:
significance is designed to assess the strength of the evidence against the null hypothesis. It
contains a statement of equality (either =, , or ).
Since only the parameter value in H 0 that is closest to H1 influences the form of the
test in all common significance testing situations, in H 0 the parameter has a specific value
( = 0 ). Meanwhile, H1 may be one-sided ( > 0 or < 0 ) or two-sided ( 0 ).
Null hypothesis versus alternative hypothesis. If the null hypothesis is not rejected, the
data on which the test is based do not provide sufficient evidence to cause rejection. If the
testing procedure leads to rejection, the data are not compatible with the null hypothesis, but
give evidence to the alternative hypothesis.
2. Test statistic:
A statistic computed from the data of the sample which serves as a decision maker. It shows
whether or not the data give evidence against the null hypothesis. One has to:
a) Choose a test statistic to test the null hypothesis, taking into account any assumptions
about the normality of the population distribution, equality of variances, independence of
samples
b) Determine the sampling distribution of the test statistic when the null hypothesis is
true.
c) Compute a value of the test statistic from the data contained in the sample (calculation
of test statistic).
3. Types of errors
Condition of Null Hypothesis
True
False
Reject H0
Significance level ( ) = the probability of rejecting a true null hypothesis, that is, the probability
of committing a type I error.
5. Decision rule
The null hypothesis is rejected if the value of the test statistic computed from the sample falls in
the rejection region. Otherwise, the null hypothesis is not rejected.
4. p-value
The probability, assuming that H0 is true, that the test statistic will take a value at least as extreme
in the direction of H1 as that actually computed is the p-value. The smaller the p-value is, the
stronger is the evidence against H0 provided by the data.
5. Decision rule
p-value we reject H0 (the data are statistically significant at level )
whereas
p-value > we do not reject H0
ASSUMPTIONS
H0
TEST STATISTIC
n > 30
Arbitrary population with unknown
n > 30
Z=
= 0
Z=
= 0
= 0
Z=
d = d0
(d = 1-2)
X - 0
N(0,1)
/ n
X - 0
N(0,1)
S/ n
X - 0
t (n-1) d. f.
S/ n
( X 1 - X 2 )- d0
12 + 22
n1
d = d0
(d = 1-2)
Z=
N(0,1)
n2
( X 1 - X 2 )- d0
2
N(0,1)
S1 + S 2
n1 n 2
d = d0
(d = 1-2)
T=
( X 1 - X 2 ) - d0
1 1 ( n1 - 1) S 12 + ( n2 - 1) S 22
+
n1 + n2 - 2
n1 n2
t (n1+n2-2) d. f.
H1
REJECTION REGION OF H0
|Z| z/2
> 0
Z z
< 0
Z -z
|Z| z/2
> 0
Z z
< 0
Z -z
|T| t/2
> 0
T t
< 0
T -t
d d0
|Z| z/2
d > d0
Z z
d < d0
Z -z
d d0
|Z| z/2
d > d0
Z z
d < d0
Z -z
d d0
|T| t/2
d > d0
T t
d < d0
T -t
10
Normal population
2 = 02
2 02
221-/2 or 22/2
2 > 02
22
2 < 02
221-
12 22
FF/2 or FF1-/2
12 > 22
FF
12 < 22
FF1-
p - p0
N(0,1)
p0 (1 - p0 )
n
p p0
|Z| z/2
p > p0
Z z
p < p0
Z -z
p 1 - p 2
N(0,1)
1 1
p (1 - p ) +
n1 n2
p1 p2
|Z| z/2
p1 > p2
Z z
p1 < p2
Z -z
(n - 1) S 2
2 =
02
12 = 22
F=
Binomial population
n > 30
2 (n-1) d. f.
p = p0
Z=
p1 = p2
Z=
S 1 F (n -1,n -1) d. f.
1
2
2
S2
Examples
11
Steps in the hypothesis testing procedure. [Adapted from Daniel (1999), p.211]
Evaluate data
Review assumptions
State hypotheses
Determine distribution
of test statistic
State decision rule
Do not
reject H0
Conclude
H0 may be
true
Make statistical
decision
Reject H0
Conclude
H1 is true
12
Examples:
Example 1. Confidence interval for a population mean sampling from a normal population
A physical therapist wished to estimate, with 99% confidence, the mean maximal strength of a
particular muscle in a certain group of individuals. He is willing to assume that strength scores are
approximately normally distributed with a variance of 144. A sample of 15 subjects who
participated in the experiment yielded a mean of 84.3. (Daniel, p. 157)
Solution: The z value corresponding to a confidence coefficient of 0.99 is found in the table of
standard normal distribution to be 2.58. The standard error is
12
= 3.0984 . The 99%
15
We are 99% confident that the population mean is between 76.3 and 92.3.
Probabilistic interpretation: In repeated sampling 99% of all intervals that could be
constructed in the manner described would include the population mean.
Example 2. Confidence interval for a population mean sampling from a non-normal population
Punctuality of patients in keeping appointments is of interest to a research team. In a study of
patient flow through the offices of general practitioners, it was found that a sample of 35 were
17.2 minutes late for appointments, on the average, with a standard deviation of 8 minutes. The
population distribution was felt to be non-normal. What is the 90% confidence interval for , the
true mean amount of time late for appointments? (Daniel, p. 158)
Solution: The sample size is fairly large (greater than 30). Although the population standard
13
deviation is unknown, we can use the sample variance as a replacement for the unknown
population variance. Therefore we assume the sampling distribution of the sample mean
to be approximately normally distributed (application of the central limit theorem). From
the table of standard normal distribution we find the z value corresponding to a
confidence coefficient of 0.90 to be about 1.645. The standard error is
8
= 1.3522 .
35
Example 3. Confidence interval for the difference between two population means sampling from
a normal population with equal variances
The purpose of a study by Stone et al. was to determine the effects of long-term exercise
intervention on corporate executives enrolled in a supervised fitness program. Data were
collected on 13 subjects (the exercise group) who voluntarily entered a supervised exercise
program and remained active for an average of 13 years and 17 subjects (the sedentary group)
who elected not to join the fitness program. Among the data collected on the subjects was
maximum number of sit-ups completed in 30 seconds. The exercise group had a mean and
standard deviation for this variable of 21.0 and 4.9, respectively. The mean and standard
deviation for the sedentary group were 12.1 and 5.6, respectively. We assume that the two
populations of overall muscle condition measures are approximately normally distributed and that
the two population variances are equal. We wish to construct a 95% confidence interval for the
difference between the means of the populations represented by these two samples. (Daniel, pp.
170-171)
Solution: First of all we compute the pooled estimate of the common population variance:
14
( n1 -1 ) S 12 +( n 2 -1 ) S 22
n1+ n 2 -2
1 1
(21.0 12.1) 2.048 + (28.1)
13 17
8.9 4.008
4.9 E S 12.9
We are 95% confident that the difference between 4.9 and 12.9. Since the interval does
not include zero, we conclude that the population means are not equal.
Example 4. Confidence interval for the ratio of the variances of two normally distributed
populations
A study was conducted to determine if an acute dose of dextroamphetamine might have positive
effects on affect and cognition in schizophrenic patients maintained on a regimen of haloperidol.
Among the variables measured was the change in patients tension-anxiety states. For n2 = 4
patients who responded to amphetamine, the standard deviation for this measurement was 3.4.
For n1 = 11 patients who did not respond, the standard deviation was 5.8. Let us assume that
these patients constitute independent simple random samples from populations of similar
patients. Let us also assume that change scores in tension-anxiety state is a normally distributed
variable in both populations. We wish to construct a 95% confidence interval for the ratio of the
variances of these two populations. (Daniel, p. 192)
Solution: The information we have:
15
n1 = 11
n2 = 4
s 22 = 3.4 2 = 11.56
df1 = 10
df 2 = 3
Since the sampling distribution follows a F distribution with 10 and 3 degrees of freedom,
when = 0.1 :
F0.05 = 8.79
0.27
22
12
is:
11.56 22
11.56
CItable
8.79
33.64 12
33.64
0.093
22
12
3.02
Since the interval includes 1, we are able to conclude that the two population variances
may be equal.
16
p =
421
= 0.842
500
From the table of standard normal distribution we find the value of z value corresponding
to a confidence coefficient of 0.95 to be 1.96. The interval is:
0.842 1.96
0.842 0.158
CItable
500
0.842 0.32
0.81 p 0.874
The association was 95% confident that between 81% and 87% of Indiana homes
displayed Christmas trees.
H1 : p 0.36
where p represents the proportion of rural households that would be obtained by the telephone
sampling procedure if it were repeated over and over again. (Moore-McCabe, p. 584)
Solution: The test statistic is:
z calc =
0.38 0.36
0.36 0.64
500
= 0.93 HTtable
From the table of standard normal distribution, we find that the probability that a Z is less
17
than or equal to 0.93 is 0.8238. The probability in each tail is 1 0.238 = 0.1762 .
Therefore the p-value is 2 0.1762 = 0.35 . There is a 35% chance of getting a value of
Z larger than 0.93 or smaller than 0.93 if H0 is true. We therefore have no reason to
reject the hypothesis that the sampling procedure is unbiased with respect to rural versus
urban residence.
The survey responses show that 89 of the urban households and 64 of the rural
households who displayed a tree chose a natural tree. So:
p1 =
89
= 0.341
261
p 2 =
64
= 0 .4
160
The pooled estimate of the common value of the proportion of respondents who chose a
18
89 + 64
= 0.363
261 + 160
1
1
0.363 0.637
+
= 0.04828
261 160
z calc =
0.341 0.4
= 1.22 HTtable
0.04828
From the table of standard normal distribution, and since we are doing a one-sided test,
the p-value is:
P ( Z 1.22) = 0.1112
Even though rural households in the survey chose natural Christmas trees more often
than the urban households, our calculations indicate that there is not sufficient evidence
in the data to conclude that this difference in preferences is true in the population of
Indiana tree users. If the preferences of rural and urban households were identical, rural
usage would exceed urban usage by an amount leading to a z statistic at least as large
as the one observed in 11% of all samples of this size.
of the population of all measurements is the true concentration in the specimen. The standard
deviation of this distribution is a property of the analytical procedure and is known to be
= 0.0068 gram per litre. The laboratory analyzes each specimen three times and reports the
19
mean result. The laboratory is asked to evaluate the claim that the concentration of the active
ingredient in a specimen is 0.86%. The mean of three repeated analyses of the specimen is
x = 0.8404 . The true concentration is the mean of the population of repeated analyses. Can
the laboratory conclude that is different from 0.86? (Moore-McCabe, pp. 474-475)
Solution: The hypotheses are:
H 0 : = 0.86
H1 : 0.86
The lab chooses the 1% level of significance ( = 0.01 ). The computed value of the test
statistic is:
z calc =
0.8404 0.86
0.0068
= 4.99 HTtable
References
Daniel (1999), pp. 150-271.
Moore and McCabe (1989), pp. 443-605.
20