Review of Sessions 1-7 PUBH 614 Spring 2019
Review of Sessions 1-7 PUBH 614 Spring 2019
Review of Sessions 1-7 PUBH 614 Spring 2019
PUBH 614
Spring 2019
Types of variables
2
Random Assignment
vs. Random Sampling
3
Shape of a Distribution: Modality
Does the histogram have a single prominent peak (unimodal), several prominent
peaks (bimodal/multimodal), or no apparent peaks (uniform)?
Note: In order to determine modality, step back and imagine a smooth curve over
the histogram -- imagine that the bars are wooden blocks and you drop a limp
spaghetti over them, the shape the spaghetti would take could be viewed as a
smooth curve.
Shape of a Distribution: Skewness
Is the histogram right skewed, left skewed, or symmetric?
Note: Histograms are said to be skewed to the side of the long tail.
Shape of a Distribution: Box Plot
Normal Distribution
● Unimodal and symmetric, bell-shaped curve.
● Notation: X ~ N(µ, σ) means the random variable, X, is normally
distributed with mean µ and standard deviation σ
The mean, µ,
and standard
deviation, σ, are
examples population
parameters
● Note that µ=0 and σ=1 for the standard normal distribution
Normal distribution
Original
distribution
Same
distribution
now
standardized
10
• 1. What percentage of adults have SBP less than 100?
P(X<100) = P( )=P(z<-1)=.1587
=standardize(x,mean,sd)
=NORM.DIST(x,mean,sd,true) =standardize(100,120,20)
=NORM.DIST(100,120,20,TRUE) z= -1
Pr(x < 100)= 0.1587 =norm.s.dist(z, TRUE)
=norm.s.dist(-1, TRUE)
Pr(x < 100)= 0.1587
15.9% of adults have systolic blood pressure less than 100
11
2. What percentage of adults have SBP greater than 100?
• P(X>100) = 1 – P(X<100)
• We know that P(X<100) = .1587 from before
• P(X>100)= 1-.1587=.8413
=1-norm.s.dist(z, TRUE)
=1-norm.s.dist(-1, TRUE)
Pr(x > 100)= 0.8413
or
=1-norm.dist(x,mean,sd,true)
=1-norm.dist(100,120,20,TRUE)
Pr(x > 100)= 0.8413
12
• 3. What percentage of adults have SBP greater than 133?
P(X>133) = 1 – P(X<133)
P()=P(Z<.65)=.7422
P(X>133)= 1- .7422=.2578
=standardize(133,120,20)
z= 0.65
=1-norm.s.dist(z, TRUE)
=1-norm.s.dist(.65, TRUE)
Pr(x > 133)= 0.2578
or
=1-norm.dist(x,mean,sd,true)
=1-norm.dist(133,120,20,TRUE)
Pr(x > 133)= 0.2578
13
• What percentage of adults have SBP between 100 and 133?
4.
P(100<X<133) = P(X<133) - P(X<100)
STANDARDIZE NORM.S.DIST
X Z P(Z<z) Pr(-1 < Z < .65)
100 -1 0.1587 0.5835
133 0.65 0.7422
or NORM.DIST P(100 < X< 133)
X P(X<x) 0.5835
100 0.1587
133 0.7422
58% of adults have systolic blood pressure between 100 and 133
14
Finding cutoff points
• Since ,
then .
15
Finding cutoff points
Body temperatures of healthy humans are distributed nearly normally with mean
98.2oF and standard deviation 0.73oF. What is the cutoff for the lowest 3% of human
body temperatures?
Mackowiak, Wasserman, and Levine (1992), A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the
Normal Body Temperature, and Other Legacies of Carl Reinhold August Wunderlick.
Practice
Body temperatures of healthy humans are distributed nearly normally with mean
98.2oF and standard deviation 0.73oF. What is the cutoff for the highest 10% of human
body temperatures?
68-95-99.7 Rule
For nearly normally distributed data,
● about 68% falls within 1 SD of the mean,
● about 95% falls within 2 SD of the mean,
● about 99.7% falls within 3 SD of the mean.
It is possible for observations to fall 4, 5, or more standard deviations away from the mean,
but these occurrences are very rare if the data are nearly normal.
Contingency Tables
A table that summarizes data for two categorical variables is called a contingency
table.
The contingency table below shows the distribution of students' genders and
whether or not they are looking for a spouse while in college.
Binomial distribution
The ultimate formula for the binomial
probability distribution is
np ≥ 10 and n(1 - p) ≥ 10
Practice
Below are four pairs of Binomial distribution parameters. Which
distribution can be approximated by the normal distribution?
1. n = 100, p = 0.95
2.
A. n
n == 100,
25, pp==0.45
0.95
3.
B. n
n == 25,
150,pp==0.45
0.05
4.
C. n
n == 150,
500, p
p == 0.05
0.015
D. n = 500, p = 0.015
Practice
A study found that approximately 25% of Facebook users are considered
power users. The same study found that the average Facebook user has 245
friends. What is the probability that the average Facebook user with 245
friends has 70 or more friends who would be considered power users?
Note:
We are given that n = 245, p = 0.25, and we are asked for the
probability P(K ≥70). Assuming independence, we can proceed
as follows …
P(X ≥ 70)
= P(K = 70 or K = 71 or K = 72 or … or K = 245)
= P(K = 70) + P(K = 71) + P(K = 72) + … + P(K = 245)
• Population
N = 10,000
Plot: positive skew
μ = 173
σ = 30
x
n
Sampling Behavior of Counts and Proportions
• Recall binomials: the count
of successes (X) follows a
binomial distribution with X
~ b(n, p)
X ~ N np, npq
when normal approximation applies.
Normal Approximation for a
Binomial Sample Proportion
pq
p and
n
pq
pˆ ~ N p,
n
when normal approximation applies.
For our example of X~b(100,0.2)
• The objective of estimation is to predict the value of a
parameter
• There are two forms of estimation:
• Point estimation: e.g., xbar is an unbiased estimator of μ
• Interval estimation: surround the point estimate with a
margin of error to create a confidence interval:
*SE
Reasoning Behind the Confidence Interval
Hypothesis Testing
• Also called significance testing
• Power = (1- β )
• = Probability that a statistical test will be able to
detect a true difference
• Commonly set to 80% for designing a study
One-Sample t Test
A. Hypotheses:
H0: µ = µ0 vs. Ha: µ ≠ µ0 (two-sided)
or Ha: µ < µ0 (left-sided) or Ha: µ > µ0 (right-sided)
x 0 s
B. Test statistic: tstat where SE x
SE x n
with df n 1
xd 0 0.38083 0
tstat 3.043
s n .4335 / 12
df n 1 12 1 11
For 95% confidence use t121,1 .0 5 t11,. 975 2.201 (from Table C)
2
.4335
95% CI for d 0.3808 2.201
12
0.3808 0.2754
(0.105 to 0.656)
Independent Samples
• no matching or pairing
Hypothesis Test for Two Independent Samples
Hypotheses. H0: μ1 = μ2
against Ha: μ1 ≠ μ2 (two-sided)
or Ha: μ1 > μ2 (right-sided) Ha: μ1 < μ2 (left-sided)
Test statistic. 2 2
(x1 - x2 ) s s
tstat = where SE x1- x2 = + 1 2
SE x1- x2 n1 n2
Type of point
sample estimate df for t* SE
single s
x n–1 n
paired sd
xd nd – 1
n
independent
smaller of n1 – 1 s12 s 22
x1 x2 or n2 – 1
n1 n2
Recap
● Calculate required sample size for a desired level of power
● Calculate power for a range of sample sizes, then choose the sample
size that yields the target power (usually 80% or 90%)
Achieving desired power
There are several ways to increase power (and hence decrease type 2 error rate):
1. Increase the sample size
2. Decrease the standard deviation of the sample, which essentially has the
same effect as increasing the sample size (it will decrease the standard
error). With a smaller s we have a better chance of distinguishing the null
value from the observed point estimate. This is difficult to ensure but
cautious measurement process and limiting the population so that it is more
homogenous may help.
3. Increase α, which will make it more likely to reject H0 (but note that this has
the side effect of increasing the Type 1 error rate).
4. Consider a larger effect size. If the true mean of the population is in the
alternative hypothesis but close to the null value, it will be harder to detect
a difference.
2-by-2 Cross-Tabulation
a1 a 2
pˆ 1 pˆ 2 𝑝
^1
n1 n2
𝑝2
^
_________
Note: If p is unknown (most cases), we use for calculating the standard error.
Two-sample test
• Hypotheses. H0: p1 = p2 against Ha:p1 ≠ p2
[Or one-sided alternatives: Ha: p1 > p2 or Ha: p1 < p2]
• Test statistic.
pˆ 1 pˆ 2
zstat where
1 1
pq
n1 n2
no. of successes, both groups combined
p
total observations, both groups combined
Oi Ei
2
B. Test statistic
2
stat
all cells Ei
with df ( R 1)(C 1)
1 xx y y
r
n 1
(
sx
)(
sy
)
60
Simple linear regression model
• • The population model : + x +
• Y - dependent variable
• - population y intercept
• - population slope coefficient
• - random error, or residual
• The sample regression line provides an estimate of the
population regression line + x
• = estimated (or predicted) y value (dependent variable)
• = estimated intercept = average y when x value is zero
• = estimated slope = change in the average y value for a
one-unit difference in x
61
Estimating the slope and intercept
and
H0: β = 0 H a: β ≠ 0
66
Confounding overview
68