Lecture 03 Probability and Statistics Review Part2
Lecture 03 Probability and Statistics Review Part2
Data Analytics
Lecture 4
Probability & Statistics Review
Part 2
1
Sampling Methods
2
Introduction
Sampling
Probabilistic Sampling
Probabilistic Sampling Methods
with replacement
1. Simple random
sampling
without replacement
2. Systematic Sampling
Proportionate
3. Stratified Sampling
Probabilistic Sampling
Disproportionate
4. Cluster Sampling
One stage
5. Multistage Sampling
Two stage
Systematic Sampling
11
Measures of Location and Variability
Location Variability
Mean Range
Mode Variance
12
Variability Measures
The Range
• The difference between the third quartile and the first quartile.
• It is the range for the middle 50% of the data.
• It overcomes the sensitivity to extreme data values.
Coefficient of variation
13
The Empirical Rule
1 SD
2 SD
3 SD
Sampling Distribution &
Estimation
15
Sampling Distribution Definition
• A sampling distribution is a distribution of all of the possible values of a sample statistic for a
given size sample selected from a population.
• For example, suppose we sample 50 students from university regarding their mean GPA.
• If we obtained many different samples of 50, we will compute a different mean for each
sample.
• We are interested in the distribution of all potential mean GPA we might calculate for any
given sample of 50 students
Sampling Distribution: Example
❖Assume there is a population …
➢ Population size N=4
➢ Random variable, X is age of individuals
➢ Values of X: 18, 20, 22, 24 (years)
Summary Measures for the Population Distribution:
μ=
X
i
i
N (X − μ)2
18 + 20 + 22 + 24 σ= = 2.236
= = 21 N
4
Sampling Distribution
18
Sampling Distribution
❖Sampling Distribution of All Sample Means
μX =
X
i 18 + 19 + 19 + + 24
= = 21
N 16
σX =
i X
( X − μ ) 2
σ
σX =
n
Sample Mean Sampling Distribution:
If the Population is Normal
❖If a population is normal with mean μ and standard deviation σ, the sampling
distribution of X is also normally distributed with
μX = μ
σ
σX =
n
Sampling Distribution Properties
μ x
Normal sampling distribution μX = μ
μx x
Sampling Distribution Properties
❖As n increases 𝜎𝑥ҧ decreases
Larger
sample size
Smaller
sample size
μ x
Sample Mean Sampling Distribution:
If the Population is not Normal
❖ Population distribution :
μ x
❖ Sampling Distribution (becomes normal as n increases):
Larger
sample
Smaller size
sample
size
μ x
Central limit theorem
As the sample size gets large enough…
the sampling distribution becomes almost normal
regardless the shape of the population
n↑
x
Central limit theorem
If the population follows a normal probability distribution, then for any sample size
the sampling distribution of the sample mean will also be normal.
If the population distribution is symmetrical (but not normal), the normal shape of
the distribution of the sample mean emerge with samples as small as 10.
The mean of the sampling distribution equal to μ and the variance equal to σ2/n.
Estimation
Distinctions Between Parameters and Statistics
Parameters Statistics
Vary No Yes
Calculated No Yes
Estimation
❖ Estimate
• Point Estimator
• Interval Estimator
Point Estimator Vs. Interval Estimator
interval estimate
Point Estimate
Interval estimate
• A good estimator is one which is close to the true value of the parameter
• If there are two unbiased estimators of a parameter, the one whose variance is
smaller is said to be relatively efficient.
• an estimator is sufficient if no other statistic that can be calculated from the
same sample provides any additional information as to the value of the parameter.
38
Confidence Interval Estimator for 𝜇: The population mean
𝜎
ത
❖Since 𝑋~𝑁 𝜇,
𝑛
ത
𝑋−𝜇
▪→ 𝑃 −𝑍𝛼Τ2 < < 𝑍𝛼Τ2 = 1 − 𝛼
𝜎Τ 𝑛
𝛼
and 𝑃 𝑍 > 𝑍𝛼Τ2 =
𝛼
Where: 𝑃 𝑍 < −𝑍𝛼Τ2 =
2 2
𝜎 𝜎
▪→ 𝑃 𝑋ത −𝑍𝛼Τ2 < 𝜇 < 𝑋ത + 𝑍𝛼Τ2 =1−𝛼 −𝑍𝛼Τ2 +𝑍𝛼Τ2
𝑛 𝑛
• Contrast this with: a 95% confidence interval estimate of starting salaries between $42,000
and $45,000.
• The second estimate is much narrower, providing accounting students more precise
information about starting salaries.
10.42
Interval Width…
• A computer company samples demand during lead time over 25 time periods:
• Its is known that the standard deviation of demand over lead time is 75 computers.
• We want to estimate the mean demand over lead time with 95% confidence in order
to set inventory levels…
Example
❖ “We want to estimate the mean demand over lead time with 95% confidence in
order to set inventory levels…”
𝜎 75
Given
n 25
• The lower and upper confidence limits are
75
• Lower Bound: 370.16 − 1.96 =340.76
25
75
• Upper Bound:370.16 + 1.96 =399.56
25
Hypothesis Testing
48
What is Hypothesis
❖Accept H0 if the test statistic value falls within the area of acceptance.
Reject otherwise.
Errors in Hypothesis Testing
Area of
acceptance
Case 1: Illustrative Example: “Body Weight”
• The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body
weight of 170 pounds. Standard deviation σ was 40 pounds.
• We want to test whether the mean body weight in the population nowadays
(i.e. in 2016) differs.
• We take a sample of 64 individuals and compute the mean body weight as 173.
What is your conclusion regarding the test.
“Body Weight” Hypothesis Testing
❖The value 170 (i.e. H0) is a possible value of µ (as inside the confidence
interval) and therefore we accept H0
57
Case 2: Illustrative Example: “Body Weight”
• The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body
weight of 170 pounds. Standard deviation σ was 40 pounds.
• We want to test whether the mean body weight in the population now differs.
• We take another sample of 64 individuals and compute the mean body weight
as 185. What is your conclusion regarding the test.
Case2: z statistic Method
x − m 0 185 − 170
zstat = = = 3.00
SE x 5
• The value 170 (i.e. H0) is not a possible value of µ at 95% confidence level (as not inside the
CI) and we therefore reject H0
60
p-value approach to testing
• P-value=P(Z≥2.75)+P(Z≤2.75)=2*0.003=0.006
p-value approach to testing
❖The p-value is also called the observed level of significance
▪H0 can be rejected if the p-value is less than 𝛼 (with 𝛼 being your level of
significance)
▪No preset value of 𝛼 is given in the p-value method
▪p value defines the smallest value for which the null hypothesis can be rejected
• e.g. , p-value = 0.038 is the smallest value of 𝛼 for which H0 can be rejected.
• H0 is rejected with 𝛼 = 0.05 but not 0.01.
o 0.05 > 0.038 → Reject H0
o 0.01 < 0.038 → Accept H0
p-value approach to testing
n=2
*n=number of observations
T-distribution with 4 degrees of freedom.
(df=n*-1)
n= 5
*n=number of observations
T-distribution with 9 degrees of freedom.
(df=n*-1)
n=10
*n=number of observations
T-distribution with 29 degrees of freedom.
(df=n*-1)
n=30
*n=number of observations
T-distribution with 99 degrees of freedom.
(df=n*-1)
Looks a lot like Z!!
n=100
*n=number of observations
Student’s t Distribution
Note: t Z as n increases
Standard
Normal
(t with df = )
t (df = 13)
t-distributions are bell-shaped
and symmetric, but have
‘fatter’ tails than the normal t (df = 5)
0 t
Let: n = 3
Student’s t Table df = n - 1 = 2
Df in the rows
The body of the table
0 t
contains t values, not 2.920
probabilities P(x>2.920)=0.05
t distribution values
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____
Note: t Z as n increases
The T probability density function
What does t look like mathematically? (You may at least recognize some
resemblance to the normal distribution function…)