Data Analytics - Notes
Data Analytics - Notes
Sampling error :
Difference between a point estimate and the true population parameter due to having
a random sample
Confidence interval :
We want to construct a 95% interval for µ
2 2
σ σ 𝑋−µ
From the CLT : 𝑋 ∼ 𝑁(µ, 𝑛
) ⇔ 𝑋 − µ ∼ 𝑁 (0, 𝑛
)⇔ ∼ 𝑁 (0, 1)
σ/ 𝑛
𝑋−µ
Then, we find z such that : P(-z< <z)=0.95
σ/ 𝑛
𝑋−µ
Since the normal distribution is symmetric, this is the same as P( >z)=0.025
σ/ 𝑛
We find that z=1.96
So :
𝑋−µ σ σ
P(-1.96< <1.96)=0.95 ⇔ P(-1.96 <𝑋 − µ<1.96 )=0.95
σ/ 𝑛 𝑛 𝑛
σ σ
⇔ P(- 𝑋 -1.96 <− µ< - 𝑋 +1.96 )=0.95
𝑛 𝑛
σ σ
⇔P( 𝑋 -1.96 <µ< 𝑋 +1.96 )=0.95
𝑛 𝑛
σ σ
So the confidence interval is (𝑥 -1.96 , 𝑥+1.96 )
𝑛 𝑛
For α ∈ (0, 1), 1-α is the confidence level and the confidence interval will be :
σ σ
(𝑥-𝑧α/2 ,𝑥+𝑧α/2 )
𝑛 𝑛
2
What if we don’t know the population variance σ ?
We use the t-statistic :
𝑠 𝑠
I = (𝑥-𝑡α/2,𝑛−1 ,𝑥+𝑡α/2,𝑛−1 )
𝑛 𝑛
2
Here, we have n-1 degrees of freedom - 𝑠 is the estimate of the ppl variance
In terms of math :
𝐻0 : µ𝐴 = µ𝐵
𝐻1 : µ𝐴 ≠ µ𝐵
The Null Hypothesis always represents the default position, or the assumption of
“no effect”, “no difference”
Possible conclusions :
- We reject 𝐻0
- We fail to reject 𝐻0
Netflix example :
- Under 𝐻0, we have µ𝐴 = µ𝐵
- If 𝐻0 is true, we’re likely to have µ𝐴 − µ𝐵 close to 0
- If 𝐻0 is true , we’re unlikely to have µ𝐴 − µ𝐵 far from 0
Defs :
Power :
1-β with β = P(Type 2 error) = P(Fail to reject 𝐻0|𝐻0 is false)
Types of test :
Test statistics :
θ(𝑋)−θ
Test statistic = , where :
𝑆𝐷(θ(𝑋))
θ(𝑥)−θ0
Observed test statistic = , where :
𝑆𝐷(θ(𝑋))
Test statistic depends on whether we have one or two parameters and on whether
we know the population variance
5 steps of hypothesis testing :
- State the Null and alternative hypothesis
- Choose a test and a significance level
- Compute the observed test-statistic
- Calculate the p value
- Make a statistical decision - interpret the results