Advanced Statistic
Advanced Statistic
Hypothesis Testing
Hypothesis testing is a fundamental statistical method used to assess claims or hypotheses about populations based
on sample data. It involves two key hypotheses:
Null Hypothesis (H0): This is the default assumption or status quo. It suggests that there is no significant
effect, difference, or relationship in the population.
Alternative Hypothesis (Ha): This contradicts the null hypothesis and suggests that there is a significant
effect, difference, or relationship in the population.
Example
Scenario: Imagine you are a teacher, and you believe that giving students a snack before a test will improve their
performance. However, your friend suggests that snacks might not make a difference.
Hypotheses:
Null Hypothesis (H0): Giving students a snack has no effect on test performance.
Testing: To test this, you decide to conduct a small experiment. You randomly select two groups of students: one
group receives a snack before the test, and the other does not. After the test, you compare the average scores
between the two groups.
Results: If the group that received snacks shows a significantly higher average score, you might reject the null
hypothesis in favor of the alternative hypothesis, concluding that snacks do have an impact on test performance.
Conclusion: If there’s no significant difference in the average scores between the two groups, you would fail to reject
the null hypothesis, suggesting that snacks may not have a noticeable effect on test performance.
Example:
Suppose the coffee shop owner collects data on the wait times for a random sample of 30 customers, and the wait
times (in minutes) are as follows:
4.2, 3.8 ,3.5, 4.0 ,4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1, 4.3, 3.6, 3.9,
4.0, 4.2, 4.2, 3.8, 3.5, 4.0, 4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1,
4.3, 3.6, 3.9, 4.0, 4.2
Hypotheses:
Alternative Hypothesis (Ha): The average wait time is not 3 minutes (μ ≠ 3).
(We’re using a two-tailed test because we want to test if the average wait time is different from 3 minutes.)
Steps:
Collect Data: We have the wait times for the 30 customers in the sample.
Calculate Standard Error: This involves calculating the standard deviation of the sample and dividing it by the
square root of the sample size. Let’s assume a standard error of 0.2 minutes.
t = (4.0−3)/0.2 =5
Hypothesis Testing
Find Critical Values: For a two-tailed test with α = 0.05 and 29 degrees of freedom (df = n – 1 = 30 – 1 = 29),
the critical values are approximately ±2.045.
Calculate P-Value: The p-value is calculated as the probability of observing a t-statistic as extreme as 5 in a t-
distribution with 29 degrees of freedom. The p-value is extremely small, indicating strong evidence against
the null hypothesis.
Make a Decision: Since the p-value (very small) is less than α (0.05), we reject the null hypothesis. This
suggests that there is enough evidence to conclude that the average wait time for customers to receive their
coffee is not 3 minutes.
Significance Level (α)
The significance level, denoted as α (alpha), is a predetermined threshold used in hypothesis testing to
determine the level of evidence required to reject the null hypothesis. It represents the maximum acceptable
probability of making a Type I error, which is the error of rejecting a true null hypothesis.
Commonly used significance levels include α = 0.05 (5%), α = 0.01 (1%), and α = 0.10 (10%).
Confidence Interval
Confidence, in statistics, is another way to describe probability. For example, suppose you construct a confidence
interval with a 95% confidence level. In that case, you are confident that 95 out of 100 times the estimate will fall
between the upper and lower values specified by the confidence interval.
Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test
Confidence level = 1 − a
So, if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 =
0.95, or 95%.
Confidence intervals are typically calculated using a formula that incorporates the sample statistic, the standard error
(a measure of variability), and a critical value from a statistical distribution (e.g., the normal distribution or t-
distribution).
The point estimate (sample statistic) is the best guess for the population parameter.
The margin of error represents the range around the point estimate. It reflects the estimate’s precision and is
determined by the chosen confidence level.
P-Value
The p-value is a statistical measure that quantifies the strength of evidence against the null hypothesis. It represents
the probability of observing sample data as extreme as, or more extreme than, the data observed, assuming the null
hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
Definition: A p-value measures the strength of evidence against the null hypothesis in hypothesis testing.
Interpretation: Smaller p-values indicate stronger evidence against the null hypothesis.
Decision: If p ≤ α (chosen significance level, e.g., 0.05), you reject the null hypothesis; if p > α, you fail to reject it.
Example
Suppose you conduct a hypothesis test with a significance level (α) of 0.05. After performing the test, you obtain a p-
value of 0.03.
P-Value Interpretation:
Indicates a 3% probability of observing the data or more extreme if the null hypothesis were true.
Decision:
The observed data supports the conclusion that the null hypothesis is unlikely to be true.
In this example, by setting α at 0.05 and obtaining a p-value of 0.03, you have a strong statistical basis to reject the
null hypothesis and make a conclusion based on the evidence provided by the data.
_________________________________________________________________________
T-Tests
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups or
to test if the mean of a single sample is significantly different from a known or hypothesized population mean.
Explanation: A one-sample t-test is used to determine if the mean of a single sample is significantly different
from a known or hypothesized population mean.
Assumptions: It assumes that the sample data is approximately normally distributed and that the
observations are independent.
When to Use: Use a one-sample t-test when you have one sample and want to test if its mean is different
from a specified value (population mean).
Example
Suppose the coffee shop owner collects data on the wait times for a random sample of 30 customers, and the wait
times (in minutes) are as follows:
4.2, 3.8 ,3.5, 4.0 ,4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1, 4.3, 3.6, 3.9,
4.0, 4.2, 4.2, 3.8, 3.5, 4.0, 4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1,
4.3, 3.6, 3.9, 4.0, 4.2
Hypotheses
Alternative Hypothesis (Ha): The average wait time is not 3 minutes (μ ≠ 3). (We’re using a two-tailed test
because we want to test if the average wait time is different from 3 minutes.)
Collect Data: We have the wait times for the 30 customers in the sample.
Find Critical Values: For a two-tailed test with α = 0.05 and 29 degrees of freedom (df = n – 1 = 30 – 1 = 29),
the critical values are approximately ± 2.045.
Calculate P-Value: The p-value is calculated as the probability of observing a t-statistic and since the
calculated t-statistic (19.43) is far beyond the critical values, it indicates an extremely small p-value.
Make a Decision: Since the t-statistic provides overwhelming evidence against the null hypothesis (p-value <
0.0001), we reject the null hypothesis. This suggests that there is strong statistical evidence to conclude that
the average wait time for customers is significantly different from 3 minutes.
Two-Sample T-Test
Explanation: A two-sample t-test is used to compare the means of two independent samples to determine if
they are significantly different from each other.
Assumptions: It assumes that the data in both samples are approximately normally distributed, and the
observations in each sample are independent.
When to Use: Use a two-sample t-test when you have two separate groups or samples, and you want to test
if their means are significantly different from each other.
Imagine you are a teacher, and you want to determine if two different teaching methods, Method A and Method B,
have a significant impact on students’ test scores. You have two groups of students, one taught using Method A and
the other using Method B, and you want to compare their test scores to see if there’s a significant difference.
Hypotheses:
Null Hypothesis (H0): There is no significant difference in the mean test scores between the two teaching
methods (μA = μB).
Alternative Hypothesis (Ha): There is a significant difference in the mean test scores between the two
teaching methods (μA ≠ μB).
Collect Data: Collect test scores from two groups of students. Let’s assume:
Set Significance Level (α): Choose a significance level, such as α = 0.05, to determine the threshold for
statistical significance.
Calculate Sample Means: Calculate the sample means for both groups:
Calculate Standard Deviations: Calculate the sample standard deviations for both groups, which measure the
variability within each group.
Calculate p-Value
t-Statistic: – 0.27903204256606634
Determine Degrees of Freedom (df)
df = n1 + n2 – 2 = 5 + 5 – 2
df = 8
Make a Decision: Compare the p-value to α. If p ≤ α, reject the null hypothesis, indicating a significant
difference in test scores between the two teaching methods. If p > α, fail to reject the null hypothesis.
Here,
|t-statistic| > Critical Value, you may reject the null hypothesis.
Purpose:
Usage:
Implementation:
The x-axis represents the variable values, and the y-axis represents the frequency of occurrences.
Example Code:
plt.title('Histogram of Data')
plt.xlabel('Variable')
plt.ylabel('Frequency')
plt.show()
2. Box Plot:
Purpose:
Displays the summary of the distribution, including median, quartiles, and potential outliers.
Usage:
Implementation:
Utilizes a rectangular box to represent the interquartile range (IQR) and “whiskers” to show
variability.
Example Code:
plt.show()
3. Scatter Plot:
Purpose:
Usage:
Implementation:
Example Code:
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
4. Pair Plot:
Purpose:
Usage:
Implementation:
Diagonal shows the distribution of each variable, and off-diagonal plots show scatter plots.
Example Code:
plt.show()
5. Heatmap:
Purpose:
Usage:
Implementation:
Example Code:
correlation_matrix = df.corr()
plt.title('Correlation Heatmap')
plt.show()
6. Violin Plot:
Purpose:
Combines aspects of box plots and kernel density plots to show the distribution of a numerical
variable across different categories.
Usage:
Implementation:
Example Code:
plt.show()
7. Bar Chart:
Purpose:
Usage:
Implementation:
Example Code:
df['Category'].value_counts().plot(kind='bar', color='orange')
plt.xlabel('Category')
plt.ylabel('Count')
plt.show()
Interpretation of Visualization
Example:
Scatter Plot Analysis – Relationship between Study Hours and Exam Scores
Data: Suppose you have collected data on the study hours and corresponding exam scores for a group of students:
Interpretation:
Pattern: The scatter plot shows a clear upward trend, indicating a positive correlation between study hours
and exam scores. More study hours generally lead to better exam performance.
Outliers: No significant outliers are observed; all data points align with the trend.
This interpretation highlights the positive relationship between study hours and exam scores, emphasizing the
benefits of increased study time on academic performance.
Correlation and Regression
Correlation – Measure of Association:
Correlation is a statistical measure that quantifies the degree and direction of a relationship between two variables. It
assesses how changes in one variable are associated with changes in another.
Linear regression is a statistical technique used to model the relationship between a dependent variable (response)
and one or more independent variables (predictors). It assumes a linear relationship and helps predict the dependent
variable based on the independent variables.
Suppose you are analyzing the relationship between study hours and exam scores for a group of students. You want
to calculate and interpret the correlation coefficient between these two variables:
Data: You have data on study hours (X) and corresponding exam scores (Y) for several students.
Calculate Correlation: You calculate the correlation coefficient, which can be Pearson’s correlation coefficient
(r), Spearman’s rank correlation coefficient, or others, depending on the data characteristics.
Interpretation:
If r is close to 1: Strong positive correlation; as study hours increase, exam scores tend to increase
significantly.
If r is close to -1: Strong negative correlation; as study hours increase, exam scores tend to decrease
significantly.
If r is close to 0: Weak or no linear correlation; study hours and exam scores are not strongly associated.
Correlation helps quantify the strength and direction of the relationship between study hours and exam scores.
Linear regression, on the other hand, can further model and predict exam scores based on study hours and may
provide insights into the extent of predictability.
__________________________________________________________________________
Confidence Interval
Confidence, in statistics, is another way to describe probability. For example, suppose you construct a confidence
interval with a 95% confidence level. In that case, you are confident that 95 out of 100 times the estimate will fall
between the upper and lower values specified by the confidence interval.
Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test:
Confidence level = 1 − a
So, if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 =
0.95, or 95%.
Calculation: Confidence intervals are typically calculated using a formula that incorporates the sample
statistic, the standard error (a measure of variability), and a critical value from a statistical distribution (e.g.,
the normal distribution or t-distribution). The formula is:
The point estimate (sample statistic), which is the best guess for the population parameter.
The margin of error, which represents the range around the point estimate. It reflects the estimate’s
precision and is determined by the chosen confidence level.
For a z statistic, some of the most common values are shown in this table:
Example: Determining the Confidence Interval for the Mean Height of a Population
Scenario: Imagine you want to estimate the average height of all adults in a country. You collect a sample of
heights from 100 individuals and calculate the sample mean height to be 170 cm with a standard deviation of
5 cm. You want to determine a 95% confidence interval for the mean height.
Data: You collect a sample of heights from 100 individuals and calculate the sample mean height to be 170
cm with a standard deviation of 5 cm.
Step 1: Calculation of Margin of Error To calculate the margin of error for the confidence interval, you need two
components:
Critical Value: Using the t-distribution (since it’s a sample) and a 95% confidence level, you find the critical
value. For this example, let’s assume it’s 1.96. This critical value represents how many standard errors the
margin of error should cover to achieve a 95% confidence level.
Standard Error: The standard error measures the variability of the sample mean. It’s calculated as the
standard deviation divided by the square root of the sample size.
The margin of Error Calculation:
Step 2: Constructing the Confidence Interval Now that you have the margin of error, you can construct the
confidence interval. The confidence interval formula is as follows:
CI = 170 cm ± 0.98 cm
Interpretation: This means you are 95% confident that the true average height of all adults in the country falls within
the range of 169.02 cm to 170.98 cm.
Explanation: By calculating a confidence interval, you provide a range of heights within which you believe the true
population average height is likely to be. In this case, with 95% confidence, you estimate that the average height of all
adults in the country lies between 169.02 cm and 170.98 cm. The margin of error (0.98 cm) accounts for the
uncertainty associated with estimating the population mean from a sample.
Z-Test: The Z-test is a statistical hypothesis test used to assess whether a sample mean is significantly different from a
known population mean when the sample size is sufficiently large. It relies on the standard normal distribution and is
suitable when the population standard deviation is known.
Comparison: The Z-test and t-test are both used for hypothesis testing, but they differ in their assumptions about the
population standard deviation and sample size:
Z-Test: Assumes a known population standard deviation and is appropriate for large sample sizes (typically n
> 30).
t-Test: Assumes an unknown population standard deviation and is suitable for smaller sample sizes.
Compare means of two groups (small Compare a sample mean to a known population mean
Purpose
samples) (large samples)
Sample Size Small sample sizes (typically < 30) Large sample sizes (typically ≥ 30)
Unknown and estimated from the
Population SD Known or estimated from a large sample
sample
Distribution Follows the t-distribution Follows the standard normal (z) distribution
Critical Values Use t-distribution tables Use standard normal (z) distribution tables
Common Use
Comparing means when SD is unknown Comparing a sample mean to a known population mean
Case
Steps in Z-test
Example
Imagine you’re evaluating the effectiveness of a weight loss program. You collect data on the weight loss of 100
participants who completed the program and want to test if, on average, participants lost a significant amount of
weight.
import random
1. Formulate Hypotheses
Null Hypothesis (H0): The average weight loss in the program is equal to or less than zero (μ ≤ 0).
Alternative Hypothesis (Ha): The average weight loss in the program is greater than zero (μ > 0).
Sample Mean (X̄ ): Calculate the sample mean of weight loss data.
Population Standard Deviation (σ): Known as 2 pounds.
import numpy as np
sample_mean = np.mean(weight_loss_data)
# Calculate Z-score
alpha = 0.05
df = len(weight_loss_data) - 1
critical_value
4. Make a Decision
# Make a decision
reject_null
Interpretation: There is sufficient statistical evidence to conclude that, on average, participants in the weight
loss program experienced significant weight loss.
Chi-Square Test: The chi-square test is a statistical test used to assess the independence (or association)
between two categorical variables. It helps determine whether there is a significant relationship between the
variables or if they are independent.
Independence: In the context of the chi-square test, independence means that the two categorical variables
are not related, and changes in one variable do not affect the distribution of the other variable.
Data: You have surveyed 200 voters in a local election to understand the association between their gender and voting
preferences. Here’s the dataset:
import pandas as pd
import random
data = {
'Gender': gender,
'Voting_Preferences': voting_preferences
df = pd.DataFrame(data)
1. Hypotheses
You want to test whether there is a significant association between the gender of voters and their voting preferences.
Alternative Hypothesis (Ha): Gender and voting preferences are not independent; there is an association.
contingency_table
Calculate the chi-square statistic, which measures the discrepancy between observed and expected frequencies.
Specify the desired level of significance (α) to determine the critical chi-square value. Let’s assume α = 0.05.
alpha = 0.05
critical_value
5. Make a Decision
# Make a decision
reject_null
There is insufficient evidence to conclude that there is a significant association between the gender of voters and
their voting preferences in this local election. Gender and voting preferences appear to be independent.
One-Way and Two-Way ANOVA
Introduction to Analysis of Variance (ANOVA):
Analysis of Variance (ANOVA): ANOVA is a statistical technique used to compare means among multiple groups or
treatments. It assesses whether there are any statistically significant differences between the means of the groups.
One-Way ANOVA: One-way ANOVA is used when there is one categorical independent variable (with two or
more levels or groups) and one continuous dependent variable. It tests whether there are any significant
differences between the means of the groups.
Two-Way ANOVA: Two-way ANOVA is used when there are two categorical independent variables (factors)
and one continuous dependent variable. It assesses the interaction effect between the two factors and
whether they have a significant influence on the dependent variable.
Example
Formulate Hypotheses
Null Hypothesis (H0): There are no significant differences in test scores among the three schools.
Alternative Hypothesis (Ha): There are significant differences in test scores among the three schools.
import numpy as np
# Given data
grand_mean = np.mean(all_scores)
ssw = np.sum([(score - np.mean(school))**2 for school, scores in zip([school_1, school_2, school_3], [school_1,
school_2, school_3]) for score in scores])
df_between = 3 - 1
df_within = len(all_scores) - 3
# Print results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")
alpha = 0.05
print("Reject the null hypothesis: \nThere are significant differences in test scores among the three schools.")
else:
print("Fail to reject the null hypothesis: \nThere are no significant differences in test scores among the three
schools.")
f_statistic, p_value
Make a Decision
Compare the calculated F-statistic with the critical F-value.
# Make a decision
reject_null
Interpretation: Based on the analysis, we reject the null hypothesis (H0). There are significant
differences in test scores among the three schools.
Complete Python Code for the above Process
import numpy as np
# Print results
print(f"P-value: {p_value}")
alpha = 0.05
print("\nReject the null hypothesis: There are significant differences in test scores among the three schools.")
else:
print("\nFail to reject the null hypothesis: There are no significant differences in test scores among the three
schools.")
Testing
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
1. Current
2. Review
3. Answered
1. Question 1: You are conducting a one-sample t-test with the following information:
Population mean = 70
Sample mean = 75
Sample size = 30
2.7
3.0
2.0
1.5
2. Question 2: In a chi-square test of independence, you have a contingency table with 3 rows and 4 columns.
How many degrees of freedom are there?
3
4
7
9
3. Question 3: You perform a two-sample t-test to compare the means of two groups. The p-value obtained is
0.034. If you set a significance level (alpha) of 0.05, what is your decision?
Perform a z-test.
4. Question 4: You are analyzing the correlation between two variables. The correlation coefficient (r) you
calculated is -0.85. What does this indicate about the relationship between the variables?
No correlation.
2.42
3.88
3.24
5.62
6. Question 6: In a hypothesis test, you calculate a z-statistic of -1.98. What is the corresponding p-value if you
are conducting a two-tailed test?
0.0244
0.4761
0.9522
0.9768
7. Question 7: You are constructing a 95% confidence interval for a population mean. If the sample size is 50
and the standard error is 4, what is the margin of error?
1.96
0.08
0.98
7.84
8. Question 9: In a hypothesis test, you calculate a t-statistic of -2.15 with 15 degrees of freedom. What is the
corresponding p-value for a two-tailed test?
0.0246
0.0492
0.1048
0.2096
9. Question 9: You are conducting a chi-square test of independence with 4 rows and 3 columns in the
contingency table. What is the total degree of freedom for this test?
7
9
12
15
10. Question 10: You perform a paired-sample t-test and obtain a t-statistic of 3.42. If you have 29 pairs of data,
what is the degrees of freedom for this test?
28
29
30
58
11. You are conducting a hypothesis test to compare the means of two independent groups. The test statistic you
calculate is -2.87. If you set a significance level (alpha) of 0.05, what is your decision?
12. In a one-sample t-test, you are testing whether the population mean is greater than 50. If the calculated t-
statistic is 1.96 and the sample size is 30, what is the p-value for a one-tailed test?
0.0274
0.0548
0.0456
0.1096