Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Advanced Statistic

The document provides an overview of hypothesis testing, including the null and alternative hypotheses, significance levels, and the use of p-values in decision-making. It explains one-sample and two-sample t-tests with examples, detailing the steps involved in hypothesis testing and the interpretation of results. Additionally, it discusses confidence intervals and visualizations such as histograms, box plots, and scatter plots for data exploration.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Advanced Statistic

The document provides an overview of hypothesis testing, including the null and alternative hypotheses, significance levels, and the use of p-values in decision-making. It explains one-sample and two-sample t-tests with examples, detailing the steps involved in hypothesis testing and the interpretation of results. Additionally, it discusses confidence intervals and visualizations such as histograms, box plots, and scatter plots for data exploration.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Advanced Statistic & Hypothesis Testing

Hypothesis Testing
Hypothesis testing is a fundamental statistical method used to assess claims or hypotheses about populations based
on sample data. It involves two key hypotheses:

 Null Hypothesis (H0): This is the default assumption or status quo. It suggests that there is no significant
effect, difference, or relationship in the population.

 Alternative Hypothesis (Ha): This contradicts the null hypothesis and suggests that there is a significant
effect, difference, or relationship in the population.

Example

Scenario: Imagine you are a teacher, and you believe that giving students a snack before a test will improve their
performance. However, your friend suggests that snacks might not make a difference.

Hypotheses:

 Null Hypothesis (H0): Giving students a snack has no effect on test performance.

 Alternative Hypothesis (H1): Giving students a snack improves test performance.

Testing: To test this, you decide to conduct a small experiment. You randomly select two groups of students: one
group receives a snack before the test, and the other does not. After the test, you compare the average scores
between the two groups.

Results: If the group that received snacks shows a significantly higher average score, you might reject the null
hypothesis in favor of the alternative hypothesis, concluding that snacks do have an impact on test performance.

Conclusion: If there’s no significant difference in the average scores between the two groups, you would fail to reject
the null hypothesis, suggesting that snacks may not have a noticeable effect on test performance.

Hypothesis Testing Steps

Example:

Suppose the coffee shop owner collects data on the wait times for a random sample of 30 customers, and the wait
times (in minutes) are as follows:
4.2, 3.8 ,3.5, 4.0 ,4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1, 4.3, 3.6, 3.9,
4.0, 4.2, 4.2, 3.8, 3.5, 4.0, 4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1,
4.3, 3.6, 3.9, 4.0, 4.2

Now, let’s perform the hypothesis test based on this data:

Hypotheses:

 Null Hypothesis (H0): The average wait time is 3 minutes (μ = 3).

 Alternative Hypothesis (Ha): The average wait time is not 3 minutes (μ ≠ 3).

(We’re using a two-tailed test because we want to test if the average wait time is different from 3 minutes.)

Steps:

 Collect Data: We have the wait times for the 30 customers in the sample.

 Set Significance Level: Let’s choose a significance level (α) of 0.05.

 Calculate Sample Mean: The sample mean ( x̄ ) is calculated as:

x̄ = (4.2+3.8+3.5+…+4.2) / 30 =4.0 minutes

 Calculate Standard Error: This involves calculating the standard deviation of the sample and dividing it by the
square root of the sample size. Let’s assume a standard error of 0.2 minutes.

 Calculate Test Statistic: Using the formula:

t = (4.0−3)/0.2 =5

Hypothesis Testing

 Find Critical Values: For a two-tailed test with α = 0.05 and 29 degrees of freedom (df = n – 1 = 30 – 1 = 29),
the critical values are approximately ±2.045.

 Calculate P-Value: The p-value is calculated as the probability of observing a t-statistic as extreme as 5 in a t-
distribution with 29 degrees of freedom. The p-value is extremely small, indicating strong evidence against
the null hypothesis.

 Make a Decision: Since the p-value (very small) is less than α (0.05), we reject the null hypothesis. This
suggests that there is enough evidence to conclude that the average wait time for customers to receive their
coffee is not 3 minutes.
Significance Level (α)
 The significance level, denoted as α (alpha), is a predetermined threshold used in hypothesis testing to
determine the level of evidence required to reject the null hypothesis. It represents the maximum acceptable
probability of making a Type I error, which is the error of rejecting a true null hypothesis.

 Commonly used significance levels include α = 0.05 (5%), α = 0.01 (1%), and α = 0.10 (10%).

Confidence Interval

Confidence, in statistics, is another way to describe probability. For example, suppose you construct a confidence
interval with a 95% confidence level. In that case, you are confident that 95 out of 100 times the estimate will fall
between the upper and lower values specified by the confidence interval.

Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test
Confidence level = 1 − a

So, if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 =
0.95, or 95%.

Calculation and Interpretation

Confidence intervals are typically calculated using a formula that incorporates the sample statistic, the standard error
(a measure of variability), and a critical value from a statistical distribution (e.g., the normal distribution or t-
distribution).

The formula is:

CI = Sample Statistic ± Margin of Error

Interpretation: A confidence interval consists of two parts:

 The point estimate (sample statistic) is the best guess for the population parameter.

 The margin of error represents the range around the point estimate. It reflects the estimate’s precision and is
determined by the chosen confidence level.

P-Value
The p-value is a statistical measure that quantifies the strength of evidence against the null hypothesis. It represents
the probability of observing sample data as extreme as, or more extreme than, the data observed, assuming the null
hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.

Definition: A p-value measures the strength of evidence against the null hypothesis in hypothesis testing.

Interpretation: Smaller p-values indicate stronger evidence against the null hypothesis.

Decision: If p ≤ α (chosen significance level, e.g., 0.05), you reject the null hypothesis; if p > α, you fail to reject it.

Example

Setting Alpha at 0.05 and Interpreting a P-Value of 0.03:

Suppose you conduct a hypothesis test with a significance level (α) of 0.05. After performing the test, you obtain a p-
value of 0.03.

Significance Level (α):

 Choose a significance level of α = 0.05.

 Accepted a 5% chance of Type I error (false rejection of a true null hypothesis).

P-Value Interpretation:

 Obtained a p-value of 0.03.

 P-value is less than α (0.05).

 Indicates a 3% probability of observing the data or more extreme if the null hypothesis were true.

Decision:

 Given the p-value is less than α.

 Typically, reject the null hypothesis in favor of the alternative hypothesis.

 Strong evidence against the null hypothesis.


Conclusion:

 Statistical basis to reject the null hypothesis.

 The observed data supports the conclusion that the null hypothesis is unlikely to be true.

In this example, by setting α at 0.05 and obtaining a p-value of 0.03, you have a strong statistical basis to reject the
null hypothesis and make a conclusion based on the evidence provided by the data.

_________________________________________________________________________
T-Tests
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups or
to test if the mean of a single sample is significantly different from a known or hypothesized population mean.

One Sample T-Test

 Explanation: A one-sample t-test is used to determine if the mean of a single sample is significantly different
from a known or hypothesized population mean.

 Assumptions: It assumes that the sample data is approximately normally distributed and that the
observations are independent.

 When to Use: Use a one-sample t-test when you have one sample and want to test if its mean is different
from a specified value (population mean).

Hypothesis Testing – One Sample T Test

Example

Suppose the coffee shop owner collects data on the wait times for a random sample of 30 customers, and the wait
times (in minutes) are as follows:

4.2, 3.8 ,3.5, 4.0 ,4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1, 4.3, 3.6, 3.9,
4.0, 4.2, 4.2, 3.8, 3.5, 4.0, 4.5, 3.3, 4.2, 4.1, 3.9, 4.3, 3.7, 4.0, 4.2, 3.8, 3.6, 4.1, 3.9, 4.4, 4.0, 4.2, 3.7, 3.9, 4.0, 4.1,
4.3, 3.6, 3.9, 4.0, 4.2

Now, let’s perform the hypothesis test based on this data:

Hypotheses

 Null Hypothesis (H0): The average wait time is 3 minutes (μ = 3).

 Alternative Hypothesis (Ha): The average wait time is not 3 minutes (μ ≠ 3). (We’re using a two-tailed test
because we want to test if the average wait time is different from 3 minutes.)

Steps of Hypothesis Testing

 Collect Data: We have the wait times for the 30 customers in the sample.

 Set Significance Level: Let’s choose a significance level (α) of 0.05.

 Calculate Sample Mean & Standard Deviation:


Sample Mean (x̄ ) = (4.2+3.8+3.5+…+4.2) / 30 =4.0 minutes

Sample Standard Deviation (s): 0.2757819000266971

 Calculate Test Statistic: Using the formula:

t = (3.979 – 3) / (0.276 / √30)


t = 19.43

 Find Critical Values: For a two-tailed test with α = 0.05 and 29 degrees of freedom (df = n – 1 = 30 – 1 = 29),
the critical values are approximately ± 2.045.

 Calculate P-Value: The p-value is calculated as the probability of observing a t-statistic and since the
calculated t-statistic (19.43) is far beyond the critical values, it indicates an extremely small p-value.

 Make a Decision: Since the t-statistic provides overwhelming evidence against the null hypothesis (p-value <
0.0001), we reject the null hypothesis. This suggests that there is strong statistical evidence to conclude that
the average wait time for customers is significantly different from 3 minutes.

Two-Sample T-Test

 Explanation: A two-sample t-test is used to compare the means of two independent samples to determine if
they are significantly different from each other.

 Assumptions: It assumes that the data in both samples are approximately normally distributed, and the
observations in each sample are independent.

 When to Use: Use a two-sample t-test when you have two separate groups or samples, and you want to test
if their means are significantly different from each other.

Hypothesis Testing – Two Sample T Test


Example

Imagine you are a teacher, and you want to determine if two different teaching methods, Method A and Method B,
have a significant impact on students’ test scores. You have two groups of students, one taught using Method A and
the other using Method B, and you want to compare their test scores to see if there’s a significant difference.

Hypotheses:

 Null Hypothesis (H0): There is no significant difference in the mean test scores between the two teaching
methods (μA = μB).

 Alternative Hypothesis (Ha): There is a significant difference in the mean test scores between the two
teaching methods (μA ≠ μB).

Steps of the Two-Sample T-Test

 Collect Data: Collect test scores from two groups of students. Let’s assume:

 Group A (Method A): [85, 88, 92, 78, 90]

 Group B (Method B): [91, 89, 82, 87, 88]

 Set Significance Level (α): Choose a significance level, such as α = 0.05, to determine the threshold for
statistical significance.

 Calculate Sample Means: Calculate the sample means for both groups:

 Sample Mean of Group A (x̄ A) = (85 + 88 + 92 + 78 + 90) / 5 = 86.6

 Sample Mean of Group B (x̄ B) = (91 + 89 + 82 + 87 + 88) / 5 = 87.4

 Calculate Standard Deviations: Calculate the sample standard deviations for both groups, which measure the
variability within each group.

 Standard Deviation for Group A: 5.458937625582472

 Standard Deviation for Group B: 3.361547262794322

 Pooled Standard Deviation: 4.533210782657254

 Calculate p-Value

 t-Statistic: – 0.27903204256606634
 Determine Degrees of Freedom (df)

 df = n1 + n2 – 2 = 5 + 5 – 2

 df = 8

 Make a Decision: Compare the p-value to α. If p ≤ α, reject the null hypothesis, indicating a significant
difference in test scores between the two teaching methods. If p > α, fail to reject the null hypothesis.

Here,

|t-statistic| > Critical Value, you may reject the null hypothesis.

Visualization Plots for Data Exploration


1. Histogram:

 Purpose:

 Illustrates the distribution of a single numerical variable.

 Usage:

 Identifies patterns, central tendency, and spread.

 Helps detect skewness, outliers, and potential data issues.

 Implementation:

 The x-axis represents the variable values, and the y-axis represents the frequency of occurrences.

 Example Code:

import matplotlib.pyplot as plt

plt.hist(data, bins=10, color='skyblue', edgecolor='black')

plt.title('Histogram of Data')

plt.xlabel('Variable')

plt.ylabel('Frequency')

plt.show()
2. Box Plot:

 Purpose:

 Displays the summary of the distribution, including median, quartiles, and potential outliers.

 Usage:

 Facilitates comparisons of distributions between different groups.

 Implementation:

 Utilizes a rectangular box to represent the interquartile range (IQR) and “whiskers” to show
variability.

 Example Code:

import seaborn as sns

sns.boxplot(x='Group', y='Variable', data=df)

plt.title('Box Plot of Variable by Group')

plt.show()
3. Scatter Plot:

 Purpose:

 Reveals the relationship between two numerical variables.

 Usage:

 Identifies patterns, trends, and correlations between variables.

 Implementation:

 Each point represents a data observation with x and y coordinates.

 Example Code:

plt.scatter(df['X'], df['Y'], color='green', alpha=0.7)

plt.title('Scatter Plot of X vs Y')

plt.xlabel('X')

plt.ylabel('Y')

plt.show()
4. Pair Plot:

 Purpose:

 Displays scatter plots for multiple pairs of variables in a dataset.

 Usage:

 Identifies relationships and distributions across multiple variables simultaneously.

 Implementation:

 Diagonal shows the distribution of each variable, and off-diagonal plots show scatter plots.

 Example Code:

import seaborn as sns

sns.pairplot(df, hue='Category', diag_kind='kde')

plt.title('Pair Plot of Variables')

plt.show()
5. Heatmap:

 Purpose:

 Visualizes the correlation matrix between numerical variables.

 Usage:

 Identifies relationships and multicollinearity between variables.

 Implementation:

 Cells are colored based on the strength and direction of correlation.

 Example Code:

import seaborn as sns

correlation_matrix = df.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap')
plt.show()

6. Violin Plot:

 Purpose:

 Combines aspects of box plots and kernel density plots to show the distribution of a numerical
variable across different categories.

 Usage:

 Provides a compact way to compare distributions.

 Implementation:

 Consists of a series of vertical violin-shaped plots.

 Example Code:

import seaborn as sns

sns.violinplot(x='Category', y='Variable', data=df, inner='quartile')

plt.title('Violin Plot of Variable by Category')

plt.show()
7. Bar Chart:

 Purpose:

 Displays the frequency or count of categorical variables.

 Usage:

 Provides a visual representation of category counts.

 Implementation:

 Bars represent the count or frequency of each category.

 Example Code:

import matplotlib.pyplot as plt

df['Category'].value_counts().plot(kind='bar', color='orange')

plt.title('Bar Chart of Category Counts')

plt.xlabel('Category')

plt.ylabel('Count')

plt.show()
Interpretation of Visualization

 Guidelines: Understand plot-specific conventions.

 Patterns: Identify trends, relationships, or groupings.

 Outliers: Spot data points deviating from the pattern.

Example: Scatter Plot Analysis:

 Pattern: Look for upward or downward trends.

 Outliers: Identify exceptional data points.

Interpreting visualizations helps extract insights and make informed decisions.


Scatter Plot Analysis

Example:

Scatter Plot Analysis – Relationship between Study Hours and Exam Scores

Data: Suppose you have collected data on the study hours and corresponding exam scores for a group of students:

Interpretation:

 Pattern: The scatter plot shows a clear upward trend, indicating a positive correlation between study hours
and exam scores. More study hours generally lead to better exam performance.

 Outliers: No significant outliers are observed; all data points align with the trend.

This interpretation highlights the positive relationship between study hours and exam scores, emphasizing the
benefits of increased study time on academic performance.
Correlation and Regression
Correlation – Measure of Association:

Correlation is a statistical measure that quantifies the degree and direction of a relationship between two variables. It
assesses how changes in one variable are associated with changes in another.

Linear Regression – Modeling Relationships:

Linear regression is a statistical technique used to model the relationship between a dependent variable (response)
and one or more independent variables (predictors). It assumes a linear relationship and helps predict the dependent
variable based on the independent variables.

Example: Calculating and Interpreting Correlation Coefficients:

Suppose you are analyzing the relationship between study hours and exam scores for a group of students. You want
to calculate and interpret the correlation coefficient between these two variables:

 Data: You have data on study hours (X) and corresponding exam scores (Y) for several students.

 Calculate Correlation: You calculate the correlation coefficient, which can be Pearson’s correlation coefficient
(r), Spearman’s rank correlation coefficient, or others, depending on the data characteristics.

Interpretation:

 If r is close to 1: Strong positive correlation; as study hours increase, exam scores tend to increase
significantly.

 If r is close to -1: Strong negative correlation; as study hours increase, exam scores tend to decrease
significantly.

 If r is close to 0: Weak or no linear correlation; study hours and exam scores are not strongly associated.

Correlation helps quantify the strength and direction of the relationship between study hours and exam scores.
Linear regression, on the other hand, can further model and predict exam scores based on study hours and may
provide insights into the extent of predictability.

__________________________________________________________________________
Confidence Interval
Confidence, in statistics, is another way to describe probability. For example, suppose you construct a confidence
interval with a 95% confidence level. In that case, you are confident that 95 out of 100 times the estimate will fall
between the upper and lower values specified by the confidence interval.
Your desired confidence level is usually one minus the alpha (α) value you used in your statistical test:

Confidence level = 1 − a

So, if you use an alpha value of p < 0.05 for statistical significance, then your confidence level would be 1 − 0.05 =
0.95, or 95%.

Calculation and Interpretation:

 Calculation: Confidence intervals are typically calculated using a formula that incorporates the sample
statistic, the standard error (a measure of variability), and a critical value from a statistical distribution (e.g.,
the normal distribution or t-distribution). The formula is:

CI = Sample Statistic ± Margin of Error

 Interpretation: A confidence interval consists of two parts:

 The point estimate (sample statistic), which is the best guess for the population parameter.

 The margin of error, which represents the range around the point estimate. It reflects the estimate’s
precision and is determined by the chosen confidence level.

For a z statistic, some of the most common values are shown in this table:

Confidence level 90% 95% 99%

alpha for one-tailed CI 0.1 0.05 0.01

alpha for two-tailed CI 0.05 0.025 0.005

z statistic 1.64 1.96 2.57

Example: Determining the Confidence Interval for the Mean Height of a Population

 Scenario: Imagine you want to estimate the average height of all adults in a country. You collect a sample of
heights from 100 individuals and calculate the sample mean height to be 170 cm with a standard deviation of
5 cm. You want to determine a 95% confidence interval for the mean height.

 Data: You collect a sample of heights from 100 individuals and calculate the sample mean height to be 170
cm with a standard deviation of 5 cm.

 Goal: Determine a 95% confidence interval for the mean height.

Step 1: Calculation of Margin of Error To calculate the margin of error for the confidence interval, you need two
components:

 Critical Value: Using the t-distribution (since it’s a sample) and a 95% confidence level, you find the critical
value. For this example, let’s assume it’s 1.96. This critical value represents how many standard errors the
margin of error should cover to achieve a 95% confidence level.

 Standard Error: The standard error measures the variability of the sample mean. It’s calculated as the
standard deviation divided by the square root of the sample size.
The margin of Error Calculation:

Step 2: Constructing the Confidence Interval Now that you have the margin of error, you can construct the
confidence interval. The confidence interval formula is as follows:

CI=Sample Mean± Margin of Error

Substituting the values:

CI = 170 cm ± 0.98 cm

Interpretation: This means you are 95% confident that the true average height of all adults in the country falls within
the range of 169.02 cm to 170.98 cm.

Explanation: By calculating a confidence interval, you provide a range of heights within which you believe the true
population average height is likely to be. In this case, with 95% confidence, you estimate that the average height of all
adults in the country lies between 169.02 cm and 170.98 cm. The margin of error (0.98 cm) accounts for the
uncertainty associated with estimating the population mean from a sample.

Hypothesis Testing with Z-Test


Introduction to the Z-Test for Large Sample Sizes:

Z-Test: The Z-test is a statistical hypothesis test used to assess whether a sample mean is significantly different from a
known population mean when the sample size is sufficiently large. It relies on the standard normal distribution and is
suitable when the population standard deviation is known.

Comparison with the t-Test:

Comparison: The Z-test and t-test are both used for hypothesis testing, but they differ in their assumptions about the
population standard deviation and sample size:

 Z-Test: Assumes a known population standard deviation and is appropriate for large sample sizes (typically n
> 30).

 t-Test: Assumes an unknown population standard deviation and is suitable for smaller sample sizes.

Aspect T-Test Z-Test

Compare means of two groups (small Compare a sample mean to a known population mean
Purpose
samples) (large samples)

Sample Size Small sample sizes (typically < 30) Large sample sizes (typically ≥ 30)
Unknown and estimated from the
Population SD Known or estimated from a large sample
sample

Distribution Follows the t-distribution Follows the standard normal (z) distribution

Formula t = (Xˉ−μ) / (s/√n) z = (Xˉ−μ) / (σ/√n)

Critical Values Use t-distribution tables Use standard normal (z) distribution tables

Common Use
Comparing means when SD is unknown Comparing a sample mean to a known population mean
Case

Compare the average height of a sample to a known


Example Compare test scores of two groups
population mean

Steps in Z-test

 Step 1: Formulate Hypotheses

 Step 2: Set Significance Level

 Step 3: Calculate the Sample Mean and Standard Deviation

 Step 4: Set Up the Z-Test Statistic

 Step 5: Find Critical Value or P-Value

 Step 6: Make a Decision

Example

Using a Z-Test to Assess Average Weight Loss:

Imagine you’re evaluating the effectiveness of a weight loss program. You collect data on the weight loss of 100
participants who completed the program and want to test if, on average, participants lost a significant amount of
weight.

Here’s a sample dataset generated randomly:

import random

# Generate sample weight loss data (in pounds)

random.seed(42) # For reproducibility

weight_loss_data = [random.uniform(0.5, 10) for _ in range(100)]

1. Formulate Hypotheses

 Null Hypothesis (H0): The average weight loss in the program is equal to or less than zero (μ ≤ 0).

 Alternative Hypothesis (Ha): The average weight loss in the program is greater than zero (μ > 0).

2. Calculate the Test Statistic (Z-Score)

 Sample Mean (X̄ ): Calculate the sample mean of weight loss data.
 Population Standard Deviation (σ): Known as 2 pounds.

 Sample Size (n): 100 participants.

import numpy as np

# Calculate sample mean and sample standard deviation

sample_mean = np.mean(weight_loss_data)

sample_std = np.std(weight_loss_data, ddof=1) # Use Bessel's correction for sample std

# Calculate Z-score

population_mean = 0 # Hypothesized population mean

z_score = (sample_mean - population_mean) / (sample_std / np.sqrt(len(weight_loss_data)))

sample_mean, sample_std, z_score

 Sample Mean (X̄ ): Approximately 5.44 pounds

 Sample Standard Deviation (s): Approximately 2.51 pounds

 Z-Score (Z): Approximately 8.62


3. Determine the Critical Value

 Significance Level (α): Typically, 0.05 (corresponding to 95% confidence level).

 Degrees of Freedom (df): n – 1 (99 degrees of freedom for this example).

from scipy.stats import norm

alpha = 0.05

df = len(weight_loss_data) - 1

# Calculate the critical Z-value

critical_value = norm.ppf(1 - alpha)

critical_value

Critical Value (Z_critical): Approximately 1.645 (for α = 0.05, one-tailed test)

4. Make a Decision

Compare the calculated Z-score with the critical value.

 If Z-Score > Critical Value, reject the null hypothesis (Ha).

 If Z-Score ≤ Critical Value, fail to reject the null hypothesis (H0).

# Make a decision

reject_null = z_score > critical_value

reject_null

Decision: Reject the null hypothesis (H0).

5. Interpret the Result

Interpret the decision in the context of the weight loss program.

 Decision: Reject H0.

 Interpretation: There is sufficient statistical evidence to conclude that, on average, participants in the weight
loss program experienced significant weight loss.

Chi-Square Test for Categorical Data


Chi-Square Test for Independence:

 Chi-Square Test: The chi-square test is a statistical test used to assess the independence (or association)
between two categorical variables. It helps determine whether there is a significant relationship between the
variables or if they are independent.

 Independence: In the context of the chi-square test, independence means that the two categorical variables
are not related, and changes in one variable do not affect the distribution of the other variable.

Testing the Association Between Gender and Voting Preferences

Data: You have surveyed 200 voters in a local election to understand the association between their gender and voting
preferences. Here’s the dataset:

Using Statistics (Mathematically)


Using Python

import pandas as pd

import random

from scipy.stats import chi2_contingency

# Generate random data for 200 voters

random.seed(42) # For reproducibility

# Generate random gender data

gender = random.choices(['Male', 'Female'], k=200)

# Generate random voting preferences data

voting_preferences = random.choices(['Candidate A', 'Candidate B', 'Undecided'], k=200)

# Create a dataframe from the data

data = {

'Gender': gender,

'Voting_Preferences': voting_preferences

df = pd.DataFrame(data)

1. Hypotheses

You want to test whether there is a significant association between the gender of voters and their voting preferences.

 Null Hypothesis (H0): Gender and voting preferences are independent.

 Alternative Hypothesis (Ha): Gender and voting preferences are not independent; there is an association.

2. Create a Contingency Table


Create a contingency table that cross-tabulates the two categorical variables (gender and voting preferences):

contingency_table = pd.crosstab(df['Gender'], df['Voting_Preferences'])

contingency_table

3. Calculate the Chi-Square Statistic

Calculate the chi-square statistic, which measures the discrepancy between observed and expected frequencies.

from scipy.stats import chi2_contingency

# Calculate chi-square statistic, p-value, degrees of freedom, and expected frequencies

chi2, p, dof, expected = chi2_contingency(contingency_table)

chi2, p, dof, expected

 Chi-Square Statistic (χ²): Approximately 0.1463

 P-Value (p): Calculated based on data

 Degrees of Freedom (df): 2

 Expected Frequencies: Calculated values based on the data

4. Determine the Critical Value

Specify the desired level of significance (α) to determine the critical chi-square value. Let’s assume α = 0.05.

from scipy.stats import chi2

alpha = 0.05

critical_value = chi2.ppf(1 - alpha, dof)

critical_value

Critical Value (χ²_critical): Approximately 5.991 (for α = 0.05 and df = 2)

5. Make a Decision

Compare the calculated chi-square statistic with the critical value.

 If χ² > χ²_critical, reject the null hypothesis.

 If χ² ≤ χ²_critical, fail to reject the null hypothesis.

# Make a decision

reject_null = chi2 > critical_value

reject_null

Decision: Fail to reject the null hypothesis (H0).

There is insufficient evidence to conclude that there is a significant association between the gender of voters and
their voting preferences in this local election. Gender and voting preferences appear to be independent.
One-Way and Two-Way ANOVA
Introduction to Analysis of Variance (ANOVA):

Analysis of Variance (ANOVA): ANOVA is a statistical technique used to compare means among multiple groups or
treatments. It assesses whether there are any statistically significant differences between the means of the groups.

One-Way and Two-Way ANOVA:

 One-Way ANOVA: One-way ANOVA is used when there is one categorical independent variable (with two or
more levels or groups) and one continuous dependent variable. It tests whether there are any significant
differences between the means of the groups.

 Two-Way ANOVA: Two-way ANOVA is used when there are two categorical independent variables (factors)
and one continuous dependent variable. It assesses the interaction effect between the two factors and
whether they have a significant influence on the dependent variable.

Example

Performing a One-Way ANOVA to Compare Test Scores Across Different Schools

Formulate Hypotheses

 Null Hypothesis (H0): There are no significant differences in test scores among the three schools.

 Alternative Hypothesis (Ha): There are significant differences in test scores among the three schools.

Using Statistics (Mathematically)


Python Code for Above Calculations

import numpy as np

from scipy.stats import f

# Given data

school_1 = np.array([85, 88, 90, 92, 86])

school_2 = np.array([78, 82, 80, 85, 88])


school_3 = np.array([92, 94, 89, 88, 90])

# Step 1: Calculate Grand Mean

all_scores = np.concatenate([school_1, school_2, school_3])

grand_mean = np.mean(all_scores)

# Step 2: Calculate SST

sst = np.sum((all_scores - grand_mean)**2)

# Step 3: Calculate SSB

ssb = np.sum([len(school) * (np.mean(school) - grand_mean)**2 for school in [school_1, school_2, school_3]])

# Step 4: Calculate SSW

ssw = np.sum([(score - np.mean(school))**2 for school, scores in zip([school_1, school_2, school_3], [school_1,
school_2, school_3]) for score in scores])

# Step 5: Degrees of Freedom

df_between = 3 - 1

df_within = len(all_scores) - 3

# Step 6: Calculate Mean Squares

msb = ssb / df_between

msw = ssw / df_within

# Step 7: Calculate F-statistic

f_statistic = msb / msw

# Step 8: Compare with Critical Value or P-value

p_value = 1 - f.cdf(f_statistic, df_between, df_within)

# Print results

print(f"F-statistic: {f_statistic}")

print(f"P-value: {p_value}")
alpha = 0.05

if p_value < alpha:

print("Reject the null hypothesis: \nThere are significant differences in test scores among the three schools.")

else:

print("Fail to reject the null hypothesis: \nThere are no significant differences in test scores among the three
schools.")

Using Python Libraries

 Perform the One-Way ANOVA

We’ll use the scipy.stats library to perform the one-way ANOVA.

from scipy.stats import f_oneway

# Group the data by school

groups = [df['Test_Score'][df['School'] == school] for school in df['School'].unique()]

# Perform one-way ANOVA

f_statistic, p_value = f_oneway(*groups)

f_statistic, p_value

 F-Statistic: Approximately 0.0404

 P-Value: Approximately 0.9605

 Determine the Critical Value


Specify the desired level of significance (α) to determine the critical F-value. Let’s assume α = 0.05.

 Make a Decision
Compare the calculated F-statistic with the critical F-value.

alpha = 0.05 # Significance level

critical_value = 3.8853 # Critical F-value for α = 0.05 and df = (2, 12)

# Make a decision

reject_null = f_statistic > critical_value

reject_null

 Interpret the Result


Interpret the decision in the context of test scores among the schools.

 Decision: Reject or fail to reject H0.

 Interpretation: Based on the analysis, we reject the null hypothesis (H0). There are significant
differences in test scores among the three schools.
Complete Python Code for the above Process

import scipy.stats as stats

import numpy as np

# Data: Test scores for three schools

school_1 = np.array([85, 88, 90, 92, 86])

school_2 = np.array([78, 82, 80, 85, 88])

school_3 = np.array([92, 94, 89, 88, 90])

# Perform one-way ANOVA

statistic, p_value = stats.f_oneway(school_1, school_2, school_3)

# Print results

print(f"One-Way ANOVA Statistic: {statistic}")

print(f"P-value: {p_value}")

# Check significance at a 0.05 significance level

alpha = 0.05

if p_value < alpha:

print("\nReject the null hypothesis: There are significant differences in test scores among the three schools.")

else:

print("\nFail to reject the null hypothesis: There are no significant differences in test scores among the three
schools.")

Testing
1. 1

2. 2

3. 3

4. 4

5. 5

6. 6

7. 7

8. 8
9. 9

10. 10

11. 11

12. 12

1. Current

2. Review

3. Answered

1. Question 1: You are conducting a one-sample t-test with the following information:

 Population mean = 70

 Sample mean = 75

 Sample standard deviation = 10

 Sample size = 30

What is the t-statistic?

 2.7

 3.0

 2.0

 1.5

2. Question 2: In a chi-square test of independence, you have a contingency table with 3 rows and 4 columns.
How many degrees of freedom are there?

 3

 4

 7

 9

3. Question 3: You perform a two-sample t-test to compare the means of two groups. The p-value obtained is
0.034. If you set a significance level (alpha) of 0.05, what is your decision?

 Reject the null hypothesis.

 Fail to reject the null hypothesis.

 Accept the null hypothesis.

 Perform a z-test.

4. Question 4: You are analyzing the correlation between two variables. The correlation coefficient (r) you
calculated is -0.85. What does this indicate about the relationship between the variables?

 Strong positive correlation.

 No correlation.

 Strong negative correlation.

 Weak positive correlation.


5. Question 5: You conduct a one-way ANOVA test with three groups. The calculated F-statistic is 4.12. What is
the critical F-value at a 5% significance level (alpha = 0.05)?

 2.42

 3.88

 3.24

 5.62

6. Question 6: In a hypothesis test, you calculate a z-statistic of -1.98. What is the corresponding p-value if you
are conducting a two-tailed test?

 0.0244

 0.4761

 0.9522

 0.9768

7. Question 7: You are constructing a 95% confidence interval for a population mean. If the sample size is 50
and the standard error is 4, what is the margin of error?

 1.96

 0.08

 0.98

 7.84

8. Question 9: In a hypothesis test, you calculate a t-statistic of -2.15 with 15 degrees of freedom. What is the
corresponding p-value for a two-tailed test?

 0.0246

 0.0492

 0.1048

 0.2096

9. Question 9: You are conducting a chi-square test of independence with 4 rows and 3 columns in the
contingency table. What is the total degree of freedom for this test?

 7

 9

 12

 15

10. Question 10: You perform a paired-sample t-test and obtain a t-statistic of 3.42. If you have 29 pairs of data,
what is the degrees of freedom for this test?

 28

 29

 30

 58
11. You are conducting a hypothesis test to compare the means of two independent groups. The test statistic you
calculate is -2.87. If you set a significance level (alpha) of 0.05, what is your decision?

 Reject the null hypothesis.

 Fail to reject the null hypothesis.

 Accept the alternative hypothesis.

 Conduct a two-tailed test.

12. In a one-sample t-test, you are testing whether the population mean is greater than 50. If the calculated t-
statistic is 1.96 and the sample size is 30, what is the p-value for a one-tailed test?

 0.0274

 0.0548

 0.0456

 0.1096

You might also like