02 Exploratory Data Analytics
02 Exploratory Data Analytics
Analytics
Exploratory Data Analysis is a critical step in the data science process.
It is the foundation for understanding and interpreting complex data sets.
EDA helps data scientists identify patterns, spot anomalies, test
hypotheses, and check assumptions through various statistical and
graphical techniques.
What is Statistics ?
Definition:
Statistics is a branch of mathematics that focuses on
collecting, analyzing, interpreting, presenting, and organizing
data. It helps researchers, analysts, and decision-makers
make informed decisions based on data. Importance of Statistics :
● Nominal Data:
Cannot calculate mean or standard deviation.
Suitable for mode and frequency counts.
● Ordinal Data:
Median and range can be calculated.
Can be treated as metric data for some analyses,
but with caution.
● Metric Data:
Mean, standard deviation, and advanced hypothesis
tests are suitable.
notebook2_visualization.ipynb
Distribution Analysis
Part of Descriptive Analysis
Understanding Population and Sample
Population: Sample:
A population includes all elements (individuals, items, A sample is a subset of the population selected for
or data points) that meet a specific criterion for a study. analysis to draw conclusions about the population
It represents the entire group you want to draw without analyzing every member.
conclusions about.
Example: Measuring the height of 200 randomly
Example: For a study on the average height of adult selected adult men from the city as a
men in a city, the population is all adult men in that city. representative sample.
Characteristics: Characteristics:
Populations can be finite (e.g., all employees in a A sample should be representative of the
company) or infinite (e.g., all possible rolls of a die). population to ensure accurate conclusions.
Often impractical to collect data from the entire Sampling methods (random, stratified, etc.) impact
population. the reliability of results.
Population vs. Sample: Key Differences
Population:
Sample:
Purpose of Sampling:
Measures of Central Tendency The median is the middle value of a dataset when it’s
ordered from lowest to highest.
Calculation Steps:
Definition:
Order the data.
Measures of central tendency summarize a dataset by
identifying the center or typical value of the data points. If odd number of data points, median is the middle value.
Characteristics:
Range :
Standard Deviation :
The range is the difference between the highest and
lowest values. Standard deviation is the square root of the variance,
providing a measure of spread in the same units as the data.
Requirement: Data must be ordered from smallest to largest for Second Quartile (Q2)
quartile calculation.
Definition: The median, or midpoint of the dataset.
Insight: Divides data such that 50% of values are smaller, 50%
are larger.
Formula:
IQR = Q3 − Q1
Significance:
Unlike the full range, it’s less affected by outliers or extreme values.
Definition :
Purpose:
Importance:
Shape:
Characteristics:
Example:
Right-Skewed (Positive)
Left-Skewed (Negative)
Definition:
Characteristics:
Example:
Visual:
Definition:
Characteristics:
Example:
Definition:
Characteristics:
Example:
Probability of 5 of customer calls arrivals in an hour For More Types and Details Check
given the history record of average 3 calls in an hour.
notebook3_Distribution_Analysis.ipynb
Hypothesis Testing
Part of Inferential Statistics
Inferential statistics is a branch of statistics that
uses various analytical tools to draw conclusions
about the population from sample data.
What is Hypothesis
A branch of statistics that uses sample data to Hypothesis: An initial assumption or prediction
make predictions or inferences about a larger about a population parameter.
population.
Objective: Use data to either reject or support
Purpose: the hypothesis.
Allows us to generalize findings from a small Example: Testing the hypothesis that "men
group to the whole population without needing earn more than women" by analyzing a sample.
data from everyone.
Types of Hypotheses
Example: Men vs. women in salary Example : Men earn more than women.
comparison.
Validate Claims: e.g., Effectiveness of a new drug. Alternate Hypothesis (H₁ or Hₐ):
Reduce Uncertainty: Provides data-driven Contradicts the null; represents the claim being
conclusions. tested.
Example:
Scenario: A factory claims the average length of a bolt it produces is 5 cm. We collect a sample of 50 bolts and find an average length of 5.1
cm, with a population standard deviation of 0.2 cm.
Hypotheses:
● Null Hypothesis (H₀): The average length of bolts is 5 cm (sample mean = population mean).
● Alternate Hypothesis (H₁): The average length of bolts is not 5 cm (sample mean ≠ population mean).
Interpretation: If the Z-score exceeds the critical value, we reject the null hypothesis, suggesting the average bolt length is different from 5
cm.
…
One-Sample T-Test :
When to Use:
● Small sample size (typically n < 30).
● Unknown population variance.
● Comparing the sample mean to a known population mean.
Example:
Scenario: A company claims the average weight of a product is 50 grams. We test a sample of 10 products and find an
average weight of 49 grams.
Hypotheses:
● Null Hypothesis (H₀): The average product weight is 50 grams.
● Alternate Hypothesis (H₁): The average product weight is different from 50 grams.
Interpretation: If the T-score exceeds the critical value, we reject the null hypothesis, indicating a significant difference in
the average weight.
…
Two-Sample T-Test (Independent Samples) :
When to Use:
Example:
Scenario: A study compares the test scores of students taught with two different methods. Sample 1 has an average score
of 80, and Sample 2 has an average score of 85.
Hypotheses:
● Null Hypothesis (H₀): The mean scores of both groups are the same.
● Alternate Hypothesis (H₁): The mean scores of both groups are different.
Interpretation: If the T-score exceeds the critical value, we reject the null hypothesis, suggesting a significant difference
between the teaching methods.
…
Paired T-Test (Dependent Samples)
When to Use:
Example:
Hypotheses:
● Null Hypothesis (H₀): The mean blood pressure before treatment equals the mean after treatment.
● Alternate Hypothesis (H₁): The mean blood pressure differs before and after treatment.
Interpretation: If the T-score is significant, we reject the null hypothesis, indicating the treatment had a significant effect on
blood pressure.
…
Chi-Square Test
When to Use:
● Categorical data.
● Testing association or independence between two variables.
Example:
Scenario: A survey examines whether people’s choice of drink (tea or coffee) is related to their age group (young or old).
Hypotheses:
● Null Hypothesis (H₀): There is no association between age group and drink preference.
● Alternate Hypothesis (H₁): There is an association between age group and drink preference.
Interpretation: If the Chi-square statistic is significant, we reject the null hypothesis, suggesting an association between
age and drink preference.
…
One-Way ANOVA
When to Use:
Example:
Scenario: Comparing average test scores among students taught with three different methods.
Hypotheses:
Interpretation: If the ANOVA F-statistic is significant, we reject the null hypothesis, indicating a difference in mean test
scores among the methods.
…
Two-Way ANOVA
When to Use:
Example:
Scenario: Examining the effect of teaching method and study hours on test scores.
Hypotheses:
● Null Hypothesis (H₀): No difference in means due to teaching method, study hours, or their interaction.
● Alternate Hypothesis (H₁): There is a difference in means due to one or more of these factors.
Interpretation: If the F-statistic is significant, we reject the null hypothesis, indicating effects due to the factors or their
interaction.
…
Repeated Measures ANOVA
When to Use:
Example:
Scenario: Measuring student performance on a test given at the beginning, middle, and end of a course.
Hypotheses:
● Null Hypothesis (H₀): No difference in mean test scores across time points.
● Alternate Hypothesis (H₁): There is a difference in mean test scores over time.
Interpretation: If the F-statistic is significant, we reject the null hypothesis, suggesting a change in scores over time.
…
MANOVA (Multivariate ANOVA)
When to Use:
Example:
Scenario: Testing the effect of a new teaching strategy on both math and science scores.
Hypotheses:
● Null Hypothesis (H₀): No difference in the mean of both math and science scores due to the strategy.
● Alternate Hypothesis (H₁): At least one of the means differs.
Interpretation: If the MANOVA test is significant, we reject the null hypothesis, indicating an effect on one or both
outcomes.