0% found this document useful (0 votes)

2 views

02 Exploratory Data Analytics

Exploratory Data Analysis (EDA) is essential for understanding complex data sets, identifying patterns, and testing hypotheses using statistical techniques. Statistics, a branch of mathematics, aids in data collection, analysis, interpretation, and presentation, supporting informed decision-making across various fields. The document also covers descriptive statistics, data visualization methods, hypothesis testing, and the importance of understanding populations and samples in data analysis.

Uploaded by

ankushsonawane36

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

02 Exploratory Data Analytics

Uploaded by

ankushsonawane36

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Exploratory Data

Analytics
Exploratory Data Analysis is a critical step in the data science process.
It is the foundation for understanding and interpreting complex data sets.
EDA helps data scientists identify patterns, spot anomalies, test
hypotheses, and check assumptions through various statistical and
graphical techniques.
What is Statistics ?
Definition:
Statistics is a branch of mathematics that focuses on
collecting, analyzing, interpreting, presenting, and organizing
data. It helps researchers, analysts, and decision-makers
make informed decisions based on data. Importance of Statistics :

Key Components of Statistics : Helps in Understanding Trends and Patterns :

Identifying patterns in data to predict future events or
Data Collection : Gathering data through surveys, behaviors.
experiments, or observations.
Supports Decision-Making :
Data Analysis : Using mathematical methods like descriptive In fields like business, healthcare, social sciences, and
and inferential statistics to analyze data. more, statistics provide the tools necessary for data-driven
Data Interpretation : Drawing conclusions and making decisions.
predictions based on data analysis.
Provides Tools for Testing Hypotheses :
Data Presentation : Communicating the findings effectively Statistical tests allow for validating scientific research and
through charts, graphs, and reports. drawing accurate conclusions.
Descriptive Statistics The level of measurement of a variable can be nominal,
ordinal, or metric. It influences which statistical analyses
and hypothesis tests are appropriate.
Definition:
"Descriptive statistics covers statistical methods for
Types of Measurement Levels:
describing data using statistical characteristics,
charts, graphics, or tables." ● Nominal Variables:
Values can be differentiated (e.g., categories like
Purpose : gender, color).
● Categorical Variables:
● Graph the data. Nominal and ordinal variables are often referred to
● Calculate the mean and other measures. as categorical variables.
● Get an overview of the distribution of the ● Ordinal Variables:
data. Values can be sorted (e.g., rankings, satisfaction
levels).
Different Key Figures, Tables, and Graphics: ● Metric Scale (Interval or Ratio):
Depending on the question and available Distances between values can be calculated (e.g.,
age, height, temperature).
measurement scale, various statistical tools are
used for evaluation.
Choosing the Right Statistical Analysis
Choosing the Right Statistical Analysis

● Nominal Data:
Cannot calculate mean or standard deviation.
Suitable for mode and frequency counts.
● Ordinal Data:
Median and range can be calculated.
Can be treated as metric data for some analyses,
but with caution.
● Metric Data:
Mean, standard deviation, and advanced hypothesis
tests are suitable.

Effect on Data Visualization:

Different measurement levels suggest different types of
data visualization.
● Nominal: Pie charts, bar charts. For More Details Check
● Ordinal: Box plots, ordered bar charts.
● Metric: Histograms, scatter plots.
notebook1_level_of_measurement.ipynb
Visualization
Part of Descriptive Statistics
Descriptive statistics refers to a branch of
statistics that involves summarizing, organizing,
and presenting data meaningfully and concisely.
Bar Chart :
Input Data: Categorical and numerical
data.

Purpose: Visualize and compare the

values of different categories.

Library: Matplotlib, Seaborn, Plotly.

Pie Chart :

Input Data: Parts of a whole (percentages

or proportions).

Purpose: Show the composition of a whole

and the relationship between parts.

Library: Matplotlib, Seaborn, Plotly.

BoxPlot :

Input Data: Numerical data, typically

grouped by categories.

Purpose: Display the distribution,

variability, and outliers of a dataset across
different categories.

Library: Matplotlib, Seaborn.

Histogram with Density lines :

Input Data: Numerical data.

Purpose: Display the distribution of data by

grouping into bins and showing the
frequency of occurrences within each bin.

Library: Matplotlib, Seaborn.

Heatmap :

Input Data: Two-dimensional numerical

data.

Purpose: Visualize the intensity of data

values using colors, typically used for
correlation matrices or displaying
relationships in matrices.

Library: Matplotlib, Seaborn.

Scatterplot:

Input Data: Paired numerical data.

Purpose: Show the relationship or

correlation between two variables by
plotting points on a two-dimensional plane.

Library: Matplotlib, Seaborn, Plotly.

Lineplot:

Input Data: Numerical data over time or

continuous variables.

Purpose: Show trends, patterns, or

relationships between variables over a
continuous interval.

Library: Matplotlib, Seaborn, Plotly.

For More Details on Visualization Check

notebook2_visualization.ipynb
Distribution Analysis
Part of Descriptive Analysis
Understanding Population and Sample

Population: Sample:

A population includes all elements (individuals, items, A sample is a subset of the population selected for
or data points) that meet a specific criterion for a study. analysis to draw conclusions about the population
It represents the entire group you want to draw without analyzing every member.
conclusions about.
Example: Measuring the height of 200 randomly
Example: For a study on the average height of adult selected adult men from the city as a
men in a city, the population is all adult men in that city. representative sample.

Characteristics: Characteristics:

Populations can be finite (e.g., all employees in a A sample should be representative of the
company) or infinite (e.g., all possible rolls of a die). population to ensure accurate conclusions.

Often impractical to collect data from the entire Sampling methods (random, stratified, etc.) impact
population. the reliability of results.
Population vs. Sample: Key Differences
Population:

Complete set, often large or infinite.

Example: All adult men in a city.

Sample:

A subset of the population.

Example: 200 randomly selected adult men.

Purpose of Sampling:

Sampling allows for efficient data collection and

generalization to the population without needing data
from every individual.
Median :

Measures of Central Tendency The median is the middle value of a dataset when it’s
ordered from lowest to highest.

Calculation Steps:
Definition:
Order the data.
Measures of central tendency summarize a dataset by
identifying the center or typical value of the data points. If odd number of data points, median is the middle value.

Importance: If even, median is the average of the two middle values.

Provides insight into the general trend or average behavior
Example: The median of [10,20,30,40,50] is 30.
of the dataset.
The median of [10,20,30,40] is 25.
Mean ( Average ) :
Mode :
The mean is the sum of all data points divided by the
number of points. The mode is the value(s) that appear most frequently in a dataset.

Characteristics:

A dataset can be unimodal, bimodal, or multimodal.

Example : The mean of [10,20,30,40] is 25 Useful for categorical data.

Example: Mode of [5, 7, 8, 8, 10, 12] is 8.

Variance :

Measures of Dispersion Variance measures the average squared deviation of each

number from the mean.

Definition: For Sample: For Population :

Measures of dispersion indicate the spread or variability

of data around the central tendency.

Importance: Note : N−1 for a sample (Bessel's correction) reduces bias in

estimation as sample has some loss of data than population.
Helps understand the distribution and consistency of
data points. Example : Variance of [10,20,30,40] is 125.

Range :
Standard Deviation :
The range is the difference between the highest and
lowest values. Standard deviation is the square root of the variance,
providing a measure of spread in the same units as the data.

Example : Range of [5, 10, 15, 20] is 20 - 5 = 15.

Example : [10,20,30,40] has std. of approx. 11.18
Quartiles Different Quartiles :

First Quartile (Q1) :

What are Quartiles ?
Definition: Middle value between the minimum and the median.
Definition: Quartiles divide a dataset into four equal parts, each
containing 25% of the data. Percentile: Represents the 25th percentile.

Insight: Shows the lower range of the dataset; 25% of values

Purpose: Provides a way to summarize data distribution and
gain insights into variability. fall below Q1.

Requirement: Data must be ordered from smallest to largest for Second Quartile (Q2)
quartile calculation.
Definition: The median, or midpoint of the dataset.

Percentile: Represents the 50th percentile.

Insight: Divides data such that 50% of values are smaller, 50%
are larger.

Third Quartile (Q3)

Definition: Middle value between the median and the

maximum.

Percentile: Represents the 75th percentile.

Insight: Shows the upper range of the dataset; 75% of values

Interquartile Range (IQR)
The IQR measures the spread of the middle 50% of data, from Q1
to Q3.

Formula:

IQR = Q3 − Q1

Significance:

Shows variability within the central portion of the dataset.

Unlike the full range, it’s less affected by outliers or extreme values.

Importance of Interquartile Range (IQR)

Robustness Against Outliers:

The IQR provides a reliable measure of variability as it excludes

outliers in the lower 25% and upper 25%.

Better Insights into Distribution:

Focuses on the main portion of the data, giving a clearer view of

data concentration and spread without distortion from extreme
values.
What is Data Distribution ?

Definition :

Data distribution describes how data points are

spread across a range of values.

Purpose:

Helps in understanding the overall pattern and

shape of data.

Importance:

Essential for analyzing data patterns, detecting

outliers, and selecting appropriate statistical tests.
Let’s Learn Few Types of Distribution
Normal Distribution :

Shape:

Symmetrical, bell-shaped curve.

Characteristics:

Mean, median, and mode are all located at the

center.

Data is symmetrically distributed around the

mean.

Example:

Heights, test scores, measurement errors.

Right-Skewed (Positive)

Right-Skewed and Left-Skewed Distributions

Right-Skewed (Positive)

Tail extends to the right.

Mean > Median.

Example: Income distribution.

Left-Skewed (Negative)

Tail extends to the left.

Mean < Median.

Example: Age at retirement.

Uniform DIstribution :

Definition:

All outcomes are equally likely, resulting in a flat,

rectangular shape.

Characteristics:

Mean and median are equal.

There is no central peak.

Example:

Rolling a fair die, random number generation between set

limits.

Visual:

Rectangular-shaped distribution graph.

Binomial Distribution :

Definition:

Shows the probability of a fixed number of successes

in a set number of independent trials.

Characteristics:

Outcomes are binary (e.g., success/failure).

Often used in discrete probability.

Example:

Flipping a coin 10 times, where each flip results in

heads or tails.
Poisson Distribution :

Definition:

Models the probability of a given number of events

occurring within a fixed interval.

Characteristics:

Describes rare events.

Right-skewed, often used in count data..

Example:

Probability of 5 of customer calls arrivals in an hour For More Types and Details Check
given the history record of average 3 calls in an hour.
notebook3_Distribution_Analysis.ipynb
Hypothesis Testing
Part of Inferential Statistics
Inferential statistics is a branch of statistics that
uses various analytical tools to draw conclusions
about the population from sample data.
What is Hypothesis

What is Inferential Statistics? Hypotheses in Research

A branch of statistics that uses sample data to Hypothesis: An initial assumption or prediction
make predictions or inferences about a larger about a population parameter.
population.
Objective: Use data to either reject or support
Purpose: the hypothesis.

Allows us to generalize findings from a small Example: Testing the hypothesis that "men
group to the whole population without needing earn more than women" by analyzing a sample.
data from everyone.
Types of Hypotheses

Differential Hypothesis: Tests differences Directional Hypothesis: Specifies the direction

between groups. of an effect

Example: Men vs. women in salary Example : Men earn more than women.
comparison.

Non-Directional Hypothesis: Only states

Correlation Hypothesis: Tests relationships there’s a difference or relationship, not the
direction
between variables.
Example : There is a difference in salaries
Example: Relationship between age and
between genders.
height.
What is Hypothesis Testing?

Hypothesis testing is a statistical method to make

decisions or inferences about a population using Key Terms in Hypothesis Testing :
sample data.
Null Hypothesis (H₀):
Why Do We Need Hypothesis Testing?
Represents "no effect" or "no difference."
Test Assumptions: Validates assumptions about a
population. Example: Gender has no effect on salary.

Validate Claims: e.g., Effectiveness of a new drug. Alternate Hypothesis (H₁ or Hₐ):

Reduce Uncertainty: Provides data-driven Contradicts the null; represents the claim being
conclusions. tested.

Support Evidence-Based Decisions: Used in Example: Gender affects salary.

healthcare, business, and social sciences.
This fig. Helps us to visualize various key terms in
Hypothesis Testing which we will discuss next :

P-value : Observed Probability/result obtained from

Sample.

Green Region ( Unlikely Region ): If p-value lies

over there we reject null hypothesis. This region
generally lie where p-value < 0.05 or 0.01. This is also
called as Significant Region.

White Region (Likely Region) : If the Value lies

under there we accept null hypothesis.This region
generally lie where p-value > 0.05 or 0.01. This is also
called as Confidence Region.

Significance Level ( Dotted Line ) : That’s the

border between green region and white region. We
also call it as Threshold.
Key Terms in Hypothesis Testing
Understanding P-Value Significance Level (α)
Role of α:
Definition:
Defines how much uncertainty is acceptable
The p-value indicates the probability that the (common levels: 0.01, 0.05).
observed results occurred by chance under the
null hypothesis. Kind of work as threshold for P-value.
Interpretation:
Decision Making:
α < 0.01: Very significant; reject H₀ confidently.
Small p-value (e.g., 0.03): Unlikely due to
chance; reject H₀. α < 0.05: Significant; likely not due to chance.
α > 0.05: Not significant; likely due to chance.
Large p-value (e.g., 0.50): Likely due to
chance; don’t reject H₀.
Key Terms in Hypothesis Testing
Confidence Interval : Degree of Freedom :
The number of independent values that can
A range that likely includes the true population
vary in a statistical calculation.
mean.
This value is generally used as bias term while
If p-value lies in this region we generally accept calculating p-value. It is used to accommodate
the null hypothesis. loss of information occurred when we takes a
sample from population
Purpose:
How Degree of Freedom Affects P-Value
Estimates where the true population value lies Calculation
based on sample data. Calculate Test Statistic: (e.g., t or chi-square).
Interpretation : Use Degree of Freedom: Determines the
distribution.
This region generally lies where α > 0.05.
Find P-Value: Probability of observing the test
statistic, assuming H₀ is true.
Types of Hypothesis Testing
Z - Test :
When to Use:

● Large sample size (typically n > 30 ).

● Known population variance (standard deviation).
● Data is normally distributed.

Example:

Scenario: A factory claims the average length of a bolt it produces is 5 cm. We collect a sample of 50 bolts and find an average length of 5.1
cm, with a population standard deviation of 0.2 cm.

Hypotheses:

● Null Hypothesis (H₀): The average length of bolts is 5 cm (sample mean = population mean).
● Alternate Hypothesis (H₁): The average length of bolts is not 5 cm (sample mean ≠ population mean).

Interpretation: If the Z-score exceeds the critical value, we reject the null hypothesis, suggesting the average bolt length is different from 5
cm.
…
One-Sample T-Test :
When to Use:
● Small sample size (typically n < 30).
● Unknown population variance.
● Comparing the sample mean to a known population mean.
Example:
Scenario: A company claims the average weight of a product is 50 grams. We test a sample of 10 products and find an
average weight of 49 grams.
Hypotheses:
● Null Hypothesis (H₀): The average product weight is 50 grams.
● Alternate Hypothesis (H₁): The average product weight is different from 50 grams.
Interpretation: If the T-score exceeds the critical value, we reject the null hypothesis, indicating a significant difference in
the average weight.
…
Two-Sample T-Test (Independent Samples) :
When to Use:

● Comparing the means of two independent groups.

● Population variances are unknown.

Example:

Scenario: A study compares the test scores of students taught with two different methods. Sample 1 has an average score
of 80, and Sample 2 has an average score of 85.

Hypotheses:

● Null Hypothesis (H₀): The mean scores of both groups are the same.
● Alternate Hypothesis (H₁): The mean scores of both groups are different.

Interpretation: If the T-score exceeds the critical value, we reject the null hypothesis, suggesting a significant difference
between the teaching methods.
…
Paired T-Test (Dependent Samples)
When to Use:

● Same group measured twice (before-and-after tests).

● Comparison of means under two conditions for the same participants.

Example:

Scenario: Testing blood pressure of patients before and after treatment.

Hypotheses:

● Null Hypothesis (H₀): The mean blood pressure before treatment equals the mean after treatment.
● Alternate Hypothesis (H₁): The mean blood pressure differs before and after treatment.

Interpretation: If the T-score is significant, we reject the null hypothesis, indicating the treatment had a significant effect on
blood pressure.
…
Chi-Square Test
When to Use:

● Categorical data.
● Testing association or independence between two variables.

Example:

Scenario: A survey examines whether people’s choice of drink (tea or coffee) is related to their age group (young or old).

Hypotheses:

● Null Hypothesis (H₀): There is no association between age group and drink preference.
● Alternate Hypothesis (H₁): There is an association between age group and drink preference.

Interpretation: If the Chi-square statistic is significant, we reject the null hypothesis, suggesting an association between
age and drink preference.
…
One-Way ANOVA
When to Use:

● Compare means across three or more groups.

● One independent variable with multiple levels.

Example:

Scenario: Comparing average test scores among students taught with three different methods.

Hypotheses:

● Null Hypothesis (H₀): All group means are equal.

● Alternate Hypothesis (H₁): At least one group mean is different.

Interpretation: If the ANOVA F-statistic is significant, we reject the null hypothesis, indicating a difference in mean test
scores among the methods.
…
Two-Way ANOVA
When to Use:

● Testing two independent variables and their interaction effect.

● Useful for understanding the effect of each factor and their interaction.

Example:

Scenario: Examining the effect of teaching method and study hours on test scores.

Hypotheses:

● Null Hypothesis (H₀): No difference in means due to teaching method, study hours, or their interaction.
● Alternate Hypothesis (H₁): There is a difference in means due to one or more of these factors.

Interpretation: If the F-statistic is significant, we reject the null hypothesis, indicating effects due to the factors or their
interaction.
…
Repeated Measures ANOVA
When to Use:

● Same subjects measured under multiple conditions.

● Useful for time-based studies or repeated measurements.

Example:

Scenario: Measuring student performance on a test given at the beginning, middle, and end of a course.

Hypotheses:

● Null Hypothesis (H₀): No difference in mean test scores across time points.
● Alternate Hypothesis (H₁): There is a difference in mean test scores over time.

Interpretation: If the F-statistic is significant, we reject the null hypothesis, suggesting a change in scores over time.
…
MANOVA (Multivariate ANOVA)
When to Use:

● Analyzes multiple dependent variables at once.

● Tests the effect of independent variables on multiple outcomes.

Example:

Scenario: Testing the effect of a new teaching strategy on both math and science scores.

Hypotheses:

● Null Hypothesis (H₀): No difference in the mean of both math and science scores due to the strategy.
● Alternate Hypothesis (H₁): At least one of the means differs.

Interpretation: If the MANOVA test is significant, we reject the null hypothesis, indicating an effect on one or both
outcomes.