Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a technique for analyzing and visualizing datasets to summarize their main characteristics, developed by John Tukey in the 1970s. Key components of EDA include visualization techniques, summary statistics, pairwise correlation, and class breakdowns to understand data distributions and relationships. Common visualization methods include scatter plots, histograms, and box plots, which help in interpreting data and identifying patterns.

Uploaded by

chandramaryt
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a technique for analyzing and visualizing datasets to summarize their main characteristics, developed by John Tukey in the 1970s. Key components of EDA include visualization techniques, summary statistics, pairwise correlation, and class breakdowns to understand data distributions and relationships. Common visualization methods include scatter plots, histograms, and box plots, which help in interpreting data and identifying patterns.

Uploaded by

chandramaryt
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 10

EXPLORATORY DATA ANALYSIS

Exploratory data analysis is an analysis


technique to analyze and investigate the data set
and summarize the main characteristics of the
dataset in a visualized form.

Exploratory data analysis is a set of


techniques that have been principally developed by
American Mathematician John Tukey 1970 .

The analysis of datasets based on various


numerical methods and graphical tools.
VISUALIZATION TECHNIQUES
Visualization techniques are essential in
Exploratory Data Analysis (EDA), helps to
understand and communicate insights. Here are
some common techniques:
1. Scatter Plots
2. Pie Charts
3. Bar Charts
4. Histograms
5. Line Plots
6. Heatmaps
7. Violin Plots
8. Distribution Plots
CONFRONTING NEW DATA SET
It refers to the initial encounter and examination of
a previous unseen data collection.
Answer the basic questions:
Learn about the data's origin, purpose, and
any relevant background information.
 Who constructed this data set, when, and
why?
Eg: National Health and Nutrition Examination Survey
2009-2010
 How big is it?
Eg: The data set has 4978 records each with seven data
fields.
 What do the fields mean?
Eg: The lengths and weights were measured using the
metric system (i.e) centimetres and kilograms
respectively.
LOOK FOR FAMILIAR OR INTERPRETABLE
RECORDS:
It refers to the process of identifying data points,
patterns, or features that are:

1.Recognizable: Easily identifiable due to prior


knowledge or experience.
2.Understandable: Can be comprehended
without extensive additional explanation.
3.Relatable: Connected to existing knowledge or
real-world concepts.
4.Meaningful: Possess a clear interpretation
or significance.
SUMMARY STATISTICS:
Summary statistics refers to a numerical
measures that summarizes and describes the
basic features of a data set.

Common summary statistics used in EDA include:


1.Measures of central tendency:
• Mean, Median, Mode
2.Measures of variability:
• Range, Variance, Standard deviation
3.Measures of distribution:
• Skewness, Kurtosis, Quantiles
4.Counts and frequencies:
• Number of observations (n), Missing values,
Frequency distributions
PAIRWISE CORRELATION
It refers to the process of calculating and
analyzing the correlation coefficients between all
possible pairs of variables in a dataset.

1.Identify relationships: Discover how


variables are related to each other.
2.Understand interactions: Reveal how
changes in one variable affect others.
3.Detect patterns: Uncover hidden patterns
and correlations.
4.Guide feature selection: Inform the
selection of relevant variables for
further analysis.
CLASS BREAKDOWNS:
Class breakdowns in Exploratory Data Analysis (EDA)
refer to the process of analyzing and summarizing the
distribution of categorical variables, also known as class
variables or target variables.
They Involves :

1.Identifying unique classes: Determining the


distinct categories or groups within a categorical
variable.
2.Counting observations: Calculating the number of
observations (records) in each class.
3.Calculating frequencies: Determining the
proportion or percentage of observations in each class.
4.Visualizing distributions: Using plots like bar
charts, pie charts, or histograms to illustrate the class
breakdowns.
PLOT OF DISTRIBUTIONS

Plotting distributions is a crucial step in EDA to


understand the shape, central tendency, and
variability of your data.

Common distribution plots in EDA include:


1. Histograms: Visualize frequency distributions.
2. Box Plots: Show median, quartiles, and outliers.
3. Density Plots: Smooth, continuous
representations.
4. Q-Q Plots(Quantile-Quantile):Compare
distributions.
5. Violin Plots: Combine box plots and density
Example:
| Customer ID | Age | Gender | Income | Purchase Amount |
| 1 | 25 | Male | 50000 | 100 |
| 2 | 31 | Female | 60000 | 200 |
| 3 | 42 | Male | 70000 | 50 |

EDA Steps:
1.Summary Statistics:
- Mean Age: 35
-Median Income: 55000
- Average Purchase Amount: 150

2. Data Visualization:
- Histogram of Age: skewed to the right
- Scatter plot of Income vs. Purchase Amount: positive correlation
- Bar chart of Gender vs. Purchase Amount: males spend more
3. Familiar or Interpretable Records:
- Customers with high income (>75000) tend to make
larger purchases
- Males aged 25-40 have higher purchase amounts

4. Pairwise Correlation:
- Strong correlation between Income and Purchase Amount
(0.8) - Moderate correlation between Age and Income
(0.5)

5. Class Breakdown:
- 60% of customers are males
- 40% of customers have income above 60000

You might also like