Module 2
Module 2
Module 2
DATA SCIENCE
Presenter’s Name
Dr. Shital Bhatt
Associate Professor
School of Computational and Data Sciences
www.vidyashilpuniversity.com www.vidyashilpuniversity.com
Introduction to Exploratory
Data Analysis (EDA)
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach/philosophy for data
analysis that employs a variety of techniques (mostly graphical) to
maximize insight into a data set; uncover underlying structure; extract
important variables; detect outliers and anomalies; test underlying
assumptions; develop parsimonious models;
The EDA approach is precisely that--an approach--not a set of techniques,
but an attitude/philosophy about how a data analysis should be carried out.
and determine optimal factor settings.
28-11-2019
3
What is Exploratory Data Analysis (EDA)?
• Data Quality: Data should be accurate, complete, and free from errors.
• Representativeness: Data should be representative of the population being studied.
• Independence: Observations should be independent unless stated otherwise.
• Distribution: Understanding the distribution (normality, skewness) is key.
Techniques and Tools Used in EDA
The primary goal of EDA is to maximize the analyst's insight into a data set and into the
underlying structure of a data set, while providing all of the specific items that an analyst
would want to extract from a data set, such as: a good-fitting, parsimonious model a list of
outliers a sense of robustness of conclusions estimates for parameters uncertainties for
those estimates a ranked list of important factors conclusions as to whether individual
factors are statistically significant optimal settings
Introduction to Goals of EDA
Exploratory Data Analysis (EDA) is a crucial step in data analysis. It serves multiple
purposes that are essential for understanding the data and ensuring robust statistical
modeling.
Identify Patterns
• Find outliers and irregularities that might indicate data quality issues or interesting
phenomena.
• Important for cleaning data and for identifying rare events.
• Example: Detecting fraudulent transactions in financial data.
Hypothesis Testing
• Validate assumptions that underlie statistical models, such as normality and independence
of observations.
• Ensures the validity of subsequent statistical analyses.
• Example: Checking the normality of residuals in regression analysis.
Underlying Assumptions of
Exploratory Data Analysis (EDA)
Underlying Assumptions
There are four assumptions that typically underlie all measurement processes; namely, that
the data from the process at hand "behave like": random drawings; from a fixed
distribution; with the distribution having fixed location; and with the distribution having
fixed variation
Introduction to Underlying Assumptions
Exploratory Data Analysis (EDA) relies on several key assumptions to ensure that the
analysis is valid and meaningful. These assumptions are fundamental to the correct
interpretation and subsequent modeling of data.
Data Quality
Clustering and dimension reduction techniques, which help create graphical displays of
high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship
between each variable in the dataset and the target variable you’re looking at.
Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
TYPES OF EXPLORATORY DATA ANALYSIS:
Univariate Non-graphical
Multivariate Non-graphical
Univariate graphical
Multivariate graphical
1. Univariate Non-graphical:
This is the simplest form of data analysis as during this we use just one variable to research
the info.
The standard goal of univariate non-graphical EDA is to know the underlying sample
distribution/ data and make observations about the population.
Outlier detection is additionally part of the analysis. The characteristics of population
distribution include:
Types
Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are statistics
called mean, median, and sometimes mode during
Spread: Spread is an indicator of what proportion distant from the middle we are to seek
out the find the info values. the quality deviation and variance are two useful measures of
spread.
Skewness and kurtosis: Two more useful univariates descriptors are the skewness and
kurtosis of the distribution.
2. Multivariate Non-graphical:
Non-graphical methods are quantitative and objective, they are not able to give the
complete picture of the data; therefore, graphical methods are used more as they involve a
degree of subjective analysis, also are required. Common sorts of univariate graphics are:
Histogram
Stem-and-leaf plots
Boxplots
Quantile-normal plots
4 Multivariate graphical:
Multivariate graphical data uses graphics to display relationships between
two or more sets of knowledge. The sole one used commonly may be a
grouped barplot with each group representing one level of 1 of the variables
Scatterplot:
Run chart: It’s a line graph of data plotted over time.
Heat map: It’s a graphical representation of data where values are depicted
by color.
Multivariate chart: It’s a graphical representation of the relationships
between factors and response.
Bubble chart: It’s a data visualization that displays multiple circles
(bubbles) in two-dimensional plot.
TOOLS REQUIRED FOR EXPLORATORY DATA
ANALYSIS:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Overview of Basic Clustering Methods
Partitioning Methods
K-Means Algorithm
It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.
Partitioning Methods
K-Means Algorithm
It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.
Partitioning Methods
K-Means Algorithm
The k-means algorithm for partitioning, where each
cluster’s center is represented by the mean value of the
objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Partitioning Methods
K-Means Algorithm
Method
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
4. update the cluster means, that is, calculate the mean value of the
objects for each cluster;
5. until no change;
Partitioning Methods
K-Means Algorithm
Hierarichal Clusturing
Agglomerative vs Divisive
Visualizing Data
Guidelines, Integrity, and Grammar of Graphics
Guidelines for Good Plots
1. Simplify complex data: Use charts that make data easier to understand.
- Example: Use line charts for trends over time.
2. Use appropriate chart types: Choose charts that best represent your data.
- Example: Use bar charts for categorical data.
Guidelines for Good Plots - Part 2
5. Label axes and legends clearly: Provide clear labels for better understanding.
- Example: Use descriptive titles and labels for axes.
6. Choose colors wisely: Use colors that are distinguishable and meaningful.
- Example: Avoid using red and green together for colorblind audiences.
7. Provide context with titles and captions: Add titles and captions to explain the data.
- Example: Include a brief description of what the chart represents.
Maintain Integrity When Plotting Data - Part 1
5. Use appropriate scales to avoid distortion: Ensure scales accurately reflect differences in
data.
- Example: Avoid using logarithmic scales unless necessary.
6. Avoid cherry-picking data points: Show the full picture without selecting only favorable
data.
- Example: Present all relevant data points, not just those that support your argument.
7. Be transparent about sources and methods: Clearly state data sources and methods used.
- Example: Include a note on where the data was obtained and any transformations
applied.
Grammar of Graphics - Part 1