Exploratory Data Analysis in ML
Exploratory Data Analysis in ML
If you are a data scientist or a machine learning enthusiast, you probably know that EDA stands for
Exploratory Data Analysis.
But do you know why EDA is so important in the ML workflow?
In this blog post, I will try to explain why EDA is not just a preparatory step, but a critical one that can
make or break your ML project.
What EDA is
EDA is the process of exploring and understanding your data before applying any ML algorithms or
models.
It involves visualizing, summarizing, and finding patterns, outliers, and anomalies in your data.
EDA helps you to gain insights and intuition about your data, which can guide your ML choices and
improve your results.
EDA often involves creating visualizations, such as charts, graphs, plots, or maps, that can help you to
convey complex information in a simple and intuitive way.
Visualizations can also help you to tell a story with your data, and highlight the key points and
takeaways for your audience.
In conclusion
Building your ML models and selecting features based on intuition only (without painstakingly
carrying out EDA) is bad practice and will undermine the abilities of your model.
.
Exploratory Data Analysis in [ML]
1. Univariate Non-graphical: As we only use one variable to research the data, this is the
most basic type of data analysis. Understanding the sample distribution and underlying
data in order to draw conclusions about the population is the basic objective of univariate
non-graphical EDA. The analysis also includes outlier detection. The population
distribution’s characteristics include:
Central tendency: The average or middle values have something to do with the
central tendency or distribution location. Statistics with the names mean, median,
and occasionally mode are frequently useful gauges of central tendency, with mean
being the most prevalent. The median may be selected when there is a skewed
distribution or when outliers are a concern.
Spread: Spread serves as a gauge for how far we should look to find the information
values from the centre. The variance and quality deviation are two helpful
measurements of spread. The variance is the root of the variance because it is the
mean of the square of each unique deviation.
Skewness and kurtosis: Two more useful univariates descriptors are the skewness
and kurtosis of the distribution. Skewness is that the measure of asymmetry and
kurtosis may be a more subtle measure of peakedness compared to a normal
distribution
2. Multivariate Non-graphical: In cross-tabulation or statistics, the multivariate non-
graphical EDA technique is typically used to illustrate the relationship between two or more
variables.
For categorical data, an extension of tabulation called cross-tabulation is extremely
useful. For 2 variables, cross-tabulation is preferred by making a two-way table with
column headings that match the amount of one-variable and row headings that
match the amount of the opposite two variables, then filling the counts with all
subjects that share an equivalent pair of levels.
For each categorical variable and one quantitative variable, we create statistics for
quantitative variables separately for every level of the specific variable then compare
the statistics across the amount of categorical variable.
Comparing the means is an off-the-cuff version of ANOVA and comparing medians may
be a robust version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective, they are
doing not give the complete picture of the data; therefore, graphical methods are more
involve a degree of subjective analysis, also are required. Common sorts of univariate
graphics are:
Histogram: The foremost basic graph is a histogram, which may be a barplot during
which each bar represents the frequency (count) or proportion (count/total count) of
cases for a variety of values. Histograms are one of the simplest ways to quickly learn a
lot about your data, including central tendency, spread, modality, shape and outliers.
Stem-and-leaf plots: An easy substitute for a histogram may be stem-and-leaf plots. It
shows all data values and therefore the shape of the distribution.
Boxplots: Another very useful univariate graphical technique is that the boxplot.
Boxplots are excellent at presenting information about central tendency and show
robust measures of location and spread also as providing information about symmetry
and outliers, although they will be misleading about aspects like multimodality. One
among the simplest uses of boxplots is within the sort of side-by-side boxplots.
Quantile-normal plots: The ultimate univariate graphical EDA technique is that the
most intricate. it’s called the quantile-normal or QN plot or more generally the
quantile-quantile or QQ plot. it’s wont to see how well a specific sample follows a
specific theoretical distribution. It allows detection of non-normality and diagnosis of
skewness and kurtosis
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships
between two or more sets of knowledge. The sole one used commonly may be a grouped
barplot with each group representing one level of 1 of the variables and every bar within a
gaggle representing the amount of the opposite variable.
Other common sorts of multivariate graphics are:
Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that
the scatterplot , sohas one variable on the x-axis and one on the y-axis and therefore
the point for every case in your dataset.
Run chart: It’s a line graph of data plotted over time.
Heat map: It’s a graphical representation of data where values are depicted by color.
Multivariate chart: It’s a graphical representation of the relationships between
factors and response.
Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in two-
dimensional plot.
Exploratory Data Analysis (EDA) and Customer Segmentation of Credit Score
Classification Dataset
Introduction
Exploratory Data Analysis (EDA) is a crucial phase in any data science project, enabling data
scientists to gain insights, identify patterns, and prepare data for further analysis. This article
serves as a comprehensive guide to conducting EDA effectively. We will break down the
process into key steps and provide detailed insights at each stage.
1. Project Overview 📝
Project Title: Create a concise, descriptive title that encapsulates the main theme of your
analysis.
Goal of the project: Break down the project goals into specific objectives or research
questions. For instance, if your goal is to understand customer churn, you might have sub-
goals like “Identify key factors affecting churn rates” or “Segment customers based on
churn behavior.”
Dataset(s) used: List the dataset names, sources, formats, and any data preprocessing steps
you applied (e.g., cleaning, merging datasets).
Team Members: Specify the roles and responsibilities of each team member. Who was
responsible for data cleaning, analysis, visualization, and reporting?
2. Data Overview 📁
Source(s) of the Data: Provide detailed information about the data sources, including URLs,
databases, and any data retrieval methods.
Data Size: Include information on the number of records, the number of features (columns),
and the memory usage (e.g., “10,000 records with 20 features, consuming 5 MB of
memory”).
Brief Description of the Data: Elaborate on the context and significance of the data, including
how it relates to the project’s goals. Mention any data collection issues or peculiarities.
Outlier Detection & Treatment: Detail the process of identifying outliers and your approach
to handling them, such as using visualization, statistical tests, or filtering.
5. Visualization 📊
Graphs and Plots: Provide a detailed inventory of the types of graphs and plots used for
visualization, with explanations of how each visualization was chosen for specific insights.
Insights Derived from the Visuals: Elaborate on the insights gained from the visualizations,
including patterns, anomalies, and trends. Relate these findings to the project’s goals.
9. Project Files 📂
EDA Code Files: List and organize the code files, scripts, or notebooks used for different
stages of the analysis. Include comments and explanations in the code.
Data Files: Specify the data files used, including their names, formats, and descriptions.
Presentations or Reports: Include the final presentations, reports, or documents created for
communication and sharing of your EDA project results.
Conclusion
Exploratory Data Analysis is a fundamental step in the data science workflow. This guide
provides a comprehensive framework to help you navigate the process effectively, from
project initiation to data exploration, hypothesis testing, and beyond. By following these
steps, you’ll be better equipped to uncover valuable insights and make informed decisions
based on your data.