Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Module 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 81

Exploratory Data Analysis: INTRODUCTION TO

DATA SCIENCE

Presenter’s Name
Dr. Shital Bhatt
Associate Professor
School of Computational and Data Sciences

www.vidyashilpuniversity.com www.vidyashilpuniversity.com
Introduction to Exploratory
Data Analysis (EDA)
Exploratory Data Analysis (EDA)
 Exploratory Data Analysis (EDA) is an approach/philosophy for data
analysis that employs a variety of techniques (mostly graphical) to
maximize insight into a data set; uncover underlying structure; extract
important variables; detect outliers and anomalies; test underlying
assumptions; develop parsimonious models;
 The EDA approach is precisely that--an approach--not a set of techniques,
but an attitude/philosophy about how a data analysis should be carried out.
and determine optimal factor settings.

28-11-2019
3
What is Exploratory Data Analysis (EDA)?

 • EDA is an approach to analyzing data sets to summarize their main characteristics.


 • Often involves visual methods.
 • Aimed at discovering patterns, spotting anomalies, testing hypotheses, and checking
assumptions.
Goals of EDA

 • Identify Patterns: Detect underlying trends and relationships.


 • Spot Anomalies: Find outliers and irregularities.
 • Hypothesis Testing: Formulate and test hypotheses.
 • Check Assumptions: Validate assumptions that underlie statistical models.
Underlying Assumptions

 • Data Quality: Data should be accurate, complete, and free from errors.
 • Representativeness: Data should be representative of the population being studied.
 • Independence: Observations should be independent unless stated otherwise.
 • Distribution: Understanding the distribution (normality, skewness) is key.
Techniques and Tools Used in EDA

 • Descriptive Statistics: Mean, median, mode, range, etc.


 • Visualization: Histograms, box plots, scatter plots, etc.
 • Data Cleaning: Handling missing values, correcting errors.
 • Transformation: Scaling, normalizing data.
Visualization in EDA

 • Quick Insights: Visual representations provide quick understanding.


 • Pattern Recognition: Easier to spot trends and outliers.
 • Communication: Effective way to communicate findings.
Goals of Exploratory Data
Analysis (EDA)
Goal of EDA

 The primary goal of EDA is to maximize the analyst's insight into a data set and into the
underlying structure of a data set, while providing all of the specific items that an analyst
would want to extract from a data set, such as: a good-fitting, parsimonious model a list of
outliers a sense of robustness of conclusions estimates for parameters uncertainties for
those estimates a ranked list of important factors conclusions as to whether individual
factors are statistically significant optimal settings
Introduction to Goals of EDA

 Exploratory Data Analysis (EDA) is a crucial step in data analysis. It serves multiple
purposes that are essential for understanding the data and ensuring robust statistical
modeling.
Identify Patterns

 • Detect underlying trends and relationships within the data.


 • Understand how different variables interact with each other.
 • Example: Identifying seasonal trends in sales data.
Spot Anomalies

 • Find outliers and irregularities that might indicate data quality issues or interesting
phenomena.
 • Important for cleaning data and for identifying rare events.
 • Example: Detecting fraudulent transactions in financial data.
Hypothesis Testing

 • Formulate and test hypotheses based on observed data patterns.


 • Helps in verifying assumptions before formal modeling.
 • Example: Testing if there is a significant difference in average spending between two
customer groups.
Check Assumptions

 • Validate assumptions that underlie statistical models, such as normality and independence
of observations.
 • Ensures the validity of subsequent statistical analyses.
 • Example: Checking the normality of residuals in regression analysis.
Underlying Assumptions of
Exploratory Data Analysis (EDA)
Underlying Assumptions

 There are four assumptions that typically underlie all measurement processes; namely, that
the data from the process at hand "behave like": random drawings; from a fixed
distribution; with the distribution having fixed location; and with the distribution having
fixed variation
Introduction to Underlying Assumptions

 Exploratory Data Analysis (EDA) relies on several key assumptions to ensure that the
analysis is valid and meaningful. These assumptions are fundamental to the correct
interpretation and subsequent modeling of data.
Data Quality

 • Assumption that data is accurate, complete, and free from errors.


 • Essential for reliable analysis.
 • Poor data quality can lead to misleading conclusions.
Representativeness

 • Data should be representative of the population being studied.


 • Ensures that findings and insights are generalizable.
 • Example: Survey data should accurately reflect the demographics of the target
population.
Independence

 • Observations should be independent unless stated otherwise.


 • Many statistical methods assume independence of observations.
 • Violations of this assumption can bias results.
Distribution

 • Understanding the distribution of the data is crucial.


 • Assumptions about normality, skewness, and kurtosis affect the choice of analysis
methods.
 • Example: Many statistical tests assume normally distributed data.
Initial VS Exploratory Analysis Method

 Data Analysis Approaches


 EDA is a data analysis approach. What other data analysis approaches exist and how does
EDA differ from these other approaches? Three popular data analysis approaches are:
Classical, Exploratory (EDA), Bayesian
 Paradigms for Analysis Techniques
 These three approaches are similar in that they all start with a general science/engineering
problem and all yield science/engineering conclusions. The difference is the sequence and
focus of the intermediate steps.
 For classical analysis, the sequence is Problem => Data => Model => Analysis =>
Conclusions
 For EDA, the sequence is Problem => Data => Analysis => Model => Conclusions
 For Bayesian, the sequence is Problem => Data => Model => Prior Distribution => Analysis
=> Conclusions
Method of dealing with underlying model for
the data distinguishes the 3 approaches
 Thus for classical analysis, the data collection is followed by the imposition of a model
(normality, linearity, etc.) and the analysis, estimation, and testing that follows are focused
on the parameters of that model. For EDA, the data collection is not followed by a model
imposition; rather it is followed immediately by analysis with a goal of inferring what
model would be appropriate. Finally, for a Bayesian analysis, the analyst attempts to
incorporate scientific/engineering knowledge/expertise into the analysis by imposing a
data-independent distribution on the parameters of the selected model; the analysis thus
consists of formally combining both the prior distribution on the parameters and the
collected data to jointly make inferences and/or test assumptions about the model
parameters. In the real world, data analysts freely mix elements of all of the above three
approaches (and other approaches). The above distinctions were made to emphasize the
major differences among the three approaches.
 Classical Exploratory The classical approach imposes models (both deterministic and
probabilistic) on the data. Deterministic models include, for example, regression models
and analysis of variance (ANOVA) models. The most common probabilistic model
assumes that the errors about the deterministic model are normally distributed--this
assumption affects the validity of the ANOVA F tests.
 The Exploratory Data Analysis approach does not impose deterministic or probabilistic
models on the data. On the contrary, the EDA approach allows the data to suggest
admissible models that best fit the data.
 Classical Exploratory The two approaches differ substantially in focus. For classical
analysis, the focus is on the model--estimating parameters of the model and generating
predicted values from the model.
 For exploratory data analysis, the focus is on the data--its structure, outliers, and models
suggested by the data
 Classical Exploratory Classical techniques are generally quantitative in nature. They
include ANOVA, t tests, chi-squared tests, and F tests.
 EDA techniques are generally graphical. They include scatter plots, character plots, box
plots, histograms, bihistograms, probability plots, residual plots, and mean plots
Functions and Techniques

 Clustering and dimension reduction techniques, which help create graphical displays of
high-dimensional data containing many variables.
 Univariate visualization of each field in the raw dataset, with summary statistics.
 Bivariate visualizations and summary statistics that allow you to assess the relationship
between each variable in the dataset and the target variable you’re looking at.
 Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
TYPES OF EXPLORATORY DATA ANALYSIS:

 Univariate Non-graphical

 Multivariate Non-graphical

 Univariate graphical

 Multivariate graphical
1. Univariate Non-graphical:

 This is the simplest form of data analysis as during this we use just one variable to research
the info.
 The standard goal of univariate non-graphical EDA is to know the underlying sample
distribution/ data and make observations about the population.
 Outlier detection is additionally part of the analysis. The characteristics of population
distribution include:
Types

 Central tendency: The central tendency or location of distribution has got to do with
typical or middle values. The commonly useful measures of central tendency are statistics
called mean, median, and sometimes mode during
 Spread: Spread is an indicator of what proportion distant from the middle we are to seek
out the find the info values. the quality deviation and variance are two useful measures of
spread.
 Skewness and kurtosis: Two more useful univariates descriptors are the skewness and
kurtosis of the distribution.
2. Multivariate Non-graphical:

 Multivariate non-graphical EDA technique is usually wont to show the connection


between two or more variables within the sort of either cross-tabulation or statistics.
 For categorical data, an extension of tabulation called cross-tabulation is extremely useful.
For 2 variables, cross-tabulation is preferred by making a two-way table with column
headings that match the amount of one-variable and row headings that match the amount
of the opposite two variables, then filling the counts with all subjects that share an
equivalent pair of levels.
3. Univariate graphical:

 Non-graphical methods are quantitative and objective, they are not able to give the
complete picture of the data; therefore, graphical methods are used more as they involve a
degree of subjective analysis, also are required. Common sorts of univariate graphics are:

 Histogram
 Stem-and-leaf plots
 Boxplots
 Quantile-normal plots
4 Multivariate graphical:
 Multivariate graphical data uses graphics to display relationships between
two or more sets of knowledge. The sole one used commonly may be a
grouped barplot with each group representing one level of 1 of the variables
 Scatterplot:
 Run chart: It’s a line graph of data plotted over time.
 Heat map: It’s a graphical representation of data where values are depicted
by color.
 Multivariate chart: It’s a graphical representation of the relationships
between factors and response.
 Bubble chart: It’s a data visualization that displays multiple circles
(bubbles) in two-dimensional plot.
TOOLS REQUIRED FOR EXPLORATORY DATA
ANALYSIS:

 1. R: An open-source programming language and free software environment for statistical


computing and graphics supported by the R foundation for statistical computing. The R language is
widely used among statisticians in developing statistical observations and data analysis.
 2. Python: An interpreted, object-oriented programming language with dynamic semantics. Its high
level, built-in data structures, combined with dynamic binding, make it very attractive for rapid
application development, also as to be used as a scripting or glue language to attach existing
components together. Python and EDA are often used together to spot missing values in the data set,
which is vital so you’ll decide the way to handle missing values for machine learning.
 Apart from these functions described above, EDA can also:
 Perform k-means clustering: Perform k-means clustering: it’s an unsupervised learning algorithm
where the info points are assigned to clusters, also referred to as k-groups, k-means clustering is
usually utilized in market segmentation, image compression, and pattern recognition
CLUSTER ANALYSIS
Introduction

Cluster analysis is a class of techniques that are


used to classify objects or cases into relative
groups called clusters.
Clustering as a data mining tool has its roots in
many application areas such as biology, security,
business intelligence, and Web search etc.
 Why Clusturing?
Requirements for Cluster Analysis

Scalability − We need highly scalable clustering


algorithms to deal with large databases.

Ability to deal with different kinds of attributes −


Algorithms should be capable to be applied on any
kind of data such as interval-based (numerical)
data, categorical, and binary data.
Requirements for Cluster Analysis(2)

Discovery of clusters with attribute shape − The clustering


algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.

High dimensionality − The clustering algorithm should not


only be able to handle low-dimensional data but also the high
dimensional space.
Requirements for Cluster Analysis(3)

Ability to deal with noisy data − Databases contain


noisy, missing or erroneous data. Some algorithms
are sensitive to such data and may lead to poor
quality clusters.

Interpretability − The clustering results should be


interpretable, comprehensible, and usable.
Applications

 Clustering analysis is broadly used in many


applications such as market research, pattern
recognition, data analysis, and image processing.

 Clustering can also help marketers discover distinct


groups in their customer base. And they can
characterize their customer groups based on the
purchasing patterns.
Applications(2)

 In the field of biology, it can be used to derive plant and


animal taxonomies, categorize genes with similar
functionalities and gain insight into structures inherent to
populations.

 Clustering also helps in identification of areas of similar land


use in an earth observation database. It also helps in the
identification of groups of houses in a city according to house
type, value, and geographic location.
Applications(3)

Clustering also helps in classifying documents on


the web for information discovery.

Clustering is also used in outlier detection


applications such as detection of credit card fraud.
Applications (4)

As a data mining function, cluster analysis serves


as a tool to gain insight into the distribution of data
to observe characteristics of each cluster.
Overview of Basic Clustering Methods

Clustering methods can be classified into the following


categories −

Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Overview of Basic Clustering Methods
Partitioning Methods

 K-Means Algorithm
 It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.
Partitioning Methods

 K-Means Algorithm
 It is an Centroid Based Technique
Suppose a data set, D, contains n objects in Euclidean
space. Partitioning methods distribute the objects in D into k
clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i,j ≤
k).
An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to one another
but dissimilar to objects in other clusters.
Partitioning Methods

 K-Means Algorithm
 The k-means algorithm for partitioning, where each
cluster’s center is represented by the mean value of the
objects in the cluster.

Input:
k: the number of clusters,
D: a data set containing n objects.
Partitioning Methods

 K-Means Algorithm
Method
1. arbitrarily choose k objects from D as the initial cluster centers;
2. repeat
3. (re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
4. update the cluster means, that is, calculate the mean value of the
objects for each cluster;
5. until no change;
Partitioning Methods

 K-Means Algorithm
Hierarichal Clusturing

 Agglomerative vs Divisive
Visualizing Data
Guidelines, Integrity, and Grammar of Graphics
Guidelines for Good Plots

 1. Simplify complex data


 2. Use appropriate chart types
 3. Avoid clutter
 4. Use consistent scales
 5. Label axes and legends clearly
 6. Choose colors wisely
 7. Provide context with titles and captions
Maintain Integrity When Plotting Data

 1. Avoid misleading representations


 2. Use complete data sets
 3. Represent data accurately
 4. Clearly distinguish data from projections
 5. Use appropriate scales to avoid distortion
 6. Avoid cherry-picking data points
 7. Be transparent about sources and methods
Grammar of Graphics

 1. Data: What data to visualize


 2. Aesthetics: How data is mapped to visual properties
 3. Geometries: Shapes used to represent data (e.g., points, lines, bars)
 4. Scales: How data values map to aesthetic values
 5. Statistics: Summarize data before plotting
 6. Coordinates: Spatial properties of the plot
 7. Facets: Multi-plot layouts to show subsets of data
Visualizing Data
Guidelines, Integrity, and Grammar of Graphics
Guidelines for Good Plots - Part 1

 1. Simplify complex data: Use charts that make data easier to understand.
 - Example: Use line charts for trends over time.
 2. Use appropriate chart types: Choose charts that best represent your data.
 - Example: Use bar charts for categorical data.
Guidelines for Good Plots - Part 2

 3. Avoid clutter: Keep charts clean and simple.


 - Example: Remove unnecessary gridlines and background colors.
 4. Use consistent scales: Ensure that scales are consistent across charts.
 - Example: Use the same axis limits for comparing similar data sets.
Guidelines for Good Plots - Part 3

 5. Label axes and legends clearly: Provide clear labels for better understanding.
 - Example: Use descriptive titles and labels for axes.
 6. Choose colors wisely: Use colors that are distinguishable and meaningful.
 - Example: Avoid using red and green together for colorblind audiences.
 7. Provide context with titles and captions: Add titles and captions to explain the data.
 - Example: Include a brief description of what the chart represents.
Maintain Integrity When Plotting Data - Part 1

 1. Avoid misleading representations: Ensure charts accurately represent data.


 - Example: Avoid using 3D charts that distort data perception.
 2. Use complete data sets: Present the entire data set to avoid bias.
 - Example: Do not omit data points that do not support your hypothesis.
Maintain Integrity When Plotting Data - Part 2

 3. Represent data accurately: Use appropriate scales and avoid exaggeration.


 - Example: Start the y-axis at zero to provide a true sense of proportion.
 4. Clearly distinguish data from projections: Separate actual data from forecasts.
 - Example: Use different line styles for actual data and projections.
Maintain Integrity When Plotting Data - Part 3

 5. Use appropriate scales to avoid distortion: Ensure scales accurately reflect differences in
data.
 - Example: Avoid using logarithmic scales unless necessary.
 6. Avoid cherry-picking data points: Show the full picture without selecting only favorable
data.
 - Example: Present all relevant data points, not just those that support your argument.
 7. Be transparent about sources and methods: Clearly state data sources and methods used.
 - Example: Include a note on where the data was obtained and any transformations
applied.
Grammar of Graphics - Part 1

 1. Data: What data to visualize.


 - Example: Choose relevant data sets for the analysis.
 2. Aesthetics: How data is mapped to visual properties.
 - Example: Map data values to colors, shapes, and sizes.
Grammar of Graphics - Part 2

 3. Geometries: Shapes used to represent data (e.g., points, lines, bars).


 - Example: Use bars for categorical data and lines for trends.
 4. Scales: How data values map to aesthetic values.
 - Example: Use linear or logarithmic scales depending on the data.
Grammar of Graphics - Part 3

 5. Statistics: Summarize data before plotting.


 - Example: Use mean, median, or other statistics to summarize data.
 6. Coordinates: Spatial properties of the plot.
 - Example: Choose between Cartesian and polar coordinates.
 7. Facets: Multi-plot layouts to show subsets of data.
 - Example: Use facets to compare multiple categories or time periods.
Visualizing Non-Numeric Data
Techniques, Tools, and Best Practices
Presented by: [Your Name]
Date
Introduction

 Overview: Importance of visualizing non-numeric data


 Types of non-numeric data: Text, images, categories, etc.
 Objective: Enhance understanding and communication of qualitative information
Understanding Non-Numeric Data

 Definition: What is non-numeric data?


 Examples: Textual data, categorical data, ordinal data
 Challenges: Complexity, subjectivity, context-dependence
Techniques for Visualizing Non-Numeric Data

 Textual Data: Word clouds, text networks


 Categorical Data: Bar charts, pie charts, heatmaps
 Ordinal Data: Ordered bar charts, color-coded tables
 Hierarchical Data: Tree diagrams, sunburst charts
Word Clouds

 Description: Visual representation of text data frequency


 Use Cases: Analyzing survey responses, feedback, reviews
 Example: [Insert example of a word cloud]
Text Networks

 Description: Graph-based representation of relationships between words


 Use Cases: Social network analysis, topic modeling
 Example: [Insert example of a text network]
Bar Charts for Categorical Data

 Description: Comparing categories based on frequency or other metrics


 Use Cases: Survey results, categorical comparisons
 Example: [Insert example of a bar chart]
Pie Charts

 Description: Showing proportions of a whole


 Use Cases: Market share, demographic distributions
 Example: [Insert example of a pie chart]
Heatmaps

 Description: Visual representation of data in matrix form using color coding


 Use Cases: Correlation matrices, categorical data comparisons
 Example: [Insert example of a heatmap]
Tree Diagrams

 Description: Representing hierarchical data with branches


 Use Cases: Organizational charts, decision trees
 Example: [Insert example of a tree diagram]
Sunburst Charts

 Description: Radial visualization of hierarchical data


 Use Cases: Showing part-to-whole relationships in hierarchical data
 Example: [Insert example of a sunburst chart]
Tools for Visualizing Non-Numeric Data

 Overview: Various tools available for creating visualizations


 Examples:
 - Word Clouds: Wordle, Tagxedo
 - Text Networks: Gephi, NodeXL
 - General Tools: Tableau, Power BI, R, Python libraries (matplotlib, seaborn)
Best Practices

 Clarity: Ensure visualizations are easy to understand


 Relevance: Choose the right type of visualization for the data
 Simplicity: Avoid overcomplicating visuals
 Context: Provide necessary context and explanations

You might also like