Exploratory Data Analysis
Exploratory Data Analysis
Prasad Deshmukh
Exploratory Data Analysis
EDA is a crucial step in data analysis,
involving data exploration,
visualization, and summarization to
uncover patterns and gain insights.
EDA helps to understand the structure
and characteristics of the dataset, detect
outliers, and identify relationships
between variables through statistical
analysis and visualizations.
Prasad Deshmukh
Data Collection
Prasad Deshmukh
Data Exploration
Explore the dataset to gain an initial understanding.
This can involve examining the structure of the data, checking the number of
rows and columns, and previewing the first few rows to get a sense of the
variables and their values.
# Removing duplicates
data.drop_duplicates()
Prasad Deshmukh
Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
Create visual representations of
the data using graphs, charts, and # Histogram
plt.hist(data['column_name'])
plots.
This helps to identify patterns, # Box plot
sns.boxplot(x=data['column_name'])
trends, and outliers.
# Scatter plot
Common visualizations include plt.scatter(data['x_column'],
histograms, box plots, scatter data['y_column'])
plots, bar charts, and heatmaps.
# Bar chart
sns.countplot(data['category_column'])
# Heatmap
sns.heatmap(data.corr())
Prasad Deshmukh
Correlation Analysis
Examine the relationships between variables by calculating correlation
coefficients.
This helps to identify variables that are highly correlated, positively or
negatively, and can provide insights into potential predictors or
multicollinearity.
Prasad Deshmukh
Outlier Detection
Identify and handle outliers in the data.
Outliers can significantly impact analysis results, so it's important to detect and understand
their presence.
Common techniques for outlier detection include box plots, z-scores, and clustering
methods.
# Box plot
sns.boxplot(x=data['column_name'])
# Z-score method
from scipy.stats import zscore
data['z_score'] = zscore(data['column_name'])
outliers = data[(data['z_score'] > 3) | (data['z_score'] < -3)]
Prasad Deshmukh
Data Transformation
Perform transformations on variables to make the data more suitable for
analysis or modeling.
Examples include log transformations, square roots, normalization, or
standardization.
# Log transformation
data['log_transformed'] = np.log(data['column_name'])
# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data['standardized_column'] =
scaler.fit_transform(data['column_name'].values.reshape(-1, 1))
Prasad Deshmukh
Hypothesis Testing
Ifapplicable, conduct statistical tests to validate hypotheses or
assumptions about the data.
This can involve t-tests, chi-square tests, ANOVA, or other
appropriate tests based on the nature of the data and the research
questions.
from scipy.stats import ttest_ind
Prasad Deshmukh
In conclusion, Exploratory Data Analysis (EDA) is a
crucial step in the data analysis process that helps to
understand the dataset, identify patterns, relationships,
and outliers, and inform subsequent analysis and
modeling decisions. It provides valuable insights and
serves as a foundation for data-driven decision-making.
Prasad Deshmukh
THANK YOU
Prasad Deshmukh