Exploratory Data Analysis

Data analytics

Uploaded by

dharam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Exploratory Data Analysis

Data analytics

Uploaded by

dharam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Exploratory Data Analysis

Prasad Deshmukh
Exploratory Data Analysis
 EDA is a crucial step in data analysis,
involving data exploration,
visualization, and summarization to
uncover patterns and gain insights.
 EDA helps to understand the structure
and characteristics of the dataset, detect
outliers, and identify relationships
between variables through statistical
analysis and visualizations.

Prasad Deshmukh
Data Collection

 Obtain the dataset you want to

import pandas as pd
analyze.
 This may involve downloading data # Read data from a CSV file
from a database, gathering data from data = pd.read_csv('data.csv')
surveys, or accessing publicly
available datasets.

Prasad Deshmukh
Data Exploration
 Explore the dataset to gain an initial understanding.
 This can involve examining the structure of the data, checking the number of
rows and columns, and previewing the first few rows to get a sense of the
variables and their values.

# Check the number of rows and columns

data.shape

# Preview first few rows

data.head()

# View column names

data.columns
Prasad Deshmukh
Data Cleaning
 Clean the data to ensure it is in a usable format.
 This includes handling missing values, removing duplicates, correcting
inconsistent data, and transforming data types if necessary.

# Handling missing values

data.dropna() # Drop rows with missing values
data.fillna(value) # Fill missing values with a specific value

# Removing duplicates
data.drop_duplicates()

# Correcting inconsistent data

data['column_name'].replace(old_value, new_value, inplace=True)
Prasad Deshmukh
Missing Value Treatment
 Address missing values in the dataset.
 This can involve imputing missing values using techniques like mean, median,
mode, or advanced imputation methods like regression or machine learning
algorithms.
# Drop rows with missing values
data.dropna(inplace=True)

# Fill missing values with mean

data.fillna(data.mean(), inplace=True)

# Fill missing values with forward fill

data.fillna(method='ffill', inplace=True)
Prasad Deshmukh
Summary Statistics
 Compute basic summary
statistics such as mean, # Compute basic summary statistics
median, mode, standard data.describe()
deviation, and quartiles for
# Calculate mean, median, mode
numerical variables. data.mean()
 For categorical variables, you data.median()
can calculate frequency counts data.mode()
or proportions for each
category.

Prasad Deshmukh
Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
 Create visual representations of
the data using graphs, charts, and # Histogram
plt.hist(data['column_name'])
plots.
 This helps to identify patterns, # Box plot
sns.boxplot(x=data['column_name'])
trends, and outliers.
# Scatter plot
 Common visualizations include plt.scatter(data['x_column'],
histograms, box plots, scatter data['y_column'])
plots, bar charts, and heatmaps.
# Bar chart
sns.countplot(data['category_column'])

# Heatmap
sns.heatmap(data.corr())
Prasad Deshmukh
Correlation Analysis
 Examine the relationships between variables by calculating correlation
coefficients.
 This helps to identify variables that are highly correlated, positively or
negatively, and can provide insights into potential predictors or
multicollinearity.

# Calculate correlation matrix

correlation_matrix = data.corr()

# Heatmap of correlation matrix

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

Prasad Deshmukh
Outlier Detection
 Identify and handle outliers in the data.
 Outliers can significantly impact analysis results, so it's important to detect and understand
their presence.
 Common techniques for outlier detection include box plots, z-scores, and clustering
methods.
# Box plot
sns.boxplot(x=data['column_name'])

# Z-score method
from scipy.stats import zscore

data['z_score'] = zscore(data['column_name'])
outliers = data[(data['z_score'] > 3) | (data['z_score'] < -3)]
Prasad Deshmukh
Data Transformation
 Perform transformations on variables to make the data more suitable for
analysis or modeling.
 Examples include log transformations, square roots, normalization, or
standardization.
# Log transformation
data['log_transformed'] = np.log(data['column_name'])

# Standardization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data['standardized_column'] =
scaler.fit_transform(data['column_name'].values.reshape(-1, 1))
Prasad Deshmukh
Hypothesis Testing
 Ifapplicable, conduct statistical tests to validate hypotheses or
assumptions about the data.
 This can involve t-tests, chi-square tests, ANOVA, or other
appropriate tests based on the nature of the data and the research
questions.
from scipy.stats import ttest_ind

# Perform t-test between two groups

group1 = data[data['group'] == 1]['column_name']
group2 = data[data['group'] == 2]['column_name']
statistic, p_value = ttest_ind(group1, group2)
Prasad Deshmukh
Iterative Analysis

 EDA is often an iterative process.

 Asyou uncover insights, you may go back and refine
your analysis, perform additional transformations, or
explore specific aspects in more detail.

Prasad Deshmukh
In conclusion, Exploratory Data Analysis (EDA) is a
crucial step in the data analysis process that helps to
understand the dataset, identify patterns, relationships,
and outliers, and inform subsequent analysis and
modeling decisions. It provides valuable insights and
serves as a foundation for data-driven decision-making.

Prasad Deshmukh
THANK YOU

Prasad Deshmukh

Mathematics: Quarter 4 - Module 1 Illustrating The Measures of Position For Ungrouped Data
86% (14)
Mathematics: Quarter 4 - Module 1 Illustrating The Measures of Position For Ungrouped Data
16 pages
Exploratory Data Analysis: Prasad Deshmukh
No ratings yet
Exploratory Data Analysis: Prasad Deshmukh
15 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Data analytics using r unit-3
No ratings yet
Data analytics using r unit-3
4 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Statistics and Data Science with R Part -4
No ratings yet
Statistics and Data Science with R Part -4
23 pages
Machine Learning-1
No ratings yet
Machine Learning-1
24 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Python Pandas
No ratings yet
Python Pandas
13 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
12. B Lab Manual Machine Learning SEM-7 CSE 2024
No ratings yet
12. B Lab Manual Machine Learning SEM-7 CSE 2024
49 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
DAV practical 2
No ratings yet
DAV practical 2
6 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Ds 5
No ratings yet
Ds 5
9 pages
Medium-Com-Data-An
No ratings yet
Medium-Com-Data-An
22 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
AIL303 M
No ratings yet
AIL303 M
22 pages
Self Intoduction 1 project
No ratings yet
Self Intoduction 1 project
11 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Data Science 1-5
No ratings yet
Data Science 1-5
15 pages
Experiment-8
No ratings yet
Experiment-8
2 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Advanced R Data Analysis Training PDF
No ratings yet
Advanced R Data Analysis Training PDF
72 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
BDA File
No ratings yet
BDA File
26 pages
hduud
No ratings yet
hduud
55 pages
Lab 1
No ratings yet
Lab 1
3 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
ML Remaining
No ratings yet
ML Remaining
17 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
VIVA_DSA
No ratings yet
VIVA_DSA
11 pages
EXP-12
No ratings yet
EXP-12
4 pages
121A1114_D2_SMA_EXP3
No ratings yet
121A1114_D2_SMA_EXP3
9 pages
ML Lab8
No ratings yet
ML Lab8
28 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Data Structure
100% (2)
Data Structure
72 pages
Data Analytics Pandas
No ratings yet
Data Analytics Pandas
33 pages
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
No ratings yet
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
68 pages
MODULE 1
No ratings yet
MODULE 1
42 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
Dav Exps - Merged - Merged
No ratings yet
Dav Exps - Merged - Merged
99 pages
Manual PDS expt no. 7,8,9
No ratings yet
Manual PDS expt no. 7,8,9
6 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
PP&DS UNIT III
No ratings yet
PP&DS UNIT III
26 pages
Pandas Interview Questions
No ratings yet
Pandas Interview Questions
21 pages
16 Mark Ds
No ratings yet
16 Mark Ds
18 pages
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
100% (1)
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
37 pages
DAR LECT 12
No ratings yet
DAR LECT 12
29 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Experiment 1 solution
No ratings yet
Experiment 1 solution
5 pages
trees_regression.ipynb - Colab
No ratings yet
trees_regression.ipynb - Colab
4 pages
Unit 4 Pandas
No ratings yet
Unit 4 Pandas
8 pages
Module III
No ratings yet
Module III
53 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
Blood Pressure Levels For Boys by Age and Height Percentile
No ratings yet
Blood Pressure Levels For Boys by Age and Height Percentile
4 pages
Lecture Slides 2 Descriptive Statistics
No ratings yet
Lecture Slides 2 Descriptive Statistics
149 pages
Week 4C
No ratings yet
Week 4C
11 pages
Omelchenko Oksana & Khankeldiev Sanjar Homework #2: Comments
No ratings yet
Omelchenko Oksana & Khankeldiev Sanjar Homework #2: Comments
9 pages
Taylor Diagram
No ratings yet
Taylor Diagram
4 pages
Lec3 STATISTICS
No ratings yet
Lec3 STATISTICS
9 pages
WFH Girls 2 5 Percentiles
No ratings yet
WFH Girls 2 5 Percentiles
5 pages
ADM-SHS-StatProb-Q3-M7-Calculating Mean and Variance of a Discrete Random Variable
No ratings yet
ADM-SHS-StatProb-Q3-M7-Calculating Mean and Variance of a Discrete Random Variable
27 pages
datascience_assignment_1696326298 (1)
No ratings yet
datascience_assignment_1696326298 (1)
156 pages
2019 Work Book 1 B. Stats Descriptive Statistics PDF
No ratings yet
2019 Work Book 1 B. Stats Descriptive Statistics PDF
18 pages
Data Analytics For Lean Six Sigma
No ratings yet
Data Analytics For Lean Six Sigma
28 pages
03 Basic Statistical Data Analysis Using Excel
No ratings yet
03 Basic Statistical Data Analysis Using Excel
141 pages
Measures of Skewness & Kurtosis
No ratings yet
Measures of Skewness & Kurtosis
19 pages
Chapter 7
No ratings yet
Chapter 7
17 pages
Octobe R
No ratings yet
Octobe R
3 pages
MODULE 0 Review On Statistics
No ratings yet
MODULE 0 Review On Statistics
11 pages
M Ch-15 Statistics
No ratings yet
M Ch-15 Statistics
6 pages
Statistics Questions Multiple-Choice
No ratings yet
Statistics Questions Multiple-Choice
7 pages
QMM 1001 Case Study 1
No ratings yet
QMM 1001 Case Study 1
9 pages
Practice Problems On Descriptive Statistics
No ratings yet
Practice Problems On Descriptive Statistics
4 pages
State Wise Health Income Clustering 18th December 2021 PDF
100% (2)
State Wise Health Income Clustering 18th December 2021 PDF
29 pages
Box Plots and Cumulative Frequency WS
No ratings yet
Box Plots and Cumulative Frequency WS
2 pages
Business Statistics Course Outline
100% (1)
Business Statistics Course Outline
5 pages
resultados factor fusion mayo
No ratings yet
resultados factor fusion mayo
15 pages
Practica 11
No ratings yet
Practica 11
7 pages
Complete Download (Ebook) Handbook of Parametric and Nonparametric Statistical Procedures: Third Edition by David J. Sheskin ISBN 9781584884408, 1584884401 PDF All Chapters
No ratings yet
Complete Download (Ebook) Handbook of Parametric and Nonparametric Statistical Procedures: Third Edition by David J. Sheskin ISBN 9781584884408, 1584884401 PDF All Chapters
81 pages
BA 1 - Describing and Summarizing Data PDF
No ratings yet
BA 1 - Describing and Summarizing Data PDF
4 pages
Section 8.3 Variance and Standard Deviation
No ratings yet
Section 8.3 Variance and Standard Deviation
3 pages