0% found this document useful (0 votes)

17 views

Phython Example

Uploaded by

reshmibiotech

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Phython Example

Uploaded by

reshmibiotech

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Exploratory Data Analysis (EDA) Using Python

Libraries
For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender, Start
Date, Last Login, Salary, Bonus%, Senior Management, and Team. We can get the
dataset here Employees.csv
Let’s read the dataset using the Pandas read_csv() function and print the 1st five
rows. To print the first five rows we will use the head() function.
 Python3

import pandas as pd
import numpy as np
# read datasdet using pandas
df = pd.read_csv('employees.csv')
df.head()

Output:

First five rows of the dataframe

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.

 Python3

df.shape

Output:
(1000, 8)
This means that this dataset has 1000 rows and 8 columns.
Let’s get a quick summary of the dataset using the pandas describe() method. The
describe() function applies basic statistical computations on the dataset like extreme
values, count of data points standard deviation, etc. Any missing value or NaN value
is automatically skipped. describe() function gives a good picture of the distribution
of data.
Example:
 Python3

df.describe()

Output:

description of the dataframe

Note we can also get the description of categorical columns of the dataset if we
specify include =’all’ in the describe function.
Now, let’s also see the columns and their data types. For this, we will use
the info() method.
 Python3

# information about the dataset

df.info()

Output:

Information about the dataset

Changing Dtype from Object to Datetime
Start Date is an important column for employees. However, it is not of much use if
we can not handle it properly to handle this type of data pandas provide a special
function datetime() from which we can change object type to DateTime format.
 Python3

# convert "Start Date" column to datetime data type

df['Start Date'] = pd.to_datetime(df['Start Date'])

We can see the number of unique elements in our dataset. This will help us in
deciding which type of encoding to choose for converting categorical columns into
numerical columns.

 Python3

df.nunique()

Output:
First Name 200
Gender 2
Start Date 972
Last Login Time 720
Salary 995
Bonus % 971
Senior Management 2
Team 10
dtype: int64
Till now we have got an idea about the dataset used. Now Let’s see if our dataset
contains any missing values or not.

Handling Missing Values

You all must be wondering why a dataset will contain any missing values. It can
occur when no information is provided for one or more items or for a whole unit. For
Example, Suppose different users being surveyed may choose not to share their
income, and some users may choose not to share their address in this way many
datasets went missing. Missing Data is a very big problem in real-life scenarios.
Missing Data can also refer to as NA(Not Available) values in pandas. There are
several useful functions for detecting, removing, and replacing null values in Pandas
DataFrame :
 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()
Now let’s check if there are any missing values in our dataset or not.
Example:
 Python3

df.isnull().sum()

Output:

Null values in dataframe

We can see that every column has a different amount of missing values. Like Gender
has 145 missing values and salary has 0. Now for handling these missing values
there can be several cases like dropping the rows containing NaN or replacing NaN
with either mean, median, mode, or some other value.
Now, let’s try to fill in the missing values of gender with the string “No Gender”.
Example:
 Python3

df["Gender"].fillna("No Gender", inplace = True)

df.isnull().sum()

Output:
Null values in dataframe after filling Gender column

We can see that now there is no null value for the gender column. Now, Let’s fill the
senior management with the mode value.
Example:
 Python3

mode = df['Senior Management'].mode().values[0]

df['Senior Management']= df['Senior Management'].replace(np.nan, mode)

df.isnull().sum()

Output:

Null values in dataframe after filling S senior management column

Now for the first name and team, we cannot fill the missing values with arbitrary
data, so, let’s drop all the rows containing these missing values.
Example:
 Python3

df = df.dropna(axis = 0, how ='any')

print(df.isnull().sum())
df.shape

Output:
Null values in dataframe after dropping all null values

We can see that our dataset is now free of all the missing values and after dropping
the data the number of rows also reduced from 1000 to 899.
Note: For more information, refer to Working with Missing Data in Pandas .
After removing the missing data let’s visualize our data.

Data Encoding

There are some models like Linear Regression which does not work with categorical
dataset in that case we should try to encode categorical dataset into the numerical
column. we can use different methods for encoding like Label encoding or One-hot
encoding. pandas and sklearn provide different functions for encoding in our case we
will use the LabelEncoding function from sklearn to encode the Gender column.
 Python3

from sklearn.preprocessing import LabelEncoder

# create an instance of LabelEncoder
le = LabelEncoder()

# fit and transform the "Senior Management"

# column with LabelEncoder
df['Gender'] = le.fit_transform\
(df['Gender'])

Noe

Data visualization
Data Visualization is the process of analyzing data in the form of graphs or maps,
making it a lot easier to understand the trends or patterns in the data.
Let’s see some commonly used graphs –
Note: We will use Matplotlib and Seaborn library for the data visualization. If you
want to know about these modules refer to the articles –
 Matplotlib Tutorial
 Python Seaborn Tutorial

Histogram

It can be used for both uni and bivariate analysis.

Example:
 Python3
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(x='Salary', data=df, )
plt.show()

Output:

Histogram plot of salary column

Boxplot

It can also be used for univariate and bivariate analyses.

Example:
 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot( x="Salary", y='Team', data=df, )

plt.show()

Output:
Boxplot of Salary and team column

Scatter Boxplot For Data Visualization

It can be used for bivariate analyses.

Example:
 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot( x="Salary", y='Team', data=df,

hue='Gender', size='Bonus %')

# Placing Legend outside the Figure

plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()

Output:
Scatter plot of salary and Team column

For multivariate analysis, we can use pairplot()method of the seaborn module. We

can also use it for the multiple pairwise bivariate distributions in a dataset.
Example:
 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='Gender', height=2)

Output:
Pairplot of columns of dataframe

Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-
called normal)objects. They can be caused by measurement or execution errors. The
analysis for outlier detection is referred to as outlier mining. There are many ways to
detect outliers, and the removal process of these outliers from the dataframe is the
same as removing a data item from the panda’s dataframe.
Let’s consider the iris dataset and let’s plot the boxplot for the SepalWidthCm
column.
Example:
 Python3

# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv('Iris.csv')

sns.boxplot(x='SepalWidthCm', data=df)

Output:
Boxplot of sample width column before outliers removal

In the above graph, the values above 4 and below 2 are acting as outliers.

Removing Outliers

For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above
methods of detecting the outliers end result is the list of all those data items that
satisfy the outlier definition according to the method used.
Example: We will detect the outliers using IQR and then we will remove them. We
will also draw the boxplot to see if the outliers are removed or not.
 Python3

# Importing
import sklearn
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns

# Load the dataset

df = pd.read_csv('Iris.csv')

# IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
interpolation = 'midpoint')

Q3 = np.percentile(df['SepalWidthCm'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
print("Old Shape: ", df.shape)

# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))

# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))

# Removing the Outliers

df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)

print("New Shape: ", df.shape)

sns.boxplot(x='SepalWidthCm', data=df)

Output:

Boxplot of sample width after outlier removal

Note: for more information, refer Detect and Remove the Outliers using Python
These are some of the EDA we do during our data science project however it
depends upon your requirement and how much data analysis we do.

data analysis
No ratings yet
data analysis
42 pages
Code explanation for date types
No ratings yet
Code explanation for date types
8 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Data Project
No ratings yet
Data Project
12 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
hduud
No ratings yet
hduud
55 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
2 Python Data Processing
100% (2)
2 Python Data Processing
66 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Lesson 07 Data Manipulation With Pandas
No ratings yet
Lesson 07 Data Manipulation With Pandas
82 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
EDA+Cheatsheet+-+Class+Note
No ratings yet
EDA+Cheatsheet+-+Class+Note
29 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Pandas
No ratings yet
Pandas
29 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
2,3. Introduction Pandas & Matplotlib - Copy
No ratings yet
2,3. Introduction Pandas & Matplotlib - Copy
32 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
2mark Question
No ratings yet
2mark Question
2 pages
Arsenic
No ratings yet
Arsenic
6 pages
Eryth Rina
No ratings yet
Eryth Rina
8 pages
Published Paper 2
No ratings yet
Published Paper 2
8 pages

Phython Example

Uploaded by

Phython Example

Uploaded by

Exploratory Data Analysis (EDA) Using Python

First five rows of the dataframe

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.

description of the dataframe

# information about the dataset

Information about the dataset

# convert "Start Date" column to datetime data type

Handling Missing Values

Null values in dataframe

df["Gender"].fillna("No Gender", inplace = True)

mode = df['Senior Management'].mode().values[0]

Null values in dataframe after filling S senior management column

df = df.dropna(axis = 0, how ='any')

from sklearn.preprocessing import LabelEncoder

# fit and transform the "Senior Management"

It can be used for both uni and bivariate analysis.

Histogram plot of salary column

It can also be used for univariate and bivariate analyses.

sns.boxplot( x="Salary", y='Team', data=df, )

Scatter Boxplot For Data Visualization

It can be used for bivariate analyses.

sns.scatterplot( x="Salary", y='Team', data=df,

# Placing Legend outside the Figure

For multivariate analysis, we can use pairplot()method of the seaborn module. We

sns.pairplot(df, hue='Gender', height=2)

# Load the dataset

# Load the dataset

# Removing the Outliers

print("New Shape: ", df.shape)

Boxplot of sample width after outlier removal

You might also like