100% found this document useful (1 vote)

111 views

Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas

This document provides an overview of data exploration techniques in Python using NumPy, Matplotlib, and Pandas. It discusses how to load data, convert data types, transpose dataframes, sort data, create plots, generate frequency tables, sample data, remove duplicates, group variables to calculate metrics, recognize and treat missing values, and merge/join datasets. Examples are provided using sample data to demonstrate each technique.

Uploaded by

Ahsan Ahmad Beg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

111 views

Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas

Uploaded by

Ahsan Ahmad Beg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Ultimate guide for Data Exploration in Python using NumPy,

Matplotlib and Pandas

D AT A E XPLO RAT I O N D AT A VI S UA LI Z AT I O N I NT E RM E D I AT E LI BRA RI E S PRO G RA M M I NG PYT HO N

Introduction

Exploring data sets and developing deep understanding about the data is one of the most important skills
every data scientist should possess. People estimate that the time spent on these activities can go as high
as 80% of the project time in some cases.

Python has been gaining a lot of ground as preferred tool for data scientists lately, and for the right
reasons. Ease of learning, powerful libraries with integration of C/C++, production readiness and
integration with web stack are some of the main reasons for this move lately.

In this guide, I will use NumPy, Matplotlib, Seaborn, and Pandas to perform data exploration. These are
powerful libraries to perform data exploration in Python. The idea is to create a ready reference for some of
the regular operations required frequently. I am using an iPython Notebook to perform data exploration and
would recommend the same for its natural fit for exploratory analysis.

In case you missed it, I would suggest you to refer to the baby steps series of Python to understand the
basics of python programming.

Learning Python for data analysis – with instructions on installation and creating the environment
Libraries and data structures
Exploratory analysis in Python (using Pandas)
Data Munging in Python (using Pandas)

Contents – Data Exploration

Here are the operations I’ll cover in this article (Refer to this article for similar operations in SAS):

1. How to load data file(s) using Pandas?

2. How to convert a variable to different data type?

3. How to transpose a table/Pandas dataframe?

4. How to sort Data?

5. How to create plots (Histogram, Scatter, Box Plot)?

6. How to generate frequency tables?

7. How to do sampling of Data set?

1. How to remove duplicate values of a variable?

2. How to group variables to calculate count, average, sum?

3. How to recognize and treat missing values and outliers?

4. How to merge / join data set or dataframes effectively in Pandas?

Part 1: How to load data file(s) using Pandas?

Input data sets can be in various formats (.XLS, .TXT, .CSV, JSON ). In Python, it is easy to load data from
any source, due to its simple syntax and availability of predefined libraries, such as Pandas. Here I will
make use of Pandas itself.

Pandas features a number of functions for reading tabular data as a Pandas DataFrame object. Below are
the common functions that can be used to read data (including read_csv in Pandas):

Loading data from a CSV file(s):

Code

import pandas as pd #Import Library Pandas df = pd.read_csv("E:/train.csv") #I am working in Windows

environment #Reading the dataset in a dataframe using Pandas print df.head(3) #Print first three observations

Output

Loading data from excel file(s):

Code df=pd.read_excel("E:/EMP.xlsx", "Data") # Load Data sheet of excel file EMP Output print df

Loading data from a txt file(s):

Code:
df=pd.read_csv("E:/Test.txt",sep='\t') # Load Data from text file having tab '\t' delimeter print df

Output

Part 2: How to convert a variable to a different data type?

Converting a variable data type to others is an important and common procedure we perform after loading
data. Let’s look at some of the commands to perform these conversions:

Convert numeric variables to string variables and vice versa

srting_outcome = str(numeric_input) #Converts numeric_input to string_outcome integer_outcome =

int(string_input) #Converts string_input to integer_outcome float_outcome = float(string_input) #Converts

string_input to integer_outcome

The later operations are especially useful when you input value from user using raw_input(). By default, the
values are read at string.

Convert character date to Date:

There are multiple ways to do this. The simplest would be to use the datetime library and strptime
function. Here is the code:

from datetime import datetime char_date = 'Apr 1 2015 1:20 PM' #creating example character date date_obj =

datetime.strptime(char_date, '%b %d %Y %I:%M%p') print date_obj

Part 3: How to transpose a Data set or dataframe using Pandas?

Here, I want to transpose Table A into Table B on the variable Product. This task can be accomplished by
using Pandas dataframe.pivot:

Code

#Transposing Pandas dataframe by a variable

df=pd.read_excel("E:/transpose.xlsx", "Sheet1") # Load Data sheet of excel file EMP print df result=

df.pivot(index= 'ID', columns='Product', values='Sales') result

Output

Part 4: How to sort a Pandas DataFrame?

Sorting of data can be done using dataframe.sort(). It can be based on multiple variables and ascending or
descending both orders.

Code

#Sorting Pandas Dataframe df=pd.read_excel("E:/transpose.xlsx", "Sheet1") #Add by variable name(s) to sort

print df.sort(['Product','Sales'], ascending=[True, False])

Above, we have a table with variables ID, Product and Sales. Now, we want to sort it by Product and Sales
(in descending order) as shown in table 2.

Part 5: How to create plots (Histogram, Scatter, Box Plot)?

Data visualization always helps to understand the data easily. Python has libraries like matplotlib and
seaborn to create multiple graphs effectively. Let’s look at the some of the visualizations to understand
below behavior of variable(s) .

The distribution of age

Relation between age and sales; and
If sales are normally distributed or not?

Histogram:

Code

#Plot Histogram

import matplotlib.pyplot as plt import pandas as pd

df=pd.read_excel("E:/First.xlsx", "Sheet1")

#Plots in matplotlib reside within a figure object, use plt.figure to create new figure fig=plt.figure()
#Create one or more subplots using add_subplot, because you can't create blank figure ax =
fig.add_subplot(1,1,1)

#Variable ax.hist(df['Age'],bins = 5)

#Labels and Tit plt.title('Age distribution') plt.xlabel('Age') plt.ylabel('#Employee') plt.show()

Output

Scatter plot:

Code

#Plots in matplotlib reside within a figure object, use plt.figure to create new figure fig=plt.figure()

#Create one or more subplots using add_subplot, because you can't create blank figure ax =

fig.add_subplot(1,1,1)

#Variable ax.scatter(df['Age'],df['Sales'])

#Labels and Tit plt.title('Sales and Age distribution') plt.xlabel('Age') plt.ylabel('Sales') plt.show()

Output
Box-plot:

Code

import seaborn as sns sns.boxplot(df['Age']) sns.despine()

Output

Part 6: How to generate frequency tables with Pandas?

Frequency Tables can be used to understand the distribution of a categorical variable or n categorical
variables using frequency tables.

Code

import pandas as pd

df=pd.read_excel("E:/First.xlsx", "Sheet1") print df

test= df.groupby(['Gender','BMI']) test.size()

Output
Part 7: How to do sample Data set in Python?

To select sample of a data set, we will use library numpy and random. Sampling of data set always helps to
understand data quickly.

Let’s say, from EMP table, I want to select random sample of 5 employees.

Code

#Create Sample dataframe

import numpy as np import pandas as pd from random import sample

# create random index rindex = np.array(sample(xrange(len(df)), 5))

# get 5 random rows from the dataframe df dfr = df.ix[rindex] print dfr

Output

Part 8: How to remove duplicate values of a variable in a Pandas Dataframe?

Often, we encounter duplicate observations. To tackle this in Python, we can use
dataframe.drop_duplicates().

Code

#Remove Duplicate Values based on values of variables "Gender" and "BMI"

rem_dup=df.drop_duplicates(['Gender', 'BMI']) print rem_dup

Output

Part 9: How to group variables in Pandas to calculate count, average, sum?

To understand the count, average and sum of variable, I would suggest you use dataframe.describe() with
Pandas groupby().

Let’s look at the code:

Code

test= df.groupby(['Gender']) test.describe()

Output
Part 10: How to recognize and Treat missing values and outliers in Pandas?

To identify missing values , we can use dataframe.isnull(). You can also refer article “Data Munging in
Python (using Pandas)“, here we have done a case study to recognize and treat missing and outlier values.

Code

# Identify missing values of dataframe df.isnull()

Output

To treat missing values, there are various imputation methods available. You can refer these articles for
methods to detect Outlier and Missing values. Imputation methods for both missing and outlier values are
almost similar. Here we will discuss general case imputation methods to replace missing values. Let’s do it
using an example:

Code:

#Example to impute missing values in Age by the mean import numpy as np meanAge = np.mean(df.Age) #Using

numpy mean function to calculate the mean value df.Age = df.Age.fillna(meanAge) #replacing missing values in
the DataFrame

Part 11: How to merge / join data sets and Pandas dataframes?

Joining / merging is one of the common operation required to integrate datasets from different sources.
They can be handled effectively in Pandas using merge function:

Code:

df_new = pd.merge(df1, df2, how = 'inner', left_index = True, right_index = True) # merges df1 and df2 on

index # By changing how = 'outer', you can do outer join. # Similarly how = 'left' will do a left join # You
can also specify the columns to join instead of indexes, which are used by default.

End Notes:

In this comprehensive guide, we looked at the Python codes for various steps in data exploration and
munging. We also looked at the python libraries like Pandas, Numpy, Matplotlib and Seaborn to perform
these steps. In next article, I will reveal the codes to perform these steps in R.

Also See: If you have any doubts pertaining to Python, feel free to discuss with us.

Did you find the article useful? Do let us know your thoughts about this guide in the comments section
below.

If you like what you just read & want to continue your analytics
learning, subscribe to our emails, follow us on twitter or like
our facebook page.

Article Url - https://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-

using-python-numpy-scipy-matplotlib-pandas/

Sunil Ray
I am a Business Analytics and Intelligence professional with deep experience in the Indian Insurance
industry. I have worked for various multi-national Insurance companies in last 7 years.

Pandas Basics
No ratings yet
Pandas Basics
84 pages
IManager U2000 V200R017C60 Administrator Guide 02 (PDF) - C
80% (5)
IManager U2000 V200R017C60 Administrator Guide 02 (PDF) - C
1,571 pages
Alteryx - Tool - Flashcards - 201910
No ratings yet
Alteryx - Tool - Flashcards - 201910
16 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
Python Data Analyst Handbook Guide_byom_cybertechie
No ratings yet
Python Data Analyst Handbook Guide_byom_cybertechie
57 pages
data analysis
No ratings yet
data analysis
42 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Data Analytics With PowerBI
No ratings yet
Data Analytics With PowerBI
27 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
2,3. Introduction Pandas & Matplotlib - Copy
No ratings yet
2,3. Introduction Pandas & Matplotlib - Copy
32 pages
DOC-20250315-WA0005.
No ratings yet
DOC-20250315-WA0005.
29 pages
1st Class-Introduction and Python Package (1)
No ratings yet
1st Class-Introduction and Python Package (1)
93 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
rajni_ip_file_final
No ratings yet
rajni_ip_file_final
42 pages
Phython Example
No ratings yet
Phython Example
12 pages
jenisha INTERNSHIP REPORT-2.docx (1)
No ratings yet
jenisha INTERNSHIP REPORT-2.docx (1)
19 pages
Pandas
No ratings yet
Pandas
12 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
DVP First Module
No ratings yet
DVP First Module
88 pages
Python & MySQL for Data Analysis
No ratings yet
Python & MySQL for Data Analysis
45 pages
CH 4
No ratings yet
CH 4
17 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
20CA2204 DATA SCIENCE QB WITH ANSWERS
No ratings yet
20CA2204 DATA SCIENCE QB WITH ANSWERS
48 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
MLS+1+-+Python+for+Data+Science
No ratings yet
MLS+1+-+Python+for+Data+Science
33 pages
Pandas
No ratings yet
Pandas
41 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
Pandas
No ratings yet
Pandas
42 pages
24
No ratings yet
24
7 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Minor Project Report
No ratings yet
Minor Project Report
6 pages
FAQ On Analytical Roles
No ratings yet
FAQ On Analytical Roles
5 pages
Disease Prediction Synopsis
No ratings yet
Disease Prediction Synopsis
3 pages
Data Analytics Interview QnAs
No ratings yet
Data Analytics Interview QnAs
21 pages
Ders1 - Basic Concepts For Data Structures
No ratings yet
Ders1 - Basic Concepts For Data Structures
31 pages
ZyWALL 2 Plus DataSheet
No ratings yet
ZyWALL 2 Plus DataSheet
2 pages
Hashtable #Include #Include #Include #Include
No ratings yet
Hashtable #Include #Include #Include #Include
4 pages
Veeam v10/11 Integration
No ratings yet
Veeam v10/11 Integration
99 pages
PSI Report 2023
No ratings yet
PSI Report 2023
7 pages
Lecture 04 PDF
No ratings yet
Lecture 04 PDF
9 pages
Sap PP Tcode 3
No ratings yet
Sap PP Tcode 3
6 pages
DNP3 Interface Reference
No ratings yet
DNP3 Interface Reference
52 pages
Isilon Best Practices Guide For iSCSI With OneFS
No ratings yet
Isilon Best Practices Guide For iSCSI With OneFS
10 pages
Bca Bca 401 Database Management System 2012
No ratings yet
Bca Bca 401 Database Management System 2012
7 pages
Memory Systems: Computer Organization and Architecture (18EC35)
No ratings yet
Memory Systems: Computer Organization and Architecture (18EC35)
5 pages
Sram 6 Transistors Read Operation 6T SRAM Memory Cell Diagram
No ratings yet
Sram 6 Transistors Read Operation 6T SRAM Memory Cell Diagram
1 page
A Case For Redundant Arrays of Inexpensive Disks (RAID)
No ratings yet
A Case For Redundant Arrays of Inexpensive Disks (RAID)
19 pages
NFS Configuration Express Guide: Ontap 9
No ratings yet
NFS Configuration Express Guide: Ontap 9
35 pages
Sap Ddic
No ratings yet
Sap Ddic
8 pages
PYQs
No ratings yet
PYQs
265 pages
2.2 Data Types in Java
No ratings yet
2.2 Data Types in Java
3 pages
Secure File-1
No ratings yet
Secure File-1
12 pages
What Is XML
No ratings yet
What Is XML
7 pages
Glpi Plugins Readthedocs Io PT Latest
No ratings yet
Glpi Plugins Readthedocs Io PT Latest
125 pages
Siemens s7-300
100% (1)
Siemens s7-300
95 pages
Ble-Sniffer - Win - 1.2 - User Guide
No ratings yet
Ble-Sniffer - Win - 1.2 - User Guide
17 pages
SQL PDF
No ratings yet
SQL PDF
173 pages
UNIT-3 Fundamentals of Information Technology (Question and Answers)
No ratings yet
UNIT-3 Fundamentals of Information Technology (Question and Answers)
11 pages
Comcast Ims Network v6 PDF
100% (2)
Comcast Ims Network v6 PDF
19 pages
Database System Development Lifecycle
No ratings yet
Database System Development Lifecycle
22 pages
Space Manual English
No ratings yet
Space Manual English
27 pages
Topics: IOE 373 - Data Processing
No ratings yet
Topics: IOE 373 - Data Processing
23 pages

Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas

Uploaded by

Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas

Uploaded by

Ultimate guide for Data Exploration in Python using NumPy,

Matplotlib and Pandas

Contents – Data Exploration

1. How to load data file(s) using Pandas?

2. How to convert a variable to different data type?

4. How to sort Data?

5. How to create plots (Histogram, Scatter, Box Plot)?

6. How to generate frequency tables?

7. How to do sampling of Data set?

1. How to remove duplicate values of a variable?

2. How to group variables to calculate count, average, sum?

3. How to recognize and treat missing values and outliers?

4. How to merge / join data set or dataframes effectively in Pandas?

Part 1: How to load data file(s) using Pandas?

Loading data from a CSV file(s):

import pandas as pd #Import Library Pandas df = pd.read_csv("E:/train.csv") #I am working in Windows

Loading data from excel file(s):

Loading data from a txt file(s):

Part 2: How to convert a variable to a different data type?

Convert numeric variables to string variables and vice versa

srting_outcome = str(numeric_input) #Converts numeric_input to string_outcome integer_outcome =

int(string_input) #Converts string_input to integer_outcome float_outcome = float(string_input) #Converts

Convert character date to Date:

datetime.strptime(char_date, '%b %d %Y %I:%M%p') print date_obj

Part 3: How to transpose a Data set or dataframe using Pandas?

#Transposing Pandas dataframe by a variable

df.pivot(index= 'ID', columns='Product', values='Sales') result

Part 4: How to sort a Pandas DataFrame?

#Sorting Pandas Dataframe df=pd.read_excel("E:/transpose.xlsx", "Sheet1") #Add by variable name(s) to sort

print df.sort(['Product','Sales'], ascending=[True, False])

Part 5: How to create plots (Histogram, Scatter, Box Plot)?

The distribution of age

import matplotlib.pyplot as plt import pandas as pd

#Labels and Tit plt.title('Age distribution') plt.xlabel('Age') plt.ylabel('#Employee') plt.show()

import seaborn as sns sns.boxplot(df['Age']) sns.despine()

Part 6: How to generate frequency tables with Pandas?

df=pd.read_excel("E:/First.xlsx", "Sheet1") print df

test= df.groupby(['Gender','BMI']) test.size()

#Create Sample dataframe

import numpy as np import pandas as pd from random import sample

# create random index rindex = np.array(sample(xrange(len(df)), 5))

Part 8: How to remove duplicate values of a variable in a Pandas Dataframe?

#Remove Duplicate Values based on values of variables "Gender" and "BMI"

rem_dup=df.drop_duplicates(['Gender', 'BMI']) print rem_dup

Part 9: How to group variables in Pandas to calculate count, average, sum?

Let’s look at the code:

test= df.groupby(['Gender']) test.describe()

# Identify missing values of dataframe df.isnull()

Article Url - https://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-

You might also like