0% found this document useful (0 votes)

177 views

Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas

The document discusses Pandas, a Python library used for data manipulation and analysis. It outlines Pandas' main data structures like Series and DataFrame, how to create and access elements in these structures, and how to perform vectorized operations and handle missing values. The document also explains Pandas' advantages over NumPy like built-in data alignment and tools for reading, writing, and transforming data.

Uploaded by

Samir Awol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views

Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas

Uploaded by

Samir Awol

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Big Data Hadoop and Spark Developer

Data Manipulation with Pandas

Learning Objectives

By the end of this lesson, you will be able to:

Explain Pandas and its features

List different data structures of Pandas

Outline the process to create series and DataFrame with data inputs

Explain how to view, select, and access elements in a data structure

Describe the procedure to handle vectorized operations

Illustrate how to handle missing values

Analyze data with different data operation methods

Introduction to Pandas
Why Pandas

NumPy is great for mathematical computing, but why do we need Pandas?

Pandas with several

functionalities

NumPy
Why Pandas

Intrinsic data
alignment

Data Structures
Data operation
handle major
functions
use cases
Pandas

Data standardization ? Data handling

functions functions
Why Pandas

Data structures handling major use cases

Pandas
Features of Pandas

The various features of Pandas make it an efficient library for Data Scientists.

Powerful data
structure

Fast and
High performance
efficient
merging and joining
data wrangling
of data sets

Pandas
Intelligent and Easy data
automated aggregation and
data alignment transformation

Tools for reading

and writing data
Data Structures
Data Structures

The four main libraries of Pandas data structure are:

• One-dimensional labeled array

Series
• Supports multiple data types
• Two-dimensional labeled array
• Supports multiple data types
Data Frame • Input can be a series
• Input can be another DataFrame
• Three-dimensional labeled array
• Supports multiple data types Panel
• Items 🡪 axis 0 • Four-dimensional labeled array
• Major axis 🡪 rows Panel 4D • Supports multiple data types
• Minor axis🡪 columns (Experimental) • Labels 🡪 axis 0
• Items 🡪 axis 1
• Major axis 🡪 rows
• Minor axis🡪 columns
Understanding Series

Series is a one-dimensional array-like object containing data and labels (or index).

Data 4 11 21 36
0 1 2 3

Label(index)

Data alignment is intrinsic and will not be broken until changed explicitly by program.
Series

Series can be created with different data inputs:

Data Input

• Integer
• ndarray 2 3 8 4
• String
• dict
• Python 0 1 2 3
• scalar
Object
• list Label(index)
• Floating Point

Data Types
Series
How to Create Series?

Key points to note while creating a series are:

•Import Pandas as it is the main library (Import pandas as pd)

•Import NumPy while working with ndarrays (Import numpy as np)
•Apply the syntax and pass the data elements as arguments

Basic Method

4 11 21 36
S = pd.Series(data, index = [index])
Series
Creating Series from a List

Import libraries

Pass list as an argument

Data value

Index

Data type

We have not created index for data but notice that data alignment is done automatically.
Creating Series from an ndarray

ndarray for countries

Pass ndarray as an
argument

countrie
s

Data type
Creating Series from dict

A series can also be created with dict data input for faster operations.

dict for countries and their

gdp

Countries have been passed as an

index and GDP as the actual data
value
GDP

Country

Data type
Creating Series from Scalar

Scalar input

Index

Data

index

Data type
Accessing Elements in Series
Data can be accessed through different functions like loc, iloc by passing data element position or index range.
Vectorizing Operations in Series

Vectorized operations are performed by the data element’s position.

Add the series

Vectorizing Operations in Series
DataFrames
DataFrame

DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Data Input

• Integer
• ndarray 2 3 8 4
• String
• dict 5 8 10 1
• Python
• List
Object 0 1 2 3
• Series
• Floating Point Label(index)
• DataFrame

Data Types
DataFrame
Creating DataFrame from Lists

Pass the list to the DataFrame

Creating DataFrame from dict

This example shows you how to create a DataFrame from a series of dicts.

dict one dict two

Entire dict
Viewing DataFrame

You can view a DataFrame by referring to the column name or with the describe function.
Creating DataFrame from dict of Series
Creating DataFrame from ndarray

Create a ndarray with years

Create a dict with the ndarray

Pass this dict to a new DataFrame

Creating DataFrame from DataFrame Object

Create a DataFrame from a

DataFrame object
View and Select Data

Problem Statement: Demonstrate how to view and select data in a DataFrame

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Missing Values
Missing Values

Various factors may lead to missing data values:

Data not provided by the

source Software issue Data integration issue Network issue
Handling Missing Values
It’s difficult to operate a dataset when it has missing values or uncommon indices.
Handling Missing Values with Functions
The dropna function drops all the values with uncommon indices.
Handling Missing Values with Functions

The fillna function fills all the uncommon indices with a number instead of dropping them.

Fill the missing values with zero

Handling Missing Values with Functions: Example
Data Operation
Data Operation

Data operation can be performed through various built-in methods for faster data processing.
Data Operation with Functions
While performing data operation, custom functions can be applied using the applymap method.

Declare a custom function

Test the function

Apply the function to the DataFrame

Data Operation with Statistical Functions

Create a DataFrame with two test

Apply the max function to find

the maximum score

Apply the mean function to find

the average score

Apply the std function to find the standard

deviation for both the tests
Data Operation Using Groupby

Create a DataFrame with first

and last name as former
presidents

Group the DataFrame with the first name

Data Operation Using Sorting

Sort values by first name

Data Operations

Problem Statement: Demonstrate how to perform data operations

Create a function to return the standardize value

Apply the function to the entire dataset

Standardized test data is applied for the entire

DataFrame
File Read and Write Support

read_hdf
read_excel to_hdf read_clipboard
to_excel to_clipboard

read_csv read_html
to csv to_html

read_json read_pickle
to_json to_pickle

read_sql read_stata
read_sas
to_sql to_stata
to sas
Activity: Sequence it Right!
The code here is buggy. You have to correct its sequence to debug it. To do that, click any two code snippets,
which you feel are out of place, to swap their places.

Click any two code snippets to swap them.

Activity: Sequence it Right!
The code here is buggy. You must correct its sequence to debug it. To do that, click any two code snippets,
which you feel are out of place, to swap their places.

Click any two code snippets to swap them.

Pandas SQL Operations
Pandas SQL Operation
Pandas SQL Operation
Pandas SQL Operation
Analyze the Federal Aviation Authority (FAA) Dataset using
Pandas

Problem Statement:
Analyze the Federal Aviation Authority (FAA) dataset using Pandas to do the following:
1.View
a. Aircraft manufacturer name
b. State name
c. Aircraft model name
d. Text information
e. Flight phase
f. Event description type
g. Fatal flag
2. Clean the dataset and replace the fatal flag NaN with “No”
3. Find the aircraft types and their occurrences in the dataset
4. Remove all the observations where aircraft names are not available
5. Display the observations where fatal flag is “Yes”
Analyze the Federal Aviation Authority (FAA) Dataset using
Pandas

Instructions to perform the assignment:

•Download the FAA dataset from the “Resource” tab. Upload the dataset to your Jupyter
notebook to view and evaluate it.

Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access
it.
•Follow the cues provided to complete the assignment.
Analyzing the Dataset

Problem Statement:
A dataset in CSV format is given for the Fire Department of the New York City. Analyze the
dataset to determine:
1. The total number of fire department facilities in the New York city
2. The number of fire department facilities in each borough
3. The facility names in Manhattan
Analyzing the Dataset

Instructions to perform the assignment:

•Download the FDNY dataset from the “Resource” tab. You can upload the dataset to your
Jupyter notebook to use it.

Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 02” notebook and upload it on the Jupyter notebook to access
it.
•Follow the cues provided to complete the assignment.
Key Takeaways

You are now able to:

Explain Pandas and its features

List different data structures of Pandas

Outline the process to create series and DataFrame with data inputs

Explain how to view, select, and access elements in a data structure

Describe the procedure to handle vectorized operations

Illustrate how to handle missing values

Analyze data with different data operation methods

Knowledge Check
Knowledge
Check How is an index for data elements assigned while creating a Pandas series ? Select all
that apply?
1

a. Created automatically

b. Needs to be assigned

c. Once created can not be changed or altered

d. Index is not applicable as series is one-dimensional

Knowledge
Check How is an index for data elements assigned while creating a Pandas series ? Select all
that apply?
1

a. Created automatically

b. Needs to be assigned

c. Once created can not be changed or altered

d. Index is not applicable as series is one-dimensional

The correct answer is a, b

Data alignment is intrinsic in Pandas data structure and happens automatically. One can also assign index to
data elements.
Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2

a. Marked as zeros for missing labels

b. Labels will be skipped

c. Marked as NaN for missing labels

d. Will prompt an exception, index not found

Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2

a. Marked as zeros for missing labels

b. Labels will be skipped

c. Marked as NaN for missing labels

d. Will prompt an exception, index not found

The correct answer is c

The result will be marked as NaN (Not a Number) for missing labels.
Knowledge
Check
What is the result of DataFrame[3:9]?
3

a. Series with sliced index from 3 to 9

b. dict of index positions 3 and 9

c. DataFrame of sliced rows index from 3 to 9

d. DataFrame with data elements at index 3 to 9

Knowledge
Check
What is the result of DataFrame[3:9]?
3

a. Series with sliced index from 3 to 9

b. dict of index positions 3 and 9

c. DataFrame of sliced rows index from 3 to 9

d. DataFrame with data elements at index 3 to 9

The correct answer is c

This is DataFrame slicing technique with indexing or selection on data elements. When a user passes the
range 3:9, the entire range from 3 to 9 gets sliced and displayed as output.
Knowledge
Check
What does the fillna() method do?
4

a. Fills all NaN values with zeros

b. Fills all NaN values with one

c. Fills all NaN values with values mentioned in the parenthesis

d. Drops NaN values from the dataset

Knowledge
Check
What does the fillna() method do?
4

a. Fills all NaN values with zeros

b. Fills all NaN values with one

c. Fills all NaN values with values mentioned in the parenthesis

d. Drops NaN values from the dataset

The correct answer is c

fillna is one of the basic methods to fill NaN values in a dataset with a desired value by passing that in
parenthesis.
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5

a. Series

b. DataFrame

c. Panel

d. PanelND
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5

a. Series

b. DataFrame

c. Panel

d. PanelND

The correct answer is c

Panel is a data structure used to store three-dimensional data.

Knowledge
Check
Which method is used for label-location indexing by label?
6

a. iat

b. iloc

c. loc

d. std
Knowledge
Check
Which method is used for label-location indexing by label?
6

a. iat

b. iloc

c. loc

d. std

The correct answer is c

The loc method is used for label-location indexing by label; iat is strictly integer location and iloc is integer-
location-based indexing by position.
Knowledge
Check
While viewing a dataframe, head() method will _____.
7

a. return only the first row

b. return only headers or column name of the DataFrame

c. return the first five rows of the DataFrame

d. throw an exception as it expects parameter(number) in parenthesis

Knowledge
Check
While viewing a dataframe, head() method will _____.
7

a. return only the first row

b. return only headers or column name of the DataFrame

c. return the first five rows of the DataFrame

d. throw an exception as it expects parameter(number) in parenthesis

The correct answer is c

The default value is 5 if nothing is passed in head method. So, it will return the first five rows of the
DataFrame.
Thank You

Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
Mad - Notes - I, Ii
No ratings yet
Mad - Notes - I, Ii
34 pages
Leading Quality - Global App Testing Ebook
No ratings yet
Leading Quality - Global App Testing Ebook
152 pages
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
No ratings yet
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
29 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
SQL Using R
No ratings yet
SQL Using R
30 pages
Data Science With Python Lesson 05 Mathematical Computing With Python NumPy
No ratings yet
Data Science With Python Lesson 05 Mathematical Computing With Python NumPy
59 pages
Data Science With Python - Lesson 06 - Scientific Computing With Python (Scipy) - Ebook
No ratings yet
Data Science With Python - Lesson 06 - Scientific Computing With Python (Scipy) - Ebook
48 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Basic Data Types in Python Real Python
No ratings yet
Basic Data Types in Python Real Python
15 pages
ML Use Cases Ebook
100% (2)
ML Use Cases Ebook
53 pages
3 - Big Data Insight V.2019 PDF
No ratings yet
3 - Big Data Insight V.2019 PDF
28 pages
DSI 2023 Data Science Roadmap
100% (1)
DSI 2023 Data Science Roadmap
32 pages
A Practical Time-Series Tutorial With MATLAB
No ratings yet
A Practical Time-Series Tutorial With MATLAB
95 pages
R Vs Python For Data Science
No ratings yet
R Vs Python For Data Science
7 pages
Data Science With Python - Lesson 02 - Data Analytics Overview
No ratings yet
Data Science With Python - Lesson 02 - Data Analytics Overview
54 pages
Data Science Crash Course - SharpSight PDF
100% (3)
Data Science Crash Course - SharpSight PDF
107 pages
Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF
No ratings yet
Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF
62 pages
Data Science Portfolio
No ratings yet
Data Science Portfolio
17 pages
Data Science and Machine Learning
100% (1)
Data Science and Machine Learning
190 pages
Lesson 1 - Course - Introduction
No ratings yet
Lesson 1 - Course - Introduction
9 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
Intro-to-Data-and Data-Science-Course-Notes-365-Data-Science
100% (1)
Intro-to-Data-and Data-Science-Course-Notes-365-Data-Science
17 pages
Data Science With R Workflow: Click The Links For Documentation
No ratings yet
Data Science With R Workflow: Click The Links For Documentation
3 pages
Learning Path Machine Learning
No ratings yet
Learning Path Machine Learning
7 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Pyomo Workshop December 2023
No ratings yet
Pyomo Workshop December 2023
261 pages
Applied Statistics en
No ratings yet
Applied Statistics en
282 pages
IDS - 3 - Data Analytics - Part-2 - Methodologies - Dantu
No ratings yet
IDS - 3 - Data Analytics - Part-2 - Methodologies - Dantu
60 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
IICT - Data Science
No ratings yet
IICT - Data Science
22 pages
Different Types of Regression Models
No ratings yet
Different Types of Regression Models
18 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
Linear Regression For Machine Learning
100% (1)
Linear Regression For Machine Learning
17 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
The Gainz Manual
No ratings yet
The Gainz Manual
28 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
Data Science With Python - Lesson 10 - Data Visualization in Python With Matplotlib - Raw
No ratings yet
Data Science With Python - Lesson 10 - Data Visualization in Python With Matplotlib - Raw
71 pages
Programming For Data Science
100% (1)
Programming For Data Science
4 pages
STATS LAB Basics of R PDF
No ratings yet
STATS LAB Basics of R PDF
77 pages
DAX Cheat Sheet
No ratings yet
DAX Cheat Sheet
10 pages
Basic Statistics For Data Science
No ratings yet
Basic Statistics For Data Science
24 pages
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
No ratings yet
Install Pyspark On Windows, Mac & Linux - DataCamp - 1
18 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
Data Science Resource Package!
No ratings yet
Data Science Resource Package!
14 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
IIMK - DS - W6 - Summary Deck
No ratings yet
IIMK - DS - W6 - Summary Deck
96 pages
Usharani Bhimavarapu Jude D
100% (1)
Usharani Bhimavarapu Jude D
349 pages
Data Visualization Using Pyplot
100% (2)
Data Visualization Using Pyplot
8 pages
Data Visualization Using Plotly, Matplotlib, Seaborn and Squarify - Data Science
No ratings yet
Data Visualization Using Plotly, Matplotlib, Seaborn and Squarify - Data Science
61 pages
DataScientist v2
No ratings yet
DataScientist v2
14 pages
Query Optimiation
No ratings yet
Query Optimiation
39 pages
Simple - Linear - Regression - Ipynb - Colaboratory
No ratings yet
Simple - Linear - Regression - Ipynb - Colaboratory
2 pages
R Programming Course Notes
No ratings yet
R Programming Course Notes
28 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Numpy Complete Notes
No ratings yet
Numpy Complete Notes
64 pages
SAS Viya: The Python Perspective
From Everand
SAS Viya: The Python Perspective
Kevin D. Smith
No ratings yet
Lesson 07 Data Manipulation With Pandas
No ratings yet
Lesson 07 Data Manipulation With Pandas
82 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
SB Record Keeping Slides
No ratings yet
SB Record Keeping Slides
29 pages
Sample Quotation Letter
56% (18)
Sample Quotation Letter
4 pages
Fin Irjmets1685105614
No ratings yet
Fin Irjmets1685105614
6 pages
Instant download Solution Manual for Radiative Heat Transfer, 3rd Edition, Michael Modest, M Modest, ISBN : 9780123869449, ISBN : 9780123869906 pdf all chapter
100% (9)
Instant download Solution Manual for Radiative Heat Transfer, 3rd Edition, Michael Modest, M Modest, ISBN : 9780123869449, ISBN : 9780123869906 pdf all chapter
64 pages
Pattern Making With Popsicles - Karen Pacia
No ratings yet
Pattern Making With Popsicles - Karen Pacia
6 pages
Tic Tac Toe
No ratings yet
Tic Tac Toe
2 pages
DISM Pro Questions Solved
No ratings yet
DISM Pro Questions Solved
5 pages
ADO.net.docx
No ratings yet
ADO.net.docx
7 pages
Forensic Analysis of Asterisk-FreePBX Based VoIP S
No ratings yet
Forensic Analysis of Asterisk-FreePBX Based VoIP S
7 pages
Artificial Intelligence Engineer: Master's Course
No ratings yet
Artificial Intelligence Engineer: Master's Course
21 pages
Palay Market Trend and Forecast
No ratings yet
Palay Market Trend and Forecast
43 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Synology DiskStation MIB Guide
No ratings yet
Synology DiskStation MIB Guide
15 pages
Programming in c Lab Manuall
No ratings yet
Programming in c Lab Manuall
14 pages
DH61WW SpecUpdate03
No ratings yet
DH61WW SpecUpdate03
7 pages
Spark Internals RDD Cache
No ratings yet
Spark Internals RDD Cache
25 pages
Logic Pro X - Key Commands (Ultimate Guide) - Professional Composers
No ratings yet
Logic Pro X - Key Commands (Ultimate Guide) - Professional Composers
17 pages
Chiranjeevi Captiva RESUME
No ratings yet
Chiranjeevi Captiva RESUME
3 pages
Eso207 Assignment2
No ratings yet
Eso207 Assignment2
12 pages
Cia Part 3 - Study Unit 11 IT Security and Application Development Core Concepts
No ratings yet
Cia Part 3 - Study Unit 11 IT Security and Application Development Core Concepts
3 pages
Automatic Transfer Requirement in SAP WM
No ratings yet
Automatic Transfer Requirement in SAP WM
1 page
Endterm_Exam (page 1 of 2) (1)
No ratings yet
Endterm_Exam (page 1 of 2) (1)
1 page
Technical Presales Welcome Kit
No ratings yet
Technical Presales Welcome Kit
25 pages
Classicmodels
No ratings yet
Classicmodels
3 pages
Addison Rae and Bryce Hall - Google Search
No ratings yet
Addison Rae and Bryce Hall - Google Search
1 page
Make Shit Happen - Kinokuniya Bookstore
No ratings yet
Make Shit Happen - Kinokuniya Bookstore
3 pages
Cad Manual
No ratings yet
Cad Manual
86 pages
Itvision - e - Autocad - Level 1
No ratings yet
Itvision - e - Autocad - Level 1
2 pages