Lesson 07 Data Manipulation With Pandas
Lesson 07 Data Manipulation With Pandas
Outline the process to create Series and DataFrame with data inputs
NumPy
Why Pandas
Intrinsic data
alignment
Data structures
Data operation
handling major
functions
use cases
Pandas
Powerful data
structure
Pandas
Intelligent and Easy data
automated aggregation and
data alignment transformation
Series is a one-dimensional array-like object containing data and labels (or index).
Data 4 11 21 36
0 1 2 3
Label(index)
Data alignment is intrinsic and will not be broken until changed explicitly by program.
Series
Data Input
• Integer
• ndarray 2 3 8 4
• String
• dict
• Python 0 1 2 3
• scalar
Object
• list Label(index)
• Floating Point
Data Types
Series
How to Create Series?
Basic Method
4 11 21 36
S = pd.Series(data, index = [index])
Series
Series: Example
Code
data = [11,12,13]
s = pd.Series(data)
s
0 11
1 23
2 23
dtype: int64
S[1]
23
Series: Example
Code
data = [11,12,13]
index = [“a”,“b”,“c”]
s = pd.Series(data, index=ind)
s
a 11
b 23
c 23
s[“a”]
11
Series: Example
Code
import pandas as pd
data = [11,12,13]
ind = [“a”,“b”,“c”]
S = pd.Series(data, index=ind)
S[[“a”, “b”]]
a 11
B 12
dtype: int64
Creating Series from a List
Import libraries
Data value
Index
Data type
Creating Series from an ndarray
Pass ndarray as an
argument
countrie
s
Data type
Creating Series from dict
A series can also be created with dict data input for faster operations.
GDP
Country
Data type
Creating Series from Scalar
Scalar input
Index
Data
index
Data type
Accessing Elements in Series
Data can be accessed through different functions like loc and iloc, by passing data element position or index range.
Objective: Create Pandas Series to input employee names and assign employee IDs to each
employee. Also, retrieve first five employees from the Pandas Series using their IDs.
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
DataFrames
DataFrame
DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
Data Input
• Integer
• ndarray 2 3 8 4
• String
• dict 5 8 10 1
• Python
• List
Object 0 1 2 3
• Series
• Floating Point Label(index)
• DataFrame
Data Types
DataFrame
Creating DataFrame from Lists
This example shows you how to create a DataFrame from a series of dicts:
Entire dict
Viewing DataFrame
You can view a DataFrame by referring to the column name or the describe function.
Viewing a DataFrame
1. A DataFrame of weather data using dictionary with the keys like day, temperature, and
weather conditions
2. Create a DataFrame with a list of dictionaries, rows, and columns
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unassisted Practices: Create Pandas DataFrame
Dictionary of list of
weather data
DataFrame of dictionary
Output
Unassisted Practices: Create Pandas DataFrame
Output
Missing Values
Missing Values
Code
import pandas as pd
import numpy as np
Code
df.dropna()
Handling Missing Values with Functions
The fillna function fills all the uncommon indices with a number instead of dropping them.
Objective: Create a dataframe with a list of dictionaries, rows indices, and column indices with one index having
a different name. Handle the missing values by:
1. Removing the NaN values
2. Filling all the uncommon or NaN values with a number, instead of dropping them
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Data Operation
Data Operation
Data operation can be performed through various built-in methods for faster data processing.
Data Operation with Functions
While performing data operation, custom functions can be applied using the applymap method.
Objective: Consider the SalaryGender dataset to perform the following operations on pandas dataframe:
1. Replace the values 0 and 1 of the gender column with female and male respectively
2. Find the maximum salary and the minimum salary
3. Find the number of men and women with PhD
4. Store age and Phd columns in the dataframe and remove all the people without PhD
5. Calculate the total number of PhD holders
6. Sort the dataframe on the basis of salary
Objective: Consider the Pandas-results.csv dataset to perform the following operations on pandas dataframe:
1. Create a dataframe of home_team and home_score
2. Find the home team with maximum home scores
3. Find the teams with home score greater than zero and get the basic statistical details of the
dataframe
4. Create a dataframe in the date range 2019-07-10 to 2019-07-20
5. Compare the home score and away score of the teams and add the winning team to a column named
winner
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unassisted Practice: Data Operations in Pandas DataFrame
Output
::::
::::
;;;;
Unassisted Practice: Data Operations in Pandas DataFrame
Output
Unassisted Practice: Data Operations in Pandas DataFrame
Output
::::
::::
;;;;
Unassisted Practice: Data Operations in Pandas DataFrame
Output
Unassisted Practice: Data Operations in Pandas DataFrame
Adding a column named winner
to the filtered dataframe
Output
::::
::::
;;;;
Data Standardization
File Read and Write Support
read_hdf
read_excel to_hdf read_clipboard
to_excel to_clipboard
read_csv read_html
to csv to_html
read_json read_pickle
to_json to_pickle
read_sql read_stata
read_sas
to_sql to_stata
to sas
Pandas SQL Operation
Pandas SQL Operation
Pandas SQL Operation
Pandas SQL Operations
Output
Unassisted Practice: Pandas SQL Operations
Create a database
Output
Key Takeaways
Outline the process to create series and DataFrame with data inputs
a. Created automatically
b. Needs to be assigned
a. Created automatically
b. Needs to be assigned
Data alignment is intrinsic in Pandas data structure and happens automatically. One can also assign index to data
elements.
Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2
The result will be marked as NaN (Not a Number) for missing labels.
Knowledge
Check
What is the result of DataFrame[3:9]?
3
This is DataFrame slicing technique with indexing or selection on data elements. When a user passes the range 3:9,
the entire range from 3 to 9 gets sliced and displayed as output.
Knowledge
Check
What does the fillna() method do?
4
fillna is one of the basic methods to fill NaN values in a dataset with a desired value by passing that in parenthesis.
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5
a. Series
b. DataFrame
c. Panel
d. PanelND
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5
a. Series
b. DataFrame
c. Panel
d. PanelND
Problem Statement:
Analyze the Federal Aviation Authority (FAA) dataset using Pandas to do the
following:
1. View
a. Aircraft manufacturer name
b. State name
c. Aircraft model name
d. Text information
e. Flight phase
f. Event description type
g. Fatal flag
2. Clean the dataset and replace the fatal flag NaN with “No”
3. Find the aircraft types and their occurrences in the dataset
4. Remove all the observations where aircraft names are not available
5. Display the observations where fatal flag is “Yes”
Analyze the Federal Aviation Authority (FAA) Dataset using Pandas
Instructions to perform:
Download the FAA dataset from the “Resource” tab. Upload the dataset to
your Jupyter notebook to view and evaluate it.
Analyzing the Dataset
Problem Statement:
A dataset in CSV format is given for the Fire Department of the New York City.
Analyze the dataset to determine:
1. The total number of fire department facilities in the New York city
2. The number of fire department facilities in each borough
3. The facility names in Manhattan
Instructions to perform:
Download the FDNY dataset from the “Resource” tab. You can upload the
dataset to your Jupyter notebook to use it.
Thank You