Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
180 views

Lesson 07 Data Manipulation With Pandas

The document discusses Pandas, a Python library used for data manipulation and analysis. It introduces Pandas data structures like Series and DataFrame. A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled array. The document outlines how to create Series and DataFrame from different data types, access elements within them, and perform vectorized operations. It also describes handling missing data and analyzing data using various operation methods in Pandas.

Uploaded by

Sumanta Sinhatal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views

Lesson 07 Data Manipulation With Pandas

The document discusses Pandas, a Python library used for data manipulation and analysis. It introduces Pandas data structures like Series and DataFrame. A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled array. The document outlines how to create Series and DataFrame from different data types, access elements within them, and perform vectorized operations. It also describes handling missing data and analyzing data using various operation methods in Pandas.

Uploaded by

Sumanta Sinhatal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Data Analytics with Python

Data Manipulation with Pandas


Learning Objectives

By the end of this lesson, you will be able to:

Explain pandas and its features

List different data structures of pandas

Outline the process to create Series and DataFrame with data inputs

Explain how to view, select, and access elements in a data structure

Describe the procedure to handle vectorized operations

Illustrate how to handle missing values

Analyze data with different data operation methods


Introduction to Pandas
Why Pandas

NumPy is great for mathematical computing, but, why do we need pandas?

Pandas with several


functionalities

NumPy
Why Pandas

Intrinsic data
alignment

Data structures
Data operation
handling major
functions
use cases
Pandas

Data standardization ? Data-handling


functions functions
Features of Pandas

Various features of Pandas make it an efficient library for Data Scientists.

Powerful data
structure

Fast and efficient


High performance
data wrangling
merging and joining
of data sets

Pandas
Intelligent and Easy data
automated aggregation and
data alignment transformation

Tools for reading


and writing data
Data Structures
Data Structures

Data structures in Pandas library:

• One-dimensional labeled array


Series
• Supports multiple data types
• Two-dimensional labeled array
• Supports multiple data types
DataFrame
• Input can be a series
• Input can be another DataFrame
• Three-dimensional labeled array
• Supports multiple data types
Panel
• Items axis 0
• Major axis rows • Four-dimensional labeled array
• Minor axis columns Panel 4D • Supports multiple data types
(Experimental) • Labels axis 0
• Items axis 1
• Major axis rows
• Minor axis columns
Understanding Series

Series is a one-dimensional array-like object containing data and labels (or index).

Data 4 11 21 36
0 1 2 3

Label(index)

Data alignment is intrinsic and will not be broken until changed explicitly by program.
Series

Series can be created with different data inputs:

Data Input

• Integer
• ndarray 2 3 8 4
• String
• dict
• Python 0 1 2 3
• scalar
Object
• list Label(index)
• Floating Point

Data Types
Series
How to Create Series?

Key points to note while creating a series are:

• Import pandas as it is in the main library (Import pandas as pd)


• Import NumPy while working with ndarrays (Import NumPy as np)
• Apply the syntax and pass the data elements as arguments

Basic Method

4 11 21 36
S = pd.Series(data, index = [index])
Series
Series: Example

Consider this example of series in pandas:

Code

data = [11,12,13]
s = pd.Series(data)
s
0 11
1 23
2 23
dtype: int64
S[1]
23
Series: Example

Code

data = [11,12,13]
index = [“a”,“b”,“c”]
s = pd.Series(data, index=ind)
s
a 11
b 23
c 23
s[“a”]
11
Series: Example

Code

import pandas as pd
data = [11,12,13]
ind = [“a”,“b”,“c”]
S = pd.Series(data, index=ind)
S[[“a”, “b”]]
a 11
B 12
dtype: int64
Creating Series from a List

Import libraries

Pass list as an argument

Data value

Index

Data type
Creating Series from an ndarray

ndarray for countries

Pass ndarray as an
argument

countrie
s

Data type
Creating Series from dict

A series can also be created with dict data input for faster operations.

dict for countries and their GDP

Countries have been passed as an index and


GDP as the actual data value

GDP

Country

Data type
Creating Series from Scalar

Scalar input

Index

Data

index

Data type
Accessing Elements in Series
Data can be accessed through different functions like loc and iloc, by passing data element position or index range.

Data element position

First five data elements and their indices

Look up method to access data

Data elements by looking up the index position


Vectorized Operations in Series

Vectorized operations are performed by the data element’s position.

Add the series

Addition at index level

Addition after shuffling the


indices
Vectorized Operations in Series

Adding two series with a


few common and a few
different indices
Create Pandas Series

Objective: Create Pandas Series to input employee names and assign employee IDs to each
employee. Also, retrieve first five employees from the Pandas Series using their IDs.
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
DataFrames
DataFrame

DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Data Input

• Integer
• ndarray 2 3 8 4
• String
• dict 5 8 10 1
• Python
• List
Object 0 1 2 3
• Series
• Floating Point Label(index)
• DataFrame

Data Types
DataFrame
Creating DataFrame from Lists

Pass the list to the DataFrame


Creating DataFrame from dict

This example shows you how to create a DataFrame from a series of dicts:

dict one dict two

Entire dict
Viewing DataFrame

You can view a DataFrame by referring to the column name or the describe function.

Viewing a DataFrame

Viewing the entire contents of the dataset


Creating DataFrame from dict of Series
Creating DataFrame from ndarray

Create an ndarray with years


Create a dict with the ndarray

Pass this dict to a new DataFrame


Creating DataFrame from DataFrame Object

Create a DataFrame from a


DataFrame object
Create Pandas DataFrames

Objective: Create DataFrames in the following conditions:


1. Input the marks of two subjects for five students and create a DataFrame using the two series
2. Extract data from the given SalaryGender CSV file and store the data from each column in a
separate NumPy array
3. Create a DataFrame using dictionary of names and age of five people as input
4. Create a DataFrame from the dictionary of series

Access: To execute the practice, follow these steps:


• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Create Pandas DataFrames

Objective: Create the following dataframes:

1. A DataFrame of weather data using dictionary with the keys like day, temperature, and
weather conditions
2. Create a DataFrame with a list of dictionaries, rows, and columns
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unassisted Practices: Create Pandas DataFrame

Dictionary of list of
weather data

DataFrame of dictionary

Output
Unassisted Practices: Create Pandas DataFrame

Initializing the list data


Two column indices of
values same as the
dictionary keys

Print the first dataframe


Print the second dataframe

Output
Missing Values
Missing Values

Various factors may lead to missing data values:

Data not provided by the


source Software issue Data integration issue Network issue
Handling Missing Values
It’s difficult to operate a dataset when it has missing values or uncommon indices.
Handling Missing Values with Functions
The dropna function drops all the values with uncommon indices.
Handling Missing Values: Example

Code

import pandas as pd
import numpy as np

Raw_data = {‘name’:[‘Joe’, np.nan, ‘Tina’, ‘Mike’, ‘Amy’],


‘last_name’:[‘Miller’, np.nan, 36,24,73]}
‘AGE’:[42, np.nan, 36, 24,73]}
df =pd.DataFrame(raw_data)
df
Handling Missing Values: Example

Code

df.dropna()
Handling Missing Values with Functions

The fillna function fills all the uncommon indices with a number instead of dropping them.

Fill the missing values with zero


Handle Missing Values with Functions: Example
Handle Missing Values

Objective: Create a dataframe with a list of dictionaries, rows indices, and column indices with one index having
a different name. Handle the missing values by:
1. Removing the NaN values
2. Filling all the uncommon or NaN values with a number, instead of dropping them
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Data Operation
Data Operation

Data operation can be performed through various built-in methods for faster data processing.
Data Operation with Functions
While performing data operation, custom functions can be applied using the applymap method.

Declare a custom function

Test the function

Apply the function to the DataFrame


Data Operation with Statistical Functions

Create a DataFrame with two tests

Apply the max function to find the


maximum score

Apply the mean function to find the


average score

Apply the std function to find the standard


deviation for both the tests
Data Operation Using Groupby

Create a DataFrame with first


and last name as former
presidents

Group the DataFrame with the first name


Data Operation: Sorting

Sort values by first name


Data Operations in Pandas DataFrame

Objective: Consider the SalaryGender dataset to perform the following operations on pandas dataframe:
1. Replace the values 0 and 1 of the gender column with female and male respectively
2. Find the maximum salary and the minimum salary
3. Find the number of men and women with PhD
4. Store age and Phd columns in the dataframe and remove all the people without PhD
5. Calculate the total number of PhD holders
6. Sort the dataframe on the basis of salary

Access: To execute the practice, follow these steps:


• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Data Operations in Pandas DataFrame

Objective: Consider the Pandas-results.csv dataset to perform the following operations on pandas dataframe:
1. Create a dataframe of home_team and home_score
2. Find the home team with maximum home scores
3. Find the teams with home score greater than zero and get the basic statistical details of the
dataframe
4. Create a dataframe in the date range 2019-07-10 to 2019-07-20
5. Compare the home score and away score of the teams and add the winning team to a column named
winner
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unassisted Practice: Data Operations in Pandas DataFrame

Reading the Pandas-results.csv file


Creating a dataframe of home team and home
score

Output

::::
::::
;;;;
Unassisted Practice: Data Operations in Pandas DataFrame

Using the max() function to find the maximum goals

Extracting the team details with maximum home score

Output
Unassisted Practice: Data Operations in Pandas DataFrame

Output

::::
::::
;;;;
Unassisted Practice: Data Operations in Pandas DataFrame

Creating a dataframe in the date range 2019-07-10 to 2019-07-20

Output
Unassisted Practice: Data Operations in Pandas DataFrame
Adding a column named winner
to the filtered dataframe

Filtering the data with home score greater than zero

Output

::::
::::
;;;;
Data Standardization
File Read and Write Support

read_hdf
read_excel to_hdf read_clipboard
to_excel to_clipboard

read_csv read_html
to csv to_html

read_json read_pickle
to_json to_pickle

read_sql read_stata
read_sas
to_sql to_stata
to sas
Pandas SQL Operation
Pandas SQL Operation
Pandas SQL Operation
Pandas SQL Operations

Objective: Perform the following SQL operations with pandas:


1. Read the Pandas-results.csv dataset and create a dataframe
2. Write the SQL query to retrieve the first 10 items of the dataframe
3. Create a SQL table named movie rating with attributes like movie name, genre, and rating
4. Insert records into SQL table through SQL statement
5. Fetch and view the result
6. Create a dataframe of the result

Access: To execute the practice, follow these steps:


• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Pandas SQL Operations

Objective: Perform the following SQL operations:


1. Create a dataframe of customer details with columns like ID, customer, billing address,
and shipping address
2. Write a SQL query to retrieve customers based on a particular ID
Example: For ID=1233, output is customer A
3. Create a small dataframe and write the records in dataframe to a SQL database
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unassisted Practice: Pandas SQL Operations

Import pandasql to query pandas dataframes using SQL syntax

Create a dataframe of customer


details

Select the customer based on the


ID number

Output
Unassisted Practice: Pandas SQL Operations

Use sqlalchemy to create engine


Create an engine
Create a simple dataframe

Create a database

Write the records in the


dataframe to an SQL database

Output
Key Takeaways

You are now able to:

Explain Pandas and its features

List different data structures of Pandas

Outline the process to create series and DataFrame with data inputs

Explain how to view, select, and access elements in a data structure

Describe the procedure to handle vectorized operations

Illustrate how to handle missing values

Analyze data with different data operation methods


Knowledge Check
Knowledge
Check How is an index for data elements assigned while creating a Pandas series ? Select all that
apply?
1

a. Created automatically

b. Needs to be assigned

c. Once created can not be changed or altered

d. Index is not applicable as series is one-dimensional


Knowledge
Check How is an index for data elements assigned while creating a Pandas series ? Select all that
apply?
1

a. Created automatically

b. Needs to be assigned

c. Once created can not be changed or altered

d. Index is not applicable as series is one-dimensional

The correct answer is a, b

Data alignment is intrinsic in Pandas data structure and happens automatically. One can also assign index to data
elements.
Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2

a. Marked as zeros for missing labels

b. Labels will be skipped

c. Marked as NaN for missing labels

d. Will prompt an exception, index not found


Knowledge
Check
What will the result be in vector addition if label is not found in a series?
2

a. Marked as zeros for missing labels

b. Labels will be skipped

c. Marked as NaN for missing labels

d. Will prompt an exception, index not found

The correct answer is c

The result will be marked as NaN (Not a Number) for missing labels.
Knowledge
Check
What is the result of DataFrame[3:9]?
3

a. Series with sliced index from 3 to 9

b. dict of index positions 3 and 9

c. DataFrame of sliced rows index from 3 to 9

d. DataFrame with data elements at index 3 to 9


Knowledge
Check
What is the result of DataFrame[3:9]?
3

a. Series with sliced index from 3 to 9

b. dict of index positions 3 and 9

c. DataFrame of sliced rows index from 3 to 9

d. DataFrame with data elements at index 3 to 9

The correct answer is c

This is DataFrame slicing technique with indexing or selection on data elements. When a user passes the range 3:9,
the entire range from 3 to 9 gets sliced and displayed as output.
Knowledge
Check
What does the fillna() method do?
4

a. Fills all NaN values with zeros

b. Fills all NaN values with one

c. Fills all NaN values with values mentioned in the parenthesis

d. Drops NaN values from the dataset


Knowledge
Check
What does the fillna() method do?
4

a. Fills all NaN values with zeros

b. Fills all NaN values with one

c. Fills all NaN values with values mentioned in the parenthesis

d. Drops NaN values from the dataset

The correct answer is c

fillna is one of the basic methods to fill NaN values in a dataset with a desired value by passing that in parenthesis.
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5

a. Series

b. DataFrame

c. Panel

d. PanelND
Knowledge
Check
Which of the following data structures is used to store three-dimensional data?
5

a. Series

b. DataFrame

c. Panel

d. PanelND

The correct answer is c

Panel is a data structure used to store three-dimensional data.


Analyze the Federal Aviation Authority (FAA) Dataset using Pandas

Problem Statement:
Analyze the Federal Aviation Authority (FAA) dataset using Pandas to do the
following:
1. View
a. Aircraft manufacturer name
b. State name
c. Aircraft model name
d. Text information
e. Flight phase
f. Event description type
g. Fatal flag
2. Clean the dataset and replace the fatal flag NaN with “No”
3. Find the aircraft types and their occurrences in the dataset
4. Remove all the observations where aircraft names are not available
5. Display the observations where fatal flag is “Yes”
Analyze the Federal Aviation Authority (FAA) Dataset using Pandas

Instructions to perform:
Download the FAA dataset from the “Resource” tab. Upload the dataset to
your Jupyter notebook to view and evaluate it.
Analyzing the Dataset

Problem Statement:
A dataset in CSV format is given for the Fire Department of the New York City.
Analyze the dataset to determine:
1. The total number of fire department facilities in the New York city
2. The number of fire department facilities in each borough
3. The facility names in Manhattan

Instructions to perform:
Download the FDNY dataset from the “Resource” tab. You can upload the
dataset to your Jupyter notebook to use it.
Thank You

You might also like