0% found this document useful (0 votes)

44 views

Reshaping Data With Python

This document discusses various techniques for reshaping data in pandas including moving columns to/from indexes, expanding and normalizing list columns, stacking and unstacking indexes, joining and splitting columns, and converting data to and from JSON. It uses a movies dataset as an example throughout with columns like title, release year/month/day, directors, and box office revenue.

Uploaded by

Darlyn LC

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Reshaping Data With Python

Uploaded by

Darlyn LC

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Reshaping Data with > Working with indexes > Exploding and normalizing

pandas in Python # Move columns to the index with .set_index()

movies_indexed = movies.set_index("title")
 

# Expand list columns with .explode()

# Vectors inside the lists are given their own row

# The number of columns remains unchanged

music.explode("singles")
 

# Move index to columns with .reset_index()

movies_indexed.reset_index()
 

Learn Python online at www.DataCamp.com

# For dictionary columns, move items to their own columns with json_normalize()

# By default, each top-level key becomes a new column

# Replace index, left joining new index to existing data with .reindex()
pd.json_normalize(music_exploded["singles"])
avengers_index = ["The Avengers", "Avengers: Age of Ultron", "Avengers: Infinity War",
"Avengers: Endgame"]

> Content movies_indexed.reindex(avengers_index)

# Equivalent to pd.DataFrame(index=avengers_index) \

> Stacking and unstacking

# .merge(movies_indexed, how="left", left_index=True, right_index=True)
Definitions
# Move (multi-)indexes from a column index to a row index with .stack()

The majority of data analysis in Python is performed in pandas DataFrames. These are rectangular datasets consisting # level argument starts with 0 for the outer index

of rows and columns

An observation contains all the values or variables related to a single instance of the objects being analyzed. For
example, in a dataset of movies, each movie would be an observation.
> Joining and splitting columns pig_feed_stacked = pig_feed.stack(level=0)
 

A variable is an attribute for the object, across all the observations. For example, the release dates for all the movies
# Concatenate several columns into a single string column with .str.cat()
# Move (multi-)indexes from a row index to a column index with .stack()

Tidy data provides a standard way to organize data. Having a consistent shape for datasets enables you to worry less
# Each column must be converted to string type before joining
pig_feed_stacked.unstack(level=1)
about data structures and more on getting useful results. The principles of tidy data are
movies["release_year"].astype(str) \

Every column is a variable .str.cat(movies[["release_month", "release_day"]].astype(str), sep="-")

Every row is an observation

Every cell is a single value.
# Split a column on a delimiter into several columns with .str.split(expand=True)
> Converting to and from JSON
movies["directors"].str.split(",", expand=True)
 

import json

> Datasets used throughout this cheat sheet # Combine several columns into a list column with .values.tolist()
# Convert series containing nested elements to JSON string with json.dumps()

movies["release_list"] = movies[["release_year", "release_month", "release_day"]] \

json_singles = json.dumps(music["singles"].to_list())
 

Throughout this cheat sheet we will use a dataset of the top grossing movies of all time, stored as movies.
.values.tolist()
 

title release_year release_month release_day directors box_office_busd

# Add column from JSON string with with json.loads()

Avatar 2009 12 18 James Cameron 2.922

# Split a list column into separate columns with .to_list()
music["singles2"] = json.loads(json_singles)
Avengers: 2019 4 22 Anthony Russo,
2.798
Endgame Joe Russo movies[["release_year2", "release_month2", "release_day2"]] = \

Titanic 1997 11 01 James Cameron 2.202 movies["release_list"].to_list()

Star Wars Ep. 2015 12 14 J.J Abrams 2.068

VII: The Force
Awakens
> Dealing with missing data
Avengers:
Infinity War
2018 4 23 Anthony Russo,

Joe Russo
2.048
> Melting and pivoting # Drop rows containing any missing values in the specified columns with .dropna()

The second dataset involves an experiment with the number of unpopped kernels in bags of popcorn, adapted from the people.dropna(subset="weight_kg")
 

Popcorn dataset in the R's Stat2Data package. # Move side-by-side columns to consecutive rows with .melt()

popcorn.melt(id_vars="brand", var_name="trial", value_name = "n_unpopped")

brand trial_1 trial_2 trial_3 trial_4 trial_5 trial_6

# Fill missing values with a default value with .fillna()

Orville 26 35 18 14 8 6 people.fillna({"weight_kg": 100})

Seaway 47 47 14 34 21 37 # Melt using row index as id_variable with .melt(ignore_index=False)

popcorn_indexed = popcorn.set_index("brand")

The third dataset is JSON data about music containing nested elements. The JSON is parsed into nested lists using popcorn_indexed.melt(var_name="trial", value_name="n_unpopped", ignore_index=False)
 

read_json() from the pandas package. Notice that each element in the singles column is a list of dictionaries.

artist singles
# Where there is a column multi-index, specify id_vars with a list of tuples

Bad Bunny [{'title': 'Gato de Noche',

pig_feed.melt(id_vars=[("No", "No")])
 

'tracks': [{'title': 'Gato de Noche', 'collaborator': 'Ñengo Flow'}]},

{'title': 'La Jumpa',

'tracks': [{'title': 'La Jumpa', 'collaborator': 'Arcángel'}]}

# Same as .melt(), plus cleanup of var_name with wide_to_long()

Drake [{'title': 'Scary Hours 2',

pd.wide_to_long(popcorn, stubnames="trial", i="brand", j="trial_no", sep="_")
 

'tracks': [{'title': "What's Next"},

{'title': 'Wants and Needs', 'collaborator': 'Lil Baby'},

{'title': 'Lemon Pepper Freestyle', 'collaborator': 'Rick Ross'}]}]

# Move values in from rows to columns with .pivot()
Learn Python Online at
The fourth dataset is a synthetic dataset containing attributes of people. sex is a character vector, and hair_color is a
factor.
# Reset the index to completely reverse a melting operation

popcorn_long \

www.DataCamp.com
.pivot(values="n_unpopped", index="brand", columns="trial") \

sex hair_color height_cm weight_kg

.reset_index()
 

Female brown 166 72

Male blonde 184
Female black 153 # Move values in from rows to columns and aggregate with .pivot_table()

# df.pivot_table(values, index, columns, aggfunc) is equivalent to

Male black 192 93

# df.groupby([index, columns])[values].agg(aggfunc).reset_index().pivot(index, columns)

The fifth dataset, pig_feed, shows weight gain in pigs from additives to their feed. There is a multi-index on the columns. popcorn_long \

.pivot_table(values="n_unpopped", index="brand", columns="trial") \

Antibiotic No Yes
.reset_index()
 
B12 No Yes No Yes
19 22 3 54

ECDL Advanced Spreadsheets 2016 2.0
100% (1)
ECDL Advanced Spreadsheets 2016 2.0
225 pages
Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Pandas Cheat Sheet
83% (12)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas (Ziad)
No ratings yet
Pandas (Ziad)
38 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
EDS - Python Cheat Sheet
No ratings yet
EDS - Python Cheat Sheet
3 pages
Rapids Cheatsheet
100% (1)
Rapids Cheatsheet
2 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Pandas
No ratings yet
Pandas
94 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
Pandas Data Wrangling Cheatsheet Datacamp PDF
No ratings yet
Pandas Data Wrangling Cheatsheet Datacamp PDF
1 page
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Python Cheat Sheets
97% (32)
Python Cheat Sheets
11 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
No ratings yet
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
5 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Pandas
No ratings yet
Pandas
41 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Learn Pandas
No ratings yet
Learn Pandas
40 pages
Pandas
No ratings yet
Pandas
41 pages
Python Programming - Pandas: Finn Arup Nielsen
No ratings yet
Python Programming - Pandas: Finn Arup Nielsen
34 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Pandas AI
No ratings yet
Pandas AI
14 pages
10 Minutes to Pandas — Pandas 2.1.1 Documentation
No ratings yet
10 Minutes to Pandas — Pandas 2.1.1 Documentation
24 pages
Pandas cheat sheet
No ratings yet
Pandas cheat sheet
19 pages
Document From Gr7
No ratings yet
Document From Gr7
29 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Ian Talks JS A-Z: WebDevAtoZ, #1
From Everand
Ian Talks JS A-Z: WebDevAtoZ, #1
Ian Eress
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
DBMS Lab Manual
From Everand
DBMS Lab Manual
Jitendra Patel
1.5/5 (3)
Line Planning With Minimal Traveling Time: 1 Motivation and Related Literature
No ratings yet
Line Planning With Minimal Traveling Time: 1 Motivation and Related Literature
16 pages
Line Planning in Public Transportation: Models and Methods: Anita Schöbel
No ratings yet
Line Planning in Public Transportation: Models and Methods: Anita Schöbel
20 pages
Introduction To Time Series Analysis. Lecture 4
No ratings yet
Introduction To Time Series Analysis. Lecture 4
34 pages
Families of Distributions: Beamer-Tu-Logo
No ratings yet
Families of Distributions: Beamer-Tu-Logo
19 pages
Exponential Families: Dr. Kempthorne
No ratings yet
Exponential Families: Dr. Kempthorne
33 pages
Lesson 5: The Autocovariance Function of A Stochastic Process
No ratings yet
Lesson 5: The Autocovariance Function of A Stochastic Process
18 pages
Love
No ratings yet
Love
1 page
Ejercicios Resueltos Tema 1
No ratings yet
Ejercicios Resueltos Tema 1
2 pages
Excel Shortcuts For Efficiency
No ratings yet
Excel Shortcuts For Efficiency
3 pages
Excel Slicers - Interactive Guide
No ratings yet
Excel Slicers - Interactive Guide
24 pages
Excel data analysis your visual blueprint for analyzing data charts and PivotTables 4th ed Edition Mcfedries - The ebook with rich content is ready for you to download
100% (1)
Excel data analysis your visual blueprint for analyzing data charts and PivotTables 4th ed Edition Mcfedries - The ebook with rich content is ready for you to download
54 pages
Pivot Table and Jamovi
No ratings yet
Pivot Table and Jamovi
48 pages
Pandas - Powerful Python Data Analysis Toolkit
No ratings yet
Pandas - Powerful Python Data Analysis Toolkit
95 pages
Instant Access To (Ebook PDF) GO! With Integrated Projects (GO! For Office 2016 Series) Ebook Full Chapters
100% (4)
Instant Access To (Ebook PDF) GO! With Integrated Projects (GO! For Office 2016 Series) Ebook Full Chapters
41 pages
Gaurav Final
No ratings yet
Gaurav Final
42 pages
Data Analytics with MS Excel Lab Manual Full 2024-25
No ratings yet
Data Analytics with MS Excel Lab Manual Full 2024-25
30 pages
Excel Short Cut Keys PDF
No ratings yet
Excel Short Cut Keys PDF
7 pages
MS Excel 2016 L15A Intro To PivotTables
No ratings yet
MS Excel 2016 L15A Intro To PivotTables
13 pages
Excel Essentials: Course Notes
No ratings yet
Excel Essentials: Course Notes
29 pages
Facility Location Model: A Case Study: Miguel Limão Berger
No ratings yet
Facility Location Model: A Case Study: Miguel Limão Berger
10 pages
Pivot Tables
No ratings yet
Pivot Tables
8 pages
Excel Pivot Tables and Pivot CH - Mejia, Henry e
No ratings yet
Excel Pivot Tables and Pivot CH - Mejia, Henry e
83 pages
Class 01 (A) Masih - IIM-Microsoft Advanced Excel Training
No ratings yet
Class 01 (A) Masih - IIM-Microsoft Advanced Excel Training
85 pages
Pivot Tables
No ratings yet
Pivot Tables
8 pages
101 Ready-To-Use Excel Macros: Class Duration: 2 Days
No ratings yet
101 Ready-To-Use Excel Macros: Class Duration: 2 Days
2 pages
Sem III Unit 3 ITSB Assignment 7 2024 NEP Student
No ratings yet
Sem III Unit 3 ITSB Assignment 7 2024 NEP Student
2 pages
Excel Lesson - Module 3
No ratings yet
Excel Lesson - Module 3
17 pages
Excel Notes
No ratings yet
Excel Notes
59 pages
DA-Interview Reference Material
No ratings yet
DA-Interview Reference Material
8 pages
Excel 2022 The 1 Guide To Master All The Functions and Formulas To Become A Professional in Just 7 D
100% (1)
Excel 2022 The 1 Guide To Master All The Functions and Formulas To Become A Professional in Just 7 D
165 pages
Camm 3e Ch03 PPT PDF
No ratings yet
Camm 3e Ch03 PPT PDF
66 pages
CSEC-Jan 2023 Solution - IT P2
No ratings yet
CSEC-Jan 2023 Solution - IT P2
21 pages
The Role of Visual Analysis
100% (2)
The Role of Visual Analysis
13 pages
Tutorial: Create An Excel Dashboard: Download The Example Dashboard
No ratings yet
Tutorial: Create An Excel Dashboard: Download The Example Dashboard
12 pages
40 Excel Tips For Becoming A Spreadsheet Pro - PCMag
No ratings yet
40 Excel Tips For Becoming A Spreadsheet Pro - PCMag
32 pages
If Formula Builder
No ratings yet
If Formula Builder
11 pages
Brio Intelligence Data Analysis PDF
No ratings yet
Brio Intelligence Data Analysis PDF
140 pages

Reshaping Data With Python

Uploaded by

Reshaping Data With Python

Uploaded by

Reshaping Data with > Working with indexes > Exploding and normalizing

pandas in Python # Move columns to the index with .set_index()

# Expand list columns with .explode()

# Vectors inside the lists are given their own row

# The number of columns remains unchanged

# Move index to columns with .reset_index()

Learn Python online at www.DataCamp.com

# By default, each top-level key becomes a new column

> Content movies_indexed.reindex(avengers_index)

> Stacking and unstacking

of rows and columns

Every column is a variable .str.cat(movies[["release_month", "release_day"]].astype(str), sep="-")

Every row is an observation

movies["release_list"] = movies[["release_year", "release_month", "release_day"]] \

title release_year release_month release_day directors box_office_busd

Avatar 2009 12 18 James Cameron 2.922

Titanic 1997 11 01 James Cameron 2.202 movies["release_list"].to_list()

Star Wars Ep. 2015 12 14 J.J Abrams 2.068

popcorn.melt(id_vars="brand", var_name="trial", value_name = "n_unpopped")

brand trial_1 trial_2 trial_3 trial_4 trial_5 trial_6

Orville 26 35 18 14 8 6 people.fillna({"weight_kg": 100})

Bad Bunny [{'title': 'Gato de Noche',

'tracks': [{'title': 'Gato de Noche', 'collaborator': 'Ñengo Flow'}]},

{'title': 'La Jumpa',

'tracks': [{'title': 'La Jumpa', 'collaborator': 'Arcángel'}]}

Drake [{'title': 'Scary Hours 2',

'tracks': [{'title': "What's Next"},

{'title': 'Wants and Needs', 'collaborator': 'Lil Baby'},

{'title': 'Lemon Pepper Freestyle', 'collaborator': 'Rick Ross'}]}]

sex hair_color height_cm weight_kg

Female brown 166 72

# df.pivot_table(values, index, columns, aggfunc) is equivalent to

Male black 192 93

.pivot_table(values="n_unpopped", index="brand", columns="trial") \

You might also like