Pandas in Action
()
About this ebook
In Pandas in Action you will learn how to:
Import datasets, identify issues with their data structures, and optimize them for efficiency
Sort, filter, pivot, and draw conclusions from a dataset and its subsets
Identify trends from text-based and time-based data
Organize, group, merge, and join separate datasets
Use a GroupBy object to store multiple DataFrames
Pandas has rapidly become one of Python's most popular data analysis libraries. In Pandas in Action, a friendly and example-rich introduction, author Boris Paskhaver shows you how to master this versatile tool and take the next steps in your data science career. You’ll learn how easy Pandas makes it to efficiently sort, analyze, filter and munge almost any type of data.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
Data analysis with Python doesn’t have to be hard. If you can use a spreadsheet, you can learn pandas! While its grid-style layouts may remind you of Excel, pandas is far more flexible and powerful. This Python library quickly performs operations on millions of rows, and it interfaces easily with other tools in the Python data ecosystem. It’s a perfect way to up your data game.
About the book
Pandas in Action introduces Python-based data analysis using the amazing pandas library. You’ll learn to automate repetitive operations and gain deeper insights into your data that would be impractical—or impossible—in Excel. Each chapter is a self-contained tutorial. Realistic downloadable datasets help you learn from the kind of messy data you’ll find in the real world.
What's inside
Organize, group, merge, split, and join datasets
Find trends in text-based and time-based data
Sort, filter, pivot, optimize, and draw conclusions
Apply aggregate operations
About the reader
For readers experienced with spreadsheets and basic Python programming.
About the author
Boris Paskhaver is a software engineer, Agile consultant, and online educator. His programming courses have been taken by 300,000 students across 190 countries.
Table of Contents
PART 1 CORE PANDAS
1 Introducing pandas
2 The Series object
3 Series methods
4 The DataFrame object
5 Filtering a DataFrame
PART 2 APPLIED PANDAS
6 Working with text data
7 MultiIndex DataFrames
8 Reshaping and pivoting
9 The GroupBy object
10 Merging, joining, and concatenating
11 Working with dates and times
12 Imports and exports
13 Configuring pandas
14 Visualization
Boris Paskhaver
Boris Paskhaver is a software engineer, Agile consultant, and online educator. His programming courses have been taken by 300,000 students across 190 countries.
Related to Pandas in Action
Related ebooks
Machine Learning Bookcamp: Build a portfolio of real-life projects Rating: 4 out of 5 stars4/5Machine Learning in Action Rating: 0 out of 5 stars0 ratingsTiny Python Projects: Learn coding and testing with puzzles and games Rating: 4 out of 5 stars4/5Think Like a Data Scientist: Tackle the data science process step-by-step Rating: 0 out of 5 stars0 ratingsDeep Learning with Python, Second Edition Rating: 0 out of 5 stars0 ratingsMath for Programmers: 3D graphics, machine learning, and simulations with Python Rating: 4 out of 5 stars4/5Deep Learning with Structured Data Rating: 0 out of 5 stars0 ratingsNatural Language Processing in Action: Understanding, analyzing, and generating text with Python Rating: 0 out of 5 stars0 ratingsIntroducing Data Science: Big data, machine learning, and more, using Python tools Rating: 5 out of 5 stars5/5MLOps Engineering at Scale Rating: 0 out of 5 stars0 ratingsPython: Real-World Data Science Rating: 0 out of 5 stars0 ratingsMachine Learning Systems: Designs that scale Rating: 0 out of 5 stars0 ratingsDeep Learning with JavaScript: Neural networks in TensorFlow.js Rating: 0 out of 5 stars0 ratingsAdvanced Algorithms and Data Structures Rating: 0 out of 5 stars0 ratingsDeep Learning with R Rating: 0 out of 5 stars0 ratingsDeep Learning with Python Rating: 5 out of 5 stars5/5Data Analysis with Python and PySpark Rating: 0 out of 5 stars0 ratingsPandas Workout: 200 exercises to make you a stronger data analyst Rating: 0 out of 5 stars0 ratingsMachine Learning with R, the tidyverse, and mlr Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Functional Python Programming Rating: 0 out of 5 stars0 ratingsFeature Engineering Bookcamp Rating: 0 out of 5 stars0 ratingsAlgorithms and Data Structures for Massive Datasets Rating: 0 out of 5 stars0 ratingsGraph Databases in Action: Examples in Gremlin Rating: 0 out of 5 stars0 ratingsPractical Data Science with R, Second Edition Rating: 4 out of 5 stars4/5Python Data Analysis Rating: 4 out of 5 stars4/5Designing Cloud Data Platforms Rating: 0 out of 5 stars0 ratingsMastering Objectoriented Python Rating: 5 out of 5 stars5/5TensorFlow in Action Rating: 0 out of 5 stars0 ratingsMachine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition Rating: 0 out of 5 stars0 ratings
Data Modeling & Design For You
Data-Intensive Applications: Design, Development, and Deployment Strategies for Scalable and Reliable Systems Rating: 0 out of 5 stars0 ratingsData Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel Rating: 3 out of 5 stars3/5Data Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Mastering Python Design Patterns Rating: 0 out of 5 stars0 ratingsNeural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning Rating: 2 out of 5 stars2/5Data Visualization: a successful design process Rating: 4 out of 5 stars4/5Thinking in Algorithms: Strategic Thinking Skills, #2 Rating: 4 out of 5 stars4/5LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots Rating: 0 out of 5 stars0 ratingsDAX Patterns: Second Edition Rating: 5 out of 5 stars5/5Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science Rating: 0 out of 5 stars0 ratingsText as Data: A New Framework for Machine Learning and the Social Sciences Rating: 0 out of 5 stars0 ratingsMastering Python Data Analysis Rating: 0 out of 5 stars0 ratingsManaging Data Using Excel Rating: 5 out of 5 stars5/5Instant Heat Maps in R How-to Rating: 0 out of 5 stars0 ratingsSupercharge Power BI: Power BI is Better When You Learn To Write DAX Rating: 5 out of 5 stars5/5Frank Kane's Taming Big Data with Apache Spark and Python Rating: 0 out of 5 stars0 ratingsMicrosoft Access: Database Creation and Management through Microsoft Access Rating: 0 out of 5 stars0 ratingsAI-Driven Data Engineering Rating: 0 out of 5 stars0 ratingsTableau Cookbook – Recipes for Data Visualization Rating: 0 out of 5 stars0 ratingsRaspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps Rating: 3 out of 5 stars3/5Mastering Agile User Stories Rating: 4 out of 5 stars4/5Kafka in Action Rating: 0 out of 5 stars0 ratingsLearning Social Media Analytics with R Rating: 0 out of 5 stars0 ratingsBlockchain Data Analytics For Dummies Rating: 0 out of 5 stars0 ratings
Reviews for Pandas in Action
0 ratings0 reviews
Book preview
Pandas in Action - Boris Paskhaver
Pandas in Action
Boris Paskhaver
To comment go to liveBook
Manning
Shelter Island
For more information on this and other Manning titles go to
www.manning.com
Dedication
For Meredith Edwards, my ray of sunshine
Copyright
For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: orders@manning.com
©2021 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.
♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.
ISBN: 9781617297434
contents
front matter
preface
acknowledgments
about this book
about the author
about the cover illustration
Part 1. Core pandas
1 Introducing pandas
1.1 Data in the 21st century
1.2 Introducing pandas
Pandas vs. graphical spreadsheet applications
Pandas vs. its competitors
1.3 A tour of pandas
Importing a data set
Manipulating a DataFrame
Counting values in a Series
Filtering a column by one or more criteria
Grouping data
2 The Series object
2.1 Overview of a Series
Classes and instances
Populating the Series with values
Customizing the Series index
Creating a Series with missing values
2.2 Creating a Series from Python objects
2.3 Series attributes
2.4 Retrieving the first and last rows
2.5 Mathematical operations
Statistical operations
Arithmetic operations
Broadcasting
2.6 Passing the Series to Python’s built-in functions
2.7 Coding challenge
Problems
Solutions
3 Series methods
3.1 Importing a data set with the read_csv function
3.2 Sorting a Series
Sorting by values with the sort_values method
Sorting by index with the sort_index method
Retrieving the smallest and largest values with the nsmallest and nlargest methods
3.3 Overwriting a Series with the inplace parameter
3.4 Counting values with the value_counts method
3.5 Invoking a function on every Series value with the apply method
3.6 Coding challenge
Problems
Solutions
4 The DataFrame object
4.1 Overview of a DataFrame
Creating a DataFrame from a dictionary
Creating a DataFrame from a NumPy ndarray
4.2 Similarities between Series and DataFrames
Importing a DataFrame with the read_csv function
Shared and exclusive attributes of Series and DataFrames
Shared methods of Series and DataFrames
4.3 Sorting a DataFrame
Sorting by a single column
Sorting by multiple columns
4.4 Sorting by index
Sorting by row index
Sorting by column index
4.5 Setting a new index
4.6 Selecting columns and rows from a DataFrame
Selecting a single column from a DataFrame
Selecting multiple columns from a DataFrame
4.7 Selecting rows from a DataFrame
Extracting rows by index label
Extracting rows by index position
Extracting values from specific columns
4.8 Extracting values from Series
4.9 Renaming columns or rows
4.10 Resetting an index
4.11 Coding challenge
Problems
Solutions
5 Filtering a DataFrame
5.1 Optimizing a data set for memory use
Converting data types with the astype method
5.2 Filtering by a single condition
5.3 Filtering by multiple conditions
The AND condition
The OR condition
Inversion with ~
Methods for Booleans
5.4 Filtering by condition
The isin method
The between method
The isnull and notnull methods
Dealing with null values
5.5 Dealing with duplicates
The duplicated method
The drop_duplicates method
5.6 Coding challenge
Problems
Solutions
Part 2. Applied pandas
6 Working with text data
6.1 Letter casing and whitespace
6.2 String slicing
6.3 String slicing and character replacement
6.4 Boolean methods
6.5 Splitting strings
6.6 Coding challenge
Problems
Solutions
6.7 A note on regular expressions
7 MultiIndex DataFrames
7.1 The MultiIndex object
7.2 MultiIndex DataFrames
7.3 Sorting a MultiIndex
7.4 Selecting with a MultiIndex
Extracting one or more columns
Extracting one or more rows with loc
Extracting one or more rows with iloc
7.5 Cross-sections
7.6 Manipulating the Index
Resetting the index
Setting the index
7.7 Coding challenge
Problems
Solutions
8 Reshaping and pivoting
8.1 Wide vs. narrow data
8.2 Creating a pivot table from a DataFrame
The pivot_table method
Additional options for pivot tables
8.3 Stacking and unstacking index levels
8.4 Melting a data set
8.5 Exploding a list of values
8.6 Coding challenge
Problems
Solutions
9 The GroupBy object
9.1 Creating a GroupBy object from scratch
9.2 Creating a GroupBy object from a data set
9.3 Attributes and methods of a GroupBy object
9.4 Aggregate operations
9.5 Applying a custom operation to all groups
9.6 Grouping by multiple columns
9.7 Coding challenge
Problems
Solutions
10 Merging, joining, and concatenating
10.1 Introducing the data sets
10.2 Concatenating data sets
10.3 Missing values in concatenated DataFrames
10.4 Left joins
10.5 Inner joins
10.6 Outer joins
10.7 Merging on index labels
10.8 Coding challenge
Problems
Solutions
11 Working with dates and times
11.1 Introducing the Timestamp object
How Python works with datetimes
How pandas works with datetimes
11.2 Storing multiple timestamps in a DatetimeIndex
11.3 Converting column or index values to datetimes
11.4 Using the DatetimeProperties object
11.5 Adding and subtracting durations of time
11.6 Date offsets
11.7 The Timedelta object
11.8 Coding challenge
Problems
Solutions
12 Imports and exports
12.1 Reading from and writing to JSON files
Loading a JSON file Into a DataFrame
Exporting a DataFrame to a JSON file
12.2 Reading from and writing to CSV files
12.3 Reading from and writing to Excel workbooks
Installing the xlrd and openpyxl libraries in an Anaconda environment
Importing Excel workbooks
Exporting Excel workbooks
12.4 Coding challenge
Problems
Solutions
13 Configuring pandas
13.1 Getting and setting pandas options
13.2 Precision
13.3 Maximum column width
13.4 Chop threshold
13.5 Option context
14 Visualization
14.1 Installing matplotlib
14.2 Line charts
14.3 Bar graphs
14.4 Pie charts
Appendix A. Installation and setup
Appendix B. Python crash course
Appendix C. NumPy crash course
Appendix D. Generating fake data with Faker
Appendix E. Regular expressions
index
front matter
preface
Truth be told, I discovered pandas entirely by luck.
In 2015, I interviewed for a data operations analyst position at Indeed.com, the world’s largest jobs site. For my final technical challenge, I was asked to derive insights from an internal data set, using the Microsoft Excel spreadsheet software. Eager to impress, I pulled out as many tricks as I could from my data analysis toolbox: column sorts, text manipulations, pivot tables, and of course the iconic VLOOKUP function. (OK, maybe iconic is a bit of an exaggeration.)
Strange as it may sound, at the time I didn’t realize that there were any tools for data analysis besides Excel. Excel was ubiquitous: my parents used it, my teachers used it, and my colleagues used it. It felt like an established standard. So when I received a job offer, I immediately bought about $100 worth of Excel books and started studying. It was time to become a spreadsheet specialist!
I showed up for my first day of work with a printout of the 50 most-used Excel functions. Barely after I finished logging into my work computer, my manager pulled me into a conference room and informed me that priorities had shifted. The team’s data sets had ballooned to a size that Excel could no longer support. My teammates were also looking for ways to automate the redundant steps in their daily and weekly reports. Luckily, my manager had figured out a solution to both problems. He asked me whether I’d heard of pandas.
The furry animal?
I asked, perplexed.
No,
he said. The Python data analysis library.
After all my prep, it was time to learn a new technology from scratch. I was a little nervous; I’d never written a line of code before. I was an Excel guy, wasn’t I? Was I capable of doing this? There was only one way to find out. I started diving into the official pandas documentation, into YouTube videos, books, workshops, Stack Overflow questions, and whatever data sets I could get my hands on. I was relieved to discover how easy and joyful it was to get started with pandas. The code felt intuitive and straightforward. The library was fast. The features were well-developed and expansive. With pandas, I could accomplish a lot of data manipulation with a little code.
Stories like mine are common in the Python community. The language’s astronomical growth over the past decade is often attributed to the ease with which new developers can pick it up. I am confident that if you’re in a position similar to mine, you can learn pandas just as well. If you’re looking to expand your data analysis skills beyond Excel spreadsheets, this book is your invitation.
When I felt comfortable with pandas, I continued to explore Python and then other programming languages. In many ways, pandas spearheaded my transition into full-time software engineering. I owe a lot to this powerful library, and I’m excited to pass on the torch of knowledge to you. I hope that you discover the magic of what code can do for you.
acknowledgments
It took a lot to get Pandas in Action to the finish line, and I want to express my utmost gratitude to the people who supported me in its two-year writing process.
First and foremost, a warm thank you to my wonderful girlfriend, Meredith. From the first sentence, she was steadfast in her support. She’s a vivacious, funny, and kind soul who always picked me up when the going got tough. This book is better because of her. Thank you, Merbear.
Thank you to my parents, Irina and Dmitriy, for providing a welcoming home where I can always find respite.
Thank you to my twin sisters, Mary and Alexandra. They’re remarkably clever, inquisitive, and hard-working for their age, and I couldn’t be prouder of them. Good luck at college!
Thanks to Watson, our golden retriever. He’s not much of a Python expert, but he makes up for it with his entertaining and friendly demeanor.
A big thank you to my editor, Sarah Miller, who was an absolute joy to work with. I am grateful for her patience and insights throughout the process. She was the true captain of the ship, and she kept everything sailing smoothly.
I would not be a software engineer without the opportunities I was given at Indeed. I want to offer my former manager, Srdjan Bodruzic, a hearty thank you for his generosity and mentorship (and for hiring me!). Thanks to my CX teammates—Tommy Winschel, Danny Moncada, JP Schultz, and Travis Wright—for their wisdom and humor. Thanks to other Indeedians who offered a helping hand during my tenure: Matthew Morin, Chris Hatton, Chip Borsi, Nicole Saglimbene, Danielle Scoli, Blairr Swayne, and George Improglou. Thanks to anybody I’ve shared a dinner with at Sophie’s Cuban Cuisine!
I started writing this book as a software engineer at Stride Consulting. I want to thank many Striders for their support throughout the process: David The Dominator
DiPanfilo, Min Kwak, Ben Blair, Kirsten Nordine, Michael Bobby
Nunez, Jay Lee, James Yoo, Ray Veliz, Nathan Riemer, Julia Berchem, Dan Plain, Nick Char, Grant Ziolkowski, Melissa Wahnish, Dave Anderson, Chris Aporta, Michael Carlson, John Galioto, Sean Marzug-McCarthy, Travis Vander Hoop, Steve Solomon, and Jan Mlčoch.
Thank you to the friendly faces I’ve had the opportunity to work with as a software engineer and consultant: Francis Hwang, Inhak Kim, Liana Lim, Matt Bambach, Brenton Morris, Ian McNally, Josh Philips, Artem Kochnev, Andrew Kang, Andrew Fader, Karl Smith, Bradley Whitwell, Brad Popiolek, Eddie Wharton, Jen Kwok, and my favorite coffee crew: Adam McAmis and Andy Fritz.
Thank you to the following people for all they add to my life: Nick Bianco, Cam Stier, Keith David, Michael Cheung, Thomas Philippeau, Nicole DiAndrea, and James Rokeach.
Thanks to my favorite band, New Found Glory, for providing the soundtrack to many writing sessions. Pop punk’s not dead!
Thank you to the Manning staff who shepherded the project to completion and helped with marketing efforts: Jennifer Houle, Aleksandar Dragosavljević, Radmila Ercegovac, Candace Gillhoolley, Stjepan Jureković, and Lucas Weber. Thanks also to the Manning staff who oversaw the content: Sarah Miller, my developmental editor; Deirdre Hiam, my production editor; Keir Simpson, my copyeditor; and Jason Everett, my proofreader.
Thanks to the technical reviewers who helped me iron out the kinks: Al Pezewski, Alberto Ciarlanti, Ben McNamara, Björn Neuhaus, Christopher Kottmyer, Dan Sheikh, Dragos Manailoiu, Erico Lendzian, Jeff Smith, Jérôme Bâton, Joaquin Beltran, Jonathan Sharley, Jose Apablaza, Ken W. Alger, Martin Czygan, Mathijs Affourtit, Matthias Busch, Mike Cuddy, Monica E. Guimaraes, Ninoslav Cerkez, Rick Prins, Syed Hasany, Viton Vitanis, and Vybhavreddy Kammireddy Changalreddy. I am a better writer and educator thanks to your efforts.
Finally, to the city of Hoboken, my home for the past six years. I wrote many parts of this manuscript in its public library, local cafes, and bubble tea shops. I made many forward strides in my life in this town, and it is forever etched into my history. Thank you, Hoboken!
about this book
Who should read this book
Pandas in Action is a comprehensive introduction to the pandas library for data analysis. Pandas enables you to perform a multitude of data manipulations with ease: sorting, joining, pivoting, cleaning, deduping, aggregating, and more. The book approaches the subject matter incrementally. It introduces pandas one piece at a time, starting with its smaller building blocks and proceeding to its larger data structures.
Pandas in Action is written for data analysts who have intermediate experience with spreadsheet software (such as Microsoft Excel, Google Sheets, and Apple Numbers) and/or alternative data analysis tools (such as R and SAS). It is also a fitting title for Python developers who are curious to learn more about data analysis.
How this book is organized: A road map
Pandas in Action consists of 14 chapters spread across two parts.
Part 1, Core pandas,
introduces the base mechanics of the pandas library in an incremental manner:
Chapter 1 analyzes a sample dataset with pandas to present a big-picture overview of what the library is capable of.
Chapter 2 introduces the Series object, a core pandas data structure that stores a collection of ordered data.
Chapter 3 dives into the Series object in greater depth. We explore various Series operations, including sorting values, dropping duplicates, extracting minimums and maximums, and more.
Chapter 4 introduces the DataFrame, a two-dimensional table of data. We apply concepts from the previous chapters to the new data structure and introduce additional manipulations.
Chapter 5 shows you how to filter subsets of rows from a DataFrame by using various logical conditions: equality, inequality, comparison, inclusion, exclusion, and more.
Part 2, Applied pandas,
focuses on more-advanced pandas features and the problems they solve in real-world datasets:
Chapter 6 teaches you how to work with imperfect text data in pandas. We discuss how to solve issues such as removing whitespace, fixing character casing, and extracting multiple values from a single column.
Chapter 7 discusses the MultiIndex, which allows us to combine multiple column values into a single identifier for a row of data.
Chapter 8 describes how to aggregate our data in a pivot table, shift headers from the row axis to the column axis, and convert our data from wide format to narrow format.
Chapter 9 explores how to group rows into buckets and aggregate the resulting collections via the GroupBy object.
Chapter 10 walks you through combining multiple data sets into a single one by using various joins.
Chapter 11 demonstrates how to work with dates and times in pandas. It covers topics such as sorting dates, calculating durations, and determining whether a date falls at the start of a month or quarter.
Chapter 12 shows you how to import additional file types into pandas, including Excel and JSON. We also learn how to export data from pandas.
Chapter 13 focuses on configuring the library’s settings. We dive into how to modify the number of displayed rows, alter the precision of floating-point numbers, round values below a threshold, and more.
Chapter 14 explores data visualization using the matplotlib library. We see how to use pandas data to create line charts, bar graphs, pie charts, and more.
Each chapter builds upon the preceding one. For those who are learning pandas from scratch, I recommend proceeding through the chapters in linear order. Simultaneously, to ensure that the book is helpful as a reference guide, I’ve written each chapter as an independent tutorial with its own data sets. We start writing our code from scratch at the beginning of each chapter, so you can start with any chapter you like.
Most chapters conclude with a coding challenge that allows you to practice its concepts. I strongly recommend taking a shot at these exercises.
Pandas is built on the Python programing language, and basic knowledge of the language’s mechanics is recommended before you get started. For those who have limited experience in Python, appendix B offers a hearty introduction to the language.
About the code
This book contains many examples of source code, which is formatted in a fixed-width font like this to separate it from ordinary text.
The source code for the book’s examples is available at the following GitHub repository: https://github.com/paskhaver/pandas-in-action. For those who are new to Git and GitHub, look for a Download Zip button on the repository page. Those who are experienced with Git and GitHub are welcome to clone the repo from the command line.
The repository also includes the complete data sets for the text. When I was learning pandas, one of my biggest frustrations was that tutorials loved to rely on randomly generated data. There was no consistency, no context, no story, no fun. In this book, we’ll work with many real-world data sets that cover everything from basketball players’ salaries to Pokémon types to restaurant health inspections. Data is everywhere around us, and pandas is one of the best tools available today to make sense of it. I hope that you enjoy the casual focus of the data sets.
liveBook discussion forum
Purchase of Pandas in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum, go to https://livebook.manning.com/#!/book/pandas-in-action/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://live book.manning.com/#!/discussion.
Manning’s commitment to our readers is to provide a venue where meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest that you try asking the author some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.
Other online resources
The official pandas documentation is available at https://pandas.pydata.org /docs.
In my spare time, I create technical video courses on Udemy. You can find the courses at https://www.udemy.com/user/borispaskhaver; they include a 20-hour pandas course and a 60-hour Python course.
Feel free to reach out to me via Twitter (https://twitter.com/borispaskhaver) or LinkedIn (https://www.linkedin.com/in/boris-paskhaver).
about the author
Boris Paskhaver is a full-stack software engineer, consultant, and online educator based in New York City. He has six courses on the e-learning platform Udemy with over 140 hours of videos, 300,000 students, 20,000 reviews, and 1 million minutes of content consumed monthly. Before becoming a software engineer, Boris worked as a data analyst and systems administrator. He graduated from New York University in 2013 with a double major in business economics and marketing.
about the cover illustration
The figure on the cover of Pandas in Action is captioned Dame de Calais,
or Lady from Calais. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were only 200 years ago. Isolated from one another, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify by their dress alone where they lived and what their trade or station in life was.
The way we dress has changed since then, and diversity by region, so abundant at the time, has faded away. Now it is hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.
At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the deep diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.
Part 1. Core pandas
Welcome! In this section, we’ll familiarize ourselves with the core mechanics of pandas and its two primary data structures: the one-dimensional Series and the two-dimensional DataFrame. Chapter 1 begins with an analysis of a data set with pandas so you can immediately get a sense of what is possible with the library. From there, we proceed to an in-depth exploration of the Series in chapters 2 and 3. We learn how to create a Series from scratch; import it from an external data set; and apply a slew of mathematical, statistical, and logical operations to it. In chapter 4, we introduce the tabular DataFrame and various ways to extract rows, columns, and values from its data. Finally, chapter 5 focuses on extracting subsets of DataFrame rows by applying logical criteria. Along the way, we’ll work through eight datasets that cover everything from box-office grosses to NBA players to Pokémon.
This part covers the essentials of pandas, the fundamentals you need to know to work effectively with the library. I’ve made every effort to start from square one, from the smallest building blocks possible, and proceed to the larger and more complex elements. The following five chapters build the foundation for your mastery of pandas. Good luck!
1 Introducing pandas
This chapter covers
The growth of data science in the 21st century
The history of the pandas library for data analysis
The pros and cons of pandas and its competitors
Data analysis in Excel versus data analysis with a programming language
A tour of the library’s features through a working example
Welcome to Pandas in Action! Pandas is a library for data analysis built on top of the Python programming language. A library (also called a package) is a collection of code for solving problems in a specific field of endeavor. Pandas is a toolbox for data manipulation operations: sorting, filtering, cleaning, deduping, aggregating, pivoting, and more. The epicenter of Python’s vast data science ecosystem, pandas pairs well with other libraries for statistics, natural language processing, machine learning, data visualization, and more.
In this introductory chapter, we’ll explore the history and evolution of modern data analytics tools. We’ll see how pandas grew from one financial analyst’s pet project to an industry standard used by companies such as Stripe, Google, and J.P. Morgan. We’ll compare the library with its competitors, including Excel and R. We’ll discuss the differences between working with a programming language and working with a graphical spreadsheet application. Finally, we’ll use pandas to analyze a real-world data set. Consider this chapter to be a sneak preview of the concepts you’ll master throughout the book. Let’s dive in!
1.1 Data in the 21st century
It is a capital mistake to theorize before one has data,
Sherlock Holmes advises his assistant John Watson in A Scandal in Bohemia,
the first of Sir Arthur Conan Doyle’s classic short stories pairing the duo. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.
The wise detective’s words continue to ring true more than a century after the publication of Doyle’s work, in a world in which data is becoming increasingly prevalent in every facet of our lives. The world’s most valuable resource is no longer oil, but data,
declared The Economist in a 2017 opinion piece. Data is evidence, and evidence is critical to businesses, governments, institutions, and individuals solving increasingly complex problems in our interconnected world. Across a breadth of industries, the world’s most successful companies, from Facebook to Amazon to Netflix, cite data as the most prized asset in their portfolios. United Nations Secretary-General António Guterres called accurate data the lifeblood of good policy and decision-making.
Data powers everything from movie recommendations to medical treatments, from supply chain logistics to poverty-reduction initiatives. The success of communities, companies, and even countries in the 21st century will depend on their ability to acquire, aggregate, and analyze data.
1.2 Introducing pandas
The technological ecosystem of tools for working with data has grown tremendously over the past decade. Today, the open source pandas library is one of the most popular solutions available for data analysis and manipulation. Open source means that the library’s source code is publicly available to download, use, modify, and distribute. Its license grants users more permissions than proprietary software such as Excel. Pandas is free to use. A global team of volunteer software developers maintains the library, and you can find its complete source code on GitHub (https://github.com/pandas-dev/pandas).
Pandas is comparable to Microsoft’s Excel spreadsheet software and Google’s in-browser Sheets application. In all three technologies, a user interacts with tables consisting of rows and columns of data. A row represents a record or, equivalently, one collection of values for the columns. Transformations are applied to coax the data into the desired state.
Figure 1.1 displays a sample transformation of a data set. The analyst applies an operation to the four-row data set on the left to arrive at the two-row data set on the right. They may select rows that fit a criterion, for example, or remove duplicate rows from the original data set.
Figure 1.1 A sample transformation of a tabular data set
What makes pandas unique is the balance it strikes between processing power and user productivity. By relying on lower-level languages such as C for many of its calculations, the library can efficiently transform million-row data sets in milliseconds. At the same time, it maintains a simple and intuitive set of commands. It is easy to accomplish a lot with a little code in pandas.
Figure 1.2 shows some sample pandas code that imports and sorts a CSV data set. Don’t worry about the code yet, but take a second to notice that the entire operation takes only two lines of code.
Figure 1.2 A sample of code that imports and sorts a data set in pandas
Pandas works seamlessly with numbers, text, dates, times, missing data, and more. We’ll explore its incredible versatility as we proceed through the more than 30 data sets included with this book.
The first version of pandas was developed in 2008 by software developer Wes McKinney, who was working at New York’s AQR Capital Management investment firm. Dissatisfied with both Excel and the statistical programming language R, McKinney searched for a tool that would make it easy to solve common data problems in the financial industry, particularly cleanup and aggregation. Unable to find an ideal product, he decided to build one himself. At the time, Python was far from the powerhouse it is today, but the beauty of the language inspired McKinney to build his library on top of its foundation. I loved [Python] for its economy of expressions,
he stated in Quartz (http://mng.bz/w0Na). You can express complicated ideas in Python with very little code, and it is very easy to read.
Pandas has seen continual, extensive growth since its release to the public in December 2009. User counts are estimated to be between five and ten million¹. As of June 2021, pandas has been downloaded more than 750 million times from PyPi, the centralized online repository of Python packages (https://pepy.tech/project/pandas). Its GitHub code repository has more than 30,000 stars (a star is equivalent to a like
on the platform). Pandas questions make up a growing percentage of questions on the question-answer aggregator Stack Overflow, suggesting increased user interest.
I would argue that we can even credit pandas for the astronomical growth of Python itself. The language has exploded in popularity because of its prevalence in data science, a field to which pandas contributes greatly. Python is now the most common first language taught at colleges and universities. The TIOBE index, a ranking of programming language popularity by search engine traffic, declared Python to be the fastest-growing language of 2018 ². If Python can keep this pace, it will probably replace C and Java in 3 to 4 years’ time, thus becoming the most popular programming language of the world,
wrote TIOBE in a press release. As you learn pandas, you’ll also be learning Python, which is another perk of the library.
1.2.1 Pandas vs. graphical spreadsheet applications
Pandas requires a different mindset from a graphical spreadsheet app such as Excel. Programming is inherently more verbal than it is visual. We communicate with the computer through commands, not clicks. Because it makes fewer assumptions about what you’re trying to accomplish, a programming language tends to be more unforgiving. It needs to be told what to do with no uncertainty. We need to issue the correct instructions with the correct inputs in the correct order; otherwise, the program will not work.
Due to these stricter requirements, pandas has a steeper learning curve than Excel or Sheets. But if you have limited experience in Python or programming in general, there’s no need to worry! When you’re fiddling with functions such as SUMIF and VLOOKUP in Excel, you’re already thinking like a programmer. The process is the same: identify the correct function to use and then supply the right inputs in the proper order. Pandas requires an identical set of skills; the difference is that we’re communicating with the computer in a more verbose language.
When you become familiar with its complexities, pandas grants you greater power and flexibility in your data manipulations. In addition to extending the range of your available procedures, programming allows you to automate them. You can write a piece of code once and reuse it across multiple files—perfect for those pesky daily and weekly reports. It’s important to note that Excel comes bundled with Visual Basic for Applications (VBA), a programming language that also enables you to automate spreadsheet procedures. I would argue, however, that Python is easier to pick up than VBA and has uses beyond data analysis, making it a better investment of your time.
There are additional benefits to making the jump from Excel to Python. Jupyter Notebook, the coding environment often paired with pandas, allows for more dynamic, interactive, and comprehensive reports. A Jupyter Notebook consists of cells, each of which contains a chunk of executable code. An analyst can integrate these cells with headers, charts, descriptions, annotations, images, videos, diagrams, and more. Readers can follow the analyst’s step-by-step logic to see how they reached their conclusion, not only their final result.
Another advantage of pandas is Python’s large data science ecosystem. Pandas integrates easily with libraries for statistics, natural language processing, machine learning, web scraping, data visualization, and more. New libraries appear yearly. Experimentation is welcomed. Innovation is constant. These robust tools sometimes remain underdeveloped in corporate competitors, which lack the support of a large, global community of contributors.
Graphical spreadsheet applications also begin to struggle as data sets grow; pandas is significantly more powerful than Excel in this aspect. The capacity of the library is limited only by the computer’s memory and processing power. On most modern machines, pandas plays well with multigigabyte data sets with millions of rows, especially when a developer knows how to exploit all its performance optimizations. In a blog post describing the limitations of the library, creator Wes McKinney wrote, Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times more RAM as the size of your data set
(http://mng.bz/qeK6).
Part of the challenge in choosing the best tool for the job is defining what terms such as data analysis and big data mean to your organization and your project. Excel, which is used by approximately 750 million working professionals globally, limits its spreadsheets to 1,048,576 rows of data³. For some analysts, 1 million rows of data are more than any report requires; for others, 1 million rows only scratch the surface.
I would advise you to look at pandas as being not the best data analysis solution but a powerful option to use alongside other modern technologies. Excel is still an excellent choice for quick, easy data manipulations. A spreadsheet application usually makes assumptions about your intent, which is why it takes only a few clicks to import a CSV file or sort a column of 100 values. There’s no real advantage to using pandas for simple tasks like these (although it’s more than capable of doing them). But what do you use when you need to clean text values in two data sets of ten million rows each, remove their duplicate records, join them, and replicate that logic for 100 batches of files? For those scenarios, it’s easier and less time-consuming to do the work with Python and pandas.
1.2.2 Pandas vs. its competitors
Data science enthusiasts frequently compare pandas with the open source programming language R and the proprietary software suite SAS. Each solution has its own community of advocates.
R is a specialized language with a foundation in statistics, whereas Python is a generalist language used in multiple technical domains. Predictably, the two languages tend to attract users with expertise in specific fields. Hadley Wickham, a prominent developer in the R community who built a collection of data science packages called tidyverse, advises users to see the two languages as collaborators rather than rivals. These things exist independently and are both awesome in different ways,
he said in Quartz (http://mng.bz/Jv9V). A pattern that I see is that the data science team in a company uses R and the data engineering team uses Python. The Python people tend to have a background in software engineering and are very confident about their programming skills. . . . [The R users] really like R, but can’t argue with the engineering team because they don’t have the language to make that argument.
One language may have an advanced feature that the other does not, but the two have achieved near parity when it comes to common tasks in data analysis. Developers and data scientists simply gravitate to what they know best.
A suite of complementary software tools that supports statistics, data mining, econometrics, and more, SAS is a commercial product developed by the North Carolina-based SAS Institute. It charges an annual user subscription fee that varies based on the bundle of selected software. The advantages conferred by a corporate-backed product include technical and visual consistency across tools, robust documentation, and a product road map geared towards enterprise clients’ needs. Open source technology like pandas enjoys a more free-for-all approach; developers work for their needs and for other developers’ needs, which sometimes miss market trends.
Certain technologies share features with pandas but serve intrinsically different purposes. SQL is one example. SQL (Structured Query Language) is a language for communicating with relational databases. A relational database consists of tables of data linked by common keys. We can use SQL for basic data manipulations such as extracting columns from tables and filtering rows by a criterion, but its functionalities are greater in scope and fundamentally revolve around data management. Databases are built to store data; data analysis is a secondary use case. SQL can create new tables, update existing records with new values, delete existing records, and so on. By comparison, pandas is built entirely for data analysis: statistical calculations, data wrangling, data merges, and more. In a typical work environment, the two tools often serve as complements. An analyst might use SQL to extract an initial cluster of data and then use pandas to manipulate it.
In summary, pandas is not the only tool in town, but it is a powerful, popular, and valuable solution for solving most data analysis problems. Again, Python truly shines in its focus on brevity and productivity. As its creator, Guido van Rossum, remarked, The joy of coding Python should be in seeing short, concise, readable [data structures] that express a lot of action in a small amount of clear code
(http://mng.bz/7jo7). Pandas lives up to that standard and is an excellent next step for spreadsheet analysts who are eager to grow their programming skills with a powerful, modern data analysis toolkit.
1.3 A tour of pandas
The best way to grasp the power of pandas is to see it in action. Let’s take a quick tour of the library by analyzing a data set of the 700 highest-grossing movies of all time. I hope you are pleasantly surprised by how intuitive the syntax of pandas can be, even if you are new to programming.
As you proceed through the rest of the chapter, try not to overanalyze the code samples; you don’t even need to copy them. Our goal right now is to get a bird’s-eye view of the features and functionalities of pandas. Think about what the library can do; we’ll worry about how in greater detail later.
We’ll be using the Jupyter Notebook development environment to write our code throughout the book. If you need help setting up pandas and Jupyter Notebook on your computer, see appendix A. You can download all data sets and completed Jupyter Notebooks at https://www.github.com/paskhaver/pandas-in-action.
1.3.1 Importing a data set
Let’s get started! First, we’ll create a new Jupyter Notebook inside the same directory as the movies.csv file; then we’ll import the pandas library to gain access to its features:
In [1] import pandas as pd
The box to the left of the code (displaying the number 1 in the previous example) marks the cell’s execution order relative to the launch or restart of the Jupyter Notebook. You can execute the cells in any order, and you can execute the same cell multiple times.
As you read through the book, you are encouraged to experiment by executing different snippets of code in your Jupyter cells. Thus, it is OK if your execution numbers do not match those in the text.
Our data is stored in a single movies.csv file. A CSV (comma-separated values) file is a plain-text file that separates each row of data with a line break and each row value with a comma. The first row in the file holds the column headers for the data. Here’s a preview of the first three rows of movies.csv:
Rank,Title,Studio,Gross,Year 1,Avengers: Endgame,Buena Vista,$2,796.30
,2019 2,Avatar,Fox,$2,789.70
,2009
The first row lists the five columns in the data set: Rank, Title, Studio, Gross, and Year. The second row holds the first record or, equivalently, the data for the first movie. The film has a Rank of 1, a Title of Avengers: Endgame
, a Studio of Buena Vista
, a Gross of $2,796.30
, and a Year of 2019. The next line holds the values for the next movie, and the pattern repeats for the remaining 750-plus rows in the data set.
Pandas can import various file types, each of which has an associated import function at the top level of the library. A function in pandas is equivalent to a function in Excel. It’s a command that we issue, either to the library or an entity within it. In this scenario, we’ll use the read_csv function to import the movies.csv file:
In [2] pd.read_csv(movies.csv
) Out [2] Rank Title Studio Gross Year 0 1 Avengers: Endgame Buena Vista $2,796.30 2019 1 2 Avatar Fox $2,789.70 2009 2 3 Titanic Paramount $2,187.50 1997 3 4 Star Wars: The Force Awakens Buena Vista $2,068.20 2015 4 5 Avengers: Infinity War Buena Vista $2,048.40 2018 ... ... ... ... ... ... 777 778 Yogi Bear Warner Brothers $201.60 2010 778 779 Garfield: The Movie Fox $200.80 2004 779 780 Cats & Dogs Warner Brothers $200.70 2001 780 781 The Hunt for Red October Paramount $200.50 1990 781 782 Valkyrie MGM $200.30 2008 782 rows × 5 columns
Pandas imports the CSV file’s contents into an object called a DataFrame. Think of an object as a container for storing data. Different objects are optimized for different types of data, and we interact with them in different ways. Pandas uses one type of object (the DataFrame) to store multicolumn data sets and another type of object (the Series) to store single-column data sets. A DataFrame is comparable to a multicolumn table in Excel.
To avoid cluttering the screen, pandas displays only the first five and last five rows of the DataFrame. A row of ellipses ( . . . ) marks where the data gap occurs.
This DataFrame consists of five columns (Rank, Title, Studio, Gross, Year) and an index. The index is the range of ascending numbers on the left side of the DataFrame. Index labels serve as identifiers for rows of data. We can set any column as the index of the DataFrame. When we do not explicitly tell pandas which column to use, the library generates a numeric index starting from 0.
What column is a good candidate for the index? It’s one whose values can act as a primary identifier or point of reference for each row. Among our five columns, Rank and Title are the two best options. Let’s swap the autogenerated numeric index with the values from the Title column. We can do so directly during the CSV import:
In [3] pd.read_csv(movies.csv
, index_col = Title
) Out [3] Rank Studio Gross Year Title Avengers: Endgame 1 Buena Vista $2,796.30 2019 Avatar 2 Fox $2,789.70 2009 Titanic 3 Paramount $2,187.50 1997 Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015 Avengers: Infinity War 5 Buena Vista $2,048.40 2018 ... ... ... ... ... Yogi Bear 778 Warner Brothers $201.60 2010 Garfield: The Movie 779 Fox $200.80 2004 Cats & Dogs 780 Warner Brothers $200.70 2001 The Hunt for Red October 781 Paramount $200.50 1990 Valkyrie 782 MGM $200.30 2008 782 rows × 4 columns
Next, we’ll assign the DataFrame to a movies variable so that we can reference it elsewhere in our program. A variable is a user-assigned name for an object in the program:
In [4] movies = pd.read_csv(movies.csv
, index_col = Title
)
For more on variables, check out appendix B.
1.3.2 Manipulating a DataFrame
We can look at the DataFrame from a variety of angles. We can extract a few rows from the beginning:
In [5] movies.head(4) Out [5] Rank Studio Gross Year Title Avengers: Endgame 1 Buena Vista $2,796.30 2019
Avatar 2 Fox $2,789.70 2009
Titanic 3 Paramount $2,187.50 1997
Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015
Or we can peek at the end of the data set instead:
In [6] movies.tail(6) Out [6] Rank Studio Gross Year Title 21 Jump Street 777 Sony $201.60 2012 Yogi Bear 778 Warner Brothers $201.60 2010 Garfield: The Movie 779 Fox $200.80 2004 Cats & Dogs 780 Warner Brothers $200.70 2001 The Hunt for Red October 781 Paramount $200.50 1990 Valkyrie 782 MGM $200.30 2008
We can find out how many rows the DataFrame has:
In [7] len(movies) Out [7] 782
We can ask pandas for the number of rows and columns in the DataFrame. This data set has 782 rows and 4 columns:
In [8] movies.shape Out [8] (782, 4)
We can inquire about the total number of cells:
In [9] movies.size Out [9] 3128
We can ask for the data types