Ebook875 pages8 hours

Pandas in Action

Name: Pandas in Action
Author: Boris Paskhaver
ISBN: 9781638351047

By Boris Paskhaver

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Take the next steps in your data science career! This friendly and hands-on guide shows you how to start mastering Pandas with skills you already know from spreadsheet software.

In Pandas in Action you will learn how to:

    Import datasets, identify issues with their data structures, and optimize them for efficiency
    Sort, filter, pivot, and draw conclusions from a dataset and its subsets
    Identify trends from text-based and time-based data
    Organize, group, merge, and join separate datasets
    Use a GroupBy object to store multiple DataFrames

Pandas has rapidly become one of Python's most popular data analysis libraries. In Pandas in Action, a friendly and example-rich introduction, author Boris Paskhaver shows you how to master this versatile tool and take the next steps in your data science career. You’ll learn how easy Pandas makes it to efficiently sort, analyze, filter and munge almost any type of data.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Data analysis with Python doesn’t have to be hard. If you can use a spreadsheet, you can learn pandas! While its grid-style layouts may remind you of Excel, pandas is far more flexible and powerful. This Python library quickly performs operations on millions of rows, and it interfaces easily with other tools in the Python data ecosystem. It’s a perfect way to up your data game.

About the book
Pandas in Action introduces Python-based data analysis using the amazing pandas library. You’ll learn to automate repetitive operations and gain deeper insights into your data that would be impractical—or impossible—in Excel. Each chapter is a self-contained tutorial. Realistic downloadable datasets help you learn from the kind of messy data you’ll find in the real world.

What's inside

    Organize, group, merge, split, and join datasets
    Find trends in text-based and time-based data
    Sort, filter, pivot, optimize, and draw conclusions
    Apply aggregate operations

About the reader
For readers experienced with spreadsheets and basic Python programming.

About the author
Boris Paskhaver is a software engineer, Agile consultant, and online educator. His programming courses have been taken by 300,000 students across 190 countries.

Table of Contents
PART 1 CORE PANDAS
1 Introducing pandas
2 The Series object
3 Series methods
4 The DataFrame object
5 Filtering a DataFrame
PART 2 APPLIED PANDAS
6 Working with text data
7 MultiIndex DataFrames
8 Reshaping and pivoting
9 The GroupBy object
10 Merging, joining, and concatenating
11 Working with dates and times
12 Imports and exports
13 Configuring pandas
14 Visualization

Skip carousel

LanguageEnglish

PublisherManning

Release dateOct 12, 2021

ISBN9781638351047

Author

Boris Paskhaver

Boris Paskhaver is a software engineer, Agile consultant, and online educator. His programming courses have been taken by 300,000 students across 190 countries.

Related authors

Skip carousel

Related to Pandas in Action

Related ebooks

Skip carousel

Machine Learning Bookcamp: Build a portfolio of real-life projects
Ebook
Machine Learning Bookcamp: Build a portfolio of real-life projects
byAlexey Grigorev
Rating: 4 out of 5 stars
4/5
Machine Learning in Action
Ebook
Machine Learning in Action
byPeter Harrington
Rating: 0 out of 5 stars
0 ratings
Tiny Python Projects: Learn coding and testing with puzzles and games
Ebook
Tiny Python Projects: Learn coding and testing with puzzles and games
byKen Youens-Clark
Rating: 4 out of 5 stars
4/5
Think Like a Data Scientist: Tackle the data science process step-by-step
Ebook
Think Like a Data Scientist: Tackle the data science process step-by-step
byBrian Godsey
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Python, Second Edition
Ebook
Deep Learning with Python, Second Edition
byFrancois Chollet
Rating: 0 out of 5 stars
0 ratings
Math for Programmers: 3D graphics, machine learning, and simulations with Python
Ebook
Math for Programmers: 3D graphics, machine learning, and simulations with Python
byPaul Orland
Rating: 4 out of 5 stars
4/5
Deep Learning with Structured Data
Ebook
Deep Learning with Structured Data
byMark Ryan
Rating: 0 out of 5 stars
0 ratings
Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
Ebook
Natural Language Processing in Action: Understanding, analyzing, and generating text with Python
byHannes Hapke
Rating: 0 out of 5 stars
0 ratings
Introducing Data Science: Big data, machine learning, and more, using Python tools
Ebook
Introducing Data Science: Big data, machine learning, and more, using Python tools
byDavy Cielen
Rating: 5 out of 5 stars
5/5
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Python: Real-World Data Science
Ebook
Python: Real-World Data Science
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Machine Learning Systems: Designs that scale
Ebook
Machine Learning Systems: Designs that scale
byJeffrey Smith
Rating: 0 out of 5 stars
0 ratings
Deep Learning with JavaScript: Neural networks in TensorFlow.js
Ebook
Deep Learning with JavaScript: Neural networks in TensorFlow.js
byStanley Bileschi
Rating: 0 out of 5 stars
0 ratings
Advanced Algorithms and Data Structures
Ebook
Advanced Algorithms and Data Structures
byMarcello La Rocca
Rating: 0 out of 5 stars
0 ratings
Deep Learning with R
Ebook
Deep Learning with R
byJ. J. Allaire
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Python
Ebook
Deep Learning with Python
byFrancois Chollet
Rating: 5 out of 5 stars
5/5
Data Analysis with Python and PySpark
Ebook
Data Analysis with Python and PySpark
byJonathan Rioux
Rating: 0 out of 5 stars
0 ratings
Pandas Workout: 200 exercises to make you a stronger data analyst
Ebook
Pandas Workout: 200 exercises to make you a stronger data analyst
byReuven Lerner
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R, the tidyverse, and mlr
Ebook
Machine Learning with R, the tidyverse, and mlr
byHefin Rhys
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5
Functional Python Programming
Ebook
Functional Python Programming
bySteven Lott
Rating: 0 out of 5 stars
0 ratings
Feature Engineering Bookcamp
Ebook
Feature Engineering Bookcamp
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
Algorithms and Data Structures for Massive Datasets
Ebook
Algorithms and Data Structures for Massive Datasets
byDzejla Medjedovic
Rating: 0 out of 5 stars
0 ratings
Graph Databases in Action: Examples in Gremlin
Ebook
Graph Databases in Action: Examples in Gremlin
byJosh Perryman
Rating: 0 out of 5 stars
0 ratings
Practical Data Science with R, Second Edition
Ebook
Practical Data Science with R, Second Edition
byJohn Mount
Rating: 4 out of 5 stars
4/5
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Designing Cloud Data Platforms
Ebook
Designing Cloud Data Platforms
byDanil Zburivsky
Rating: 0 out of 5 stars
0 ratings
Mastering Objectoriented Python
Ebook
Mastering Objectoriented Python
bySteven F. Lott
Rating: 5 out of 5 stars
5/5
TensorFlow in Action
Ebook
TensorFlow in Action
byThushan Ganegedara
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
Ebook
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
byBrett Lantz
Rating: 0 out of 5 stars
0 ratings

Data Modeling & Design For You

Skip carousel

Data-Intensive Applications: Design, Development, and Deployment Strategies for Scalable and Reliable Systems
Ebook
Data-Intensive Applications: Design, Development, and Deployment Strategies for Scalable and Reliable Systems
byBrian Murray
Rating: 0 out of 5 stars
0 ratings
Data Analytics for Beginners: Introduction to Data Analytics
Ebook
Data Analytics for Beginners: Introduction to Data Analytics
byAnthony S. Williams
Rating: 4 out of 5 stars
4/5
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
Ebook
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016
byRob Collie
Rating: 4 out of 5 stars
4/5
150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel
Ebook
150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel
byAndrei Besedin
Rating: 3 out of 5 stars
3/5
Data Analytics with Python: Data Analytics in Python Using Pandas
Ebook
Data Analytics with Python: Data Analytics in Python Using Pandas
byFrank Millstein
Rating: 3 out of 5 stars
3/5
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Mastering Python Design Patterns
Ebook
Mastering Python Design Patterns
bySakis Kasampalis
Rating: 0 out of 5 stars
0 ratings
Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning
Ebook
Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning
byBrian Murray
Rating: 2 out of 5 stars
2/5
Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts
Ebook
Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts
byDmitry Anoshin
Rating: 0 out of 5 stars
0 ratings
Data Visualization: a successful design process
Ebook
Data Visualization: a successful design process
byAndy Kirk
Rating: 4 out of 5 stars
4/5
Thinking in Algorithms: Strategic Thinking Skills, #2
Ebook
Thinking in Algorithms: Strategic Thinking Skills, #2
byAlbert Rutherford
Rating: 4 out of 5 stars
4/5
LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots
Ebook
LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots
byStefan Kottwitz
Rating: 0 out of 5 stars
0 ratings
DAX Patterns: Second Edition
Ebook
DAX Patterns: Second Edition
byMarco Russo
Rating: 5 out of 5 stars
5/5
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
Ebook
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Text as Data: A New Framework for Machine Learning and the Social Sciences
Ebook
Text as Data: A New Framework for Machine Learning and the Social Sciences
byJustin Grimmer
Rating: 0 out of 5 stars
0 ratings
Mastering Python Data Analysis
Ebook
Mastering Python Data Analysis
byMagnus Vilhelm Persson
Rating: 0 out of 5 stars
0 ratings
Managing Data Using Excel
Ebook
Managing Data Using Excel
byMark Gardener
Rating: 5 out of 5 stars
5/5
Instant Heat Maps in R How-to
Ebook
Instant Heat Maps in R How-to
bySebastian Raschka
Rating: 0 out of 5 stars
0 ratings
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
Ebook
Supercharge Power BI: Power BI is Better When You Learn To Write DAX
byMatt Allington
Rating: 5 out of 5 stars
5/5
Frank Kane's Taming Big Data with Apache Spark and Python
Ebook
Frank Kane's Taming Big Data with Apache Spark and Python
byFrank Kane
Rating: 0 out of 5 stars
0 ratings
Microsoft Access: Database Creation and Management through Microsoft Access
Ebook
Microsoft Access: Database Creation and Management through Microsoft Access
bySteven Bright
Rating: 0 out of 5 stars
0 ratings
AI-Driven Data Engineering
Ebook
AI-Driven Data Engineering
byChuck Sherman
Rating: 0 out of 5 stars
0 ratings
Tableau Cookbook – Recipes for Data Visualization
Ebook
Tableau Cookbook – Recipes for Data Visualization
byShweta Sankhe-Savale
Rating: 0 out of 5 stars
0 ratings
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
Ebook
Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps
byJason Scotts
Rating: 3 out of 5 stars
3/5
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
Ebook
Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples
byMichael Blake
Rating: 5 out of 5 stars
5/5
Mastering Agile User Stories
Ebook
Mastering Agile User Stories
byDeEtta Balthazar
Rating: 4 out of 5 stars
4/5
Kafka in Action
Ebook
Kafka in Action
byDylan Scott
Rating: 0 out of 5 stars
0 ratings
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Ebook
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
byPeter Bradley
Rating: 0 out of 5 stars
0 ratings
Learning Social Media Analytics with R
Ebook
Learning Social Media Analytics with R
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Blockchain Data Analytics For Dummies
Ebook
Blockchain Data Analytics For Dummies
byMichael G. Solomon
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
UNLIMITED
Exploring deep reinforcement learning: with Thomas Simonini of Hugging Face
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
UNLIMITED
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
UNLIMITED
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
S1:E1 "The Beginning"
UNLIMITED
S1:E1 "The Beginning"
byData Science Now
0 ratings
0% found this document useful
084: Yves Hilpisch – Quantitative finance and programming trading strategies w/ The Python Quants: Dr. Yves Hilpisch is the founder of The Python Quants, a keynote speaker, and a three-time published author (most notably, Python For Finance). He regularly contracts to hedge funds, banks and exchanges, and hosts workshops on Python programming and algor
UNLIMITED
084: Yves Hilpisch – Quantitative finance and programming trading strategies w/ The Python Quants: Dr. Yves Hilpisch is the founder of The Python Quants, a keynote speaker, and a three-time published author (most notably, Python For Finance). He regularly contracts to hedge funds, banks and exchanges, and hosts workshops on Python programming and algor
byChat With Traders
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
UNLIMITED
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
2: Pytest vs Unittest vs Nose: Choosing a test framework
UNLIMITED
2: Pytest vs Unittest vs Nose: Choosing a test framework
byTest and Code
0 ratings
0% found this document useful
The Rapid Rise of Vector Databases with Ram Sriharsha: Ram Sriharsha, VP of Engineering and R&D at Pinecone, joins Corey on Screaming in the Cloud to discuss Pinecone’s creation of Vector Databases, the challenges they solve, and why their customer adoption has seen such a rapid rise. Ram reveals the the comm
UNLIMITED
The Rapid Rise of Vector Databases with Ram Sriharsha: Ram Sriharsha, VP of Engineering and R&D at Pinecone, joins Corey on Screaming in the Cloud to discuss Pinecone’s creation of Vector Databases, the challenges they solve, and why their customer adoption has seen such a rapid rise. Ram reveals the the comm
byScreaming in the Cloud
0 ratings
0% found this document useful
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
UNLIMITED
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Devon Estes from Sketch on Benchee, Performance and Training: Devon Estes joins our ongoing discussion about performance and training in the Elixir world, shares about his current work on the beta for Sketch Cloud, his previous Erlang consultancy role at one of the largest banks in Europe, and the massive responsibility he carried while working on the bottom line application.
UNLIMITED
Devon Estes from Sketch on Benchee, Performance and Training: Devon Estes joins our ongoing discussion about performance and training in the Elixir world, shares about his current work on the beta for Sketch Cloud, his previous Erlang consultancy role at one of the largest banks in Europe, and the massive responsibility he carried while working on the bottom line application.
byElixir Wizards
0 ratings
0% found this document useful
Cloud Firestore for Users who are new to Firestore: Brian Dorsey and Mark Mirchandani are talking intro to Firestore this week with fellow Googler Allison Kornher.
UNLIMITED
Cloud Firestore for Users who are new to Firestore: Brian Dorsey and Mark Mirchandani are talking intro to Firestore this week with fellow Googler Allison Kornher.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Defining Success: Metrics and KPIs - Adam Sroka
UNLIMITED
Defining Success: Metrics and KPIs - Adam Sroka
byDataTalks.Club
0 ratings
0% found this document useful
Commanding the Council of the Lords of Thought with Anna Belak: A few years ago Corey caught wind of the open source project Sysdig, which at the time attracted his attention. Now it has turned into something “rather interesting” when it comes to observability and security. Anna Belak, Sysdig’s Director of Thought Lea
UNLIMITED
Commanding the Council of the Lords of Thought with Anna Belak: A few years ago Corey caught wind of the open source project Sysdig, which at the time attracted his attention. Now it has turned into something “rather interesting” when it comes to observability and security. Anna Belak, Sysdig’s Director of Thought Lea
byScreaming in the Cloud
0 ratings
0% found this document useful
Build Your Own Data Pipeline - Andreas Kretz
UNLIMITED
Build Your Own Data Pipeline - Andreas Kretz
byDataTalks.Club
0 ratings
0% found this document useful
Apache Beam with Kenneth Knowles and Pablo Estrada: On the podcast this week, your hosts and talk about the data processing tool Apache Beam with guests and . Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and...
UNLIMITED
Apache Beam with Kenneth Knowles and Pablo Estrada: On the podcast this week, your hosts and talk about the data processing tool Apache Beam with guests and . Kenn starts us off with an overview of how Apache Beam began and how Cloud Dataflow was involved. The unique batch and stream method and...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Storytime for DataOps - Christopher Bergh
UNLIMITED
Storytime for DataOps - Christopher Bergh
byDataTalks.Club
0 ratings
0% found this document useful
Hasty Treat - Refactoring: In this Hasty Treat, Scott and Wes discuss refactoring, what it is, why you should do it, when to do it, as well as best practices and much more. Netlify — Sponsor is the best way to deploy and host a front-end website. All the features...
UNLIMITED
Hasty Treat - Refactoring: In this Hasty Treat, Scott and Wes discuss refactoring, what it is, why you should do it, when to do it, as well as best practices and much more. Netlify — Sponsor is the best way to deploy and host a front-end website. All the features...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Database Essentials: Join hosts Lois Houston and Nikita Abraham, along with Hope Fisher, Oracle’s Product Manager for Database Technologies, as they break down the basics of databases, explore different database management systems, and delve into database development....
UNLIMITED
Database Essentials: Join hosts Lois Houston and Nikita Abraham, along with Hope Fisher, Oracle’s Product Manager for Database Technologies, as they break down the basics of databases, explore different database management systems, and delve into database development....
byOracle University Podcast
0 ratings
0% found this document useful
MySQL Database Design: Explore the essentials of MySQL database design with Lois Houston and Nikita Abraham, who team up with MySQL expert Perside Foster to discuss key storage concepts, transaction support in InnoDB, and ACID compliance. You’ll also get tips on choosing...
UNLIMITED
MySQL Database Design: Explore the essentials of MySQL database design with Lois Houston and Nikita Abraham, who team up with MySQL expert Perside Foster to discuss key storage concepts, transaction support in InnoDB, and ACID compliance. You’ll also get tips on choosing...
byOracle University Podcast
0 ratings
0% found this document useful
Composable Data Analytics
UNLIMITED
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
UNLIMITED
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
UNLIMITED
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Cloud-Hosted Database Services with Benjamin Anderson: Today Corey chats with promoted guest, CTO for cloud at EDB, Benjamin Anderson. They start by talking about the evolution of the market space for cloud-hosted database services and relational database trends over the years. Corey and Benjamin discuss diff
UNLIMITED
Cloud-Hosted Database Services with Benjamin Anderson: Today Corey chats with promoted guest, CTO for cloud at EDB, Benjamin Anderson. They start by talking about the evolution of the market space for cloud-hosted database services and relational database trends over the years. Corey and Benjamin discuss diff
byScreaming in the Cloud
0 ratings
0% found this document useful
Episode 421: RR 413: When Your Tools Interrupt Your Coding Process
UNLIMITED
Episode 421: RR 413: When Your Tools Interrupt Your Coding Process
byRuby Rogues
0 ratings
0% found this document useful
Moving up a level of abstraction with serverless on MongoDB Atlas and AWS
UNLIMITED
Moving up a level of abstraction with serverless on MongoDB Atlas and AWS
byThe Stack Overflow Podcast
0 ratings
0% found this document useful
Whiteboard Confessional: Naming Is Hard, Don’t Make it Worse: Join me as I continue the Whiteboard Confessional series with a look the importance of owning your own domain names while touching upon what split-horizon DNS is and why companies use it, what the Route 53 Resolver is actually designed to do, why it is im
UNLIMITED
Whiteboard Confessional: Naming Is Hard, Don’t Make it Worse: Join me as I continue the Whiteboard Confessional series with a look the importance of owning your own domain names while touching upon what split-horizon DNS is and why companies use it, what the Route 53 Resolver is actually designed to do, why it is im
byAWS Morning Brief
0 ratings
0% found this document useful
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
UNLIMITED
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
#110 - Dane Hillard on Python packaging and effective developer tooling
UNLIMITED
#110 - Dane Hillard on Python packaging and effective developer tooling
byPybites Podcast
0 ratings
0% found this document useful
The Busy Creator 10, w/guest Erica Heinz: Erica Heinz is a web designer in Brooklyn, NY
UNLIMITED
The Busy Creator 10, w/guest Erica Heinz: Erica Heinz is a web designer in Brooklyn, NY
byThe Busy Creator Podcast with Prescott Perez-Fox
0 ratings
0% found this document useful
12: Coverage.py with Ned Batchelder: We also discuss edX, Python user groups, PyCon talks, and more.
UNLIMITED
12: Coverage.py with Ned Batchelder: We also discuss edX, Python user groups, PyCon talks, and more.
byTest and Code
0 ratings
0% found this document useful

Related categories

Skip carousel

Reviews for Pandas in Action

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Pandas in Action - Boris Paskhaver

Pandas in Action

Boris Paskhaver

To comment go to liveBook

Manning

Shelter Island

For more information on this and other Manning titles go to

www.manning.com

Dedication

For Meredith Edwards, my ray of sunshine

Copyright

For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales Department

Manning Publications Co.

20 Baldwin Road

PO Box 761

Shelter Island, NY 11964

Email: orders@manning.com

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

♾ Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

ISBN: 9781617297434

front matter

preface

acknowledgments

about this book

about the author

about the cover illustration

Part 1. Core pandas

1 Introducing pandas

1.1 Data in the 21st century

1.2 Introducing pandas

Pandas vs. graphical spreadsheet applications

Pandas vs. its competitors

1.3 A tour of pandas

Importing a data set

Manipulating a DataFrame

Counting values in a Series

Filtering a column by one or more criteria

Grouping data

2 The Series object

2.1 Overview of a Series

Classes and instances

Populating the Series with values

Customizing the Series index

Creating a Series with missing values

2.2 Creating a Series from Python objects

2.3 Series attributes

2.4 Retrieving the first and last rows

2.5 Mathematical operations

Statistical operations

Arithmetic operations

Broadcasting

2.6 Passing the Series to Python’s built-in functions

2.7 Coding challenge

Problems

Solutions

3 Series methods

3.1 Importing a data set with the read_csv function

3.2 Sorting a Series

Sorting by values with the sort_values method

Sorting by index with the sort_index method

Retrieving the smallest and largest values with the nsmallest and nlargest methods

3.3 Overwriting a Series with the inplace parameter

3.4 Counting values with the value_counts method

3.5 Invoking a function on every Series value with the apply method

3.6 Coding challenge

Problems

Solutions

4 The DataFrame object

4.1 Overview of a DataFrame

Creating a DataFrame from a dictionary

Creating a DataFrame from a NumPy ndarray

4.2 Similarities between Series and DataFrames

Importing a DataFrame with the read_csv function

Shared and exclusive attributes of Series and DataFrames

Shared methods of Series and DataFrames

4.3 Sorting a DataFrame

Sorting by a single column

Sorting by multiple columns

4.4 Sorting by index

Sorting by row index

Sorting by column index

4.5 Setting a new index

4.6 Selecting columns and rows from a DataFrame

Selecting a single column from a DataFrame

Selecting multiple columns from a DataFrame

4.7 Selecting rows from a DataFrame

Extracting rows by index label

Extracting rows by index position

Extracting values from specific columns

4.8 Extracting values from Series

4.9 Renaming columns or rows

4.10 Resetting an index

4.11 Coding challenge

Problems

Solutions

5 Filtering a DataFrame

5.1 Optimizing a data set for memory use

Converting data types with the astype method

5.2 Filtering by a single condition

5.3 Filtering by multiple conditions

The AND condition

The OR condition

Inversion with ~

Methods for Booleans

5.4 Filtering by condition

The isin method

The between method

The isnull and notnull methods

Dealing with null values

5.5 Dealing with duplicates

The duplicated method

The drop_duplicates method

5.6 Coding challenge

Problems

Solutions

Part 2. Applied pandas

6 Working with text data

6.1 Letter casing and whitespace

6.2 String slicing

6.3 String slicing and character replacement

6.4 Boolean methods

6.5 Splitting strings

6.6 Coding challenge

Problems

Solutions

6.7 A note on regular expressions

7 MultiIndex DataFrames

7.1 The MultiIndex object

7.2 MultiIndex DataFrames

7.3 Sorting a MultiIndex

7.4 Selecting with a MultiIndex

Extracting one or more columns

Extracting one or more rows with loc

Extracting one or more rows with iloc

7.5 Cross-sections

7.6 Manipulating the Index

Resetting the index

Setting the index

7.7 Coding challenge

Problems

Solutions

8 Reshaping and pivoting

8.1 Wide vs. narrow data

8.2 Creating a pivot table from a DataFrame

The pivot_table method

Additional options for pivot tables

8.3 Stacking and unstacking index levels

8.4 Melting a data set

8.5 Exploding a list of values

8.6 Coding challenge

Problems

Solutions

9 The GroupBy object

9.1 Creating a GroupBy object from scratch

9.2 Creating a GroupBy object from a data set

9.3 Attributes and methods of a GroupBy object

9.4 Aggregate operations

9.5 Applying a custom operation to all groups

9.6 Grouping by multiple columns

9.7 Coding challenge

Problems

Solutions

10 Merging, joining, and concatenating

10.1 Introducing the data sets

10.2 Concatenating data sets

10.3 Missing values in concatenated DataFrames

10.4 Left joins

10.5 Inner joins

10.6 Outer joins

10.7 Merging on index labels

10.8 Coding challenge

Problems

Solutions

11 Working with dates and times

11.1 Introducing the Timestamp object

How Python works with datetimes

How pandas works with datetimes

11.2 Storing multiple timestamps in a DatetimeIndex

11.3 Converting column or index values to datetimes

11.4 Using the DatetimeProperties object

11.5 Adding and subtracting durations of time

11.6 Date offsets

11.7 The Timedelta object

11.8 Coding challenge

Problems

Solutions

12 Imports and exports

12.1 Reading from and writing to JSON files

Loading a JSON file Into a DataFrame

Exporting a DataFrame to a JSON file

12.2 Reading from and writing to CSV files

12.3 Reading from and writing to Excel workbooks

Installing the xlrd and openpyxl libraries in an Anaconda environment

Importing Excel workbooks

Exporting Excel workbooks

12.4 Coding challenge

Problems

Solutions

13 Configuring pandas

13.1 Getting and setting pandas options

13.2 Precision

13.3 Maximum column width

13.4 Chop threshold

13.5 Option context

14 Visualization

14.1 Installing matplotlib

14.2 Line charts

14.3 Bar graphs

14.4 Pie charts

Appendix A. Installation and setup

Appendix B. Python crash course

Appendix C. NumPy crash course

Appendix D. Generating fake data with Faker

Appendix E. Regular expressions

index

front matter

preface

Truth be told, I discovered pandas entirely by luck.

In 2015, I interviewed for a data operations analyst position at Indeed.com, the world’s largest jobs site. For my final technical challenge, I was asked to derive insights from an internal data set, using the Microsoft Excel spreadsheet software. Eager to impress, I pulled out as many tricks as I could from my data analysis toolbox: column sorts, text manipulations, pivot tables, and of course the iconic VLOOKUP function. (OK, maybe iconic is a bit of an exaggeration.)

Strange as it may sound, at the time I didn’t realize that there were any tools for data analysis besides Excel. Excel was ubiquitous: my parents used it, my teachers used it, and my colleagues used it. It felt like an established standard. So when I received a job offer, I immediately bought about $100 worth of Excel books and started studying. It was time to become a spreadsheet specialist!

I showed up for my first day of work with a printout of the 50 most-used Excel functions. Barely after I finished logging into my work computer, my manager pulled me into a conference room and informed me that priorities had shifted. The team’s data sets had ballooned to a size that Excel could no longer support. My teammates were also looking for ways to automate the redundant steps in their daily and weekly reports. Luckily, my manager had figured out a solution to both problems. He asked me whether I’d heard of pandas.

The furry animal? I asked, perplexed.

No, he said. The Python data analysis library.

After all my prep, it was time to learn a new technology from scratch. I was a little nervous; I’d never written a line of code before. I was an Excel guy, wasn’t I? Was I capable of doing this? There was only one way to find out. I started diving into the official pandas documentation, into YouTube videos, books, workshops, Stack Overflow questions, and whatever data sets I could get my hands on. I was relieved to discover how easy and joyful it was to get started with pandas. The code felt intuitive and straightforward. The library was fast. The features were well-developed and expansive. With pandas, I could accomplish a lot of data manipulation with a little code.

Stories like mine are common in the Python community. The language’s astronomical growth over the past decade is often attributed to the ease with which new developers can pick it up. I am confident that if you’re in a position similar to mine, you can learn pandas just as well. If you’re looking to expand your data analysis skills beyond Excel spreadsheets, this book is your invitation.

When I felt comfortable with pandas, I continued to explore Python and then other programming languages. In many ways, pandas spearheaded my transition into full-time software engineering. I owe a lot to this powerful library, and I’m excited to pass on the torch of knowledge to you. I hope that you discover the magic of what code can do for you.

acknowledgments

It took a lot to get Pandas in Action to the finish line, and I want to express my utmost gratitude to the people who supported me in its two-year writing process.

First and foremost, a warm thank you to my wonderful girlfriend, Meredith. From the first sentence, she was steadfast in her support. She’s a vivacious, funny, and kind soul who always picked me up when the going got tough. This book is better because of her. Thank you, Merbear.

Thank you to my parents, Irina and Dmitriy, for providing a welcoming home where I can always find respite.

Thank you to my twin sisters, Mary and Alexandra. They’re remarkably clever, inquisitive, and hard-working for their age, and I couldn’t be prouder of them. Good luck at college!

Thanks to Watson, our golden retriever. He’s not much of a Python expert, but he makes up for it with his entertaining and friendly demeanor.

A big thank you to my editor, Sarah Miller, who was an absolute joy to work with. I am grateful for her patience and insights throughout the process. She was the true captain of the ship, and she kept everything sailing smoothly.

I would not be a software engineer without the opportunities I was given at Indeed. I want to offer my former manager, Srdjan Bodruzic, a hearty thank you for his generosity and mentorship (and for hiring me!). Thanks to my CX teammates—Tommy Winschel, Danny Moncada, JP Schultz, and Travis Wright—for their wisdom and humor. Thanks to other Indeedians who offered a helping hand during my tenure: Matthew Morin, Chris Hatton, Chip Borsi, Nicole Saglimbene, Danielle Scoli, Blairr Swayne, and George Improglou. Thanks to anybody I’ve shared a dinner with at Sophie’s Cuban Cuisine!

I started writing this book as a software engineer at Stride Consulting. I want to thank many Striders for their support throughout the process: David The Dominator DiPanfilo, Min Kwak, Ben Blair, Kirsten Nordine, Michael Bobby Nunez, Jay Lee, James Yoo, Ray Veliz, Nathan Riemer, Julia Berchem, Dan Plain, Nick Char, Grant Ziolkowski, Melissa Wahnish, Dave Anderson, Chris Aporta, Michael Carlson, John Galioto, Sean Marzug-McCarthy, Travis Vander Hoop, Steve Solomon, and Jan Mlčoch.

Thank you to the friendly faces I’ve had the opportunity to work with as a software engineer and consultant: Francis Hwang, Inhak Kim, Liana Lim, Matt Bambach, Brenton Morris, Ian McNally, Josh Philips, Artem Kochnev, Andrew Kang, Andrew Fader, Karl Smith, Bradley Whitwell, Brad Popiolek, Eddie Wharton, Jen Kwok, and my favorite coffee crew: Adam McAmis and Andy Fritz.

Thank you to the following people for all they add to my life: Nick Bianco, Cam Stier, Keith David, Michael Cheung, Thomas Philippeau, Nicole DiAndrea, and James Rokeach.

Thanks to my favorite band, New Found Glory, for providing the soundtrack to many writing sessions. Pop punk’s not dead!

Thank you to the Manning staff who shepherded the project to completion and helped with marketing efforts: Jennifer Houle, Aleksandar Dragosavljević, Radmila Ercegovac, Candace Gillhoolley, Stjepan Jureković, and Lucas Weber. Thanks also to the Manning staff who oversaw the content: Sarah Miller, my developmental editor; Deirdre Hiam, my production editor; Keir Simpson, my copyeditor; and Jason Everett, my proofreader.

Thanks to the technical reviewers who helped me iron out the kinks: Al Pezewski, Alberto Ciarlanti, Ben McNamara, Björn Neuhaus, Christopher Kottmyer, Dan Sheikh, Dragos Manailoiu, Erico Lendzian, Jeff Smith, Jérôme Bâton, Joaquin Beltran, Jonathan Sharley, Jose Apablaza, Ken W. Alger, Martin Czygan, Mathijs Affourtit, Matthias Busch, Mike Cuddy, Monica E. Guimaraes, Ninoslav Cerkez, Rick Prins, Syed Hasany, Viton Vitanis, and Vybhavreddy Kammireddy Changalreddy. I am a better writer and educator thanks to your efforts.

Finally, to the city of Hoboken, my home for the past six years. I wrote many parts of this manuscript in its public library, local cafes, and bubble tea shops. I made many forward strides in my life in this town, and it is forever etched into my history. Thank you, Hoboken!

about this book

Who should read this book

Pandas in Action is a comprehensive introduction to the pandas library for data analysis. Pandas enables you to perform a multitude of data manipulations with ease: sorting, joining, pivoting, cleaning, deduping, aggregating, and more. The book approaches the subject matter incrementally. It introduces pandas one piece at a time, starting with its smaller building blocks and proceeding to its larger data structures.

Pandas in Action is written for data analysts who have intermediate experience with spreadsheet software (such as Microsoft Excel, Google Sheets, and Apple Numbers) and/or alternative data analysis tools (such as R and SAS). It is also a fitting title for Python developers who are curious to learn more about data analysis.

How this book is organized: A road map

Pandas in Action consists of 14 chapters spread across two parts.

Part 1, Core pandas, introduces the base mechanics of the pandas library in an incremental manner:

Chapter 1 analyzes a sample dataset with pandas to present a big-picture overview of what the library is capable of.

Chapter 2 introduces the Series object, a core pandas data structure that stores a collection of ordered data.

Chapter 3 dives into the Series object in greater depth. We explore various Series operations, including sorting values, dropping duplicates, extracting minimums and maximums, and more.

Chapter 4 introduces the DataFrame, a two-dimensional table of data. We apply concepts from the previous chapters to the new data structure and introduce additional manipulations.

Chapter 5 shows you how to filter subsets of rows from a DataFrame by using various logical conditions: equality, inequality, comparison, inclusion, exclusion, and more.

Part 2, Applied pandas, focuses on more-advanced pandas features and the problems they solve in real-world datasets:

Chapter 6 teaches you how to work with imperfect text data in pandas. We discuss how to solve issues such as removing whitespace, fixing character casing, and extracting multiple values from a single column.

Chapter 7 discusses the MultiIndex, which allows us to combine multiple column values into a single identifier for a row of data.

Chapter 8 describes how to aggregate our data in a pivot table, shift headers from the row axis to the column axis, and convert our data from wide format to narrow format.

Chapter 9 explores how to group rows into buckets and aggregate the resulting collections via the GroupBy object.

Chapter 10 walks you through combining multiple data sets into a single one by using various joins.

Chapter 11 demonstrates how to work with dates and times in pandas. It covers topics such as sorting dates, calculating durations, and determining whether a date falls at the start of a month or quarter.

Chapter 12 shows you how to import additional file types into pandas, including Excel and JSON. We also learn how to export data from pandas.

Chapter 13 focuses on configuring the library’s settings. We dive into how to modify the number of displayed rows, alter the precision of floating-point numbers, round values below a threshold, and more.

Chapter 14 explores data visualization using the matplotlib library. We see how to use pandas data to create line charts, bar graphs, pie charts, and more.

Each chapter builds upon the preceding one. For those who are learning pandas from scratch, I recommend proceeding through the chapters in linear order. Simultaneously, to ensure that the book is helpful as a reference guide, I’ve written each chapter as an independent tutorial with its own data sets. We start writing our code from scratch at the beginning of each chapter, so you can start with any chapter you like.

Most chapters conclude with a coding challenge that allows you to practice its concepts. I strongly recommend taking a shot at these exercises.

Pandas is built on the Python programing language, and basic knowledge of the language’s mechanics is recommended before you get started. For those who have limited experience in Python, appendix B offers a hearty introduction to the language.

About the code

This book contains many examples of source code, which is formatted in a fixed-width font like this to separate it from ordinary text.

The source code for the book’s examples is available at the following GitHub repository: https://github.com/paskhaver/pandas-in-action. For those who are new to Git and GitHub, look for a Download Zip button on the repository page. Those who are experienced with Git and GitHub are welcome to clone the repo from the command line.

The repository also includes the complete data sets for the text. When I was learning pandas, one of my biggest frustrations was that tutorials loved to rely on randomly generated data. There was no consistency, no context, no story, no fun. In this book, we’ll work with many real-world data sets that cover everything from basketball players’ salaries to Pokémon types to restaurant health inspections. Data is everywhere around us, and pandas is one of the best tools available today to make sense of it. I hope that you enjoy the casual focus of the data sets.

liveBook discussion forum

Purchase of Pandas in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum, go to https://livebook.manning.com/#!/book/pandas-in-action/discussion. You can also learn more about Manning’s forums and the rules of conduct at https://live book.manning.com/#!/discussion.

Manning’s commitment to our readers is to provide a venue where meaningful dialogue between individual readers and between readers and the author can take place. It is not a commitment to any specific amount of participation on the part of the author, whose contribution to the forum remains voluntary (and unpaid). We suggest that you try asking the author some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

Other online resources

The official pandas documentation is available at https://pandas.pydata.org /docs.

In my spare time, I create technical video courses on Udemy. You can find the courses at https://www.udemy.com/user/borispaskhaver; they include a 20-hour pandas course and a 60-hour Python course.

Feel free to reach out to me via Twitter (https://twitter.com/borispaskhaver) or LinkedIn (https://www.linkedin.com/in/boris-paskhaver).

about the author

Boris Paskhaver is a full-stack software engineer, consultant, and online educator based in New York City. He has six courses on the e-learning platform Udemy with over 140 hours of videos, 300,000 students, 20,000 reviews, and 1 million minutes of content consumed monthly. Before becoming a software engineer, Boris worked as a data analyst and systems administrator. He graduated from New York University in 2013 with a double major in business economics and marketing.

about the cover illustration

The figure on the cover of Pandas in Action is captioned Dame de Calais, or Lady from Calais. The illustration is taken from a collection of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1797. Each illustration is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were only 200 years ago. Isolated from one another, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify by their dress alone where they lived and what their trade or station in life was.

The way we dress has changed since then, and diversity by region, so abundant at the time, has faded away. Now it is hard to tell apart the inhabitants of different continents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.

At a time when it is hard to tell one computer book from another, Manning celebrates the inventiveness and initiative of the computer business with book covers based on the deep diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.

Part 1. Core pandas

Welcome! In this section, we’ll familiarize ourselves with the core mechanics of pandas and its two primary data structures: the one-dimensional Series and the two-dimensional DataFrame. Chapter 1 begins with an analysis of a data set with pandas so you can immediately get a sense of what is possible with the library. From there, we proceed to an in-depth exploration of the Series in chapters 2 and 3. We learn how to create a Series from scratch; import it from an external data set; and apply a slew of mathematical, statistical, and logical operations to it. In chapter 4, we introduce the tabular DataFrame and various ways to extract rows, columns, and values from its data. Finally, chapter 5 focuses on extracting subsets of DataFrame rows by applying logical criteria. Along the way, we’ll work through eight datasets that cover everything from box-office grosses to NBA players to Pokémon.

This part covers the essentials of pandas, the fundamentals you need to know to work effectively with the library. I’ve made every effort to start from square one, from the smallest building blocks possible, and proceed to the larger and more complex elements. The following five chapters build the foundation for your mastery of pandas. Good luck!

1 Introducing pandas

This chapter covers

The growth of data science in the 21st century

The history of the pandas library for data analysis

The pros and cons of pandas and its competitors

Data analysis in Excel versus data analysis with a programming language

A tour of the library’s features through a working example

Welcome to Pandas in Action! Pandas is a library for data analysis built on top of the Python programming language. A library (also called a package) is a collection of code for solving problems in a specific field of endeavor. Pandas is a toolbox for data manipulation operations: sorting, filtering, cleaning, deduping, aggregating, pivoting, and more. The epicenter of Python’s vast data science ecosystem, pandas pairs well with other libraries for statistics, natural language processing, machine learning, data visualization, and more.

In this introductory chapter, we’ll explore the history and evolution of modern data analytics tools. We’ll see how pandas grew from one financial analyst’s pet project to an industry standard used by companies such as Stripe, Google, and J.P. Morgan. We’ll compare the library with its competitors, including Excel and R. We’ll discuss the differences between working with a programming language and working with a graphical spreadsheet application. Finally, we’ll use pandas to analyze a real-world data set. Consider this chapter to be a sneak preview of the concepts you’ll master throughout the book. Let’s dive in!

1.1 Data in the 21st century

It is a capital mistake to theorize before one has data, Sherlock Holmes advises his assistant John Watson in A Scandal in Bohemia, the first of Sir Arthur Conan Doyle’s classic short stories pairing the duo. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.

The wise detective’s words continue to ring true more than a century after the publication of Doyle’s work, in a world in which data is becoming increasingly prevalent in every facet of our lives. The world’s most valuable resource is no longer oil, but data, declared The Economist in a 2017 opinion piece. Data is evidence, and evidence is critical to businesses, governments, institutions, and individuals solving increasingly complex problems in our interconnected world. Across a breadth of industries, the world’s most successful companies, from Facebook to Amazon to Netflix, cite data as the most prized asset in their portfolios. United Nations Secretary-General António Guterres called accurate data the lifeblood of good policy and decision-making. Data powers everything from movie recommendations to medical treatments, from supply chain logistics to poverty-reduction initiatives. The success of communities, companies, and even countries in the 21st century will depend on their ability to acquire, aggregate, and analyze data.

1.2 Introducing pandas

The technological ecosystem of tools for working with data has grown tremendously over the past decade. Today, the open source pandas library is one of the most popular solutions available for data analysis and manipulation. Open source means that the library’s source code is publicly available to download, use, modify, and distribute. Its license grants users more permissions than proprietary software such as Excel. Pandas is free to use. A global team of volunteer software developers maintains the library, and you can find its complete source code on GitHub (https://github.com/pandas-dev/pandas).

Pandas is comparable to Microsoft’s Excel spreadsheet software and Google’s in-browser Sheets application. In all three technologies, a user interacts with tables consisting of rows and columns of data. A row represents a record or, equivalently, one collection of values for the columns. Transformations are applied to coax the data into the desired state.

Figure 1.1 displays a sample transformation of a data set. The analyst applies an operation to the four-row data set on the left to arrive at the two-row data set on the right. They may select rows that fit a criterion, for example, or remove duplicate rows from the original data set.

Figure 1.1 A sample transformation of a tabular data set

What makes pandas unique is the balance it strikes between processing power and user productivity. By relying on lower-level languages such as C for many of its calculations, the library can efficiently transform million-row data sets in milliseconds. At the same time, it maintains a simple and intuitive set of commands. It is easy to accomplish a lot with a little code in pandas.

Figure 1.2 shows some sample pandas code that imports and sorts a CSV data set. Don’t worry about the code yet, but take a second to notice that the entire operation takes only two lines of code.

Figure 1.2 A sample of code that imports and sorts a data set in pandas

Pandas works seamlessly with numbers, text, dates, times, missing data, and more. We’ll explore its incredible versatility as we proceed through the more than 30 data sets included with this book.

The first version of pandas was developed in 2008 by software developer Wes McKinney, who was working at New York’s AQR Capital Management investment firm. Dissatisfied with both Excel and the statistical programming language R, McKinney searched for a tool that would make it easy to solve common data problems in the financial industry, particularly cleanup and aggregation. Unable to find an ideal product, he decided to build one himself. At the time, Python was far from the powerhouse it is today, but the beauty of the language inspired McKinney to build his library on top of its foundation. I loved [Python] for its economy of expressions, he stated in Quartz (http://mng.bz/w0Na). You can express complicated ideas in Python with very little code, and it is very easy to read.

Pandas has seen continual, extensive growth since its release to the public in December 2009. User counts are estimated to be between five and ten million¹. As of June 2021, pandas has been downloaded more than 750 million times from PyPi, the centralized online repository of Python packages (https://pepy.tech/project/pandas). Its GitHub code repository has more than 30,000 stars (a star is equivalent to a like on the platform). Pandas questions make up a growing percentage of questions on the question-answer aggregator Stack Overflow, suggesting increased user interest.

I would argue that we can even credit pandas for the astronomical growth of Python itself. The language has exploded in popularity because of its prevalence in data science, a field to which pandas contributes greatly. Python is now the most common first language taught at colleges and universities. The TIOBE index, a ranking of programming language popularity by search engine traffic, declared Python to be the fastest-growing language of 2018 ². If Python can keep this pace, it will probably replace C and Java in 3 to 4 years’ time, thus becoming the most popular programming language of the world, wrote TIOBE in a press release. As you learn pandas, you’ll also be learning Python, which is another perk of the library.

1.2.1 Pandas vs. graphical spreadsheet applications

Pandas requires a different mindset from a graphical spreadsheet app such as Excel. Programming is inherently more verbal than it is visual. We communicate with the computer through commands, not clicks. Because it makes fewer assumptions about what you’re trying to accomplish, a programming language tends to be more unforgiving. It needs to be told what to do with no uncertainty. We need to issue the correct instructions with the correct inputs in the correct order; otherwise, the program will not work.

Due to these stricter requirements, pandas has a steeper learning curve than Excel or Sheets. But if you have limited experience in Python or programming in general, there’s no need to worry! When you’re fiddling with functions such as SUMIF and VLOOKUP in Excel, you’re already thinking like a programmer. The process is the same: identify the correct function to use and then supply the right inputs in the proper order. Pandas requires an identical set of skills; the difference is that we’re communicating with the computer in a more verbose language.

When you become familiar with its complexities, pandas grants you greater power and flexibility in your data manipulations. In addition to extending the range of your available procedures, programming allows you to automate them. You can write a piece of code once and reuse it across multiple files—perfect for those pesky daily and weekly reports. It’s important to note that Excel comes bundled with Visual Basic for Applications (VBA), a programming language that also enables you to automate spreadsheet procedures. I would argue, however, that Python is easier to pick up than VBA and has uses beyond data analysis, making it a better investment of your time.

There are additional benefits to making the jump from Excel to Python. Jupyter Notebook, the coding environment often paired with pandas, allows for more dynamic, interactive, and comprehensive reports. A Jupyter Notebook consists of cells, each of which contains a chunk of executable code. An analyst can integrate these cells with headers, charts, descriptions, annotations, images, videos, diagrams, and more. Readers can follow the analyst’s step-by-step logic to see how they reached their conclusion, not only their final result.

Another advantage of pandas is Python’s large data science ecosystem. Pandas integrates easily with libraries for statistics, natural language processing, machine learning, web scraping, data visualization, and more. New libraries appear yearly. Experimentation is welcomed. Innovation is constant. These robust tools sometimes remain underdeveloped in corporate competitors, which lack the support of a large, global community of contributors.

Graphical spreadsheet applications also begin to struggle as data sets grow; pandas is significantly more powerful than Excel in this aspect. The capacity of the library is limited only by the computer’s memory and processing power. On most modern machines, pandas plays well with multigigabyte data sets with millions of rows, especially when a developer knows how to exploit all its performance optimizations. In a blog post describing the limitations of the library, creator Wes McKinney wrote, Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times more RAM as the size of your data set (http://mng.bz/qeK6).

Part of the challenge in choosing the best tool for the job is defining what terms such as data analysis and big data mean to your organization and your project. Excel, which is used by approximately 750 million working professionals globally, limits its spreadsheets to 1,048,576 rows of data³. For some analysts, 1 million rows of data are more than any report requires; for others, 1 million rows only scratch the surface.

I would advise you to look at pandas as being not the best data analysis solution but a powerful option to use alongside other modern technologies. Excel is still an excellent choice for quick, easy data manipulations. A spreadsheet application usually makes assumptions about your intent, which is why it takes only a few clicks to import a CSV file or sort a column of 100 values. There’s no real advantage to using pandas for simple tasks like these (although it’s more than capable of doing them). But what do you use when you need to clean text values in two data sets of ten million rows each, remove their duplicate records, join them, and replicate that logic for 100 batches of files? For those scenarios, it’s easier and less time-consuming to do the work with Python and pandas.

1.2.2 Pandas vs. its competitors

Data science enthusiasts frequently compare pandas with the open source programming language R and the proprietary software suite SAS. Each solution has its own community of advocates.

R is a specialized language with a foundation in statistics, whereas Python is a generalist language used in multiple technical domains. Predictably, the two languages tend to attract users with expertise in specific fields. Hadley Wickham, a prominent developer in the R community who built a collection of data science packages called tidyverse, advises users to see the two languages as collaborators rather than rivals. These things exist independently and are both awesome in different ways, he said in Quartz (http://mng.bz/Jv9V). A pattern that I see is that the data science team in a company uses R and the data engineering team uses Python. The Python people tend to have a background in software engineering and are very confident about their programming skills. . . . [The R users] really like R, but can’t argue with the engineering team because they don’t have the language to make that argument. One language may have an advanced feature that the other does not, but the two have achieved near parity when it comes to common tasks in data analysis. Developers and data scientists simply gravitate to what they know best.

A suite of complementary software tools that supports statistics, data mining, econometrics, and more, SAS is a commercial product developed by the North Carolina-based SAS Institute. It charges an annual user subscription fee that varies based on the bundle of selected software. The advantages conferred by a corporate-backed product include technical and visual consistency across tools, robust documentation, and a product road map geared towards enterprise clients’ needs. Open source technology like pandas enjoys a more free-for-all approach; developers work for their needs and for other developers’ needs, which sometimes miss market trends.

Certain technologies share features with pandas but serve intrinsically different purposes. SQL is one example. SQL (Structured Query Language) is a language for communicating with relational databases. A relational database consists of tables of data linked by common keys. We can use SQL for basic data manipulations such as extracting columns from tables and filtering rows by a criterion, but its functionalities are greater in scope and fundamentally revolve around data management. Databases are built to store data; data analysis is a secondary use case. SQL can create new tables, update existing records with new values, delete existing records, and so on. By comparison, pandas is built entirely for data analysis: statistical calculations, data wrangling, data merges, and more. In a typical work environment, the two tools often serve as complements. An analyst might use SQL to extract an initial cluster of data and then use pandas to manipulate it.

In summary, pandas is not the only tool in town, but it is a powerful, popular, and valuable solution for solving most data analysis problems. Again, Python truly shines in its focus on brevity and productivity. As its creator, Guido van Rossum, remarked, The joy of coding Python should be in seeing short, concise, readable [data structures] that express a lot of action in a small amount of clear code (http://mng.bz/7jo7). Pandas lives up to that standard and is an excellent next step for spreadsheet analysts who are eager to grow their programming skills with a powerful, modern data analysis toolkit.

1.3 A tour of pandas

The best way to grasp the power of pandas is to see it in action. Let’s take a quick tour of the library by analyzing a data set of the 700 highest-grossing movies of all time. I hope you are pleasantly surprised by how intuitive the syntax of pandas can be, even if you are new to programming.

As you proceed through the rest of the chapter, try not to overanalyze the code samples; you don’t even need to copy them. Our goal right now is to get a bird’s-eye view of the features and functionalities of pandas. Think about what the library can do; we’ll worry about how in greater detail later.

We’ll be using the Jupyter Notebook development environment to write our code throughout the book. If you need help setting up pandas and Jupyter Notebook on your computer, see appendix A. You can download all data sets and completed Jupyter Notebooks at https://www.github.com/paskhaver/pandas-in-action.

1.3.1 Importing a data set

Let’s get started! First, we’ll create a new Jupyter Notebook inside the same directory as the movies.csv file; then we’ll import the pandas library to gain access to its features:

In [1] import pandas as pd

The box to the left of the code (displaying the number 1 in the previous example) marks the cell’s execution order relative to the launch or restart of the Jupyter Notebook. You can execute the cells in any order, and you can execute the same cell multiple times.

As you read through the book, you are encouraged to experiment by executing different snippets of code in your Jupyter cells. Thus, it is OK if your execution numbers do not match those in the text.

Our data is stored in a single movies.csv file. A CSV (comma-separated values) file is a plain-text file that separates each row of data with a line break and each row value with a comma. The first row in the file holds the column headers for the data. Here’s a preview of the first three rows of movies.csv:

Rank,Title,Studio,Gross,Year 1,Avengers: Endgame,Buena Vista,$2,796.30,2019 2,Avatar,Fox,$2,789.70,2009

The first row lists the five columns in the data set: Rank, Title, Studio, Gross, and Year. The second row holds the first record or, equivalently, the data for the first movie. The film has a Rank of 1, a Title of Avengers: Endgame, a Studio of Buena Vista, a Gross of $2,796.30, and a Year of 2019. The next line holds the values for the next movie, and the pattern repeats for the remaining 750-plus rows in the data set.

Pandas can import various file types, each of which has an associated import function at the top level of the library. A function in pandas is equivalent to a function in Excel. It’s a command that we issue, either to the library or an entity within it. In this scenario, we’ll use the read_csv function to import the movies.csv file:

In [2] pd.read_csv(movies.csv) Out [2] Rank Title Studio Gross Year 0 1 Avengers: Endgame Buena Vista $2,796.30 2019 1 2 Avatar Fox $2,789.70 2009 2 3 Titanic Paramount $2,187.50 1997 3 4 Star Wars: The Force Awakens Buena Vista $2,068.20 2015 4 5 Avengers: Infinity War Buena Vista $2,048.40 2018 ... ... ... ... ... ... 777 778 Yogi Bear Warner Brothers $201.60 2010 778 779 Garfield: The Movie Fox $200.80 2004 779 780 Cats & Dogs Warner Brothers $200.70 2001 780 781 The Hunt for Red October Paramount $200.50 1990 781 782 Valkyrie MGM $200.30 2008 782 rows × 5 columns

Pandas imports the CSV file’s contents into an object called a DataFrame. Think of an object as a container for storing data. Different objects are optimized for different types of data, and we interact with them in different ways. Pandas uses one type of object (the DataFrame) to store multicolumn data sets and another type of object (the Series) to store single-column data sets. A DataFrame is comparable to a multicolumn table in Excel.

To avoid cluttering the screen, pandas displays only the first five and last five rows of the DataFrame. A row of ellipses ( . . . ) marks where the data gap occurs.

This DataFrame consists of five columns (Rank, Title, Studio, Gross, Year) and an index. The index is the range of ascending numbers on the left side of the DataFrame. Index labels serve as identifiers for rows of data. We can set any column as the index of the DataFrame. When we do not explicitly tell pandas which column to use, the library generates a numeric index starting from 0.

What column is a good candidate for the index? It’s one whose values can act as a primary identifier or point of reference for each row. Among our five columns, Rank and Title are the two best options. Let’s swap the autogenerated numeric index with the values from the Title column. We can do so directly during the CSV import:

In [3] pd.read_csv(movies.csv, index_col = Title) Out [3] Rank Studio Gross Year Title Avengers: Endgame 1 Buena Vista $2,796.30 2019 Avatar 2 Fox $2,789.70 2009 Titanic 3 Paramount $2,187.50 1997 Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015 Avengers: Infinity War 5 Buena Vista $2,048.40 2018 ... ... ... ... ... Yogi Bear 778 Warner Brothers $201.60 2010 Garfield: The Movie 779 Fox $200.80 2004 Cats & Dogs 780 Warner Brothers $200.70 2001 The Hunt for Red October 781 Paramount $200.50 1990 Valkyrie 782 MGM $200.30 2008 782 rows × 4 columns

Next, we’ll assign the DataFrame to a movies variable so that we can reference it elsewhere in our program. A variable is a user-assigned name for an object in the program:

In [4] movies = pd.read_csv(movies.csv, index_col = Title)

For more on variables, check out appendix B.

1.3.2 Manipulating a DataFrame

We can look at the DataFrame from a variety of angles. We can extract a few rows from the beginning:

In [5] movies.head(4) Out [5] Rank Studio Gross Year Title Avengers: Endgame 1 Buena Vista $2,796.30 2019

Avatar 2 Fox $2,789.70 2009

Titanic 3 Paramount $2,187.50 1997

Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015

Or we can peek at the end of the data set instead:

In [6] movies.tail(6) Out [6] Rank Studio Gross Year Title 21 Jump Street 777 Sony $201.60 2012 Yogi Bear 778 Warner Brothers $201.60 2010 Garfield: The Movie 779 Fox $200.80 2004 Cats & Dogs 780 Warner Brothers $200.70 2001 The Hunt for Red October 781 Paramount $200.50 1990 Valkyrie 782 MGM $200.30 2008

We can find out how many rows the DataFrame has:

In [7] len(movies) Out [7] 782

We can ask pandas for the number of rows and columns in the DataFrame. This data set has 782 rows and 4 columns:

In [8] movies.shape Out [8] (782, 4)

We can inquire about the total number of cells:

In [9] movies.size Out [9] 3128

We can ask for the data types

Enjoying the preview?

Page 1 of 1

Pandas in Action

About this ebook

Boris Paskhaver

Related authors

Related to Pandas in Action

Related ebooks

Machine Learning Bookcamp: Build a portfolio of real-life projects

Machine Learning in Action

Tiny Python Projects: Learn coding and testing with puzzles and games

Think Like a Data Scientist: Tackle the data science process step-by-step

Deep Learning with Python, Second Edition

Math for Programmers: 3D graphics, machine learning, and simulations with Python

Deep Learning with Structured Data

Natural Language Processing in Action: Understanding, analyzing, and generating text with Python

Introducing Data Science: Big data, machine learning, and more, using Python tools

MLOps Engineering at Scale

Python: Real-World Data Science

Machine Learning Systems: Designs that scale

Deep Learning with JavaScript: Neural networks in TensorFlow.js

Advanced Algorithms and Data Structures

Deep Learning with R

Deep Learning with Python

Data Analysis with Python and PySpark

Pandas Workout: 200 exercises to make you a stronger data analyst

Machine Learning with R, the tidyverse, and mlr

Mastering Python for Data Science

Functional Python Programming

Feature Engineering Bookcamp

Algorithms and Data Structures for Massive Datasets

Graph Databases in Action: Examples in Gremlin

Practical Data Science with R, Second Edition

Python Data Analysis

Designing Cloud Data Platforms

Mastering Objectoriented Python

TensorFlow in Action

Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition

Data Modeling & Design For You

Data-Intensive Applications: Design, Development, and Deployment Strategies for Scalable and Reliable Systems

Data Analytics for Beginners: Introduction to Data Analytics

Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI &amp; Power Pivot in Excel 2010-2016

150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel

Data Analytics with Python: Data Analytics in Python Using Pandas

The Secrets of ChatGPT Prompt Engineering for Non-Developers

Mastering Python Design Patterns

Neural Networks for Beginners: An Easy-to-Follow Introduction to Artificial Intelligence and Deep Learning

Tableau Desktop Certified Associate: Exam Guide: Develop your Tableau skills and prepare for Tableau certification with tips from industry experts

Data Visualization: a successful design process

Thinking in Algorithms: Strategic Thinking Skills, #2

LaTeX Graphics with TikZ: A practitioner's guide to drawing 2D and 3D images, diagrams, charts, and plots

DAX Patterns: Second Edition

Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science

Text as Data: A New Framework for Machine Learning and the Social Sciences

Mastering Python Data Analysis

Managing Data Using Excel

Instant Heat Maps in R How-to

Supercharge Power BI: Power BI is Better When You Learn To Write DAX

Frank Kane's Taming Big Data with Apache Spark and Python

Microsoft Access: Database Creation and Management through Microsoft Access

AI-Driven Data Engineering

Tableau Cookbook – Recipes for Data Visualization

Raspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps

Hacks To Crush Plc Program Fast & Efficiently Everytime... : Coding, Simulating & Testing Programmable Logic Controller With Examples

Mastering Agile User Stories

Kafka in Action

Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3

Learning Social Media Analytics with R

Blockchain Data Analytics For Dummies

Related podcast episodes

Related categories

Reviews for Pandas in Action

What did you think?

Book preview

Pandas in Action - Boris Paskhaver

Pandas in Action

Dedication

contents

front matter

Part 1. Core pandas

1 Introducing pandas

2 The Series object

Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI & Power Pivot in Excel 2010-2016