Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
5/5
()
Michael Walker
Michael Walker is the bestselling author of Laurel Canyon: The Inside Story of Rock-and-Roll’s Legendary Neighborhood and What You Want Is in the Limo: On the Road with Led Zeppelin, Alice Cooper, and the Who in 1973, the Year the Sixties Died and the Modern Rock Star Was Born. He has written about popular culture for the New York Times and is a contributing editor at the Hollywood Reporter. He lives in Los Angeles.
Read more from Michael Walker
Delta Lady: A Memoir Rating: 3 out of 5 stars3/5Unnatural Behavior: Where Are We Going? Rating: 0 out of 5 stars0 ratingsThe Fear of the Lord Is His Treasure Rating: 0 out of 5 stars0 ratingsUp There: The North-East, Football, Boom & Bust Rating: 0 out of 5 stars0 ratingsAfter God's Own Heart: Leadership Lessons from the Life of David Rating: 0 out of 5 stars0 ratingsGreen Shoots: Irish Football Histories Rating: 0 out of 5 stars0 ratingsMake It Make Sense Rating: 0 out of 5 stars0 ratingsGiving Christianity a Bad Name: Justin’S Journey Rating: 0 out of 5 stars0 ratingsThe Work of the Cross Rating: 0 out of 5 stars0 ratingsThe Aurora Revelations: A Paranormal Mystery Novel Rating: 0 out of 5 stars0 ratingsKingdom Road - Volume 1 Rating: 0 out of 5 stars0 ratingsDeath, where is your sting? Rating: 0 out of 5 stars0 ratings
Related to Python Data Cleaning Cookbook
Related ebooks
Instant MapReduce Patterns – Hadoop Essentials How-to Rating: 0 out of 5 stars0 ratingsInstant Redis Optimization How-to Rating: 0 out of 5 stars0 ratingsApache Spark 2.x Cookbook Rating: 0 out of 5 stars0 ratingsSplunk Developer's Guide Rating: 0 out of 5 stars0 ratingsLearning Heroku Postgres Rating: 0 out of 5 stars0 ratingsPostgreSQL 9 Administration Cookbook LITE: Configuration, Monitoring and Maintenance Rating: 3 out of 5 stars3/5Deep Learning for Computer Vision with SAS: An Introduction Rating: 0 out of 5 stars0 ratingsDeep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition) Rating: 0 out of 5 stars0 ratingsUltimate Enterprise Data Analysis and Forecasting using Python Rating: 0 out of 5 stars0 ratingsApache Oozie Essentials Rating: 0 out of 5 stars0 ratingsOpen-Source Middleware Suites A Complete Guide Rating: 0 out of 5 stars0 ratingsRelational Database Index Design and the Optimizers: DB2, Oracle, SQL Server, et al. Rating: 5 out of 5 stars5/5Big Data AWS A Complete Guide Rating: 0 out of 5 stars0 ratingsMastering Amazon Relational Database Service for MySQL: Building and configuring MySQL instances (English Edition) Rating: 0 out of 5 stars0 ratingsInstant SQL Server Analysis Services 2012 Cube Security Rating: 0 out of 5 stars0 ratingsOpen Source Database: Virtue Or Vice? Rating: 0 out of 5 stars0 ratingsGoogle Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers Rating: 0 out of 5 stars0 ratingsLearning Ansible Rating: 0 out of 5 stars0 ratingsFundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions Rating: 0 out of 5 stars0 ratingsHands-On Kubernetes, Service Mesh and Zero-Trust: Build and manage secure applications using Kubernetes and Istio (English Edition) Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning Rating: 0 out of 5 stars0 ratingsRapid - Apache Mahout Clustering designs: Explore clustering algorithms used with Apache Mahout Rating: 0 out of 5 stars0 ratingsBuidling AI Agents with LangGraph Rating: 0 out of 5 stars0 ratingsWAN Survival Guide: Strategies for VPNs and Multiservice Networks Rating: 0 out of 5 stars0 ratingsData Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake Rating: 0 out of 5 stars0 ratingsComprehensive SQL Techniques: Mastering Data Analysis and Reporting Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Some Future Day: How AI Is Going to Change Everything Rating: 0 out of 5 stars0 ratingsExcel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsAn Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsStandard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going Rating: 4 out of 5 stars4/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5CompTia Security 701: Fundamentals of Security Rating: 0 out of 5 stars0 ratingsUncanny Valley: A Memoir Rating: 4 out of 5 stars4/5
Reviews for Python Data Cleaning Cookbook
1 rating0 reviews
Book preview
Python Data Cleaning Cookbook - Michael Walker
Python Data Cleaning Cookbook
Second Edition
Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Michael Walker
Python Data Cleaning Cookbook
Second Edition
Copyright © 2024 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Senior Publishing Product Manager: Gebin George
Acquisition Editor – Peer Reviews: Gaurav Gavas
Project Editor: Namrata Katare
Content Development Editor: Deepayan Bhattacharjee
Copy Editor: Safis Editing
Technical Editor: Karan Sonawane
Proofreader: Safis Editing
Indexer: Subalakshmi Govindhan
Presentation Designer: Ajay Patule
Developer Relations Marketing Executive: Vignesh Raju
First published: December 2020
Second edition: May 2024
Production reference: 1230524
Published by Packt Publishing Ltd.
Grosvenor House
11 St Paul’s Square
Birmingham
B3 1RB, UK.
ISBN: 978-1-80323-987-3
www.packt.com
Contributors
About the author
Michael Walker has worked as a data analyst for 37 years at various educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island. Michael is also the author of Data Cleaning and Exploration with Machine Learning.
I deeply appreciate the editors of this text, particularly Namrata Katare and Deepayan Bhattacharjee, and the technical reviewer, Kalyana Bedhu. They made this text so much better than it would have been otherwise. Any remaining issues are completely the fault of the author. I would also like to thank my partner over the last several decades, Karen Walker. She has had to listen to me go on and on about statistics and data, and the teaching of both topics, for all those years. Along the way, she patiently and lovingly taught me most of what I know about how to teach and be in community with other humans, even when that means putting aside some modeling or programming task for a few hours to focus on what matters most. For the sake of the reader, I hope this book reflects some fraction of her thoughtfulness and patience.
About the reviewer
Kalyana Bedhu, a recipient of patents and an author of award-winning papers and data science courses, is an Engineering Leader and Architect of Enterprise AI/ML at Fannie Mae. He has over 20 years of experience in applied AI/ML across various companies like Microsoft, Ericsson, Sony, Bosch, Fidelity, and Oracle. As a practitioner of Data Science at Ericsson, he helped set up a data science lab. He played a pivotal role in transforming a central IT organization into an AI and data science engine.
Join our community on Discord
Join our community’s Discord space for discussions with the author and other readers:
https://discord.gg/p8uSgEAETX
Preface
This book is a practical guide to data cleaning, broadly defined as all tasks necessary to prepare data for analysis. It is organized by the tasks usually completed during the data-cleaning process: importing data, viewing data diagnostically, identifying outliers and unexpected values, imputing values, tidying data, and so on. Each recipe walks the reader from raw data through the completion of a specific data-cleaning task.
There are already a number of very good pandas books. Unsurprisingly, there is some overlap between those texts and this one. However, the emphasis here is different. I focus as much on the why as on the how in this book.
Since pandas is still relatively new, the lessons I have learned about cleaning data have been shaped by my experiences with other tools. Before settling into my current work routine with Python and R about 10 years ago, I relied mostly on C# and T-SQL in the early 2000s, SAS and Stata in the 90s, and FORTRAN and Pascal in the 80s. Most readers of this text probably have experience with a variety of data-cleaning and analysis tools. In many ways the specific tool is less significant than the data preparation task and the attributes of the data. I would have covered pretty much the same topics if I had been asked to write The SAS Data Cleaning Cookbook or The R Data Cleaning Cookbook. I just take a Python/pandas-specific approach to the same data-cleaning challenges that analysts have faced for decades.
I start each chapter with how to think about the particular data-cleaning task at hand before discussing how to approach it with a tool from the Python ecosystem—pandas, NumPy, Matplotlib, and so on. This is reinforced in each recipe by a discussion of the implications of what we are uncovering in the data. I try to connect tool to purpose. For example, concepts like skewness and kurtosis matter as much for handling outliers as does knowing how to update pandas Series values.
New in the Second Edition
Readers of the first edition will recognize that this book is substantially longer than that one. That is partly because there are two new chapters—a chapter devoted to treating missing values and another one on pre-processing data for predictive analysis. The insufficient coverage of missing values, and the absence of coverage of pre-processing data for machine learning applications were important omissions. The pre-processing coverage is further improved by new recipes on data pipelines in the final chapter that take the reader from raw data to model evaluation.
The recipes in all chapters have been revised. This is to make sure that they all work well with the most recent versions of pandas. pandas went from version 1.5.3 to 2.2.1 during the writing of this book. I have tried to make sure that all code works fine on all versions of pandas released from January 2023 through February 2024. Since AI tools are becoming increasingly common in our work, I have included discussion of OpenAI tools in four of the chapters. Altogether, 22 of the 82 recipes are new. All of the datasets used have also been updated.
Who this book is for
I had multiple audiences in mind as I wrote this book, but I most consistently thought about a dear friend of mine who bought a Transact-SQL book 30 years ago and quickly developed great confidence in her database work, ultimately building a career around those skills. I would love it if someone just starting their career as a data scientist or analyst worked through this book and had a similar experience as my friend. More than anything else, I want you to feel good and excited about what you can do as a result of reading this book.
I also hope this book will be a useful reference for folks who have been doing this kind of work for a while. Here, I imagine someone opening the book and wondering to themself, What’s an approach to handling missing data that maintains the variance of my variable?
In keeping with the hands-on nature of this text, every bit of output is reproducible with code in this book. I also stuck to a rule throughout, even when it was challenging. Every recipe starts with raw data largely unchanged from the original downloaded file. You go from data file to better prepared data in each recipe. If you have forgotten how a particular object was created, all you will ever need to do is turn back a page or two to see.
Readers who have some knowledge of pandas and NumPy will have an easier time with some code blocks, as will folks with some knowledge of Python and introductory statistics. None of that is essential though. There are just some recipes you might want to pause over longer.
What this book covers
Chapter 1, Anticipating Data Cleaning Issues When Importing Tabular Data with pandas, explores tools for loading CSV files, Excel files, relational database tables, SAS, SPSS, Stata, and R files into pandas DataFrames.
Chapter 2, Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data, discusses techniques for reading and normalizing JSON data, web scraping, and working with big data using Spark. It also explores techniques for persisting data, including with versioning.
Chapter 3, Taking the Measure of Your Data, introduces common techniques for navigating around a DataFrame, selecting columns and rows, and generating summary statistics. The use of OpenAI tools for examining dataset structure and generating statistics is introduced.
Chapter 4, Identifying Outliers in Subsets of Data, explores a wide range of strategies to identify outliers across a whole DataFrame and by selected groups.
Chapter 5, Using Visualizations for the Identification of Unexpected Values, demonstrates the use of the Matplotlib and Seaborn tools to visualize how key variables are distributed, including with histograms, boxplots, scatter plots, line plots, and violin plots.
Chapter 6, Cleaning and Exploring Data with Series Operations, discusses updating pandas Series with scalars, arithmetic operations, and conditional statements based on the values of one or more Series.
Chapter 7, Identifying and Fixing Missing Values, goes over strategies for identifying missing values across rows and columns, and over subsets of data. It explores strategies for imputing values, such as setting values to the overall mean or the mean for a given category and forward filling. It also examines multivariate techniques for imputing values for missing values and discusses when they are appropriate.
Chapter 8, Encoding, Transforming, and Scaling Features, covers a range of variable transformation techniques to prepare features and targets for predictive analysis. This includes the most common kinds of encoding—one-hot, ordinal, and hashing encoding; transformations to improve the distribution of variables; and binning and scaling approaches to address skewness, kurtosis, and outliers and to adjust for features with widely different ranges.
Chapter 9, Fixing Messy Data When Aggregating, demonstrates multiple approaches to aggregating data by group, including looping through data with
itertuples
or NumPy arrays, dropping duplicate rows, and using pandas’ groupby and pivot tables. It also discusses when to choose one approach over the others.
Chapter 10, Addressing Data Issues When Combining DataFrames, examines different strategies for concatenating and merging data, and how to anticipate common data challenges when combining data.
Chapter 11, Tidying and Reshaping Data, introduces several strategies for de-duplicating, stacking, melting, and pivoting data.
Chapter 12, Automate Data Cleaning with User-Defined Functions and Classes and Pipelines, examines how to turn many of the techniques from the first 11 chapters into reuseable code.
Download the example code files
The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803239873.
Conventions used
There are a number of text conventions used throughout this book.
Code in text
: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded
WebStorm-10*.dmg
disk image file as another disk in your system."
A block of code is set as follows:
import
pandas
as
pd
import
os
import
sys nls97 = pd.read_csv(
data/nls97g.csv
, low_memory=
False
) nls97.set_index(
'personid'
, inplace=
True
)
Any output from the code will appear like this:
satverbal satmath min 14 7 per15 390 390 qr1 430 430 med 500 500 qr3 570 580 per85 620 621 max 800 800 count 1,406 1,407 mean 500 501 iqr 140 150
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."
Warnings or important notes appear like this.
Tips and tricks appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: Email
feedback@packtpub.com
and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at
questions@packtpub.com
.
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packtpub.com/submit-errata, click Submit Errata, and fill in the form.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at
copyright@packtpub.com
with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Share your thoughts
Once you’ve read Python Data Cleaning Cookbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Download a free PDF copy of this book
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere?
Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.
Follow these simple steps to get the benefits:
Scan the QR code or visit the link below:
https://packt.link/free-ebook/9781803239873
Submit your proof of purchase.
That’s it! We’ll send your free PDF and other benefits to your email directly.
1
Anticipating Data Cleaning Issues When Importing Tabular Data with pandas
Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. The recipes in this book demonstrate how many common data preparation tasks can be done more easily with pandas than with other tools. While we work with pandas extensively in this book, we also use other popular packages such as Numpy, matplotlib, and scipy.
A key pandas object is the DataFrame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas DataFrame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.
Before we can make use of this great functionality, we have to import our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.
We examine tools to import tabular data in this recipe. Specifically, we cover the following topics:
Importing CSV files
Importing Excel files
Importing data from SQL databases
Importing SPSS, Stata, and SAS data
Importing R data
Persisting tabular data
Technical requirements
The code and notebooks for this chapter are available on GitHub at https://github.com/michaelbwalker/Python-Data-Cleaning-Cookbook-Second-Edition. You can use any IDE (Integrated Development Environment) of your choice – IDLE, Visual Studio, Sublime, Spyder, and so on – or Jupyter Notebook to work with any of the code in this chapter, or any chapter in this book. A good guide to get started with Jupyter Notebook can be found here: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. I used the Spyder IDE to write the code in this chapter.
I used pandas 2.2.1 and NumPy version 1.24.3 for all of the code in this chapter and subsequent chapters. I have also tested all code with pandas 1.5.3.
Importing CSV files
The
read_csv
method of the
pandas
library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas DataFrame. In this recipe, we import a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.
Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.
All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas
read_csv
method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.
Getting ready
Create a folder for this chapter, and then create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder, and then place the
landtempssample.csv
file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository, including the data files. Here is a screenshot of the beginning of the CSV file:
Screenshot from 2023-05-28 21-00-25Figure 1.1: Land Temperatures Data
Data note
This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.
How to do it…
We will import a CSV file into pandas, taking advantage of some very useful
read_csv
options:
Import the
pandas
library, and set up the environment to make viewing the output easier:
import
pandas
as
pd pd.options.display.float_format =
'{:,.2f}'
.
format
pd.set_option(
'display.width'
,
85
) pd.set_option(
'display.max_columns'
,
8
)
Read the data file, set new names for the headings, and parse the date column.
Pass an argument of
1
to the
skiprows
parameter to skip the first row, pass a list of columns to
parse_dates
to create a pandas datetime column from those columns, and set
low_memory
to
False
. This will cause pandas to load all of the data into memory at once, rather than in chunks. We do this so that pandas can identify the data type of each column automatically. In the There’s more… section, we see how to set the data type for each column manually:
landtemps = pd.read_csv(
'data/landtempssample.csv'
, ... names=[
'stationid'
,
'year'
,
'month'
,
'avgtemp'
,
'latitude'
, ...
'longitude'
,
'elevation'
,
'station'
,
'countryid'
,
'country'
], ... skiprows=
1
, ... parse_dates=[[
'month'
,
'year'
]], ... low_memory=
False
)
type
(landtemps)
Note
We have to use
skiprows
because we are passing a list of column names to
read_csv
. If we use the column names in the CSV file, we do not need to specify values for either
names
or
skiprows
.
Get a quick glimpse of the data.
View the first few rows. Show the data type for all columns, as well as the number of rows and columns:
landtemps.head(
7
)
month_year stationid ... countryid country 0 2000-04-01 USS0010K01S ... US United States 1 1940-05-01 CI000085406 ... CI Chile 2 2013-12-01 USC00036376 ... US United States 3 1963-02-01 ASN00024002 ... AS Australia 4 2001-11-01 ASN00028007 ... AS Australia 5 1991-04-01 USW00024151 ... US United States 6 1993-12-01 RSM00022641 ... RS Russia [7 rows x 9 columns]
landtemps.dtypes
month_year datetime64[ns] stationed object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object
landtemps.shape
(100000, 9)
Give the date column a more appropriate name and view the summary statistics for average monthly temperature:
landtemps.rename(columns={
'month_year'
:
'measuredate'
}, inplace=
True
) landtemps.dtypes
measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object
landtemps.avgtemp.describe()
count 85,554.00 mean 10.92 std 11.52 min -70.70 25% 3.46 50% 12.22 75% 19.57 max 39.95 Name: avgtemp, dtype: float64
Look for missing values for each column.
Use
isnull
, which returns
True
for each value that is missing for each column, and
False
when not missing. Chain this with
sum
to count the missing values for each column. (When working with Boolean values,
sum
treats
True
as
1
and
False
as
0
. I will discuss method chaining in the There’s more... section of this recipe):
landtemps.isnull().
sum
()
measuredate 0 stationed 0 avgtemp 14446 latitude 0 longitude 0 elevation 0 station 0 countryid 0 country 5 dtype: int64
Remove rows with missing data for
avgtemp
.
Use the
subset
parameter to tell
dropna
to drop rows when
avgtemp
is missing. Set
inplace
to
True
. Leaving
inplace
at its default value of
False
would display the DataFrame, but the changes we have made would not be retained. Use the
shape
attribute of the DataFrame to get the number of rows and columns:
landtemps.dropna(subset=[
'avgtemp'
], inplace=
True
) landtemps.shape
(85554, 9)
That’s it! Importing CSV files into pandas is as simple as that.
How it works...
Almost all of the recipes in this book use the
pandas
library. We refer to it as
pd
to make it easier to reference later. This is customary. We also use
float_format
to display float values in a readable way and
set_option
to make the Terminal output wide enough to accommodate the number of variables.
Much of the work is done by the first line in Step 2. We use
read_csv
to load a pandas DataFrame in memory and call it
landtemps
. In addition to passing a filename, we set the
names
parameter to a list of our preferred column headings. We also tell
read_csv
to skip the first row, by setting
skiprows
to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row,
read_csv
will treat the header row in the file as actual data.
read_csv
also solves a date conversion issue for us. We use the
parse_dates
parameter to ask it to convert the
month
and
year
columns to a date value.
Step 3 runs through a few standard data checks. We use
head(7)
to print out all columns for the first seven rows. We use the
dtypes
attribute of the DataFrame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values.
shape
returns a tuple, whose first element is the number of rows in the DataFrame (100,000 in this case) and whose second element is the number of columns (9).
When we used
read_csv
to parse the
month
and
year
columns, it gave the resulting column the name
month_year
. We used the
rename
method in Step 4 to give that column a more appropriate name. We need to specify
inplace=True
to replace the old column name with the new column name in memory. The
describe
method provides summary statistics on the
avgtemp
column.
Notice that the count for
avgtemp
indicates that there are 85,554 rows that have valid values for
avgtemp
. This is out of 100,000 rows for the whole DataFrame, as provided by the
shape
attribute. The listing of missing values for each column in Step 5 (
landtemps.isnull().sum()
) confirms this: 100,000 – 85,554 = 14,446.
Step 6 drops all rows where
avgtemp
is
NaN
. (The
NaN
value, not a number, is the pandas representation of missing values.)
subset
is used to indicate which column to check for missing values. The
shape
attribute for
landtemps
now indicates that there are 85,554 rows, which is what we would expect, given the previous count from
describe
.
There’s more...
If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the
sep
parameter of
read_csv
. When creating the pandas DataFrame, an index was also created. The numbers to the far left of the output when
head
was run are index values. Any number of rows can be specified for
head
. The default value is
5
.
Instead of setting
low_memory
to
False
, to get pandas to make good guesses regarding data types, we could have set data types manually:
landtemps = pd.read_csv(
'
data/landtempssample.csv'
, names=[
'stationid'
,
'year'
,
'month'
,
'avgtemp'
,
'latitude'
,
'longitude'
,
'elevation'
,
'station'
,
'countryid'
,
'
country'
], skiprows=
1
, parse_dates=[[
'month'
,
'year'
]], dtype={
'stationid'
:
'object'
,
'avgtemp'
:
'float64'
,
'latitude'
:
'float64'
,
'
longitude'
:
'float64'
,
'elevation'
:
'float64'
,
'station'
:
'object'
,
'countryid'
:
'object'
,
'country'
:
'object'
}, ) landtemps.info()
The
landtemps.isnull().sum()
statement is an example of chaining methods. First,
isnull
returns a DataFrame of
True
and
False
values, resulting from testing whether each column value is
null
. The
sum
function takes that DataFrame and sums the
True
values for each column, interpreting the
True
values as
1
and the
False
values as
0
. We would have obtained the same result if we had used the following two steps:
checknull = landtemps.isnull() checknull.
sum
()
There is no hard and fast rule for when to chain methods and when not to do so. I find chaining helpful when the overall operation feels like a single step, even if it’s two or more steps mechanically. Chaining also has the side benefit of not creating extra objects that I might not need.
The dataset used in this recipe is just a sample from the full land temperatures database, with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:
landtemps = pd.read_csv(
'data/landtemps.zip'
, ... compression=
'zip'
, names=[
'stationid'
,
'year'
, ...
'month'
,
'avgtemp'
,
'latitude'
,
'longitude'
, ...
'elevation'
,
'station'
,
'countryid'
,
'country'
], ... skiprows=
1
, ... parse_dates=[[
'month'
,
'year'
]], ... low_memory=
False
)
read_csv
can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.
See also
Subsequent recipes in this chapter, and in other chapters, set indexes to improve navigation over rows and merging.
A significant amount of reshaping of the Global Historical Climatology Network raw data was done before using it in this recipe. We demonstrate this in Chapter 11, Tidying and Reshaping Data.
Importing Excel files
The
read_excel
method of the
pandas
library can be used to import data from an Excel file and load it into memory as a pandas DataFrame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets.
Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms with those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python.
Fortunately,
read_excel
has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows, select particular columns, and pull data from a particular sheet or sheets.
Getting ready
You can download the
GDPpercapita22b.xlsx
file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file (some columns were hidden for display purposes):
Figure 1.2: View of the dataset
And here is a view of the end of the file:
Figure 1.3: View of the dataset
Data note
This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/.
How to do it…
We import an Excel file into pandas and do some initial data cleaning:
Import the
pandas
library:
import
pandas
as
pd
Read the Excel per capita GDP data.
Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the
sheet_name
parameter to specify the sheet. Set
skiprows
to
4
and
skipfooter
to
1
to skip the first four rows (the first row is hidden) and the last row. We provide values for
usecols
to get data from column
A
and columns
C
through
W
(column
B
is blank). Use
head
to view the first few rows and
shape
to get the number of rows and columns:
percapitaGDP = pd.read_excel(
data/GDPpercapita22b.xlsx
, ... sheet_name=
OECD.Stat export
, ... skiprows=
4
, ... skipfooter=
1
, ... usecols=
A,C:W
) percapitaGDP.head()
Year 2000 ... 2019 2020 0 Metropolitan areas ... NaN ... NaN NaN 1 AUS: Australia .. ... ... ... ... ... 2 AUS01: Greater Sydney ... ... ... 45576 45152 3 AUS02: Greater Melbourne ... ... ... 42299 40848 4 AUS03: Greater Brisbane ... ... ... 42145 40741 [5 rows x 22 columns]
percapitaGDP.shape
(731, 22)
Note
You may encounter a problem with
read_excel
if the Excel file does not use utf-8 encoding. One way to resolve this is to save the Excel file as a CSV file, reopen it, and then save it with utf-8 encoding.
Use the
info
method of the DataFrame to view data types and the
non-null
count. Notice that all columns have the
object
data type:
percapitaGDP.info()
Rename the
Year
column to
metro
, and remove the leading spaces.
Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases. We can test for leading spaces with
startswith(' ')
and then use
any
to establish whether there are one or more occasions when the first character is blank. We can use
endswith(' ')
to examine trailing spaces. We use strip to remove both leading and trailing spaces. When we test for trailing spaces again, we see that there are none:
percapitaGDP.rename(columns={
'Year'
:
'metro'
}, inplace=
True
) percapitaGDP.metro.
str
.startswith(
' '
).
any
()
True
percapitaGDP.metro.
str
.endswith(
' '
).
any
()
False
percapitaGDP.metro = percapitaGDP.metro.
str
.strip() percapitaGDP.metro.
str
.startswith(
' '
).
any
()
False
Convert the data columns to numeric.
Iterate over all of the GDP year columns (2000–2020) and convert the data type from
object
to
float
. Coerce the conversion even when there is character data – the
..
in this example. We want character values in those columns to become
missing
, which is what happens. Rename the year columns to better reflect the data in those columns:
for
col
in
percapitaGDP.columns[
1
:]: ... percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], ... errors=
'coerce'
) ... percapitaGDP.rename(columns={col:
'pcGDP'
+col}, ... inplace=
True
) ... percapitaGDP.head()
metro pcGDP2000 pcGDP2001 ... \ 0 Metropolitan areas NaN NaN ... 1 AUS: Australia NaN NaN ... 2 AUS01: Greater Sydney NaN 41091 ... 3 AUS02: Greater Melbourne NaN 40488 ... 4 AUS03: Greater Brisbane NaN 35276 ... pcGDP2018 pcGDP2019 pcGDP2020 0 NaN NaN NaN 1 NaN NaN NaN 2 47171 45576 45152 3 43237 42299 40848 4 44328 42145 40741 [5 rows x 22 columns]
percapitaGDP.dtypes
metro object pcGDP2000 float64 pcGDP2001 float64 abbreviated to save space pcGDP2019 float64 pcGDP2020 float64 dtype: object
Use the
describe
method to generate summary statistics for all numeric data in the DataFrame:
percapitaGDP.describe()
pcGDP2000 pcGDP2001 pcGDP2002 ... pcGDP2018 \ count 158 450 479 ... 692 mean 33961 38874 39621 ... 41667 std 15155 13194 13061 ... 17440 min 2686 7805 7065 ... 5530 25% 21523 30790 31064 ... 31322 50% 35836 38078 39246 ... 41428 75% 42804 46576 47874 ... 51130 max 95221 96941 98929 ... 147760 pcGDP2019 pcGDP2020 count 596 425 mean 42709 39792 std 18893 19230 min 5698 5508 25% 29760 24142 50% 43505 41047 75% 53647 51130 max 146094 131082 [8 rows x 21 columns]
Remove rows where all of the per capita GDP values are missing.
Use the
subset
parameter of
dropna
to inspect all columns, starting with the second column (it is zero-based) and going through to the last column. Use
how
to specify that we want to drop rows only if all of the columns specified in
subset
are missing. Use
shape
to show the number of rows and columns in the resulting DataFrame:
percapitaGDP.dropna(subset=percapitaGDP.columns[
1
:], how=
all
, inplace=
True
) percapitaGDP.shape
(692, 22)
Set the index for the DataFrame using the metropolitan area column.
Confirm that there are 692 valid values for
metro
and that there are 692 unique values, before setting the index:
percapitaGDP.metro.count()
692
percapitaGDP.metro.nunique()
692
percapitaGDP.set_index(
'metro'
, inplace=
True
) percapitaGDP.head()
pcGDP2000 pcGDP2001 ... \ metro ...