Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Ebook1,199 pages6 hours

Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Rating: 5 out of 5 stars

5/5

()

Read preview
LanguageEnglish
Release dateMay 31, 2024
ISBN9781803246291
Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Author

Michael Walker

Michael Walker is the bestselling author of Laurel Canyon: The Inside Story of Rock-and-Roll’s Legendary Neighborhood and What You Want Is in the Limo: On the Road with Led Zeppelin, Alice Cooper, and the Who in 1973, the Year the Sixties Died and the Modern Rock Star Was Born. He has written about popular culture for the New York Times and is a contributing editor at the Hollywood Reporter. He lives in Los Angeles.

Read more from Michael Walker

Related to Python Data Cleaning Cookbook

Related ebooks

Computers For You

View More

Reviews for Python Data Cleaning Cookbook

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Python Data Cleaning Cookbook - Michael Walker

    9781803239873-cov.png

    Python Data Cleaning Cookbook

    Second Edition

    Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

    Michael Walker

    Python Data Cleaning Cookbook

    Second Edition

    Copyright © 2024 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Senior Publishing Product Manager: Gebin George

    Acquisition Editor – Peer Reviews: Gaurav Gavas

    Project Editor: Namrata Katare

    Content Development Editor: Deepayan Bhattacharjee

    Copy Editor: Safis Editing

    Technical Editor: Karan Sonawane

    Proofreader: Safis Editing

    Indexer: Subalakshmi Govindhan

    Presentation Designer: Ajay Patule

    Developer Relations Marketing Executive: Vignesh Raju

    First published: December 2020

    Second edition: May 2024

    Production reference: 1230524

    Published by Packt Publishing Ltd.

    Grosvenor House

    11 St Paul’s Square

    Birmingham

    B3 1RB, UK.

    ISBN: 978-1-80323-987-3

    www.packt.com

    Contributors

    About the author

    Michael Walker has worked as a data analyst for 37 years at various educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island. Michael is also the author of Data Cleaning and Exploration with Machine Learning.

    I deeply appreciate the editors of this text, particularly Namrata Katare and Deepayan Bhattacharjee, and the technical reviewer, Kalyana Bedhu. They made this text so much better than it would have been otherwise. Any remaining issues are completely the fault of the author. I would also like to thank my partner over the last several decades, Karen Walker. She has had to listen to me go on and on about statistics and data, and the teaching of both topics, for all those years. Along the way, she patiently and lovingly taught me most of what I know about how to teach and be in community with other humans, even when that means putting aside some modeling or programming task for a few hours to focus on what matters most. For the sake of the reader, I hope this book reflects some fraction of her thoughtfulness and patience.

    About the reviewer

    Kalyana Bedhu, a recipient of patents and an author of award-winning papers and data science courses, is an Engineering Leader and Architect of Enterprise AI/ML at Fannie Mae. He has over 20 years of experience in applied AI/ML across various companies like Microsoft, Ericsson, Sony, Bosch, Fidelity, and Oracle. As a practitioner of Data Science at Ericsson, he helped set up a data science lab. He played a pivotal role in transforming a central IT organization into an AI and data science engine.

    Join our community on Discord

    Join our community’s Discord space for discussions with the author and other readers:

    https://discord.gg/p8uSgEAETX

    Preface

    This book is a practical guide to data cleaning, broadly defined as all tasks necessary to prepare data for analysis. It is organized by the tasks usually completed during the data-cleaning process: importing data, viewing data diagnostically, identifying outliers and unexpected values, imputing values, tidying data, and so on. Each recipe walks the reader from raw data through the completion of a specific data-cleaning task.

    There are already a number of very good pandas books. Unsurprisingly, there is some overlap between those texts and this one. However, the emphasis here is different. I focus as much on the why as on the how in this book.

    Since pandas is still relatively new, the lessons I have learned about cleaning data have been shaped by my experiences with other tools. Before settling into my current work routine with Python and R about 10 years ago, I relied mostly on C# and T-SQL in the early 2000s, SAS and Stata in the 90s, and FORTRAN and Pascal in the 80s. Most readers of this text probably have experience with a variety of data-cleaning and analysis tools. In many ways the specific tool is less significant than the data preparation task and the attributes of the data. I would have covered pretty much the same topics if I had been asked to write The SAS Data Cleaning Cookbook or The R Data Cleaning Cookbook. I just take a Python/pandas-specific approach to the same data-cleaning challenges that analysts have faced for decades.

    I start each chapter with how to think about the particular data-cleaning task at hand before discussing how to approach it with a tool from the Python ecosystem—pandas, NumPy, Matplotlib, and so on. This is reinforced in each recipe by a discussion of the implications of what we are uncovering in the data. I try to connect tool to purpose. For example, concepts like skewness and kurtosis matter as much for handling outliers as does knowing how to update pandas Series values.

    New in the Second Edition

    Readers of the first edition will recognize that this book is substantially longer than that one. That is partly because there are two new chapters—a chapter devoted to treating missing values and another one on pre-processing data for predictive analysis. The insufficient coverage of missing values, and the absence of coverage of pre-processing data for machine learning applications were important omissions. The pre-processing coverage is further improved by new recipes on data pipelines in the final chapter that take the reader from raw data to model evaluation.

    The recipes in all chapters have been revised. This is to make sure that they all work well with the most recent versions of pandas. pandas went from version 1.5.3 to 2.2.1 during the writing of this book. I have tried to make sure that all code works fine on all versions of pandas released from January 2023 through February 2024. Since AI tools are becoming increasingly common in our work, I have included discussion of OpenAI tools in four of the chapters. Altogether, 22 of the 82 recipes are new. All of the datasets used have also been updated.

    Who this book is for

    I had multiple audiences in mind as I wrote this book, but I most consistently thought about a dear friend of mine who bought a Transact-SQL book 30 years ago and quickly developed great confidence in her database work, ultimately building a career around those skills. I would love it if someone just starting their career as a data scientist or analyst worked through this book and had a similar experience as my friend. More than anything else, I want you to feel good and excited about what you can do as a result of reading this book.

    I also hope this book will be a useful reference for folks who have been doing this kind of work for a while. Here, I imagine someone opening the book and wondering to themself, What’s an approach to handling missing data that maintains the variance of my variable?

    In keeping with the hands-on nature of this text, every bit of output is reproducible with code in this book. I also stuck to a rule throughout, even when it was challenging. Every recipe starts with raw data largely unchanged from the original downloaded file. You go from data file to better prepared data in each recipe. If you have forgotten how a particular object was created, all you will ever need to do is turn back a page or two to see.

    Readers who have some knowledge of pandas and NumPy will have an easier time with some code blocks, as will folks with some knowledge of Python and introductory statistics. None of that is essential though. There are just some recipes you might want to pause over longer.

    What this book covers

    Chapter 1, Anticipating Data Cleaning Issues When Importing Tabular Data with pandas, explores tools for loading CSV files, Excel files, relational database tables, SAS, SPSS, Stata, and R files into pandas DataFrames.

    Chapter 2, Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data, discusses techniques for reading and normalizing JSON data, web scraping, and working with big data using Spark. It also explores techniques for persisting data, including with versioning.

    Chapter 3, Taking the Measure of Your Data, introduces common techniques for navigating around a DataFrame, selecting columns and rows, and generating summary statistics. The use of OpenAI tools for examining dataset structure and generating statistics is introduced.

    Chapter 4, Identifying Outliers in Subsets of Data, explores a wide range of strategies to identify outliers across a whole DataFrame and by selected groups.

    Chapter 5, Using Visualizations for the Identification of Unexpected Values, demonstrates the use of the Matplotlib and Seaborn tools to visualize how key variables are distributed, including with histograms, boxplots, scatter plots, line plots, and violin plots.

    Chapter 6, Cleaning and Exploring Data with Series Operations, discusses updating pandas Series with scalars, arithmetic operations, and conditional statements based on the values of one or more Series.

    Chapter 7, Identifying and Fixing Missing Values, goes over strategies for identifying missing values across rows and columns, and over subsets of data. It explores strategies for imputing values, such as setting values to the overall mean or the mean for a given category and forward filling. It also examines multivariate techniques for imputing values for missing values and discusses when they are appropriate.

    Chapter 8, Encoding, Transforming, and Scaling Features, covers a range of variable transformation techniques to prepare features and targets for predictive analysis. This includes the most common kinds of encoding—one-hot, ordinal, and hashing encoding; transformations to improve the distribution of variables; and binning and scaling approaches to address skewness, kurtosis, and outliers and to adjust for features with widely different ranges.

    Chapter 9, Fixing Messy Data When Aggregating, demonstrates multiple approaches to aggregating data by group, including looping through data with

    itertuples

    or NumPy arrays, dropping duplicate rows, and using pandas’ groupby and pivot tables. It also discusses when to choose one approach over the others.

    Chapter 10, Addressing Data Issues When Combining DataFrames, examines different strategies for concatenating and merging data, and how to anticipate common data challenges when combining data.

    Chapter 11, Tidying and Reshaping Data, introduces several strategies for de-duplicating, stacking, melting, and pivoting data.

    Chapter 12, Automate Data Cleaning with User-Defined Functions and Classes and Pipelines, examines how to turn many of the techniques from the first 11 chapters into reuseable code.

    Download the example code files

    The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803239873.

    Conventions used

    There are a number of text conventions used throughout this book.

    Code in text

    : Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded

    WebStorm-10*.dmg

    disk image file as another disk in your system."

    A block of code is set as follows:

    import

    pandas

    as

    pd

    import

    os

    import

    sys nls97 = pd.read_csv(

    data/nls97g.csv

    , low_memory=

    False

    ) nls97.set_index(

    'personid'

    , inplace=

    True

    )

    Any output from the code will appear like this:

    satverbal satmath min 14 7 per15 390 390 qr1 430 430 med 500 500 qr3 570 580 per85 620 621 max 800 800 count 1,406 1,407 mean 500 501 iqr 140 150

    Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

    Warnings or important notes appear like this.

    Tips and tricks appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: Email

    feedback@packtpub.com

    and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at

    questions@packtpub.com

    .

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packtpub.com/submit-errata, click Submit Errata, and fill in the form.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at

    copyright@packtpub.com

    with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

    Share your thoughts

    Once you’ve read Python Data Cleaning Cookbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

    Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

    Download a free PDF copy of this book

    Thanks for purchasing this book!

    Do you like to read on the go but are unable to carry your print books everywhere?

    Is your eBook purchase not compatible with the device of your choice?

    Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

    Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

    The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.

    Follow these simple steps to get the benefits:

    Scan the QR code or visit the link below:

    https://packt.link/free-ebook/9781803239873

    Submit your proof of purchase.

    That’s it! We’ll send your free PDF and other benefits to your email directly.

    1

    Anticipating Data Cleaning Issues When Importing Tabular Data with pandas

    Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. The recipes in this book demonstrate how many common data preparation tasks can be done more easily with pandas than with other tools. While we work with pandas extensively in this book, we also use other popular packages such as Numpy, matplotlib, and scipy.

    A key pandas object is the DataFrame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas DataFrame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.

    Before we can make use of this great functionality, we have to import our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.

    We examine tools to import tabular data in this recipe. Specifically, we cover the following topics:

    Importing CSV files

    Importing Excel files

    Importing data from SQL databases

    Importing SPSS, Stata, and SAS data

    Importing R data

    Persisting tabular data

    Technical requirements

    The code and notebooks for this chapter are available on GitHub at https://github.com/michaelbwalker/Python-Data-Cleaning-Cookbook-Second-Edition. You can use any IDE (Integrated Development Environment) of your choice – IDLE, Visual Studio, Sublime, Spyder, and so on – or Jupyter Notebook to work with any of the code in this chapter, or any chapter in this book. A good guide to get started with Jupyter Notebook can be found here: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. I used the Spyder IDE to write the code in this chapter.

    I used pandas 2.2.1 and NumPy version 1.24.3 for all of the code in this chapter and subsequent chapters. I have also tested all code with pandas 1.5.3.

    Importing CSV files

    The

    read_csv

    method of the

    pandas

    library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas DataFrame. In this recipe, we import a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

    Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

    All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas

    read_csv

    method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

    Getting ready

    Create a folder for this chapter, and then create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder, and then place the

    landtempssample.csv

    file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository, including the data files. Here is a screenshot of the beginning of the CSV file:

    Screenshot from 2023-05-28 21-00-25

    Figure 1.1: Land Temperatures Data

    Data note

    This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.

    How to do it…

    We will import a CSV file into pandas, taking advantage of some very useful

    read_csv

    options:

    Import the

    pandas

    library, and set up the environment to make viewing the output easier:

    import

    pandas

    as

    pd pd.options.display.float_format =

    '{:,.2f}'

    .

    format

    pd.set_option(

    'display.width'

    ,

    85

    ) pd.set_option(

    'display.max_columns'

    ,

    8

    )

    Read the data file, set new names for the headings, and parse the date column.

    Pass an argument of

    1

    to the

    skiprows

    parameter to skip the first row, pass a list of columns to

    parse_dates

    to create a pandas datetime column from those columns, and set

    low_memory

    to

    False

    . This will cause pandas to load all of the data into memory at once, rather than in chunks. We do this so that pandas can identify the data type of each column automatically. In the There’s more… section, we see how to set the data type for each column manually:

    landtemps = pd.read_csv(

    'data/landtempssample.csv'

    , ... names=[

    'stationid'

    ,

    'year'

    ,

    'month'

    ,

    'avgtemp'

    ,

    'latitude'

    , ...

    'longitude'

    ,

    'elevation'

    ,

    'station'

    ,

    'countryid'

    ,

    'country'

    ], ... skiprows=

    1

    , ... parse_dates=[[

    'month'

    ,

    'year'

    ]], ... low_memory=

    False

    )

    type

    (landtemps)

    Note

    We have to use

    skiprows

    because we are passing a list of column names to

    read_csv

    . If we use the column names in the CSV file, we do not need to specify values for either

    names

    or

    skiprows

    .

    Get a quick glimpse of the data.

    View the first few rows. Show the data type for all columns, as well as the number of rows and columns:

    landtemps.head(

    7

    )

    month_year stationid ... countryid country 0 2000-04-01 USS0010K01S ... US United States 1 1940-05-01 CI000085406 ... CI Chile 2 2013-12-01 USC00036376 ... US United States 3 1963-02-01 ASN00024002 ... AS Australia 4 2001-11-01 ASN00028007 ... AS Australia 5 1991-04-01 USW00024151 ... US United States 6 1993-12-01 RSM00022641 ... RS Russia [7 rows x 9 columns]

    landtemps.dtypes

    month_year datetime64[ns] stationed object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object

    landtemps.shape

    (100000, 9)

    Give the date column a more appropriate name and view the summary statistics for average monthly temperature:

    landtemps.rename(columns={

    'month_year'

    :

    'measuredate'

    }, inplace=

    True

    ) landtemps.dtypes

    measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object

    landtemps.avgtemp.describe()

    count 85,554.00 mean 10.92 std 11.52 min -70.70 25% 3.46 50% 12.22 75% 19.57 max 39.95 Name: avgtemp, dtype: float64

    Look for missing values for each column.

    Use

    isnull

    , which returns

    True

    for each value that is missing for each column, and

    False

    when not missing. Chain this with

    sum

    to count the missing values for each column. (When working with Boolean values,

    sum

    treats

    True

    as

    1

    and

    False

    as

    0

    . I will discuss method chaining in the There’s more... section of this recipe):

    landtemps.isnull().

    sum

    ()

    measuredate 0 stationed 0 avgtemp 14446 latitude 0 longitude 0 elevation 0 station 0 countryid 0 country 5 dtype: int64

    Remove rows with missing data for

    avgtemp

    .

    Use the

    subset

    parameter to tell

    dropna

    to drop rows when

    avgtemp

    is missing. Set

    inplace

    to

    True

    . Leaving

    inplace

    at its default value of

    False

    would display the DataFrame, but the changes we have made would not be retained. Use the

    shape

    attribute of the DataFrame to get the number of rows and columns:

    landtemps.dropna(subset=[

    'avgtemp'

    ], inplace=

    True

    ) landtemps.shape

    (85554, 9)

    That’s it! Importing CSV files into pandas is as simple as that.

    How it works...

    Almost all of the recipes in this book use the

    pandas

    library. We refer to it as

    pd

    to make it easier to reference later. This is customary. We also use

    float_format

    to display float values in a readable way and

    set_option

    to make the Terminal output wide enough to accommodate the number of variables.

    Much of the work is done by the first line in Step 2. We use

    read_csv

    to load a pandas DataFrame in memory and call it

    landtemps

    . In addition to passing a filename, we set the

    names

    parameter to a list of our preferred column headings. We also tell

    read_csv

    to skip the first row, by setting

    skiprows

    to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row,

    read_csv

    will treat the header row in the file as actual data.

    read_csv

    also solves a date conversion issue for us. We use the

    parse_dates

    parameter to ask it to convert the

    month

    and

    year

    columns to a date value.

    Step 3 runs through a few standard data checks. We use

    head(7)

    to print out all columns for the first seven rows. We use the

    dtypes

    attribute of the DataFrame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values.

    shape

    returns a tuple, whose first element is the number of rows in the DataFrame (100,000 in this case) and whose second element is the number of columns (9).

    When we used

    read_csv

    to parse the

    month

    and

    year

    columns, it gave the resulting column the name

    month_year

    . We used the

    rename

    method in Step 4 to give that column a more appropriate name. We need to specify

    inplace=True

    to replace the old column name with the new column name in memory. The

    describe

    method provides summary statistics on the

    avgtemp

    column.

    Notice that the count for

    avgtemp

    indicates that there are 85,554 rows that have valid values for

    avgtemp

    . This is out of 100,000 rows for the whole DataFrame, as provided by the

    shape

    attribute. The listing of missing values for each column in Step 5 (

    landtemps.isnull().sum()

    ) confirms this: 100,000 – 85,554 = 14,446.

    Step 6 drops all rows where

    avgtemp

    is

    NaN

    . (The

    NaN

    value, not a number, is the pandas representation of missing values.)

    subset

    is used to indicate which column to check for missing values. The

    shape

    attribute for

    landtemps

    now indicates that there are 85,554 rows, which is what we would expect, given the previous count from

    describe

    .

    There’s more...

    If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the

    sep

    parameter of

    read_csv

    . When creating the pandas DataFrame, an index was also created. The numbers to the far left of the output when

    head

    was run are index values. Any number of rows can be specified for

    head

    . The default value is

    5

    .

    Instead of setting

    low_memory

    to

    False

    , to get pandas to make good guesses regarding data types, we could have set data types manually:

    landtemps = pd.read_csv(

    '

    data/landtempssample.csv'

    , names=[

    'stationid'

    ,

    'year'

    ,

    'month'

    ,

    'avgtemp'

    ,

    'latitude'

    ,

    'longitude'

    ,

    'elevation'

    ,

    'station'

    ,

    'countryid'

    ,

    '

    country'

    ], skiprows=

    1

    , parse_dates=[[

    'month'

    ,

    'year'

    ]], dtype={

    'stationid'

    :

    'object'

    ,

    'avgtemp'

    :

    'float64'

    ,

    'latitude'

    :

    'float64'

    ,

    '

    longitude'

    :

    'float64'

    ,

    'elevation'

    :

    'float64'

    ,

    'station'

    :

    'object'

    ,

    'countryid'

    :

    'object'

    ,

    'country'

    :

    'object'

    }, ) landtemps.info()

    RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month_year 100000 non-null datetime64[ns] 1 stationid 100000 non-null object 2 avgtemp 85554 non-null float64 3 latitude 100000 non-null float64 4 longitude 100000 non-null float64 5 elevation 100000 non-null float64 6 station 100000 non-null object 7 countryid 100000 non-null object 8 country 99995 non-null object dtypes: datetime64[ns](1), float64(4), object(4) memory usage: 6.9+ MB

    The

    landtemps.isnull().sum()

    statement is an example of chaining methods. First,

    isnull

    returns a DataFrame of

    True

    and

    False

    values, resulting from testing whether each column value is

    null

    . The

    sum

    function takes that DataFrame and sums the

    True

    values for each column, interpreting the

    True

    values as

    1

    and the

    False

    values as

    0

    . We would have obtained the same result if we had used the following two steps:

    checknull = landtemps.isnull() checknull.

    sum

    ()

    There is no hard and fast rule for when to chain methods and when not to do so. I find chaining helpful when the overall operation feels like a single step, even if it’s two or more steps mechanically. Chaining also has the side benefit of not creating extra objects that I might not need.

    The dataset used in this recipe is just a sample from the full land temperatures database, with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:

    landtemps = pd.read_csv(

    'data/landtemps.zip'

    , ... compression=

    'zip'

    , names=[

    'stationid'

    ,

    'year'

    , ...

    'month'

    ,

    'avgtemp'

    ,

    'latitude'

    ,

    'longitude'

    , ...

    'elevation'

    ,

    'station'

    ,

    'countryid'

    ,

    'country'

    ], ... skiprows=

    1

    , ... parse_dates=[[

    'month'

    ,

    'year'

    ]], ... low_memory=

    False

    )

    read_csv

    can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.

    See also

    Subsequent recipes in this chapter, and in other chapters, set indexes to improve navigation over rows and merging.

    A significant amount of reshaping of the Global Historical Climatology Network raw data was done before using it in this recipe. We demonstrate this in Chapter 11, Tidying and Reshaping Data.

    Importing Excel files

    The

    read_excel

    method of the

    pandas

    library can be used to import data from an Excel file and load it into memory as a pandas DataFrame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets.

    Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms with those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python.

    Fortunately,

    read_excel

    has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows, select particular columns, and pull data from a particular sheet or sheets.

    Getting ready

    You can download the

    GDPpercapita22b.xlsx

    file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file (some columns were hidden for display purposes):

    Figure 1.2: View of the dataset

    And here is a view of the end of the file:

    Figure 1.3: View of the dataset

    Data note

    This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/.

    How to do it…

    We import an Excel file into pandas and do some initial data cleaning:

    Import the

    pandas

    library:

    import

    pandas

    as

    pd

    Read the Excel per capita GDP data.

    Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the

    sheet_name

    parameter to specify the sheet. Set

    skiprows

    to

    4

    and

    skipfooter

    to

    1

    to skip the first four rows (the first row is hidden) and the last row. We provide values for

    usecols

    to get data from column

    A

    and columns

    C

    through

    W

    (column

    B

    is blank). Use

    head

    to view the first few rows and

    shape

    to get the number of rows and columns:

    percapitaGDP = pd.read_excel(

    data/GDPpercapita22b.xlsx

    , ... sheet_name=

    OECD.Stat export

    , ... skiprows=

    4

    , ... skipfooter=

    1

    , ... usecols=

    A,C:W

    ) percapitaGDP.head()

    Year 2000 ... 2019 2020 0 Metropolitan areas ... NaN ... NaN NaN 1 AUS: Australia .. ... ... ... ... ... 2 AUS01: Greater Sydney ... ... ... 45576 45152 3 AUS02: Greater Melbourne ... ... ... 42299 40848 4 AUS03: Greater Brisbane ... ... ... 42145 40741 [5 rows x 22 columns]

    percapitaGDP.shape

    (731, 22)

    Note

    You may encounter a problem with

    read_excel

    if the Excel file does not use utf-8 encoding. One way to resolve this is to save the Excel file as a CSV file, reopen it, and then save it with utf-8 encoding.

    Use the

    info

    method of the DataFrame to view data types and the

    non-null

    count. Notice that all columns have the

    object

    data type:

    percapitaGDP.info()

    RangeIndex: 731 entries, 0 to 730 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 731 non-null object 1 2000 730 non-null object 2 2001 730 non-null object 3 2002 730 non-null object 4 2003 730 non-null object 5 2004 730 non-null object 6 2005 730 non-null object 7 2006 730 non-null object 8 2007 730 non-null object 9 2008 730 non-null object 10 2009 730 non-null object 11 2010 730 non-null object 12 2011 730 non-null object 13 2012 730 non-null object 14 2013 730 non-null object 15 2014 730 non-null object 16 2015 730 non-null object 17 2016 730 non-null object 18 2017 730 non-null object 19 2018 730 non-null object 20 2019 730 non-null object 21 2020 730 non-null object dtypes: object(22) memory usage: 125.8+ KB

    Rename the

    Year

    column to

    metro

    , and remove the leading spaces.

    Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases. We can test for leading spaces with

    startswith(' ')

    and then use

    any

    to establish whether there are one or more occasions when the first character is blank. We can use

    endswith(' ')

    to examine trailing spaces. We use strip to remove both leading and trailing spaces. When we test for trailing spaces again, we see that there are none:

    percapitaGDP.rename(columns={

    'Year'

    :

    'metro'

    }, inplace=

    True

    ) percapitaGDP.metro.

    str

    .startswith(

    ' '

    ).

    any

    ()

    True

    percapitaGDP.metro.

    str

    .endswith(

    ' '

    ).

    any

    ()

    False

    percapitaGDP.metro = percapitaGDP.metro.

    str

    .strip() percapitaGDP.metro.

    str

    .startswith(

    ' '

    ).

    any

    ()

    False

    Convert the data columns to numeric.

    Iterate over all of the GDP year columns (2000–2020) and convert the data type from

    object

    to

    float

    . Coerce the conversion even when there is character data – the

    ..

    in this example. We want character values in those columns to become

    missing

    , which is what happens. Rename the year columns to better reflect the data in those columns:

    for

    col

    in

    percapitaGDP.columns[

    1

    :]: ... percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], ... errors=

    'coerce'

    ) ... percapitaGDP.rename(columns={col:

    'pcGDP'

    +col}, ... inplace=

    True

    ) ... percapitaGDP.head()

    metro pcGDP2000 pcGDP2001 ... \ 0 Metropolitan areas NaN NaN ... 1 AUS: Australia NaN NaN ... 2 AUS01: Greater Sydney NaN 41091 ... 3 AUS02: Greater Melbourne NaN 40488 ... 4 AUS03: Greater Brisbane NaN 35276 ... pcGDP2018 pcGDP2019 pcGDP2020 0 NaN NaN NaN 1 NaN NaN NaN 2 47171 45576 45152 3 43237 42299 40848 4 44328 42145 40741 [5 rows x 22 columns]

    percapitaGDP.dtypes

    metro object pcGDP2000 float64 pcGDP2001 float64 abbreviated to save space pcGDP2019 float64 pcGDP2020 float64 dtype: object

    Use the

    describe

    method to generate summary statistics for all numeric data in the DataFrame:

    percapitaGDP.describe()

    pcGDP2000 pcGDP2001 pcGDP2002 ... pcGDP2018 \ count 158 450 479 ... 692 mean 33961 38874 39621 ... 41667 std 15155 13194 13061 ... 17440 min 2686 7805 7065 ... 5530 25% 21523 30790 31064 ... 31322 50% 35836 38078 39246 ... 41428 75% 42804 46576 47874 ... 51130 max 95221 96941 98929 ... 147760 pcGDP2019 pcGDP2020 count 596 425 mean 42709 39792 std 18893 19230 min 5698 5508 25% 29760 24142 50% 43505 41047 75% 53647 51130 max 146094 131082 [8 rows x 21 columns]

    Remove rows where all of the per capita GDP values are missing.

    Use the

    subset

    parameter of

    dropna

    to inspect all columns, starting with the second column (it is zero-based) and going through to the last column. Use

    how

    to specify that we want to drop rows only if all of the columns specified in

    subset

    are missing. Use

    shape

    to show the number of rows and columns in the resulting DataFrame:

    percapitaGDP.dropna(subset=percapitaGDP.columns[

    1

    :], how=

    all

    , inplace=

    True

    ) percapitaGDP.shape

    (692, 22)

    Set the index for the DataFrame using the metropolitan area column.

    Confirm that there are 692 valid values for

    metro

    and that there are 692 unique values, before setting the index:

    percapitaGDP.metro.count()

    692

    percapitaGDP.metro.nunique()

    692

    percapitaGDP.set_index(

    'metro'

    , inplace=

    True

    ) percapitaGDP.head()

    pcGDP2000 pcGDP2001 ... \ metro ...

    Enjoying the preview?
    Page 1 of 1