Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Ebook1,199 pages6 hours

Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Name: Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI
Brand: Packt Publishing
Rating: 5.0 (1 reviews)

By Michael Walker

Rating: 5 out of 5 stars

5/5

()

Read preview

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateMay 31, 2024

ISBN9781803246291

Author

Michael Walker

Michael Walker is the bestselling author of Laurel Canyon: The Inside Story of Rock-and-Roll’s Legendary Neighborhood and What You Want Is in the Limo: On the Road with Led Zeppelin, Alice Cooper, and the Who in 1973, the Year the Sixties Died and the Modern Rock Star Was Born. He has written about popular culture for the New York Times and is a contributing editor at the Hollywood Reporter. He lives in Los Angeles.

Related to Python Data Cleaning Cookbook

Related ebooks

Skip carousel

Instant MapReduce Patterns – Hadoop Essentials How-to
Ebook
Instant MapReduce Patterns – Hadoop Essentials How-to
bySrinath Perera
Rating: 0 out of 5 stars
0 ratings
Instant Redis Optimization How-to
Ebook
Instant Redis Optimization How-to
byArun Chinnachamy
Rating: 0 out of 5 stars
0 ratings
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
Ebook
Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)
byPablo Alejandro Echeverria Barrios
Rating: 0 out of 5 stars
0 ratings
Apache Spark 2.x Cookbook
Ebook
Apache Spark 2.x Cookbook
byRishi Yadav
Rating: 0 out of 5 stars
0 ratings
Splunk Developer's Guide
Ebook
Splunk Developer's Guide
byKyle Smith
Rating: 0 out of 5 stars
0 ratings
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
Ebook
Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools
byVinicius Aquino do Vale
Rating: 0 out of 5 stars
0 ratings
Learning Heroku Postgres
Ebook
Learning Heroku Postgres
byPatrick Espake
Rating: 0 out of 5 stars
0 ratings
PostgreSQL 9 Administration Cookbook LITE: Configuration, Monitoring and Maintenance
Ebook
PostgreSQL 9 Administration Cookbook LITE: Configuration, Monitoring and Maintenance
bySimon Riggs
Rating: 3 out of 5 stars
3/5
Deep Learning for Computer Vision with SAS: An Introduction
Ebook
Deep Learning for Computer Vision with SAS: An Introduction
byRobert Blanchard
Rating: 0 out of 5 stars
0 ratings
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
Ebook
Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)
byShekhar Khandelwal
Rating: 0 out of 5 stars
0 ratings
Ultimate Enterprise Data Analysis and Forecasting using Python
Ebook
Ultimate Enterprise Data Analysis and Forecasting using Python
byShanthababu Pandian
Rating: 0 out of 5 stars
0 ratings
Apache Oozie Essentials
Ebook
Apache Oozie Essentials
bySingh Jagat Jasjit
Rating: 0 out of 5 stars
0 ratings
Open-Source Middleware Suites A Complete Guide
Ebook
Open-Source Middleware Suites A Complete Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Relational Database Index Design and the Optimizers: DB2, Oracle, SQL Server, et al.
Ebook
Relational Database Index Design and the Optimizers: DB2, Oracle, SQL Server, et al.
byTapio Lahdenmaki
Rating: 5 out of 5 stars
5/5
Big Data AWS A Complete Guide
Ebook
Big Data AWS A Complete Guide
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Mastering Amazon Relational Database Service for MySQL: Building and configuring MySQL instances (English Edition)
Ebook
Mastering Amazon Relational Database Service for MySQL: Building and configuring MySQL instances (English Edition)
byJeyaram Ayyalusamy
Rating: 0 out of 5 stars
0 ratings
Instant SQL Server Analysis Services 2012 Cube Security
Ebook
Instant SQL Server Analysis Services 2012 Cube Security
bySatya SK Jayanty
Rating: 0 out of 5 stars
0 ratings
Open Source Database: Virtue Or Vice?
Ebook
Open Source Database: Virtue Or Vice?
byBinayaka Mishra
Rating: 0 out of 5 stars
0 ratings
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
Ebook
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
byvivian njoroge
Rating: 0 out of 5 stars
0 ratings
Learning Ansible
Ebook
Learning Ansible
byWayne Taylor
Rating: 0 out of 5 stars
0 ratings
Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions
Ebook
Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions
byDumky De Wilde
Rating: 0 out of 5 stars
0 ratings
Hands-On Kubernetes, Service Mesh and Zero-Trust: Build and manage secure applications using Kubernetes and Istio (English Edition)
Ebook
Hands-On Kubernetes, Service Mesh and Zero-Trust: Build and manage secure applications using Kubernetes and Istio (English Edition)
bySwapnil Dubey
Rating: 0 out of 5 stars
0 ratings
Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning
Ebook
Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning
bySinan Ozdemir
Rating: 0 out of 5 stars
0 ratings
Rapid - Apache Mahout Clustering designs: Explore clustering algorithms used with Apache Mahout
Ebook
Rapid - Apache Mahout Clustering designs: Explore clustering algorithms used with Apache Mahout
byAshish Gupta
Rating: 0 out of 5 stars
0 ratings
Buidling AI Agents with LangGraph
Ebook
Buidling AI Agents with LangGraph
byScott Robinson
Rating: 0 out of 5 stars
0 ratings
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Ebook
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
bySaba Shah
Rating: 0 out of 5 stars
0 ratings
WAN Survival Guide: Strategies for VPNs and Multiservice Networks
Ebook
WAN Survival Guide: Strategies for VPNs and Multiservice Networks
byHoward Berkowitz
Rating: 0 out of 5 stars
0 ratings
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
Ebook
Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake
byPulkit Chadha
Rating: 0 out of 5 stars
0 ratings
Comprehensive SQL Techniques: Mastering Data Analysis and Reporting
Ebook
Comprehensive SQL Techniques: Mastering Data Analysis and Reporting
byAdam Jones
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Fundamentals Certification and Beyond: A complete AZ-900 exam guide with online mock exams, flashcards, and hands-on activities
Ebook
Microsoft Azure Fundamentals Certification and Beyond: A complete AZ-900 exam guide with online mock exams, flashcards, and hands-on activities
bySteve Miles
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byT.C. Boyle
Rating: 5 out of 5 stars
5/5
Deep Search: How to Explore the Internet More Effectively
Ebook
Deep Search: How to Explore the Internet More Effectively
byAlan Pearce
Rating: 5 out of 5 stars
5/5
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Ebook
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
byMargot Lee Shetterly
Rating: 4 out of 5 stars
4/5
Learning the Chess Openings
Ebook
Learning the Chess Openings
byJef Kaan
Rating: 5 out of 5 stars
5/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 4 out of 5 stars
4/5
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
Ebook
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide
byJoe Shelley
Rating: 5 out of 5 stars
5/5
The Professional Voiceover Handbook: Voiceover training, #1
Ebook
The Professional Voiceover Handbook: Voiceover training, #1
byPeter Baker
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Some Future Day: How AI Is Going to Change Everything
Ebook
Some Future Day: How AI Is Going to Change Everything
byMarc Beckman
Rating: 0 out of 5 stars
0 ratings
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
An Ultimate Guide to Kali Linux for Beginners
Ebook
An Ultimate Guide to Kali Linux for Beginners
byAnsh Goyal
Rating: 3 out of 5 stars
3/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 4 out of 5 stars
4/5
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
Ebook
The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms
byCory Althoff
Rating: 0 out of 5 stars
0 ratings
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
Ebook
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
byFlynn Fisher
Rating: 4 out of 5 stars
4/5
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going
Ebook
A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going
byMichael Wooldridge
Rating: 4 out of 5 stars
4/5
Going Text: Mastering the Command Line
Ebook
Going Text: Mastering the Command Line
byBrian Schell
Rating: 4 out of 5 stars
4/5
Tor and the Dark Art of Anonymity
Ebook
Tor and the Dark Art of Anonymity
byLance Henderson
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Python Machine Learning By Example
Ebook
Python Machine Learning By Example
byYuxi (Hayden) Liu
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
CompTia Security 701: Fundamentals of Security
Ebook
CompTia Security 701: Fundamentals of Security
byAS Snipes
Rating: 0 out of 5 stars
0 ratings
Uncanny Valley: A Memoir
Ebook
Uncanny Valley: A Memoir
byAnna Wiener
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

AI for Everyone: AI for Everyone
UNLIMITED
AI for Everyone: AI for Everyone
byOracle University Podcast
100%
100% found this document useful
Building The Materialize Engine For Interactive Streaming Analytics In SQL - Episode 112: An episode about building Materialize for interactive analytics on continuously updated streams of data
UNLIMITED
Building The Materialize Engine For Interactive Streaming Analytics In SQL - Episode 112: An episode about building Materialize for interactive analytics on continuously updated streams of data
byData Engineering Podcast
0 ratings
0% found this document useful
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
UNLIMITED
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
byPractical AI
0 ratings
0% found this document useful
Building A Cost Effective Data Catalog With Tree Schema - Episode 158: An interview about the Tree Schema data catalog platform and using it to quickly get visibility into your data assets.
UNLIMITED
Building A Cost Effective Data Catalog With Tree Schema - Episode 158: An interview about the Tree Schema data catalog platform and using it to quickly get visibility into your data assets.
byData Engineering Podcast
0 ratings
0% found this document useful
Solving Data Lineage Tracking And Data Discovery At WeWork - Episode 111: An interview about how the Marquez platform for metadata management powers data lineage tracking, data discovery, and health reporting at WeWork
UNLIMITED
Solving Data Lineage Tracking And Data Discovery At WeWork - Episode 111: An interview about how the Marquez platform for metadata management powers data lineage tracking, data discovery, and health reporting at WeWork
byData Engineering Podcast
0 ratings
0% found this document useful
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
UNLIMITED
Data Visualization with Manuel Lima: Gabi Ferrara and Jon Foust are back today and joined by fellow Googler Manuel Lima.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Why Enterprise Licensing Changed the Game for Beyond Typicals: In this podcast episode, Sam discusses the development and refinement of our enterprise licensing technology for our software, Beyond Typicals. We outline how this model allows more companies to utilize our product and how it contributes to...
UNLIMITED
Why Enterprise Licensing Changed the Game for Beyond Typicals: In this podcast episode, Sam discusses the development and refinement of our enterprise licensing technology for our software, Beyond Typicals. We outline how this model allows more companies to utilize our product and how it contributes to...
byWe Make Civil Engineering Look Good | Working to Make Transportation and other Civil Engineer Projects Better through Outreach, 3D Visualization and More!
0 ratings
0% found this document useful
023: Top Excel Tips & Tricks of 2018: In this annual special podcast episode, we round up the best Excel experts & MVPs around the world to get their best Excel tips & tricks of 2018! ? Join Our Academy Online Excel Course ? Show Notes: ...
UNLIMITED
023: Top Excel Tips & Tricks of 2018: In this annual special podcast episode, we round up the best Excel experts & MVPs around the world to get their best Excel tips & tricks of 2018! ? Join Our Academy Online Excel Course ? Show Notes: ...
byLearn Microsoft Excel with MyExcelOnline
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
UNLIMITED
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
UNLIMITED
Azure Databricks: I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were...
byData Skeptic
0 ratings
0% found this document useful
Automating Infrastructure as Code with Ansible and Molecule: In Ansible, roles allow system administrators to automate the loading of certain variables, tasks, files, templates, and handlers based on a known file structure. Grouping content by roles allows for easy sharing and reuse. When developing roles,...
UNLIMITED
Automating Infrastructure as Code with Ansible and Molecule: In Ansible, roles allow system administrators to automate the loading of certain variables, tasks, files, templates, and handlers based on a known file structure. Grouping content by roles allows for easy sharing and reuse. When developing roles,...
bySoftware Engineering Institute (SEI) Podcast Series
0 ratings
0% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
UNLIMITED
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 234: Cosmos DB with Kirill Gavrylyuk - Last Episode!
UNLIMITED
Episode 234: Cosmos DB with Kirill Gavrylyuk - Last Episode!
byMicrosoft Azure Cloud Cover Show (HD) - Channel 9
0 ratings
0% found this document useful
Network Analyzer with Zach Seils and Manasa Chalasani: and Lorin Price welcome guests Zach Seils and to talk about networking and the newly released Network Analyzer. Google Cloud’s Network Intelligence Center is described as a one-stop shop that simplifies network monitoring, troubleshooting, workload...
UNLIMITED
Network Analyzer with Zach Seils and Manasa Chalasani: and Lorin Price welcome guests Zach Seils and to talk about networking and the newly released Network Analyzer. Google Cloud’s Network Intelligence Center is described as a one-stop shop that simplifies network monitoring, troubleshooting, workload...
byGoogle Cloud Platform Podcast
100%
100% found this document useful
#143 - How to Think Like a Software Engineering Manager - Akanksha Gupta
UNLIMITED
#143 - How to Think Like a Software Engineering Manager - Akanksha Gupta
byTech Lead Journal
100%
100% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
UNLIMITED
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
MLA 014 Machine Learning Server: Server-side ML. Training & hosting for inference, with a goal towards serverless. AWS SageMaker, Batch, Lambda, EFS, Cortex.dev
UNLIMITED
MLA 014 Machine Learning Server: Server-side ML. Training & hosting for inference, with a goal towards serverless. AWS SageMaker, Batch, Lambda, EFS, Cortex.dev
byMachine Learning Guide
0 ratings
0% found this document useful
When NodeJS I/O Blocks | The Backend Engineering Show
UNLIMITED
When NodeJS I/O Blocks | The Backend Engineering Show
byThe Backend Engineering Show with Hussein Nasser
0 ratings
0% found this document useful
Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80: An interview about the FoundationDB project and how it simplifies the work of building custom distributed systems applications
UNLIMITED
Using FoundationDB As The Bedrock For Your Distributed Systems - Episode 80: An interview about the FoundationDB project and how it simplifies the work of building custom distributed systems applications
byData Engineering Podcast
0 ratings
0% found this document useful
Building ETL Pipelines With Generative AI: Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
UNLIMITED
Building ETL Pipelines With Generative AI: Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
byData Engineering Podcast
0 ratings
0% found this document useful
#625: AWS Application Composer: AWS Application Composer helps teams collaborate as they build well-architected applications backed
UNLIMITED
#625: AWS Application Composer: AWS Application Composer helps teams collaborate as they build well-architected applications backed
byAWS Podcast
0 ratings
0% found this document useful
Cloud Education Made Easy with Katie Bullard: Katie Bullard is the president of A Cloud Guru, a cloud education platform. She’s also a board member at Conservice, ChildCareCRM, and Journyx, Inc. Katie previously served as president and chief growth officer at ZoomInfo (formerly DiscoverOrg), VP of ma
UNLIMITED
Cloud Education Made Easy with Katie Bullard: Katie Bullard is the president of A Cloud Guru, a cloud education platform. She’s also a board member at Conservice, ChildCareCRM, and Journyx, Inc. Katie previously served as president and chief growth officer at ZoomInfo (formerly DiscoverOrg), VP of ma
byScreaming in the Cloud
0 ratings
0% found this document useful
Forward Proxy vs Reverse Proxy Explained
UNLIMITED
Forward Proxy vs Reverse Proxy Explained
byThe Backend Engineering Show with Hussein Nasser
0 ratings
0% found this document useful
2023 Look Ahead to FinOps
UNLIMITED
2023 Look Ahead to FinOps
byThe Cloudcast
0 ratings
0% found this document useful
Cloud Native Data Security As Code With Cyral - Episode 156: An interview about the Cyral platform and how it enforces data security as code for protecting databases and object storage in the cloud.
UNLIMITED
Cloud Native Data Security As Code With Cyral - Episode 156: An interview about the Cyral platform and how it enforces data security as code for protecting databases and object storage in the cloud.
byData Engineering Podcast
0 ratings
0% found this document useful
Conversation AI with Priyanka Vergadia: The podcast today is all about conversational AI and Dialogflow with our Google guest, Priyanka Vergadia.
UNLIMITED
Conversation AI with Priyanka Vergadia: The podcast today is all about conversational AI and Dialogflow with our Google guest, Priyanka Vergadia.
byGoogle Cloud Platform Podcast
100%
100% found this document useful
Low Friction Data Governance With Immuta: An interview about how the Immuta platform simplifies the work of managing access control and data security as part of your data governance strategy.
UNLIMITED
Low Friction Data Governance With Immuta: An interview about how the Immuta platform simplifies the work of managing access control and data security as part of your data governance strategy.
byData Engineering Podcast
0 ratings
0% found this document useful
Welcome to the Data Driven Podcast -- Benjamin Shapiro // I Hear Everything
UNLIMITED
Welcome to the Data Driven Podcast -- Benjamin Shapiro // I Hear Everything
byData Driven - Learn essential data literacy, AI and storytelling skills to future proof your career and fuel data informed decisions
0 ratings
0% found this document useful
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
UNLIMITED
Automate Your Pipeline Creation For Streaming Data Transformations With SQLake: Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.
byData Engineering Podcast
0 ratings
0% found this document useful
What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta: An interview with Ernie Ostic about the Manta platform and how it approaches the collection and processing of metadata to build a comprehensive view of data lineage across your various data systems
UNLIMITED
What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta: An interview with Ernie Ostic about the Manta platform and how it approaches the collection and processing of metadata to build a comprehensive view of data lineage across your various data systems
byData Engineering Podcast
0 ratings
0% found this document useful

Related categories

Skip carousel

Reviews for Python Data Cleaning Cookbook

Rating: 5 out of 5 stars

5/5

1 rating0 reviews

Book preview

Python Data Cleaning Cookbook - Michael Walker

9781803239873-cov.png

Python Data Cleaning Cookbook

Second Edition

Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Michael Walker

Python Data Cleaning Cookbook

Second Edition

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Senior Publishing Product Manager: Gebin George

Acquisition Editor – Peer Reviews: Gaurav Gavas

Project Editor: Namrata Katare

Content Development Editor: Deepayan Bhattacharjee

Copy Editor: Safis Editing

Technical Editor: Karan Sonawane

Proofreader: Safis Editing

Indexer: Subalakshmi Govindhan

Presentation Designer: Ajay Patule

Developer Relations Marketing Executive: Vignesh Raju

First published: December 2020

Second edition: May 2024

Production reference: 1230524

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN: 978-1-80323-987-3

www.packt.com

Contributors

About the author

Michael Walker has worked as a data analyst for 37 years at various educational institutions. He has also taught data science, research methods, statistics, and computer programming to undergraduates since 2006. He is currently the Chief Information Officer at College Unbound in Providence, Rhode Island. Michael is also the author of Data Cleaning and Exploration with Machine Learning.

I deeply appreciate the editors of this text, particularly Namrata Katare and Deepayan Bhattacharjee, and the technical reviewer, Kalyana Bedhu. They made this text so much better than it would have been otherwise. Any remaining issues are completely the fault of the author. I would also like to thank my partner over the last several decades, Karen Walker. She has had to listen to me go on and on about statistics and data, and the teaching of both topics, for all those years. Along the way, she patiently and lovingly taught me most of what I know about how to teach and be in community with other humans, even when that means putting aside some modeling or programming task for a few hours to focus on what matters most. For the sake of the reader, I hope this book reflects some fraction of her thoughtfulness and patience.

About the reviewer

Kalyana Bedhu, a recipient of patents and an author of award-winning papers and data science courses, is an Engineering Leader and Architect of Enterprise AI/ML at Fannie Mae. He has over 20 years of experience in applied AI/ML across various companies like Microsoft, Ericsson, Sony, Bosch, Fidelity, and Oracle. As a practitioner of Data Science at Ericsson, he helped set up a data science lab. He played a pivotal role in transforming a central IT organization into an AI and data science engine.

Join our community on Discord

Join our community’s Discord space for discussions with the author and other readers:

https://discord.gg/p8uSgEAETX

Preface

This book is a practical guide to data cleaning, broadly defined as all tasks necessary to prepare data for analysis. It is organized by the tasks usually completed during the data-cleaning process: importing data, viewing data diagnostically, identifying outliers and unexpected values, imputing values, tidying data, and so on. Each recipe walks the reader from raw data through the completion of a specific data-cleaning task.

There are already a number of very good pandas books. Unsurprisingly, there is some overlap between those texts and this one. However, the emphasis here is different. I focus as much on the why as on the how in this book.

Since pandas is still relatively new, the lessons I have learned about cleaning data have been shaped by my experiences with other tools. Before settling into my current work routine with Python and R about 10 years ago, I relied mostly on C# and T-SQL in the early 2000s, SAS and Stata in the 90s, and FORTRAN and Pascal in the 80s. Most readers of this text probably have experience with a variety of data-cleaning and analysis tools. In many ways the specific tool is less significant than the data preparation task and the attributes of the data. I would have covered pretty much the same topics if I had been asked to write The SAS Data Cleaning Cookbook or The R Data Cleaning Cookbook. I just take a Python/pandas-specific approach to the same data-cleaning challenges that analysts have faced for decades.

I start each chapter with how to think about the particular data-cleaning task at hand before discussing how to approach it with a tool from the Python ecosystem—pandas, NumPy, Matplotlib, and so on. This is reinforced in each recipe by a discussion of the implications of what we are uncovering in the data. I try to connect tool to purpose. For example, concepts like skewness and kurtosis matter as much for handling outliers as does knowing how to update pandas Series values.

New in the Second Edition

Readers of the first edition will recognize that this book is substantially longer than that one. That is partly because there are two new chapters—a chapter devoted to treating missing values and another one on pre-processing data for predictive analysis. The insufficient coverage of missing values, and the absence of coverage of pre-processing data for machine learning applications were important omissions. The pre-processing coverage is further improved by new recipes on data pipelines in the final chapter that take the reader from raw data to model evaluation.

The recipes in all chapters have been revised. This is to make sure that they all work well with the most recent versions of pandas. pandas went from version 1.5.3 to 2.2.1 during the writing of this book. I have tried to make sure that all code works fine on all versions of pandas released from January 2023 through February 2024. Since AI tools are becoming increasingly common in our work, I have included discussion of OpenAI tools in four of the chapters. Altogether, 22 of the 82 recipes are new. All of the datasets used have also been updated.

Who this book is for

I had multiple audiences in mind as I wrote this book, but I most consistently thought about a dear friend of mine who bought a Transact-SQL book 30 years ago and quickly developed great confidence in her database work, ultimately building a career around those skills. I would love it if someone just starting their career as a data scientist or analyst worked through this book and had a similar experience as my friend. More than anything else, I want you to feel good and excited about what you can do as a result of reading this book.

I also hope this book will be a useful reference for folks who have been doing this kind of work for a while. Here, I imagine someone opening the book and wondering to themself, What’s an approach to handling missing data that maintains the variance of my variable?

In keeping with the hands-on nature of this text, every bit of output is reproducible with code in this book. I also stuck to a rule throughout, even when it was challenging. Every recipe starts with raw data largely unchanged from the original downloaded file. You go from data file to better prepared data in each recipe. If you have forgotten how a particular object was created, all you will ever need to do is turn back a page or two to see.

Readers who have some knowledge of pandas and NumPy will have an easier time with some code blocks, as will folks with some knowledge of Python and introductory statistics. None of that is essential though. There are just some recipes you might want to pause over longer.

What this book covers

Chapter 1, Anticipating Data Cleaning Issues When Importing Tabular Data with pandas, explores tools for loading CSV files, Excel files, relational database tables, SAS, SPSS, Stata, and R files into pandas DataFrames.

Chapter 2, Anticipating Data Cleaning Issues When Working with HTML, JSON, and Spark Data, discusses techniques for reading and normalizing JSON data, web scraping, and working with big data using Spark. It also explores techniques for persisting data, including with versioning.

Chapter 3, Taking the Measure of Your Data, introduces common techniques for navigating around a DataFrame, selecting columns and rows, and generating summary statistics. The use of OpenAI tools for examining dataset structure and generating statistics is introduced.

Chapter 4, Identifying Outliers in Subsets of Data, explores a wide range of strategies to identify outliers across a whole DataFrame and by selected groups.

Chapter 5, Using Visualizations for the Identification of Unexpected Values, demonstrates the use of the Matplotlib and Seaborn tools to visualize how key variables are distributed, including with histograms, boxplots, scatter plots, line plots, and violin plots.

Chapter 6, Cleaning and Exploring Data with Series Operations, discusses updating pandas Series with scalars, arithmetic operations, and conditional statements based on the values of one or more Series.

Chapter 7, Identifying and Fixing Missing Values, goes over strategies for identifying missing values across rows and columns, and over subsets of data. It explores strategies for imputing values, such as setting values to the overall mean or the mean for a given category and forward filling. It also examines multivariate techniques for imputing values for missing values and discusses when they are appropriate.

Chapter 8, Encoding, Transforming, and Scaling Features, covers a range of variable transformation techniques to prepare features and targets for predictive analysis. This includes the most common kinds of encoding—one-hot, ordinal, and hashing encoding; transformations to improve the distribution of variables; and binning and scaling approaches to address skewness, kurtosis, and outliers and to adjust for features with widely different ranges.

Chapter 9, Fixing Messy Data When Aggregating, demonstrates multiple approaches to aggregating data by group, including looping through data with

itertuples

or NumPy arrays, dropping duplicate rows, and using pandas’ groupby and pivot tables. It also discusses when to choose one approach over the others.

Chapter 10, Addressing Data Issues When Combining DataFrames, examines different strategies for concatenating and merging data, and how to anticipate common data challenges when combining data.

Chapter 11, Tidying and Reshaping Data, introduces several strategies for de-duplicating, stacking, melting, and pivoting data.

Chapter 12, Automate Data Cleaning with User-Defined Functions and Classes and Pipelines, examines how to turn many of the techniques from the first 11 chapters into reuseable code.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/Python-Data-Cleaning-Cookbook-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/gbp/9781803239873.

Conventions used

There are a number of text conventions used throughout this book.

Code in text

: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded

WebStorm-10*.dmg

disk image file as another disk in your system."

A block of code is set as follows:

import

pandas

import

sys nls97 = pd.read_csv(

data/nls97g.csv

, low_memory=

False

) nls97.set_index(

'personid'

, inplace=

True

)

Any output from the code will appear like this:

satverbal satmath min 14 7 per15 390 390 qr1 430 430 med 500 500 qr3 570 580 per85 620 621 max 800 800 count 1,406 1,407 mean 500 501 iqr 140 150

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email

feedback@packtpub.com

and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at

questions@packtpub.com

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you reported this to us. Please visit http://www.packtpub.com/submit-errata, click Submit Errata, and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at

with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read Python Data Cleaning Cookbook, Second Edition, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?

Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily.

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below:

https://packt.link/free-ebook/9781803239873

Submit your proof of purchase.

That’s it! We’ll send your free PDF and other benefits to your email directly.

1 Anticipating Data Cleaning Issues When Importing Tabular Data with pandas

Scientific distributions of Python (Anaconda, WinPython, Canopy, and so on) provide analysts with an impressive range of data manipulation, exploration, and visualization tools. One important tool is pandas. Developed by Wes McKinney in 2008, but really gaining in popularity after 2012, pandas is now an essential library for data analysis in Python. The recipes in this book demonstrate how many common data preparation tasks can be done more easily with pandas than with other tools. While we work with pandas extensively in this book, we also use other popular packages such as Numpy, matplotlib, and scipy.

A key pandas object is the DataFrame, which represents data as a tabular structure, with rows and columns. In this way, it is similar to the other data stores we discuss in this chapter. However, a pandas DataFrame also has indexing functionality that makes selecting, combining, and transforming data relatively straightforward, as the recipes in this book will demonstrate.

Before we can make use of this great functionality, we have to import our data into pandas. Data comes to us in a wide variety of formats: as CSV or Excel files, as tables from SQL databases, from statistical analysis packages such as SPSS, Stata, SAS, or R, from non-tabular sources such as JSON, and from web pages.

We examine tools to import tabular data in this recipe. Specifically, we cover the following topics:

Importing CSV files

Importing Excel files

Importing data from SQL databases

Importing SPSS, Stata, and SAS data

Importing R data

Persisting tabular data

Technical requirements

The code and notebooks for this chapter are available on GitHub at https://github.com/michaelbwalker/Python-Data-Cleaning-Cookbook-Second-Edition. You can use any IDE (Integrated Development Environment) of your choice – IDLE, Visual Studio, Sublime, Spyder, and so on – or Jupyter Notebook to work with any of the code in this chapter, or any chapter in this book. A good guide to get started with Jupyter Notebook can be found here: https://www.dataquest.io/blog/jupyter-notebook-tutorial/. I used the Spyder IDE to write the code in this chapter.

I used pandas 2.2.1 and NumPy version 1.24.3 for all of the code in this chapter and subsequent chapters. I have also tested all code with pandas 1.5.3.

Importing CSV files

The

read_csv

method of the

pandas

library can be used to read a file with comma separated values (CSV) and load it into memory as a pandas DataFrame. In this recipe, we import a CSV file and address some common issues: creating column names that make sense to us, parsing dates, and dropping rows with critical missing data.

Raw data is often stored as CSV files. These files have a carriage return at the end of each line of data to demarcate a row, and a comma between each data value to delineate columns. Something other than a comma can be used as the delimiter, such as a tab. Quotation marks may be placed around values, which can be helpful when the delimiter occurs naturally within certain values, which sometimes happens with commas.

All data in a CSV file are characters, regardless of the logical data type. This is why it is easy to view a CSV file, presuming it is not too large, in a text editor. The pandas

read_csv

method will make an educated guess about the data type of each column, but you will need to help it along to ensure that these guesses are on the mark.

Getting ready

Create a folder for this chapter, and then create a new Python script or Jupyter Notebook file in that folder. Create a data subfolder, and then place the

landtempssample.csv

file in that subfolder. Alternatively, you could retrieve all of the files from the GitHub repository, including the data files. Here is a screenshot of the beginning of the CSV file:

Screenshot from 2023-05-28 21-00-25

Figure 1.1: Land Temperatures Data

Data note

This dataset, taken from the Global Historical Climatology Network integrated database, is made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-monthly. I used the data from version 4. The data in this recipe uses a 100,000-row sample of the full dataset, which is also available in the repository.

How to do it…

We will import a CSV file into pandas, taking advantage of some very useful

read_csv

options:

Import the

pandas

library, and set up the environment to make viewing the output easier:

import

pandas

pd pd.options.display.float_format =

'{:,.2f}'

format

pd.set_option(

'display.width'

) pd.set_option(

'display.max_columns'

)

Read the data file, set new names for the headings, and parse the date column.

Pass an argument of

to the

skiprows

parameter to skip the first row, pass a list of columns to

parse_dates

to create a pandas datetime column from those columns, and set

low_memory

False

. This will cause pandas to load all of the data into memory at once, rather than in chunks. We do this so that pandas can identify the data type of each column automatically. In the There’s more… section, we see how to set the data type for each column manually:

landtemps = pd.read_csv(

'data/landtempssample.csv'

, ... names=[

'stationid'

'year'

'month'

'avgtemp'

'latitude'

, ...

'longitude'

'elevation'

'station'

'countryid'

'country'

], ... skiprows=

, ... parse_dates=[[

'month'

'year'

]], ... low_memory=

False

)

type

(landtemps)

Note

We have to use

skiprows

because we are passing a list of column names to

read_csv

. If we use the column names in the CSV file, we do not need to specify values for either

names

skiprows

Get a quick glimpse of the data.

View the first few rows. Show the data type for all columns, as well as the number of rows and columns:

landtemps.head(

)

month_year stationid ... countryid country 0 2000-04-01 USS0010K01S ... US United States 1 1940-05-01 CI000085406 ... CI Chile 2 2013-12-01 USC00036376 ... US United States 3 1963-02-01 ASN00024002 ... AS Australia 4 2001-11-01 ASN00028007 ... AS Australia 5 1991-04-01 USW00024151 ... US United States 6 1993-12-01 RSM00022641 ... RS Russia [7 rows x 9 columns]

landtemps.dtypes

month_year datetime64[ns] stationed object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object

landtemps.shape

(100000, 9)

Give the date column a more appropriate name and view the summary statistics for average monthly temperature:

landtemps.rename(columns={

'month_year'

'measuredate'

}, inplace=

True

) landtemps.dtypes

measuredate datetime64[ns] stationid object avgtemp float64 latitude float64 longitude float64 elevation float64 station object countryid object country object dtype: object

landtemps.avgtemp.describe()

count 85,554.00 mean 10.92 std 11.52 min -70.70 25% 3.46 50% 12.22 75% 19.57 max 39.95 Name: avgtemp, dtype: float64

Look for missing values for each column.

Use

isnull

, which returns

True

for each value that is missing for each column, and

False

when not missing. Chain this with

sum

to count the missing values for each column. (When working with Boolean values,

sum

treats

True

and

False

. I will discuss method chaining in the There’s more... section of this recipe):

landtemps.isnull().

sum

()

measuredate 0 stationed 0 avgtemp 14446 latitude 0 longitude 0 elevation 0 station 0 countryid 0 country 5 dtype: int64

Remove rows with missing data for

avgtemp

Use the

subset

parameter to tell

dropna

to drop rows when

avgtemp

is missing. Set

inplace

True

. Leaving

inplace

at its default value of

False

would display the DataFrame, but the changes we have made would not be retained. Use the

shape

attribute of the DataFrame to get the number of rows and columns:

landtemps.dropna(subset=[

'avgtemp'

], inplace=

True

) landtemps.shape

(85554, 9)

That’s it! Importing CSV files into pandas is as simple as that.

How it works...

Almost all of the recipes in this book use the

pandas

library. We refer to it as

to make it easier to reference later. This is customary. We also use

float_format

to display float values in a readable way and

set_option

to make the Terminal output wide enough to accommodate the number of variables.

Much of the work is done by the first line in Step 2. We use

read_csv

to load a pandas DataFrame in memory and call it

landtemps

. In addition to passing a filename, we set the

names

parameter to a list of our preferred column headings. We also tell

read_csv

to skip the first row, by setting

skiprows

to 1, since the original column headings are in the first row of the CSV file. If we do not tell it to skip the first row,

read_csv

will treat the header row in the file as actual data.

read_csv

also solves a date conversion issue for us. We use the

parse_dates

parameter to ask it to convert the

month

and

year

columns to a date value.

Step 3 runs through a few standard data checks. We use

head(7)

to print out all columns for the first seven rows. We use the

dtypes

attribute of the DataFrame to show the data type of all columns. Each column has the expected data type. In pandas, character data has the object data type, a data type that allows for mixed values.

shape

returns a tuple, whose first element is the number of rows in the DataFrame (100,000 in this case) and whose second element is the number of columns (9).

When we used

read_csv

to parse the

month

and

year

columns, it gave the resulting column the name

month_year

. We used the

rename

method in Step 4 to give that column a more appropriate name. We need to specify

inplace=True

to replace the old column name with the new column name in memory. The

describe

method provides summary statistics on the

avgtemp

column.

Notice that the count for

avgtemp

indicates that there are 85,554 rows that have valid values for

avgtemp

. This is out of 100,000 rows for the whole DataFrame, as provided by the

shape

attribute. The listing of missing values for each column in Step 5 (

landtemps.isnull().sum()

) confirms this: 100,000 – 85,554 = 14,446.

Step 6 drops all rows where

avgtemp

NaN

. (The

NaN

value, not a number, is the pandas representation of missing values.)

subset

is used to indicate which column to check for missing values. The

shape

attribute for

landtemps

now indicates that there are 85,554 rows, which is what we would expect, given the previous count from

describe

There’s more...

If the file you are reading uses a delimiter other than a comma, such as a tab, this can be specified in the

sep

parameter of

read_csv

. When creating the pandas DataFrame, an index was also created. The numbers to the far left of the output when

head

was run are index values. Any number of rows can be specified for

head

. The default value is

Instead of setting

low_memory

False

, to get pandas to make good guesses regarding data types, we could have set data types manually:

landtemps = pd.read_csv(

data/landtempssample.csv'

, names=[

'stationid'

'year'

'month'

'avgtemp'

'latitude'

'longitude'

'elevation'

'station'

'countryid'

country'

], skiprows=

, parse_dates=[[

'month'

'year'

]], dtype={

'stationid'

'object'

'avgtemp'

'float64'

'latitude'

'float64'

longitude'

'float64'

'elevation'

'float64'

'station'

'object'

'countryid'

'object'

'country'

'object'

}, ) landtemps.info()

RangeIndex: 100000 entries, 0 to 99999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month_year 100000 non-null datetime64[ns] 1 stationid 100000 non-null object 2 avgtemp 85554 non-null float64 3 latitude 100000 non-null float64 4 longitude 100000 non-null float64 5 elevation 100000 non-null float64 6 station 100000 non-null object 7 countryid 100000 non-null object 8 country 99995 non-null object dtypes: datetime64[ns](1), float64(4), object(4) memory usage: 6.9+ MB

The

landtemps.isnull().sum()

statement is an example of chaining methods. First,

isnull

returns a DataFrame of

True

and

False

values, resulting from testing whether each column value is

null

. The

sum

function takes that DataFrame and sums the

True

values for each column, interpreting the

True

values as

and the

False

values as

. We would have obtained the same result if we had used the following two steps:

checknull = landtemps.isnull() checknull.

sum

()

There is no hard and fast rule for when to chain methods and when not to do so. I find chaining helpful when the overall operation feels like a single step, even if it’s two or more steps mechanically. Chaining also has the side benefit of not creating extra objects that I might not need.

The dataset used in this recipe is just a sample from the full land temperatures database, with almost 17 million records. You can run the larger file if your machine can handle it, with the following code:

landtemps = pd.read_csv(

'data/landtemps.zip'

, ... compression=

'zip'

, names=[

'stationid'

'year'

, ...

'month'

'avgtemp'

'latitude'

'longitude'

, ...

'elevation'

'station'

'countryid'

'country'

], ... skiprows=

, ... parse_dates=[[

'month'

'year'

]], ... low_memory=

False

)

read_csv

can read a compressed ZIP file. We get it to do this by passing the name of the ZIP file and the type of compression.

Importing Excel files

The

read_excel

method of the

pandas

library can be used to import data from an Excel file and load it into memory as a pandas DataFrame. In this recipe, we import an Excel file and handle some common issues when working with Excel files: extraneous header and footer information, selecting specific columns, removing rows with no data, and connecting to particular sheets.

Despite the tabular structure of Excel, which invites the organization of data into rows and columns, spreadsheets are not datasets and do not require people to store data in that way. Even when some data conforms with those expectations, there is often additional information in rows or columns before or after the data to be imported. Data types are not always as clear as they are to the person who created the spreadsheet. This will be all too familiar to anyone who has ever battled with importing leading zeros. Moreover, Excel does not insist that all data in a column be of the same type, or that column headings be appropriate for use with a programming language such as Python.

Fortunately,

read_excel

has a number of options for handling messiness in Excel data. These options make it relatively easy to skip rows, select particular columns, and pull data from a particular sheet or sheets.

Getting ready

You can download the

GDPpercapita22b.xlsx

file, as well as the code for this recipe, from the GitHub repository for this book. The code assumes that the Excel file is in a data subfolder. Here is a view of the beginning of the file (some columns were hidden for display purposes):

Figure 1.2: View of the dataset

And here is a view of the end of the file:

Figure 1.3: View of the dataset

Data note

This dataset, from the Organisation for Economic Co-operation and Development, is available for public use at https://stats.oecd.org/.

How to do it…

We import an Excel file into pandas and do some initial data cleaning:

Import the

pandas

library:

import

pandas

Read the Excel per capita GDP data.

Select the sheet with the data we need, but skip the columns and rows that we do not want. Use the

sheet_name

parameter to specify the sheet. Set

skiprows

and

skipfooter

to skip the first four rows (the first row is hidden) and the last row. We provide values for

usecols

to get data from column

and columns

through

(column

is blank). Use

head

to view the first few rows and

shape

to get the number of rows and columns:

percapitaGDP = pd.read_excel(

data/GDPpercapita22b.xlsx

, ... sheet_name=

OECD.Stat export

, ... skiprows=

, ... skipfooter=

, ... usecols=

A,C:W

) percapitaGDP.head()

Year 2000 ... 2019 2020 0 Metropolitan areas ... NaN ... NaN NaN 1 AUS: Australia .. ... ... ... ... ... 2 AUS01: Greater Sydney ... ... ... 45576 45152 3 AUS02: Greater Melbourne ... ... ... 42299 40848 4 AUS03: Greater Brisbane ... ... ... 42145 40741 [5 rows x 22 columns]

percapitaGDP.shape

(731, 22)

Note

You may encounter a problem with

read_excel

if the Excel file does not use utf-8 encoding. One way to resolve this is to save the Excel file as a CSV file, reopen it, and then save it with utf-8 encoding.

Use the

info

method of the DataFrame to view data types and the

non-null

count. Notice that all columns have the

object

data type:

percapitaGDP.info()

RangeIndex: 731 entries, 0 to 730 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 731 non-null object 1 2000 730 non-null object 2 2001 730 non-null object 3 2002 730 non-null object 4 2003 730 non-null object 5 2004 730 non-null object 6 2005 730 non-null object 7 2006 730 non-null object 8 2007 730 non-null object 9 2008 730 non-null object 10 2009 730 non-null object 11 2010 730 non-null object 12 2011 730 non-null object 13 2012 730 non-null object 14 2013 730 non-null object 15 2014 730 non-null object 16 2015 730 non-null object 17 2016 730 non-null object 18 2017 730 non-null object 19 2018 730 non-null object 20 2019 730 non-null object 21 2020 730 non-null object dtypes: object(22) memory usage: 125.8+ KB

Rename the

Year

column to

metro

, and remove the leading spaces.

Give an appropriate name to the metropolitan area column. There are extra spaces before the metro values in some cases. We can test for leading spaces with

startswith(' ')

and then use

any

to establish whether there are one or more occasions when the first character is blank. We can use

endswith(' ')

to examine trailing spaces. We use strip to remove both leading and trailing spaces. When we test for trailing spaces again, we see that there are none:

percapitaGDP.rename(columns={

'Year'

'metro'

}, inplace=

True

) percapitaGDP.metro.

str

.startswith(

' '

any

()

True

percapitaGDP.metro.

str

.endswith(

' '

any

()

False

percapitaGDP.metro = percapitaGDP.metro.

str

.strip() percapitaGDP.metro.

str

.startswith(

' '

any

()

False

Convert the data columns to numeric.

Iterate over all of the GDP year columns (2000–2020) and convert the data type from

object

float

. Coerce the conversion even when there is character data – the

in this example. We want character values in those columns to become

missing

, which is what happens. Rename the year columns to better reflect the data in those columns:

for

col

percapitaGDP.columns[

:]: ... percapitaGDP[col] = pd.to_numeric(percapitaGDP[col], ... errors=

'coerce'

) ... percapitaGDP.rename(columns={col:

'pcGDP'

+col}, ... inplace=

True

) ... percapitaGDP.head()

metro pcGDP2000 pcGDP2001 ... \ 0 Metropolitan areas NaN NaN ... 1 AUS: Australia NaN NaN ... 2 AUS01: Greater Sydney NaN 41091 ... 3 AUS02: Greater Melbourne NaN 40488 ... 4 AUS03: Greater Brisbane NaN 35276 ... pcGDP2018 pcGDP2019 pcGDP2020 0 NaN NaN NaN 1 NaN NaN NaN 2 47171 45576 45152 3 43237 42299 40848 4 44328 42145 40741 [5 rows x 22 columns]

percapitaGDP.dtypes

metro object pcGDP2000 float64 pcGDP2001 float64 abbreviated to save space pcGDP2019 float64 pcGDP2020 float64 dtype: object

Use the

describe

method to generate summary statistics for all numeric data in the DataFrame:

percapitaGDP.describe()

pcGDP2000 pcGDP2001 pcGDP2002 ... pcGDP2018 \ count 158 450 479 ... 692 mean 33961 38874 39621 ... 41667 std 15155 13194 13061 ... 17440 min 2686 7805 7065 ... 5530 25% 21523 30790 31064 ... 31322 50% 35836 38078 39246 ... 41428 75% 42804 46576 47874 ... 51130 max 95221 96941 98929 ... 147760 pcGDP2019 pcGDP2020 count 596 425 mean 42709 39792 std 18893 19230 min 5698 5508 25% 29760 24142 50% 43505 41047 75% 53647 51130 max 146094 131082 [8 rows x 21 columns]

Remove rows where all of the per capita GDP values are missing.

Use the

subset

parameter of

dropna

to inspect all columns, starting with the second column (it is zero-based) and going through to the last column. Use

how

to specify that we want to drop rows only if all of the columns specified in

subset

are missing. Use

shape

to show the number of rows and columns in the resulting DataFrame:

percapitaGDP.dropna(subset=percapitaGDP.columns[

:], how=

all

, inplace=

True

) percapitaGDP.shape

(692, 22)

Set the index for the DataFrame using the metropolitan area column.

Confirm that there are 692 valid values for

metro

and that there are 692 unique values, before setting the index:

percapitaGDP.metro.count()

692

percapitaGDP.metro.nunique()

692

percapitaGDP.set_index(

'metro'

, inplace=

True

) percapitaGDP.head()

pcGDP2000 pcGDP2001 ... \ metro ...

Enjoying the preview?

Page 1 of 1

Python Data Cleaning Cookbook: Prepare your data for analysis with pandas, NumPy, Matplotlib, scikit-learn, and OpenAI

Michael Walker

Read more from Michael Walker

Delta Lady: A Memoir

Unnatural Behavior: Where Are We Going?

What You Want Is in the Limo: On the Road with Led Zeppelin, Alice Cooper, and the Who in 1973, the Year the Sixties Died and the Modern Rock Star Was Born

The Fear of the Lord Is His Treasure

Up There: The North-East, Football, Boom & Bust

After God's Own Heart: Leadership Lessons from the Life of David

Green Shoots: Irish Football Histories

Make It Make Sense

Giving Christianity a Bad Name: Justin’S Journey

The Work of the Cross

The Aurora Revelations: A Paranormal Mystery Novel

Kingdom Road - Volume 1

Death, where is your sting?

Related authors

Related to Python Data Cleaning Cookbook

Related ebooks

Instant MapReduce Patterns – Hadoop Essentials How-to

Instant Redis Optimization How-to

Hands-on Data Virtualization with Polybase: Administer Big Data, SQL Queries and Data Accessibility Across Hadoop, Azure, Spark, Cassandra, MongoDB, CosmosDB, MySQL and PostgreSQL (English Edition)

Apache Spark 2.x Cookbook

Splunk Developer's Guide

Data Processing and Modeling with Hadoop: Mastering Hadoop Ecosystem Including ETL, Data Vault, DMBok, GDPR, and Various Data-Centric Tools

Learning Heroku Postgres

PostgreSQL 9 Administration Cookbook LITE: Configuration, Monitoring and Maintenance

Deep Learning for Computer Vision with SAS: An Introduction

Deep Learning for Data Architects: Unleash the power of Python's deep learning algorithms (English Edition)

Ultimate Enterprise Data Analysis and Forecasting using Python

Apache Oozie Essentials

Open-Source Middleware Suites A Complete Guide

Relational Database Index Design and the Optimizers: DB2, Oracle, SQL Server, et al.

Big Data AWS A Complete Guide

Mastering Amazon Relational Database Service for MySQL: Building and configuring MySQL instances (English Edition)

Instant SQL Server Analysis Services 2012 Cube Security

Open Source Database: Virtue Or Vice?

Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers

Learning Ansible

Fundamentals of Analytics Engineering: An introduction to building end-to-end analytics solutions

Hands-On Kubernetes, Service Mesh and Zero-Trust: Build and manage secure applications using Kubernetes and Istio (English Edition)

Principles of Data Science: A beginner's guide to essential math and coding skills for data fluency and machine learning

Rapid - Apache Mahout Clustering designs: Explore clustering algorithms used with Apache Mahout

Buidling AI Agents with LangGraph

Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python

WAN Survival Guide: Strategies for VPNs and Multiservice Networks

Data Engineering with Databricks Cookbook: Build effective data and AI solutions using Apache Spark, Databricks, and Delta Lake

Comprehensive SQL Techniques: Mastering Data Analysis and Reporting

Microsoft Azure Fundamentals Certification and Beyond: A complete AZ-900 exam guide with online mock exams, flashcards, and hands-on activities

Computers For You

Elon Musk

Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad

Deep Search: How to Explore the Internet More Effectively

Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race

Learning the Chess Openings

Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls

ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind

Mastering ChatGPT: 21 Prompts Templates for Effortless Writing

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution

CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide

The Professional Voiceover Handbook: Voiceover training, #1

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

Some Future Day: How AI Is Going to Change Everything

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

An Ultimate Guide to Kali Linux for Beginners

The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology

The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms

Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics

How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally

Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters

Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition

A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going

Going Text: Mastering the Command Line

Tor and the Dark Art of Anonymity

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

Python Machine Learning By Example

The Hacker Crackdown: Law and Disorder on the Electronic Frontier

CompTia Security 701: Fundamentals of Security

Uncanny Valley: A Memoir

101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters