Learning pandas
4/5
()
About this ebook
- Employ the use of pandas for data analysis closely to focus more on analysis and less on programming
- Get programmers comfortable in performing data exploration and analysis on Python using pandas
- Step-by-step demonstration of using Python and pandas with interactive and incremental examples to facilitate learning
If you are a Python programmer who wants to get started with performing data analysis using pandas and Python, this is the book for you. Some experience with statistical analysis would be helpful but is not mandatory.
Read more from Heydt Michael
Learning pandas - Second Edition Rating: 4 out of 5 stars4/5Mastering pandas for Finance Rating: 0 out of 5 stars0 ratingsPython Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS Rating: 0 out of 5 stars0 ratings
Related to Learning pandas
Related ebooks
Python Data Analysis Rating: 4 out of 5 stars4/5Python Data Science Essentials Rating: 0 out of 5 stars0 ratingsPython: Real-World Data Science Rating: 0 out of 5 stars0 ratingsLearning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Learning Data Mining with Python Rating: 0 out of 5 stars0 ratingsFunctional Python Programming Rating: 0 out of 5 stars0 ratingsPython Essentials Rating: 5 out of 5 stars5/5Mastering Python Scientific Computing Rating: 4 out of 5 stars4/5Mastering Python for Finance Rating: 5 out of 5 stars5/5Scientific Computing with Python 3 Rating: 0 out of 5 stars0 ratingsMastering Predictive Analytics with R Rating: 4 out of 5 stars4/5Learning Data Mining with Python - Second Edition Rating: 0 out of 5 stars0 ratingsPractical Machine Learning Rating: 2 out of 5 stars2/5Introduction to R for Business Intelligence Rating: 0 out of 5 stars0 ratingsHands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsPython Data Analysis Cookbook Rating: 5 out of 5 stars5/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Python Data Visualization Cookbook Rating: 4 out of 5 stars4/5Python: Real World Machine Learning Rating: 0 out of 5 stars0 ratingsPython Web Scraping - Second Edition Rating: 5 out of 5 stars5/5Mastering Python Data Analysis Rating: 0 out of 5 stars0 ratingsNumPy Beginner's Guide Rating: 5 out of 5 stars5/5Pandas in Action Rating: 0 out of 5 stars0 ratingsWeb Scraping with Python Rating: 4 out of 5 stars4/5matplotlib Plotting Cookbook Rating: 5 out of 5 stars5/5Python Data Visualization Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsPrinciples of Data Science Rating: 4 out of 5 stars4/5
Programming For You
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsHTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Python QuickStart Guide: The Simplified Beginner's Guide to Python Programming Using Hands-On Projects and Real-World Applications Rating: 0 out of 5 stars0 ratingsExcel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Unreal Engine from Zero to Proficiency (Foundations): Unreal Engine from Zero to Proficiency, #1 Rating: 3 out of 5 stars3/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5Python: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5Microsoft Azure For Dummies Rating: 0 out of 5 stars0 ratingsLinux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Spies, Lies, and Algorithms: The History and Future of American Intelligence Rating: 4 out of 5 stars4/5Python Data Structures and Algorithms Rating: 5 out of 5 stars5/5
Reviews for Learning pandas
1 rating0 reviews
Book preview
Learning pandas - Heydt Michael
Table of Contents
Learning pandas
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. A Tour of pandas
pandas and why it is important
pandas and IPython Notebooks
Referencing pandas in the application
Primary pandas objects
The pandas Series object
The pandas DataFrame object
Loading data from files and the Web
Loading CSV data from files
Loading data from the Web
Simplicity of visualization of pandas data
Summary
2. Installing pandas
Getting Anaconda
Installing Anaconda
Installing Anaconda on Linux
Installing Anaconda on Mac OS X
Installing Anaconda on Windows
Ensuring pandas is up to date
Running a small pandas sample in IPython
Starting the IPython Notebook server
Installing and running IPython Notebooks
Using Wakari for pandas
Summary
3. NumPy for pandas
Installing and importing NumPy
Benefits and characteristics of NumPy arrays
Creating NumPy arrays and performing basic array operations
Selecting array elements
Logical operations on arrays
Slicing arrays
Reshaping arrays
Combining arrays
Splitting arrays
Useful numerical methods of NumPy arrays
Summary
4. The pandas Series Object
The Series object
Importing pandas
Creating Series
Size, shape, uniqueness, and counts of values
Peeking at data with heads, tails, and take
Looking up values in Series
Alignment via index labels
Arithmetic operations
The special case of Not-A-Number (NaN)
Boolean selection
Reindexing a Series
Modifying a Series in-place
Slicing a Series
Summary
5. The pandas DataFrame Object
Creating DataFrame from scratch
Example data
S&P 500
Monthly stock historical prices
Selecting columns of a DataFrame
Selecting rows and values of a DataFrame using the index
Slicing using the [] operator
Selecting rows by index label and location: .loc[] and .iloc[]
Selecting rows by index label and/or location: .ix[]
Scalar lookup by label or location using .at[] and .iat[]
Selecting rows of a DataFrame by Boolean selection
Modifying the structure and content of DataFrame
Renaming columns
Adding and inserting columns
Replacing the contents of a column
Deleting columns in a DataFrame
Adding rows to a DataFrame
Appending rows with .append()
Concatenating DataFrame objects with pd.concat()
Adding rows (and columns) via setting with enlargement
Removing rows from a DataFrame
Removing rows using .drop()
Removing rows using Boolean selection
Removing rows using a slice
Changing scalar values in a DataFrame
Arithmetic on a DataFrame
Resetting and reindexing
Hierarchical indexing
Summarized data and descriptive statistics
Summary
6. Accessing Data
Setting up the IPython notebook
CSV and Text/Tabular format
The sample CSV data set
Reading a CSV file into a DataFrame
Specifying the index column when reading a CSV file
Data type inference and specification
Specifying column names
Specifying specific columns to load
Saving DataFrame to a CSV file
General field-delimited data
Handling noise rows in field-delimited data
Reading and writing data in an Excel format
Reading and writing JSON files
Reading HTML data from the Web
Reading and writing HDF5 format files
Accessing data on the web and in the cloud
Reading and writing from/to SQL databases
Reading data from remote data services
Reading stock data from Yahoo! and Google Finance
Retrieving data from Yahoo! Finance Options
Reading economic data from the Federal Reserve Bank of St. Louis
Accessing Kenneth French's data
Reading from the World Bank
Summary
7. Tidying Up Your Data
What is tidying your data?
Setting up the IPython notebook
Working with missing data
Determining NaN values in Series and DataFrame objects
Selecting out or dropping missing data
How pandas handles NaN values in mathematical operations
Filling in missing data
Forward and backward filling of missing values
Filling using index labels
Interpolation of missing values
Handling duplicate data
Transforming Data
Mapping
Replacing values
Applying functions to transform data
Summary
8. Combining and Reshaping Data
Setting up the IPython notebook
Concatenating data
Merging and joining data
An overview of merges
Specifying the join semantics of a merge operation
Pivoting
Stacking and unstacking
Stacking using nonhierarchical indexes
Unstacking using hierarchical indexes
Melting
Performance benefits of stacked data
Summary
9. Grouping and Aggregating Data
Setting up the IPython notebook
The split, apply, and combine (SAC) pattern
Split
Data for the examples
Grouping by a single column's values
Accessing the results of grouping
Grouping using index levels
Apply
Applying aggregation functions to groups
The transformation of group data
An overview of transformation
Practical examples of transformation
Filtering groups
Discretization and Binning
Summary
10. Time-series Data
Setting up the IPython notebook
Representation of dates, time, and intervals
The datetime, day, and time objects
Timestamp objects
Timedelta
Introducing time-series data
DatetimeIndex
Creating time-series data with specific frequencies
Calculating new dates using offsets
Date offsets
Anchored offsets
Representing durations of time using Period objects
The Period object
PeriodIndex
Handling holidays using calendars
Normalizing timestamps using time zones
Manipulating time-series data
Shifting and lagging
Frequency conversion
Up and down resampling
Time-series moving-window operations
Summary
11. Visualization
Setting up the IPython notebook
Plotting basics with pandas
Creating time-series charts with .plot()
Adorning and styling your time-series plot
Adding a title and changing axes labels
Specifying the legend content and position
Specifying line colors, styles, thickness, and markers
Specifying tick mark locations and tick labels
Formatting axes tick date labels using formatters
Common plots used in statistical analyses
Bar plots
Histograms
Box and whisker charts
Area plots
Scatter plots
Density plot
The scatter plot matrix
Heatmaps
Multiple plots in a single chart
Summary
12. Applications to Finance
Setting up the IPython notebook
Obtaining and organizing stock data from Yahoo!
Plotting time-series prices
Plotting volume-series data
Calculating the simple daily percentage change
Calculating simple daily cumulative returns
Resampling data from daily to monthly returns
Analyzing distribution of returns
Performing a moving-average calculation
The comparison of average daily returns across stocks
The correlation of stocks based on the daily percentage change of the closing price
Volatility calculation
Determining risk relative to expected returns
Summary
Index
Learning pandas
Learning pandas
Copyright © 2015 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2015
Production reference: 1090415
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78398-512-8
www.packtpub.com
Credits
Author
Michael Heydt
Reviewers
Bill Chambers
S. Shelly Jang
Arun Karunagath Rajeevan
Daniel Velkov
Adrian Wan
Commissioning Editor
Kartikey Pandey
Acquisition Editor
Neha Nagwekar
Content Development Editor
Akshay Nair
Technical Editors
Shashank Desai
Chinmay Puranik
Copy Editors
Roshni Banerjee
Pranjali Chury
Stuti Srivastava
Project Coordinator
Mary Alex
Proofreaders
Simran Bhogal
Paul Hindle
Linda Morris
Christopher Smith
Indexer
Monica Ajmera Mehta
Graphics
Sheetal Aute
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta
About the Author
Michael Heydt is an independent consultant, educator, and trainer with nearly 30 years of professional software development experience, during which he focused on agile software design and implementation using advanced technologies in multiple verticals, including media, finance, energy, and healthcare. He holds an MS degree in mathematics and computer science from Drexel University and an executive master's of technology management degree from the University of Pennsylvania's School of Engineering and Wharton Business School. His studies and research have focused on technology management, software engineering, entrepreneurship, information retrieval, data sciences, and computational finance. Since 2005, he has been specializing in building energy and financial trading systems for major investment banks on Wall Street and for several global energy trading companies, utilizing .NET, C#, WPF, TPL, DataFlow, Python, R, Mono, iOS, and Android. His current interests include creating seamless applications using desktop, mobile, and wearable technologies, which utilize high concurrency, high availability, real-time data analytics, augmented and virtual reality, cloud services, messaging, computer vision, natural user interfaces, and software-defined networks. He is the author of numerous technology articles, papers, and books (Instant Lucene.NET, Learning pandas). He is a frequent speaker at .NET users' groups and various mobile and cloud conferences, and he regularly delivers webinars on advanced technologies.
About the Reviewers
Bill Chambers is a Python developer and data scientist currently pursuing a master of information management and systems degree at the UC Berkeley School of Information. Previously, he focused on data architecture and systems using marketing, sales, and customer analytics data. Bill is passionate about delivering actionable insights and innovative solutions using data.
You can find more information about him at http://www.billchambers.me.
S. Shelly Jang received her PhD degree in electrical engineering from the University of Washington and a master's degree in chemical and biological engineering from the University of British Columbia in 2014 and 2009, respectively. She was an Insight Data Science fellow in 2014. During her tenure, she built a web app that recommends crowd-verified treatment options for various medical conditions. She is currently a senior data scientist at AT&T Big Data. Exploring complex, large-scale data sets to build models and derive insights is just a part of her job.
In her free time, she participates in the Quantified Self community, sharing her insights on personal analytics and self-hacking.
Arun Karunagath Rajeevan is a senior consultant (products) in an exciting start-up, working as an architect and coder, and is a polyglot. He is currently involved in developing the best quality management suite in the supply chain management category.
Apart from this, he has experience in healthcare and multimedia (embedded) domains. When he is not working, he loves to travel and listen to music.
Daniel Velkov is a software engineer based in San Francisco, who has more than 10 years of programming experience. His biggest professional accomplishment was designing and implementing the search stack for MyLife.com—one of the major social websites in the US. Nowadays, he works on making Google search better. Besides Python and search, he has worked on several machine learning and data analysis-oriented projects. When he is not coding, he enjoys skiing, riding motorcycles, and exploring the Californian outdoors.
Adrian Wan is a physics and computer science major at Swarthmore College. After he graduates, he will be working at Nest, a Google company, as a software engineer and data scientist. His passion lies at the intersection of his two disciplines, where elegant mathematical models and explanations of real-life phenomena are brought to life and probed deeply with efficient, clean, and powerful code. He greatly enjoyed contributing to this book and hopes that you will be able to appreciate the power that pandas brings to Python.
You can find out more about him at http://awan1.github.io.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
Preface
This book is about learning to use pandas, an open source library for Python, which was created to enable Python to easily manipulate and perform powerful statistical and mathematical analyses on tabular and multidimensional datasets. The design of pandas and its power combined with the familiarity of Python have created explosive growth in its usage over the last several years, particularly among financial firms as well as those simply looking for practical tools for statistical and data analysis.
While there exist many excellent examples of using pandas to solve many domain-specific problems, it can be difficult to find a cohesive set of examples in a form that allows one to effectively learn and apply the features of pandas. The information required to learn practical skills in using pandas is distributed across many websites, slide shares, and videos, and is generally not in a form that gives an integrated guide to all of the features with practical examples in an easy-to-understand and applicable fashion.
This book is therefore intended to be a go-to reference for learning pandas. It will take you all the way from installation, through to creating one- and two-dimensional indexed data structures, to grouping data and slicing-and-dicing them, with common analyses used to demonstrate derivation of useful results. This will include the loading and saving of data from resources that are local and Internet-based and creating effective data visualizations that provide instant ability to visually realize insights into the meaning previously hidden within complex data.
What this book covers
Chapter 1, A Tour of pandas, is a hands-on introduction to the key features of pandas. It will give you a broad overview of the types of data tasks that can be performed with pandas. This chapter will set the groundwork for learning as all concepts introduced in this chapter will be expanded upon in subsequent chapters.
Chapter 2, Installing pandas, will show you how to install Anaconda Python and pandas on Windows, OS X, and Linux. This chapter also covers using the conda package manager to upgrade pandas and its dependent libraries to the most recent version.
Chapter 3, NumPy for pandas, will introduce you to concepts in NumPy, particularly NumPy arrays, which are core for understanding the pandas Series and DataFrame objects.
Chapter 4, The pandas Series Object, covers the pandas Series object and how it expands upon the functionality of the NumPy array to provide richer representation and manipulation of sequences of data through the use of high-performance indexes.
Chapter 5, The pandas DataFrame Object, introduces the primary data structure of pandas, the DataFrame object, and how it forms a two-dimensional representation of tabular data by aligning multiple Series objects along a common index to provide seamless access and manipulation across elements in multiple series that are related by a common index label.
Chapter 6, Accessing Data, shows how data can be loaded and saved from external sources into both Series and DataFrame objects. You will learn how to access data from multiple sources such as files, HTTP servers, database systems, and web services, as well as how to process data in CSV, HTML, and JSON formats.
Chapter 7, Tidying Up Your Data, instructs you on how to use the various tools provided by pandas for managing dirty and missing data.
Chapter 8, Combining and Reshaping Data, covers various techniques for combining, splitting, joining, and merging data located in multiple pandas objects, and then demonstrates how to reshape data using concepts such as pivots, stacking, and melting.
Chapter 9, Grouping and Aggregating Data, focuses on how to use pandas to group data to enable you to perform aggregate operations on grouped data to assist in deriving analytic results.
Chapter 10, Time-series Data, will instruct you on how to use pandas to represent sequences of information that is indexed by the progression of time. This chapter will first cover how pandas represents dates and time, as well as concepts such as periods, frequencies, time zones, and calendars. The focus then shifts to time-series data and various operations such as shifting, lagging, resampling, and moving window operations.
Chapter 11, Visualization, dives into the integration of pandas with matplotlib to visualize pandas data. This chapter will demonstrate how to represent and present many common statistical and financial data visualizations, including bar charts, histograms, scatter plots, area plots, density plots, and heat maps.
Chapter 12, Applications to Finance, brings together everything learned through the previous chapters with practical examples of using pandas to obtain, manipulate, analyze, and visualize stock data.
What you need for this book
This book assumes some familiarity with programming concepts, but those without programming experience, or specifically Python programming experience, will be comfortable with the examples as they focus on pandas constructs more than Python or programming. The examples are based on Anaconda Python 2.7 and pandas 0.15.1. If you do not have either installed, guidance will be given in Chapter 2, Installing pandas, on installing both on Windows, OS X, and Ubuntu systems. For those not interested in installing any software, instructions are also given on using the Warkari.io online Python data analysis service.
Who this book is for
If you are looking to get into data science and want to learn how to use the Python programming language for data analysis instead of other domain-specific data science tools such as R, then this book is for you. If you have used other data science packages and want to learn how to apply that knowledge to Python, then this book is also for you. Alternately, if you want to learn an additional tool or start with data science to enhance your career, then this book is for you.
Conventions
In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: This information can be easily imported into DataFrame using the pd.read_csv() function as follows.
Any command-line / IPython input or output is written as follows:
In [1]: # import numpy and pandas, and DataFrame / Series import numpy as np import pandas as pd from pandas import DataFrame, Series
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: Clicking on the New Notebook button will present you with a notebook where you can start entering your pandas code.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title through the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code examples in the book are also publicly available on Wakari.io at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.
Tip
Although great efforts are taken to use data that will reproduce the same output when you execute the samples, there is a small set of code that uses current data and hence the result of running those samples may vary from what is published in this book. These include In [39]: and In [40]: in Chapter 1, A Tour of pandas, which uses the data of the last three months of Google stock, as well as a small number of samples used in the later chapters that demonstrate the usage of date offsets centered on the current date.
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/5128OS_ColoredImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books — maybe a mistake in the text or the code — we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.
Piracy
Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
Questions
You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.
Chapter 1. A Tour of pandas
In this chapter, we will take a look at pandas, which is an open source Python-based data analysis library. It provides high-performance and easy-to-use data structures and data analysis tools built with the Python programming language. The pandas library brings many of the good things from R, specifically the DataFrame objects and R packages such as plyr and reshape2, and places them in a single library that you can use in your Python applications.
The development of pandas was begun in 2008 by Wes McKinney when he worked at AQR Capital Management. It was opened sourced in 2009 and is currently supported and actively developed by various organizations and contributors. It was initially designed with finance in mind, specifically with its ability around time series data manipulation, but emphasizes the data manipulation part of the equation leaving statistical, financial, and other types of analyses to other Python libraries.
In this chapter, we will take a brief tour of pandas and some of the associated tools such as IPython notebooks. You will be introduced to a variety of concepts in pandas for data organization and manipulation in an effort to form both a base understanding and a frame of reference for deeper coverage in later sections of this book. By the end of this chapter, you will have a good understanding of the fundamentals of pandas and even be able to perform basic data manipulations. Also, you will be ready to continue with later portions of this book for more detailed understanding.
This chapter will introduce you to:
pandas and why it is important
IPython and IPython Notebooks
Referencing pandas in your application
The Series and DataFrame objects of pandas
How to load data from files and the Web
The simplicity of visualizing pandas data
Note
pandas is always lowercase by convention in pandas documentation, and this will be a convention followed by this book.
pandas and why it is important
pandas is a library containing high-level data structures and tools that have been created to assist a Python programmer to perform powerful data manipulations, and discover information in that data in a simple and fast way.
The simple and effective data analysis requires the ability to index, retrieve, tidy, reshape, combine, slice, and perform various analyses on both single and multidimensional data, including heterogeneous typed data that is automatically aligned along index labels. To enable these capabilities, pandas provides the following features (and many more not explicitly mentioned here):
High performance array and table structures for representation of homogenous and heterogeneous data sets: the Series and DataFrame objects
Flexible reshaping of data structure, allowing the ability to insert and delete both rows and columns of tabular data
Hierarchical indexing of data along multiple axes (both rows and columns), allowing multiple labels per data item
Labeling of series and tabular data to facilitate indexing and automatic alignment of data
Ability to easily identify and fix missing data, both in floating point and as non-floating point formats
Powerful grouping capabilities and a functionality to perform split-apply-combine operations on series and tabular data
Simple conversion from ragged and differently indexed data of both NumPy and Python data structures to pandas objects
Smart label-based slicing and subsetting of data sets, including intuitive and flexible merging, and joining of data with SQL-like constructs
Extensive I/O facilities to load and save data from multiple formats including CSV, Excel, relational and non-relational databases, HDF5 format, and JSON
Explicit support for time series-specific functionality, providing functionality for date range generation, moving window statistics, time shifting, lagging, and so on
Built-in support to retrieve and automatically parse data from various web-based data sources such as Yahoo!, Google Finance, the World Bank, and several others
For those desiring to get into data analysis and the emerging field of data science, pandas offers an excellent means for a Python programmer (or just an enthusiast) to learn data manipulation. For those just learning or coming from a statistical language like R, pandas can offer an excellent introduction to Python as a programming language.
pandas itself is not a data science toolkit. It does provide some statistical methods as a matter of convenience, but to draw conclusions from data, it leans upon other packages in the Python ecosystem, such as SciPy, NumPy, scikit-learn, and upon graphics libraries such as matplotlib and ggvis for data visualization. This is actually the strength of pandas over other languages such as R, as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere.
In this book, we will look at how to use pandas for data manipulation, with a specific focus on gathering, cleaning, and manipulation of various forms of data using pandas. Detailed specifics of data science, finance, econometrics, social network analysis, Python, and IPython are left as reference. You can refer to some other excellent books on these topics already available at https://www.packtpub.com/.
pandas and IPython Notebooks
A popular means of using pandas is through the use of IPython Notebooks. IPython Notebooks provide a web-based interactive computational environment, allowing the combination of code, text, mathematics, plots, and right media into a web-based document. IPython Notebooks run in a browser and contain Python code that is run in a local or server-side Python session that the notebooks communicate with using WebSockets. Notebooks can also contain markup code and rich media content, and can be converted to other formats such as PDF, HTML, and slide shows.
The following is an example of an IPython Notebook from the IPython website (http://ipython.org/notebook.html) that demonstrates the rich capabilities of notebooks:
IPython Notebooks are not strictly required for using pandas and can be installed into your development environment independently or alongside of pandas. During the course of this this book, we will install pandas and an IPython Notebook server. You will be able to perform code examples in the text directly in an IPython console interpreter, and the examples will be packaged as notebooks that can be run with a local notebook server. Additionally, the workbooks will be available online for easy and immediate access at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.
Note
To learn more about IPython Notebooks, visit the notebooks site at http://ipython.org/ipython-doc/dev/notebook/, and for more in-depth coverage, refer to another book, Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing.
Referencing pandas in the application
All pandas programs and examples in this book will always start by importing pandas (and NumPy) into the Python environment. There is a common convention used in many publications (web and print) of importing pandas and NumPy, which will also be used throughout this book. All workbooks and examples for chapters will start with code similar to the following to initialize the pandas library within Python.
In [1]: # import numpy and pandas, and DataFrame / Series import numpy as np import pandas as pd from pandas import DataFrame, Series
# Set some pandas options pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 10) pd.set_option('display.max_rows', 10)
# And some items for matplotlib %matplotlib inline import matplotlib.pyplot as plt pd.options.display.mpl_style = 'default'
NumPy and pandas go hand-in-hand, as much of pandas is built on NumPy. It is, therefore, very convenient to import NumPy and put it in a np. namespace. Likewise, pandas is imported and referenced with a pd. prefix. Since DataFrame and Series objects of pandas are used so frequently, the third line then imports the Series and DataFrame objects into the global namespace so that we can use them without a pd. prefix.
The three pd.set_options() method calls set up some defaults for IPython Notebooks and console output from pandas. These specify how wide and high any output will be, and how many columns it will contain. They can be used to modify the output of IPython and pandas to fit your personal needs to display results. The options set here are convenient for formatting the output of the examples to the constraints of the text.
Primary pandas objects
A programmer of pandas will spend most of their time using two primary objects provided by the pandas framework: Series and DataFrame. The DataFrame objects will be the overall workhorse of pandas and the most frequently used as they provide the means to manipulate tabular and heterogeneous data.
The pandas Series object
The base data structure of pandas is the Series object, which is designed to operate similar to a NumPy array but also adds index capabilities. A simple way to create a Series object is by initializing a Series object with a Python array or Python list.
In [2]: # create a four item DataFrame s = Series([1, 2, 3, 4]) s
Out [2]: 0 1 1 2 2 3 3 4 dtype: int64
This has created a pandas Series from the list. Notice that printing the series resulted in what appears to be two columns of data. The first column in the output is not a column of the Series object, but the index labels. The second column is the values of the Series object. Each row represents the index label and the value for that label. This Series was created without specifying an index, so pandas automatically creates indexes starting at zero and increasing by one.
Elements of a Series object can be accessed through the index using []. This informs the Series which value to return given one or more index values (referred to in pandas as labels). The following code retrieves the items in the series with labels 1 and 3.
In [3]: # return a Series with the rows with labels 1 and 3 s[[1, 3]]
Out [3]: 1 2 3 4 dtype: int64
Note
It is important to note that the lookup here is not by zero-based positions 1 and 3 like an array, but by the values in the index.
A Series object can be created with a user-defined index by specifying the labels for the index using the index parameter.
In [4]: # create a series using an explicit