Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Learning pandas
Learning pandas
Learning pandas
Ebook731 pages5 hours

Learning pandas

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

About This Book
  • Employ the use of pandas for data analysis closely to focus more on analysis and less on programming
  • Get programmers comfortable in performing data exploration and analysis on Python using pandas
  • Step-by-step demonstration of using Python and pandas with interactive and incremental examples to facilitate learning
Who This Book Is For

If you are a Python programmer who wants to get started with performing data analysis using pandas and Python, this is the book for you. Some experience with statistical analysis would be helpful but is not mandatory.

LanguageEnglish
Release dateApr 16, 2015
ISBN9781783985135
Learning pandas

Read more from Heydt Michael

Related to Learning pandas

Related ebooks

Programming For You

View More

Related articles

Reviews for Learning pandas

Rating: 4 out of 5 stars
4/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Learning pandas - Heydt Michael

    Table of Contents

    Learning pandas

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. A Tour of pandas

    pandas and why it is important

    pandas and IPython Notebooks

    Referencing pandas in the application

    Primary pandas objects

    The pandas Series object

    The pandas DataFrame object

    Loading data from files and the Web

    Loading CSV data from files

    Loading data from the Web

    Simplicity of visualization of pandas data

    Summary

    2. Installing pandas

    Getting Anaconda

    Installing Anaconda

    Installing Anaconda on Linux

    Installing Anaconda on Mac OS X

    Installing Anaconda on Windows

    Ensuring pandas is up to date

    Running a small pandas sample in IPython

    Starting the IPython Notebook server

    Installing and running IPython Notebooks

    Using Wakari for pandas

    Summary

    3. NumPy for pandas

    Installing and importing NumPy

    Benefits and characteristics of NumPy arrays

    Creating NumPy arrays and performing basic array operations

    Selecting array elements

    Logical operations on arrays

    Slicing arrays

    Reshaping arrays

    Combining arrays

    Splitting arrays

    Useful numerical methods of NumPy arrays

    Summary

    4. The pandas Series Object

    The Series object

    Importing pandas

    Creating Series

    Size, shape, uniqueness, and counts of values

    Peeking at data with heads, tails, and take

    Looking up values in Series

    Alignment via index labels

    Arithmetic operations

    The special case of Not-A-Number (NaN)

    Boolean selection

    Reindexing a Series

    Modifying a Series in-place

    Slicing a Series

    Summary

    5. The pandas DataFrame Object

    Creating DataFrame from scratch

    Example data

    S&P 500

    Monthly stock historical prices

    Selecting columns of a DataFrame

    Selecting rows and values of a DataFrame using the index

    Slicing using the [] operator

    Selecting rows by index label and location: .loc[] and .iloc[]

    Selecting rows by index label and/or location: .ix[]

    Scalar lookup by label or location using .at[] and .iat[]

    Selecting rows of a DataFrame by Boolean selection

    Modifying the structure and content of DataFrame

    Renaming columns

    Adding and inserting columns

    Replacing the contents of a column

    Deleting columns in a DataFrame

    Adding rows to a DataFrame

    Appending rows with .append()

    Concatenating DataFrame objects with pd.concat()

    Adding rows (and columns) via setting with enlargement

    Removing rows from a DataFrame

    Removing rows using .drop()

    Removing rows using Boolean selection

    Removing rows using a slice

    Changing scalar values in a DataFrame

    Arithmetic on a DataFrame

    Resetting and reindexing

    Hierarchical indexing

    Summarized data and descriptive statistics

    Summary

    6. Accessing Data

    Setting up the IPython notebook

    CSV and Text/Tabular format

    The sample CSV data set

    Reading a CSV file into a DataFrame

    Specifying the index column when reading a CSV file

    Data type inference and specification

    Specifying column names

    Specifying specific columns to load

    Saving DataFrame to a CSV file

    General field-delimited data

    Handling noise rows in field-delimited data

    Reading and writing data in an Excel format

    Reading and writing JSON files

    Reading HTML data from the Web

    Reading and writing HDF5 format files

    Accessing data on the web and in the cloud

    Reading and writing from/to SQL databases

    Reading data from remote data services

    Reading stock data from Yahoo! and Google Finance

    Retrieving data from Yahoo! Finance Options

    Reading economic data from the Federal Reserve Bank of St. Louis

    Accessing Kenneth French's data

    Reading from the World Bank

    Summary

    7. Tidying Up Your Data

    What is tidying your data?

    Setting up the IPython notebook

    Working with missing data

    Determining NaN values in Series and DataFrame objects

    Selecting out or dropping missing data

    How pandas handles NaN values in mathematical operations

    Filling in missing data

    Forward and backward filling of missing values

    Filling using index labels

    Interpolation of missing values

    Handling duplicate data

    Transforming Data

    Mapping

    Replacing values

    Applying functions to transform data

    Summary

    8. Combining and Reshaping Data

    Setting up the IPython notebook

    Concatenating data

    Merging and joining data

    An overview of merges

    Specifying the join semantics of a merge operation

    Pivoting

    Stacking and unstacking

    Stacking using nonhierarchical indexes

    Unstacking using hierarchical indexes

    Melting

    Performance benefits of stacked data

    Summary

    9. Grouping and Aggregating Data

    Setting up the IPython notebook

    The split, apply, and combine (SAC) pattern

    Split

    Data for the examples

    Grouping by a single column's values

    Accessing the results of grouping

    Grouping using index levels

    Apply

    Applying aggregation functions to groups

    The transformation of group data

    An overview of transformation

    Practical examples of transformation

    Filtering groups

    Discretization and Binning

    Summary

    10. Time-series Data

    Setting up the IPython notebook

    Representation of dates, time, and intervals

    The datetime, day, and time objects

    Timestamp objects

    Timedelta

    Introducing time-series data

    DatetimeIndex

    Creating time-series data with specific frequencies

    Calculating new dates using offsets

    Date offsets

    Anchored offsets

    Representing durations of time using Period objects

    The Period object

    PeriodIndex

    Handling holidays using calendars

    Normalizing timestamps using time zones

    Manipulating time-series data

    Shifting and lagging

    Frequency conversion

    Up and down resampling

    Time-series moving-window operations

    Summary

    11. Visualization

    Setting up the IPython notebook

    Plotting basics with pandas

    Creating time-series charts with .plot()

    Adorning and styling your time-series plot

    Adding a title and changing axes labels

    Specifying the legend content and position

    Specifying line colors, styles, thickness, and markers

    Specifying tick mark locations and tick labels

    Formatting axes tick date labels using formatters

    Common plots used in statistical analyses

    Bar plots

    Histograms

    Box and whisker charts

    Area plots

    Scatter plots

    Density plot

    The scatter plot matrix

    Heatmaps

    Multiple plots in a single chart

    Summary

    12. Applications to Finance

    Setting up the IPython notebook

    Obtaining and organizing stock data from Yahoo!

    Plotting time-series prices

    Plotting volume-series data

    Calculating the simple daily percentage change

    Calculating simple daily cumulative returns

    Resampling data from daily to monthly returns

    Analyzing distribution of returns

    Performing a moving-average calculation

    The comparison of average daily returns across stocks

    The correlation of stocks based on the daily percentage change of the closing price

    Volatility calculation

    Determining risk relative to expected returns

    Summary

    Index

    Learning pandas


    Learning pandas

    Copyright © 2015 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: April 2015

    Production reference: 1090415

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78398-512-8

    www.packtpub.com

    Credits

    Author

    Michael Heydt

    Reviewers

    Bill Chambers

    S. Shelly Jang

    Arun Karunagath Rajeevan

    Daniel Velkov

    Adrian Wan

    Commissioning Editor

    Kartikey Pandey

    Acquisition Editor

    Neha Nagwekar

    Content Development Editor

    Akshay Nair

    Technical Editors

    Shashank Desai

    Chinmay Puranik

    Copy Editors

    Roshni Banerjee

    Pranjali Chury

    Stuti Srivastava

    Project Coordinator

    Mary Alex

    Proofreaders

    Simran Bhogal

    Paul Hindle

    Linda Morris

    Christopher Smith

    Indexer

    Monica Ajmera Mehta

    Graphics

    Sheetal Aute

    Production Coordinator

    Arvindkumar Gupta

    Cover Work

    Arvindkumar Gupta

    About the Author

    Michael Heydt is an independent consultant, educator, and trainer with nearly 30 years of professional software development experience, during which he focused on agile software design and implementation using advanced technologies in multiple verticals, including media, finance, energy, and healthcare. He holds an MS degree in mathematics and computer science from Drexel University and an executive master's of technology management degree from the University of Pennsylvania's School of Engineering and Wharton Business School. His studies and research have focused on technology management, software engineering, entrepreneurship, information retrieval, data sciences, and computational finance. Since 2005, he has been specializing in building energy and financial trading systems for major investment banks on Wall Street and for several global energy trading companies, utilizing .NET, C#, WPF, TPL, DataFlow, Python, R, Mono, iOS, and Android. His current interests include creating seamless applications using desktop, mobile, and wearable technologies, which utilize high concurrency, high availability, real-time data analytics, augmented and virtual reality, cloud services, messaging, computer vision, natural user interfaces, and software-defined networks. He is the author of numerous technology articles, papers, and books (Instant Lucene.NET, Learning pandas). He is a frequent speaker at .NET users' groups and various mobile and cloud conferences, and he regularly delivers webinars on advanced technologies.

    About the Reviewers

    Bill Chambers is a Python developer and data scientist currently pursuing a master of information management and systems degree at the UC Berkeley School of Information. Previously, he focused on data architecture and systems using marketing, sales, and customer analytics data. Bill is passionate about delivering actionable insights and innovative solutions using data.

    You can find more information about him at http://www.billchambers.me.

    S. Shelly Jang received her PhD degree in electrical engineering from the University of Washington and a master's degree in chemical and biological engineering from the University of British Columbia in 2014 and 2009, respectively. She was an Insight Data Science fellow in 2014. During her tenure, she built a web app that recommends crowd-verified treatment options for various medical conditions. She is currently a senior data scientist at AT&T Big Data. Exploring complex, large-scale data sets to build models and derive insights is just a part of her job.

    In her free time, she participates in the Quantified Self community, sharing her insights on personal analytics and self-hacking.

    Arun Karunagath Rajeevan is a senior consultant (products) in an exciting start-up, working as an architect and coder, and is a polyglot. He is currently involved in developing the best quality management suite in the supply chain management category.

    Apart from this, he has experience in healthcare and multimedia (embedded) domains. When he is not working, he loves to travel and listen to music.

    Daniel Velkov is a software engineer based in San Francisco, who has more than 10 years of programming experience. His biggest professional accomplishment was designing and implementing the search stack for MyLife.com—one of the major social websites in the US. Nowadays, he works on making Google search better. Besides Python and search, he has worked on several machine learning and data analysis-oriented projects. When he is not coding, he enjoys skiing, riding motorcycles, and exploring the Californian outdoors.

    Adrian Wan is a physics and computer science major at Swarthmore College. After he graduates, he will be working at Nest, a Google company, as a software engineer and data scientist. His passion lies at the intersection of his two disciplines, where elegant mathematical models and explanations of real-life phenomena are brought to life and probed deeply with efficient, clean, and powerful code. He greatly enjoyed contributing to this book and hopes that you will be able to appreciate the power that pandas brings to Python.

    You can find out more about him at http://awan1.github.io.

    www.PacktPub.com

    Support files, eBooks, discount offers, and more

    For support files and downloads related to your book, please visit www.PacktPub.com.

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

    Preface

    This book is about learning to use pandas, an open source library for Python, which was created to enable Python to easily manipulate and perform powerful statistical and mathematical analyses on tabular and multidimensional datasets. The design of pandas and its power combined with the familiarity of Python have created explosive growth in its usage over the last several years, particularly among financial firms as well as those simply looking for practical tools for statistical and data analysis.

    While there exist many excellent examples of using pandas to solve many domain-specific problems, it can be difficult to find a cohesive set of examples in a form that allows one to effectively learn and apply the features of pandas. The information required to learn practical skills in using pandas is distributed across many websites, slide shares, and videos, and is generally not in a form that gives an integrated guide to all of the features with practical examples in an easy-to-understand and applicable fashion.

    This book is therefore intended to be a go-to reference for learning pandas. It will take you all the way from installation, through to creating one- and two-dimensional indexed data structures, to grouping data and slicing-and-dicing them, with common analyses used to demonstrate derivation of useful results. This will include the loading and saving of data from resources that are local and Internet-based and creating effective data visualizations that provide instant ability to visually realize insights into the meaning previously hidden within complex data.

    What this book covers

    Chapter 1, A Tour of pandas, is a hands-on introduction to the key features of pandas. It will give you a broad overview of the types of data tasks that can be performed with pandas. This chapter will set the groundwork for learning as all concepts introduced in this chapter will be expanded upon in subsequent chapters.

    Chapter 2, Installing pandas, will show you how to install Anaconda Python and pandas on Windows, OS X, and Linux. This chapter also covers using the conda package manager to upgrade pandas and its dependent libraries to the most recent version.

    Chapter 3, NumPy for pandas, will introduce you to concepts in NumPy, particularly NumPy arrays, which are core for understanding the pandas Series and DataFrame objects.

    Chapter 4, The pandas Series Object, covers the pandas Series object and how it expands upon the functionality of the NumPy array to provide richer representation and manipulation of sequences of data through the use of high-performance indexes.

    Chapter 5, The pandas DataFrame Object, introduces the primary data structure of pandas, the DataFrame object, and how it forms a two-dimensional representation of tabular data by aligning multiple Series objects along a common index to provide seamless access and manipulation across elements in multiple series that are related by a common index label.

    Chapter 6, Accessing Data, shows how data can be loaded and saved from external sources into both Series and DataFrame objects. You will learn how to access data from multiple sources such as files, HTTP servers, database systems, and web services, as well as how to process data in CSV, HTML, and JSON formats.

    Chapter 7, Tidying Up Your Data, instructs you on how to use the various tools provided by pandas for managing dirty and missing data.

    Chapter 8, Combining and Reshaping Data, covers various techniques for combining, splitting, joining, and merging data located in multiple pandas objects, and then demonstrates how to reshape data using concepts such as pivots, stacking, and melting.

    Chapter 9, Grouping and Aggregating Data, focuses on how to use pandas to group data to enable you to perform aggregate operations on grouped data to assist in deriving analytic results.

    Chapter 10, Time-series Data, will instruct you on how to use pandas to represent sequences of information that is indexed by the progression of time. This chapter will first cover how pandas represents dates and time, as well as concepts such as periods, frequencies, time zones, and calendars. The focus then shifts to time-series data and various operations such as shifting, lagging, resampling, and moving window operations.

    Chapter 11, Visualization, dives into the integration of pandas with matplotlib to visualize pandas data. This chapter will demonstrate how to represent and present many common statistical and financial data visualizations, including bar charts, histograms, scatter plots, area plots, density plots, and heat maps.

    Chapter 12, Applications to Finance, brings together everything learned through the previous chapters with practical examples of using pandas to obtain, manipulate, analyze, and visualize stock data.

    What you need for this book

    This book assumes some familiarity with programming concepts, but those without programming experience, or specifically Python programming experience, will be comfortable with the examples as they focus on pandas constructs more than Python or programming. The examples are based on Anaconda Python 2.7 and pandas 0.15.1. If you do not have either installed, guidance will be given in Chapter 2, Installing pandas, on installing both on Windows, OS X, and Ubuntu systems. For those not interested in installing any software, instructions are also given on using the Warkari.io online Python data analysis service.

    Who this book is for

    If you are looking to get into data science and want to learn how to use the Python programming language for data analysis instead of other domain-specific data science tools such as R, then this book is for you. If you have used other data science packages and want to learn how to apply that knowledge to Python, then this book is also for you. Alternately, if you want to learn an additional tool or start with data science to enhance your career, then this book is for you.

    Conventions

    In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

    Code words in text are shown as follows: This information can be easily imported into DataFrame using the pd.read_csv() function as follows.

    Any command-line / IPython input or output is written as follows:

    In [1]:   # import numpy and pandas, and DataFrame / Series   import numpy as np   import pandas as pd   from pandas import DataFrame, Series

    New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: Clicking on the New Notebook button will present you with a notebook where you can start entering your pandas code.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

    To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title through the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code examples in the book are also publicly available on Wakari.io at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.

    Tip

    Although great efforts are taken to use data that will reproduce the same output when you execute the samples, there is a small set of code that uses current data and hence the result of running those samples may vary from what is published in this book. These include In [39]: and In [40]: in Chapter 1, A Tour of pandas, which uses the data of the last three months of Google stock, as well as a small number of samples used in the later chapters that demonstrate the usage of date offsets centered on the current date.

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/5128OS_ColoredImages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books — maybe a mistake in the text or the code — we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

    Piracy

    Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors, and our ability to bring you valuable content.

    Questions

    You can contact us at <questions@packtpub.com> if you are having a problem with any aspect of the book, and we will do our best to address it.

    Chapter 1. A Tour of pandas

    In this chapter, we will take a look at pandas, which is an open source Python-based data analysis library. It provides high-performance and easy-to-use data structures and data analysis tools built with the Python programming language. The pandas library brings many of the good things from R, specifically the DataFrame objects and R packages such as plyr and reshape2, and places them in a single library that you can use in your Python applications.

    The development of pandas was begun in 2008 by Wes McKinney when he worked at AQR Capital Management. It was opened sourced in 2009 and is currently supported and actively developed by various organizations and contributors. It was initially designed with finance in mind, specifically with its ability around time series data manipulation, but emphasizes the data manipulation part of the equation leaving statistical, financial, and other types of analyses to other Python libraries.

    In this chapter, we will take a brief tour of pandas and some of the associated tools such as IPython notebooks. You will be introduced to a variety of concepts in pandas for data organization and manipulation in an effort to form both a base understanding and a frame of reference for deeper coverage in later sections of this book. By the end of this chapter, you will have a good understanding of the fundamentals of pandas and even be able to perform basic data manipulations. Also, you will be ready to continue with later portions of this book for more detailed understanding.

    This chapter will introduce you to:

    pandas and why it is important

    IPython and IPython Notebooks

    Referencing pandas in your application

    The Series and DataFrame objects of pandas

    How to load data from files and the Web

    The simplicity of visualizing pandas data

    Note

    pandas is always lowercase by convention in pandas documentation, and this will be a convention followed by this book.

    pandas and why it is important

    pandas is a library containing high-level data structures and tools that have been created to assist a Python programmer to perform powerful data manipulations, and discover information in that data in a simple and fast way.

    The simple and effective data analysis requires the ability to index, retrieve, tidy, reshape, combine, slice, and perform various analyses on both single and multidimensional data, including heterogeneous typed data that is automatically aligned along index labels. To enable these capabilities, pandas provides the following features (and many more not explicitly mentioned here):

    High performance array and table structures for representation of homogenous and heterogeneous data sets: the Series and DataFrame objects

    Flexible reshaping of data structure, allowing the ability to insert and delete both rows and columns of tabular data

    Hierarchical indexing of data along multiple axes (both rows and columns), allowing multiple labels per data item

    Labeling of series and tabular data to facilitate indexing and automatic alignment of data

    Ability to easily identify and fix missing data, both in floating point and as non-floating point formats

    Powerful grouping capabilities and a functionality to perform split-apply-combine operations on series and tabular data

    Simple conversion from ragged and differently indexed data of both NumPy and Python data structures to pandas objects

    Smart label-based slicing and subsetting of data sets, including intuitive and flexible merging, and joining of data with SQL-like constructs

    Extensive I/O facilities to load and save data from multiple formats including CSV, Excel, relational and non-relational databases, HDF5 format, and JSON

    Explicit support for time series-specific functionality, providing functionality for date range generation, moving window statistics, time shifting, lagging, and so on

    Built-in support to retrieve and automatically parse data from various web-based data sources such as Yahoo!, Google Finance, the World Bank, and several others

    For those desiring to get into data analysis and the emerging field of data science, pandas offers an excellent means for a Python programmer (or just an enthusiast) to learn data manipulation. For those just learning or coming from a statistical language like R, pandas can offer an excellent introduction to Python as a programming language.

    pandas itself is not a data science toolkit. It does provide some statistical methods as a matter of convenience, but to draw conclusions from data, it leans upon other packages in the Python ecosystem, such as SciPy, NumPy, scikit-learn, and upon graphics libraries such as matplotlib and ggvis for data visualization. This is actually the strength of pandas over other languages such as R, as pandas applications are able to leverage an extensive network of robust Python frameworks already built and tested elsewhere.

    In this book, we will look at how to use pandas for data manipulation, with a specific focus on gathering, cleaning, and manipulation of various forms of data using pandas. Detailed specifics of data science, finance, econometrics, social network analysis, Python, and IPython are left as reference. You can refer to some other excellent books on these topics already available at https://www.packtpub.com/.

    pandas and IPython Notebooks

    A popular means of using pandas is through the use of IPython Notebooks. IPython Notebooks provide a web-based interactive computational environment, allowing the combination of code, text, mathematics, plots, and right media into a web-based document. IPython Notebooks run in a browser and contain Python code that is run in a local or server-side Python session that the notebooks communicate with using WebSockets. Notebooks can also contain markup code and rich media content, and can be converted to other formats such as PDF, HTML, and slide shows.

    The following is an example of an IPython Notebook from the IPython website (http://ipython.org/notebook.html) that demonstrates the rich capabilities of notebooks:

    IPython Notebooks are not strictly required for using pandas and can be installed into your development environment independently or alongside of pandas. During the course of this this book, we will install pandas and an IPython Notebook server. You will be able to perform code examples in the text directly in an IPython console interpreter, and the examples will be packaged as notebooks that can be run with a local notebook server. Additionally, the workbooks will be available online for easy and immediate access at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.

    Note

    To learn more about IPython Notebooks, visit the notebooks site at http://ipython.org/ipython-doc/dev/notebook/, and for more in-depth coverage, refer to another book, Learning IPython for Interactive Computing and Data Visualization, Cyrille Rossant, Packt Publishing.

    Referencing pandas in the application

    All pandas programs and examples in this book will always start by importing pandas (and NumPy) into the Python environment. There is a common convention used in many publications (web and print) of importing pandas and NumPy, which will also be used throughout this book. All workbooks and examples for chapters will start with code similar to the following to initialize the pandas library within Python.

    In [1]:   # import numpy and pandas, and DataFrame / Series   import numpy as np   import pandas as pd   from pandas import DataFrame, Series

     

     

      # Set some pandas options   pd.set_option('display.notebook_repr_html', False)   pd.set_option('display.max_columns', 10)   pd.set_option('display.max_rows', 10)

     

     

      # And some items for matplotlib   %matplotlib inline   import matplotlib.pyplot as plt   pd.options.display.mpl_style = 'default'

    NumPy and pandas go hand-in-hand, as much of pandas is built on NumPy. It is, therefore, very convenient to import NumPy and put it in a np. namespace. Likewise, pandas is imported and referenced with a pd. prefix. Since DataFrame and Series objects of pandas are used so frequently, the third line then imports the Series and DataFrame objects into the global namespace so that we can use them without a pd. prefix.

    The three pd.set_options() method calls set up some defaults for IPython Notebooks and console output from pandas. These specify how wide and high any output will be, and how many columns it will contain. They can be used to modify the output of IPython and pandas to fit your personal needs to display results. The options set here are convenient for formatting the output of the examples to the constraints of the text.

    Primary pandas objects

    A programmer of pandas will spend most of their time using two primary objects provided by the pandas framework: Series and DataFrame. The DataFrame objects will be the overall workhorse of pandas and the most frequently used as they provide the means to manipulate tabular and heterogeneous data.

    The pandas Series object

    The base data structure of pandas is the Series object, which is designed to operate similar to a NumPy array but also adds index capabilities. A simple way to create a Series object is by initializing a Series object with a Python array or Python list.

    In [2]:   # create a four item DataFrame   s = Series([1, 2, 3, 4])   s

     

     

    Out [2]:   0    1   1    2   2    3   3    4   dtype: int64

    This has created a pandas Series from the list. Notice that printing the series resulted in what appears to be two columns of data. The first column in the output is not a column of the Series object, but the index labels. The second column is the values of the Series object. Each row represents the index label and the value for that label. This Series was created without specifying an index, so pandas automatically creates indexes starting at zero and increasing by one.

    Elements of a Series object can be accessed through the index using []. This informs the Series which value to return given one or more index values (referred to in pandas as labels). The following code retrieves the items in the series with labels 1 and 3.

    In [3]:   # return a Series with the rows with labels 1 and 3   s[[1, 3]]

     

     

    Out [3]:   1    2   3    4   dtype: int64

    Note

    It is important to note that the lookup here is not by zero-based positions 1 and 3 like an array, but by the values in the index.

    A Series object can be created with a user-defined index by specifying the labels for the index using the index parameter.

    In [4]:   # create a series using an explicit

    Enjoying the preview?
    Page 1 of 1