Lab2 Pandas1
Lab2 Pandas1
Lab2 Pandas1
Table of Contents
Learning Goals
About 6000 odd "best books" were fetched and parsed from Goodreads. The "bestness" of
these books came from a proprietary formula used by Goodreads and published as a list on their
web site.
We parsed the page for each book and saved data from all these pages in a tabular format as a
CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data
analysis to answer questions about these best books and popular genres.
• Load and systematically address missing values, ancoded as NaN values in our data set,
for example, by removing observations associated with these values.
• Parse columns in the dataframe to create new dataframe columns.
• Use groupby to aggregate data on a particular feature column, such as author.
This lab corresponds to lectures 2, 3 and 4 and maps on to homework 1 and further.
1. Build a DataFrame from the data (ideally, put all data in this object)
2. Clean the DataFrame. It should have the following properties:
– Each row describes a single object
– Each column describes a property of that object
– Columns are numeric whenever appropriate
– Columns contain atomic properties that cannot be further decomposed
3. Explore global properties. Use histograms, scatter plots, and aggregation functions to
summarize the data.
4. Explore group properties. Use groupby and small multiples to compare subsets of the
data.
This process transforms your data into a format which is easier to work with, gives you a basic
overview of the data's properties, and likely generates several questions for you to followup in
subsequent analysis.
Part 1: Loading and Cleaning with Pandas
Read in the goodreads.csv file, examine the data, and do any necessary data cleaning.
Here is a description of the columns (in order) present in this csv file:
Let us see what issues we find with the data and resolve them.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
Oh dear. That does not quite seem to be right. We are missing the column names. We need to
add these in! But what are they?
Use these to load the dataframe properly! And then "head" the dataframe... (you will need to
look at the read_csv docs)
df.dtypes
rating float64
review_count float64
isbn object
booktype object
author_url object
year float64
genre_urls object
dir object
rating_count float64
name object
dtype: object
#Get a sense of how many missing values there are in the dataframe.
np.sum([df.rating.isnull()])
How does pandas or numpy handle missing values when we try to compute with data sets that
include them?
We'll now check if any of the other suspicious columns have missing values. Let's look at year
and review_count first.
One thing you can do is to try and convert to the type you expect the column to be. If something
goes wrong, it likely means your data are bad.
df[df.year.isnull()]
df = df[df.year.notnull()]
Ok so we have done some cleaning. What do things look like now? Notice the float has not yet
changed.
df.dtypes
print(np.sum(df.year.isnull()))
print(np.sum(df.rating_count.isnull()))
print(np.sum(df.review_count.isnull()))
# We removed seven rows
df.shape
Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:
df.dtypes
Sweet!
df.loc[df.genre_urls.isnull(), 'genre_urls']=""
df.loc[df.isbn.isnull(), 'isbn']=""
Examine an example author_url and reason about which sequence of string operations must
be performed in order to isolate the author's name.
test_string.split('/')[-1].split('.')[1:][0]
Lets wrap the above code into a function which we will then use
# Write a function that accepts an author url and returns the author's
name based on your experimentation above
def get_author(url):
# your code here
This is a little more complicated because there be more than one genre.
df.genre_urls.head()
Write a function that accepts a genre url and returns the genre name based on your
experimentation above
def split_and_join_genres(url):
# your code here
split_and_join_genres("/genres/young-adult|/genres/science-fiction")
split_and_join_genres("")
df['genres']=df.genre_urls.map(split_and_join_genres)
df.head()
Finally, let's pick an author at random so we can see the results of the transformations. Scroll to
see the author and genre columns that we added to the dataframe.
df[df.author == "Marguerite_Yourcenar"]
del df['genre_urls']
We can determine the "best book" by year! For this we use Pandas groupby. Groupby allows
grouping a dataframe by any (usually categorical) variable.
dfgb_author = df.groupby('author')
type(dfgb_author)
dfgb_author.count()
dfgb_author['author'].count()
ratingdict = {}
for author, subset in dfgb_author:
ratingdict[author] = (subset['rating'].mean(),
subset['rating'].std())
ratingdict
Lets get the best-rated book(s) for every year in our dataframe.