Lab2 Pandas1

This Lab is based on the Data Preprocessing Course
Lab Data analysis with pandas
Table of Contents
Learning Goals
About 6000 odd "best books" were fetched and parsed from Goodreads. The "bestness" of
these books came from a proprietary formula used by Goodreads and published as a list on their
web site.
We parsed the page for each book and saved data from all these pages in a tabular format as a
CSV file. In this lab we'll clean and further parse the data. We'll then do some exploratory data
analysis to answer questions about these best books and popular genres.
By the end of this lab, you should be able to:
• Load and systematically address missing values, ancoded as NaN values in our data set,
for example, by removing observations associated with these values.
• Parse columns in the dataframe to create new dataframe columns.
• Use groupby to aggregate data on a particular feature column, such as author.
This lab corresponds to lectures 2, 3 and 4 and maps on to homework 1 and further.
Basic EDA workflow

The basic workflow is as follows:
1. Build a DataFrame from the data (ideally, put all data in this object)
2. Clean the DataFrame. It should have the following properties:
– Each row describes a single object
– Each column describes a property of that object
– Columns are numeric whenever appropriate
– Columns contain atomic properties that cannot be further decomposed
3. Explore global properties. Use histograms, scatter plots, and aggregation functions to
summarize the data.
4. Explore group properties. Use groupby and small multiples to compare subsets of the
data.
This process transforms your data into a format which is easier to work with, gives you a basic
overview of the data's properties, and likely generates several questions for you to followup in
subsequent analysis.
Part 1: Loading and Cleaning with Pandas
Read in the goodreads.csv file, examine the data, and do any necessary data cleaning.
Here is a description of the columns (in order) present in this csv file:
rating: the average rating on a 1-5 scale achieved by the book

review_count: the number of Goodreads users who reviewed this book
isbn: the ISBN code for the book
booktype: an internal Goodreads identifier for the book
author_url: the Goodreads (relative) URL for the author of the book
year: the year the book was published
genre_urls: a string with '|' separated relative URLS of Goodreads
genre pages
dir: a directory identifier internal to the scraping code
rating_count: the number of ratings for this book (this is different
from the number of reviews)
name: the name of the book
Let us see what issues we find with the data and resolve them.
After loading appropriate libraries
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
Cleaning: Reading in the data

We read in and clean the data from goodreads.csv.
#Read the data into a dataframe

df = pd.read_csv("C:/Users/hp/Desktop/PD/goodreads.csv")
#Examine the first few rows of the dataframe

df.head()
4.40 136455 0439023483 good_reads:book

https://www.goodreads.com/author/show/153394.Suzanne_Collins
2008
/genres/young-adult|/genres/science-fiction|/genres/dystopia|/genres/
fantasy|/genres/science-fiction|/genres/romance|/genres/adventure|/
genres/book-club|/genres/young-adult|/genres/teen|/genres/
apocalyptic|/genres/post-apocalyptic|/genres/action
dir01/2767052-the-hunger-games.html 2958974 The Hunger
Games (The Hunger Games, #1)
0 4.41 16648.0 0439358078 good_reads:book
https://www.goodreads.com/author/show/1077326....
2003.0 /genres/fantasy|/genres/young-adult|/genres/fi...
dir01/2.Harry_Potter_and_the_Order_of_the_Phoe... 1284478.0 Harry
Potter and the Order of the Phoenix (Har...
1 3.56 85746.0 0316015849 good_reads:book
https://www.goodreads.com/author/show/941441.S...
2005.0 /genres/young-adult|/genres/fantasy|/genres/ro...
dir01/41865.Twilight.html 2579564.0
Twilight (Twilight, #1)
2 4.23 47906.0 0061120081 good_reads:book
https://www.goodreads.com/author/show/1825.Har...
1960.0 /genres/classics|/genres/fiction|/genres/histo...
dir01/2657.To_Kill_a_Mockingbird.html 2078123.0
To Kill a Mockingbird
3 4.23 34772.0 0679783261 good_reads:book
https://www.goodreads.com/author/show/1265.Jan...
1813.0 /genres/classics|/genres/fiction|/genres/roman...
dir01/1885.Pride_and_Prejudice.html 1388992.0
Pride and Prejudice
4 4.25 12363.0 0446675539 good_reads:book
https://www.goodreads.com/author/show/11081.Ma...
1936.0 /genres/classics|/genres/historical-fiction|/g...
dir01/18405.Gone_with_the_Wind.html 645470.0
Gone with the Wind
Oh dear. That does not quite seem to be right. We are missing the column names. We need to
add these in! But what are they?
Here is a list of them in order:

["rating", 'review_count', 'isbn', 'booktype','author_url', 'year',
'genre_urls', 'dir','rating_count', 'name']
Use these to load the dataframe properly! And then "head" the dataframe... (you will need to
look at the read_csv docs)
# your code here

df = pd.read_csv("C:/Users/hp/Desktop/PD/goodreads.csv" , header =
None , names=["rating", 'review_count', 'isbn',
'booktype','author_url', 'year', 'genre_urls', 'dir','rating_count',
'name'])
df.head()
rating review_count isbn booktype

author_url year genre_urls
dir rating_count name
0 4.40 136455.0 0439023483 good_reads:book
https://www.goodreads.com/author/show/153394.S... 2008.0
/genres/young-adult|/genres/science-fiction|/g...
dir01/2767052-the-hunger-games.html 2958974.0 The
Hunger Games (The Hunger Games, #1)
1 4.41 16648.0 0439358078 good_reads:book
https://www.goodreads.com/author/show/1077326.... 2003.0
/genres/fantasy|/genres/young-adult|/genres/fi...
dir01/2.Harry_Potter_and_the_Order_of_the_Phoe... 1284478.0 Harry
Potter and the Order of the Phoenix (Har...
2 3.56 85746.0 0316015849 good_reads:book
https://www.goodreads.com/author/show/941441.S... 2005.0
/genres/young-adult|/genres/fantasy|/genres/ro...
dir01/41865.Twilight.html 2579564.0
Twilight (Twilight, #1)
3 4.23 47906.0 0061120081 good_reads:book
https://www.goodreads.com/author/show/1825.Har... 1960.0
/genres/classics|/genres/fiction|/genres/histo...
dir01/2657.To_Kill_a_Mockingbird.html 2078123.0
To Kill a Mockingbird
4 4.23 34772.0 0679783261 good_reads:book
https://www.goodreads.com/author/show/1265.Jan... 1813.0
/genres/classics|/genres/fiction|/genres/roman...
dir01/1885.Pride_and_Prejudice.html 1388992.0
Pride and Prejudice
Cleaning: Examing the dataframe - quick checks

We should examine the dataframe to get a overall sense of the content.
Lets check the types of the columns. What do you find?
df.dtypes
rating float64
review_count float64
isbn object
booktype object
author_url object
year float64
genre_urls object
dir object
rating_count float64
name object
dtype: object
your answer here review_count and

There are a couple more quick sanity checks to perform on the dataframe.
print(df.shape)
df.columns
Cleaning: Examining the dataframe - a deeper look

Beyond performing checking some quick general properties of the data frame and looking at the
first n rows, we can dig a bit deeper into the values being stored. If you haven't already, check to
see if there are any missing values in the data frame.
Let's see for a column which seemed OK to us.
#Get a sense of how many missing values there are in the dataframe.
np.sum([df.rating.isnull()])
#Try to locate where the missing values occur

df[df.rating.isnull()]
How does pandas or numpy handle missing values when we try to compute with data sets that
include them?
We'll now check if any of the other suspicious columns have missing values. Let's look at year
and review_count first.
One thing you can do is to try and convert to the type you expect the column to be. If something
goes wrong, it likely means your data are bad.
Lets test for missing data:
df[df.year.isnull()]
Cleaning: Dealing with Missing Values

How should we interpret 'missing' or 'invalid' values in the data (hint: look at where these values
occur)? One approach is to simply exclude them from the dataframe. Is this appropriate for all
'missing' or 'invalid' values?
#Treat the missing or invalid values in your dataframe

#######
df = df[df.year.notnull()]
Ok so we have done some cleaning. What do things look like now? Notice the float has not yet
changed.
df.dtypes
print(np.sum(df.year.isnull()))
print(np.sum(df.rating_count.isnull()))
print(np.sum(df.review_count.isnull()))
# We removed seven rows
df.shape
Suspect observations for rating and rating_count were removed as well!

Ok so lets fix those types. Convert them to ints. If the type conversion fails, we now know we
have further problems.
# your code here
Once you do this, we seem to be good on these columns (no errors in conversion). Lets look:
df.dtypes
Sweet!
Some of the other colums that should be strings have NaN.
df.loc[df.genre_urls.isnull(), 'genre_urls']=""
df.loc[df.isbn.isnull(), 'isbn']=""
Part 2: Parsing and Completing the Data Frame

We will parse the author column from the author_url and genres column from the genre_urls.
Keep the genres column as a string separated by '|'.
We will use panda's map to assign new columns to the dataframe.
Examine an example author_url and reason about which sequence of string operations must
be performed in order to isolate the author's name.
#Get the first author_url

test_string = df.author_url[0]
test_string
#Test out some string operations to isolate the author name
test_string.split('/')[-1].split('.')[1:][0]
Lets wrap the above code into a function which we will then use
# Write a function that accepts an author url and returns the author's
name based on your experimentation above
def get_author(url):
# your code here
#Apply the get_author function to the 'author_url' column using '.map'

#and add a new column 'author' to store the names
df['author'] = df.author_url.map(get_author)
df.author[0:5]
Now parse out the genres from genre_url.
This is a little more complicated because there be more than one genre.
df.genre_urls.head()
# your code here
Write a function that accepts a genre url and returns the genre name based on your
experimentation above
def split_and_join_genres(url):
# your code here
Test your function
split_and_join_genres("/genres/young-adult|/genres/science-fiction")
split_and_join_genres("")
Use map again to create a new "genres" column
df['genres']=df.genre_urls.map(split_and_join_genres)
df.head()
Finally, let's pick an author at random so we can see the results of the transformations. Scroll to
see the author and genre columns that we added to the dataframe.
df[df.author == "Marguerite_Yourcenar"]
Let us delete the genre_urls column.
del df['genre_urls']
And then save the dataframe out!
df.to_csv("data/cleaned-goodreads.csv", index=False, header=True)

Part 3: Grouping
It appears that some books were written in negative years! Print out the observations that
correspond to negative years. What do you notice about these books?
# your code here
We can determine the "best book" by year! For this we use Pandas groupby. Groupby allows
grouping a dataframe by any (usually categorical) variable.
dfgb_author = df.groupby('author')
type(dfgb_author)
Perhaps we want the number of books each author wrote
dfgb_author.count()
Lots of useless info there. One column should suffice
dfgb_author['author'].count()
Perhaps you want more detailed info...
dfgb_author[['rating', 'rating_count', 'review_count',

'year']].describe()
You can also access a groupby dictionary style.
ratingdict = {}
for author, subset in dfgb_author:
ratingdict[author] = (subset['rating'].mean(),
subset['rating'].std())
ratingdict
Lets get the best-rated book(s) for every year in our dataframe.
#Using .groupby, we can divide the dataframe into subsets by the

values of 'year'.
#We can then iterate over these subsets
# your code here

Lab2 Pandas1

Uploaded by

Copyright:

Available Formats

Lab2 Pandas1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab2 Pandas1

Uploaded by

Copyright:

Available Formats

This Lab is based on the Data Preprocessing Course

Lab Data analysis with pandas

By the end of this lab, you should be able to:

Basic EDA workflow

rating: the average rating on a 1-5 scale achieved by the book

After loading appropriate libraries

Cleaning: Reading in the data

#Read the data into a dataframe

#Examine the first few rows of the dataframe

4.40 136455 0439023483 good_reads:book

Here is a list of them in order:

# your code here

rating review_count isbn booktype

Cleaning: Examing the dataframe - quick checks

Lets check the types of the columns. What do you find?

your answer here review_count and

Cleaning: Examining the dataframe - a deeper look

Let's see for a column which seemed OK to us.

#Try to locate where the missing values occur

Lets test for missing data:

Cleaning: Dealing with Missing Values

#Treat the missing or invalid values in your dataframe

Suspect observations for rating and rating_count were removed as well!

# your code here

Some of the other colums that should be strings have NaN.

Part 2: Parsing and Completing the Data Frame

We will use panda's map to assign new columns to the dataframe.

#Get the first author_url

#Test out some string operations to isolate the author name

#Apply the get_author function to the 'author_url' column using '.map'

Now parse out the genres from genre_url.

# your code here

Test your function

Use map again to create a new "genres" column

Let us delete the genre_urls column.

And then save the dataframe out!

df.to_csv("data/cleaned-goodreads.csv", index=False, header=True)

# your code here

Perhaps we want the number of books each author wrote

Lots of useless info there. One column should suffice

Perhaps you want more detailed info...

dfgb_author[['rating', 'rating_count', 'review_count',

You can also access a groupby dictionary style.

#Using .groupby, we can divide the dataframe into subsets by the

You might also like