Scraping Book Python PDF
Scraping Book Python PDF
2 Introducing HTML 13
3 Introducing Python 19
5 Downloading files 33
6 Extracting links 39
7 Extracting tables 43
8 Final notes 49
List of code snippets
In this introduction, I’m going to talk about the purpose of the book-
let, some pre-requisites that you’ll need, and provide an outline of
the things we’ll cover along the way.
What does this booklet cover? By the time you reach the end of
this booklet, you should be able to
There’s little in this book that you couldn’t do if you had enough
time and the patience to do it by hand. Indeed, many of the exam-
ples would go quicker by hand. But they cover principles that can be
extended quite easily to really tough projects.
Usage scenarios
Web scraping will help you in any situation where you find yourself
copying and pasting information from your web browser. Here are
some times when I’ve used web scraping:
Alternatives
I scrape the web because I want to save time when I have to collect
a lot of information. But there’s no sense writing a program to scrape
the web when you could save time some other way. Here are some
alternatives to screen scraping:
There are a number of ethical and legal issues to bear in mind when
scraping. These can be summarized as follows:
Respect the hosting site’s wishes Many large web sites have a file
called robots.txt. It’s a list of instructions to bots, or scrapers. It
tells you wish parts of the site you shouldn’t include when scraping.
Here is (part of) the listing from the BBC’s robots.txt:
That means that the BBC doesn’t want you scraping anything
from iPlayer, or from their Bitesize GCSE revision micro-site. So
don’t.
Respect the law Just because content is online doesn’t mean it’s
yours to use as you see fit. Many sites which use paywalls will re-
quire you to sign up to Terms and Agreements. This means, for
example, that you can’t write a web scraper to download all the arti-
cles from your favourite journal for all time.6 The legal requirements 6
In fact, this could land your university
will differ from site to site. If in doubt, consult a lawyer.7 in real trouble – probably more than
torrenting films or music.
7
In fact, if in doubt, it’s probably a sign
you shouldn’t scrape.
2
Introducing HTML
Basics
3. What happens if you change the h1 to h2? And h6? And h7?
5. If you can make text bold with <b>, how might you italicise text?
Try it!
• Line 3 starts the body of the web page. Web pages have a
<body> and a <head>. The <head> contains information like the
title of the page.
• Line 5 starts a new heading, the biggest heading size, and closes
it
1 <a href = " http :// www . uea . ac . uk / " > University of East Anglia </
a>
INTRODUCING HTML 15
and here’s a code snippet which would insert the UEA logo.
1 < img src = " http :// www . uea . ac . uk / polopoly_fs /1.166636!
ueas tandardr gb . png " >
The attribute for the link tag is href (short for hyper-reference),
and it takes a particular value – in this case, the address of the web
page we’re linking to. The attribute for the image tag is src (short
for source), and it too takes a particular value – the address of a
PNG image file.4 Try copying these in to the TryIt editor and see 4
PNG stands for Portable Network
what happens. You can see that the tag for links has an opening and Graphics. It’s a popular image format
– not as popular as .gif or .jpeg, but
closing tag, and that the text of the link goes between these two. The technically superior.
image tag doesn’t need a closing tag.
Tables
You can find a full list of HTML tags by looking at the official HTML
specification – which, as I’ve already suggested, is a very dull docu-
ment. A list of common HTML tags can be found in Table 2.1.
Two tags in particular are worth pointing out: SPAN and DIV. These
are two tags with opening and closing pairs, which mark out se-
quences of text. They don’t do anything to them – well, DIV starts
16 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
1. What happens to the image if you change the last three letters
from .png to .PNG ?
3. What happens if you change the first two <td> tags to <th> ?
What might <th> stand for?
a new line – but they are used very often in modern HTML pages
to add formatting information, or to add interactive elements. Most
web pages are now a mix of HTML – which we know about – and
two other technologies, CSS (short for Cascading Style Sheets) and
Javascript, a programming language. We don’t need to know about
them, but they’ll crop up in most web pages we look at. You’ll often
see them used with attributes id or class. These provide hooks for
the CSS or Javascript to latch on to.
Most of the HTML files we’ve been looking at so far have been mini-
mal examples – or, put less politely, toy examples. It’s time to look at
some HTML in the wild.
You can see the HTML used to write a page from within your
browser. In most browsers, you need to right-click and select ‘View
Source’.6 6
Apple, in its infinite wisdom, requires
Let’s pick a good example page to start with: Google’s home- you first to change your preferences.
Go to Preferences -¿ Advanced, and
page circa 1998. The Internet Archive’s Wayback Machine can show check ‘Show Develop menu in menu
us what Google looked like on 2nd December 1998 at the following bar’.
address: http://web.archive.org/web/19981202230410/http:
//www.google.com/.
If you go to that address, you can see the original HTML source
code used by Google, starting at line 227. It’s fairly simple. Even
someone who had never seen HTML before could work out the
function of many of the tags by comparing the source with the page
as rendered in the browser.
Now look at some of the stuff before line 227. Urgh. Lines 12-111
have a whole load of Javascript – you can completely ignore that.
Line 114 has a chunk of style information (CSS) – you can ignore
that too. But even the rest of it is peppered with style information,
and looks really ugly. It’s a headache. But that’s the kind of stuff we’ll
have to work with.
Before you close the source view, check you can do Exercise 3.
In the next chapter, which deals with Python, we’ll need to know how
to save plaintext files. So it’s useful to check now that you can do
this.
Go back to the TryIt examples used in Exercises 1 and 2. Paste
these examples in to your text-editor. Try saving this document as
a plain-text file with the extension .html.7 It doesn’t matter where 7
This extension isn’t strictly necessary,
you save it, but you might want to take this moment to create a new but many operating systems and
browsers live and die on the basis of
correctly-assigned extensions.
18 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
folder to save all the work you’ll be doing in the next chapter. It also
doesn’t matter what you call it – but test.html would be a good
suggestion.
Once you’ve saved your file, it’s time to open it in your browser.
You should be able to open a local file in your browser. Some
browsers will, as a default, only show you files ending in .htm or
.html.
If your browser offers to save the file you’ve just tried to open,
you’ve done something wrong. If you get a whole load of gibberish,
you’ve done something wrong. Try googling for the name of your text
editor and ‘save plain text’.
• We’ve seen how these tags are used (and abused) in practice
We’ll use all of these in future chapters – but for the next chapter,
we can put away the browser and fire up our plain-text editor...
3
Introducing Python
• It’s tidy. Code written in Python is very easy to read, even for
people who have no understanding of computer programming.
This compares favourably with other languages.2 2
The language I most often use,
Perl, makes it quite easy to produce
• It’s popular. Python is in active development and there is a large horrendous code. Some have joked
that Perl stands for Pathologically
installed user base. That means that there are lots of people Eclectic Rubbish Lister.
learning Python, that there are lots of people helping others learn
Python, and that it’s very easy to Google your way to an answer.
• It’s used for web scraping. The site ScraperWiki (which I men-
tioned in the introduction) hosts web scrapers written in three
languages: Python, Ruby, and PHP. That means that you can look
at Python scrapers that other people have written, and use them
as templates or recipes for your own scrapers.
Before we begin
Installing Python
Installing BeautifulSoup
20 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
First steps
1 1+1
2 22.0/7.0
3 pow (2 ,16)
1 pi = 22.0 / 7.0
2 pi
1 a = 20
2 b = 10
3 a = b
4 print a
5 print b
Numbers aren’t the only types of values that variables can hold.
They can also hold strings.
1 myfirstname [0:5]
2 myfirstname [1:5]
3 myfirstname [:5]
4 myfirstname [5:]
Looper
1 for i in myfirstname :
2 print i
Regular expressions
Conclusion
W E B PAG E S I N C L U D E A L OT O F I R R E L E VA N T F O R M AT T I N G . Very
often, we’re not interested in the images contained in a page, the
mouse-overs that give us definitions of terms, or the buttons that
allow us to post an article to Facebook or retweet it. Some browsers
now allow you to read pages without all of this irrelevant informa-
tion.1 In this first applied chapter, we’re going to write some Python 1
Safari’s ‘Reader’ feature; some
scrapers to extract the text from a page, and print it either to the Instapaper features.
The example
The example I’m going to use for this chapter is a recent review
from the music website, Pitchfork.com. In particular, it’s a review of
the latest (at the time of writing) Mogwai album, A Wrenched and
Virile Lore.2 You can find it at http://pitchfork.com/reviews/ 2
No, I don’t know what that means
albums/17374-a-wrenched-virile-lore/. When you open it in either.
You can take a look at the source of the web page by right clicking
and selecting View Source (or the equivalent in your browser). The
26 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
source is not that readable – it’s 138 lines long, but many of those
lines are very long indeed. We’re interested in the start of the review
text itself, beginning ‘Mogwai made their name’. Try searching for
‘made their name’ in the source. You should find it on line 55 (a
horrifically long line). See Listing 4.2.
How are we supposed to make any sense of that? Well, let’s
look at where the review begins. We seen that ‘Mogwai’ is between
opening and closing link tags (<a> and </a>). Those take us to a
round up of all Pitchfork articles on Mogwai, which we don’t want.
If we go back before that, we see that the first line is wrapped in
opening and closing paragraph tags. That’s helpful: we’ll definitely
be interested in text contained in paragraph tags (as opposed to
free-floating text). But the tag that’s really helpful is the one before
the opening paragraph tag:
I’m going to provide the source for our very first Python scraper. I’ll
list the code, and then explain it line by line. The code is in Listing
4.3
Here’s a line-by-line run-down:
What kind of output does this give us? Listing 4.4 shows the first
four lines.
E X T R AC T I N G S O M E T E X T 27
Hmm.... not so good. We’ve still got a lot of crap at the top. We’re
going to have to work on that. For the moment, though, check you
can do Exercise 5 before continuing.
a whole load of crap at the start, and it starts with the text of the
review.
Unfortunately, it’s still not perfect. You’ll see at the bottom of the
output that there are a number of sentence fragments ending in
ellipses (. . . ). If you go back to the page in your web browser, as
shown in Figure 4.1, you’ll see that these sentence fragments are
actually parts of boxes linking to other Mogwai reviews. We don’t
want to include those in our output. We need to find some way of
becoming more precise.
That’s where the div we saw earlier comes in to play. Here’s the
listing; explanation follows.
Recap
First, we identified the portion of the web page that we wanted, and
found the corresponding location in the HTML source code. This is
usually a trivial step, but can become more complicated if you want
to find multiple, non-contiguous parts of the page.
So far, we’ve been happy just printing the results of our program to
the screen. But very often, we want to save them somewhere for
later use. It’s simple to do that in Python.
Here’s a listing which shows how to write our output to a file called
review.txt
There are two changes you need to take note of. First, we import
a new package, called codecs. That’s to take care of things like
accented characters, which are represented in different ways on
different web-pages. Second, instead of calling the print function,
32 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
A taster
2. Is this on its own enough to extract the artist? If not, what div
must you combine it with?
In the previous section, Python helped us clear a lot of the junk from
the text of web pages. Sometimes, however, the information we want
isn’t plain text, but is a file – perhaps an image file (.jpg, .png, .gif), or
a sound file (.mp3, .ogg), or a document file (.pdf, .doc). Python can
help us by making the automated downloading of these files easier –
particularly when they’re spread over multiple pages.
In this chapter, we’re going to learn the basics of how to identify
links and save them. We’re going to use some of the regular ex-
pression skills we learned back in 3. And we’re going to make some
steps in identifying sequences of pages we want to scrape.
The example
and be a little smart, and create a filename for the saved file. We
have to give a filename – Python’s not going to invent one for us. So
we’ll use the last part of the address itself – the part after the last
forward slash.
In order to get that, we take the url, and call the split function
on it. We give the split function the character we want to split on, the
forward slash. That split function would normally return us a whole
list of parts. But we only want the last part. So we use the minus
notation to count backwards (as we saw before).
In Line 8, we create another variable to tell Python which directory
we want to save it in. Remember to create this directory, or Python
will fail. Remember also to include the trailing slash.
Finally, line 10 does the hard work of downloading stuff for us.
We call the urlretrieve function, and pass it two arguments – the
address, and a path to where we want to save the file, which has the
directory plus the filename.
One thing you’ll notice when you try to run this program – it will
seem as if it’s not doing anything for a long time. It takes time to
download things, especially from sites which aren’t used to heavy
traffic. That’s why it’s important to be polite when scraping.
This is all fine if we’re just interested in a single page. But frankly,
there are add-ons for your browser that will help you download all
files of a particular type on a single page.3 Where Python really 3
DownloadThemAll! for Firefox is very
comes in to its own is when you use it to download multiple files from good.
multiple pages all spinning off a single index page. Got that? Good.
We’re going to use a different example to talk about it.
The Leveson Inquiry into Culture, Practice and Ethics of the Press
delivered its report on 29th November 2012. It was based on several
thousand pages of written evidence and transcripts, all of which are
available online at the Inquiry website.4 Downloading all of those 4
http://www.levesoninquiry.org.
submissions – perhaps you’re going away to read them all on a uk/
desert island with no internet – would take a very, very long time. So
let’s write a scraper to do so.
If you go to http://www.levesoninquiry.org.uk/evidence/ and
click on ‘View all’, you’ll see the full list of individuals who submitted
evidence. See Figure 5.3 if you’re in any doubt.
We’re interested in the links to the pages holding evidence of
specified individuals: pages featuring Rupert Murdoch’s evidence5 , 5
http://www.levesoninquiry.
or Tony Blair’s 6 . The common pattern is in the link – all the pages org.uk/evidence/?witness=
rupert-murdoch
we’re interested in have witness= in them. So we’ll build our parser 6
http://www.levesoninquiry.org.
on that basis. uk/evidence/?witness=tony-blair
When we get to one of those pages, we’ll be looking to download
36 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
all of the PDF transcripts. So we’ll amend the same regular expres-
sion code we used to download MP3s above. Listing 5.4 shows the
necessary steps.
believe).
We then pause a little to give the servers a rest, with time.sleep,
from the time package. We then open a new page, with the full
address we just created! (We store it in the same soup, which might
get confusing).
Now we’re on the witness page, we need to find more links. Just
searching for stuff that ends in .pdf isn’t enough; we need just PDF
transcripts. So we also add a regular expression to search on the
text of the link.
To save bandwidth (and time!) we close by printing off the base
URL together with the relative link from the href attribute of the <a>
tag. If that leaves you unsatisfied, try Exercise 7.
2. Turn your wireless connection off and try running the program
again. What happens?
6
Extracting links
The idea of the link is the fundamental building block not only of
the web but of many applications built on top of the web. Links –
whether they’re links between normal web pages, between followers
on Twitter, or between friends on Facebook – are often based on la-
tent structures that often not even those doing the linking are aware
of. We’re going to write a Python scraper to extract links from a par-
ticular web site, the AHRC website. We’re going to write our results
to a plain text spreadsheet file, and we’re going to try and get that in
to a spreadsheet program so we can analyze it later.
AHRC news
We’re going to pivot off the div with class of item, and identify
the links in those divs. Once we get those links, we’ll go to those
news items. Those items (and you’ll have to trust me on this) have
divs with class of pageContent. We’ll use that in the same way.
40 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
8 start = ’ http :// www . ahrc . ac . uk / News - and - Events / News / Pages /
News - Listing . aspx ’
9 outfile = codecs . open ( ’ ahrc_links . csv ’ , ’w ’ , ’utf -8 ’)
10
11 soup = BeautifulSoup ( urlopen ( start ) )
12
Listing 6.2 shows the eventual link scraper. You should notice
two things. First, we’re starting to use if-tests, like we discussed
back in Chapter 3. Second, we’ve got quite a lot of loops – we loop
over news items, we have a (redundant) loop over content divs, and
we loop over all links. The combination of these two things means
there’s quite a lot of indentation.
Let me explain three lines in particular. Line 15 tests whether or
not the <a> tag that we for in the loop beginning Line 14 has an href
attribute. It’s good to test for things like that. There are some <a>
tags which don’t have href attributes.1 . If you give get to line 16 with 1
Instead they have a name tag, and
just such a tag, Python will choke. act as anchor points. You use them
whenever you go to a link with a hash
Line 23 takes a particular slice out of our link text. It goes from the symbol (# ) after the .html
beginning to the fourth character. We could have made that clearer
by writing linkurl[0:4] – remember, lists in Python start from zero,
not one. We’re relying on external links beginning with http.
Line 24 uses a regular expression. Specifically, it says, take any
kind of character that follows a forward slash, and replace it with
nothing – and do that to the variable linkurl, from the seventh char-
acter onwards. That’s going to mean that we get only the website
address, not any folders below that. (So, digitrans.crowdvine.com/pages/watch-live
becomes digitrans.crowdvine.com/).
Finally, Line 25 gives us our output. We want to produce a spread-
sheet table with two columns. The first column is going to be the
AHRC page that we scraped. The second column is going to give us
E X T R AC T I N G L I N K S 41
I told Python to write this to the file ahrc links.csv. Files that
end in CSV are normally comma-separated values files. That is,
they use a comma where I used a tab. I still told Python to write
to a file ending in .csv, because my computer recognises .csv as
an extension for comma-separated values files, and tries to open
the thing in a spreadsheet. There is an extension for tab separated
values files, .tsv, but my computer doesn’t recognise that. So I cheat,
and use .csv.
42 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
This is going to form the route for getting our information into a
form in which we can analyze things. We’re going to get a whole
load of information, smash it together with tabs separating it, and
open it in a spreadsheet.
That strategy pays off most obviously when looking at tables –
and that’s what we’re going to look at next.
7
Extracting tables
Our example
Try copying and pasting the table from the ATP tour page into
your spreadsheet. You should find that the final result is not very
useful. Instead of giving you separate columns for ranking, player
name, and nationality, they’re all collapsed into one column, making
it difficult to search or sort.
You can see why that is by looking at the source of the web page.
44 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
You should be able to see that there are only four cells in this
table row, whereas we want to extract six pieces of information (rank,
name, nationality, points, week change, and tournaments played).
What we’re going to do is produce an initial version of the scraper
which extracts the table as it stands, and then improve things by
separating out the first column.
22 outfile . close ()
by my cell separator, which is a tab.1 Finally, after the end of the for 1
You could use a comma, but then
loop, I add a new line so that my spreadsheet isn’t just one long row. you would have to deal with commas
separating players’ first and last names.
How are we going to improve on that? We’re going to use some That’s why although we talk about
if and else statements in our code. Essentially, we’re going to pro- comma separated values files, we
mostly used a tab.
cess the cell contents one way if it has class of first, but process it
in a quite different way if it doesn’t. Listing 7.4 shows the listing.
The major differences with respect to the previous listing are
as follows. There’s a little bit of a trick in Line 14. Because we’re
going to parse table cells with class first on the assumption that they
contain spans with the rank, and links, and so on, we’re going to
ignore the first row of the table, because it has a table cell with class
first which doesn’t contain a span with the rank, etc., and because if
we ask Python to get spans from a table cell which doesn’t contain
them, it’s going to choke.2 So we take a slice of the results returned 2
We could write code to consider this
by BeautifulSoup, omitting the first element.3 possibility, but that would make the
program more complicated.
3
Remember Python counts from zero.
Listing 7.4: Improved ATP code
python code/atp2.py
1 import re
2 import urlparse
3 import codecs
4 from urllib2 import urlopen
5 from urllib import urlretrieve
6 from bs4 import BeautifulSoup
7 from bs4 import SoupStrainer
8
9 start = ’ http :// www . atpworldtour . com / Rankings / Singles . aspx ’
10 outfile = codecs . open ( ’ atp_ranks2 . csv ’ , ’w ’ , ’utf -8 ’)
11
28 outfile . write ( ’\ n ’)
29
30 outfile . close ()
As with many of these examples, what we’ve just been able to ac-
complish is at the outer limits of what could be done manually. But
now that we’ve been able to parse this particular ranking table, we
are able to extend our work, and accomplish much more with just a
few additional lines of code.
Looking at the ATP web page, you should notice that there are
two drop down menus, which offer us the chance to look at rankings
for different weeks, and rankings for different strata (1-100, 101-
E X T R AC T I N G TA B L E S 47
200, 201-300, and so on...). We could adapt the code we’ve just
written to scrape this information as well. In this instance, the key
would come about not through replicating the actions we would go
through in the browser (selecting each item, hitting on ‘Go’, copying
the results), but on examining what happens to the address when we
try and example change of week, or change of ranking strata. For
example: if we select 101 - 200, you should see that the URL in your
browser’s address bar changes from http://www.atpworldtour.
com/Rankings/Singles.aspx to http://www.atpworldtour.com/
Rankings/Singles.aspx?d=26.11.2012&r=101&c=#{}. In fact, if we
play around a bit, we can get to arbitrary start points by just adding
something after r – we don’t even need to include these d= and
c= parameters. Try http://www.atpworldtour.com/Rankings/
Singles.aspx?r=314#{} as an example.
We might therefore wrap most of the code from Listing 7.4 in a for
loop of the form:
1 weekends = []
2 soup = BeautifulSoup ( urlopen ( start ) , parse_only =
SoupStrainer ( ’ select ’ ,{ ’ id ’: ’ singlesDates ’ }) )
3 for option in soup . find_all ( ’ option ’) :
4 weekends . append ( option . get_text () )
We could then create two nested for loops with strata and
weekends, paste these on to the address we started with, and ex-
tract ourr table information.
8
Final notes
C O N G R AT U L AT I O N S , YO U ’ V E G OT T H I S F A R . You’ve learned
how to understand the language that web pages are written in, and
to take your first steps in a programming language called Python.
You’ve written some simple programs that extract information from
web pages, and turn them in to spreadsheets. I would estimate that
those achievements place you in the top 0.5% of the population
when it comes to digital literacy.
Whilst you’ve learned a great deal, the knowledge you have is
quite shallow and brittle. From this booklet you will have learned a
number of recipes that you can customize to fit your own needs. But
pretty soon, you’ll need to do some research on your own. You’ll
need independent knowledge to troubleshoot problems you have
customizing these recipes – and for writing entirely new recipes.
If scraping the web is likely to be useful to you, you need to do the
following.
Third, you need to look at what other people are doing. Checking
out some of the scrapers on ScraperWiki is invaluable. Look for
some scrapers that use BeautifulSoup. Hack them. Break them.
50 S C R A P I N G T H E W E B F O R A RT S A N D H U M A N I T I E S
Then look for other scrapers that are written in plain Python, or using
other libraries2 to parse the web. Hack them. Break them. Lather. 2
Like lxml, one popular
Rinse. Repeat. BeautifulSoup alternative.
Difficulties
I won’t pretend that scraping the web is always easy. I have almost
never written a program that worked the way I wanted to the first
time I ran it. And there are some web pages that are difficult or im-
possible to scrape. Here’s a list of some things you won’t be able to
do.
Twitter You can get Twitter data, but not over the web. You’ll need
to use a different package,3 and you’ll need to get a key from Twitter. 3
Say, python-twitter at https:
It’s non-trivial. //github.com/bear/python-twitter