Beautifulsoup: Web Scraping With Python: Andrew Peterson
Beautifulsoup: Web Scraping With Python: Andrew Peterson
Beautifulsoup: Web Scraping With Python: Andrew Peterson
Apr 9, 2013
BeautifulSoup
Roadmap
Uses: data types, examples... Getting Started downloading les with wget BeautifulSoup: in depth example - election results table Additional commands, approaches PDFminer (time permitting) additional examples
BeautifulSoup
Etiquette/ Ethics
Similar rules of etiquette apply as Pablo mentioned: Limit requests, protect privacy, play nice...
BeautifulSoup
BeautifulSoup
HTML, HTML5 (<!DOCTYPE html>) data formats: XML, JSON PDF APIs other languages of the web: css, java, php, asp.net... (dont forget existing datasets)
BeautifulSoup
BeautifulSoup
General purpose, robust, works with broken tags Parses html and xml, including xing asymmetric tags, etc. Returns unicode text strings Alternatives: lxml (also parses html), Scrapey Faster alternatives: ElementTree, SGMLParser (custom)
BeautifulSoup
Installation
pip install beautifulsoup4 or easy_install beautifulsoup4 See: http://www.crummy.com/software/BeautifulSoup/ On installing libraries: http://docs.python.org/2/install/
BeautifulSoup
<table> Defines a table <th> Defines a header cell in a table <tr> Defines a row in a table <td> Defines a cell in a table
BeautifulSoup
HTML Tables
BeautifulSoup
HTML Tables
<h4>Simple <table> <tr> <td>[r1, <td>[r1, </tr> <tr> <td>[r2, <td>[r2, </tr> </table>
table:</h4>
c1]</td> c2]</td>
BeautifulSoup
BeautifulSoup
election results spread across hundreds of pages want to quickly put in useable format (e.g. csv)
BeautifulSoup
website might change at any moment ability to replicate research limits page requests
BeautifulSoup
I use wget (GNU), which can be called from within python alternatively cURL may be better for macs, or scrapy
BeautifulSoup
BeautifulSoup
BeautifulSoup
1 2
BeautifulSoup
Open a page
BeautifulSoup
BeautifulSoup
nd all
nds all the Tag and NavigableString objects that match the criteria you give. nd table rows: find_all("tr") e.g.: for link in soup.find_all(a): print(link.get(href))
BeautifulSoup
BeautifulSoup
BeautifulSoup
Regular Expressions
Allow precise and exible matching of strings precise: i.e. character-by-character (including spaces, etc) exible: specify a set of allowable characters, unknown quantities import re
BeautifulSoup
from xkcd
BeautifulSoup
brackets [ ] allow matching of any element they contain [A-Z] matches a capital letter, [0-9] matches a number [a-z][0-9] matches a lowercase letter followed by a number
BeautifulSoup
star * matches the previous item 0 or more times plus + matches the previous item 1 or more times [A-Za-z]* would match only the rst 3 chars of Xpr8r
BeautifulSoup
dot . will match anything but line break characters \r \n combined with * or + is very hungry!
BeautifulSoup
pipe is for or abc|123 matches abc or 123 but not ab3 question makes the preceeding item optional: c3?[a-z]+ would match c3po and also cpu
BeautifulSoup
parser starts from beginning of string can tell it to start from the end with $
BeautifulSoup
Regular Expressions
\d, \w and \s \D, \W and \S NOT digit (use outside char class)
BeautifulSoup
Regular Expressions
Now lets see some examples and put this to use to get the party.
BeautifulSoup
BeautifulSoup
Basic functions
BeautifulSoup
With parent you move up the parse tree. With contents you move down the tree. contents is an ordered list of the Tag and NavigableString objects contained within a page element. nextSibling and previousSibling: skip to the next or previous thing on the same level of the parse tree
BeautifulSoup
Data output
Create simple csv les: import csv many other possible methods: e.g. use within a pandas DataFrame (cf Wes McKinney)
BeautifulSoup
Loop over les for vote total rows, make the party empty print each row with the county and precinct number as columns
BeautifulSoup
PDFs
Can extract text, looping over 100s or 1,000s of pdfs. not based on character recognition (OCR)
BeautifulSoup
pdfminer
There are other packages, but pdfminer is focused more directly on scraping (rather than creating) pdfs. Can be executed in a single command, or step-by-step
BeautifulSoup
pdfminer
BeautifulSoup
PDFs
Well look at just using it within python in a single command, outputting to a .txt le. Sample pdfs from the National Security Archive Iraq War: http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/
BeautifulSoup
Often useful to do something over all les in a folder. One way to do this is with glob: import glob for filename in glob.glob(/filepath/*.pdf): print filename see also an example le with pdfminer
BeautifulSoup
Additional Examples
BeautifulSoup