Lee Access2019 Workshop

Overview: Hack Open Web Data
Access 2019 Workshops and Hackfest on Oct. 2

Slides: https://bit.ly/2ncfiHM
Yoo Young Lee
@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library
Schedule
10:30 - 10:45 Overview of the workshop
10:45 - 11:00 [Hands-on] Basic rules of HTML and CSS
11:00 - 11:20 [Hands-on] Web scraping approach with Google Sheets
11:20 - 11:50 [Hands-on] Web scraping approach with Python
11:50 - 12:20 [Hands-on] Web scraping approach with R
12:20 - 12:30 Wrap-up

About this workshop
● This workshop IS about scraping certain elements of data from a website or web pages.
● This workshop IS NOT about crawling an entire website.
● This workshop IS about using tools already developed for us.
● This workshop IS NOT about building our own web scraper or web crawler.
● This workshop IS designed for beginners.
● This workshop IS NOT designed for advanced web scraping techniques (i.e. logins &
sessions).
● This workshop IS NOT about Python or R programming.
Learning outcomes
1. Basic understanding of web scraping and its use cases
2. Ethical and legal aspects of web scraping
3. Basic concept of web scraping
a. Web syntax (HTML and CSS) and extraction strategies
4. Three web scraping approaches and tools

a. Google Sheets
b. Python with Google Colab (or Jupyter Notebook)
c. R with RStudio
About me & you
Why web scraping?
● The Web has become a source of data for daily activities and scientific
research.
○ Everybody lies: Big data, new data, and what the internet can tell us about who we really are
○ Accessing Google flu trends performance in the United States during 2009 influenza virus A
(H1N1) pandemic
● Not all of websites provide APIs or web services to facilitate data exchange.
● Extract a large amount of data.
● Structure web data in a way to answer your questions.
Use cases
● Data Migration to Open Journal Systems (OJS) using R
● Using web data for evidence-informed event management with R
What is web scraping?
Data on the Web Scrape

Storage
(html, css) (XPath)
(CSV)
● An automated process to extract and parse “data” from websites to storage in

a structured way.
● But other than an API (Application Programming Interface).
Limitations & challenges
● Designed to handle static scraping for simple websites (i.e., numeric
pagination, traditional HTML coding practices).
● Not suited for dynamic web scraping (i.e., websites applying JavaScript/AJAX
code).
● Possible to extract ~100 datasets, but difficult to handle large-scale like
millions of records one time.
● Websites updated frequently so your code can get outdated frequently too.
● Prohibited by robots.txt file by default.
Ethical and legal concerns
● In Canada, web scraping for internal system is illegal, especially data reused
for financial benefit [The Toronto Real Estate Board v. Mongohouse.com et
al].
● What if collect data for research purpose? Still grey area in Canada.
● Rule of thumb:
○ Check the Terms of Service (ToS).
○ Check the rules of robots.txt.
○ Be gentle with the website when requested.
○ Ask when in doubt.
Robots.txt
● How to access it? https://website_url/robots.txt
● Example: Canadian National Digital Heritage Index robots.txt file
○ User-agent: *
■ To apply all robots,
○ Disallow: /
■ Scraping from the entire website is not allowed.
○ Disallow:
■ All pages can be scraped.
○ Disallow: /infoserv/
■ Only some parts of the website are excluded.
○ Crawl-delay: 60
■ Wait 60 seconds before accessing the website again.
Basic Rules of HTML and CSS
Yoo Young Lee

Learning Outcomes

a. Google Sheets
c. R with RStudio
Data on the Web

Storage
(html, css) (XPath)
(CSV)
● HTML: Content and structure of a page (header, paragraph, footer, etc.)

● CSS: Look and feel (color, font type, border, etc.)
● JavaScript: Advanced behaviors (pop-up, animation, etc)
Tags, elements, and attributes
Image credit: https://mdn.mozillademos.org/files/7659/anatomy-of-an-html-element.png

Tree structure
<!DOCTYPE html> <style>
<html> #myHeading {
<head> ...
<title>Page Title</title> }
</head>
<body> .myDiv {
<h1 id=”myHeading”>This is a Heading</h1> ...
<div class=”myDiv”> }
<p class=”myPara”>This is a paragraph.</p> .myPara {
</div> …
<a href="http://www.google.com">This is a link</a> }
</body> </style>
</html>
Extraction strategies

Storage
(html, css) (XPath)
(CSV)
● Xpath
● CSS Selector
● Regular Expressions
Xpath
Expression Description Example
nodename Selects all nodes with the name "nodename" might be div, table, span, tr, td, h1
div, table, span, tr, td, h1,etc.
/ Selects from the root node. /html/body/div
// Selects nodes in the document from the current node that //div
match the selection no matter where they are.
@ Selects attributes. //div[@class=”class_name”]
[1] Select the first child of li under ul anywhere in the //ul/li[1]

document.
text() Select the text around <a></a> element. //a/text()
Xpath cheatsheet: https://devhints.io/xpath#axes

Extraction
<h1 id=”myHeading”>This is a first heading</h1>
<div class=”myDiv”>
<p class=”myPara”>This is a paragraph.</p>
</div>
<h1 id=”myHeading”>This is a second heading</h1>
<a href="http://www.google.com">This is a link</a>
Content XPath
This is a first heading //h1[@id=”myHeading”]

This is a second heading
This is a second heading //h1[2]
This is a paragraph. //div[@class=”myDiv”]/p[@class=”myPara”]
This is a link //a or //a/text()

Developer tools in the browsers
Hands on
● Example website: Explore Edmonton Events
1. Find Xpath for the event title.
● /html/body/main/section[2]/section/div[2]/div[2]/div/div/div/div/div[2]/a/div/div/di
v/h4/span
● //h4[@class="flex"]/span
Hands on
2. Find Xpath for event dates.
● /html/body/main/section[2]/section/div[2]/div[2]/div/div/div/div/div[1]/div/span/te
xt()
● //div[@class="bg-gray ratio ratio--square"]//span
Hands on
1. Find Xpath for the title of first event only.
● //div[@class="wrap wrap--gutter"][1]//h4[@class="flex"]/span
2. Find Xpath for the date of the first event only.
● //div[@class="wrap wrap--gutter"][1]//div[@class="bg-gray ratio

ratio--square"]//span
Web Scraping with Google Sheets
Yoo Young Lee

Learning outcomes

a. Google Sheets
c. R with RStudio
Syntax and concepts
IMPORTXML(url, xpath_query)
● url: The URL of the page to examine, including protocol (e.g. http://)
○ Enclosed in quotation marks
○ Reference to a cell
● xpath_query: XPath query to run on the structured data

Storage
(html, css) (XPath)
(CSV)
IMPORTXML with XPath Google Sheets

Demonstration and Exercise 1 (Wikipedia)
Exercise Google Sheet
Exercise 2: Canadian Archive of Women in STEM
Exercise 3: Travel Advice and Advisories
Web Scraping with Python
Yoo Young Lee

Learning outcomes

a. Google Sheets
c. R with RStudio
Python and Google Colaboratory
● Python: Python is an interpreted,
high-level, general-purpose programming
language and one of the most popular
languages.
● Google Colaboratory: Like Google docs,
sheets or slides, Google Colaboratory is a
free Jupyter notebook environment that
requires no setup and runs entirely in the
cloud. It is developed for deep learning, but
it can be used any projects written in
Python (2.7 or 3.6)
Python libraries for web scraping
Library Learning curve Can fetch Can process Can run JS
requests easy yes no no
BeautifulSoup easy no yes no
lxml medium no yes no
Selenium medium yes yes yes
Scrapy hard yes yes no
Credit: https://kite.com/blog/python/python-beautifulsoup-html-parser-web-scraping/

Storage
(html, css) (XPath)
(CSV)
requests pandas
Beautiful Soup csv
BeautifulSoup functions
● find_all(‘a’) == find all links in the document in a list form
● find(‘title’) == find the first title that you find in the document
● get(‘href’) == get the href attribute value from an element on the page
● (element).text == retrieve text associated w/ that element (i.e. text contents
of the tag)
Demonstration
Colab Jupyter Notebook
Exercise 1: uOttawa Faculty
Exercise 1: uOttawa Faculty
Answer: Colab Jupyter Notebook
Exercise 2: Vegan Restaurants in Edmonton
Exercise 2: Vegan Restaurants in Edmonton
Exercise 3: List of IUPUI Professors
Exercise 3: List of IUPUI Professors
Exercise 4: CBC Edmonton News
Exercise 4: CBC Edmonton News
Web Scraping with R
Yoo Young Lee

Learning Outcomes

a. Google Sheets
c. R with RStudio
R and RStudio
● R is a language and environment
for statistical computing and
graphics.
● RStudio is an integrated
development environment (IDE)
for R.
R Packages for Web Scraping
Library Can retrieve Can parse
Rcrawler yes yes
Rvest yes yes
scrapeR yes yes
RSelenium no yes
Httr, RCurl yes no
Credit: https://github.com/salimk/Rcrawler
Rcrawler function
● ContentScraper(url, XpathPatterns, CssPatterns, PatternsName,
asDataFrame, ManyPerPattern) == Scrape one or many elements from a
web page by XPath or CSS pattern
Demonstration
Demo.Rmd
Exercise 1: Canadian National Digital Heritage Index
Exercise1.Rmd
Answer: Exercise1.Rmd
Exercise2.Rmd
Exercise 3: Board Games
Exercise3.Rmd
Exercise 3: Board Games
Exercise 4: Code4Lib Jobs
Exercise4.Rmd
Exercise 4: Code4Lib Jobs
Summary
● No matter what tools you’re going to use:
○ Identify information that is nested in an XML/HTML document
○ Download documents
○ Parse documents
○ Develop queries
○ Extract information
○ Save information.
Google Sheets Python R

● Easy to use ● Easy programming ● Strong community and
● Simple process language packages
● Difficult to handle ● More complicated than ● More complicated than

complex webpages Google Sheets Google Sheets
Thank You
Slides: https://bit.ly/2ncfiHM
Yoo Young Lee

Lee Access2019 Workshop

Uploaded by

Copyright:

Available Formats

Lee Access2019 Workshop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lee Access2019 Workshop

Uploaded by

Copyright:

Available Formats

Overview: Hack Open Web Data

Access 2019 Workshops and Hackfest on Oct. 2

10:30 - 10:45 Overview of the workshop

10:45 - 11:00 [Hands-on] Basic rules of HTML and CSS

11:00 - 11:20 [Hands-on] Web scraping approach with Google Sheets

11:20 - 11:50 [Hands-on] Web scraping approach with Python

11:50 - 12:20 [Hands-on] Web scraping approach with R

12:20 - 12:30 Wrap-up

4. Three web scraping approaches and tools

Data on the Web Scrape

● An automated process to extract and parse “data” from websites to storage in

Yoo Young Lee

4. Three web scraping approaches and tools

Data on the Web Scrape

● HTML: Content and structure of a page (header, paragraph, footer, etc.)

Image credit: https://mdn.mozillademos.org/files/7659/anatomy-of-an-html-element.png

Data on the Web Scrape

/ Selects from the root node. /html/body/div

@ Selects attributes. //div[@class=”class_name”]

[1] Select the first child of li under ul anywhere in the //ul/li[1]

text() Select the text around <a></a> element. //a/text()

Xpath cheatsheet: https://devhints.io/xpath#axes

This is a first heading //h1[@id=”myHeading”]

This is a second heading //h1[2]

This is a paragraph. //div[@class=”myDiv”]/p[@class=”myPara”]

This is a link //a or //a/text()

1. Find Xpath for the event title.

2. Find Xpath for event dates.

1. Find Xpath for the title of first event only.

2. Find Xpath for the date of the first event only.

● //div[@class="wrap wrap--gutter"][1]//div[@class="bg-gray ratio

Yoo Young Lee

4. Three web scraping approaches and tools

Data on the Web Scrape

IMPORTXML with XPath Google Sheets

Yoo Young Lee

4. Three web scraping approaches and tools

Library Learning curve Can fetch Can process Can run JS

requests easy yes no no

BeautifulSoup easy no yes no

lxml medium no yes no

Selenium medium yes yes yes

Scrapy hard yes yes no

Data on the Web Scrape

Yoo Young Lee

4. Three web scraping approaches and tools

Rcrawler yes yes

Rvest yes yes

scrapeR yes yes

Httr, RCurl yes no

Google Sheets Python R

● Difficult to handle ● More complicated than ● More complicated than

You might also like