Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lee Access2019 Workshop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Overview: Hack Open Web Data

Access 2019 Workshops and Hackfest on Oct. 2


Slides: https://bit.ly/2ncfiHM
Yoo Young Lee
@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library
Schedule

10:30 - 10:45 Overview of the workshop

10:45 - 11:00 [Hands-on] Basic rules of HTML and CSS

11:00 - 11:20 [Hands-on] Web scraping approach with Google Sheets

11:20 - 11:50 [Hands-on] Web scraping approach with Python

11:50 - 12:20 [Hands-on] Web scraping approach with R

12:20 - 12:30 Wrap-up


About this workshop
● This workshop IS about scraping certain elements of data from a website or web pages.
● This workshop IS NOT about crawling an entire website.
● This workshop IS about using tools already developed for us.
● This workshop IS NOT about building our own web scraper or web crawler.
● This workshop IS designed for beginners.
● This workshop IS NOT designed for advanced web scraping techniques (i.e. logins &
sessions).
● This workshop IS NOT about Python or R programming.
Learning outcomes
1. Basic understanding of web scraping and its use cases
2. Ethical and legal aspects of web scraping
3. Basic concept of web scraping
a. Web syntax (HTML and CSS) and extraction strategies

4. Three web scraping approaches and tools


a. Google Sheets
b. Python with Google Colab (or Jupyter Notebook)
c. R with RStudio
About me & you
Why web scraping?
● The Web has become a source of data for daily activities and scientific
research.
○ Everybody lies: Big data, new data, and what the internet can tell us about who we really are
○ Accessing Google flu trends performance in the United States during 2009 influenza virus A
(H1N1) pandemic

● Not all of websites provide APIs or web services to facilitate data exchange.
● Extract a large amount of data.
● Structure web data in a way to answer your questions.
Use cases
● Data Migration to Open Journal Systems (OJS) using R
● Using web data for evidence-informed event management with R
What is web scraping?

Data on the Web Scrape


Storage
(html, css) (XPath)
(CSV)

● An automated process to extract and parse “data” from websites to storage in


a structured way.
● But other than an API (Application Programming Interface).
Limitations & challenges
● Designed to handle static scraping for simple websites (i.e., numeric
pagination, traditional HTML coding practices).
● Not suited for dynamic web scraping (i.e., websites applying JavaScript/AJAX
code).
● Possible to extract ~100 datasets, but difficult to handle large-scale like
millions of records one time.
● Websites updated frequently so your code can get outdated frequently too.
● Prohibited by robots.txt file by default.
Ethical and legal concerns
● In Canada, web scraping for internal system is illegal, especially data reused
for financial benefit [The Toronto Real Estate Board v. Mongohouse.com et
al].
● What if collect data for research purpose? Still grey area in Canada.
● Rule of thumb:
○ Check the Terms of Service (ToS).
○ Check the rules of robots.txt.
○ Be gentle with the website when requested.
○ Ask when in doubt.
Robots.txt
● How to access it? https://website_url/robots.txt
● Example: Canadian National Digital Heritage Index robots.txt file
○ User-agent: *
■ To apply all robots,
○ Disallow: /
■ Scraping from the entire website is not allowed.
○ Disallow:
■ All pages can be scraped.
○ Disallow: /infoserv/
■ Only some parts of the website are excluded.
○ Crawl-delay: 60
■ Wait 60 seconds before accessing the website again.
Basic Rules of HTML and CSS
Access 2019 Workshops and Hackfest on Oct. 2

Yoo Young Lee


@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library
Learning Outcomes
1. Basic understanding of web scraping and its use cases
2. Ethical and legal aspects of web scraping
3. Basic concept of web scraping
a. Web syntax (HTML and CSS) and extraction strategies

4. Three web scraping approaches and tools


a. Google Sheets
b. Python with Google Colab (or Jupyter Notebook)
c. R with RStudio
Data on the Web

Data on the Web Scrape


Storage
(html, css) (XPath)
(CSV)

● HTML: Content and structure of a page (header, paragraph, footer, etc.)


● CSS: Look and feel (color, font type, border, etc.)
● JavaScript: Advanced behaviors (pop-up, animation, etc)
Tags, elements, and attributes

Image credit: https://mdn.mozillademos.org/files/7659/anatomy-of-an-html-element.png


Tree structure
<!DOCTYPE html> <style>
<html> #myHeading {
<head> ...
<title>Page Title</title> }
</head>
<body> .myDiv {
<h1 id=”myHeading”>This is a Heading</h1> ...
<div class=”myDiv”> }
<p class=”myPara”>This is a paragraph.</p> .myPara {
</div> …
<a href="http://www.google.com">This is a link</a> }
</body> </style>
</html>
Extraction strategies

Data on the Web Scrape


Storage
(html, css) (XPath)
(CSV)

● Xpath
● CSS Selector
● Regular Expressions
Xpath
Expression Description Example

nodename Selects all nodes with the name "nodename" might be div, table, span, tr, td, h1
div, table, span, tr, td, h1,etc.

/ Selects from the root node. /html/body/div

// Selects nodes in the document from the current node that //div
match the selection no matter where they are.

@ Selects attributes. //div[@class=”class_name”]

[1] Select the first child of li under ul anywhere in the //ul/li[1]


document.

text() Select the text around <a></a> element. //a/text()

Xpath cheatsheet: https://devhints.io/xpath#axes


Extraction
<h1 id=”myHeading”>This is a first heading</h1>
<div class=”myDiv”>
<p class=”myPara”>This is a paragraph.</p>
</div>
<h1 id=”myHeading”>This is a second heading</h1>
<a href="http://www.google.com">This is a link</a>
Content XPath

This is a first heading //h1[@id=”myHeading”]


This is a second heading

This is a second heading //h1[2]

This is a paragraph. //div[@class=”myDiv”]/p[@class=”myPara”]

This is a link //a or //a/text()


Developer tools in the browsers
Hands on
● Example website: Explore Edmonton Events

1. Find Xpath for the event title.

● /html/body/main/section[2]/section/div[2]/div[2]/div/div/div/div/div[2]/a/div/div/di
v/h4/span
● //h4[@class="flex"]/span
Hands on
● Example website: Explore Edmonton Events

2. Find Xpath for event dates.

● /html/body/main/section[2]/section/div[2]/div[2]/div/div/div/div/div[1]/div/span/te
xt()
● //div[@class="bg-gray ratio ratio--square"]//span
Hands on
● Example website: Explore Edmonton Events

1. Find Xpath for the title of first event only.

● //div[@class="wrap wrap--gutter"][1]//h4[@class="flex"]/span

2. Find Xpath for the date of the first event only.

● //div[@class="wrap wrap--gutter"][1]//div[@class="bg-gray ratio


ratio--square"]//span
Web Scraping with Google Sheets
Access 2019 Workshops and Hackfest on Oct. 2

Yoo Young Lee


@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library
Learning outcomes
1. Basic understanding of web scraping and its use cases
2. Ethical and legal aspects of web scraping
3. Basic concept of web scraping
a. Web syntax (HTML and CSS) and extraction strategies

4. Three web scraping approaches and tools


a. Google Sheets
b. Python with Google Colab (or Jupyter Notebook)
c. R with RStudio
Syntax and concepts
IMPORTXML(url, xpath_query)

● url: The URL of the page to examine, including protocol (e.g. http://)
○ Enclosed in quotation marks
○ Reference to a cell
● xpath_query: XPath query to run on the structured data
What is web scraping?

Data on the Web Scrape


Storage
(html, css) (XPath)
(CSV)

IMPORTXML with XPath Google Sheets


Demonstration and Exercise 1 (Wikipedia)
Exercise Google Sheet
Exercise 2: Canadian Archive of Women in STEM
Exercise Google Sheet
Exercise 3: Travel Advice and Advisories
Exercise Google Sheet
Web Scraping with Python
Access 2019 Workshops and Hackfest on Oct. 2

Yoo Young Lee


@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library
Learning outcomes
1. Basic understanding of web scraping and its use cases
2. Ethical and legal aspects of web scraping
3. Basic concept of web scraping
a. Web syntax (HTML and CSS) and extraction strategies

4. Three web scraping approaches and tools


a. Google Sheets
b. Python with Google Colab (or Jupyter Notebook)
c. R with RStudio
Python and Google Colaboratory
● Python: Python is an interpreted,
high-level, general-purpose programming
language and one of the most popular
languages.
● Google Colaboratory: Like Google docs,
sheets or slides, Google Colaboratory is a
free Jupyter notebook environment that
requires no setup and runs entirely in the
cloud. It is developed for deep learning, but
it can be used any projects written in
Python (2.7 or 3.6)
Python libraries for web scraping

Library Learning curve Can fetch Can process Can run JS

requests easy yes no no

BeautifulSoup easy no yes no

lxml medium no yes no

Selenium medium yes yes yes

Scrapy hard yes yes no

Credit: https://kite.com/blog/python/python-beautifulsoup-html-parser-web-scraping/
What is web scraping?

Data on the Web Scrape


Storage
(html, css) (XPath)
(CSV)

requests pandas
Beautiful Soup csv
BeautifulSoup functions
● find_all(‘a’) == find all links in the document in a list form
● find(‘title’) == find the first title that you find in the document
● get(‘href’) == get the href attribute value from an element on the page
● (element).text == retrieve text associated w/ that element (i.e. text contents
of the tag)
Demonstration
Colab Jupyter Notebook
Exercise 1: uOttawa Faculty
Colab Jupyter Notebook
Exercise 1: uOttawa Faculty
Answer: Colab Jupyter Notebook
Exercise 2: Vegan Restaurants in Edmonton
Colab Jupyter Notebook
Exercise 2: Vegan Restaurants in Edmonton
Answer: Colab Jupyter Notebook
Exercise 3: List of IUPUI Professors
Colab Jupyter Notebook
Exercise 3: List of IUPUI Professors
Answer: Colab Jupyter Notebook
Exercise 4: CBC Edmonton News
Colab Jupyter Notebook
Exercise 4: CBC Edmonton News
Answer: Colab Jupyter Notebook
Web Scraping with R
Access 2019 Workshops and Hackfest on Oct. 2

Yoo Young Lee


@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library
Learning Outcomes
1. Basic understanding of web scraping and its use cases
2. Ethical and legal aspects of web scraping
3. Basic concept of web scraping
a. Web syntax (HTML and CSS) and extraction strategies

4. Three web scraping approaches and tools


a. Google Sheets
b. Python with Google Colab (or Jupyter Notebook)
c. R with RStudio
R and RStudio
● R is a language and environment
for statistical computing and
graphics.

● RStudio is an integrated
development environment (IDE)
for R.
R Packages for Web Scraping
Library Can retrieve Can parse

Rcrawler yes yes

Rvest yes yes

scrapeR yes yes

RSelenium no yes

Httr, RCurl yes no

Credit: https://github.com/salimk/Rcrawler
Rcrawler function
● ContentScraper(url, XpathPatterns, CssPatterns, PatternsName,
asDataFrame, ManyPerPattern) == Scrape one or many elements from a
web page by XPath or CSS pattern
Demonstration
Demo.Rmd
Exercise 1: Canadian National Digital Heritage Index
Exercise1.Rmd
Exercise 1: Canadian National Digital Heritage Index
Answer: Exercise1.Rmd
Exercise 2: Canadian National Digital Heritage Index
Exercise2.Rmd
Exercise 2: Canadian National Digital Heritage Index
Answer: Exercise2.Rmd
Exercise 3: Board Games
Exercise3.Rmd
Exercise 3: Board Games
Answer: Exercise3.Rmd
Exercise 4: Code4Lib Jobs
Exercise4.Rmd
Exercise 4: Code4Lib Jobs
Answer: Exercise4.Rmd
Summary
● No matter what tools you’re going to use:
○ Identify information that is nested in an XML/HTML document
○ Download documents
○ Parse documents
○ Develop queries
○ Extract information
○ Save information.

Google Sheets Python R


● Easy to use ● Easy programming ● Strong community and
● Simple process language packages

● Difficult to handle ● More complicated than ● More complicated than


complex webpages Google Sheets Google Sheets
Thank You
Access 2019 Workshops and Hackfest on Oct. 2
Slides: https://bit.ly/2ncfiHM
Yoo Young Lee
@yooylee | yooyoung.lee@uOttawa.ca
Web and Digital Initiatives Librarian
University of Ottawa Library

You might also like