Lee Access2019 Workshop
Lee Access2019 Workshop
Lee Access2019 Workshop
● Not all of websites provide APIs or web services to facilitate data exchange.
● Extract a large amount of data.
● Structure web data in a way to answer your questions.
Use cases
● Data Migration to Open Journal Systems (OJS) using R
● Using web data for evidence-informed event management with R
What is web scraping?
● Xpath
● CSS Selector
● Regular Expressions
Xpath
Expression Description Example
nodename Selects all nodes with the name "nodename" might be div, table, span, tr, td, h1
div, table, span, tr, td, h1,etc.
// Selects nodes in the document from the current node that //div
match the selection no matter where they are.
● /html/body/main/section[2]/section/div[2]/div[2]/div/div/div/div/div[2]/a/div/div/di
v/h4/span
● //h4[@class="flex"]/span
Hands on
● Example website: Explore Edmonton Events
● /html/body/main/section[2]/section/div[2]/div[2]/div/div/div/div/div[1]/div/span/te
xt()
● //div[@class="bg-gray ratio ratio--square"]//span
Hands on
● Example website: Explore Edmonton Events
● //div[@class="wrap wrap--gutter"][1]//h4[@class="flex"]/span
● url: The URL of the page to examine, including protocol (e.g. http://)
○ Enclosed in quotation marks
○ Reference to a cell
● xpath_query: XPath query to run on the structured data
What is web scraping?
Credit: https://kite.com/blog/python/python-beautifulsoup-html-parser-web-scraping/
What is web scraping?
requests pandas
Beautiful Soup csv
BeautifulSoup functions
● find_all(‘a’) == find all links in the document in a list form
● find(‘title’) == find the first title that you find in the document
● get(‘href’) == get the href attribute value from an element on the page
● (element).text == retrieve text associated w/ that element (i.e. text contents
of the tag)
Demonstration
Colab Jupyter Notebook
Exercise 1: uOttawa Faculty
Colab Jupyter Notebook
Exercise 1: uOttawa Faculty
Answer: Colab Jupyter Notebook
Exercise 2: Vegan Restaurants in Edmonton
Colab Jupyter Notebook
Exercise 2: Vegan Restaurants in Edmonton
Answer: Colab Jupyter Notebook
Exercise 3: List of IUPUI Professors
Colab Jupyter Notebook
Exercise 3: List of IUPUI Professors
Answer: Colab Jupyter Notebook
Exercise 4: CBC Edmonton News
Colab Jupyter Notebook
Exercise 4: CBC Edmonton News
Answer: Colab Jupyter Notebook
Web Scraping with R
Access 2019 Workshops and Hackfest on Oct. 2
● RStudio is an integrated
development environment (IDE)
for R.
R Packages for Web Scraping
Library Can retrieve Can parse
RSelenium no yes
Credit: https://github.com/salimk/Rcrawler
Rcrawler function
● ContentScraper(url, XpathPatterns, CssPatterns, PatternsName,
asDataFrame, ManyPerPattern) == Scrape one or many elements from a
web page by XPath or CSS pattern
Demonstration
Demo.Rmd
Exercise 1: Canadian National Digital Heritage Index
Exercise1.Rmd
Exercise 1: Canadian National Digital Heritage Index
Answer: Exercise1.Rmd
Exercise 2: Canadian National Digital Heritage Index
Exercise2.Rmd
Exercise 2: Canadian National Digital Heritage Index
Answer: Exercise2.Rmd
Exercise 3: Board Games
Exercise3.Rmd
Exercise 3: Board Games
Answer: Exercise3.Rmd
Exercise 4: Code4Lib Jobs
Exercise4.Rmd
Exercise 4: Code4Lib Jobs
Answer: Exercise4.Rmd
Summary
● No matter what tools you’re going to use:
○ Identify information that is nested in an XML/HTML document
○ Download documents
○ Parse documents
○ Develop queries
○ Extract information
○ Save information.