Data Science With Python - Lesson 11 - Web Scraping
Data Science With Python - Lesson 11 - Web Scraping
Web scraping is a computer software technique for extracting information from websites in an automated fashion.
What Is Web Scraping?
Why Web Scraping
Every day, you find yourself in a situation where you need to extract data from the web.
Why Web Scraping
Web Scraping Process
Web Scraping Process: Basic Preparation
There are two basic things to consider before setting up the web scraping process:
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
Step 1: A web request is sent to the targeted website to collect the required data.
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
Step 2: The information is retrieved from the targeted website in HTML or XML format from web.
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
Step 3: The retrieved information is parsed to the several parsers based on the data format. Parsing is a technique
to read data and extract information from the available document.
Web Scraping Process
Once you have understood the target data and finalized the list of websites, you need to design the web scraping
process.
Step 4: The parsed data is stored in the desired format. You can follow the same process to scrap another targeted
web.
Web Scraping Software
A web scraping software will interact with websites in the same way as your web browser.
A Web scraper is used to extract the information from web in routine and automated manner.
Displays the data Saves data from the web page to the local file or
database
Web Scraping Considerations
Reading and understanding the legal information along with terms and conditions mentioned in the website is
important.
Web Scraping Considerations
Legal Constraints
Notice
Copyright
Trademark Material
Patented Information
Web Scraping Tool: BeautifulSoup
BeautifulSoup, is an easy, intuitive, and a robust Python library designed for web scraping.
Efficient tool for dissecting documents and extracting information from the web pages
Has powerful sets of built-in methods for navigating, searching, and modifying a parse tree
Application Program
Interface or APIs have now
become a common practice
to extract information from
the web.
Common Data/Page Formats on the Web
JavaScript Object
Notation, or JSON, is a
lightweight and popular
format used for
information exchange
on the web.
Parser
Parser
What is a parser?
A Parser is also used to validate the input information before processing it.
Input Output
Commands Parser Methods
Parsing data is one of the most important steps in the web scraping process.
Failing to parse the data would eventually lead to a failure of the entire process.
! !
Parser
Parser
Various Parsers
lxml xml Lxml xml is the only xml parser available and it also depends on
C.
HTML
Parser
Tree
Demonstrate how to scrape a web document, parse it, and use objects to extract information.
Understanding Tree
html
head Body
a li li li
Understanding Tree
html tag
Body tag
Division or a Section
Cascaded style sheets
Understanding Tree
BeautifulSoup Parent
div:
oraganizationlist
Siblings
li: li:
HRmanager HRmanager
With the help of the search filters technique, you can extract specific information from the parsed document.
The filters can be treated as search criteria for extracting the information based on the elements present in the
document.
Searching Tree: Filters
There are various kinds of filters used for searching information from a tree.
List A list filters the string that matches against the search item in
the list.
Methods and
Attributes
Searching methods
find_all() find()
Searching the tree with find_all()
The find_all() searches and retrieves all tags’ descendants that matche your filters.
Method
The find() method has a syntax similar to that of the find_all() method; however, there are some key differences.
Find_all() Scans entire document Returns list with values Returns empty list
Searching the parse tree can also be performed by various other methods such as:
Tree Search
find_parents() find_parent()
find_previous_siblings() find_previous_sibling()
find_all_next() find_next()
find_all_previous() find_previous()
find_next_siblings() find_next_sibling()
scans for all the matches scans only for the first match
Searching in a Tree with Filters
With the help of BeautifulSoup, it is easy to navigate the parse tree based on the need.
Navigating Down
Navigating Up
Navigating Sideways
Navigating Down This technique shows you how to extract information from children tags.
Following are the attributes used to navigate down:
Navigating Down
Every tag has a parent and two attributes, .parents and .parent, to help navigate
up the family tree.
Navigating Up
Navigating Sideways
Navigating Down
This technique shows you how to extract information from the same level in the tree.
Navigating Up The attributes used to navigate sideways are: .next_sibling and .previous_sibling.
Navigating Sideways
Navigating Down
This technique shows you how to parse the tree back and forth.
Navigating Up The attributes used to navigate back and forth are:
.next_element and .previous_element
.next_elements and .previous_elements
Navigating Sideways
With BeautifulSoup, you can also modify the tree and write your changes as a new HTML or XML document.
.string()
append()
NavigableString()
.new_tag()
insert()
Insert_before() and insert_after()
clear()
extract()
decompose()
replace with ()
wrap()
unwrap()
Modifying the Tree
Demonstrate how to modify a web tree to get the desired result with the help of an example.
Parsing Only Part of the Document
This feature of parsing a part of the document will not work with the html5lib parser.
Parsing Part of the Document
Demonstrate how to parse only a part of document with the help of an example.
Output: Printing and Formatting
Output
Prettify()
Printing
Unicode() or str()
Formatting
Output: Printing and Formatting
Prettify()
Printing
Formatting
Unicode() or str()
Output: Printing and Formatting
Prettify()
Output
Unicode() or str()
Printing The unicode()or str() method turns a parse tree into a non-
decorative formatting string.
Formatting
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Formatting Html and xml formatting
will convert unicode
None
characters into html and
xml entities respectively.
Uppercase and
lowercase
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Formatting The minimal formatting
None will process content with
valid html/ xml tags.
Uppercase and
lowercase
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Formatting None formatting will not
None modify the content or
string on output.
Uppercase and
lowercase
Output: Printing and Formatting
The formatters are used to generate different types of output with the desired formatting.
Output
Minimal
Uppercase and
Formatting lowercase formatting will
None convert string values to
uppercase and
lowercase, respectively.
Uppercase and
lowercase
Formatting and Printing
Scrape the Simplilearn website page and perform the following tasks:
• View and print the Simplilearn web page content in a proper format
• View the head and title
• Print all the href links present in the Simplilearn web page
Scrape the Simplilearn website resource page and perform the following tasks:
• View and print the Simplilearn web page content in a proper format
• View the head and title
• Print all the href links present in the Simplilearn web page
• Search and print the resource headers of the Simplilearn web page
• Search resource topics
• View the article names and navigate through them
a. html.parser
b. lxml
c. lxml.xml
d. html5lib
Knowledge
Check
Which of the following is the only xml parser?
1
a. html.parser
b. lxml
c. lxml.xml
d. html5lib
a. ASCII
b. Unicode
c. latin-1
d. UTF-8
Knowledge
Check
In which of the following formats is the BeautifulSoup output encoded?
2
a. ASCII
b. Unicode
c. latin-1
d. UTF-8
a. Beautiful Soup
b. Pandas
c. Requests
d. Numpy
Knowledge
Check
Which of the following libraries is used to extract a web page?
3
a. Beautiful Soup
b. Pandas
c. Requests
d. Numpy
a. Tag
b. NextSibling
c. NavigableString
d. Comment
Knowledge
Check
Which of the following is NOT an object in BeautifulSoup?
4
a. Tag
b. NextSibling
c. NavigableString
d. Comment