Web Scraping Using Python: A Step by Step Guide: September 2019
Web Scraping Using Python: A Step by Step Guide: September 2019
net/publication/335598912
CITATIONS READS
3 18,004
1 author:
Jiahao Wu
Fordham University
1 PUBLICATION 3 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jiahao Wu on 04 September 2019.
The need of extracting data from website is increasing. When we are conducting data
related projects such as price monitoring, business analytics or news aggregator, we
would always need to record the data from website. However, copying and pasting data
line by line has been outdated. In this article, we would teach you how to become an
“insider” in extracting data from website, which is to do web scraping with python.
Step 0: Introduction
Web scraping is a technique which could help us transform HTML unstructured data into
structed data in spreadsheet or database. Besides using python to write cods, there are
many other ways to web scraping such as accessing website data with API or data
extraction tools like Octoparse.
For some big websites like Airbnb or Twitter, they would provide API for developers to
access their data. API stands for Application Programming Interface, which is an access for
two applications to communicate with each other. For most people, API is the most
optimal approach to obtain data provided from the website themselves.
However, most websites don’t have API service. Sometimes even if they provide API, the
data you could get is not what you want. Therefore, writing a python script to build web
crawler becomes another powerful and flexible solution.
Flexibility: As we know, websites update quickly. Not only the content but also the web
structure would change frequently. Python is an easy-to-use language because it is
dynamically inputable and highly productive. Therefore, people could change their code
easily and keep up with the speed of web updates.
Powerful: Python has a large collection of mature libraries. For example, requests,
beautifulsoup4 could help us fetch URLs and pull out information from web pages.
Selenium could help us avoid some anti-scraping techniques by giving web crawlers the
ability to mimic human browsing behaviors. In addition, re, numpy and pandas could help
us clean and process the data.
In this tutorial, we would show you how to scrape reviews from Yelp. We will use two
libraries: BeautifulSoup in bs4 and request in urllib. These two libraries are commonly used
in building a web crawler with Python.
Now we have the “soup”, which is the raw HTML for this website. We could use prettify() to
clean the raw data and print it to see the nested structure of HTML in the “soup”.
Now we successfully get all the clean reviews with less than 20 lines of code.
Here is just a demo to scrape 20 reviews from Yelp. But in real case, we may need to face a
lot of other situations. For example, we will need steps like pagination to go to other
pages and extract the rest reviews for this shop. Or we will need to also scrape down other
information like reviewer name, reviewer location, review time, rating, check in......
To get the above information, we would need to learn more functions and libraries such as
selenium or regular expression. It would be interesting to spend more time digging about
the challenges in web scraping.
However, if you are looking for some simple ways to do web scraping, Octoparse could be
another solution. Octoparse is a powerful web scraping tool which could help you easily
obtain information from websites. Check out this tutorial about how to scrape reviews
from Yelp with Octoparse. Feel free to contact us when you need a powerful web-scraping
tool for your business or project!