Web Scraping Using Python: A Step by Step Guide: September 2019

This document provides a step-by-step guide to web scraping using Python. It introduces web scraping and why Python is well-suited for the task. The guide walks through scraping Yelp reviews as an example, covering importing libraries, extracting HTML, locating review elements, storing reviews, and cleaning unneeded tags from the reviews. While the demo focuses on 20 reviews, real-world scraping would need techniques like pagination to gather all relevant data from a site.

Uploaded by

Haseeb Joyia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

Web Scraping Using Python: A Step by Step Guide: September 2019

Uploaded by

Haseeb Joyia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335598912

Web Scraping Using Python: A Step By Step Guide

Article · September 2019

CITATIONS READS

3 18,004

1 author:

Jiahao Wu
Fordham University
1 PUBLICATION 3 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jiahao Wu on 04 September 2019.

The user has requested enhancement of the downloaded file.

web scraping using python: a step by step guide

Photoed by Heidi Sandstrom on Unsplash

The need of extracting data from website is increasing. When we are conducting data
related projects such as price monitoring, business analytics or news aggregator, we
would always need to record the data from website. However, copying and pasting data
line by line has been outdated. In this article, we would teach you how to become an
“insider” in extracting data from website, which is to do web scraping with python.

Step 0: Introduction
Web scraping is a technique which could help us transform HTML unstructured data into
structed data in spreadsheet or database. Besides using python to write cods, there are
many other ways to web scraping such as accessing website data with API or data
extraction tools like Octoparse.

For some big websites like Airbnb or Twitter, they would provide API for developers to
access their data. API stands for Application Programming Interface, which is an access for
two applications to communicate with each other. For most people, API is the most
optimal approach to obtain data provided from the website themselves.

However, most websites don’t have API service. Sometimes even if they provide API, the
data you could get is not what you want. Therefore, writing a python script to build web
crawler becomes another powerful and flexible solution.

So why should we use python instead of other languages?

Flexibility: As we know, websites update quickly. Not only the content but also the web
structure would change frequently. Python is an easy-to-use language because it is
dynamically inputable and highly productive. Therefore, people could change their code
easily and keep up with the speed of web updates.

Powerful: Python has a large collection of mature libraries. For example, requests,
beautifulsoup4 could help us fetch URLs and pull out information from web pages.
Selenium could help us avoid some anti-scraping techniques by giving web crawlers the
ability to mimic human browsing behaviors. In addition, re, numpy and pandas could help
us clean and process the data.

Step 1: Import Python library

In this tutorial, we would show you how to scrape reviews from Yelp. We will use two
libraries: BeautifulSoup in bs4 and request in urllib. These two libraries are commonly used
in building a web crawler with Python.

Step 2: Extract the HTML from web page

We need to extract reviews from

“https://www.yelp.com/biz/milk-and-cream-cereal-bar-new-york?osq=Ice+Cream”. So
first, let’s save the URL in a variable called url. Then we could access the content on this
webpage and save the html in “ourUrl”.
Then we use BeautifulSoup to parse the page.

Now we have the “soup”, which is the raw HTML for this website. We could use prettify() to
clean the raw data and print it to see the nested structure of HTML in the “soup”.

Step 3: Locate and scrape the reviews

Next, we should find the HTML reviews in this web page, extract them and store them. For
each element in the web page, they would have a unique HTML “ID”. To check their ID, we
would need to INSPECT them in web page.
In this case, the reviews are located under the tag called ”p”. So we will first use the fuction
called find_all() to find the parent node of this reviews. And then locate all elements with
the tag “p” under the parent node in a loop. After finding all “p” elements, we would store
them in an empty list called “review”.
Now we get all the reviews from that page. Let’s see how many reviews have we extracted.

Step 4: Clean the reviews

You must notice that there still some symbols such as “” at the beginning of
each review, “ ” in the middle of the reviews and “” at the end of each review.
“ ” stands for a single line break. We don’t need any line break in the reviews so we
will need to delete them. Also, “” and “” are the beginning and ending
of the HTML and we also need to delete them.
View publication stats

Now we successfully get all the clean reviews with less than 20 lines of code.

Here is just a demo to scrape 20 reviews from Yelp. But in real case, we may need to face a
lot of other situations. For example, we will need steps like pagination to go to other
pages and extract the rest reviews for this shop. Or we will need to also scrape down other
information like reviewer name, reviewer location, review time, rating, check in......

To get the above information, we would need to learn more functions and libraries such as
selenium or regular expression. It would be interesting to spend more time digging about
the challenges in web scraping.

However, if you are looking for some simple ways to do web scraping, Octoparse could be
another solution. Octoparse is a powerful web scraping tool which could help you easily
obtain information from websites. Check out this tutorial about how to scrape reviews
from Yelp with Octoparse. Feel free to contact us when you need a powerful web-scraping
tool for your business or project!

OpenBullet Script Tutorial
No ratings yet
OpenBullet Script Tutorial
16 pages
How to Create a Ptc/gpt Website from Scratch: Dont Have a Clue On What to Do? Then This Is for You!
From Everand
How to Create a Ptc/gpt Website from Scratch: Dont Have a Clue On What to Do? Then This Is for You!
Alexandre Paiva
4/5 (1)
DARKBERT
No ratings yet
DARKBERT
17 pages
TL INF4831 JanFeb 2021 Exam
No ratings yet
TL INF4831 JanFeb 2021 Exam
7 pages
Androrat Full Version Â " Android Remote Administration Tool
No ratings yet
Androrat Full Version Â " Android Remote Administration Tool
6 pages
Dork Tutorial V1.03
No ratings yet
Dork Tutorial V1.03
31 pages
Step by Step Method How To Use Cookies From Redline Stealer Logs 2024 Guide Leak
No ratings yet
Step by Step Method How To Use Cookies From Redline Stealer Logs 2024 Guide Leak
1 page
Black Hat Dream
67% (3)
Black Hat Dream
62 pages
Data Analyst Roadmap 2023 by Rishabh Mishra
100% (1)
Data Analyst Roadmap 2023 by Rishabh Mishra
9 pages
Lootbits Hack Script
0% (1)
Lootbits Hack Script
1 page
Oxylas
No ratings yet
Oxylas
3 pages
First of All. Dark Webdocx
0% (1)
First of All. Dark Webdocx
2 pages
MaxBulk Mailer™ MAXMAILER USER GUIDE
No ratings yet
MaxBulk Mailer™ MAXMAILER USER GUIDE
19 pages
Unlimited Adsense account
From Everand
Unlimited Adsense account
Karthikeya Reddy
No ratings yet
The Role of Computers in Business
No ratings yet
The Role of Computers in Business
7 pages
Create a Website with Wordpress: 6 Easy Steps to Build a Professional Website from Scratch
From Everand
Create a Website with Wordpress: 6 Easy Steps to Build a Professional Website from Scratch
No Limits Books
No ratings yet
HTML Data
No ratings yet
HTML Data
2,775 pages
Jso
100% (1)
Jso
6 pages
PP WebsitePaymentsStandard IntegrationGuide
No ratings yet
PP WebsitePaymentsStandard IntegrationGuide
194 pages
Designing A Captcha System With PHP and MySQL
No ratings yet
Designing A Captcha System With PHP and MySQL
9 pages
Insta Bot: Technology Workshop Craft Home Food Play Outside Costumes
No ratings yet
Insta Bot: Technology Workshop Craft Home Food Play Outside Costumes
5 pages
Codeigniter User Guide 1 5 5
No ratings yet
Codeigniter User Guide 1 5 5
362 pages
HGJGHJ
No ratings yet
HGJGHJ
12 pages
Message
No ratings yet
Message
20 pages
Gmail HTML
No ratings yet
Gmail HTML
43 pages
Script Login
No ratings yet
Script Login
3 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
PHP3 PDF
No ratings yet
PHP3 PDF
11 pages
WhitneyDBonnell TruthfinderReport
No ratings yet
WhitneyDBonnell TruthfinderReport
61 pages
Social Entity (mẫu)
No ratings yet
Social Entity (mẫu)
11 pages
5 Easy Steps To Create Simple & Secure PHP Login Script
No ratings yet
5 Easy Steps To Create Simple & Secure PHP Login Script
22 pages
Cookies
No ratings yet
Cookies
289 pages
History
No ratings yet
History
20 pages
Apps & Bots Programming Documentation - Apps 1.0 Documentati
No ratings yet
Apps & Bots Programming Documentation - Apps 1.0 Documentati
6 pages
Captcha Sniper Users Guide
No ratings yet
Captcha Sniper Users Guide
11 pages
Please Beware of The Phishing Scam Website That Cloned Our Webpages, Whi CH Is Paid Google Ads Shows in The First Position in Search Results
No ratings yet
Please Beware of The Phishing Scam Website That Cloned Our Webpages, Whi CH Is Paid Google Ads Shows in The First Position in Search Results
3 pages
War Bot Net Tu T
No ratings yet
War Bot Net Tu T
3 pages
Install - Cgi Returns 500 Internal Server Error: CGI::Carp Perl Module
No ratings yet
Install - Cgi Returns 500 Internal Server Error: CGI::Carp Perl Module
15 pages
Udemy
No ratings yet
Udemy
14 pages
Google Search Bot v331 Crack PDF
No ratings yet
Google Search Bot v331 Crack PDF
4 pages
Tutorial PHP Mailer
No ratings yet
Tutorial PHP Mailer
22 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Blogger SEO Code Mods
No ratings yet
Blogger SEO Code Mods
15 pages
Install Plugin Cpanel - WHM - Watch Mysql
No ratings yet
Install Plugin Cpanel - WHM - Watch Mysql
2 pages
Google Dork List
No ratings yet
Google Dork List
7 pages
PHP Apis
No ratings yet
PHP Apis
72 pages
Sevabot Skype Bot
No ratings yet
Sevabot Skype Bot
49 pages
Google Dorks Vulnerable Sites #1 Vulnerable Sites #2 Vulnerable Sites #3 Vulnerable Sites (With Syntax)
No ratings yet
Google Dorks Vulnerable Sites #1 Vulnerable Sites #2 Vulnerable Sites #3 Vulnerable Sites (With Syntax)
11 pages
Test User Agentes 2025
No ratings yet
Test User Agentes 2025
3 pages
Adhranns Updated Dating Scam 2014 PDF
No ratings yet
Adhranns Updated Dating Scam 2014 PDF
7 pages
Using XSS To Bypass CSRF Protection
No ratings yet
Using XSS To Bypass CSRF Protection
12 pages
Ahands onintroductionguidetoEthicalHacking Partone
100% (1)
Ahands onintroductionguidetoEthicalHacking Partone
79 pages
Carding Resources If You Are A Newbie!
No ratings yet
Carding Resources If You Are A Newbie!
3 pages
Proxy Servers: Jiban Jyoti Rana Reg No:0801307165 Branch: IT Sec: B
No ratings yet
Proxy Servers: Jiban Jyoti Rana Reg No:0801307165 Branch: IT Sec: B
30 pages
Doxing Anti-Doxing
No ratings yet
Doxing Anti-Doxing
4 pages
Android App Portal
No ratings yet
Android App Portal
10 pages
Domains Social
No ratings yet
Domains Social
116 pages
How To Hack Databases
No ratings yet
How To Hack Databases
11 pages
Com - Gk.hack - Cheat.tools Logcat
No ratings yet
Com - Gk.hack - Cheat.tools Logcat
10 pages
Bugbounty Cheatsheet - Mohammed Adam (Twitter - Com - Iam - Amdadam)
No ratings yet
Bugbounty Cheatsheet - Mohammed Adam (Twitter - Com - Iam - Amdadam)
49 pages
How To Add Paypal Buttons
No ratings yet
How To Add Paypal Buttons
120 pages
How To Install VcPanel - VPS Control Panel
No ratings yet
How To Install VcPanel - VPS Control Panel
6 pages
Discover How We Made $15,775 In 7 Days With Free Secret Systems that Generates Real and Unlimited HQ Backlinks that Rank Your Website, Video and Blog On Top of Google, Youtube, Yahoo and Bing In Just 60 Seconds: Unleash the Backlink Alchemy and Turbocharge Your Online Success and Income
From Everand
Discover How We Made $15,775 In 7 Days With Free Secret Systems that Generates Real and Unlimited HQ Backlinks that Rank Your Website, Video and Blog On Top of Google, Youtube, Yahoo and Bing In Just 60 Seconds: Unleash the Backlink Alchemy and Turbocharge Your Online Success and Income
Miller Allen A.
No ratings yet
Discover the 1 System that Can Send Over 950,000,000+ Highly Free Targeted Traffic Without Spending A Dime On Advertising:: Increase Your Website Traffic with our SEO Tools and Social Media Advertising
From Everand
Discover the 1 System that Can Send Over 950,000,000+ Highly Free Targeted Traffic Without Spending A Dime On Advertising:: Increase Your Website Traffic with our SEO Tools and Social Media Advertising
Andrew Moore
1/5 (1)
Mikrotik To Cisco Asa Ipsec VPN Vion Technology Blog PDF
No ratings yet
Mikrotik To Cisco Asa Ipsec VPN Vion Technology Blog PDF
6 pages
Module 15 - Smart Generators
No ratings yet
Module 15 - Smart Generators
15 pages
t3 Simple SQL
No ratings yet
t3 Simple SQL
5 pages
Marc Harold S. Manguiob Address: Purok Fortune, Lucod Baganga Davao Oriental, Philippines Phone: (+63) 915 661 7648
No ratings yet
Marc Harold S. Manguiob Address: Purok Fortune, Lucod Baganga Davao Oriental, Philippines Phone: (+63) 915 661 7648
5 pages
SPOM SET D P4 Digital Ecosystem and Controls
No ratings yet
SPOM SET D P4 Digital Ecosystem and Controls
13 pages
Logcat - Sun 02-25-2024 - 23.59.57
No ratings yet
Logcat - Sun 02-25-2024 - 23.59.57
4,639 pages
Description: Tags: EDconnect v72 InstallGuide
No ratings yet
Description: Tags: EDconnect v72 InstallGuide
27 pages
Cyber Law and Security: UNIT-2
No ratings yet
Cyber Law and Security: UNIT-2
48 pages
Acknowledgement: List OF Tables
No ratings yet
Acknowledgement: List OF Tables
4 pages
SESSION 10 - Final Testssa
No ratings yet
SESSION 10 - Final Testssa
3 pages
Role of Ict in Business
No ratings yet
Role of Ict in Business
15 pages
Building Scalable Serverless Apps in The Cloud: AWS or Azure ?
No ratings yet
Building Scalable Serverless Apps in The Cloud: AWS or Azure ?
37 pages
Employee Management System1
No ratings yet
Employee Management System1
15 pages
WP2016 3-2 1 Guidelines For The Implementation of Mandatory Incident Reporting in The Context of The NIS Directive
No ratings yet
WP2016 3-2 1 Guidelines For The Implementation of Mandatory Incident Reporting in The Context of The NIS Directive
36 pages
70-740 Updated
No ratings yet
70-740 Updated
338 pages
Web Services & WCF: Ankit
No ratings yet
Web Services & WCF: Ankit
18 pages
WBS Structure ASAP 7.2 Public
No ratings yet
WBS Structure ASAP 7.2 Public
18 pages
3BUR002065R3501 A en AdvaBuild Version 3.5 Migration Guide
No ratings yet
3BUR002065R3501 A en AdvaBuild Version 3.5 Migration Guide
110 pages
Interview Questions For SAP BI
No ratings yet
Interview Questions For SAP BI
4 pages
EY Special referrals
No ratings yet
EY Special referrals
24 pages
Indirect_Communication_Distributed_Systems
No ratings yet
Indirect_Communication_Distributed_Systems
3 pages
Usability Test-Interface Design
No ratings yet
Usability Test-Interface Design
1 page
QA Manual - Patricia Melissa Yolanda Sibarani - SATKOM
No ratings yet
QA Manual - Patricia Melissa Yolanda Sibarani - SATKOM
2 pages
Wifi and 5G Ericsson
No ratings yet
Wifi and 5G Ericsson
7 pages
How Salesforce - Com Uses Salesforce Chatter
0% (1)
How Salesforce - Com Uses Salesforce Chatter
3 pages
Offering
No ratings yet
Offering
30 pages