Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Christos Chen

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 42

Christos Chen

May 12, 2020


·
5 min read
·
Listen

Data Scraping: A Quick, Basic Tutorial in


Python

Source: SSA-Data
What is Data Scraping?

If you’re like me, you might have heard this term briefly thrown around
in computer science or business contexts. So… what exactly IS data
scraping and why has everybody been talking about it?

As every curious individual living in 2020 might do, I googled what it


meant. Dictionary.com defines it as:

Source: Dictionary.com

In short, data scraping is used to easily make vast amounts of data:

 Accessible

 Usable

 Aggregable

Why does this matter to Data Scientists?

A data scientist’s role is to tell stories through data. Much like how
words fuel the ability of writers to articulate, data better equips data
scientists with the ability to tell relevant, meaningful, yet holistic
narratives about the past and the present to leverage a better future.

As the building blocks and the keystone of data science, data extracted
from web scraping can be utilized in, to name a few:

 Natural Language Processing

 Machine Learning

 Data Visualization & Analysis

Industry Prominence & Usage

In a wider context, data-scraping is not a new practice. However,


recent automation has enabled companies to truly harness its power.
Data scraping can provide valuable insight on the customer
experience, better inform business decisions & performance, and drive
innovation at previously unattainable rates. It has found use in data
analysis & visualization, research & development, and market analysis
among other contexts.
Source: VIRTUALlytics

The Basics

To perform data scraping, you must have a basic understanding of


HTML structures. For those of you who aren’t very experienced with
computer science — don’t fret! You just need enough to identify
some simple HTML structures. Here are a few of the most commonly
seen ones:

 Headings: Defined from h1 to h6 tags, most important to least


important respectively.
Example: <h1 > This is the Heading shown! <h1>

 Paragraphs: Defined with the <p> tag.


Example: <p> This is the Paragraph shown! </p>

 Divisions: Defined by the <div> tag.


Example: <div> This is the container shown! </div>

The Process
Web Scraping can be Broken Down into 4 General Steps:

1. Finding the Desired URL to be Scraped

2. Inspecting the Page

3. Identifying Elements for Extraction

4. Extracting & Storing the Data

Getting Started !

First, make sure that you have Python and BeautifulSoup installed on


your computer! These can all be directly downloaded online.

We will import BeautifulSoup for navigating the data and urlopen to


extract the HTML.
from urllib.request import urlopenfrom bs4 import BeautifulSoup

1. Finding the Desired URL to be Scraped

First, identify the URL of the page you want to scrape. Here, we will be
scraping data on a BBC article that was recently posted.
Global coronavirus cases rise above four million
More than four million confirmed cases of coronavirus have been reported around the
world, according to data collated…
www.bbc.com
2. Inspecting the Page

Now, we’ll begin by getting the desired data and inspecting. Do this by
right-clicking anywhere and selecting “Inspect Element”.

You should see a bunch of HTML code flood your screen under the
elements tag.
Overwhelmed? DON’T PANIC!

3. Identifying elements for extraction

Now, we want to identify and navigate to the smallest division, shown


as <div>, that contains all of the desired data to be scrapped. Recall the
basic HTML structures mentioned above! Since we want to scrape the
body of the article, we are looking for the <p> tag, which represents
paragraphs.
We’ve found the article body, the <p>s!

After finding all of the desired data, we look for the most specific
division, or <div>, that contains them. Because the elements are
indented based on a hierarchy, we can see that all the desired data falls
under:

As shown above, this division contains all of the desired article we want!

4. Extracting & Storing the Data!

First, we want to connect to the website and retrieve the HTML data.
link = "https://www.bbc.com/news/world-52603017"try:page =
urlopen(link)except:print("Error connecting to the URL")
A try/except is utilized here to catch an error thrown if the URL was
not valid.

Now, in order to extract the desired data, we will create a


BeautifulSoup object to parse the HTML data we just retrieved.
soup = BeautifulSoup(page, 'html.parser')

We identified the division, shown by the <div> tag, that contains our
data earlier:

We will now parse and identify that specific division using the
BeautifulSoup object’s find() function:
content = soup.find('div', {"class": "story-body__inner"})

Now we will iterate through the specific divider we just isolated above,
identifying all of the paragraph text, which is indicated by a <p> within
that division, using BeautifulSoup’s find_all() function:
news = ''for x in content.find_all('p'):news += ' ' + x.text

The above code stores the entire body of the article in the news
variable, which can later be placed into a data frame alongside other
extracted data!

That scraped data can be stored within a CSV. In this example, we will
store the data stored in the news variable in a csv called “article.csv”.
with open('article.csv','w') as file:file.write(news)
And … just like that, you’ve performed some basic data scraping!

Ethics of Data Scraping

As with all things that have the potential to do great things, data
scraping can also be utilized unethically, maliciously, and illegally.
Data scraping has been used to plagiarize, spam, and even commit
identity theft and fraud.

We are working with some powerful stuff here!

While the practice of data scraping is an ethical concept, novice and


expert data scientists alike need to be aware of the implications of our
actions — especially when regarding the safety and privacy of personal
data.

 Use Public APIs, if available, instead of scraping

 Do not request large amounts of data that may overload servers or


be perceived as a DDoS attack

 Respect other people’s work & do not steal content

With that in mind, good luck and happy scraping!


Let’s suppose you want to get some information from a website? Let’s say an article from
the geeksforgeeks website or some news article, what will you do? The first thing that
may come in your mind is to copy and paste the information into your local media. But
what if you want a large amount of data on a daily basis and as quickly as possible. In
such situations, copy and paste will not work and that’s where you’ll need web scraping.
In this article, we will discuss how to perform web scraping using the requests library and
beautifulsoup library in Python.

Requests Module
Requests library is used for making HTTP requests to a specific URL and returns the
response. Python requests provide inbuilt functionalities for managing both the request
and response.

Installation

Requests installation depends on the type of operating system, the basic command
anywhere would be to open a command terminal and run,
pip install requests

Making a Request

Python requests module has several built-in methods to make HTTP requests to specified
URI using GET, POST, PUT, PATCH, or HEAD requests. A HTTP request is meant to
either retrieve data from a specified URI or to push data to a server. It works as a request-
response protocol between a client and a server. Here we will be using the GET request. 
GET method is used to retrieve information from the given server using a given URI. The
GET method sends the encoded user information appended to the page request. 

Example: Python requests making GET request

 Python3
import requests

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# check status code for response received

# success code - 200

print(r)

# print content of request

print(r.content)

 
 
Output:
 
Response object

 
When one makes a request to a URI, it returns a response. This Response object in terms
of python is returned by requests.method(), method being – get, post, put, etc. Response
is a powerful object with lots of functions and attributes that assist in normalizing data or
creating ideal portions of code. For example, response.status_code returns the status code
from the headers itself, and one can check if the request was processed successfully or
not.
 
Response objects can be used to imply lots of features, methods, and functionalities.
 

Example: Python requests Response Object

 Python3

import requests

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# print request object

print(r.url)

   

# print status code

print(r.status_code)

 
 
Output:
 
https://www.geeksforgeeks.org/python-programming-language/
200
 
For more information, refer to our Python Requests Tutorial.
 

BeautifulSoup Library 
BeautifulSoup is used extract information from the HTML and XML files. It provides a
parse tree and the functions to navigate, search or modify this parse tree.
 Beautiful Soup is a Python library used to pull the data out of HTML and XML files
for web scraping purposes. It produces a parse tree from page source code that can be
utilized to drag data hierarchically and more legibly. 
 It was first presented by Leonard Richardson, who is still donating to this project, and
this project is also supported by Tide lift (a paid subscription tool for open-source
supervision).
 Beautiful soup3 was officially released in May 2006, Latest version released by
Beautiful Soup is 4.9.2, and it supports Python 3 and Python 2.4 as well.

Features of Beautiful Soup


Beautiful Soup is a Python library developed for quick reversal projects like screen-
scraping. Three features make it powerful:
1. Beautiful Soup provides a few simple methods and Pythonic phrases for guiding,
searching, and changing a parse tree: a toolkit for studying a document and removing
what you need. It doesn’t take much code to document an application.
2. Beautiful Soup automatically converts incoming records to Unicode and outgoing
forms to UTF-8. You don’t have to think about encodings unless the document doesn’t
define an encoding, and Beautiful Soup can’t catch one. Then you just have to choose the
original encoding.
3. Beautiful Soup sits on top of famous Python parsers like LXML and HTML, allowing
you to try different parsing strategies or trade speed for flexibility.
 

Installation

 
To install Beautifulsoup on Windows, Linux, or any operating system, one would need
pip package. To check how to install pip on your operating system, check out – PIP
Installation – Windows || Linux. Now run the below command in the terminal.
 
pip install beautifulsoup4
Inspecting Website

 
Before getting out any information from the HTML of the page, we must understand the
structure of the page. This is needed to be done in order to select the desired data from
the entire page. We can do this by right-clicking on the page we want to scrape and select
inspect element.
 
 
After clicking the inspect button the Developer Tools of the browser gets open. Now
almost all the browsers come with the developers tools installed, and we will be using
Chrome for this tutorial. 
 
 
The developer’s tools allow seeing the site’s Document Object Model (DOM). If you
don’t know about DOM then don’t worry just consider the text displayed as the HTML
structure of the page. 
 

Parsing the HTML

 
After getting the HTML of the page let’s see how to parse this raw HTML code into
some useful information. First of all, we will create a BeautifulSoup object by specifying
the parser we want to use.
 
Note: BeautifulSoup library is built on top of the HTML parsing libraries like html5lib,
lxml, html.parser, etc. So BeautifulSoup object and specify the parser library can be
created at the same time.
 

Example: Python BeautifulSoup Parsing HTML

 
 Python3

import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# check status code for response received

# success code - 200

print(r)

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

print(soup.prettify())

Output:
This information is still not useful to us, let’s see another example to make some clear
picture from this. Let’s try to extract the title of the page.

 Python3

import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

# Getting the title tag


print(soup.title)

# Getting the name of the tag

print(soup.title.name)

# Getting the name of parent tag

print(soup.title.parent.name)

# use the child attribute to get

# the name of the child tag

Output:
<title>Python Programming Language - GeeksforGeeks</title>
title
html

Finding Elements

Now, we would like to extract some useful data from the HTML content. The soup object
contains all the data in the nested structure which could be programmatically extracted.
The website we want to scrape contains a lot of text so now let’s scrape all those content.
First, let’s inspect the webpage we want to scrape. 
Finding Elements by class

In the above image, we can see that all the content of the page is under the div with class
entry-content. We will use the find class. This class will find the given tag with the given
attribute. In our case, it will find all the div having class as entry-content. We have got all
the content from the site but you can see that all the images and links are also scraped. So
our next task is to find only the content from the above-parsed HTML. On again
inspecting the HTML of our website – 
We can see that the content of the page is under the <p> tag. Now we have to find all the
p tags present in this class. We can use the find_all class of the BeautifulSoup.
 Python3

import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

 
# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div', class_='entry-content')

content = s.find_all('p')

print(content)

 
 
Output:
 

Finding Elements by ID

 
In the above example, we have found the elements by the class name but let’s see how to
find elements by id. Now for this task let’s scrape the content of the leftbar of the page.
The first step is to inspect the page and see the leftbar falls under which tag.
 

 
The above image shows that the leftbar falls under the <div> tag with id as main. Now
lets’s get the HTML content under this tag. Now let’s inspect more of the page get the
content of the leftbar. 
 

 
We can see that the list in the leftbar is under the <ul> tag with the class as leftBarList
and our task is to find all the li under this ul.
 

 Python3
import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

# Finding by id

s = soup.find('div', id= 'main')

# Getting the leftbar

leftbar = s.find('ul', class_='leftBarList')

# All the li under the above ul

content = leftbar.find_all('li')

 
print(content)

 
 
Output:
 

Extracting Text from the tags

 
In the above examples, you must have seen that while scraping the data the tags also get
scraped but what if we want only the text without any tags. Don’t worry we will discuss
the same in this section. We will be using the text property. It only prints the text from
the tag. We will be using the above example and will remove all the tags from them.
 

Example 1: Removing the tags from the content of the page 

 Python3

import requests
from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div', class_='entry-content')

lines = s.find_all('p')

for line in lines:

    print(line.text)

 
 
Output:
 
Example 2: Removing the tags from the content of the leftbar

 Python3

import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

 
# Finding by id

s = soup.find('div', id= 'main')

# Getting the leftbar

leftbar = s.find('ul', class_='leftBarList')

# All the li under the above ul

lines = leftbar.find_all('li')

for line in lines:

    print(line.text)

 
 
Output:
 
Extracting Links

 
Till now we have seen how to extract text, let’s now see how to extract the links from the
page.
 

Example: Python BeautifulSoup Extracting Links

 Python3

import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

# find all the anchor tags with "href"

for link in soup.find_all('a'):


    print(link.get('href'))

 
 
Output:
 

Extracting Image Information

 
On again inspecting the page, we can see that images lie inside the img tag and the link of
that image is inside the src attribute. See  the below image – 
 

Example: Python BeautifulSoup Extract Image 

 Python3
import requests

from bs4 import BeautifulSoup

# Making a GET request

r = requests.get('https://www.geeksforgeeks.org/python-programming-
language/')

# Parsing the HTML

soup = BeautifulSoup(r.content, 'html.parser')

images_list = []

images = soup.select('img')

for image in images:

    src = image.get('src')

    alt = image.get('alt')

    images_list.append({"src": src, "alt": alt})

     

for image in images_list:


    print(image)

 
 
Output:
 

Scraping multiple Pages


 
Now, there may arise various instances where you may want to get data from multiple
pages from the same website or multiple different URLs as well, and manually writing
code for each webpage is a time-consuming and tedious task. Plus, it defines all basic
principles of automation. Duh!  
 
To solve this exact problem, we will see two main techniques that will help us extract
data from multiple webpages:
 
 The same website
 Different website URLs

Example 1: Looping through the page numbers 

page numbers at the bottom of the GeeksforGeeks website


 
Most websites have pages labeled from 1 to N. This makes it really simple for us to loop
through these pages and extract data from them as these pages have similar structures.
For example:
 

page numbers at the bottom of the GeeksforGeeks website

 
Here, we can see the page details at the end of the URL. Using this information we can
easily create a for loop iterating over as many pages as we want (by putting page/(i)/ in
the URL string and iterating “i” till N) and scrape all the useful data from them. The
following code will give you more clarity over how to scrape data by using a For Loop in
Python.
 

 Python3

import requests

from bs4 import BeautifulSoup as bs

URL = 'https://www.geeksforgeeks.org/page/1/'

req = requests.get(URL)

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs = {'class','head'})


 

print(titles[4].text)

Output:
7 Most Common Time Wastes During Software Development
Now, using the above code, we can get the titles of all the articles by just sandwiching
those lines with a loop.

 Python3

import requests

from bs4 import BeautifulSoup as bs

URL = 'https://www.geeksforgeeks.org/page/'

for page in range(1, 10):

    req = requests.get(URL + str(page) + '/')

    soup = bs(req.text, 'html.parser')

    titles = soup.find_all('div', attrs={'class', 'head'})

 
    for i in range(4, 19):

        if page > 1:

            print(f"{(i-3)+page*15}" + titles[i].text)

        else:

            print(f"{i-3}" + titles[i].text)

Output:

Example 2: Looping through a list of different URLs

The above technique is absolutely wonderful, but what if you need to scrape different
pages, and you don’t know their page numbers? You’ll need to scrape those different
URLs one by one and manually code a script for every such webpage.
Instead, you could just make a list of these URLs and loop through them. By simply
iterating the items in the list i.e. the URLs, we will be able to extract the titles of those
pages without having to write code for each page. Here’s an example code of how you
can do it.
 Python3

import requests

from bs4 import BeautifulSoup as bs

URL =
['https://www.geeksforgeeks.org','https://www.geeksforgeeks.org/page/
10/']

for url in range(0,2):

    req = requests.get(URL[url])

    soup = bs(req.text, 'html.parser')

    titles = soup.find_all('div',attrs={'class','head'})

    for i in range(4, 19):

        if url+1 > 1:

            print(f"{(i - 3) + url * 15}" + titles[i].text)

        else:

            print(f"{i - 3}" + titles[i].text)

Output:
For more information, refer to our Python BeautifulSoup Tutorial.
Saving Data to CSV
First we will create a list of dictionaries with the key value pairs that we want to add in
the CSV file. Then we will use the csv module to write the output in the CSV file. See the
below example for better understanding.

Example: Python BeautifulSoup saving to CSV

 Python3

import requests

from bs4 import BeautifulSoup as bs

import csv

URL = 'https://www.geeksforgeeks.org/page/'
 

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div', attrs={'class', 'head'})

titles_list = []

count = 1

for title in titles:

    d = {}

    d['Title Number'] = f'Title {count}'

    d['Title Name'] = title.text

    count += 1

    titles_list.append(d)

filename = 'titles.csv'

with open(filename, 'w', newline='') as f:

    w = csv.DictWriter(f,['Title Number','Title Name'])

    w.writeheader()

     
    w.writerows(titles_list)

 
 
Output:
 
 
LoginRegister
HIDE AD

You might also like