Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Web Scraping Using Python [Step by Step Tutorial] – Pythonista Planet

This tutorial provides a step-by-step guide on web scraping using Python's Beautiful Soup library, explaining the process of extracting data from websites. It covers the legality of web scraping, the advantages of using Python, and the necessary tools and packages to get started. The tutorial also includes practical examples of how to scrape website data, including finding links and extracting specific HTML elements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Web Scraping Using Python [Step by Step Tutorial] – Pythonista Planet

This tutorial provides a step-by-step guide on web scraping using Python's Beautiful Soup library, explaining the process of extracting data from websites. It covers the legality of web scraping, the advantages of using Python, and the necessary tools and packages to get started. The tutorial also includes practical examples of how to scrape website data, including finding links and extracting specific HTML elements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Web Scraping Using P ython [Step by Step

Tutorial]
Written by Ashwin Joy ● in Python

In this tutorial, we are going to do web scraping using Python’s Beautiful Soup library step-by-
step. Python 3 is ridiculously fast in web scraping it provides a beautiful framework for that called
beautiful soup (beauty is in the name itself).

Table of Contents 
What is Web Scraping?
Is Web Scraping Legally Allowed?
Why Python?
Web Scraping using Python’s Beautiful Soup
Finding all the Links from a Website

What is Web Scraping?


When you want to extract some important data from a website, we use web scraping.

According to Wikipedia’s definition, web scraping, web harvesting, or web data abstraction is
data scraping used for extracting data from websites.

Usually, the ideal way of picking up data from websites is through APIs which is recommended.
But sometimes, when the APIs are not available, we go for web scraping.

Is Web Scraping Legally Allowed?


Web scraping is a little grey area. Web scraping is not legally allowed in most of the websites. You
have to check from the website owner or the policies of the website.

So, make sure you are completely aware of what you are doing, and do web scraping only on
legally allowed websites.

You could scrap your own website for sure. But you can’t scrap or crawl someone else’s website,
without obtaining their permission.

Why Python?
Python 3 is the best programming language to do web scraping. Python is so fast and easy to do
web scraping. Also, most of the tools of web scraping that are present in the Kali-Linux are being
designed in Python.

Enough of the theories, let’s start scraping the web using the beautiful soup library.

We b S c r a p i n g u s i n g P y t h o n’s B e a u t i f u l S o u p
The first thing you want to do when you are going to do web scraping is to go to the website that
you want to scrap and analyze it. Web scraping is all about how you understand the website, it’s
data structures, how things are looking, etc.

The next thing you need to do is to get all the necessary tools and packages. I’m using Python
IDLE to do the scraping. So you should have that ready in your system.

You can also write code in your shell as well if needed. After that, we need to install the necessary
packages. We need packages like ‘bs4’ which is the beautiful soup, ‘requests’ and ‘lxml’ to
proceed.

So go to your command line (CMD) and install them one by one, if you don’t have them already. If
you are on a MAC/Linux, use pip3 instead of pip in the following commands.

pip install bs4

pip install lxml

Generally, ‘requests’ already come up with Python. If you don’t have that in your system, install
that too.

pip install requests

Now, all your packages are ready. Go to your Python IDLE or Python Shell and let’s write some
code.

First of all, we need to import all three packages. So, let’s do that.

import requests
import bs4
import lxml
Next, you have to make a request to the website that you want to scrap. Let’s create a variable
‘res’ to make a request.

res = res.requests.get('https://mywebsite.com')

You can type in your URL instead of mywebsite.com which I randomly typed for an example.

This ‘res’ variable is now storing the entire web page data. If you just type in ‘res.text’ and hit
enter, you can see all the details that this variable is storing.

We need to extract information from this variable. Here comes the use of the beautiful soup
library.

We are going to create an object called ‘soup’. For that, we use bs4 and its method called
‘BeautifulSoup’.
This method takes in two parameters, the first is ‘res.text’ and the second one is how you want to
structure your data. In this case, we are using lxml.

soup = bs4.BeautifulSoup(res.text,'lxml')

For example, let’s say we want to extract the information about the title tag of that website. So,
let’s create a new variable.

title = soup.select('title')

You can pass any HTML tag you want instead of ‘title’. Now, let’s check what is inside this ‘title’
variable.

print(title)

Then, you will see the title of the website as the output. You have just scraped the title of that
website using Python.

You can also scrape data based on certain CSS class or id using ‘.classname’ or ‘#idname’
respectively. Let’s see an example.

title = soup.select('.classname')
#or
title = soup.select('#idname')

Enter the name of the class or id you want to scrape in place of ‘classname’ and ‘idname’.
Finding all the Links from a Website
If you want to find all the links that are there on a website, we can do that too. For that, we are
using a ‘for’ loop and a method called ‘find_all’.

for link in soup.find_all('a',href=True):


print(link['href'])

Then, you can see all the links listed on your IDLE or shell as output.

That’s it about the basics of web scraping using Python. If you have any doubts or queries, feel
free to let me know in the comments section down below.

If you enjoyed this article, share it with your friends.


Happy learning!

Tweet Share Save SHARE Print

Ashwin Joy
I'm the face behind Pythonista Planet. I learned my first programming language back in 2015. Ever
since then, I've been learning programming and immersing myself in technology. On this site, I
share everything that I've learned about computer programming.

3 t h o u g h t s o n ““We
We b S c r a p i n g U s i n g P y t h o n [ S t e p b y
S t e p Tu to r i a l ]]””
Pachu says:
December 13, 2019 at 8:32 PM
Request.get not working in kali linux

Reply

Ashwin Joy says:


December 13, 2019 at 9:55 PM

You might not be having the requests library in your system. Download it using pip and try
again. Hopefully, it’ll work.

Reply

zaid kamil says:


January 13, 2020 at 11:13 PM

its requests.get(…)

Reply

Leave a Reply
Your email address will not be published. Required fields are marked *

Comment

Name *

Email *

Save my name and email in this browser for the next time I comment.

POST COMMENT
Recent Content

How To Learn Python - A Concise Guide

Most of us have used or have come across the necessity of using the Python programming
language. Python is one of the most popular programming languages around the world. Due to
many factors,...

CONTINUE READING
15 Best Courses For Machine Learning
Welcome to the future..! In this article, we will be dealing with how to learn Machine Learning. We
know that humans can learn a lot from their past experiences and that machines follow...

CONTINUE READING

ABOUT ME

Hi, I’m Ashwin Joy. I’m a Computer Science and Engineering graduate who is passionate about
programming and technology. Pythonista Planet is the place where I nerd out about computer
programming. On this blog, I share all the things I learn about programming as I go.

ABOUT ME

L E G A L I N F O R M AT I O N

This site is owned and operated by Ashwin Joy. PythonistaPlanet.com is a participant in the
Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a
means for sites to earn advertising fees by advertising and linking to Amazon.com. This site also
participates in affiliate programs of Udemy, Treehouse, Coursera, and Udacity, and is
compensated for referring traffic and business to these companies.

report this ad

Privacy Policy

© 2020 Copyright Pythonista Planet

You might also like