Web Scraping Using Python [Step by Step Tutorial] – Pythonista Planet
Web Scraping Using Python [Step by Step Tutorial] – Pythonista Planet
Tutorial]
Written by Ashwin Joy ● in Python
In this tutorial, we are going to do web scraping using Python’s Beautiful Soup library step-by-
step. Python 3 is ridiculously fast in web scraping it provides a beautiful framework for that called
beautiful soup (beauty is in the name itself).
Table of Contents
What is Web Scraping?
Is Web Scraping Legally Allowed?
Why Python?
Web Scraping using Python’s Beautiful Soup
Finding all the Links from a Website
According to Wikipedia’s definition, web scraping, web harvesting, or web data abstraction is
data scraping used for extracting data from websites.
Usually, the ideal way of picking up data from websites is through APIs which is recommended.
But sometimes, when the APIs are not available, we go for web scraping.
So, make sure you are completely aware of what you are doing, and do web scraping only on
legally allowed websites.
You could scrap your own website for sure. But you can’t scrap or crawl someone else’s website,
without obtaining their permission.
Why Python?
Python 3 is the best programming language to do web scraping. Python is so fast and easy to do
web scraping. Also, most of the tools of web scraping that are present in the Kali-Linux are being
designed in Python.
Enough of the theories, let’s start scraping the web using the beautiful soup library.
We b S c r a p i n g u s i n g P y t h o n’s B e a u t i f u l S o u p
The first thing you want to do when you are going to do web scraping is to go to the website that
you want to scrap and analyze it. Web scraping is all about how you understand the website, it’s
data structures, how things are looking, etc.
The next thing you need to do is to get all the necessary tools and packages. I’m using Python
IDLE to do the scraping. So you should have that ready in your system.
You can also write code in your shell as well if needed. After that, we need to install the necessary
packages. We need packages like ‘bs4’ which is the beautiful soup, ‘requests’ and ‘lxml’ to
proceed.
So go to your command line (CMD) and install them one by one, if you don’t have them already. If
you are on a MAC/Linux, use pip3 instead of pip in the following commands.
Generally, ‘requests’ already come up with Python. If you don’t have that in your system, install
that too.
Now, all your packages are ready. Go to your Python IDLE or Python Shell and let’s write some
code.
First of all, we need to import all three packages. So, let’s do that.
import requests
import bs4
import lxml
Next, you have to make a request to the website that you want to scrap. Let’s create a variable
‘res’ to make a request.
res = res.requests.get('https://mywebsite.com')
You can type in your URL instead of mywebsite.com which I randomly typed for an example.
This ‘res’ variable is now storing the entire web page data. If you just type in ‘res.text’ and hit
enter, you can see all the details that this variable is storing.
We need to extract information from this variable. Here comes the use of the beautiful soup
library.
We are going to create an object called ‘soup’. For that, we use bs4 and its method called
‘BeautifulSoup’.
This method takes in two parameters, the first is ‘res.text’ and the second one is how you want to
structure your data. In this case, we are using lxml.
soup = bs4.BeautifulSoup(res.text,'lxml')
For example, let’s say we want to extract the information about the title tag of that website. So,
let’s create a new variable.
title = soup.select('title')
You can pass any HTML tag you want instead of ‘title’. Now, let’s check what is inside this ‘title’
variable.
print(title)
Then, you will see the title of the website as the output. You have just scraped the title of that
website using Python.
You can also scrape data based on certain CSS class or id using ‘.classname’ or ‘#idname’
respectively. Let’s see an example.
title = soup.select('.classname')
#or
title = soup.select('#idname')
Enter the name of the class or id you want to scrape in place of ‘classname’ and ‘idname’.
Finding all the Links from a Website
If you want to find all the links that are there on a website, we can do that too. For that, we are
using a ‘for’ loop and a method called ‘find_all’.
Then, you can see all the links listed on your IDLE or shell as output.
That’s it about the basics of web scraping using Python. If you have any doubts or queries, feel
free to let me know in the comments section down below.
Ashwin Joy
I'm the face behind Pythonista Planet. I learned my first programming language back in 2015. Ever
since then, I've been learning programming and immersing myself in technology. On this site, I
share everything that I've learned about computer programming.
3 t h o u g h t s o n ““We
We b S c r a p i n g U s i n g P y t h o n [ S t e p b y
S t e p Tu to r i a l ]]””
Pachu says:
December 13, 2019 at 8:32 PM
Request.get not working in kali linux
Reply
You might not be having the requests library in your system. Download it using pip and try
again. Hopefully, it’ll work.
Reply
its requests.get(…)
Reply
Leave a Reply
Your email address will not be published. Required fields are marked *
Comment
Name *
Email *
Save my name and email in this browser for the next time I comment.
POST COMMENT
Recent Content
Most of us have used or have come across the necessity of using the Python programming
language. Python is one of the most popular programming languages around the world. Due to
many factors,...
CONTINUE READING
15 Best Courses For Machine Learning
Welcome to the future..! In this article, we will be dealing with how to learn Machine Learning. We
know that humans can learn a lot from their past experiences and that machines follow...
CONTINUE READING
ABOUT ME
Hi, I’m Ashwin Joy. I’m a Computer Science and Engineering graduate who is passionate about
programming and technology. Pythonista Planet is the place where I nerd out about computer
programming. On this blog, I share all the things I learn about programming as I go.
ABOUT ME
L E G A L I N F O R M AT I O N
This site is owned and operated by Ashwin Joy. PythonistaPlanet.com is a participant in the
Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a
means for sites to earn advertising fees by advertising and linking to Amazon.com. This site also
participates in affiliate programs of Udemy, Treehouse, Coursera, and Udacity, and is
compensated for referring traffic and business to these companies.
report this ad
Privacy Policy