Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

10Python Web Scraping Form based Websites

Uploaded by

David Osei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

10Python Web Scraping Form based Websites

Uploaded by

David Osei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

PYTHON WEB SCRAPING ­ FORM BASED WEBSITES

https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_form_based_websites.htm
Copyright © tutorialspoint.com

Advertisements

In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of
websites that work on user based inputs, that is form based websites.

Introduction
These days WWW W orldW ideW eb is moving towards social media as well as usergenerated contents. So the
question arises how we can access such kind of information that is beyond login screen? For this we need to deal
with forms and logins.

In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work
with HTTP POST method that pushes information to a web server for storage and analysis.

Interacting with Login forms


While working on Internet, you must have interacted with login forms many times. They may be very simple like
including only a very few HTML fields, a submit button and an action page or they may be complicated and have
some additional fields like email, leave a message along with captcha for security reasons.

In this section, we are going to deal with a simple submit form with the help of Python requests library.

First, we need to import requests library as follows −

import requests

Now, we need to provide the information for the fields of login form.

parameters = {‘Name’:’Enter your name’, ‘Email‐id’:’Your Emailid’,’Message’:’Type your message


here’}

In next line of code, we need to provide the URL on which action of the form would happen.

r = requests.post(“enter the URL”, data = parameters)


print(r.text)

After running the script, it will return the content of the page where action has happened.

Suppose if you want to submit any image with the form, then it is very easy with requests.post. You can
understand it with the help of following Python script −

import requests
file = {‘Uploadfile’: open(’C:\Usres\desktop\123.png’,‘rb’)}
r = requests.post(“enter the URL”, files = file)
print(r.text)

Loading Cookies from the Web Server


A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our
computer stores it in a file located inside our web browser.

In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that
allows us to submit information to a website and second which lets us to remain in a permanent “logged­in” state
throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is
logged in and who is not.

What do cookies do?

These days most of the websites are using cookies for tracking. We can understand the working of cookies with the
help of following steps −

Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie
generally contains a server­generated toke, time­out and tracking information.

Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown
whenever we visit the website.

Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the
submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the
cookies with the help of Python requests library, as shown below −

import requests
parameters = {‘Name’:’Enter your name’, ‘Email‐id’:’Your Emailid’,’Message’:’Type your message
here’}
r = requests.post(“enter the URL”, data = parameters)

In the above line of code, the URL would be the page which will act as the processor for the login form.

print(‘The cookie is:’)


print(r.cookies.get_dict())
print(r.text)

After running the above script, we will retrieve the cookies from the result of last request.

There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such
kind of situation can be dealt with requests.Session as follows −

import requests
session = requests.Session()
parameters = {‘Name’:’Enter your name’, ‘Email‐id’:’Your Emailid’,’Message’:’Type your message
here’}
r = session.post(“enter the URL”, data = parameters)

In the above line of code, the URL would be the page which will act as the processor for the login form.

print(‘The cookie is:’)


print(r.cookies.get_dict())
print(r.text)

Observe that you can easily understand the difference between script with session and without session.
Automating forms with Python
In this section we are going to deal with a Python module named Mechanize that will reduce our work and
automate the process of filling up forms.

Mechanize module

Mechanize module provides us a high­level interface to interact with forms. Before starting using it we need to
install it with the following command −

pip install mechanize

Note that it would work only in Python 2.x.

Example

In this example, we are going to automate the process of filling a login form having two fields namely email and
password −

import mechanize
brwsr = mechanize.Browser()
brwsr.open(Enter the URL of login)
brwsr.select_form(nr = 0)
brwsr['email'] = ‘Enter email’
brwsr['password'] = ‘Enter password’
response = brwsr.submit()
brwsr.submit()

The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser
object has been created. Then, we navigated to the login URL and selected the form. After that, names and values
are passed directly to the browser object.

You might also like