Python Urllib3 - Accessing Web Resources Via HTTP
Python Urllib3 - Accessing Web Resources Via HTTP
Python urllib3
last modified July 6, 2020
Python urllib3 tutorial introduces the Python urllib3 module. We show how to grab data, post data,
stream data, work with JSON, and use redirects.
The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative,
hypermedia information systems. HTTP is the foundation of data communication for the World
Wide Web.
Python urllib3
The urllib3 module is a powerful, sanity-friendly HTTP client for Python. It supports thread
safety, connection pooling, client-side SSL/TLS verification, file uploads with multipart encoding,
helpers for retrying requests and dealing with HTTP redirects, gzip and deflate encoding, and
proxy for HTTP and SOCKS.
version.py
#!/usr/bin/env python3
import urllib3
print(urllib3.__version__)
$ ./version.py
1.24.1
status.py
#!/usr/bin/env python3
import urllib3
http = urllib3.PoolManager()
url = 'http://webcode.me'
The example creates a GET request to the webcode.me. It prints the status code of the response.
http = urllib3.PoolManager()
We create a PoolManager to generate a request. It handles all of the details of connection pooling
and thread safety.
url = 'http://webcode.me'
With the request() method, we make a GET request to the specified URL.
print(resp.status)
$ status.py
200
The 200 status code means that the request has succeeded.
Python urllib3 GET request
The HTTP GET method requests a representation of the specified resource.
get_request.py
#!/usr/bin/env python3
import urllib3
http = urllib3.PoolManager()
url = 'http://webcode.me'
The example sends a GET request to the webcode.me webpage. It returns the HTML code of the
home page.
print(resp.data.decode('utf-8'))
$ ./get_request.py
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>My html page</title>
</head>
<body>
<p>
Today is a beautiful day. We go swimming and fishing.
</p>
<p>
Hello there. How are you?
</p>
</body>
head_request.py
#!/usr/bin/env python3
import urllib3
http = urllib3.PoolManager()
url = 'http://webcode.me'
resp = http.request('HEAD', url)
print(resp.headers['Server'])
print(resp.headers['Date'])
print(resp.headers['Content-Type'])
print(resp.headers['Last-Modified'])
The response object contains the headers dictionary, which has the various header fields, such as
server and date.
$ ./head_request.py
nginx/1.6.2
Thu, 20 Feb 2020 14:35:14 GMT
text/html
Sat, 20 Jul 2019 11:49:25 GMT
From the output we can see that the web server of the website is nginx and the content type is
HTML code.
We install certifi.
import certifi
print(certifi.where())
To reference the installed certificate authority (CA) bundle, we use the built-in where() function.
status2.py
#!/usr/bin/env python3
import urllib3
import certifi
url = 'https://httpbin.org/anything'
http = urllib3.PoolManager(ca_certs=certifi.where())
resp = http.request('GET', url)
print(resp.status)
We pass the root CA bundle to the PoolManager. Without this CA bundle, the request would issue
the following warning: InsecureRequestWarning: Unverified HTTPS request is being
made. Adding certificate verification is strongly advised..
http://example.com/api/users?name=John%20Doe&occupation=gardener
The query parameters are specified after the ? character. Multiple fields are separated with the &.
Special characters, such as spaces, are encoded. In the above string, the space is encoded with the
%20 value.
query_params.py
#!/usr/bin/env python3
import urllib3
import certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
url = 'https://httpbin.org/get'
req = http.request('GET', url, fields=payload)
print(req.data.decode('utf-8'))
In the example, we send a GET request with some query parameters to the
https://httpbin.org/get. The link simply returns some data back to the client, including the
query parameters. The site is used for testing HTTP requests.
$ ./query_params.py
{
"args": {
"age": "23",
"name": "Peter"
},
"headers": {
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"X-Amzn-Trace-Id": "Root=1-5e4ea45f-c3c9c721c848f8f81a3129d8"
},
"origin": "188.167.251.9",
"url": "https://httpbin.org/get?name=Peter&age=23"
}
The httpbin.org responded with a JSON string, which includes our payload as well.
Python urllib3 POST request
The HTTP POST method sends data to the server. It is often used when uploading a file or when
submitting a completed web form.
post_request.py
#!/usr/bin/env python3
import urllib3
import certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
url = 'https://httpbin.org/post'
The example sends a POST request. The data is specified with the fields option.
$ ./post_request.py
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "John Doe"
},
...
"url": "https://httpbin.org/post"
}
send_json.py
#!/usr/bin/env python3
import urllib3
import certifi
import json
http = urllib3.PoolManager(ca_certs=certifi.where())
resp = http.request(
'POST',
'https://httpbin.org/post',
body=encoded_data,
headers={'Content-Type': 'application/json'})
data = json.loads(resp.data.decode('utf-8'))['json']
print(data)
The example sends JSON data.
resp = http.request(
'POST',
'https://httpbin.org/post',
body=encoded_data,
headers={'Content-Type': 'application/json'})
data = json.loads(resp.data.decode('utf-8'))['json']
print(data)
We decode the returned data back to text and print it to the console.
get_binary.py
#!/usr/bin/env python3
import urllib3
http = urllib3.PoolManager()
url = 'http://webcode.me/favicon.ico'
req = http.request('GET', url)
with open('favicon.ico', 'wb') as f:
f.write(req.data)
The req.data is in a binary format, which we can directly write to the disk.
The chunks are sent out and received independently of one another. Each chunk is preceded by its
size in bytes.
Setting preload_content to False means that urllib3 will stream the response content. The
stream() method iterates over chunks of the response content. When streaming, we should call
release_conn() to release the http connection back to the connection pool so that it can be re-
used.
streaming.py
#!/usr/bin/env python3
import urllib3
import certifi
url = "https://docs.oracle.com/javase/specs/jls/se8/jls8.pdf"
local_filename = url.split('/')[-1]
http = urllib3.PoolManager(ca_certs=certifi.where())
resp = http.request(
'GET',
url,
preload_content=False)
resp.release_conn()
resp = http.request(
'GET',
url,
preload_content=False)
resp.release_conn()
redirect.py
#!/usr/bin/env python3
import urllib3
import certifi
http = urllib3.PoolManager(ca_certs=certifi.where())
url = 'https://httpbin.org/redirect-to?url=/'
resp = http.request('GET', url, redirect=True)
print(resp.status)
print(resp.geturl())
print(resp.info())
$ ./redirect.py
200
/
HTTPHeaderDict({'Date': 'Fri, 21 Feb 2020 12:49:29 GMT', 'Content-Type': 'text/html;
charset=utf-8', 'Content-Length': '9593', 'Connection': 'keep-alive',
'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Credentials': 'true'})
app.py
#!/usr/bin/env python3
app = Flask(__name__)
@app.route('/headers')
def hello():
ua = request.headers.get('user-agent')
ka = request.headers.get('connection')
The application has one route. It sends the user agent and connection header fields of a request to
the client.
send_req.py
#!/usr/bin/env python3
import urllib3
http = urllib3.PoolManager()
url = 'localhost:5000/headers'
url = 'localhost:5000/headers'
print(resp.data.decode('utf-8'))
$ export FLASK_APP=app.py
$ flask run
We run the Flask application.
$ ./send_req.py
User agent: Python program; Connection: keep-alive