Pythonlearn-12-HTTP Python
Pythonlearn-12-HTTP Python
Pythonlearn-12-HTTP Python
Chapter 12
Terminologies
• The World Wide Web (abbreviated WWW or the Web) is an
information space where documents and other web resources are
identified by Uniform Resource Locators (URLs), interlinked by
hypertext links, and accessible via the Internet.
• Domain names are formed by the rules and procedures of the Domain
Name System (DNS).
• In general, a domain name represents an Internet Protocol (IP)
resource, such as a personal computer used to access the Internet, a
server computer hosting a web site, or the web site itself or any other
service communicated via the Internet.
• The primary function of web server is to store, process and deliver web
pages to clients.
• Hypertext is a text displayed on a computer display or other electronic
devices with references (hyperlinks) to other text that the reader can
immediately access.
• HTTP is the underlying protocol used by the World Wide Web and this
protocol defines how messages are formatted and transmitted, and
what actions Web servers and browsers should take in response to
various commands.
• For example, when you enter a URL in your browser, this actually
sends an HTTP command to the Web server directing it to fetch and
transmit the requested Web page.
• The other main standard that controls how the World Wide Web works
is HTML, which covers how Web pages are formatted and displayed.
HTTP Status codes
• 400 Bad File Request
• 401 Unauthorized
• The HTTP/1.0 specification defined the GET, POST and HEAD methods
• Any client can use any method and the server can be configured to support
any combination of methods.
• Two commonly used methods for a request-response between a client
and server are: GET and POST.
• User enters the URL of the webpage in the address bar of the web browser.
• Your browser contacts the HTTP web server and demands the file having the
specified URL.
• Web Server will look for the filename. If it finds that file then sends back to the
browser otherwise sends an error message indicating that you have requested a
wrong file.
• Web browser takes response from web server and displays either the received
file or error message.
• In this chapter we will pretend to be a web browser and
retrieve web pages using the HyperText Transfer
Protocol (HTTP).
IP traffic
• TCP: TCP is a connection-oriented protocol.
• For the Transmission Control Protocol and the User Datagram Protocol,
a port number is a 16-bit integer that is put in the header appended to
a message unit.
Common TCP Ports
http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
Application Protocols
http://www.dr-chk.com/page1.htm
Robert Cailliau
CERN
Getting Data From The Server
• Each time the user clicks on an anchor tag with an href= value to
switch to a new page, the browser makes a connection to the web
server and issues a “GET” request - to GET the content of the page
at the specified URL
Browser
Web Server
80
Browser
Click
Request Web Server
80
GET http://www.dr-chuck.com/page2.htm
Browser
Click
Request Web Server
80
GET http://www.dr-chuck.com/page2.htm
Browser
Click
Request Web Server Response
80
<h1>The Second
Page</h1><p>If you like, you
can switch back to the <a
href="page1.htm">First
GET http://www.dr-chuck.com/page2.htm Page</a>.</p>
Browser
Click
Request Web Server Response
80
<h1>The Second
Page</h1><p>If you like, you
can switch back to the <a
href="page1.htm">First
GET http://www.dr-chuck.com/page2.htm Page</a>.</p>
Browser
Click Parse/
Render
Making an HTTP request
• Connect to the server like www.dr-chuck.com"
• A socket is much like a file, except that a single socket provides a two-
way connection between two programs. You can both read from and
write to the same socket.
http://en.wikipedia.org/wiki/Internet_socket
• If you write something to a socket, it is sent to the application at the
other end of the socket.
• If you read from the socket, you are given the data which the other
application has sent.
• But if you try to read a socket when the program on the other end of the
socket has not sent any data, you just sit and wait.
• If the programs on both ends of the socket simply wait for some data
without sending anything, they will wait for a very long time.
• Server forms the listener socket while client reaches out to the server.
• They are the real backbones behind web browsing. In simpler terms
there is a server and a client.
Socket Programming
• socket A network connection between two applications where the
applications can send and receive data in either direction.
As an example, web traffic usually uses port 80 while email traffic uses
port 25.
• Socket programming is started by importing the socket library and
making a simple socket.
import socket
• Sockets may be implemented over a number of different channel types:
Unix domain sockets, TCP, UDP, and so on.
The socket Module
• To create a socket, you must use the socket.socket() function
available in socket module, which has the general syntax
• socket.AF_INET
• socket.AF_INET6
• These constants represent the address (and protocol) families, used for
the first argument to socket(). If the AF_UNIX constant is not defined
then this protocol is unsupported.
• When you create a socket, you have to specify its address family, and
then you can only use addresses of that type with the socket.
• The Linux kernel, for example, supports 29 other address families such
as UNIX (AF_UNIX) sockets and IPX (AF_IPX), and also
communications with IRDA and Bluetooth
(AF_IRDA and AF_BLUETOOTH)
• Socket are characterized by their domain, type and transport protocol.
Common domains are:
• socket.SOCK_DGRAM
• socket.SOCK_RAW
• socket.SOCK_RDM
• socket.SOCK_SEQPACKET
• These constants represent the socket types, used for the second
argument to socket(). (Only SOCK_STREAM and SOCK_DGRAM
appear to be generally useful.)
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
• Once you have socket object, then you can use required functions to
create your client or server program.
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('data.pr4e.org', 80) )
Host Port
http://docs.python.org/library/socket.html
• This example shows how to make a low-level network connection with
sockets.
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode(),end=‘ ')
mysock.close()
• To request a document from a web server, we make a connection to the www.pr4e.org server
on port 80, and then send a line of the form
where the second parameter is the web page we are requesting, and then we also send a blank
line.
• Since our program is playing the role of the “web browser”, the HTTP protocol says we must
send the GET command followed by a blank line.
• Once we send that blank line, we write a loop that receives data in 512-character chunks from
the socket and prints the data out until there is no more data to read (i.e., the recv() returns an
empty string).
• The web server will respond with some header information about the document and a blank
line followed by the document content.
HTTP/1.1 200 OK HTTP Header
Date: Sun, 14 Mar 2010 23:52:41 GMT
Server: Apache
Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT
ETag: "143c1b33-a7-4b395bea" while True:
Accept-Ranges: bytes data = mysock.recv(512)
Content-Length: 167 if ( len(data) < 1 ) :
Connection: close break
Content-Type: text/plain print(data.decode())
https://en.wikipedia.org/wiki/ASCII
http://www.catonmat.net/download/ascii-cheat-sheet.png
Representing Simple Strings
• Each character is represented by a
number between 0 and 256 stored in
8 bits of memory >>> print(ord('H'))
72
• We refer to "8 bits of memory as a >>> print(ord('e'))
"byte" of memory – (i.e. my disk 101
drive contains 3 Terabytes of >>> print(ord('\n'))
10
memory)
>>>
• The ord() function tells us the
numeric value of a simple ASCII
character
ASCII
>>> print(ord('H'))
72
>>> print(ord('e'))
101
>>> print(ord('\n'))
10
>>>
while True:
data = mysock.recv(512)
if ( len(data) < 1 ) :
break
mystring = data.decode()
print(mystring)
Program1:
An HTTP Request in Python
import socket
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode())
mysock.close()
https://docs.python.org/3/library/stdtypes.html#bytes.decode
https://docs.python.org/3/library/stdtypes.html#str.encode
decode() Bytes recv()
UTF-8
String Socket Network
Unicode
Bytes send()
encode() UTF-8
import socket
while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode())
mysock.close()
Program 2:
Retrieving an image over HTTP
• In the above example, we retrieved a plain text file which had newlines in the file
and we simply copied the data to the screen as the program ran.
• We can use a similar program to retrieve an image across using HTTP.
• Instead of copying the data to the screen as the program runs, we accumulate the
data in a string, trim off the headers, and then save the image data to a file as
follows:
import socket
import time
HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""
while True:
data = mysock.recv(5120)
if (len(data) < 1):
break
time.sleep(0.25)
count = count + len(data)
print(len(data), count)
picture = picture + data
mysock.close()
# Look for the end of the header (2 CRLF)
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())
• Since HTTP is so common, we have a library that does all the socket work for us
and makes web pages look like a file.
• Using urllib, you can treat a web page much like a file.
• You simply indicate which web page you would like to retrieve and urllib handles
all of the HTTP protocol and header details.
import urllib.request, urllib.parse, urllib.error
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
urllib1.py
import urllib.request, urllib.parse, urllib.error
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
urllib1.py
Like a File...
import urllib.request, urllib.parse, urllib.error
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
counts = dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1
print(counts)
urlwords.py
Reading Web Pages
import urllib.request, urllib.parse, urllib.error
fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
print(line.decode().strip())
fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
print(line.decode().strip())
http://en.wikipedia.org/wiki/Web_scraping
http://en.wikipedia.org/wiki/Web_crawler
Why Scrape?
• spider The act of a web search engine retrieving a page and then all
the pages linked from a page and so on until they have nearly all of the
pages on the Internet which they use to build their search index.
Parsing HTML using regular
expressions
import urllib.request, urllib.parse, urllib.error
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall(b'href="(http://.*?)"', html)
for link in links:
print(link.decode())
Reading binary files using
urllib
import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand.write(img)
fhand.close()
import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1:
break
size = size + len(info)
fhand.write(info)
print(size, 'characters copied.')
fhand.close()
The Easy Way - Beautiful Soup
• You could do string searches the hard way
• Or use the free software library called BeautifulSoup from
www.crummy.com
https://www.crummy.com/software/BeautifulSoup/
BeautifulSoup Installation
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4
...
urllinks.py
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
python urllinks.py
Enter - http://www.dr-chuck.com/page1.htm
http://www.dr-chuck.com/page2.htm