Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Pythonlearn-12-HTTP Python

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 93

Networked Programs

Chapter 12
Terminologies
• The World Wide Web (abbreviated WWW or the Web) is an
information space where documents and other web resources are
identified by Uniform Resource Locators (URLs), interlinked by
hypertext links, and accessible via the Internet.

• A web browser (commonly referred to as a browser) is a software


application for accessing information on the World Wide Web. Each
individual web page, image, and video is identified by a distinct URL,
enabling browsers to retrieve and display them on the user's device.
• Embedded hyperlinks permit users to navigate between web pages.

• Multiple web pages with a common theme, a common domain name, or


both, make up a website.

• A domain name is an identification string that defines a realm of


administrative autonomy, authority or control within the Internet.

• Domain names are formed by the rules and procedures of the Domain
Name System (DNS).
• In general, a domain name represents an Internet Protocol (IP)
resource, such as a personal computer used to access the Internet, a
server computer hosting a web site, or the web site itself or any other
service communicated via the Internet.

• Domain names serve to identify Internet resources, such as computers,


networks, and services, with a text-based label that is easier to
memorize than the numerical addresses used in the Internet protocols.
• Web server refers to server software, or hardware that can serve
contents to the World Wide Web.

• A web server processes incoming network requests over the HTTP


protocol (and several other related protocols)

• The primary function of web server is to store, process and deliver web
pages to clients.
• Hypertext is a text displayed on a computer display or other electronic
devices with references (hyperlinks) to other text that the reader can
immediately access.

• Hypertext Markup Language (HTML) is the standard markup


language for creating web pages and web applications
• HTTP is the protocol to exchange or transfer hypertext.

• HTTP is the underlying protocol used by the World Wide Web and this
protocol defines how messages are formatted and transmitted, and
what actions Web servers and browsers should take in response to
various commands.
• For example, when you enter a URL in your browser, this actually
sends an HTTP command to the Web server directing it to fetch and
transmit the requested Web page.

• The other main standard that controls how the World Wide Web works
is HTML, which covers how Web pages are formatted and displayed.
HTTP Status codes
• 400 Bad File Request

• 401 Unauthorized

• 403 Forbidden/Access Denied

• 404 File Not Found

• 408 Request Timeout

• 500 Internal Error

• 501 Not Implemented

• 502 Service Temporarily Overloaded


• Server-side scripting is a technique used in web development which involves
employing scripts on a web server which produce a response customized for
each user's (client's) request to the website.

• There are a number of server-side scripting languages available, including:


ASP,ASP.NET,Google Apps Script,Haskell (*.hs),Java (*.jsp) via JavaServer
Pages,JavaScript using Server-side,JavaScript (*.ssjs, *.js) (example: Node.js),
Parser (*.p),Perl via the CGI.pm module (*.cgi, *.ipl, *.pl),PHP, Python
(*.py), R (*.rhtml) - (example: rApache),Ruby (*.rb, *.rbw) (example: Ruby
on Rails), SMX (*.smx), Tcl (*.tcl)
HTTP Request methods
• HTTP defines methods to indicate the desired action to be performed on the
identified resource.

• Often, the resource corresponds to a file or the output of an executable


residing on the server.

• The HTTP/1.0 specification defined the GET, POST and HEAD methods

• HTTP/1.1 specification added 5 new methods: OPTIONS, PUT, DELETE,


TRACE and CONNECT.

• Any client can use any method and the server can be configured to support
any combination of methods.
• Two commonly used methods for a request-response between a client
and server are: GET and POST.

• GET - Requests data from a specified resource

• POST - Submits data to be processed to a specified resource


Web Browsing
Lets see what happens when we click a hyper link to browse a particular web page
or URL.

• User enters the URL of the webpage in the address bar of the web browser.

• Your browser contacts the HTTP web server and demands the file having the
specified URL.

• Web Server will look for the filename. If it finds that file then sends back to the
browser otherwise sends an error message indicating that you have requested a
wrong file.

• Web browser takes response from web server and displays either the received
file or error message.
• In this chapter we will pretend to be a web browser and
retrieve web pages using the HyperText Transfer
Protocol (HTTP).
IP traffic
• TCP: TCP is a connection-oriented protocol.

• UDP is a connectionless protocol.


TCP ports
• A port number is a way to identify a specific process to which an
Internet or other network message is to be forwarded when it arrives at
a server.

• For the Transmission Control Protocol and the User Datagram Protocol,
a port number is a 16-bit integer that is put in the header appended to
a message unit.
Common TCP Ports

http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
Application Protocols
http://www.dr-chk.com/page1.htm

protocol host document

Robert Cailliau
CERN
Getting Data From The Server
• Each time the user clicks on an anchor tag with an href= value to
switch to a new page, the browser makes a connection to the web
server and issues a “GET” request - to GET the content of the page
at the specified URL

• The server returns the HTML document to the browser, which


formats and displays the document to the user
Web Server
80

Browser
Web Server
80

Browser
Click
Request Web Server
80

GET http://www.dr-chuck.com/page2.htm

Browser
Click
Request Web Server
80

GET http://www.dr-chuck.com/page2.htm

Browser
Click
Request Web Server Response

80
<h1>The Second
Page</h1><p>If you like, you
can switch back to the <a
href="page1.htm">First
GET http://www.dr-chuck.com/page2.htm Page</a>.</p>

Browser
Click
Request Web Server Response

80
<h1>The Second
Page</h1><p>If you like, you
can switch back to the <a
href="page1.htm">First
GET http://www.dr-chuck.com/page2.htm Page</a>.</p>

Browser
Click Parse/
Render
Making an HTTP request
• Connect to the server like www.dr-chuck.com"

• Request a document (or the default document)

• GET http://www.dr-chuck.com/page1.htm HTTP/1.0

• GET http://www.mlive.com/ann-arbor/ HTTP/1.0

• GET http://www.facebook.com HTTP/1.0


Sockets
• The network protocol that powers the web is actually quite simple and
there is built-in support in Python called sockets which makes it very
easy to make network connections and retrieve data over those
sockets in a Python program.
Socket Programming in Python
• Socket programming is a way of connecting two nodes on a network to
communicate with each other.

• A socket is much like a file, except that a single socket provides a two-
way connection between two programs. You can both read from and
write to the same socket.

• Sockets are the endpoints of a bidirectional communications channel.

• Sockets may communicate within a process, between processes on the


same machine, or between processes on different continents.
TCP Connections / Sockets
“In computer networking, an Internet socket or network socket is
an endpoint of a bidirectional inter-process communication flow
across an Internet Protocol-based computer network, such as the
Internet.”

Process Internet Process

http://en.wikipedia.org/wiki/Internet_socket
• If you write something to a socket, it is sent to the application at the
other end of the socket.

• If you read from the socket, you are given the data which the other
application has sent.
• But if you try to read a socket when the program on the other end of the
socket has not sent any data, you just sit and wait.

• If the programs on both ends of the socket simply wait for some data
without sending anything, they will wait for a very long time.

• So an important part of programs that communicate over the Internet is to


have some sort of protocol.

• A protocol is a set of precise rules that determine who is to go first, what


they are to do, and then what the responses are to that message,and who
sends next, and so on.
• One socket(node) listens on a particular port at an IP, while other
socket reaches out to the other to form a connection.

• Server forms the listener socket while client reaches out to the server.

• They are the real backbones behind web browsing. In simpler terms
there is a server and a client.
Socket Programming
• socket A network connection between two applications where the
applications can send and receive data in either direction.

• port A number that generally indicates which application you are


contacting when you make a socket connection to a server.

As an example, web traffic usually uses port 80 while email traffic uses
port 25.
• Socket programming is started by importing the socket library and
making a simple socket.

import socket
• Sockets may be implemented over a number of different channel types:
Unix domain sockets, TCP, UDP, and so on.
The socket Module
• To create a socket, you must use the socket.socket() function
available in socket module, which has the general syntax

s = socket.socket (socket_family, socket_type, protocol=0)

• socket_family − This is either AF_UNIX or AF_INET.

• socket_type − This is either SOCK_STREAM or SOCK_DGRAM.

• protocol − This is usually left out, defaulting to 0.


Socket Family
• socket.AF_UNIX

• socket.AF_INET

• socket.AF_INET6

• These constants represent the address (and protocol) families, used for
the first argument to socket(). If the AF_UNIX constant is not defined
then this protocol is unsupported.
• When you create a socket, you have to specify its address family, and
then you can only use addresses of that type with the socket.

• AF_INET is an address family that is used to designate the type of


addresses that your socket can communicate with (in this case, Internet
Protocol v4 addresses).

• The Linux kernel, for example, supports 29 other address families such
as UNIX (AF_UNIX) sockets and IPX (AF_IPX), and also
communications with IRDA and Bluetooth
(AF_IRDA and AF_BLUETOOTH)
• Socket are characterized by their domain, type and transport protocol.
Common domains are:

• AF_UNIX: address format is UNIX pathname

• AF_INET: address format is host and port number


Socket Types
• socket.SOCK_STREAM

• socket.SOCK_DGRAM

• socket.SOCK_RAW

• socket.SOCK_RDM

• socket.SOCK_SEQPACKET

• These constants represent the socket types, used for the second
argument to socket(). (Only SOCK_STREAM and SOCK_DGRAM
appear to be generally useful.)
import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
• Once you have socket object, then you can use required functions to
create your client or server program.

• Following is the list of functions required −


• socket.send is a low-level method. It can send less bytes than you
requested, but returns the number of bytes sent.

• socket.sendall is a high-level Python-only method that sends the


entire buffer you pass or throws an exception.
Let’s Write a Web Browser!
• Perhaps the easiest way to show how the HTTP protocol works is to
write a very simple Python program that makes a connection to a web
server and follows the rules of the HTTP protocol to request a
document and display what the server sends back.
Sockets in Python
Python has built-in support for TCP Sockets

import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('data.pr4e.org', 80) )

Host Port

http://docs.python.org/library/socket.html
• This example shows how to make a low-level network connection with
sockets.

• Sockets can be used to communicate with a web server or with a mail


server or many other kinds of servers.
Program 1:
An HTTP Request in Python
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)


mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode(),end=‘ ')
mysock.close()
• To request a document from a web server, we make a connection to the www.pr4e.org server
on port 80, and then send a line of the form

GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n

where the second parameter is the web page we are requesting, and then we also send a blank
line.
• Since our program is playing the role of the “web browser”, the HTTP protocol says we must
send the GET command followed by a blank line.

• Once we send that blank line, we write a loop that receives data in 512-character chunks from
the socket and prints the data out until there is no more data to read (i.e., the recv() returns an
empty string).

• The web server will respond with some header information about the document and a blank
line followed by the document content.
HTTP/1.1 200 OK HTTP Header
Date: Sun, 14 Mar 2010 23:52:41 GMT
Server: Apache
Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT
ETag: "143c1b33-a7-4b395bea" while True:
Accept-Ranges: bytes data = mysock.recv(512)
Content-Length: 167 if ( len(data) < 1 ) :
Connection: close break
Content-Type: text/plain print(data.decode())

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
HTTP Body
About Characters and Strings…
ASCII
American
Standard Code
for Information
Interchange

https://en.wikipedia.org/wiki/ASCII
http://www.catonmat.net/download/ascii-cheat-sheet.png
Representing Simple Strings
• Each character is represented by a
number between 0 and 256 stored in
8 bits of memory >>> print(ord('H'))
72
• We refer to "8 bits of memory as a >>> print(ord('e'))
"byte" of memory – (i.e. my disk 101
drive contains 3 Terabytes of >>> print(ord('\n'))
10
memory)
>>>
• The ord() function tells us the
numeric value of a simple ASCII
character
ASCII
>>> print(ord('H'))
72
>>> print(ord('e'))
101
>>> print(ord('\n'))
10
>>>

In the 1960s and 1970s,


we just assumed that
one byte was one
character
http://unicode.org/charts/
Multi-Byte Characters
To represent the wide range of characters computers must handle we represent
characters with more than one byte
• UTF-16 – Fixed length - Two bytes https://en.wikipedia.org/wiki/UTF-8
• UTF-32 – Fixed Length - Four Bytes
• UTF-8 – 1-4 bytes
- Upwards compatible with ASCII
- Automatic detection between ASCII and UTF-8
- UTF-8 is recommended practice for encoding
data to be exchanged between systems
Two Kinds of Strings in Python
Python 2.7.10 Python 3.5.1
>>> x = ' 이광춘 ' >>> x = ' 이광춘 '
>>> type(x) >>> type(x)
<type 'str'> <class 'str'>
>>> x = u' 이광춘 ' >>> x = u' 이광춘 '
>>> type(x) >>> type(x)
<type 'unicode'> <class 'str'>
>>> >>>

In Python 3, all strings are Unicode


Python 2 versus Python 3
Python 2.7.10 Python 3.5.1
>>> x = b'abc' >>> x = b'abc'
>>> type(x) >>> type(x)
<type 'str'> <class 'bytes'>
>>> x = ' 이광춘 ' >>> x = ' 이광춘 '
>>> type(x) >>> type(x)
<type 'str'> <class 'str'>
>>> x = u' 이광춘 ' >>> x = u' 이광춘 '
>>> type(x) >>> type(x)
<type 'unicode'> <class 'str'>
Python 3 and Unicode
• In Python 3, all strings internally Python 3.5.1
are UNICODE >>> x = b'abc'
>>> type(x)
• Working with string variables in
<class 'bytes'>
Python programs and reading data
>>> x = ' 이광춘 '
from files usually "just works"
>>> type(x)
<class 'str'>
• When we talk to a network
>>> x = u' 이광춘 '
resource using sockets or talk to a
>>> type(x)
database we have to encode and
<class 'str'>
decode data (usually to UTF-8)
Python Strings to Bytes
• When we talk to an external resource like a network socket we send bytes,
so we need to encode Python 3 strings into a given character encoding

• When we read data from an external resource, we must decode it based on


the character set so it is properly represented in Python 3 as a string

while True:
data = mysock.recv(512)
if ( len(data) < 1 ) :
break
mystring = data.decode()
print(mystring)
Program1:
An HTTP Request in Python
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)


mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode())
mysock.close()
https://docs.python.org/3/library/stdtypes.html#bytes.decode
https://docs.python.org/3/library/stdtypes.html#str.encode
decode() Bytes recv()
UTF-8
String Socket Network
Unicode
Bytes send()
encode() UTF-8

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)


mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
mysock.send(cmd)

while True:
data = mysock.recv(512)
if (len(data) < 1):
break
print(data.decode())
mysock.close()
Program 2:
Retrieving an image over HTTP
• In the above example, we retrieved a plain text file which had newlines in the file
and we simply copied the data to the screen as the program ran.
• We can use a similar program to retrieve an image across using HTTP.
• Instead of copying the data to the screen as the program runs, we accumulate the
data in a string, trim off the headers, and then save the image data to a file as
follows:
import socket
import time
HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg HTTP/1.0\r\n\r\n')
count = 0
picture = b""
while True:
data = mysock.recv(5120)
if (len(data) < 1):
break
time.sleep(0.25)
count = count + len(data)
print(len(data), count)
picture = picture + data
mysock.close()
# Look for the end of the header (2 CRLF)
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())

# Skip past the header and save the picture data


picture = picture[pos+4:]
fhand = open(“new.jpg", "wb")
fhand.write(picture)
fhand.close()
Making HTTP Easier With urllib
Program 3: retrieving web pages
• While we can manually send and receive data over HTTP using the socket
library, there is a much simpler way to perform this common task in Python by
using the urllib library.

• Since HTTP is so common, we have a library that does all the socket work for us
and makes web pages look like a file.

• Using urllib, you can treat a web page much like a file.

• You simply indicate which web page you would like to retrieve and urllib handles
all of the HTTP protocol and header details.
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())

urllib1.py
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())

But soft what light through yonder window breaks


It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

urllib1.py
Like a File...
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1
print(counts)
urlwords.py
Reading Web Pages
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
print(line.decode().strip())

<h1>The First Page</h1>


<p>If you like, you can switch to the <a href="http://www.dr-
chuck.com/page2.htm">Second Page</a>.
</p>
urllib2.py
Following Links
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
print(line.decode().strip())

<h1>The First Page</h1>


<p>If you like, you can switch to the <a href="http://www.dr-
chuck.com/page2.htm">Second Page</a>.
</p>
urllib2.py
Parsing HTML
(a.k.a. Web Scraping)
What is Web Scraping?
• When a program or script pretends to be a browser and
retrieves web pages, looks at those web pages, extracts
information, and then looks at more web pages

• Search engines scrape web pages - we call this “spidering the


web” or “web crawling”

http://en.wikipedia.org/wiki/Web_scraping
http://en.wikipedia.org/wiki/Web_crawler
Why Scrape?

• Pull data - particularly social data - who links to who?

• Monitor a site for new information

• Spider the web to make a database for a search engine


• scrape When a program pretends to be a web browser and retrieves a
web page, then looks at the web page content. Often programs are
following the links in one page to find the next page so they can
traverse a network of pages or a social network.

• spider The act of a web search engine retrieving a page and then all
the pages linked from a page and so on until they have nearly all of the
pages on the Internet which they use to build their search index.
Parsing HTML using regular
expressions
import urllib.request, urllib.parse, urllib.error
import re
url = input('Enter - ')
html = urllib.request.urlopen(url).read()
links = re.findall(b'href="(http://.*?)"', html)
for link in links:
print(link.decode())
Reading binary files using
urllib
import urllib.request, urllib.parse, urllib.error

img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()

fhand = open('cover3.jpg', 'wb')

fhand.write(img)

fhand.close()
import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1:
break
size = size + len(info)
fhand.write(info)
print(size, 'characters copied.')
fhand.close()
The Easy Way - Beautiful Soup
• You could do string searches the hard way
• Or use the free software library called BeautifulSoup from
www.crummy.com

https://www.crummy.com/software/BeautifulSoup/
BeautifulSoup Installation
# To run this, you can install BeautifulSoup
# https://pypi.python.org/pypi/beautifulsoup4

# Or download the file


# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

import urllib.request, urllib.parse, urllib.error


from bs4 import BeautifulSoup

...
urllinks.py
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Enter - ')


html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

# Retrieve all of the anchor tags


tags = soup('a')
for tag in tags:
print(tag.get('href', None))

python urllinks.py
Enter - http://www.dr-chuck.com/page1.htm
http://www.dr-chuck.com/page2.htm

You might also like