Course Notes - Web Scraping and API Fundamentals in Python
Course Notes - Web Scraping and API Fundamentals in Python
On the web, servers and clients usually communicate through HTTP requests.
HTTP stands for ‘HyperText Transfer Protocol’ and it specifies how requests and responses are to be formatted and transmitted. These requests
are how most of your web surfing is happening. When opening a page, the browser sends a request to the server of that page, and the server
responds with the relevant resources (HTML, images, etc.).
The two most popular request types are GET and POST.
GET POST
• Obtain data from server • Usually used when a state needs to be
• Can be bookmarked altered (such as adding items to you
shopping cart) or when sending passwords
• Parameters are added directly into the URL
• Parameters are added in a separate body,
• Not used to send sensitive info (such as thus it is more secure
passwords)
• Cannot be bookmarked
Request headers are pieces of information about the request itself - information, such as the encoding and language of the expected response,
the length and type of data provided, who is making the request, cookies and so on. These pieces of information, referred to as headers, are
intended to make communications on the web easier and more reliable, as the server has a better idea of how to respond.
Two of the most common header fields are the User-Agent (identification string for the software making the request) and cookies (special type of
header that has a variety of uses).
Response
The response contains 2 main pieces of information – the status code and the body of the response.
The status code indicates whether the request was successful and/or any errors. It is represented by a 3-digit number.
Codes in the ranges indicate: The two most frequently encountered status codes are:
▪ 2xx – Success ▪ 200 OK – The request has succeeded
▪ 3xx – Redirects ▪ 404 Not Found – The server can not find the requested
▪ 4xx – Client errors resource
▪ 5xx – Server errors
The body of the response contains the requested information. Usually, it is either an HTML or a JSON file.
Request
Client Server
Response
JSON
JSON stands for ‘JavaScript Object Notation’ as it was derived from the JavaScript programming language. It is a standard for
data exchange on the web.
The JSON format relies on 3 key concepts: it should be easy for humans to read and write; easy for programs to process and
generate, regardless of the programming language; and, finally: written in plain text.
It achieves that by using conventions familiar to almost all programmers, by building upon 2 structures: dictionaries and lists.
Dictionaries
{
“key 1”: “value 1”,
A dictionary is a data structure that contains key-value
“key 2”: “value 2”
pairs, surrounded by curly brackets …
}
Lists
An API specifies how software components should interact. You may think of it as a contract between a client and a server. If
the client makes a request in a specific format, the server will always respond in a documented format or initiate a defined
action. Web-based APIs usually provide information.
Some APIs are free, most are either paid, or require registration. In the latter case, you are usually given a KEY and ID that must
be incorporated into every request to that API.
Examples of APIs:
HTML is the underlying source code of all webpages, along with CSS and JavaScript.
It consists of nested elements/tags.
Beautiful Soup is a Python library for extracting data from an HTML document.
It achieves that by analyzing the HTML with a parser.
Identification:
Some websites may require us to identify ourselves – the solution is to set the “User-Agent” request header to one
of the common browsers.
Cookies:
Other websites may require us to set cookies – solution: use the session class of the request's library.
Login:
Occasionally, the data we want to scrape may be locked behind a login. In that case, we need to simulate a login
attempt by inspecting the POST request that is sent and its parameters.
Excessive requests:
Another roadblock may occur when we send too much requests in a short amount of time to a server. Therefore,
we may get blocked. Out of ethical standpoint as well, it is a good idea to limit our rate of requests. We can
simply do that with the sleep function of the time package -> time.sleep(2).
Requests-html – dealing with JavaScript
The requests-html package was intended as a replacement of the requests + Beautiful Soup combo. However, its
strongest point is that it has full JavaScript support, meaning it can execute JavaScript. This allows us to scrape
dynamically generated content.