0% found this document useful (0 votes)

266 views

Course Notes - Web Scraping and API Fundamentals in Python

This document provides an overview of key concepts for web scraping and APIs in Python, including: - Web scraping involves extracting structured data from websites programmatically using automation. Common challenges include handling logins, excessive requests, and dynamically generated content. - APIs allow software to communicate through defined requests and responses. Popular APIs provide data like currency rates, weather, and job listings. - Formats like JSON, HTML, and HTTP requests are used for data transmission on the web. Libraries like Beautiful Soup and Requests-HTML can parse responses into usable data structures. - Ethical scraping follows robots.txt and avoids overloading servers or commercial use without permission. JavaScript rendering may be needed to access dynamically generated content

Uploaded by

Ambrish Rai

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

266 views

Course Notes - Web Scraping and API Fundamentals in Python

Uploaded by

Ambrish Rai

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

COURSE NOTES:

WEB SCRAPING AND API

FUNDAMENTALS IN PYTHON
Course
Coursenotes:
notes:
Descriptive
Descriptive
statistics
statistics
Web Scraping

Definition Ethics of scraping:

Web scraping, also known as web data extraction, is Do not engage in piracy or other unauthorized
formally known as the process of obtaining and commercial use regarding the data you extract.
structuring data from the web using intelligent
automation. Read the ToS and robots.txt of the site you are
about to scrape.
It can be used to potentially retrieve hundreds,
millions, or even billions of data points from the Do not spam the website with multiple requests
internet’s seemingly endless frontier. in a short amount of time, as that may hurt them
and/or may be classified as a DDOS attack.
HTTP requests

On the web, servers and clients usually communicate through HTTP requests.
HTTP stands for ‘HyperText Transfer Protocol’ and it specifies how requests and responses are to be formatted and transmitted. These requests
are how most of your web surfing is happening. When opening a page, the browser sends a request to the server of that page, and the server
responds with the relevant resources (HTML, images, etc.).
The two most popular request types are GET and POST.

GET POST
• Obtain data from server • Usually used when a state needs to be
• Can be bookmarked altered (such as adding items to you
shopping cart) or when sending passwords
• Parameters are added directly into the URL
• Parameters are added in a separate body,
• Not used to send sensitive info (such as thus it is more secure
passwords)
• Cannot be bookmarked

Request headers are pieces of information about the request itself - information, such as the encoding and language of the expected response,
the length and type of data provided, who is making the request, cookies and so on. These pieces of information, referred to as headers, are
intended to make communications on the web easier and more reliable, as the server has a better idea of how to respond.
Two of the most common header fields are the User-Agent (identification string for the software making the request) and cookies (special type of
header that has a variety of uses).
Response

The response contains 2 main pieces of information – the status code and the body of the response.
The status code indicates whether the request was successful and/or any errors. It is represented by a 3-digit number.

Codes in the ranges indicate: The two most frequently encountered status codes are:
▪ 2xx – Success ▪ 200 OK – The request has succeeded
▪ 3xx – Redirects ▪ 404 Not Found – The server can not find the requested
▪ 4xx – Client errors resource
▪ 5xx – Server errors

The body of the response contains the requested information. Usually, it is either an HTML or a JSON file.

Request

Client Server

Response
JSON

JSON stands for ‘JavaScript Object Notation’ as it was derived from the JavaScript programming language. It is a standard for
data exchange on the web.

The JSON format relies on 3 key concepts: it should be easy for humans to read and write; easy for programs to process and
generate, regardless of the programming language; and, finally: written in plain text.

It achieves that by using conventions familiar to almost all programmers, by building upon 2 structures: dictionaries and lists.

Dictionaries
{
“key 1”: “value 1”,
A dictionary is a data structure that contains key-value
“key 2”: “value 2”
pairs, surrounded by curly brackets …
}

Lists

A list, on the other hand, is a collection of items. It is

[ “item 1”, “item 2” … ]
contained inside square brackets.
API Overview

An API is an Application Programming Interface.

An API specifies how software components should interact. You may think of it as a contract between a client and a server. If
the client makes a request in a specific format, the server will always respond in a documented format or initiate a defined
action. Web-based APIs usually provide information.

Some APIs are free, most are either paid, or require registration. In the latter case, you are usually given a KEY and ID that must
be incorporated into every request to that API.

Examples of APIs:

• Up-to-date currency • Real-time weather • Job listings • Nutrition analysis

exchange rates data and forecast
HTML Overview

HTML is the underlying source code of all webpages, along with CSS and JavaScript.
It consists of nested elements/tags.

An element has the following syntax: <tag_name> content </tag_name>

These elements may have additional information specified in tag attributes.

Popular tag attributes are: Popular tags are:

• ‘class’ • link - <a href=“url_of_link”>Text of the link</>
• ‘id’ • paragraph - <p>…</p>
Example: • tag to group other tags - <div class=“hahaha”>…</div>
• <div class=“descriptive-class-name”>….</div> • tag to mark part of the content - <span id=“hohoho”>…</span>
Beautiful Soup

Beautiful Soup is a Python library for extracting data from an HTML document.
It achieves that by analyzing the HTML with a parser.

The most useful methods of the package are:

• .find(“p”, class=“footer-text”) – extracts the first paragraph tag with class = “footer-text”
• .find_all(“a”) – extracts all links from the page
• tag.attrs – returns a dictionary of all attributes of this tag
• tag.text – returns all the text in this tag
• tag.contents – returns a list of all children elements of this tag
Common roadblocks when scraping

Identification:
Some websites may require us to identify ourselves – the solution is to set the “User-Agent” request header to one
of the common browsers.

Cookies:
Other websites may require us to set cookies – solution: use the session class of the request's library.

Login:
Occasionally, the data we want to scrape may be locked behind a login. In that case, we need to simulate a login
attempt by inspecting the POST request that is sent and its parameters.

Excessive requests:
Another roadblock may occur when we send too much requests in a short amount of time to a server. Therefore,
we may get blocked. Out of ethical standpoint as well, it is a good idea to limit our rate of requests. We can
simply do that with the sleep function of the time package -> time.sleep(2).
Requests-html – dealing with JavaScript

The requests-html package was intended as a replacement of the requests + Beautiful Soup combo. However, its
strongest point is that it has full JavaScript support, meaning it can execute JavaScript. This allows us to scrape
dynamically generated content.

To run the JavaScript, we can use the render method:

from requests_html import HTMLSession
session = HTMLSession()
r = session.get(“example.com”)
r.html.render()

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (2)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
API Cheatsheet
No ratings yet
API Cheatsheet
4 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Notes for Web Scraping - BeautifulSoup-3903
No ratings yet
Notes for Web Scraping - BeautifulSoup-3903
6 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
ibm-python-module-5-apis-data-collection
No ratings yet
ibm-python-module-5-apis-data-collection
3 pages
I
No ratings yet
I
54 pages
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
No ratings yet
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
4 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Python API Tutorial - Getting Started With APIs - Dataquest
100% (1)
Python API Tutorial - Getting Started With APIs - Dataquest
26 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Cric Score App
No ratings yet
Cric Score App
16 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Module5 Q&A
No ratings yet
Module5 Q&A
6 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
L2_Data Acquisition
No ratings yet
L2_Data Acquisition
48 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Infosys
No ratings yet
Infosys
27 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Scraping
100% (1)
Scraping
25 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Sma 2
No ratings yet
Sma 2
9 pages
03 Getting Data
No ratings yet
03 Getting Data
18 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
Working With Apis: Takeaways: Syntax
No ratings yet
Working With Apis: Takeaways: Syntax
2 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
A Practical Guide to Web Scraping ( PDFDrive )
No ratings yet
A Practical Guide to Web Scraping ( PDFDrive )
107 pages
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
ibm-python-module-5-glossary
No ratings yet
ibm-python-module-5-glossary
2 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Glossary APIs and Data Collection
No ratings yet
Glossary APIs and Data Collection
2 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
How To Build A Web Scraper For Tenders Extraction
No ratings yet
How To Build A Web Scraper For Tenders Extraction
12 pages
getting data
No ratings yet
getting data
63 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Doc3
No ratings yet
Doc3
2 pages
Guesstimate
No ratings yet
Guesstimate
18 pages
Qlik Sense KR
No ratings yet
Qlik Sense KR
33 pages
Qlik Sense Notes 28012019
No ratings yet
Qlik Sense Notes 28012019
33 pages
Blocks Drilling Line 3rd Ed Previewwtrmrk
100% (1)
Blocks Drilling Line 3rd Ed Previewwtrmrk
28 pages
Intel Ek
No ratings yet
Intel Ek
2 pages
Grade 7-10 Mapeh 1st Quarter
No ratings yet
Grade 7-10 Mapeh 1st Quarter
14 pages
Sizing and Terminal Configuration: Heavy-Duty Commercial Batteries Groups (12-Volt)
No ratings yet
Sizing and Terminal Configuration: Heavy-Duty Commercial Batteries Groups (12-Volt)
1 page
Art Appreciation: Art Is What We Say It Is. Appreciation Is What We Make It
No ratings yet
Art Appreciation: Art Is What We Say It Is. Appreciation Is What We Make It
31 pages
Subject Verb Agreement
67% (3)
Subject Verb Agreement
2 pages
Transformer-Based Korean Pretrained Language Models - NLP - Ai
No ratings yet
Transformer-Based Korean Pretrained Language Models - NLP - Ai
7 pages
Xmas Exam Time Table 2024 Updated - Hssreporter - Com
No ratings yet
Xmas Exam Time Table 2024 Updated - Hssreporter - Com
3 pages
T-11 Capstone Project Report: Group No: B43 Project Guide: Prof. Vinayak Musale Group Members SR No Name PRN No
No ratings yet
T-11 Capstone Project Report: Group No: B43 Project Guide: Prof. Vinayak Musale Group Members SR No Name PRN No
16 pages
001 - Lesson 1 - Your First Steps As A Christian - DA2101EN - L01
0% (1)
001 - Lesson 1 - Your First Steps As A Christian - DA2101EN - L01
0 pages
Computer ch5 Class 8
No ratings yet
Computer ch5 Class 8
5 pages
Dynamics Units & Conversions
No ratings yet
Dynamics Units & Conversions
3 pages
Unit 1
No ratings yet
Unit 1
12 pages
《百年時尚：香港長衫故事》展览大纲
No ratings yet
《百年時尚：香港長衫故事》展览大纲
85 pages
Immediate download Doing Family Therapy Craft and Creativity in Clinical Practice 3rd Edition Robert Taibbi ebooks 2024
100% (3)
Immediate download Doing Family Therapy Craft and Creativity in Clinical Practice 3rd Edition Robert Taibbi ebooks 2024
34 pages
Action Script Tutorial
No ratings yet
Action Script Tutorial
29 pages
Guidance Notes
No ratings yet
Guidance Notes
35 pages
Ict Book 2019
No ratings yet
Ict Book 2019
57 pages
Ishtla Singh - Language, Thought and Representation
No ratings yet
Ishtla Singh - Language, Thought and Representation
5 pages
Instrument Tools Mdas
No ratings yet
Instrument Tools Mdas
3 pages
SD Section
No ratings yet
SD Section
2 pages
Aboudi Jawadi Hassan
No ratings yet
Aboudi Jawadi Hassan
265 pages
Analizador de Red Finex 183202
No ratings yet
Analizador de Red Finex 183202
19 pages
Fourteen Steps To Teach Dyslexics 7 Steps
No ratings yet
Fourteen Steps To Teach Dyslexics 7 Steps
20 pages
The Tragedy in History
100% (1)
The Tragedy in History
193 pages
Relationship Between Verbal and Non-Verb
No ratings yet
Relationship Between Verbal and Non-Verb
7 pages
Recursion PDF
No ratings yet
Recursion PDF
19 pages
Proba Scrisa v1 Ian 2019
No ratings yet
Proba Scrisa v1 Ian 2019
2 pages
little-peter-rabbit-1-activities-with-music-songs-nursery-rhymes-workshe_97457
No ratings yet
little-peter-rabbit-1-activities-with-music-songs-nursery-rhymes-workshe_97457
2 pages
Gleaning Information Using Text Structure
No ratings yet
Gleaning Information Using Text Structure
16 pages
Taskalfa 2551ci
No ratings yet
Taskalfa 2551ci
536 pages