0% found this document useful (0 votes)

9 views

Programming 2 Lectures

Uploaded by

230109013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Programming 2 Lectures

Uploaded by

230109013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Data Collection

Meraryslan Meraliyev
October 9, 2024

1
Table of Contents

Introduction to Data Collection

Working with Files and Directories
APIs and Data Retrieval
Data Collection from Databases
Web Scraping
BeautifulSoup for Web Scraping
Web Scraping with Selenium
Real-time Data Collection
Automation and Scheduling
Conclusion and Best Practices

2
Introduction to Data Collection
Introduction to Data Collection

• Data is critical in modern software applications.

• Data collection paradigms:
• Manual vs. automated data collection.
• Online and offline data sources.
• Efficiently collect, store, and use data in applications.

3
Working with Files and Directories
Overview of File Operations in Python

• Python provides built-in functions for file input/output (I/O).

• Common file operations:
• Reading from a file.
• Writing to a file.
• Appending data to a file.

4
Reading a File in Python

• Files are opened using the open() function.

• Example: Reading a file line by line.
1 with open ( ’ data . txt ’ , ’r ’) as file :
2 for line in file :
3 print ( line . strip () )
4

5
Writing to a File

• Writing data to a file involves using ’w’ (write) or ’a’

(append) mode.
• Example: Writing a list of items to a file.
1 items = [ " apple " , " banana " , " cherry " ]
2

3 with open ( ’ fruits . txt ’ , ’w ’) as file :

4 for item in items :
5 file . write ( item + ’\ n ’)
6

6
Appending Data to a File

• Appending data adds content to the end of an existing file

without overwriting.
• Example: Adding more items to an existing file.
1 new_items = [ " orange " , " grape " ]
2

3 with open ( ’ fruits . txt ’ , ’a ’) as file :

4 for item in new_items :
5 file . write ( item + ’\ n ’)
6

7
Short Task: File Operations

• Task: Create a program that reads a list of names from a file,

converts them to uppercase, and writes the result to a new
file.
• Use the provided names.txt for this task.

8
Working with Directories in Python

• The os module allows you to interact with the file system.

• Example: Listing all files in a directory.
1 import os
2

3 directory = ’/ path / to / folder ’

4 for filename in os . listdir ( directory ) :
5 print ( filename )
6

9
Short Task: Directory Operations

• Task: Write a Python script that lists all files in a specified

directory and counts how many text files (.txt) are present.

10
APIs and Data Retrieval
What are APIs?

• APIs (Application Programming Interfaces) enable programs

to communicate with each other.
• APIs provide structured data from external services.
• They typically use HTTP and return data in formats such as
JSON or XML.

11
Making API Requests with Python

• The requests library is used to send HTTP requests to an

API.
• Example: Fetching data from an open weather API.
1 import requests
2

3 url = ’ https :// api . openweathermap . org / data /2.5/ weather

? q = London & appid = your_api_key ’
4 response = requests . get ( url )
5

6 # Parse JSON data

7 data = response . json ()
8 print ( f " Temperature : { data [ ’ main ’][ ’ temp ’]} K " )
9

12
Short Task: Working with APIs

• Task: Use a public API to fetch cryptocurrency data (e.g.,

Bitcoin prices) and display the current price in USD.
• Dataset: Use the CoinGecko API.

13
Handling API Authentication

• Some APIs require authentication using API keys or tokens.

• Example: Adding an API key in the request headers.
1 headers = {
2 ’ Authorization ’: ’ Bearer your_access_token ’
3 }
4

5 url = ’ https :// api . example . com / protected ’

6 response = requests . get ( url , headers = headers )
7

8 print ( response . json () )

14
Working with API Pagination

• Many APIs return large datasets in pages.

• You can navigate through pages using parameters like page
and per page.
• Example: Handling pagination.
1 url = ’ https :// api . example . com / data ? page =1& per_page =50
’
2

3 while url :
4 response = requests . get ( url )
5 data = response . json ()
6

7 # Process the data

8 print ( data )
9

10 # Move to the next page if provided

11 url = data . get ( ’ next_page_url ’)
12 15
Task: Working with API Pagination

• Task: Use an API that supports pagination (e.g., GitHub API)

and retrieve all repositories for a given user across multiple
pages.

16
Data Collection from Databases
Introduction to SQL Databases

• SQL (Structured Query Language) is used to interact with

relational databases.
• SQL databases store structured data using tables, rows, and
columns.
• Common databases include MySQL, PostgreSQL, and SQLite.

17
Basic SQL Commands

• Common SQL commands include:

• SELECT: Retrieve data from a table.
• INSERT: Add new data to a table.
• UPDATE: Modify existing data in a table.
• DELETE: Remove data from a table.

18
Querying a Database with SQLite

• SQLite is a lightweight SQL database that stores data in a file.

• Example: Inserting and retrieving data from a database.
1 import sqlite3
2

3 connection = sqlite3 . connect ( ’ example . db ’)

4 cursor = connection . cursor ()
5

6 # Insert data into the database

7 cursor . execute ( " INSERT INTO users ( name , age ) VALUES
(? , ?) " , ( " Alice " , 30) )
8 connection . commit ()
9

10 # Query the data

11 cursor . execute ( " SELECT * FROM users " )
12 rows = cursor . fetchall ()
13

14 for row in rows : 19

15 print ( row )
Short Task: Database Operations

• Task: Create an SQLite database, insert data into a table,

and query the data. Use SQL commands such as SELECT and
INSERT.

20
Web Scraping
Introduction to Web Scraping

• Web scraping allows automatic extraction of data from

websites.
• Scraping is often done for research, data analysis, or
monitoring prices or trends.
• Ethical scraping:
• Always check the website’s robots.txt.
• Avoid overwhelming the server with too many requests.

21
Introduction to BeautifulSoup

• BeautifulSoup is a Python library used to parse HTML and

XML.
• It creates a parse tree from HTML, allowing easy navigation
and data extraction.
• Works in combination with requests to fetch and parse
static webpages.
• Ideal for scraping static content where the HTML is not
dynamically generated.

22
Basic Scraping with BeautifulSoup

• Example: Scraping all links from a webpage.

1 import requests
2 from bs4 import BeautifulSoup
3

4 url = " https :// example . com "

5 response = requests . get ( url )
6

7 # Parse the HTML

8 soup = BeautifulSoup ( response . content , ’ html . parser ’)
9

10 # Find all links in the page

11 for link in soup . find_all ( ’a ’) :
12 print ( link . get ( ’ href ’) )
13

23
Navigating HTML Structure with BeautifulSoup

• BeautifulSoup allows you to navigate through the HTML

structure to extract specific data.
• Example: Extracting a webpage’s title and the first paragraph.
1 # Get the title of the page
2 print ( soup . title . string )
3

4 # Get the first paragraph text

5 first_paragraph = soup . find ( ’p ’)
6 print ( first_paragraph . text )
7

24
Searching for Elements by Attribute

• Elements in HTML often have attributes like id, class, etc.

• Example: Searching for a div with a specific class.
1 # Find a div with class ’ content ’
2 content_div = soup . find ( ’ div ’ , class_ = ’ content ’)
3 print ( content_div . text )
4

25
Extracting Data from Tables

• HTML tables are commonly used to display structured data.

• BeautifulSoup can easily extract table rows and cells.
• Example: Extracting all rows from a table.
1 # Find the table
2 table = soup . find ( ’ table ’)
3

4 # Extract all rows from the table

5 for row in table . find_all ( ’ tr ’) :
6 cells = row . find_all ( ’ td ’)
7 for cell in cells :
8 print ( cell . text )
9

26
Handling Forms with BeautifulSoup

• Forms are used to submit user data to a server.

• BeautifulSoup allows you to extract form inputs and simulate
form submission using requests.
• Example: Extracting form fields and submitting data.
1 form = soup . find ( ’ form ’)
2

3 # Find all input fields

4 inputs = form . find_all ( ’ input ’)
5

6 # Print input field names

7 for input in inputs :
8 print ( input . get ( ’ name ’) )
9

10 # Submitting the form using requests

11 data = {
12 ’ name ’: ’ John ’ ,
13 ’ email ’: ’ john@example . com ’ 27
Using Headers for Scraping

• Many websites check for a valid User-Agent header to prevent

scraping.
• You can send custom headers with your request to avoid
getting blocked.
• Example: Using custom headers with requests.
1 headers = {
2 ’ User - Agent ’: ’ Mozilla /5.0 ( Windows NT 10.0; Win64
; x64 ) AppleWebKit /537.36 ( KHTML , like Gecko )
Chrome /91.0.4472.124 Safari /537.36 ’
3 }
4

5 response = requests . get ( url , headers = headers )

6 soup = BeautifulSoup ( response . content , ’ html . parser ’)
7

28
Handling Pagination in Scraping

• Some websites display data over multiple pages.

• You need to extract the ”Next” page URL and continue
scraping.
• Example: Handling pagination in a loop.
1 while url :
2 response = requests . get ( url )
3 soup = BeautifulSoup ( response . content , ’ html .
parser ’)
4

5 # Scrape data from the current page

6 # Find the ’ Next ’ page link
7 next_page = soup . find ( ’a ’ , { ’ class ’: ’ next ’ })
8

9 # If there ’s a next page , update the URL

10 if next_page :
11 url = next_page [ ’ href ’]
12 else : 29
Task: BeautifulSoup Practice

• Task: Write a script using BeautifulSoup that scrapes the

titles and publication dates of all articles on a news website.
• Dataset: Use any public news website or provided sample.

30
Introduction to Selenium

• Selenium is a browser automation tool that can be used to

scrape dynamic websites.
• It allows interaction with elements on the page (e.g., filling
forms, clicking buttons).
• Selenium requires a WebDriver such as ChromeDriver or
GeckoDriver.

31
Setting Up Selenium

• Download and install the correct version of ChromeDriver for

your browser.
• Install Selenium using pip: pip install selenium.
• Basic setup of Selenium for scraping.
1 Set up ChromeDriver driver = webdriver.Chrome(executablep ath =
2 ’/path/to/chromedriver’)
3 Open a webpage driver.get("https://example.com")
4 Print page title print(driver.title)
5 Close the browser driver.quit()

32
Interacting with Web Elements using Selenium

• Selenium allows you to interact with elements on a webpage

like a real user.
• Example: Filling a form and clicking a button.
1 # Find input fields by name
2 username = driver . find_element_by_name ( " username " )
3 password = driver . find_element_by_name ( " password " )
4

5 # Fill out the form

6 username . send_keys ( " my_username " )
7 password . send_keys ( " my_password " )
8

9 # Click the login button

10 login_button = driver . find_element_by_id ( " login_button
")
11 login_button . click ()
12

33
Handling Dynamic Content with Selenium

• Some pages load content dynamically using JavaScript.

• You can wait for elements to load using WebDriverWait.
• Example: Waiting for a button to become clickable.
1 from selenium . webdriver . common . by import By
2 from selenium . webdriver . support . ui import
WebDriverWait
3 from selenium . webdriver . support import
expe c te d _ co n d itions as EC
4

5 # Wait for the login button to be clickable

6 login_button = WebDriverWait ( driver , 10) . until (
7 EC . e l e m e n t _ t o _b e _c l ic ka b le (( By . ID , ’ login_button ’)
)
8 )
9 login_button . click ()
10

34
Running Selenium in Headless Mode

• Headless mode allows you to run Selenium without opening a

browser window.
• Useful for automated tasks on servers.
• Example: Running ChromeDriver in headless mode.
1 from selenium . webdriver . chrome . options import Options
2

3 chrome_options = Options ()
4 chrome_options . add_argument ( " -- headless " )
5

6 driver = webdriver . Chrome ( executable_path = ’/ path / to /

chromedriver ’ , options = chrome_options )
7 driver . get ( " https :// example . com " )
8

9 print ( driver . title )

10 driver . quit ()
11
35
Task: Scraping with Selenium

• Task: Use Selenium to scrape a dynamic website where

content is loaded via JavaScript.
• Example: Scrape product prices from an e-commerce website
after performing a search.

36
Conclusion on Web Scraping

• Web scraping is a powerful tool for extracting data from

websites.
• Use BeautifulSoup for static HTML content and Selenium for
dynamic content.
• Always respect website terms of service and avoid sending too
many requests to avoid getting blocked.

37
Real-time Data Collection
Introduction to WebSockets

• WebSockets provide a two-way communication channel

between a client and a server.
• They allow real-time data streams (e.g., stock prices, chat
applications).
• Python’s websocket-client library is commonly used for
WebSocket communication.

38
Receiving Real-time Data via WebSockets

• Example: Receiving real-time cryptocurrency prices from a

WebSocket.
1 import websocket
2

3 def on_message ( ws , message ) :

4 print ( message )
5

6 ws = websocket . WebSocketApp ( " wss :// ws - feed . pro .

coinbase . com " , on_message = on_message )
7 ws . run_forever ()
8

39
Short Task: Real-time Data Collection

• Task: Connect to a WebSocket providing live stock data and

print real-time stock prices.
• Dataset: Use a stock market WebSocket.

40
Automation and Scheduling
Automating Tasks with Python

• Automate tasks using the schedule library to run functions

periodically.
• Example: Scheduling a task to run every 10 minutes.
1 import schedule
2 import time
3

4 def collect_data () :
5 print ( " Collecting data ... " )
6

7 schedule . every (10) . minutes . do ( collect_data )

9 while True :
10 schedule . run_pending ()
11 time . sleep (1)
12

41
Short Task: Scheduling Tasks

• Task: Write a Python script that collects data from an API

every hour and writes the results to a file using the schedule
library.

42
Conclusion and Best Practices
Conclusion: Best Practices for Data Collection

• Respect the terms of service and privacy policies of data

sources.
• Use caching to avoid redundant data collection.
• Store data in structured formats (CSV, JSON, databases) for
easy retrieval and analysis.
• Optimize your code for performance, especially when working
with large datasets or real-time data streams.

43
Next Steps in Data Collection

• Explore advanced APIs that require authentication and rate

limiting.
• Set up automated data pipelines for continuous data
collection and processing.
• Integrate your data with machine learning models to build
predictive systems.

Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
C754 V68 Firmware
No ratings yet
C754 V68 Firmware
10 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Web Scraping
No ratings yet
Web Scraping
35 pages
3252_ids_10
No ratings yet
3252_ids_10
5 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Ultimate Python Cheat Sheet - Practical Python For Everyday Tasks - by Jason Roell - Medium
No ratings yet
Ultimate Python Cheat Sheet - Practical Python For Everyday Tasks - by Jason Roell - Medium
107 pages
DWV_labs_2025_1 (1)
No ratings yet
DWV_labs_2025_1 (1)
17 pages
Cric Score App
No ratings yet
Cric Score App
16 pages
First Web Scraper
No ratings yet
First Web Scraper
34 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Data Collection
No ratings yet
Data Collection
14 pages
Ultimate Python Cheat Sheet Practical Python For Everyday Tasks
No ratings yet
Ultimate Python Cheat Sheet Practical Python For Everyday Tasks
94 pages
I
No ratings yet
I
54 pages
Que&practical
No ratings yet
Que&practical
3 pages
6. Lab Manual
No ratings yet
6. Lab Manual
21 pages
Doc3
No ratings yet
Doc3
2 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
No ratings yet
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
161 pages
1710988761593
100% (2)
1710988761593
169 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
No ratings yet
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
6 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Assignment Unit I and II
No ratings yet
Assignment Unit I and II
3 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Sma 2
No ratings yet
Sma 2
9 pages
Development Web Scrapping
No ratings yet
Development Web Scrapping
14 pages
Instant Access to Python Real-World Projects: Crafting your Python Portfolio with Deployable Applications Steven F. Lott ebook Full Chapters
100% (3)
Instant Access to Python Real-World Projects: Crafting your Python Portfolio with Deployable Applications Steven F. Lott ebook Full Chapters
51 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
Exercises 5
No ratings yet
Exercises 5
7 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Immediate download Python Real-World Projects: Crafting your Python Portfolio with Deployable Applications Steven F. Lott ebooks 2024
No ratings yet
Immediate download Python Real-World Projects: Crafting your Python Portfolio with Deployable Applications Steven F. Lott ebooks 2024
51 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Certified-SDET-Professional-Python-Brochure (1) (1) (1) (1) (2) (1) (1)
No ratings yet
Certified-SDET-Professional-Python-Brochure (1) (1) (1) (1) (2) (1) (1)
10 pages
Python Using AI
No ratings yet
Python Using AI
9 pages
Software Engineering Project
No ratings yet
Software Engineering Project
55 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
Upload PDF
No ratings yet
Upload PDF
11 pages
unit4
No ratings yet
unit4
36 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
cmsc320 f2018 Lec02
No ratings yet
cmsc320 f2018 Lec02
45 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell - The ebook in PDF and DOCX formats is ready for download
100% (2)
Web Scraping with Python Collecting Data from the Modern Web 1st Edition Ryan Mitchell - The ebook in PDF and DOCX formats is ready for download
47 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
VoyagerMulti 5.6.1.26 - Release Bulletin
No ratings yet
VoyagerMulti 5.6.1.26 - Release Bulletin
16 pages
917 Paperpublished
No ratings yet
917 Paperpublished
17 pages
Labview - Web Publishing
No ratings yet
Labview - Web Publishing
20 pages
X550-Qube User Manual
No ratings yet
X550-Qube User Manual
24 pages
LESSON EXEMPLAR Editable Template
88% (8)
LESSON EXEMPLAR Editable Template
2 pages
Quite Sleep
No ratings yet
Quite Sleep
66 pages
Mba Thesis Topics in HR
100% (2)
Mba Thesis Topics in HR
7 pages
Oro Commands List
No ratings yet
Oro Commands List
8 pages
Prof Ed Test Items 3
No ratings yet
Prof Ed Test Items 3
6 pages
Online Hotel Reservation
No ratings yet
Online Hotel Reservation
7 pages
Neb
No ratings yet
Neb
114 pages
Info JSS2 Eclass Internet Environmment
No ratings yet
Info JSS2 Eclass Internet Environmment
3 pages
Web 2.0 and
No ratings yet
Web 2.0 and
3 pages
WBP Micro Project
100% (2)
WBP Micro Project
27 pages
Literature Review On Internet Use
100% (2)
Literature Review On Internet Use
8 pages
Module 10 Creating A Simple Markup Language Document Remedan
No ratings yet
Module 10 Creating A Simple Markup Language Document Remedan
91 pages
Q2 English 7 Module 2
No ratings yet
Q2 English 7 Module 2
21 pages
HTML Resume
100% (1)
HTML Resume
5 pages
Raymarine HS5 SeaTalkhs Network Switch Datasheet PDF
No ratings yet
Raymarine HS5 SeaTalkhs Network Switch Datasheet PDF
13 pages
Ethiopian Chamber of Commerce & Sectoral Associations (ECCSA) Request For Proposal
No ratings yet
Ethiopian Chamber of Commerce & Sectoral Associations (ECCSA) Request For Proposal
31 pages
Q:2 Draw and Explain A Model For Thinking About Ethical, Social and Political Issues?
No ratings yet
Q:2 Draw and Explain A Model For Thinking About Ethical, Social and Political Issues?
5 pages
Kyko Global Seek A Temporary Restraining Order To Enjoin Defendants Prithvi Information Solutions
No ratings yet
Kyko Global Seek A Temporary Restraining Order To Enjoin Defendants Prithvi Information Solutions
25 pages
Dissertation Sur Les Zoos
100% (2)
Dissertation Sur Les Zoos
6 pages
Computerized Exam System For Ambo University
No ratings yet
Computerized Exam System For Ambo University
64 pages
Vaiva Question
No ratings yet
Vaiva Question
2 pages
Unit 2 Server Side Scripting With DB Connection
No ratings yet
Unit 2 Server Side Scripting With DB Connection
57 pages
Sujet Dissertation Sciences Po
100% (2)
Sujet Dissertation Sciences Po
4 pages
Correct Response: A.: 2. When A Journalist Refers To A Particular Web Site in A Print or Online News Story, The
No ratings yet
Correct Response: A.: 2. When A Journalist Refers To A Particular Web Site in A Print or Online News Story, The
3 pages
Resume Omkesh Ptl[1]
No ratings yet
Resume Omkesh Ptl[1]
30 pages