Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Programming 2 Lectures

Uploaded by

230109013
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Programming 2 Lectures

Uploaded by

230109013
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Data Collection

Meraryslan Meraliyev
October 9, 2024

1
Table of Contents

Introduction to Data Collection


Working with Files and Directories
APIs and Data Retrieval
Data Collection from Databases
Web Scraping
BeautifulSoup for Web Scraping
Web Scraping with Selenium
Real-time Data Collection
Automation and Scheduling
Conclusion and Best Practices

2
Introduction to Data Collection
Introduction to Data Collection

• Data is critical in modern software applications.


• Data collection paradigms:
• Manual vs. automated data collection.
• Online and offline data sources.
• Efficiently collect, store, and use data in applications.

3
Working with Files and Directories
Overview of File Operations in Python

• Python provides built-in functions for file input/output (I/O).


• Common file operations:
• Reading from a file.
• Writing to a file.
• Appending data to a file.

4
Reading a File in Python

• Files are opened using the open() function.


• Example: Reading a file line by line.
1 with open ( ’ data . txt ’ , ’r ’) as file :
2 for line in file :
3 print ( line . strip () )
4

5
Writing to a File

• Writing data to a file involves using ’w’ (write) or ’a’


(append) mode.
• Example: Writing a list of items to a file.
1 items = [ " apple " , " banana " , " cherry " ]
2

3 with open ( ’ fruits . txt ’ , ’w ’) as file :


4 for item in items :
5 file . write ( item + ’\ n ’)
6

6
Appending Data to a File

• Appending data adds content to the end of an existing file


without overwriting.
• Example: Adding more items to an existing file.
1 new_items = [ " orange " , " grape " ]
2

3 with open ( ’ fruits . txt ’ , ’a ’) as file :


4 for item in new_items :
5 file . write ( item + ’\ n ’)
6

7
Short Task: File Operations

• Task: Create a program that reads a list of names from a file,


converts them to uppercase, and writes the result to a new
file.
• Use the provided names.txt for this task.

8
Working with Directories in Python

• The os module allows you to interact with the file system.


• Example: Listing all files in a directory.
1 import os
2

3 directory = ’/ path / to / folder ’


4 for filename in os . listdir ( directory ) :
5 print ( filename )
6

9
Short Task: Directory Operations

• Task: Write a Python script that lists all files in a specified


directory and counts how many text files (.txt) are present.

10
APIs and Data Retrieval
What are APIs?

• APIs (Application Programming Interfaces) enable programs


to communicate with each other.
• APIs provide structured data from external services.
• They typically use HTTP and return data in formats such as
JSON or XML.

11
Making API Requests with Python

• The requests library is used to send HTTP requests to an


API.
• Example: Fetching data from an open weather API.
1 import requests
2

3 url = ’ https :// api . openweathermap . org / data /2.5/ weather


? q = London & appid = your_api_key ’
4 response = requests . get ( url )
5

6 # Parse JSON data


7 data = response . json ()
8 print ( f " Temperature : { data [ ’ main ’][ ’ temp ’]} K " )
9

12
Short Task: Working with APIs

• Task: Use a public API to fetch cryptocurrency data (e.g.,


Bitcoin prices) and display the current price in USD.
• Dataset: Use the CoinGecko API.

13
Handling API Authentication

• Some APIs require authentication using API keys or tokens.


• Example: Adding an API key in the request headers.
1 headers = {
2 ’ Authorization ’: ’ Bearer your_access_token ’
3 }
4

5 url = ’ https :// api . example . com / protected ’


6 response = requests . get ( url , headers = headers )
7

8 print ( response . json () )


9

14
Working with API Pagination

• Many APIs return large datasets in pages.


• You can navigate through pages using parameters like page
and per page.
• Example: Handling pagination.
1 url = ’ https :// api . example . com / data ? page =1& per_page =50

2

3 while url :
4 response = requests . get ( url )
5 data = response . json ()
6

7 # Process the data


8 print ( data )
9

10 # Move to the next page if provided


11 url = data . get ( ’ next_page_url ’)
12 15
Task: Working with API Pagination

• Task: Use an API that supports pagination (e.g., GitHub API)


and retrieve all repositories for a given user across multiple
pages.

16
Data Collection from Databases
Introduction to SQL Databases

• SQL (Structured Query Language) is used to interact with


relational databases.
• SQL databases store structured data using tables, rows, and
columns.
• Common databases include MySQL, PostgreSQL, and SQLite.

17
Basic SQL Commands

• Common SQL commands include:


• SELECT: Retrieve data from a table.
• INSERT: Add new data to a table.
• UPDATE: Modify existing data in a table.
• DELETE: Remove data from a table.

18
Querying a Database with SQLite

• SQLite is a lightweight SQL database that stores data in a file.


• Example: Inserting and retrieving data from a database.
1 import sqlite3
2

3 connection = sqlite3 . connect ( ’ example . db ’)


4 cursor = connection . cursor ()
5

6 # Insert data into the database


7 cursor . execute ( " INSERT INTO users ( name , age ) VALUES
(? , ?) " , ( " Alice " , 30) )
8 connection . commit ()
9

10 # Query the data


11 cursor . execute ( " SELECT * FROM users " )
12 rows = cursor . fetchall ()
13

14 for row in rows : 19


15 print ( row )
Short Task: Database Operations

• Task: Create an SQLite database, insert data into a table,


and query the data. Use SQL commands such as SELECT and
INSERT.

20
Web Scraping
Introduction to Web Scraping

• Web scraping allows automatic extraction of data from


websites.
• Scraping is often done for research, data analysis, or
monitoring prices or trends.
• Ethical scraping:
• Always check the website’s robots.txt.
• Avoid overwhelming the server with too many requests.

21
Introduction to BeautifulSoup

• BeautifulSoup is a Python library used to parse HTML and


XML.
• It creates a parse tree from HTML, allowing easy navigation
and data extraction.
• Works in combination with requests to fetch and parse
static webpages.
• Ideal for scraping static content where the HTML is not
dynamically generated.

22
Basic Scraping with BeautifulSoup

• Example: Scraping all links from a webpage.


1 import requests
2 from bs4 import BeautifulSoup
3

4 url = " https :// example . com "


5 response = requests . get ( url )
6

7 # Parse the HTML


8 soup = BeautifulSoup ( response . content , ’ html . parser ’)
9

10 # Find all links in the page


11 for link in soup . find_all ( ’a ’) :
12 print ( link . get ( ’ href ’) )
13

23
Navigating HTML Structure with BeautifulSoup

• BeautifulSoup allows you to navigate through the HTML


structure to extract specific data.
• Example: Extracting a webpage’s title and the first paragraph.
1 # Get the title of the page
2 print ( soup . title . string )
3

4 # Get the first paragraph text


5 first_paragraph = soup . find ( ’p ’)
6 print ( first_paragraph . text )
7

24
Searching for Elements by Attribute

• Elements in HTML often have attributes like id, class, etc.


• Example: Searching for a div with a specific class.
1 # Find a div with class ’ content ’
2 content_div = soup . find ( ’ div ’ , class_ = ’ content ’)
3 print ( content_div . text )
4

25
Extracting Data from Tables

• HTML tables are commonly used to display structured data.


• BeautifulSoup can easily extract table rows and cells.
• Example: Extracting all rows from a table.
1 # Find the table
2 table = soup . find ( ’ table ’)
3

4 # Extract all rows from the table


5 for row in table . find_all ( ’ tr ’) :
6 cells = row . find_all ( ’ td ’)
7 for cell in cells :
8 print ( cell . text )
9

26
Handling Forms with BeautifulSoup

• Forms are used to submit user data to a server.


• BeautifulSoup allows you to extract form inputs and simulate
form submission using requests.
• Example: Extracting form fields and submitting data.
1 form = soup . find ( ’ form ’)
2

3 # Find all input fields


4 inputs = form . find_all ( ’ input ’)
5

6 # Print input field names


7 for input in inputs :
8 print ( input . get ( ’ name ’) )
9

10 # Submitting the form using requests


11 data = {
12 ’ name ’: ’ John ’ ,
13 ’ email ’: ’ john@example . com ’ 27
Using Headers for Scraping

• Many websites check for a valid User-Agent header to prevent


scraping.
• You can send custom headers with your request to avoid
getting blocked.
• Example: Using custom headers with requests.
1 headers = {
2 ’ User - Agent ’: ’ Mozilla /5.0 ( Windows NT 10.0; Win64
; x64 ) AppleWebKit /537.36 ( KHTML , like Gecko )
Chrome /91.0.4472.124 Safari /537.36 ’
3 }
4

5 response = requests . get ( url , headers = headers )


6 soup = BeautifulSoup ( response . content , ’ html . parser ’)
7

28
Handling Pagination in Scraping

• Some websites display data over multiple pages.


• You need to extract the ”Next” page URL and continue
scraping.
• Example: Handling pagination in a loop.
1 while url :
2 response = requests . get ( url )
3 soup = BeautifulSoup ( response . content , ’ html .
parser ’)
4

5 # Scrape data from the current page


6 # Find the ’ Next ’ page link
7 next_page = soup . find ( ’a ’ , { ’ class ’: ’ next ’ })
8

9 # If there ’s a next page , update the URL


10 if next_page :
11 url = next_page [ ’ href ’]
12 else : 29
Task: BeautifulSoup Practice

• Task: Write a script using BeautifulSoup that scrapes the


titles and publication dates of all articles on a news website.
• Dataset: Use any public news website or provided sample.

30
Introduction to Selenium

• Selenium is a browser automation tool that can be used to


scrape dynamic websites.
• It allows interaction with elements on the page (e.g., filling
forms, clicking buttons).
• Selenium requires a WebDriver such as ChromeDriver or
GeckoDriver.

31
Setting Up Selenium

• Download and install the correct version of ChromeDriver for


your browser.
• Install Selenium using pip: pip install selenium.
• Basic setup of Selenium for scraping.
1 Set up ChromeDriver driver = webdriver.Chrome(executablep ath =
2 ’/path/to/chromedriver’)
3 Open a webpage driver.get("https://example.com")
4 Print page title print(driver.title)
5 Close the browser driver.quit()

32
Interacting with Web Elements using Selenium

• Selenium allows you to interact with elements on a webpage


like a real user.
• Example: Filling a form and clicking a button.
1 # Find input fields by name
2 username = driver . find_element_by_name ( " username " )
3 password = driver . find_element_by_name ( " password " )
4

5 # Fill out the form


6 username . send_keys ( " my_username " )
7 password . send_keys ( " my_password " )
8

9 # Click the login button


10 login_button = driver . find_element_by_id ( " login_button
")
11 login_button . click ()
12

33
Handling Dynamic Content with Selenium

• Some pages load content dynamically using JavaScript.


• You can wait for elements to load using WebDriverWait.
• Example: Waiting for a button to become clickable.
1 from selenium . webdriver . common . by import By
2 from selenium . webdriver . support . ui import
WebDriverWait
3 from selenium . webdriver . support import
expe c te d _ co n d itions as EC
4

5 # Wait for the login button to be clickable


6 login_button = WebDriverWait ( driver , 10) . until (
7 EC . e l e m e n t _ t o _b e _c l ic ka b le (( By . ID , ’ login_button ’)
)
8 )
9 login_button . click ()
10

34
Running Selenium in Headless Mode

• Headless mode allows you to run Selenium without opening a


browser window.
• Useful for automated tasks on servers.
• Example: Running ChromeDriver in headless mode.
1 from selenium . webdriver . chrome . options import Options
2

3 chrome_options = Options ()
4 chrome_options . add_argument ( " -- headless " )
5

6 driver = webdriver . Chrome ( executable_path = ’/ path / to /


chromedriver ’ , options = chrome_options )
7 driver . get ( " https :// example . com " )
8

9 print ( driver . title )


10 driver . quit ()
11
35
Task: Scraping with Selenium

• Task: Use Selenium to scrape a dynamic website where


content is loaded via JavaScript.
• Example: Scrape product prices from an e-commerce website
after performing a search.

36
Conclusion on Web Scraping

• Web scraping is a powerful tool for extracting data from


websites.
• Use BeautifulSoup for static HTML content and Selenium for
dynamic content.
• Always respect website terms of service and avoid sending too
many requests to avoid getting blocked.

37
Real-time Data Collection
Introduction to WebSockets

• WebSockets provide a two-way communication channel


between a client and a server.
• They allow real-time data streams (e.g., stock prices, chat
applications).
• Python’s websocket-client library is commonly used for
WebSocket communication.

38
Receiving Real-time Data via WebSockets

• Example: Receiving real-time cryptocurrency prices from a


WebSocket.
1 import websocket
2

3 def on_message ( ws , message ) :


4 print ( message )
5

6 ws = websocket . WebSocketApp ( " wss :// ws - feed . pro .


coinbase . com " , on_message = on_message )
7 ws . run_forever ()
8

39
Short Task: Real-time Data Collection

• Task: Connect to a WebSocket providing live stock data and


print real-time stock prices.
• Dataset: Use a stock market WebSocket.

40
Automation and Scheduling
Automating Tasks with Python

• Automate tasks using the schedule library to run functions


periodically.
• Example: Scheduling a task to run every 10 minutes.
1 import schedule
2 import time
3

4 def collect_data () :
5 print ( " Collecting data ... " )
6

7 schedule . every (10) . minutes . do ( collect_data )


8

9 while True :
10 schedule . run_pending ()
11 time . sleep (1)
12

41
Short Task: Scheduling Tasks

• Task: Write a Python script that collects data from an API


every hour and writes the results to a file using the schedule
library.

42
Conclusion and Best Practices
Conclusion: Best Practices for Data Collection

• Respect the terms of service and privacy policies of data


sources.
• Use caching to avoid redundant data collection.
• Store data in structured formats (CSV, JSON, databases) for
easy retrieval and analysis.
• Optimize your code for performance, especially when working
with large datasets or real-time data streams.

43
Next Steps in Data Collection

• Explore advanced APIs that require authentication and rate


limiting.
• Set up automated data pipelines for continuous data
collection and processing.
• Integrate your data with machine learning models to build
predictive systems.

44

You might also like