Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
24 views

Lecture 5 - SQL Part III

This document discusses how to access SQL databases from Python programs. It covers using database drivers to connect to SQL databases from Python and execute queries. Specific topics covered include using the MySQL connector to connect to MySQL from Python, executing queries using a cursor object, handling results, and parameterizing queries to prevent SQL injection attacks. The document also discusses loading query results into Pandas DataFrames.

Uploaded by

aadarsh gaikwad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Lecture 5 - SQL Part III

This document discusses how to access SQL databases from Python programs. It covers using database drivers to connect to SQL databases from Python and execute queries. Specific topics covered include using the MySQL connector to connect to MySQL from Python, executing queries using a cursor object, handling results, and parameterizing queries to prevent SQL injection attacks. The document also discusses loading query results into Pandas DataFrames.

Uploaded by

aadarsh gaikwad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

DS 5110 – Lecture 5

SQL Part III

Roi Yehoshua
Agenda
 Accessing SQL from Python
 Data warehouses and lakes
 OLAP
 Window functions

2 Roi Yehoshua, 2024


Accessing SQL from a Programming Languages
 To access SQL from a programming language you need a database driver
 A driver converts application commands into a format that the DBMS can understand
 Typical commands include
 Connection commands (e.g., open or close a connection to the database)
 SQL commands
 Handling cursors, which allow you to traverse through result sets
 Transaction management commands
 Types of drivers
 Native drivers – provided by specific database vendors
 ODBC drivers – provide a standard interface to communicate with databases
 JDBC drivers – specific to Java-based applications

3 Roi Yehoshua, 2024


ODBC
 Open Database Connectivity (ODBC) is a standard API for accessing DBMS
 Independent of any specific DBMS or operating system
 ODBC drivers exist for most of the commercial databases

4 Roi Yehoshua, 2024


MySQL Connector/Python
 Python Database API (DB-API) defines a standard interface for accessing relational
databases from Python programs
 Different packages in Python implement this interface for different databases
 e.g., mysql.connector for MySQL, sqlite3 for SQLite, pyodbc for ODBC
 mysql.connector is the recommended driver for interacting with MySQL
 Developed by the MySQL group at Oracle
 Install it via pip
pip install mysql-connector-python

 If you’ve installed Python with the Anaconda distribution, you should already have it

5 Roi Yehoshua, 2024


MySQL Connector/Python
 The general workflow of a Python program that interacts with a MySQL database:
 Connect to the MySQL server
 Execute a SQL query
 Fetch the results
 Inform the database if any table changes were made (by committing the changes)
 Close the connection to the server

6 Roi Yehoshua, 2024


Establishing a Connection with MySQL Server
 To connect to a MySQL server, call the connect() function in mysql.connector module
 This function gets 4 parameters: host, user, password and database name
 It returns a MySQLConnection object

 You should always close the connection in the end by calling conn.close()
 You should never hard-code your login credentials directly in a Python script

7 Roi Yehoshua, 2024


The Cursor Object
 In order to execute SQL queries in Python, you need a cursor object
 A cursor object allows you to traverse over database records
 To create a cursor, use the cursor() method of your connection object
 Then you execute a SQL query by calling cursor.execute(query)
 If the query returns rows, you can retrieve them using one of cursor’s fetch methods:
 fetchall() - retrieves all the rows from the result as a list of tuples
 fetchone() - retrieves the next row of the result as a tuple
 Returns None if no more rows are available
 fetchmany(n) - retrieves the next n rows from the result as a list of tuples (n defaults to 1)
 Returns an empty list if no more rows are available

8 Roi Yehoshua, 2024


Reading Records from a Table
 The following example selects all the records from the instructor table:

• The result variable holds the


records returned from.fetchall()
• It’s a list of tuples representing
individual records from the table

9 Roi Yehoshua, 2024


Reading Records from a Table
 To get specific attributes in each row, specify their indexes in the returned tuples
 For example, to print only the instructor names:

10 Roi Yehoshua, 2024


Cursor as an Iterator
 To process the rows in the result one at a time, you can use the cursor as an iterator:

11 Roi Yehoshua, 2024


More Complex Queries
 You can make your select queries as complex as you want using the same methods
 Example: Find the courses that were taken by the highest number of students

12 Roi Yehoshua, 2024


Cursor Properties
 The cursor has a few useful properties that provide information about the result set
 column_names – returns the list containing the column names of the result set
 rowcount – the number of rows in the result set

13 Roi Yehoshua, 2024


SQL Injection Attack
 A common attack that enables execution of malicious SQL statements in the DB
 By insertion (“injection”) of a SQL query via the input data from the client
 Suppose you write a script that checks if a given username exists in the users table
 You construct the following SQL query:
username = # read from the input
query = "SELECT * FROM users WHERE username = '" + username + "'"

 If the user, instead of entering their username, enters:

 Then the query becomes:


query = "SELECT * FROM users WHERE username = '' or 1 = 1 --'"

 The where clause is now always true and the entire users table is returned
14 Roi Yehoshua, 2024
SQL Injection Attack
 Most databases allow execution of multiple SQL statements separated by semicolon
 This allows the hacker to inject whole SQL statements into the query
 For example, the hacker could enter the following string:

 The resulting query would be:


query = "SELECT * FROM users WHERE username = ''; drop table users; --"

 This query would result in deleting the entire users table!

 Solution: use parameterized queries whenever user input is involved in the query

15 Roi Yehoshua, 2024


SQL Injection Attack
 Example:

16 Roi Yehoshua, 2024


Parameterized Query
 A parameterized query is a query that uses placeholders (%) for attribute values
 Strings passed to the placeholders are correctly escaped by the library at runtime
 e.g., each quotation mark is doubled
 You can pass a parameterized query to cursor.execute() as follows:

17 Roi Yehoshua, 2024


Parameterized Query
 If a user tries to sneak in some problematic characters, the resulting statement will
cause no harm since each quotation mark will be doubled:
select * from instructor where name = 'X'' or ''Y'' = ''Y'
 This query will return an empty set

18 Roi Yehoshua, 2024


Inserting New Records
 To insert data, pass the insert into command to the cursor’s execute() method
 For example, to add a new instructor:

 You must call conn.commit() at the end, otherwise your changes will be lost!
 Unless you turn on automatic commits by setting conn.autocommit = True

19 Roi Yehoshua, 2024


Inserting New Records
 Verifying that the new record was added to the table:

20 Roi Yehoshua, 2024


Inserting New Records
 If the values to be inserted come from an external source (e.g., the user), again you
should use parameters in the SQL statement:

21 Roi Yehoshua, 2024


Inserting a Bulk of Records
 You can insert multiple records using the executemany() method
 It accepts two parameters:
 A query that contains placeholders for the records that need to be inserted
 A list that contains all records that you wish to insert

22 Roi Yehoshua, 2024


Update and Delete
 Updating and deleting work the same way, just pass the SQL to cursor.execute()
 You can use the cursor rowcount attribute to check how many records were affected:

23 Roi Yehoshua, 2024


Calling Stored Procedures from Python
 To call a stored procedure use the callproc() method of the Cursor object
cursor.callproc(procedure_name, args=())
 Then, you can call the stored_results() method to get an iterator with the result set
 The rows in the result can be read by calling the fetchall() method
 Example:

24 Roi Yehoshua, 2024


Loading Data into a DataFrame
 Pandas provides a read_sql() method that reads an SQL query into a DataFrame

 For example, loading the instructor table into a DataFrame:

25 Roi Yehoshua, 2024


Loading Data into a DataFrame
 You can use the params argument to pass a list of parameters to the query
 For example, the following displays all the students that took the course CS-319

26 Roi Yehoshua, 2024


Data Warehouse
 A specialized type of database designed for analysis of large volumes of data
 Integrates data from various sources and arranges them in order to optimize queries
 Stores both current and historical data
 Data is usually added or deleted, but not updated
 OLAP (Online Analytical Processing) tools allow analysis of multi-dimensional data
 Popular data warehouses: Snowflake, Amazon Redshift, Google BigQuery, Teradata

27 Roi Yehoshua, 2024


OLTP vs. OLAP

28 Roi Yehoshua, 2024


Multidimensional Data
 A data warehouse contains two primary types of tables:
 Fact tables store quantitative data (facts) about a business process
 For example, a table that stores daily sales:
sales(date, product_id, store_id, total_sales_amount, total_units_sold)
 Each row in the table represents a combination of dimensions (date, product and store) and
the corresponding measures (total sales amount and units sold)
 Dimension tables store descriptive attributes (dimensions) related to the data
 For example, a product dimension table would include details about each product
 e.g., id, name, category, price
 Dimension tables are typically denormalized to make the design simpler
 Data modeled as dimension + measures attributes is called multidimensional data

29 Roi Yehoshua, 2024


Star Schema
 The resulting schema is a star schema
 The fact table is connected to multiple peripheral tables (dimension tables)
 Advantages
 Queries are simpler and run faster compared to normalized structures (less joins)
 Flexible design that allows to add new dimensions without making extensive changes

30 Roi Yehoshua, 2024


Snowflake Schema
 A variant of star schema with multiple levels of dimension tables
 The dimension tables are usually normalized
 Reduces data redundancy but increases the number of tables and complexity of joins

31 Roi Yehoshua, 2024


Data Lakes
 Storage systems that can store vast amounts of data in its native format
 Including structured data (e.g., databases) and unstructured data (e.g., images, text)
 Require less upfront effort, but more effort during querying
 Use “schema-on-read” approach instead of “schema-on-write”
 You store the data as-is and define the schema only when you need to read/process it
 Usually built on top of distributed storage systems such as Hadoop
 which can scale out by adding more nodes to the system
 Integrate well with big data tools such as Apache Spark
 Popular data lakes:
 Amazon S3
 Azure Data Lake Storage
 Google Cloud Storage
32 Roi Yehoshua, 2024
Data Lakes

Source: https://www.grazitti.com/blog/data-lake-vs-data-warehouse-which-one-should-you-go-for

33 Roi Yehoshua, 2024


OLAP (Online Analytical Processing)
 Set of tools that provide analysis of data from multiple dimensions
 Allows users to interactively analyze and explore data from different points of view
 Used in BI (Business Intelligence) to uncover insights and patterns that are not
immediately apparent with traditional, two-dimensional databases
 Types of OLAP
 MOLAP (Multidimensional OLAP): Uses multidimensional databases
 ROLAP (Relational OLAP): Uses relational databases to store data and a multidimensional
model at the front end
 HOLAP (Hybrid OLAP): Combines the features of both MOLAP and ROLAP

34 Roi Yehoshua, 2024


Data Cube
 The primary unit of storage and analysis in OLAP
 A multi-dimensional grid of data
 Each dimension represents a different aspect of the data
 Actual data points (measures) reside within the cube
 Example: sales_data (date, item_type, location, sales_amount)

35 Roi Yehoshua, 2024


OLAP Operations
 The main OLAP operations
 Roll-up (aggregation)
 Drill-down (de-aggregation)
 Slice
 Dice
 Pivot (rotating)

36 Roi Yehoshua, 2024


Roll-Up
 Aggregate the data by ascending the level in the dimension hierarchy
 Example: aggregate sales by countries instead of cities

37 Roi Yehoshua, 2024


Roll-Up in SQL
 Can be achieved by using GROUP BY
 For example, aggregating daily sales to monthly sales:
SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(sale_amount) AS monthly_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date);

date item_type location sale_amount year month total_sales


2022-01-01 Electronics New York 500 2022 1 800
2022-01-02 Clothing New York 300 2022 2 870
2022-02-01 Electronics New York 550 2023 1 770
2022-02-03 Clothing New York 320
2023-01-05 Electronics Los Angeles 480
2023-01-10 Clothing Los Angeles 290

38 Roi Yehoshua, 2024


The ROLLUP Operator
 Enhances the grouping capability of your queries by providing subtotals for
hierarchical combinations of columns in the GROUP BY
 For example, using ROLLUP to get yearly and monthly sales:
SELECT YEAR(date) AS year, MONTH(date) AS month, SUM(sale_amount) AS total_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date) WITH ROLLUP;

date item_type location sale_amount year month total_sales


2022-01-01 Electronics New York 500 2022 1 800
2022-01-02 Clothing New York 300 2022 2 870
2022-02-01 Electronics New York 550 2022 NULL 1670
2022-02-03 Clothing New York 320 2023 1 770
2023-01-05 Electronics Los Angeles 480 2023 NULL 770
2023-01-10 Clothing Los Angeles 290 NULL NULL 2440

39 Roi Yehoshua, 2024


Drill-Down
 Break down the data by descending the level in the dimension hierarchy
 Example: aggregate the sales by months instead of quarters

40 Roi Yehoshua, 2024


Drill-Down in SQL
 Just add another level to the GROUP BY
 For example, breaking down monthly sales by location:
SELECT YEAR(date) AS year, MONTH(date) AS month, location, SUM(sale_amount) AS total_sales
FROM sales_data
GROUP BY YEAR(date), MONTH(date), location;

date item_type location sale_amount year month location total_sales


2022-01-01 Electronics New York 500 2022 1 New York 800
2022-01-02 Clothing New York 300 2022 2 New York 870
2022-02-01 Electronics New York 550 2023 1 Los Angeles 770
2022-02-03 Clothing New York 320
2023-01-05 Electronics Los Angeles 480
2023-01-10 Clothing Los Angeles 290

41 Roi Yehoshua, 2024


Slice
 View a slice of the cube by fixing some of the dimensions
 Example: viewing sales for a specific quarter

42 Roi Yehoshua, 2024


Slice in SQL
 Use WHERE to fix one of the dimensions and group by the other dimensions
 Example: viewing sales for the year 2022
SELECT item_type, location, SUM(sale_amount) AS total_sales
FROM sales_data
WHERE YEAR(date) = 2022
GROUP BY item_type, location;

43 Roi Yehoshua, 2024


Dice
 Select two or more dimensions to get a sub-cube
 Example: viewing sales for a specific location, quarter, and item type

44 Roi Yehoshua, 2024


Dice in SQL
 Use WHERE to fix two or more dimensions and group by the other dimensions
 Example: viewing sales for the year 2022 in Chicago
SELECT item_type, SUM(sale_amount) AS total_sales
FROM sales_data
WHERE YEAR(date) = 2022 AND location = 'Chicago'
GROUP BY item_type;

45 Roi Yehoshua, 2024


Pivot
 Rotate the data axes to view the data from a different perspective
 Example: show the item types in the columns instead of the rows

46 Roi Yehoshua, 2024


Pivot in SQL
 SQL doesn’t directly support the pivot operation like some specialized OLAP systems
 You can achieve a pivot using conditional aggregation
 For example, pivoting monthly sales for different item types:
SELECT MONTH(date) AS Month,
SUM(CASE WHEN item_type = 'Electronics' THEN sale_amount ELSE 0 END) AS electronics_sales,
SUM(CASE WHEN item_type = 'Clothing' THEN sale_amount ELSE 0 END) AS clothing_sales
FROM sales_data
WHERE YEAR(date) = 2022
GROUP BY MONTH(date);

47 Roi Yehoshua, 2024


Other OLAP Operations
 Split
 Break out one dimension into two separate dimensions for finer granularity
 Drill across
 Navigate from one cube to another cube that shares one or more dimensions
 Drill-through
 Access the detailed data that constitutes a certain higher-level aggregate value

48 Roi Yehoshua, 2024


The CUBE Operator
 An extension to the GROUP BY clause
 Generates a result set that represents all combinations of all aggregations
 In contrast to ROLLUP that represents aggregations from the most detailed level to the total
 For each grouping, the result contains NULL for attributes not present in the
grouping
 Supported by many relational databases (but not MySQL )

49 Roi Yehoshua, 2024


The CUBE Operator
date location sale_amount year month location total_sales
2022 1 New York 800
2022-01-01 New York 500
2022 2 Los Angeles 870
2022-01-02 New York 300
2022 1 NULL 800
2022-02-01 Los Angeles 550 2022 2 NULL 870
2022-02-03 Los Angeles 320 NULL 1 New York 800
NULL 2 Los Angeles 870
SELECT YEAR(date) AS year, MONTH(date) AS month, 2022 NULL New York 800
location, SUM(sale_amount) AS total_sales 2022 NULL Los Angeles 870
FROM sales_data
GROUP BY YEAR(date), MONTH(date), location WITH CUBE; NULL NULL New York 800
NULL NULL Los Angeles 870
2022 NULL NULL 1670
NULL 1 NULL 800
NULL 2 NULL 870
NULL NULL NULL 2440

50 Roi Yehoshua, 2024


Class Exercise
 Compute the total number of student enrollments in each department’s courses
over the years, with subtotals by year and a grand total across all years

51 Roi Yehoshua, 2024


Solution

52 Roi Yehoshua, 2024


Window Functions in SQL
 Perform calculations across a set of table rows that are related to the current row
 e.g., calculating running totals / moving average
 Return a single value for each row from the query
 Unlike aggregate functions which return a single value for each group
 General syntax:
<window_function>(<arguments>) OVER (
[PARTITION BY <partition_expression(s)>]
[ORDER BY <order_expression(s)>]
[frame_specification]
)

 The OVER clause defines the rows on which the window function operates
 An optional PARTITION BY clause divides the rows into groups
 The ORDER BY clause determines the order of the rows within each partition
 Frame specification defines which rows are included in the frame (subset of the current row)
53 Roi Yehoshua, 2024
Example: Moving Average
 The AVG() function can be used as a window function to get a moving average
 For example, assume that we have a table with daily sales figures
 We can calculate a 3-day moving average for sales as follows:

54 Roi Yehoshua, 2024


Frame Specifications
 Define the set of rows included in the window relative to the current row
 Has two components:
 Frame start: where the frame starts relative to the current row
 Frame end: where the frame ends relative to the current row
 ROWS, RANGE, or GROUPS: determine how the frame boundaries are defined
 ROWS: Defines the frame by a specific number of rows
 RANGE: Defines the frame by value range
 GROUPS: Used with ordered set aggregate functions
 Frame boundary:
 UNBOUNDED PRECEDING/FOLLOWING: The frame starts/ends at the first row of the partition
 N PRECEDING/FOLLOWING: The frame starts/ends N rows before the current row
 CURRENT ROW: The frame starts or ends at the current row

55 Roi Yehoshua, 2024


Common Window Functions
 Ranking functions
 RANK() assigns a rank to each row based on a specified column’s values
 DENSE_RANK() is similar to RANK() but without gaps between rank values for rows with the
same rank
 ROW_NUMBER() assigns a unique sequential integer to rows
 Aggregate functions
 Almost all aggregate functions (e.g., SUM() and AVG()) can be used as window functions
 Navigation functions
 LEAD() returns a value from a subsequent row
 LAG() returns a value from a preceding row
 FIRST_VALUE() returns a value from the first row of the window frame
 LAST_VALUE() returns a value from the last row of the window frame

56 Roi Yehoshua, 2024


Rank Function
 The RANK() function assigns a rank to each row of a partition or the entire result set
 It should be used with ORDER BY to sort the rows into the desired order
 For example, we can rank the students based on their total credits:

57 Roi Yehoshua, 2024


PARTITION BY
 PARTITION BY divides the result set into partitions and the window function is
applied to each partition independently
 For example, we can rank students by the total credits within each department:

58 Roi Yehoshua, 2024


Class Exercise
 For each course in the Fall 2017 semester rank the students based on their grades
 If two students have the same grade, they should get the same rank
 Expected columns in the result: ‘course_id’, ‘student_id’, ‘grade’, ‘student_rank’

59 Roi Yehoshua, 2024


Solution

60 Roi Yehoshua, 2024

You might also like