Lecture5 -Query_Processing 1
Lecture5 -Query_Processing 1
Systems
M. Tamer Özsu
Patrick Valduriez
Query
Processor
◼ Query language
❑ SQL: “intergalactic dataspeak”
◼ Query execution
❑ The steps that one goes through in executing high-level
(declarative) user queries.
◼ Query optimization
❑ How do we determine the “best” execution plan?
Strategy 1
SELECT ENAME
FROM EMP , ASG
WHERE emp.no=asg.no and RESP = "Manager"
Strategy 1
ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))
SELECT ENAME
FROM EMP NATURAL JOIN ASG
WHERE RESP = "Manager"
Strategy 2
ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Strategy 2 avoids Cartesian product, and consumes less
computing resources, so may be “better”
Strategy A Strategy B
© 2020, M.T. Özsu & P. Valduriez 12
Cost of Alternatives
Assume
◼ size(EMP) = 400 row,
size(ASG) = 1000
◼ tuple access cost = 1
unit (1 operation or 1s);
◼ tuple transfer cost = 10
units
◼ There are 20 managers
in relation ASG
◼ Assume that the data is
uniformly distributed
among sites
◼ Strategy A
❑ produce ASG': (10+10) tuple access cost 20
❑ transfer ASG' to the sites of EMP: (10+10)
tuple transfer cost 200
❑ produce EMP': (10+10) tuple access cost
2 40
❑ transfer EMP' to result site: (10+10) tuple
transfer cost 200
Total Cost 460
◼ Strategy B
❑ transfer EMP to site 5: 400 tuple transfer
cost 4,000
❑ transfer ASG to site 5: 1000 tuple transfer
cost 10,000
❑ produce ASG': 1000 tuple access (apply
condition) 1,000
❑ join EMP and ASG': 400 20(manager) tuple
access cost 8,000
Total Cost 23,000
Operation Complexity
Select
Project O(n)
◼ Assume (without duplicate elimination)
Join
Semi-join O(n log n)
Division
Set Operators
◼ Exhaustive search
❑ Cost-based
❑ Optimal
❑ Combinatorial complexity in the number of relations
◼ Heuristics
❑ Not optimal
❑ Regroup common sub-expressions
❑ Perform selection, projection first
❑ Replace a join by a series of semijoins
❑ Reorder operations to reduce intermediate relation size
❑ Optimize individual operations
◼ Relation
❑ Cardinality
❑ Size of a tuple
❑ Fraction of tuples participating in a join with another relation
◼ Attribute
❑ Cardinality of domain
❑ Actual number of distinct values
◼ Simplifying assumptions
❑ Independence between different attribute values
❑ Uniform distribution of attribute values within their domain
◼ Centralized
❑ Single site determines the “best” schedule
❑ Simple
❑ Need knowledge about the entire distributed database
◼ Distributed
❑ Cooperation among sites to determine the schedule
❑ Need only local information
❑ Cost of cooperation
◼ Hybrid
❑ One site determines the global schedule
❑ Each site optimizes the local subqueries