Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Lecture5 -Query_Processing 1

Query Processing

Uploaded by

amirosama2121
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lecture5 -Query_Processing 1

Query Processing

Uploaded by

amirosama2121
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Principles of Distributed Database

Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
◼ Introduction
◼ Distributed and parallel database design
◼ Distributed data control
◼ Distributed Query Processing
◼ Distributed Transaction Processing
◼ Data Replication
◼ Database Integration – Multidatabase Systems
◼ Parallel Database Systems
◼ Peer-to-Peer Data Management
◼ Big Data Processing
◼ NoSQL, NewSQL and Polystores
◼ Web Data Management
© 2020, M.T. Özsu & P. Valduriez 2
Outline
◼ Distributed Query Processing
❑ Query Decomposition and Localization
❑ Join Ordering
❑ Distributed Query Optimization
❑ Adaptive Query Processing

© 2020, M.T. Özsu & P. Valduriez 3


Query Processing in a DDBMS

◼ Generally, a query in distributed DBMS require data from


multiple sites, and this is called transmission of data that
causes communication costs.

◼ Query processing in DBMS is different from centralized


DBMS due to the communication cost of data transfer
over the network.

◼ The transmission cost is low when the sites are


connected through high-speed network and is quite
significant in another network.

© 2020, M.T. Özsu & P. Valduriez 4


Query Processing in a DDBMS

◼ In distributed query processing, the data transfer


cost of distributed query processing means:

❑ Cost of transferring intermediate files to other sites for


processing and
❑ Cost of transferring the ultimate result file to the
location where the results required

© 2020, M.T. Özsu & P. Valduriez 5


Distributed DBMS Environment

• If s1 request a query and


needs data from s2 and
s3.
• It is decided to execute
the query at s3.
• 1st communication cost is
transferring the data from
s2 to s3 → then s3 will
execute the query and
get the result.
• 2nd communication cost is
transferring the result
from s3 to s1

© 2020, M.T. Özsu & P. Valduriez 6


Query Processing in a DDBMS

High level user query

Query
Processor

Low-level data manipulation


commands for D-DBMS

© 2020, M.T. Özsu & P. Valduriez 7


Query Processing Components

◼ Query language
❑ SQL: “intergalactic dataspeak”

◼ Query execution
❑ The steps that one goes through in executing high-level
(declarative) user queries.

◼ Query optimization
❑ How do we determine the “best” execution plan?

◼ We assume a homogeneous D-DBMS

© 2020, M.T. Özsu & P. Valduriez 8


Selecting Alternatives
Find the names of employee who are
managing a project

Strategy 1

SELECT ENAME
FROM EMP , ASG
WHERE emp.no=asg.no and RESP = "Manager"

Strategy 1
 ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))

© 2020, M.T. Özsu & P. Valduriez 9


Selecting Alternatives
Strategy 2

SELECT ENAME
FROM EMP NATURAL JOIN ASG
WHERE RESP = "Manager"

Strategy 2
 ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))
Strategy 2 avoids Cartesian product, and consumes less
computing resources, so may be “better”

© 2020, M.T. Özsu & P. Valduriez 10


Selecting Alternatives
In a distributed system,

◼ Relational algebra is not enough to express execution


strategies. It must be supplemented with operators for
exchanging data between sites.

◼ The distributed query processor must also select the


best sites to process data , and possibly the way data
should be transformed.

© 2020, M.T. Özsu & P. Valduriez 11


What is the Problem?

Site 1 Site 2 Site 3 Site 4 Site 5


ASG1=ENO≤“E3”(ASG) ASG2= ENO>“E3”(ASG) EMP1= ENO≤“E3”(EMP) EMP2= ENO>“E3”(EMP) Result

Strategy A Strategy B
© 2020, M.T. Özsu & P. Valduriez 12
Cost of Alternatives

Assume
◼ size(EMP) = 400 row,
size(ASG) = 1000
◼ tuple access cost = 1
unit (1 operation or 1s);
◼ tuple transfer cost = 10
units
◼ There are 20 managers
in relation ASG
◼ Assume that the data is
uniformly distributed
among sites

© 2020, M.T. Özsu & P. Valduriez 13


Cost of Alternatives

◼ Strategy A
❑ produce ASG': (10+10) tuple access cost 20
❑ transfer ASG' to the sites of EMP: (10+10)
tuple transfer cost 200
❑ produce EMP': (10+10) tuple access cost
2 40
❑ transfer EMP' to result site: (10+10) tuple
transfer cost 200
Total Cost 460

◼ Strategy B
❑ transfer EMP to site 5: 400 tuple transfer
cost 4,000
❑ transfer ASG to site 5: 1000 tuple transfer
cost 10,000
❑ produce ASG': 1000 tuple access (apply
condition) 1,000
❑ join EMP and ASG': 400 20(manager) tuple
access cost 8,000
Total Cost 23,000

© 2020, M.T. Özsu & P. Valduriez 14


Query Optimization Objectives
◼ Minimize a cost function
❑ I/O cost + CPU cost + communication cost
❑ These might have different weights in different distributed
environments
◼ Wide area networks
❑ Communication cost may dominate or vary much
◼ Bandwidth
◼ Speed
◼ Protocol overhead
◼ Local area networks
❑ Communication cost not that dominant, so total cost function
should be considered
◼ Can also maximize throughput
© 2020, M.T. Özsu & P. Valduriez 15
Complexity of Relational Operations

Operation Complexity

Select
Project O(n)
◼ Assume (without duplicate elimination)

❑ Relations of cardinality n Project


(with duplicate elimination) O(n  log n)
❑ Sequential scan
Group

Join
Semi-join O(n  log n)
Division
Set Operators

Cartesian Product O(n2)

© 2020, M.T. Özsu & P. Valduriez 16


Types Of Optimizers

◼ Exhaustive search
❑ Cost-based
❑ Optimal
❑ Combinatorial complexity in the number of relations
◼ Heuristics
❑ Not optimal
❑ Regroup common sub-expressions
❑ Perform selection, projection first
❑ Replace a join by a series of semijoins
❑ Reorder operations to reduce intermediate relation size
❑ Optimize individual operations

© 2020, M.T. Özsu & P. Valduriez 17


Optimization Granularity

◼ Single query at a time


❑ Cannot use common intermediate results

◼ Multiple queries at a time


❑ Efficient if many similar queries

❑ Decision space is much larger

© 2020, M.T. Özsu & P. Valduriez 18


Optimization Timing

◼ Static : optimization is done at query compilation time


❑ Compilation ➔ optimize prior to the execution
❑ Difficult to estimate the size of the intermediate resultserror
propagation
❑ Can amortize over many executions
◼ Dynamic: proceeds at query execution time
❑ Run time optimization
❑ Exact information on the intermediate relation sizes
❑ Have to re-optimize for multiple executions
◼ Hybrid: tradeoff between both
❑ Compile using a static algorithm
❑ If the error in estimate sizes > threshold, re-optimize at run time

© 2020, M.T. Özsu & P. Valduriez 19


Statistics

◼ Relation
❑ Cardinality
❑ Size of a tuple
❑ Fraction of tuples participating in a join with another relation
◼ Attribute
❑ Cardinality of domain
❑ Actual number of distinct values
◼ Simplifying assumptions
❑ Independence between different attribute values
❑ Uniform distribution of attribute values within their domain

© 2020, M.T. Özsu & P. Valduriez 20


Optimization Decision Sites

◼ Centralized
❑ Single site determines the “best” schedule
❑ Simple
❑ Need knowledge about the entire distributed database
◼ Distributed
❑ Cooperation among sites to determine the schedule
❑ Need only local information
❑ Cost of cooperation
◼ Hybrid
❑ One site determines the global schedule
❑ Each site optimizes the local subqueries

© 2020, M.T. Özsu & P. Valduriez 21


Network Topology

◼ Wide area networks (WAN) – point-to-point


❑ Characteristics
◼ Relatively low bandwidth (compared to local CPU/IO)
◼ High protocol overhead
❑ Communication cost may dominate; ignore all other cost factors
❑ Global schedule to minimize communication cost
❑ Local schedules according to centralized query optimization
◼ Local area networks (LAN)
❑ Communication cost not that dominant
❑ Total cost function should be considered
❑ Broadcasting can be exploited (joins)
❑ Special algorithms exist for star networks

© 2020, M.T. Özsu & P. Valduriez 22


Questions?

© 2020, M.T. Özsu & P. Valduriez 23

You might also like