Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
43 views

Distributed Query Processing

The document discusses distributed query processing. It describes how distributed queries extend centralized query processing to account for factors like network communication costs and data partitioning across multiple sites. Some key techniques for distributed query execution include row blocking to batch data transfers, multi-cast optimization to reduce communication costs, and semi-joins to minimize data shipped between sites during joins.

Uploaded by

Afaq Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Distributed Query Processing

The document discusses distributed query processing. It describes how distributed queries extend centralized query processing to account for factors like network communication costs and data partitioning across multiple sites. Some key techniques for distributed query execution include row blocking to batch data transfers, multi-cast optimization to reduce communication costs, and semi-joins to minimize data shipped between sites during joins.

Uploaded by

Afaq Ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Distributed Query Processing

Group Members
M. Arslan BCS07183033
Adnan Ali BCS07183037
Amina Sajid BCS07183040
Motivation
• Cost and scalability: network of off-shelf
machines
• Integration of different software vendors
(with own DBMS)
• Integration of legacy systems
• Applications inherently distributed, such as
workflow or collaborative-design
• State-of-the-art distributed information
technologies (e-businesses)
Distribute Query Processing
• Query Processing Basics
– centralized query processing
– distributed query processing
Problem Statement
• Input: Query such as „Biological objects in
study A referenced in a literature in journal Y“.
• Output: Answer
• Objectives:
– response time, throughput, first answers, little IO, ...
• Centralized vs. Distributed Query Processing
– same basic problem
– but, more and different parameters, such(data sites
or available machine power) and objectives
Steps in Query Processing
• Input: Declarative Query
– SQL, XQuery, ...
• Step 1: Translate Query into Algebra
– Tree of operators (query plan generation)
• Step 2: Optimize Query
– Tree of operators (logical) - also select partitions of table
– Tree of operators (physical) – also site annotations
– (Compilation)
• Step 3: Execution
– Interpretation; Query result generation
Algebra
A.d
A.a =
SELECT A.d
B.b,
FROM A, B
A.c =
WHERE A.a = B.b
35
AND A.c = 35 X

A B

– relational algebra for SQL very well understood


– algebra for XQuery mostly understood
Query Optimization
A.d A.d
A.a =
B.b,
hashjoin
A.c =
35
X B.b

index
A B B
A.c

– logical, e.g., push down cheap predicates


– enumerate alternative plans, apply cost model
– use search heuristics to find cheapest plan
Basic Query Optimization
• Classical Dynamic Programming algorithm
– Performs join order optimization
– Input : Join query on n relations
– Output : Best join order
Query Execution
John
A.d
(John, 35, CS)
hashjoin (CS)
(AS)
(John, 35, CS) B.b
(Mary, 35, EE) (Edinburgh, CS,5.0)
(Edinburgh, AS, 6.0)
index
B
A.c
– library of operators (hash join, merge join, ...)
– exploit indexes and clustering in database
– pipelining (iterator model)
Summary : Centralized Queries
• Basic SQL (SPJG, nesting) well understood
• Very good extensibility
– spatial joins, time series, UDF, xquery, etc.
• Current problems
– Better statistics : cost model for optimization
– Physical database design expensive & complex
• Some Trends
– interactiveness during execution
– approximate answers, top-k
– self-tuning capabilities (adaptive; robust; etc.)
Distributed Query Processing: Basics
• Idea:
Extension of centralized query processing. (System
R* et al. in 80s)
• What is different?
– extend physical algebra: send&receive operators
– other metrics : optimize for response time
– resource vectors, network interconnect matrix
– caching and replication
– less predictability in cost model (adaptive algos)
– heterogeneity in data formats and data models
Issues in Distributed Databases
• Plan enumeration
– The time and space complexity of traditional dynamic
programming algorithm is very large
– Iterative Dynamic Programming (heuristic for large
queries)
• Cost Models
– Classic Cost Model
– Response Time Model
– Economic Models
Query Execution Techniques for
Distributed Databases
• Row Blocking
• Multi-cast optimization
• Multi-threaded execution
• Joins with horizontal partitioning
• Semi joins
• Top n queries
Query Execution Techniques for DD

• Row Blocking –
– SEND and RECEIVE operators in query plan
to model communication
– Implemented by TCP/IP, UDP, etc.
– Ship tuples in block-wise fashion (batch);
smooth burstiness
Query Execution Techniques for DD
• Multi-cast Optimization
– Location of sending/receiving may affect
communication costs; forwarding versus multi-casting
• Multi-threaded execution
– Several threads for operators at the same site (intra-
query parallelism)
– May be useful to enable concurrent reads for diverse
machines (while continuing query processing)
– Must consider if resources warrant concurrent operator
execution (say two sorts each needing all memory)
Query Execution Techniques for DD
• Joins with Data (horizontal) partitioning:
– Hash-based partitioning to conduct joins on independent partitions
• Semi Joins :
– Reduce communication costs; Send only “join keys” instead of
complete tuples to the site to extract relevant join partners
• Double-pipelined hash joins :
– Non-blocking join operators to deliver first results quickly; fully
exploit pipelined parallelism, and reduce overall response time
• Top n queries :
– Isloate top n tuples quickly and only perform other expensive
operations (like sort, join, etc) on those few (use “stop” operators)

You might also like