CSE 444 Practice Problems
CSE 444 Practice Problems
DBMS Architecture
1. Data Independence
(b) What properties of the relational model facilitate physical data independence?
Solution:
Declarative query language or set-at-a-time query language.
1
(c) What is logical data independence?
Solution:
Logical data independence is a property of a DBMS that ensures decoupling between the logical
structure of data and applications that operate on it. With this property, changes in the logical
data layout like tables, rows, and columns do not require application to be changed.
(d) How can one provide a high level of logical data independence with the relational model?
Solution:
By defining views.
2
2. High-Level DBMS Architecture
You should know the key components of a relational DBMS. Please see the lecture notes for an overview.
You should also be able to discuss what you implemented in the labs. For example, we could ask you
for a high-level description of your buffer manager.
3
3. Data Storage and Indexing
You should be able to show what happens when one adds/removes data to/from a B+ Tree. Please
see the web quizzes for some good examples.
4
4. Data Storage and Indexing
Suppose we have a relation R(a,b,c,d,e) and there are at least 1000 distinct values for each of the
attributes.
Consider each of the following query workloads, independently of each other. If it is possible to speed it
up significantly by adding up to two additional indexes to relation R, specify for each index (1) which
attribute or set of attributes form the search key of the index, (2) if the index should be clustered or
unclustered, (3) if the index should be a hash-based index or a B+-tree. You may add at most two
new indexes. If adding a new index would not make a significant difference, you should say so. Give a
brief justification for your answers.
(a) 100,000 queries have the form: select * from R where b <?
10,000 queries have the form: select * from R where c =?
Solution:
Since we need efficient range-queries on R(b), we definitely want a clustered, B+-tree index on
R(b).
For the second query, an index on R(c) will help. It can be either a B+-tree or a hash-based index
since queries look-up specific key values. The index must be unclustered since the index on R(b)
is clustered. That is fine, however, since queries will look-up specific key values and we know that
there are many distinct values in the relation.
(b) 100,000 queries have the form: select * from R where b <? and c =?
10,000 queries have the form: select * from R where d =?
1,000 queries have the form: select * from R where a =?
Solution:
For the first query, a clustered B+-tree index on R(c,b) would be most helpful since we could use
it to look-up all data items that match both the given value on c and the range on b.
Since we can only add a second index, we will favor the most frequent query and add an index on
R(d). As in the question above, this index must be unclustered and can be either a B+-tree or a
hash-based index.
5
(c) 100,000 queries have the form: select a, c from R where b <?
10,000 queries have the form: select * from R where d <?
Solution:
Since both queries are range-selection queries, we need clustered indexes for both of them, but we
cannot have more than one such index. However, we can have a covering index.
We thus recommend:
• A clustered, B+-tree index on R(d).
• An unclustered, B+-tree index on R(b,a,c). This is also a covering index in that we only need
to use the index to answer the query. We don’t need to touch the data.
6
5. Relational Algebra and Query Processing
Consider three tables R(a,b,c), S(d,e,f), and T(g,h,i).
CORRECT INCORRECT
Solution:
CORRECT.
ii. πR.b (σTOTAL>2 (γR.b,count(∗)→TOTAL (R 1R.a=S.d ((σS.f <50 (S)) 1S.e=T.g (σT.h>21 (T ))))))
CORRECT INCORRECT
Solution:
CORRECT.
iii. πR.b (σTOTAL>2 (σT.h>21 AND S.f <50 (γR.b,count(∗)→TOTAL (R 1R.a=S.d (S 1S.e=T.g T )))))
CORRECT INCORRECT
Solution:
INCORRECT.
7
(b) For the following SQL query, show an equivalent relational algebra expression. You can give the
expression in the same format as we used above or you can draw it in the form of an expression
tree or logical query plan.
SELECT R.b
FROM R, S
WHERE R.a = S.d
AND R.b NOT IN (SELECT R2.b FROM R as R2, T WHERE R2.b = T.g)
Solution:
πR.b (R 1R.a=S.d S) − πR.b (R 1R.b=T.g T )
8
(c) A user just connected to a database server and submitted a SQL query in the form of a string.
Give four important steps involved in evaluating that SQL query and the order in which they
are performed. You only need to name the steps. No need to explain them.
Solution:
(d) Query parsing: the DBMS parses the SQL string into an internal tree representation of the
query. Typically, a variety of checks are also performed at that time including syntax checks,
authorization checks, simple integrity constraint checks, etc.
(e) Query rewrite: the parse-tree is converted into an initial logical query plan. During that step,
views are also rewritten and queries are flattened if possible.
(f) Query optimization: the query optimizer searches for an efficient physical query plan to execute
the query.
(g) Query execution: the DBMS actually executes the query and returns the results to the user.
9
(h) What is the difference between a logical and a physical query plan?
Solution:
A logical query plan is an extended relational algebra tree. A physical query plan is a logical query
plan with extra annotations that specify the (1) access path to use for each relation (whether to use
an index or a file scan and which index to use), the (2) implementation to use for each relational
operator, and (3) how the operators should be executed (pipelined execution, intermediate result
materialization, etc.).
10
6. Relational Algebra and Query Processing
Consider four tables R(a,b,c), S(d,e,f), T(g,h), U(i,j,k).
Solution:
Many solutions were possible including:
π R.b,avg
T.h = U.i
S.e = T.g
U
R.a = S.d
T
R S
11
(b) Consider the following two physical query plans. Give two reasons why plan B may be faster
than plan A. Explain each reason.
A B
(B+ tree index on A.a) (B+ tree index on B.b)
Plan A Plan B
Solution:
The following are three possible reasons why plan B could be faster than plan A:
• The low selectivity of the selection predicate (a > 10) and an unclustered index on A.a may
make a file scan of A faster than using the index.
• If there are lots of matches, hash joins may be faster than index nested loops. The hash join
reads its input only once (unless the input is too large). The index-nested loop may end-up
reading the same pages of B multiple times.
• Pushing the selection (B.c < 2000) down can get rid of lots of B tuples before the join,
reducing the cost of that operation.
12
7. Operator Algorithms
Relation R has 90 pages. Relation S has 80 pages. Explain how a DBMS could efficiently join these
two relations given that only 11 pages can fit in main memory at a time. Your explanation should be
detailed: specify how many pages are allocated in memory and what they are used for; specify what
exactly is written to disk and when.
13