Lecture 5 Trees
Lecture 5 Trees
Indexes
CS 186, Spring 2006, Lectures
5 &6
R & G Chapters 9 & 10
Abraham Lincoln
Review: Files, Pages,
Records
• Abstraction of stored data is “files” of “records”.
– Records live on pages
– Physical Record ID (RID) = <page#, slot#>
• Variable length data requires more sophisticated
structures for records and pages. (why?)
– Records: offset array in header
– Pages: Slotted pages w/internal offsets & free
space area
• Often best to be “lazy” about issues such as free
space management, exact ordering, etc. (why?)
• Files can be unordered (heap), sorted, or kinda
sorted (i.e., “clustered”) on a search key.
– Tradeoffs are update/maintenance cost vs. speed
of accesses via the search key.
– Files can be clustered (sorted) at most one
way.
• Indexes can be used to speed up many kinds of
accesses. (i.e., “access paths”)
Indexes: Introduction
• Sometimes, we want to retrieve records by specifying the
values in one or more fields, e.g.,
– Find all students in the “CS” department
– Find all students with a gpa > 3
• An index on a file is a disk-based data structure that
speeds up selections on the search key fields for the
index.
– Any subset of the fields of a relation can be the search
key for an index on the relation.
– Search key is not the same as key (e.g. doesn’t have to
be unique ID).
• An index contains a collection of data entries, and
supports efficient retrieval of all records with a given
search key value k.
– Typically, index also contains auxiliary information
that directs searches to the desired data entries
Indexes: Overview
• Many indexing techniques exist:
– B+ trees, hash-based structures, R trees,
…
• Can have multiple (different) indexes per
file.
– E.g. file sorted by age, with a hash index
on salary and a B+tree index on name.
• Index Classification
– What selections does it support
– Representation of data entries in index
• i.e., what kind of info is the index actually
storing?
• 3 alternatives here
– Clustered vs. Unclustered Indexes
– Single Key vs. Composite Indexes
– Tree-based, hash-based, other
Indexes: What Selections do
•
they support?
Selections of form field <op> constant
• Equality selections (op is =)
– Either “tree” or “hash” indexes help here.
• Range selections (op is one of <, >, <=, >=, BETWEEN)
– “Hash” indexes don’t work for these.
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
Alternatives for Data Entry k*
in Index
• Question: What is actually stored in the
leaves of the index for key value “k”?
(i.e., what are the “data entries”?)
• Three alternatives:
1. Actual data record(s) with key value k
2. {<k, rid of matching data record>}
3. <k, list of rids of matching data
records>
• Upshot
– Don’t brag about being an ISAM expert on
your resume
– Do understand how they work, and
tradeoffs with B+-trees
Range Searches
• ``Find all students with gpa > 3.0’’
– If data is in sorted file, do binary
search to find first such student,
then scan to find others.
– Cost of binary search in a database
can be quite high. Q: Why???
• Simple idea: Create an `index’ file.
k1 k2 kN Index File
P K
ISAM 0 1 P K 2 P K m Pm
1 2
Non-leaf
Pages
Leaf
Pages
Overflow
page
Primary pages
20 33 51 63
10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97*
Data Pages
Pages
20 33 51 63
Primary
Leaf 46* 55*
10* 15* 20* 27* 33* 37* 40* 51* 63* 97*
Pages
Pages
42*
... then Deleting
42*, 51*, 97*
Root
Index 40
Pages
20 33 51 63
Primary
Leaf 46* 55*
10* 15* 20* 27* 33* 37* 40* 51* 63* 97*
Pages
Pages
42*
• Pros
– ????
• Cons
– ????
Administrivia - Exam
Schedule Change
Index Entries
(Direct search)
Data Entries
("Sequence set")
Example B+ Tree
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 23* 24* 27* 29* 33* 34* 38* 39*
Example B+ Tree - Inserting
8* Root
5 13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
2* 3* 5* 7* 8* Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
• Observe how
Page Entry to be inserted in parent node.
(Note that 5 is
s copied up and
…
minimum Split 5
continues to appear in the leaf.)
occupancy is
guaranteed in
2* 3* 5* 7* 8*
both leaf and
index pg
splits.
• Note difference Index 5 13 17 24 30
between copy-up Page
Entry to be inserted in parent node.
and push-up; be Split 17 (Note that 17 is pushed up and only
appears once in the index. Contrast
sure you this with a leaf split.)
understand the
reasons for
5 13 24 30
this.
Deleting a Data Entry from a
B+ Tree
• Start at root, find leaf L where entry belongs.
• Remove the entry.
– If L is at least half-full, done!
– If L has only d-1 entries,
• Try to re-distribute, borrowing from sibling
(adjacent node with same parent as L).
• If re-distribution fails, merge L and
sibling.
• If merge occurred, must delete entry (pointing to
L or sibling) from parent of L.
• Merge could propagate to root, decreasing height.
Example Tree (including 8*)
5 13 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
Example Tree (including 8*)
5 13 24 30
5 13 27 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*
down’ of index
entry (below).
Root
5 13 17 30
22
5 13 17 20 30
2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*
After Re-distribution
• Intuitively, entries are re-distributed by
`pushing through’ the splitting entry in
the parent node.
• It suffices to re-distribute index entry
with key 20; we’ve re-distributed 17 as
Root
well for illustration.
17
5 13 20 22 30
2* 3* 5* 7* 8* 14* 16* 17* 18* 20* 21* 22* 27* 29* 33* 34* 38* 39*
Prefix Key Compression
• Important to increase fan-out. (Why?)
• Key values in index entries only `direct traffic’; can
often compress them.
– E.g., If we have adjacent index entries with search
key values Dannon Yogurt, David Smith and
Devarakonda Murthy, we can abbreviate David Smith to
Dav. (The other keys can be compressed too ...)
• Is this correct? Not quite! What if there is a data entry
Davey Jones? (Can only compress David Smith to Davi)
• In general, while compressing, must leave each index entry
greater than every key value (in any subtree) to its left.
• Insert/delete must be suitably modified.
Bulk Loading of a B+ Tree
• If we have a large collection of records,
and we want to create a B+ tree on some
field, doing so by repeatedly inserting
records is very slow.
– Also leads to minimal leaf utilization
--- why?
• Bulk Loading can be done much more
efficiently.
Root
• Initialization: Sort allof data
Sorted pages data entries,
entries; not yet in B+ tree
one considers
locking!
3* 4* 6* 9* 10* 11* 12* 13* 20*22* 23* 31* 35* 36* 38*41* 44*
Summary of Bulk Loading