Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 9e85183

Browse files
committed
Major overhaul of btree index code. Eliminate special BTP_CHAIN logic for
duplicate keys by letting search go to the left rather than right when an equal key is seen at an upper tree level. Fix poor choice of page split point (leading to insertion failures) that was forced by chaining logic. Don't store leftmost key in non-leaf pages, since it's not necessary. Don't create root page until something is first stored in the index, so an unused index is now 8K not 16K. (Doesn't seem to be as easy to get rid of the metadata page, unfortunately.) Massive cleanup of unreadable code, fix poor, obsolete, and just plain wrong documentation and comments. See src/backend/access/nbtree/README for the gory details.
1 parent c9537ca commit 9e85183

File tree

11 files changed

+1613
-2843
lines changed

11 files changed

+1613
-2843
lines changed

src/backend/access/nbtree/README

Lines changed: 164 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,68 +1,175 @@
1-
$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $
1+
$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $
22

33
This directory contains a correct implementation of Lehman and Yao's
4-
btree management algorithm that supports concurrent access for Postgres.
4+
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
5+
Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
6+
on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).
7+
58
We have made the following changes in order to incorporate their algorithm
69
into Postgres:
710

8-
+ The requirement that all btree keys be unique is too onerous,
9-
but the algorithm won't work correctly without it. As a result,
10-
this implementation adds an OID (guaranteed to be unique) to
11-
every key in the index. This guarantees uniqueness within a set
12-
of duplicates. Space overhead is four bytes.
13-
14-
For this reason, when we're passed an index tuple to store by the
15-
common access method code, we allocate a larger one and copy the
16-
supplied tuple into it. No Postgres code outside of the btree
17-
access method knows about this xid or sequence number.
18-
19-
+ Lehman and Yao don't require read locks, but assume that in-
20-
memory copies of tree nodes are unshared. Postgres shares
21-
in-memory buffers among backends. As a result, we do page-
22-
level read locking on btree nodes in order to guarantee that
23-
no record is modified while we are examining it. This reduces
24-
concurrency but guaranteees correct behavior.
25-
26-
+ Read locks on a page are held for as long as a scan has a pointer
27-
to the page. However, locks are always surrendered before the
28-
sibling page lock is acquired (for readers), so we remain deadlock-
29-
free. I will do a formal proof if I get bored anytime soon.
11+
+ The requirement that all btree keys be unique is too onerous,
12+
but the algorithm won't work correctly without it. Fortunately, it is
13+
only necessary that keys be unique on a single tree level, because L&Y
14+
only use the assumption of key uniqueness when re-finding a key in a
15+
parent node (to determine where to insert the key for a split page).
16+
Therefore, we can use the link field to disambiguate multiple
17+
occurrences of the same user key: only one entry in the parent level
18+
will be pointing at the page we had split. (Indeed we need not look at
19+
the real "key" at all, just at the link field.) We can distinguish
20+
items at the leaf level in the same way, by examining their links to
21+
heap tuples; we'd never have two items for the same heap tuple.
22+
23+
+ Lehman and Yao assume that the key range for a subtree S is described
24+
by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
25+
node. This does not work for nonunique keys (for example, if we have
26+
enough equal keys to spread across several leaf pages, there *must* be
27+
some equal bounding keys in the first level up). Therefore we assume
28+
Ki <= v <= Ki+1 instead. A search that finds exact equality to a
29+
bounding key in an upper tree level must descend to the left of that
30+
key to ensure it finds any equal keys in the preceding page. An
31+
insertion that sees the high key of its target page is equal to the key
32+
to be inserted has a choice whether or not to move right, since the new
33+
key could go on either page. (Currently, we try to find a page where
34+
there is room for the new key without a split.)
35+
36+
+ Lehman and Yao don't require read locks, but assume that in-memory
37+
copies of tree nodes are unshared. Postgres shares in-memory buffers
38+
among backends. As a result, we do page-level read locking on btree
39+
nodes in order to guarantee that no record is modified while we are
40+
examining it. This reduces concurrency but guaranteees correct
41+
behavior. An advantage is that when trading in a read lock for a
42+
write lock, we need not re-read the page after getting the write lock.
43+
Since we're also holding a pin on the shared buffer containing the
44+
page, we know that buffer still contains the page and is up-to-date.
45+
46+
+ We support the notion of an ordered "scan" of an index as well as
47+
insertions, deletions, and simple lookups. A scan in the forward
48+
direction is no problem, we just use the right-sibling pointers that
49+
L&Y require anyway. (Thus, once we have descended the tree to the
50+
correct start point for the scan, the scan looks only at leaf pages
51+
and never at higher tree levels.) To support scans in the backward
52+
direction, we also store a "left sibling" link much like the "right
53+
sibling". (This adds an extra step to the L&Y split algorithm: while
54+
holding the write lock on the page being split, we also lock its former
55+
right sibling to update that page's left-link. This is safe since no
56+
writer of that page can be interested in acquiring a write lock on our
57+
page.) A backwards scan has one additional bit of complexity: after
58+
following the left-link we must account for the possibility that the
59+
left sibling page got split before we could read it. So, we have to
60+
move right until we find a page whose right-link matches the page we
61+
came from.
62+
63+
+ Read locks on a page are held for as long as a scan has a pointer
64+
to the page. However, locks are always surrendered before the
65+
sibling page lock is acquired (for readers), so we remain deadlock-
66+
free. I will do a formal proof if I get bored anytime soon.
67+
NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin,
68+
on the current page of a scan before control leaves nbtree. When we
69+
come back to resume the scan, we have to re-grab the read lock and
70+
then move right if the current item moved (see _bt_restscan()).
71+
72+
+ Lehman and Yao fail to discuss what must happen when the root page
73+
becomes full and must be split. Our implementation is to split the
74+
root in the same way that any other page would be split, then construct
75+
a new root page holding pointers to both of the resulting pages (which
76+
now become siblings on level 2 of the tree). The new root page is then
77+
installed by altering the root pointer in the meta-data page (see
78+
below). This works because the root is not treated specially in any
79+
other way --- in particular, searches will move right using its link
80+
pointer if the link is set. Therefore, searches will find the data
81+
that's been moved into the right sibling even if they read the metadata
82+
page before it got updated. This is the same reasoning that makes a
83+
split of a non-root page safe. The locking considerations are similar too.
84+
85+
+ Lehman and Yao assume fixed-size keys, but we must deal with
86+
variable-size keys. Therefore there is not a fixed maximum number of
87+
keys per page; we just stuff in as many as will fit. When we split a
88+
page, we try to equalize the number of bytes, not items, assigned to
89+
each of the resulting pages. Note we must include the incoming item in
90+
this calculation, otherwise it is possible to find that the incoming
91+
item doesn't fit on the split page where it needs to go!
3092

3193
In addition, the following things are handy to know:
3294

33-
+ Page zero of every btree is a meta-data page. This page stores
34-
the location of the root page, a pointer to a list of free
35-
pages, and other stuff that's handy to know.
36-
37-
+ This algorithm doesn't really work, since it requires ordered
38-
writes, and UNIX doesn't support ordered writes.
39-
40-
+ There's one other case where we may screw up in this
41-
implementation. When we start a scan, we descend the tree
42-
to the key nearest the one in the qual, and once we get there,
43-
position ourselves correctly for the qual type (eg, <, >=, etc).
44-
If we happen to step off a page, decide we want to get back to
45-
it, and fetch the page again, and if some bad person has split
46-
the page and moved the last tuple we saw off of it, then the
47-
code complains about botched concurrency in an elog(WARN, ...)
48-
and gives up the ghost. This is the ONLY violation of Lehman
49-
and Yao's guarantee of correct behavior that I am aware of in
50-
this code.
95+
+ Page zero of every btree is a meta-data page. This page stores
96+
the location of the root page, a pointer to a list of free
97+
pages, and other stuff that's handy to know. (Currently, we
98+
never shrink btree indexes so there are never any free pages.)
99+
100+
+ The algorithm assumes we can fit at least three items per page
101+
(a "high key" and two real data items). Therefore it's unsafe
102+
to accept items larger than 1/3rd page size. Larger items would
103+
work sometimes, but could cause failures later on depending on
104+
what else gets put on their page.
105+
106+
+ This algorithm doesn't guarantee btree consistency after a kernel crash
107+
or hardware failure. To do that, we'd need ordered writes, and UNIX
108+
doesn't support ordered writes (short of fsync'ing every update, which
109+
is too high a price). Rebuilding corrupted indexes during restart
110+
seems more attractive.
111+
112+
+ On deletions, we need to adjust the position of active scans on
113+
the index. The code in nbtscan.c handles this. We don't need to
114+
do this for insertions or splits because _bt_restscan can find the
115+
new position of the previously-found item. NOTE that nbtscan.c
116+
only copes with deletions issued by the current backend. This
117+
essentially means that concurrent deletions are not supported, but
118+
that's true already in the Lehman and Yao algorithm. nbtscan.c
119+
exists only to support VACUUM and allow it to delete items while
120+
it's scanning the index.
121+
122+
Notes about data representation:
123+
124+
+ The right-sibling link required by L&Y is kept in the page "opaque
125+
data" area, as is the left-sibling link and some flags.
126+
127+
+ We also keep a parent link in the opaque data, but this link is not
128+
very trustworthy because it is not updated when the parent page splits.
129+
Thus, it points to some page on the parent level, but possibly a page
130+
well to the left of the page's actual current parent. In most cases
131+
we do not need this link at all. Normally we return to a parent page
132+
using a stack of entries that are made as we descend the tree, as in L&Y.
133+
There is exactly one case where the stack will not help: concurrent
134+
root splits. If an inserter process needs to split what had been the
135+
root when it started its descent, but finds that that page is no longer
136+
the root (because someone else split it meanwhile), then it uses the
137+
parent link to move up to the next level. This is OK because we do fix
138+
the parent link in a former root page when splitting it. This logic
139+
will work even if the root is split multiple times (even up to creation
140+
of multiple new levels) before an inserter returns to it. The same
141+
could not be said of finding the new root via the metapage, since that
142+
would work only for a single level of added root.
143+
144+
+ The Postgres disk block data format (an array of items) doesn't fit
145+
Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
146+
so we have to play some games.
147+
148+
+ On a page that is not rightmost in its tree level, the "high key" is
149+
kept in the page's first item, and real data items start at item 2.
150+
The link portion of the "high key" item goes unused. A page that is
151+
rightmost has no "high key", so data items start with the first item.
152+
Putting the high key at the left, rather than the right, may seem odd,
153+
but it avoids moving the high key as we add data items.
154+
155+
+ On a leaf page, the data items are simply links to (TIDs of) tuples
156+
in the relation being indexed, with the associated key values.
157+
158+
+ On a non-leaf page, the data items are down-links to child pages with
159+
bounding keys. The key in each data item is the *lower* bound for
160+
keys on that child page, so logically the key is to the left of that
161+
downlink. The high key (if present) is the upper bound for the last
162+
downlink. The first data item on each such page has no lower bound
163+
--- or lower bound of minus infinity, if you prefer. The comparison
164+
routines must treat it accordingly. The actual key stored in the
165+
item is irrelevant, and need not be stored at all. This arrangement
166+
corresponds to the fact that an L&Y non-leaf page has one more pointer
167+
than key.
51168

52169
Notes to operator class implementors:
53170

54-
With this implementation, we require the user to supply us with
55-
a procedure for pg_amproc. This procedure should take two keys
56-
A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
57-
respectively. See the contents of that relation for the btree
58-
access method for some samples.
59-
60-
Notes to mao for implementation document:
61-
62-
On deletions, we need to adjust the position of active scans on
63-
the index. The code in nbtscan.c handles this. We don't need to
64-
do this for splits because of the way splits are handled; if they
65-
happen behind us, we'll automatically go to the next page, and if
66-
they happen in front of us, we're not affected by them. For
67-
insertions, if we inserted a tuple behind the current scan location
68-
on the current scan page, we move one space ahead.
171+
+ With this implementation, we require the user to supply us with
172+
a procedure for pg_amproc. This procedure should take two keys
173+
A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B,
174+
respectively. See the contents of that relation for the btree
175+
access method for some samples.

0 commit comments

Comments
 (0)