|
1 |
| -$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.1.1.1 1996/07/09 06:21:12 scrappy Exp $ |
| 1 | +$Header: /cvsroot/pgsql/src/backend/access/nbtree/README,v 1.2 2000/07/21 06:42:32 tgl Exp $ |
2 | 2 |
|
3 | 3 | This directory contains a correct implementation of Lehman and Yao's
|
4 |
| -btree management algorithm that supports concurrent access for Postgres. |
| 4 | +high-concurrency B-tree management algorithm (P. Lehman and S. Yao, |
| 5 | +Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions |
| 6 | +on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). |
| 7 | + |
5 | 8 | We have made the following changes in order to incorporate their algorithm
|
6 | 9 | into Postgres:
|
7 | 10 |
|
8 |
| - + The requirement that all btree keys be unique is too onerous, |
9 |
| - but the algorithm won't work correctly without it. As a result, |
10 |
| - this implementation adds an OID (guaranteed to be unique) to |
11 |
| - every key in the index. This guarantees uniqueness within a set |
12 |
| - of duplicates. Space overhead is four bytes. |
13 |
| - |
14 |
| - For this reason, when we're passed an index tuple to store by the |
15 |
| - common access method code, we allocate a larger one and copy the |
16 |
| - supplied tuple into it. No Postgres code outside of the btree |
17 |
| - access method knows about this xid or sequence number. |
18 |
| - |
19 |
| - + Lehman and Yao don't require read locks, but assume that in- |
20 |
| - memory copies of tree nodes are unshared. Postgres shares |
21 |
| - in-memory buffers among backends. As a result, we do page- |
22 |
| - level read locking on btree nodes in order to guarantee that |
23 |
| - no record is modified while we are examining it. This reduces |
24 |
| - concurrency but guaranteees correct behavior. |
25 |
| - |
26 |
| - + Read locks on a page are held for as long as a scan has a pointer |
27 |
| - to the page. However, locks are always surrendered before the |
28 |
| - sibling page lock is acquired (for readers), so we remain deadlock- |
29 |
| - free. I will do a formal proof if I get bored anytime soon. |
| 11 | ++ The requirement that all btree keys be unique is too onerous, |
| 12 | + but the algorithm won't work correctly without it. Fortunately, it is |
| 13 | + only necessary that keys be unique on a single tree level, because L&Y |
| 14 | + only use the assumption of key uniqueness when re-finding a key in a |
| 15 | + parent node (to determine where to insert the key for a split page). |
| 16 | + Therefore, we can use the link field to disambiguate multiple |
| 17 | + occurrences of the same user key: only one entry in the parent level |
| 18 | + will be pointing at the page we had split. (Indeed we need not look at |
| 19 | + the real "key" at all, just at the link field.) We can distinguish |
| 20 | + items at the leaf level in the same way, by examining their links to |
| 21 | + heap tuples; we'd never have two items for the same heap tuple. |
| 22 | + |
| 23 | ++ Lehman and Yao assume that the key range for a subtree S is described |
| 24 | + by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent |
| 25 | + node. This does not work for nonunique keys (for example, if we have |
| 26 | + enough equal keys to spread across several leaf pages, there *must* be |
| 27 | + some equal bounding keys in the first level up). Therefore we assume |
| 28 | + Ki <= v <= Ki+1 instead. A search that finds exact equality to a |
| 29 | + bounding key in an upper tree level must descend to the left of that |
| 30 | + key to ensure it finds any equal keys in the preceding page. An |
| 31 | + insertion that sees the high key of its target page is equal to the key |
| 32 | + to be inserted has a choice whether or not to move right, since the new |
| 33 | + key could go on either page. (Currently, we try to find a page where |
| 34 | + there is room for the new key without a split.) |
| 35 | + |
| 36 | ++ Lehman and Yao don't require read locks, but assume that in-memory |
| 37 | + copies of tree nodes are unshared. Postgres shares in-memory buffers |
| 38 | + among backends. As a result, we do page-level read locking on btree |
| 39 | + nodes in order to guarantee that no record is modified while we are |
| 40 | + examining it. This reduces concurrency but guaranteees correct |
| 41 | + behavior. An advantage is that when trading in a read lock for a |
| 42 | + write lock, we need not re-read the page after getting the write lock. |
| 43 | + Since we're also holding a pin on the shared buffer containing the |
| 44 | + page, we know that buffer still contains the page and is up-to-date. |
| 45 | + |
| 46 | ++ We support the notion of an ordered "scan" of an index as well as |
| 47 | + insertions, deletions, and simple lookups. A scan in the forward |
| 48 | + direction is no problem, we just use the right-sibling pointers that |
| 49 | + L&Y require anyway. (Thus, once we have descended the tree to the |
| 50 | + correct start point for the scan, the scan looks only at leaf pages |
| 51 | + and never at higher tree levels.) To support scans in the backward |
| 52 | + direction, we also store a "left sibling" link much like the "right |
| 53 | + sibling". (This adds an extra step to the L&Y split algorithm: while |
| 54 | + holding the write lock on the page being split, we also lock its former |
| 55 | + right sibling to update that page's left-link. This is safe since no |
| 56 | + writer of that page can be interested in acquiring a write lock on our |
| 57 | + page.) A backwards scan has one additional bit of complexity: after |
| 58 | + following the left-link we must account for the possibility that the |
| 59 | + left sibling page got split before we could read it. So, we have to |
| 60 | + move right until we find a page whose right-link matches the page we |
| 61 | + came from. |
| 62 | + |
| 63 | ++ Read locks on a page are held for as long as a scan has a pointer |
| 64 | + to the page. However, locks are always surrendered before the |
| 65 | + sibling page lock is acquired (for readers), so we remain deadlock- |
| 66 | + free. I will do a formal proof if I get bored anytime soon. |
| 67 | + NOTE: nbtree.c arranges to drop the read lock, but not the buffer pin, |
| 68 | + on the current page of a scan before control leaves nbtree. When we |
| 69 | + come back to resume the scan, we have to re-grab the read lock and |
| 70 | + then move right if the current item moved (see _bt_restscan()). |
| 71 | + |
| 72 | ++ Lehman and Yao fail to discuss what must happen when the root page |
| 73 | + becomes full and must be split. Our implementation is to split the |
| 74 | + root in the same way that any other page would be split, then construct |
| 75 | + a new root page holding pointers to both of the resulting pages (which |
| 76 | + now become siblings on level 2 of the tree). The new root page is then |
| 77 | + installed by altering the root pointer in the meta-data page (see |
| 78 | + below). This works because the root is not treated specially in any |
| 79 | + other way --- in particular, searches will move right using its link |
| 80 | + pointer if the link is set. Therefore, searches will find the data |
| 81 | + that's been moved into the right sibling even if they read the metadata |
| 82 | + page before it got updated. This is the same reasoning that makes a |
| 83 | + split of a non-root page safe. The locking considerations are similar too. |
| 84 | + |
| 85 | ++ Lehman and Yao assume fixed-size keys, but we must deal with |
| 86 | + variable-size keys. Therefore there is not a fixed maximum number of |
| 87 | + keys per page; we just stuff in as many as will fit. When we split a |
| 88 | + page, we try to equalize the number of bytes, not items, assigned to |
| 89 | + each of the resulting pages. Note we must include the incoming item in |
| 90 | + this calculation, otherwise it is possible to find that the incoming |
| 91 | + item doesn't fit on the split page where it needs to go! |
30 | 92 |
|
31 | 93 | In addition, the following things are handy to know:
|
32 | 94 |
|
33 |
| - + Page zero of every btree is a meta-data page. This page stores |
34 |
| - the location of the root page, a pointer to a list of free |
35 |
| - pages, and other stuff that's handy to know. |
36 |
| - |
37 |
| - + This algorithm doesn't really work, since it requires ordered |
38 |
| - writes, and UNIX doesn't support ordered writes. |
39 |
| - |
40 |
| - + There's one other case where we may screw up in this |
41 |
| - implementation. When we start a scan, we descend the tree |
42 |
| - to the key nearest the one in the qual, and once we get there, |
43 |
| - position ourselves correctly for the qual type (eg, <, >=, etc). |
44 |
| - If we happen to step off a page, decide we want to get back to |
45 |
| - it, and fetch the page again, and if some bad person has split |
46 |
| - the page and moved the last tuple we saw off of it, then the |
47 |
| - code complains about botched concurrency in an elog(WARN, ...) |
48 |
| - and gives up the ghost. This is the ONLY violation of Lehman |
49 |
| - and Yao's guarantee of correct behavior that I am aware of in |
50 |
| - this code. |
| 95 | ++ Page zero of every btree is a meta-data page. This page stores |
| 96 | + the location of the root page, a pointer to a list of free |
| 97 | + pages, and other stuff that's handy to know. (Currently, we |
| 98 | + never shrink btree indexes so there are never any free pages.) |
| 99 | + |
| 100 | ++ The algorithm assumes we can fit at least three items per page |
| 101 | + (a "high key" and two real data items). Therefore it's unsafe |
| 102 | + to accept items larger than 1/3rd page size. Larger items would |
| 103 | + work sometimes, but could cause failures later on depending on |
| 104 | + what else gets put on their page. |
| 105 | + |
| 106 | ++ This algorithm doesn't guarantee btree consistency after a kernel crash |
| 107 | + or hardware failure. To do that, we'd need ordered writes, and UNIX |
| 108 | + doesn't support ordered writes (short of fsync'ing every update, which |
| 109 | + is too high a price). Rebuilding corrupted indexes during restart |
| 110 | + seems more attractive. |
| 111 | + |
| 112 | ++ On deletions, we need to adjust the position of active scans on |
| 113 | + the index. The code in nbtscan.c handles this. We don't need to |
| 114 | + do this for insertions or splits because _bt_restscan can find the |
| 115 | + new position of the previously-found item. NOTE that nbtscan.c |
| 116 | + only copes with deletions issued by the current backend. This |
| 117 | + essentially means that concurrent deletions are not supported, but |
| 118 | + that's true already in the Lehman and Yao algorithm. nbtscan.c |
| 119 | + exists only to support VACUUM and allow it to delete items while |
| 120 | + it's scanning the index. |
| 121 | + |
| 122 | +Notes about data representation: |
| 123 | + |
| 124 | ++ The right-sibling link required by L&Y is kept in the page "opaque |
| 125 | + data" area, as is the left-sibling link and some flags. |
| 126 | + |
| 127 | ++ We also keep a parent link in the opaque data, but this link is not |
| 128 | + very trustworthy because it is not updated when the parent page splits. |
| 129 | + Thus, it points to some page on the parent level, but possibly a page |
| 130 | + well to the left of the page's actual current parent. In most cases |
| 131 | + we do not need this link at all. Normally we return to a parent page |
| 132 | + using a stack of entries that are made as we descend the tree, as in L&Y. |
| 133 | + There is exactly one case where the stack will not help: concurrent |
| 134 | + root splits. If an inserter process needs to split what had been the |
| 135 | + root when it started its descent, but finds that that page is no longer |
| 136 | + the root (because someone else split it meanwhile), then it uses the |
| 137 | + parent link to move up to the next level. This is OK because we do fix |
| 138 | + the parent link in a former root page when splitting it. This logic |
| 139 | + will work even if the root is split multiple times (even up to creation |
| 140 | + of multiple new levels) before an inserter returns to it. The same |
| 141 | + could not be said of finding the new root via the metapage, since that |
| 142 | + would work only for a single level of added root. |
| 143 | + |
| 144 | ++ The Postgres disk block data format (an array of items) doesn't fit |
| 145 | + Lehman and Yao's alternating-keys-and-pointers notion of a disk page, |
| 146 | + so we have to play some games. |
| 147 | + |
| 148 | ++ On a page that is not rightmost in its tree level, the "high key" is |
| 149 | + kept in the page's first item, and real data items start at item 2. |
| 150 | + The link portion of the "high key" item goes unused. A page that is |
| 151 | + rightmost has no "high key", so data items start with the first item. |
| 152 | + Putting the high key at the left, rather than the right, may seem odd, |
| 153 | + but it avoids moving the high key as we add data items. |
| 154 | + |
| 155 | ++ On a leaf page, the data items are simply links to (TIDs of) tuples |
| 156 | + in the relation being indexed, with the associated key values. |
| 157 | + |
| 158 | ++ On a non-leaf page, the data items are down-links to child pages with |
| 159 | + bounding keys. The key in each data item is the *lower* bound for |
| 160 | + keys on that child page, so logically the key is to the left of that |
| 161 | + downlink. The high key (if present) is the upper bound for the last |
| 162 | + downlink. The first data item on each such page has no lower bound |
| 163 | + --- or lower bound of minus infinity, if you prefer. The comparison |
| 164 | + routines must treat it accordingly. The actual key stored in the |
| 165 | + item is irrelevant, and need not be stored at all. This arrangement |
| 166 | + corresponds to the fact that an L&Y non-leaf page has one more pointer |
| 167 | + than key. |
51 | 168 |
|
52 | 169 | Notes to operator class implementors:
|
53 | 170 |
|
54 |
| - With this implementation, we require the user to supply us with |
55 |
| - a procedure for pg_amproc. This procedure should take two keys |
56 |
| - A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, |
57 |
| - respectively. See the contents of that relation for the btree |
58 |
| - access method for some samples. |
59 |
| - |
60 |
| -Notes to mao for implementation document: |
61 |
| - |
62 |
| - On deletions, we need to adjust the position of active scans on |
63 |
| - the index. The code in nbtscan.c handles this. We don't need to |
64 |
| - do this for splits because of the way splits are handled; if they |
65 |
| - happen behind us, we'll automatically go to the next page, and if |
66 |
| - they happen in front of us, we're not affected by them. For |
67 |
| - insertions, if we inserted a tuple behind the current scan location |
68 |
| - on the current scan page, we move one space ahead. |
| 171 | ++ With this implementation, we require the user to supply us with |
| 172 | + a procedure for pg_amproc. This procedure should take two keys |
| 173 | + A and B and return < 0, 0, or > 0 if A < B, A = B, or A > B, |
| 174 | + respectively. See the contents of that relation for the btree |
| 175 | + access method for some samples. |
0 commit comments