|
1 |
| -$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.10 2006/04/25 22:46:05 tgl Exp $ |
| 1 | +$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.11 2006/05/07 01:21:30 tgl Exp $ |
2 | 2 |
|
3 | 3 | This directory contains a correct implementation of Lehman and Yao's
|
4 | 4 | high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
|
@@ -67,13 +67,22 @@ move right until we find a page whose right-link matches the page we
|
67 | 67 | came from. (Actually, it's even harder than that; see deletion discussion
|
68 | 68 | below.)
|
69 | 69 |
|
70 |
| -Read locks on a page are held for as long as a scan is examining a page. |
71 |
| -But nbtree.c arranges to drop the read lock, but not the buffer pin, |
72 |
| -on the current page of a scan before control leaves nbtree. When we |
73 |
| -come back to resume the scan, we have to re-grab the read lock and |
74 |
| -then move right if the current item moved (see _bt_restscan()). Keeping |
75 |
| -the pin ensures that the current item cannot move left or be deleted |
76 |
| -(see btbulkdelete). |
| 70 | +Page read locks are held only for as long as a scan is examining a page. |
| 71 | +To minimize lock/unlock traffic, an index scan always searches a leaf page |
| 72 | +to identify all the matching items at once, copying their heap tuple IDs |
| 73 | +into backend-local storage. The heap tuple IDs are then processed while |
| 74 | +not holding any page lock within the index. We do continue to hold a pin |
| 75 | +on the leaf page, to protect against concurrent deletions (see below). |
| 76 | +In this state the scan is effectively stopped "between" pages, either |
| 77 | +before or after the page it has pinned. This is safe in the presence of |
| 78 | +concurrent insertions and even page splits, because items are never moved |
| 79 | +across pre-existing page boundaries --- so the scan cannot miss any items |
| 80 | +it should have seen, nor accidentally return the same item twice. The scan |
| 81 | +must remember the page's right-link at the time it was scanned, since that |
| 82 | +is the page to move right to; if we move right to the current right-link |
| 83 | +then we'd re-scan any items moved by a page split. We don't similarly |
| 84 | +remember the left-link, since it's best to use the most up-to-date |
| 85 | +left-link when trying to move left (see detailed move-left algorithm below). |
77 | 86 |
|
78 | 87 | In most cases we release our lock and pin on a page before attempting
|
79 | 88 | to acquire pin and lock on the page we are moving to. In a few places
|
@@ -119,14 +128,33 @@ item doesn't fit on the split page where it needs to go!
|
119 | 128 | The deletion algorithm
|
120 | 129 | ----------------------
|
121 | 130 |
|
122 |
| -Deletions of leaf items are handled by getting a super-exclusive lock on |
123 |
| -the target page, so that no other backend has a pin on the page when the |
124 |
| -deletion starts. This means no scan is pointing at the page, so no other |
125 |
| -backend can lose its place due to the item deletion. |
126 |
| - |
127 |
| -The above does not work for deletion of items in internal pages, since |
128 |
| -other backends keep no lock nor pin on a page they have descended past. |
129 |
| -Instead, when a backend is ascending the tree using its stack, it must |
| 131 | +Before deleting a leaf item, we get a super-exclusive lock on the target |
| 132 | +page, so that no other backend has a pin on the page when the deletion |
| 133 | +starts. This is not necessary for correctness in terms of the btree index |
| 134 | +operations themselves; as explained above, index scans logically stop |
| 135 | +"between" pages and so can't lose their place. The reason we do it is to |
| 136 | +provide an interlock between non-full VACUUM and indexscans. Since VACUUM |
| 137 | +deletes index entries before deleting tuples, the super-exclusive lock |
| 138 | +guarantees that VACUUM can't delete any heap tuple that an indexscanning |
| 139 | +process might be about to visit. (This guarantee works only for simple |
| 140 | +indexscans that visit the heap in sync with the index scan, not for bitmap |
| 141 | +scans. We only need the guarantee when using non-MVCC snapshot rules such |
| 142 | +as SnapshotNow, so in practice this is only important for system catalog |
| 143 | +accesses.) |
| 144 | + |
| 145 | +Because a page can be split even while someone holds a pin on it, it is |
| 146 | +possible that an indexscan will return items that are no longer stored on |
| 147 | +the page it has a pin on, but rather somewhere to the right of that page. |
| 148 | +To ensure that VACUUM can't prematurely remove such heap tuples, we require |
| 149 | +btbulkdelete to obtain super-exclusive lock on every leaf page in the index |
| 150 | +(even pages that don't contain any deletable tuples). This guarantees that |
| 151 | +the btbulkdelete call cannot return while any indexscan is still holding |
| 152 | +a copy of a deleted index tuple. Note that this requirement does not say |
| 153 | +that btbulkdelete must visit the pages in any particular order. |
| 154 | + |
| 155 | +There is no such interlocking for deletion of items in internal pages, |
| 156 | +since backends keep no lock nor pin on a page they have descended past. |
| 157 | +Hence, when a backend is ascending the tree using its stack, it must |
130 | 158 | be prepared for the possibility that the item it wants is to the left of
|
131 | 159 | the recorded position (but it can't have moved left out of the recorded
|
132 | 160 | page). Since we hold a lock on the lower page (per L&Y) until we have
|
@@ -201,7 +229,7 @@ accordingly. Searches and forward scans simply follow the right-link
|
201 | 229 | until they find a non-dead page --- this will be where the deleted page's
|
202 | 230 | key-space moved to.
|
203 | 231 |
|
204 |
| -Stepping left in a backward scan is complicated because we must consider |
| 232 | +Moving left in a backward scan is complicated because we must consider |
205 | 233 | the possibility that the left sibling was just split (meaning we must find
|
206 | 234 | the rightmost page derived from the left sibling), plus the possibility
|
207 | 235 | that the page we were just on has now been deleted and hence isn't in the
|
|
0 commit comments