Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 09cb5c0

Browse files
committed
Rewrite btree index scans to work a page at a time in all cases (both
btgettuple and btgetmulti). This eliminates the problem of "re-finding" the exact stopping point, since the stopping point is effectively always a page boundary, and index items are never moved across pre-existing page boundaries. A small penalty is that the keys_are_unique optimization is effectively disabled (and, therefore, is removed in this patch), causing us to apply _bt_checkkeys() to at least one more tuple than necessary when looking up a unique key. However, the advantages for non-unique cases seem great enough to accept this tradeoff. Aside from simplifying and (sometimes) speeding up the indexscan code, this will allow us to reimplement btbulkdelete as a largely sequential scan instead of index-order traversal, thereby significantly reducing the cost of VACUUM. Those changes will come in a separate patch. Original patch by Heikki Linnakangas, rework by Tom Lane.
1 parent 88d94a1 commit 09cb5c0

File tree

9 files changed

+624
-629
lines changed

9 files changed

+624
-629
lines changed

src/backend/access/index/genam.c

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.54 2006/03/05 15:58:21 momjian Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/index/genam.c,v 1.55 2006/05/07 01:21:30 tgl Exp $
1212
*
1313
* NOTES
1414
* many of the old access method routines have been turned into
@@ -90,8 +90,6 @@ RelationGetIndexScan(Relation indexRelation,
9090
scan->have_lock = false; /* ditto */
9191
scan->kill_prior_tuple = false;
9292
scan->ignore_killed_tuples = true; /* default setting */
93-
scan->keys_are_unique = false; /* may be set by index AM */
94-
scan->got_tuple = false;
9593

9694
scan->opaque = NULL;
9795

@@ -102,9 +100,6 @@ RelationGetIndexScan(Relation indexRelation,
102100
scan->xs_ctup.t_data = NULL;
103101
scan->xs_cbuf = InvalidBuffer;
104102

105-
scan->unique_tuple_pos = 0;
106-
scan->unique_tuple_mark = 0;
107-
108103
pgstat_initstats(&scan->xs_pgstat_info, indexRelation);
109104

110105
/*

src/backend/access/index/indexam.c

Lines changed: 4 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.92 2006/05/02 22:25:10 tgl Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/index/indexam.c,v 1.93 2006/05/07 01:21:30 tgl Exp $
1212
*
1313
* INTERFACE ROUTINES
1414
* index_open - open an index relation by relation OID
@@ -362,10 +362,6 @@ index_rescan(IndexScanDesc scan, ScanKey key)
362362
}
363363

364364
scan->kill_prior_tuple = false; /* for safety */
365-
scan->keys_are_unique = false; /* may be set by index AM */
366-
scan->got_tuple = false;
367-
scan->unique_tuple_pos = 0;
368-
scan->unique_tuple_mark = 0;
369365

370366
FunctionCall2(procedure,
371367
PointerGetDatum(scan),
@@ -417,8 +413,6 @@ index_markpos(IndexScanDesc scan)
417413
SCAN_CHECKS;
418414
GET_SCAN_PROCEDURE(ammarkpos);
419415

420-
scan->unique_tuple_mark = scan->unique_tuple_pos;
421-
422416
FunctionCall1(procedure, PointerGetDatum(scan));
423417
}
424418

@@ -440,13 +434,6 @@ index_restrpos(IndexScanDesc scan)
440434

441435
scan->kill_prior_tuple = false; /* for safety */
442436

443-
/*
444-
* We do not reset got_tuple; so if the scan is actually being
445-
* short-circuited by index_getnext, the effective position restoration is
446-
* done by restoring unique_tuple_pos.
447-
*/
448-
scan->unique_tuple_pos = scan->unique_tuple_mark;
449-
450437
FunctionCall1(procedure, PointerGetDatum(scan));
451438
}
452439

@@ -456,8 +443,7 @@ index_restrpos(IndexScanDesc scan)
456443
* The result is the next heap tuple satisfying the scan keys and the
457444
* snapshot, or NULL if no more matching tuples exist. On success,
458445
* the buffer containing the heap tuple is pinned (the pin will be dropped
459-
* at the next index_getnext or index_endscan). The index TID corresponding
460-
* to the heap tuple can be obtained if needed from scan->currentItemData.
446+
* at the next index_getnext or index_endscan).
461447
* ----------------
462448
*/
463449
HeapTuple
@@ -469,65 +455,6 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)
469455
SCAN_CHECKS;
470456
GET_SCAN_PROCEDURE(amgettuple);
471457

472-
/*
473-
* If we already got a tuple and it must be unique, there's no need to
474-
* make the index AM look through any additional tuples. (This can save a
475-
* useful amount of work in scenarios where there are many dead tuples due
476-
* to heavy update activity.)
477-
*
478-
* To do this we must keep track of the logical scan position
479-
* (before/on/after tuple). Also, we have to be sure to release scan
480-
* resources before returning NULL; if we fail to do so then a multi-index
481-
* scan can easily run the system out of free buffers. We can release
482-
* index-level resources fairly cheaply by calling index_rescan. This
483-
* means there are two persistent states as far as the index AM is
484-
* concerned: on-tuple and rescanned. If we are actually asked to
485-
* re-fetch the single tuple, we have to go through a fresh indexscan
486-
* startup, which penalizes that (infrequent) case.
487-
*/
488-
if (scan->keys_are_unique && scan->got_tuple)
489-
{
490-
int new_tuple_pos = scan->unique_tuple_pos;
491-
492-
if (ScanDirectionIsForward(direction))
493-
{
494-
if (new_tuple_pos <= 0)
495-
new_tuple_pos++;
496-
}
497-
else
498-
{
499-
if (new_tuple_pos >= 0)
500-
new_tuple_pos--;
501-
}
502-
if (new_tuple_pos == 0)
503-
{
504-
/*
505-
* We are moving onto the unique tuple from having been off it. We
506-
* just fall through and let the index AM do the work. Note we
507-
* should get the right answer regardless of scan direction.
508-
*/
509-
scan->unique_tuple_pos = 0; /* need to update position */
510-
}
511-
else
512-
{
513-
/*
514-
* Moving off the tuple; must do amrescan to release index-level
515-
* pins before we return NULL. Since index_rescan will reset my
516-
* state, must save and restore...
517-
*/
518-
int unique_tuple_mark = scan->unique_tuple_mark;
519-
520-
index_rescan(scan, NULL /* no change to key */ );
521-
522-
scan->keys_are_unique = true;
523-
scan->got_tuple = true;
524-
scan->unique_tuple_pos = new_tuple_pos;
525-
scan->unique_tuple_mark = unique_tuple_mark;
526-
527-
return NULL;
528-
}
529-
}
530-
531458
/* just make sure this is false... */
532459
scan->kill_prior_tuple = false;
533460

@@ -588,14 +515,6 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)
588515
}
589516

590517
/* Success exit */
591-
scan->got_tuple = true;
592-
593-
/*
594-
* If we just fetched a known-unique tuple, then subsequent calls will go
595-
* through the short-circuit code above. unique_tuple_pos has been
596-
* initialized to 0, which is the correct state ("on row").
597-
*/
598-
599518
return heapTuple;
600519
}
601520

@@ -608,8 +527,8 @@ index_getnext(IndexScanDesc scan, ScanDirection direction)
608527
* (which most callers of this routine will probably want to suppress by
609528
* setting scan->ignore_killed_tuples = false).
610529
*
611-
* On success (TRUE return), the found index TID is in scan->currentItemData,
612-
* and its heap TID is in scan->xs_ctup.t_self. scan->xs_cbuf is untouched.
530+
* On success (TRUE return), the heap TID of the found index entry is in
531+
* scan->xs_ctup.t_self. scan->xs_cbuf is untouched.
613532
* ----------------
614533
*/
615534
bool

src/backend/access/nbtree/README

Lines changed: 45 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.10 2006/04/25 22:46:05 tgl Exp $
1+
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.11 2006/05/07 01:21:30 tgl Exp $
22

33
This directory contains a correct implementation of Lehman and Yao's
44
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
@@ -67,13 +67,22 @@ move right until we find a page whose right-link matches the page we
6767
came from. (Actually, it's even harder than that; see deletion discussion
6868
below.)
6969

70-
Read locks on a page are held for as long as a scan is examining a page.
71-
But nbtree.c arranges to drop the read lock, but not the buffer pin,
72-
on the current page of a scan before control leaves nbtree. When we
73-
come back to resume the scan, we have to re-grab the read lock and
74-
then move right if the current item moved (see _bt_restscan()). Keeping
75-
the pin ensures that the current item cannot move left or be deleted
76-
(see btbulkdelete).
70+
Page read locks are held only for as long as a scan is examining a page.
71+
To minimize lock/unlock traffic, an index scan always searches a leaf page
72+
to identify all the matching items at once, copying their heap tuple IDs
73+
into backend-local storage. The heap tuple IDs are then processed while
74+
not holding any page lock within the index. We do continue to hold a pin
75+
on the leaf page, to protect against concurrent deletions (see below).
76+
In this state the scan is effectively stopped "between" pages, either
77+
before or after the page it has pinned. This is safe in the presence of
78+
concurrent insertions and even page splits, because items are never moved
79+
across pre-existing page boundaries --- so the scan cannot miss any items
80+
it should have seen, nor accidentally return the same item twice. The scan
81+
must remember the page's right-link at the time it was scanned, since that
82+
is the page to move right to; if we move right to the current right-link
83+
then we'd re-scan any items moved by a page split. We don't similarly
84+
remember the left-link, since it's best to use the most up-to-date
85+
left-link when trying to move left (see detailed move-left algorithm below).
7786

7887
In most cases we release our lock and pin on a page before attempting
7988
to acquire pin and lock on the page we are moving to. In a few places
@@ -119,14 +128,33 @@ item doesn't fit on the split page where it needs to go!
119128
The deletion algorithm
120129
----------------------
121130

122-
Deletions of leaf items are handled by getting a super-exclusive lock on
123-
the target page, so that no other backend has a pin on the page when the
124-
deletion starts. This means no scan is pointing at the page, so no other
125-
backend can lose its place due to the item deletion.
126-
127-
The above does not work for deletion of items in internal pages, since
128-
other backends keep no lock nor pin on a page they have descended past.
129-
Instead, when a backend is ascending the tree using its stack, it must
131+
Before deleting a leaf item, we get a super-exclusive lock on the target
132+
page, so that no other backend has a pin on the page when the deletion
133+
starts. This is not necessary for correctness in terms of the btree index
134+
operations themselves; as explained above, index scans logically stop
135+
"between" pages and so can't lose their place. The reason we do it is to
136+
provide an interlock between non-full VACUUM and indexscans. Since VACUUM
137+
deletes index entries before deleting tuples, the super-exclusive lock
138+
guarantees that VACUUM can't delete any heap tuple that an indexscanning
139+
process might be about to visit. (This guarantee works only for simple
140+
indexscans that visit the heap in sync with the index scan, not for bitmap
141+
scans. We only need the guarantee when using non-MVCC snapshot rules such
142+
as SnapshotNow, so in practice this is only important for system catalog
143+
accesses.)
144+
145+
Because a page can be split even while someone holds a pin on it, it is
146+
possible that an indexscan will return items that are no longer stored on
147+
the page it has a pin on, but rather somewhere to the right of that page.
148+
To ensure that VACUUM can't prematurely remove such heap tuples, we require
149+
btbulkdelete to obtain super-exclusive lock on every leaf page in the index
150+
(even pages that don't contain any deletable tuples). This guarantees that
151+
the btbulkdelete call cannot return while any indexscan is still holding
152+
a copy of a deleted index tuple. Note that this requirement does not say
153+
that btbulkdelete must visit the pages in any particular order.
154+
155+
There is no such interlocking for deletion of items in internal pages,
156+
since backends keep no lock nor pin on a page they have descended past.
157+
Hence, when a backend is ascending the tree using its stack, it must
130158
be prepared for the possibility that the item it wants is to the left of
131159
the recorded position (but it can't have moved left out of the recorded
132160
page). Since we hold a lock on the lower page (per L&Y) until we have
@@ -201,7 +229,7 @@ accordingly. Searches and forward scans simply follow the right-link
201229
until they find a non-dead page --- this will be where the deleted page's
202230
key-space moved to.
203231

204-
Stepping left in a backward scan is complicated because we must consider
232+
Moving left in a backward scan is complicated because we must consider
205233
the possibility that the left sibling was just split (meaning we must find
206234
the rightmost page derived from the left sibling), plus the possibility
207235
that the page we were just on has now been deleted and hence isn't in the

0 commit comments

Comments
 (0)