Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 2ed5b87

Browse files
committed
Reduce pinning and buffer content locking for btree scans.
Even though the main benefit of the Lehman and Yao algorithm for btrees is that no locks need be held between page reads in an index search, we were holding a buffer pin on each leaf page after it was read until we were ready to read the next one. The reason was so that we could treat this as a weak lock to create an "interlock" with vacuum's deletion of heap line pointers, even though our README file pointed out that this was not necessary for a scan using an MVCC snapshot. The main goal of this patch is to reduce the blocking of vacuum processes by in-progress btree index scans (including a cursor which is idle), but the code rearrangement also allows for one less buffer content lock to be taken when a forward scan steps from one page to the next, which results in a small but consistent performance improvement in many workloads. This patch leaves behavior unchanged for some cases, which can be addressed separately so that each case can be evaluated on its own merits. These unchanged cases are when a scan uses a non-MVCC snapshot, an index-only scan, and a scan of a btree index for which modifications are not WAL-logged. If later patches allow all of these cases to drop the buffer pin after reading a leaf page, then the btree vacuum process can be simplified; it will no longer need the "super-exclusive" lock to delete tuples from a page. Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
1 parent 8217fb1 commit 2ed5b87

File tree

8 files changed

+327
-133
lines changed

8 files changed

+327
-133
lines changed

src/backend/access/nbtree/README

+50-33
Original file line numberDiff line numberDiff line change
@@ -92,17 +92,18 @@ To minimize lock/unlock traffic, an index scan always searches a leaf page
9292
to identify all the matching items at once, copying their heap tuple IDs
9393
into backend-local storage. The heap tuple IDs are then processed while
9494
not holding any page lock within the index. We do continue to hold a pin
95-
on the leaf page, to protect against concurrent deletions (see below).
96-
In this state the scan is effectively stopped "between" pages, either
97-
before or after the page it has pinned. This is safe in the presence of
98-
concurrent insertions and even page splits, because items are never moved
99-
across pre-existing page boundaries --- so the scan cannot miss any items
100-
it should have seen, nor accidentally return the same item twice. The scan
101-
must remember the page's right-link at the time it was scanned, since that
102-
is the page to move right to; if we move right to the current right-link
103-
then we'd re-scan any items moved by a page split. We don't similarly
104-
remember the left-link, since it's best to use the most up-to-date
105-
left-link when trying to move left (see detailed move-left algorithm below).
95+
on the leaf page in some circumstances, to protect against concurrent
96+
deletions (see below). In this state the scan is effectively stopped
97+
"between" pages, either before or after the page it has pinned. This is
98+
safe in the presence of concurrent insertions and even page splits, because
99+
items are never moved across pre-existing page boundaries --- so the scan
100+
cannot miss any items it should have seen, nor accidentally return the same
101+
item twice. The scan must remember the page's right-link at the time it
102+
was scanned, since that is the page to move right to; if we move right to
103+
the current right-link then we'd re-scan any items moved by a page split.
104+
We don't similarly remember the left-link, since it's best to use the most
105+
up-to-date left-link when trying to move left (see detailed move-left
106+
algorithm below).
106107

107108
In most cases we release our lock and pin on a page before attempting
108109
to acquire pin and lock on the page we are moving to. In a few places
@@ -154,25 +155,37 @@ starts. This is not necessary for correctness in terms of the btree index
154155
operations themselves; as explained above, index scans logically stop
155156
"between" pages and so can't lose their place. The reason we do it is to
156157
provide an interlock between non-full VACUUM and indexscans. Since VACUUM
157-
deletes index entries before deleting tuples, the super-exclusive lock
158-
guarantees that VACUUM can't delete any heap tuple that an indexscanning
159-
process might be about to visit. (This guarantee works only for simple
160-
indexscans that visit the heap in sync with the index scan, not for bitmap
161-
scans. We only need the guarantee when using non-MVCC snapshot rules; in
162-
an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with
163-
an unrelated tuple at the same TID, because the new tuple wouldn't be
164-
visible to our scan anyway.)
165-
166-
Because a page can be split even while someone holds a pin on it, it is
167-
possible that an indexscan will return items that are no longer stored on
168-
the page it has a pin on, but rather somewhere to the right of that page.
169-
To ensure that VACUUM can't prematurely remove such heap tuples, we require
170-
btbulkdelete to obtain super-exclusive lock on every leaf page in the index,
171-
even pages that don't contain any deletable tuples. This guarantees that
172-
the btbulkdelete call cannot return while any indexscan is still holding
173-
a copy of a deleted index tuple. Note that this requirement does not say
174-
that btbulkdelete must visit the pages in any particular order. (See also
175-
on-the-fly deletion, below.)
158+
deletes index entries before reclaiming heap tuple line pointers, the
159+
super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
160+
line pointer that an indexscanning process might be about to visit. This
161+
guarantee works only for simple indexscans that visit the heap in sync
162+
with the index scan, not for bitmap scans. We only need the guarantee
163+
when using non-MVCC snapshot rules; when using an MVCC snapshot, it
164+
doesn't matter if the heap tuple is replaced with an unrelated tuple at
165+
the same TID, because the new tuple won't be visible to our scan anyway.
166+
Therefore, a scan using an MVCC snapshot which has no other confounding
167+
factors will not hold the pin after the page contents are read. The
168+
current reasons for exceptions, where a pin is still needed, are if the
169+
index is not WAL-logged or if the scan is an index-only scan. If later
170+
work allows the pin to be dropped for all cases we will be able to
171+
simplify the vacuum code, since the concept of a super-exclusive lock
172+
for btree indexes will no longer be needed.
173+
174+
Because a pin is not always held, and a page can be split even while
175+
someone does hold a pin on it, it is possible that an indexscan will
176+
return items that are no longer stored on the page it has a pin on, but
177+
rather somewhere to the right of that page. To ensure that VACUUM can't
178+
prematurely remove such heap tuples, we require btbulkdelete to obtain a
179+
super-exclusive lock on every leaf page in the index, even pages that
180+
don't contain any deletable tuples. Any scan which could yield incorrect
181+
results if the tuple at a TID matching the the scan's range and filter
182+
conditions were replaced by a different tuple while the scan is in
183+
progress must hold the pin on each index page until all index entries read
184+
from the page have been processed. This guarantees that the btbulkdelete
185+
call cannot return while any indexscan is still holding a copy of a
186+
deleted index tuple if the scan could be confused by that. Note that this
187+
requirement does not say that btbulkdelete must visit the pages in any
188+
particular order. (See also on-the-fly deletion, below.)
176189

177190
There is no such interlocking for deletion of items in internal pages,
178191
since backends keep no lock nor pin on a page they have descended past.
@@ -396,8 +409,12 @@ that this breaks the interlock between VACUUM and indexscans, but that is
396409
not so: as long as an indexscanning process has a pin on the page where
397410
the index item used to be, VACUUM cannot complete its btbulkdelete scan
398411
and so cannot remove the heap tuple. This is another reason why
399-
btbulkdelete has to get super-exclusive lock on every leaf page, not only
400-
the ones where it actually sees items to delete.
412+
btbulkdelete has to get a super-exclusive lock on every leaf page, not
413+
only the ones where it actually sees items to delete. So that we can
414+
handle the cases where we attempt LP_DEAD flagging for a page after we
415+
have released its pin, we remember the LSN of the index page when we read
416+
the index tuples from it; we do not attempt to flag index tuples as dead
417+
if the we didn't hold the pin the entire time and the LSN has changed.
401418

402419
WAL Considerations
403420
------------------
@@ -462,7 +479,7 @@ metapage update (of the "fast root" link) is performed and logged as part
462479
of the insertion into the parent level. When splitting the root page, the
463480
metapage update is handled as part of the "new root" action.
464481

465-
Each step in page deletion are logged as separate WAL entries: marking the
482+
Each step in page deletion is logged as a separate WAL entry: marking the
466483
leaf as half-dead and removing the downlink is one record, and unlinking a
467484
page is a second record. If vacuum is interrupted for some reason, or the
468485
system crashes, the tree is consistent for searches and insertions. The

src/backend/access/nbtree/nbtinsert.c

+2-2
Original file line numberDiff line numberDiff line change
@@ -498,9 +498,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
498498
* If there's not enough room in the space, we try to make room by
499499
* removing any LP_DEAD tuples.
500500
*
501-
* On entry, *buf and *offsetptr point to the first legal position
501+
* On entry, *bufptr and *offsetptr point to the first legal position
502502
* where the new tuple could be inserted. The caller should hold an
503-
* exclusive lock on *buf. *offsetptr can also be set to
503+
* exclusive lock on *bufptr. *offsetptr can also be set to
504504
* InvalidOffsetNumber, in which case the function will search for the
505505
* right location within the page if needed. On exit, they point to the
506506
* chosen insert location. If _bt_findinsertloc decides to move right,

src/backend/access/nbtree/nbtree.c

+53-33
Original file line numberDiff line numberDiff line change
@@ -404,7 +404,8 @@ btbeginscan(PG_FUNCTION_ARGS)
404404

405405
/* allocate private workspace */
406406
so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
407-
so->currPos.buf = so->markPos.buf = InvalidBuffer;
407+
BTScanPosInvalidate(so->currPos);
408+
BTScanPosInvalidate(so->markPos);
408409
if (scan->numberOfKeys > 0)
409410
so->keyData = (ScanKey) palloc(scan->numberOfKeys * sizeof(ScanKeyData));
410411
else
@@ -424,8 +425,6 @@ btbeginscan(PG_FUNCTION_ARGS)
424425
* scan->xs_itupdesc whether we'll need it or not, since that's so cheap.
425426
*/
426427
so->currTuples = so->markTuples = NULL;
427-
so->currPos.nextTupleOffset = 0;
428-
so->markPos.nextTupleOffset = 0;
429428

430429
scan->xs_itupdesc = RelationGetDescr(rel);
431430

@@ -451,17 +450,14 @@ btrescan(PG_FUNCTION_ARGS)
451450
{
452451
/* Before leaving current page, deal with any killed items */
453452
if (so->numKilled > 0)
454-
_bt_killitems(scan, false);
455-
ReleaseBuffer(so->currPos.buf);
456-
so->currPos.buf = InvalidBuffer;
453+
_bt_killitems(scan);
454+
BTScanPosUnpinIfPinned(so->currPos);
455+
BTScanPosInvalidate(so->currPos);
457456
}
458457

459-
if (BTScanPosIsValid(so->markPos))
460-
{
461-
ReleaseBuffer(so->markPos.buf);
462-
so->markPos.buf = InvalidBuffer;
463-
}
464458
so->markItemIndex = -1;
459+
BTScanPosUnpinIfPinned(so->markPos);
460+
BTScanPosInvalidate(so->markPos);
465461

466462
/*
467463
* Allocate tuple workspace arrays, if needed for an index-only scan and
@@ -515,17 +511,14 @@ btendscan(PG_FUNCTION_ARGS)
515511
{
516512
/* Before leaving current page, deal with any killed items */
517513
if (so->numKilled > 0)
518-
_bt_killitems(scan, false);
519-
ReleaseBuffer(so->currPos.buf);
520-
so->currPos.buf = InvalidBuffer;
514+
_bt_killitems(scan);
515+
BTScanPosUnpinIfPinned(so->currPos);
521516
}
522517

523-
if (BTScanPosIsValid(so->markPos))
524-
{
525-
ReleaseBuffer(so->markPos.buf);
526-
so->markPos.buf = InvalidBuffer;
527-
}
528518
so->markItemIndex = -1;
519+
BTScanPosUnpinIfPinned(so->markPos);
520+
521+
/* No need to invalidate positions, the RAM is about to be freed. */
529522

530523
/* Release storage */
531524
if (so->keyData != NULL)
@@ -552,12 +545,8 @@ btmarkpos(PG_FUNCTION_ARGS)
552545
IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
553546
BTScanOpaque so = (BTScanOpaque) scan->opaque;
554547

555-
/* we aren't holding any read locks, but gotta drop the pin */
556-
if (BTScanPosIsValid(so->markPos))
557-
{
558-
ReleaseBuffer(so->markPos.buf);
559-
so->markPos.buf = InvalidBuffer;
560-
}
548+
/* There may be an old mark with a pin (but no lock). */
549+
BTScanPosUnpinIfPinned(so->markPos);
561550

562551
/*
563552
* Just record the current itemIndex. If we later step to next page
@@ -568,7 +557,10 @@ btmarkpos(PG_FUNCTION_ARGS)
568557
if (BTScanPosIsValid(so->currPos))
569558
so->markItemIndex = so->currPos.itemIndex;
570559
else
560+
{
561+
BTScanPosInvalidate(so->markPos);
571562
so->markItemIndex = -1;
563+
}
572564

573565
/* Also record the current positions of any array keys */
574566
if (so->numArrayKeys)
@@ -593,35 +585,63 @@ btrestrpos(PG_FUNCTION_ARGS)
593585
if (so->markItemIndex >= 0)
594586
{
595587
/*
596-
* The mark position is on the same page we are currently on. Just
588+
* The scan has never moved to a new page since the last mark. Just
597589
* restore the itemIndex.
590+
*
591+
* NB: In this case we can't count on anything in so->markPos to be
592+
* accurate.
598593
*/
599594
so->currPos.itemIndex = so->markItemIndex;
600595
}
596+
else if (so->currPos.currPage == so->markPos.currPage)
597+
{
598+
/*
599+
* so->markItemIndex < 0 but mark and current positions are on the
600+
* same page. This would be an unusual case, where the scan moved to
601+
* a new index page after the mark, restored, and later restored again
602+
* without moving off the marked page. It is not clear that this code
603+
* can currently be reached, but it seems better to make this function
604+
* robust for this case than to Assert() or elog() that it can't
605+
* happen.
606+
*
607+
* We neither want to set so->markItemIndex >= 0 (because that could
608+
* cause a later move to a new page to redo the memcpy() executions)
609+
* nor re-execute the memcpy() functions for a restore within the same
610+
* page. The previous restore to this page already set everything
611+
* except markPos as it should be.
612+
*/
613+
so->currPos.itemIndex = so->markPos.itemIndex;
614+
}
601615
else
602616
{
603-
/* we aren't holding any read locks, but gotta drop the pin */
617+
/*
618+
* The scan moved to a new page after last mark or restore, and we are
619+
* now restoring to the marked page. We aren't holding any read
620+
* locks, but if we're still holding the pin for the current position,
621+
* we must drop it.
622+
*/
604623
if (BTScanPosIsValid(so->currPos))
605624
{
606625
/* Before leaving current page, deal with any killed items */
607-
if (so->numKilled > 0 &&
608-
so->currPos.buf != so->markPos.buf)
609-
_bt_killitems(scan, false);
610-
ReleaseBuffer(so->currPos.buf);
611-
so->currPos.buf = InvalidBuffer;
626+
if (so->numKilled > 0)
627+
_bt_killitems(scan);
628+
BTScanPosUnpinIfPinned(so->currPos);
612629
}
613630

614631
if (BTScanPosIsValid(so->markPos))
615632
{
616633
/* bump pin on mark buffer for assignment to current buffer */
617-
IncrBufferRefCount(so->markPos.buf);
634+
if (BTScanPosIsPinned(so->markPos))
635+
IncrBufferRefCount(so->markPos.buf);
618636
memcpy(&so->currPos, &so->markPos,
619637
offsetof(BTScanPosData, items[1]) +
620638
so->markPos.lastItem * sizeof(BTScanPosItem));
621639
if (so->currTuples)
622640
memcpy(so->currTuples, so->markTuples,
623641
so->markPos.nextTupleOffset);
624642
}
643+
else
644+
BTScanPosInvalidate(so->currPos);
625645
}
626646

627647
PG_RETURN_VOID();

0 commit comments

Comments
 (0)