diff options
author | Peter Geoghegan | 2025-06-11 13:17:27 +0000 |
---|---|---|
committer | Peter Geoghegan | 2025-06-11 13:17:27 +0000 |
commit | 7c7c0a77dc8a48ba4f88863d8138abd5fbac6cec (patch) | |
tree | 8e066e4d15e6e1ad84030619e4df311e6fc065d4 /src | |
parent | 0703c9385ba0ff435efec0673ca9f15bda941317 (diff) |
Make _bt_killitems drop pins it acquired itself.REL_14_STABLE
Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called. In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.
Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan. so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.
This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e. The test case in question involves the use
of mark and restore. An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin. A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(non-assert builds got a "resource was not closed" WARNING instead).
The same problem exists on earlier releases, though the issue is far
more subtle there. Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage. This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
_bt_killitems call will neglect to check if so->currPos.lsn continues to
match the page's now-current LSN.
As a result of all this, it's just about possible that _bt_killitems
will set the wrong items LP_DEAD (on release branches). This could only
happen with merge joins (the sole user of nbtree mark/restore support),
when a concurrently inserted index tuple used a recently-recycled TID
(and only when the new tuple was inserted onto the same page as a
distinct concurrently-removed tuple with the same TID). This is exactly
the scenario that _bt_killitems' check of the page's now-current LSN
against the LSN stashed in currPos was supposed to prevent.
A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set. All call sites that might need to drop a still-held pin will be
taught to rely on the scan-level so->dropPin field recently introduced
by commit e6eed40e. That will make bugs of the same general nature as
this one impossible (or make them much easier to detect, at least).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
Diffstat (limited to 'src')
-rw-r--r-- | src/backend/access/nbtree/nbtutils.c | 34 |
1 files changed, 20 insertions, 14 deletions
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 73705df4fb1..211a68e41ab 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -1706,9 +1706,9 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts, * current page and killed tuples thereon (generally, this should only be * called if so->numKilled > 0). * - * The caller does not have a lock on the page and may or may not have the - * page pinned in a buffer. Note that read-lock is sufficient for setting - * LP_DEAD status (which is only a hint). + * Caller should not have a lock on the so->currPos page, but may hold a + * buffer pin. When we return, it still won't be locked. It'll continue to + * hold whatever pins were held before calling here. * * We match items by heap TID before assuming they are the right ones to * delete. We cope with cases where items have moved right due to insertions. @@ -1740,7 +1740,8 @@ _bt_killitems(IndexScanDesc scan) int i; int numKilled = so->numKilled; bool killedsomething = false; - bool droppedpin PG_USED_FOR_ASSERTS_ONLY; + bool droppedpin; + Buffer buf; Assert(BTScanPosIsValid(so->currPos)); @@ -1759,29 +1760,31 @@ _bt_killitems(IndexScanDesc scan) * LSN. */ droppedpin = false; - _bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ); - - page = BufferGetPage(so->currPos.buf); + buf = so->currPos.buf; + _bt_lockbuf(scan->indexRelation, buf, BT_READ); } else { - Buffer buf; + XLogRecPtr latestlsn; droppedpin = true; /* Attempt to re-read the buffer, getting pin and lock. */ buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ); - page = BufferGetPage(buf); - if (BufferGetLSNAtomic(buf) == so->currPos.lsn) - so->currPos.buf = buf; - else + latestlsn = BufferGetLSNAtomic(buf); + Assert(!XLogRecPtrIsInvalid(so->currPos.lsn)); + Assert(so->currPos.lsn <= latestlsn); + if (so->currPos.lsn != latestlsn) { /* Modified while not pinned means hinting is not safe. */ _bt_relbuf(scan->indexRelation, buf); return; } + + /* Unmodified, hinting is safe */ } + page = BufferGetPage(buf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); minoff = P_FIRSTDATAKEY(opaque); maxoff = PageGetMaxOffsetNumber(page); @@ -1898,10 +1901,13 @@ _bt_killitems(IndexScanDesc scan) if (killedsomething) { opaque->btpo_flags |= BTP_HAS_GARBAGE; - MarkBufferDirtyHint(so->currPos.buf, true); + MarkBufferDirtyHint(buf, true); } - _bt_unlockbuf(scan->indexRelation, so->currPos.buf); + if (!droppedpin) + _bt_unlockbuf(scan->indexRelation, buf); + else + _bt_relbuf(scan->indexRelation, buf); } |