Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit d3abbbe

Browse files
Avoid early reuse of btree pages, causing incorrect query results.
When we allowed read-only transactions to skip assigning XIDs we introduced the possibility that a fully deleted btree page could be reused. This broke the index link sequence which could then lead to indexscans silently returning fewer rows than would have been correct. The actual incidence of silent errors from this is thought to be very low because of the exact workload required and locking pre-conditions. Fix is to remove pages only if index page opaque->btpo.xact precedes RecentGlobalXmin. Noah Misch, reviewed by Simon Riggs
1 parent 3e4d3a3 commit d3abbbe

File tree

3 files changed

+20
-26
lines changed

3 files changed

+20
-26
lines changed

src/backend/access/nbtree/README

+5-3
Original file line numberDiff line numberDiff line change
@@ -261,12 +261,14 @@ we need to be sure we don't miss or re-scan any items.
261261

262262
A deleted page can only be reclaimed once there is no scan or search that
263263
has a reference to it; until then, it must stay in place with its
264-
right-link undisturbed. We implement this by waiting until all
265-
transactions that were running at the time of deletion are dead; which is
264+
right-link undisturbed. We implement this by waiting until all active
265+
snapshots and registered snapshots as of the deletion are gone; which is
266266
overly strong, but is simple to implement within Postgres. When marked
267267
dead, a deleted page is labeled with the next-transaction counter value.
268268
VACUUM can reclaim the page for re-use when this transaction number is
269-
older than the oldest open transaction.
269+
older than RecentGlobalXmin. As collateral damage, this implementation
270+
also waits for running XIDs with no snapshots and for snapshots taken
271+
until the next transaction to allocate an XID commits.
270272

271273
Reclaiming a page doesn't actually change its state on disk --- we simply
272274
record it in the shared-memory free space map, from which it will be

src/backend/access/nbtree/nbtpage.c

+10-22
Original file line numberDiff line numberDiff line change
@@ -558,19 +558,9 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
558558
*/
559559
if (XLogStandbyInfoActive())
560560
{
561-
TransactionId latestRemovedXid;
562-
563561
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
564562

565-
/*
566-
* opaque->btpo.xact is the threshold value not the
567-
* value to measure conflicts against. We must retreat
568-
* by one from it to get the correct conflict xid.
569-
*/
570-
latestRemovedXid = opaque->btpo.xact;
571-
TransactionIdRetreat(latestRemovedXid);
572-
573-
_bt_log_reuse_page(rel, blkno, latestRemovedXid);
563+
_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
574564
}
575565

576566
/* Okay to use page. Re-initialize and return it */
@@ -685,7 +675,6 @@ bool
685675
_bt_page_recyclable(Page page)
686676
{
687677
BTPageOpaque opaque;
688-
TransactionId cutoff;
689678

690679
/*
691680
* It's possible to find an all-zeroes page in an index --- for example, a
@@ -698,18 +687,11 @@ _bt_page_recyclable(Page page)
698687

699688
/*
700689
* Otherwise, recycle if deleted and too old to have any processes
701-
* interested in it. If we are generating records for Hot Standby
702-
* defer page recycling until RecentGlobalXmin to respect user
703-
* controls specified by vacuum_defer_cleanup_age or hot_standby_feedback.
690+
* interested in it.
704691
*/
705-
if (XLogStandbyInfoActive())
706-
cutoff = RecentGlobalXmin;
707-
else
708-
cutoff = RecentXmin;
709-
710692
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
711693
if (P_ISDELETED(opaque) &&
712-
TransactionIdPrecedesOrEquals(opaque->btpo.xact, cutoff))
694+
TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
713695
return true;
714696
return false;
715697
}
@@ -1376,7 +1358,13 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
13761358

13771359
/*
13781360
* Mark the page itself deleted. It can be recycled when all current
1379-
* transactions are gone.
1361+
* transactions are gone. Storing GetTopTransactionId() would work, but
1362+
* we're in VACUUM and would not otherwise have an XID. Having already
1363+
* updated links to the target, ReadNewTransactionId() suffices as an
1364+
* upper bound. Any scan having retained a now-stale link is advertising
1365+
* in its PGXACT an xmin less than or equal to the value we read here. It
1366+
* will continue to do so, holding back RecentGlobalXmin, for the duration
1367+
* of that scan.
13801368
*/
13811369
page = BufferGetPage(buf);
13821370
opaque = (BTPageOpaque) PageGetSpecialPointer(page);

src/backend/access/nbtree/nbtxlog.c

+5-1
Original file line numberDiff line numberDiff line change
@@ -968,7 +968,11 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
968968
/*
969969
* Btree reuse page records exist to provide a conflict point
970970
* when we reuse pages in the index via the FSM. That's all it
971-
* does though.
971+
* does though. latestRemovedXid was the page's btpo.xact. The
972+
* btpo.xact < RecentGlobalXmin test in _bt_page_recyclable()
973+
* conceptually mirrors the pgxact->xmin > limitXmin test in
974+
* GetConflictingVirtualXIDs(). Consequently, one XID value
975+
* achieves the same exclusion effect on master and standby.
972976
*/
973977
{
974978
xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);

0 commit comments

Comments
 (0)