Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit cf2acaf

Browse files
Deprecate nbtree's BTP_HAS_GARBAGE flag.
Streamline handling of the various strategies that we have to avoid a page split in nbtinsert.c. When it looks like a leaf page is about to overflow, we now perform deleting LP_DEAD items and deduplication in one central place. This greatly simplifies _bt_findinsertloc(). This has an independently useful consequence: nbtree no longer relies on the BTP_HAS_GARBAGE page level flag/hint for anything important. We still set and unset the flag in the same way as before, but it's no longer treated as a gating condition when considering if we should check for already-set LP_DEAD bits. This happens at the point where the page looks like it might have to be split anyway, so simply checking the LP_DEAD bits in passing is practically free. This avoids missing LP_DEAD bits just because the page-level hint is unset, which is probably reasonably common (e.g. it happens when VACUUM unsets the page-level flag without actually removing index tuples whose LP_DEAD-bit was set recently, after the VACUUM operation began but before it reached the leaf page in question). Note that this isn't a big behavioral change compared to PostgreSQL 13. We were already checking for set LP_DEAD bits regardless of whether the BTP_HAS_GARBAGE page level flag was set before we considered doing a deduplication pass. This commit only goes slightly further by doing the same check for all indexes, even indexes where deduplication won't be performed. We don't completely remove the BTP_HAS_GARBAGE flag. We still rely on it as a gating condition with pg_upgrade'd indexes from before B-tree version 4/PostgreSQL 12. That makes sense because we sometimes have to make a choice among pages full of duplicates when inserting a tuple with pre version 4 indexes. It probably still pays to avoid accessing the line pointer array of a page there, since it won't yet be clear whether we'll insert on to the page in question at all, let alone split it as a result. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Victor Yegorov <vyegorov@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz%3DYpc1PDdk8OVJDChGJBjT06%3DA0Mbv9HyTLCsOknGcUFg%40mail.gmail.com
1 parent 7684b6f commit cf2acaf

File tree

6 files changed

+135
-130
lines changed

6 files changed

+135
-130
lines changed

src/backend/access/nbtree/nbtdedup.c

+16-60
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,7 @@ static bool _bt_posting_valid(IndexTuple posting);
2828
#endif
2929

3030
/*
31-
* Deduplicate items on a leaf page. The page will have to be split by caller
32-
* if we cannot successfully free at least newitemsz (we also need space for
33-
* newitem's line pointer, which isn't included in caller's newitemsz).
31+
* Perform a deduplication pass.
3432
*
3533
* The general approach taken here is to perform as much deduplication as
3634
* possible to free as much space as possible. Note, however, that "single
@@ -43,76 +41,32 @@ static bool _bt_posting_valid(IndexTuple posting);
4341
* handle those if and when the anticipated right half page gets its own
4442
* deduplication pass, following further inserts of duplicates.)
4543
*
46-
* This function should be called during insertion, when the page doesn't have
47-
* enough space to fit an incoming newitem. If the BTP_HAS_GARBAGE page flag
48-
* was set, caller should have removed any LP_DEAD items by calling
49-
* _bt_vacuum_one_page() before calling here. We may still have to kill
50-
* LP_DEAD items here when the page's BTP_HAS_GARBAGE hint is falsely unset,
51-
* but that should be rare. Also, _bt_vacuum_one_page() won't unset the
52-
* BTP_HAS_GARBAGE flag when it finds no LP_DEAD items, so a successful
53-
* deduplication pass will always clear it, just to keep things tidy.
44+
* The page will have to be split if we cannot successfully free at least
45+
* newitemsz (we also need space for newitem's line pointer, which isn't
46+
* included in caller's newitemsz).
47+
*
48+
* Note: Caller should have already deleted all existing items with their
49+
* LP_DEAD bits set.
5450
*/
5551
void
56-
_bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
57-
IndexTuple newitem, Size newitemsz, bool checkingunique)
52+
_bt_dedup_pass(Relation rel, Buffer buf, Relation heapRel, IndexTuple newitem,
53+
Size newitemsz, bool checkingunique)
5854
{
5955
OffsetNumber offnum,
6056
minoff,
6157
maxoff;
6258
Page page = BufferGetPage(buf);
63-
BTPageOpaque opaque;
59+
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
6460
Page newpage;
65-
OffsetNumber deletable[MaxIndexTuplesPerPage];
6661
BTDedupState state;
67-
int ndeletable = 0;
6862
Size pagesaving = 0;
6963
bool singlevalstrat = false;
7064
int nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
7165

72-
/*
73-
* We can't assume that there are no LP_DEAD items. For one thing, VACUUM
74-
* will clear the BTP_HAS_GARBAGE hint without reliably removing items
75-
* that are marked LP_DEAD. We don't want to unnecessarily unset LP_DEAD
76-
* bits when deduplicating items. Allowing it would be correct, though
77-
* wasteful.
78-
*/
79-
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
80-
minoff = P_FIRSTDATAKEY(opaque);
81-
maxoff = PageGetMaxOffsetNumber(page);
82-
for (offnum = minoff;
83-
offnum <= maxoff;
84-
offnum = OffsetNumberNext(offnum))
85-
{
86-
ItemId itemid = PageGetItemId(page, offnum);
87-
88-
if (ItemIdIsDead(itemid))
89-
deletable[ndeletable++] = offnum;
90-
}
91-
92-
if (ndeletable > 0)
93-
{
94-
_bt_delitems_delete(rel, buf, deletable, ndeletable, heapRel);
95-
96-
/*
97-
* Return when a split will be avoided. This is equivalent to
98-
* avoiding a split using the usual _bt_vacuum_one_page() path.
99-
*/
100-
if (PageGetFreeSpace(page) >= newitemsz)
101-
return;
102-
103-
/*
104-
* Reconsider number of items on page, in case _bt_delitems_delete()
105-
* managed to delete an item or two
106-
*/
107-
minoff = P_FIRSTDATAKEY(opaque);
108-
maxoff = PageGetMaxOffsetNumber(page);
109-
}
110-
11166
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
11267
newitemsz += sizeof(ItemIdData);
11368

11469
/*
115-
* By here, it's clear that deduplication will definitely be attempted.
11670
* Initialize deduplication state.
11771
*
11872
* It would be possible for maxpostingsize (limit on posting list tuple
@@ -138,6 +92,9 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
13892
/* nintervals should be initialized to zero */
13993
state->nintervals = 0;
14094

95+
minoff = P_FIRSTDATAKEY(opaque);
96+
maxoff = PageGetMaxOffsetNumber(page);
97+
14198
/* Determine if "single value" strategy should be used */
14299
if (!checkingunique)
143100
singlevalstrat = _bt_do_singleval(rel, page, state, minoff, newitem);
@@ -259,10 +216,9 @@ _bt_dedup_one_page(Relation rel, Buffer buf, Relation heapRel,
259216
/*
260217
* By here, it's clear that deduplication will definitely go ahead.
261218
*
262-
* Clear the BTP_HAS_GARBAGE page flag in the unlikely event that it is
263-
* still falsely set, just to keep things tidy. (We can't rely on
264-
* _bt_vacuum_one_page() having done this already, and we can't rely on a
265-
* page split or VACUUM getting to it in the near future.)
219+
* Clear the BTP_HAS_GARBAGE page flag. The index must be a heapkeyspace
220+
* index, and as such we'll never pay attention to BTP_HAS_GARBAGE anyway.
221+
* But keep things tidy.
266222
*/
267223
if (P_HAS_GARBAGE(opaque))
268224
{

src/backend/access/nbtree/nbtinsert.c

+95-47
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,10 @@ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
5858
static Buffer _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf);
5959
static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
6060
OffsetNumber itup_off, bool newfirstdataitem);
61-
static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
61+
static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
62+
BTInsertState insertstate,
63+
bool lpdeadonly, bool checkingunique,
64+
bool uniquedup);
6265

6366
/*
6467
* _bt_doinsert() -- Handle insertion of a single index tuple in the tree.
@@ -871,38 +874,14 @@ _bt_findinsertloc(Relation rel,
871874
}
872875

873876
/*
874-
* If the target page is full, see if we can obtain enough space by
875-
* erasing LP_DEAD items. If that fails to free enough space, see if
876-
* we can avoid a page split by performing a deduplication pass over
877-
* the page.
878-
*
879-
* We only perform a deduplication pass for a checkingunique caller
880-
* when the incoming item is a duplicate of an existing item on the
881-
* leaf page. This heuristic avoids wasting cycles -- we only expect
882-
* to benefit from deduplicating a unique index page when most or all
883-
* recently added items are duplicates. See nbtree/README.
877+
* If the target page is full, see if we can obtain enough space using
878+
* one or more strategies (e.g. erasing LP_DEAD items, deduplication).
879+
* Page splits are expensive, and should only go ahead when truly
880+
* necessary.
884881
*/
885882
if (PageGetFreeSpace(page) < insertstate->itemsz)
886-
{
887-
if (P_HAS_GARBAGE(opaque))
888-
{
889-
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
890-
insertstate->bounds_valid = false;
891-
892-
/* Might as well assume duplicates (if checkingunique) */
893-
uniquedup = true;
894-
}
895-
896-
if (itup_key->allequalimage && BTGetDeduplicateItems(rel) &&
897-
(!checkingunique || uniquedup) &&
898-
PageGetFreeSpace(page) < insertstate->itemsz)
899-
{
900-
_bt_dedup_one_page(rel, insertstate->buf, heapRel,
901-
insertstate->itup, insertstate->itemsz,
902-
checkingunique);
903-
insertstate->bounds_valid = false;
904-
}
905-
}
883+
_bt_delete_or_dedup_one_page(rel, heapRel, insertstate, false,
884+
checkingunique, uniquedup);
906885
}
907886
else
908887
{
@@ -942,8 +921,9 @@ _bt_findinsertloc(Relation rel,
942921
*/
943922
if (P_HAS_GARBAGE(opaque))
944923
{
945-
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
946-
insertstate->bounds_valid = false;
924+
/* Erase LP_DEAD items (won't deduplicate) */
925+
_bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
926+
checkingunique, false);
947927

948928
if (PageGetFreeSpace(page) >= insertstate->itemsz)
949929
break; /* OK, now we have enough space */
@@ -993,14 +973,17 @@ _bt_findinsertloc(Relation rel,
993973
* performing a posting list split, so delete all LP_DEAD items early.
994974
* This is the only case where LP_DEAD deletes happen even though
995975
* there is space for newitem on the page.
976+
*
977+
* This can only erase LP_DEAD items (it won't deduplicate).
996978
*/
997-
_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
979+
_bt_delete_or_dedup_one_page(rel, heapRel, insertstate, true,
980+
checkingunique, false);
998981

999982
/*
1000983
* Do new binary search. New insert location cannot overlap with any
1001984
* posting list now.
1002985
*/
1003-
insertstate->bounds_valid = false;
986+
Assert(!insertstate->bounds_valid);
1004987
insertstate->postingoff = 0;
1005988
newitemoff = _bt_binsrch_insert(rel, insertstate);
1006989
Assert(insertstate->postingoff == 0);
@@ -2623,32 +2606,59 @@ _bt_pgaddtup(Page page,
26232606
}
26242607

26252608
/*
2626-
* _bt_vacuum_one_page - vacuum just one index page.
2609+
* _bt_delete_or_dedup_one_page - Try to avoid a leaf page split by attempting
2610+
* a variety of operations.
2611+
*
2612+
* There are two operations performed here: deleting items already marked
2613+
* LP_DEAD, and deduplication. If both operations fail to free enough space
2614+
* for the incoming item then caller will go on to split the page. We always
2615+
* attempt our preferred strategy (which is to delete items whose LP_DEAD bit
2616+
* are set) first. If that doesn't work out we move on to deduplication.
2617+
*
2618+
* Caller's checkingunique and uniquedup arguments help us decide if we should
2619+
* perform deduplication, which is primarily useful with low cardinality data,
2620+
* but can sometimes absorb version churn.
2621+
*
2622+
* Callers that only want us to look for/delete LP_DEAD items can ask for that
2623+
* directly by passing true 'lpdeadonly' argument.
26272624
*
2628-
* Try to remove LP_DEAD items from the given page. The passed buffer
2629-
* must be exclusive-locked, but unlike a real VACUUM, we don't need a
2630-
* super-exclusive "cleanup" lock (see nbtree/README).
2625+
* Note: We used to only delete LP_DEAD items when the BTP_HAS_GARBAGE page
2626+
* level flag was found set. The flag was useful back when there wasn't
2627+
* necessarily one single page for a duplicate tuple to go on (before heap TID
2628+
* became a part of the key space in version 4 indexes). But we don't
2629+
* actually look at the flag anymore (it's not a gating condition for our
2630+
* caller). That would cause us to miss tuples that are safe to delete,
2631+
* without getting any benefit in return. We know that the alternative is to
2632+
* split the page; scanning the line pointer array in passing won't have
2633+
* noticeable overhead. (We still maintain the BTP_HAS_GARBAGE flag despite
2634+
* all this because !heapkeyspace indexes must still do a "getting tired"
2635+
* linear search, and so are likely to get some benefit from using it as a
2636+
* gating condition.)
26312637
*/
26322638
static void
2633-
_bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
2639+
_bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
2640+
BTInsertState insertstate,
2641+
bool lpdeadonly, bool checkingunique,
2642+
bool uniquedup)
26342643
{
26352644
OffsetNumber deletable[MaxIndexTuplesPerPage];
26362645
int ndeletable = 0;
26372646
OffsetNumber offnum,
2638-
minoff,
26392647
maxoff;
2648+
Buffer buffer = insertstate->buf;
2649+
BTScanInsert itup_key = insertstate->itup_key;
26402650
Page page = BufferGetPage(buffer);
26412651
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
26422652

26432653
Assert(P_ISLEAF(opaque));
2654+
Assert(lpdeadonly || itup_key->heapkeyspace);
26442655

26452656
/*
26462657
* Scan over all items to see which ones need to be deleted according to
26472658
* LP_DEAD flags.
26482659
*/
2649-
minoff = P_FIRSTDATAKEY(opaque);
26502660
maxoff = PageGetMaxOffsetNumber(page);
2651-
for (offnum = minoff;
2661+
for (offnum = P_FIRSTDATAKEY(opaque);
26522662
offnum <= maxoff;
26532663
offnum = OffsetNumberNext(offnum))
26542664
{
@@ -2659,12 +2669,50 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
26592669
}
26602670

26612671
if (ndeletable > 0)
2672+
{
26622673
_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
2674+
insertstate->bounds_valid = false;
2675+
2676+
/* Return when a page split has already been avoided */
2677+
if (PageGetFreeSpace(page) >= insertstate->itemsz)
2678+
return;
2679+
2680+
/* Might as well assume duplicates (if checkingunique) */
2681+
uniquedup = true;
2682+
}
2683+
2684+
/*
2685+
* Some callers only want to delete LP_DEAD items. Return early for these
2686+
* callers.
2687+
*
2688+
* Note: The page's BTP_HAS_GARBAGE hint flag may still be set when we
2689+
* return at this point (or when we go on the try either or both of our
2690+
* other strategies and they also fail). We do not bother expending a
2691+
* separate write to clear it, however. Caller will definitely clear it
2692+
* when it goes on to split the page (plus deduplication knows to clear
2693+
* the flag when it actually modifies the page).
2694+
*/
2695+
if (lpdeadonly)
2696+
return;
2697+
2698+
/*
2699+
* We can get called in the checkingunique case when there is no reason to
2700+
* believe that there are any duplicates on the page; we should at least
2701+
* still check for LP_DEAD items. If that didn't work out, give up and
2702+
* let caller split the page. Deduplication cannot be justified given
2703+
* there is no reason to think that there are duplicates.
2704+
*/
2705+
if (checkingunique && !uniquedup)
2706+
return;
2707+
2708+
/* Assume bounds about to be invalidated (this is almost certain now) */
2709+
insertstate->bounds_valid = false;
26632710

26642711
/*
2665-
* Note: if we didn't find any LP_DEAD items, then the page's
2666-
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
2667-
* separate write to clear it, however. We will clear it when we split
2668-
* the page, or when deduplication runs.
2712+
* Perform deduplication pass, though only when it is enabled for the
2713+
* index and known to be safe (it must be an allequalimage index).
26692714
*/
2715+
if (BTGetDeduplicateItems(rel) && itup_key->allequalimage)
2716+
_bt_dedup_pass(rel, buffer, heapRel, insertstate->itup,
2717+
insertstate->itemsz, checkingunique);
26702718
}

src/backend/access/nbtree/nbtpage.c

+16-17
Original file line numberDiff line numberDiff line change
@@ -1187,8 +1187,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
11871187
* array of offset numbers.
11881188
*
11891189
* PageIndexTupleOverwrite() won't unset each item's LP_DEAD bit when it
1190-
* happens to already be set. Although we unset the BTP_HAS_GARBAGE page
1191-
* level flag, unsetting individual LP_DEAD bits should still be avoided.
1190+
* happens to already be set. It's important that we not interfere with
1191+
* _bt_delitems_delete().
11921192
*/
11931193
for (int i = 0; i < nupdatable; i++)
11941194
{
@@ -1215,20 +1215,12 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
12151215
opaque->btpo_cycleid = 0;
12161216

12171217
/*
1218-
* Mark the page as not containing any LP_DEAD items. This is not
1219-
* certainly true (there might be some that have recently been marked, but
1220-
* weren't targeted by VACUUM's heap scan), but it will be true often
1221-
* enough. VACUUM does not delete items purely because they have their
1222-
* LP_DEAD bit set, since doing so would necessitate explicitly logging a
1223-
* latestRemovedXid cutoff (this is how _bt_delitems_delete works).
1218+
* Clear the BTP_HAS_GARBAGE page flag.
12241219
*
1225-
* The consequences of falsely unsetting BTP_HAS_GARBAGE should be fairly
1226-
* limited, since we never falsely unset an LP_DEAD bit. Workloads that
1227-
* are particularly dependent on LP_DEAD bits being set quickly will
1228-
* usually manage to set the BTP_HAS_GARBAGE flag before the page fills up
1229-
* again anyway. Furthermore, attempting a deduplication pass will remove
1230-
* all LP_DEAD items, regardless of whether the BTP_HAS_GARBAGE hint bit
1231-
* is set or not.
1220+
* This flag indicates the presence of LP_DEAD items on the page (though
1221+
* not reliably). Note that we only trust it with pg_upgrade'd
1222+
* !heapkeyspace indexes. That's why clearing it here won't usually
1223+
* interfere with _bt_delitems_delete().
12321224
*/
12331225
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
12341226

@@ -1310,10 +1302,17 @@ _bt_delitems_delete(Relation rel, Buffer buf,
13101302

13111303
/*
13121304
* Unlike _bt_delitems_vacuum, we *must not* clear the vacuum cycle ID,
1313-
* because this is not called by VACUUM. Just clear the BTP_HAS_GARBAGE
1314-
* page flag, since we deleted all items with their LP_DEAD bit set.
1305+
* because this is not called by VACUUM
13151306
*/
13161307
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
1308+
1309+
/*
1310+
* Clear the BTP_HAS_GARBAGE page flag.
1311+
*
1312+
* This flag indicates the presence of LP_DEAD items on the page (though
1313+
* not reliably). Note that we only trust it with pg_upgrade'd
1314+
* !heapkeyspace indexes.
1315+
*/
13171316
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
13181317

13191318
MarkBufferDirty(buf);

src/backend/access/nbtree/nbtutils.c

+2-1
Original file line numberDiff line numberDiff line change
@@ -1877,7 +1877,8 @@ _bt_killitems(IndexScanDesc scan)
18771877
* Since this can be redone later if needed, mark as dirty hint.
18781878
*
18791879
* Whenever we mark anything LP_DEAD, we also set the page's
1880-
* BTP_HAS_GARBAGE flag, which is likewise just a hint.
1880+
* BTP_HAS_GARBAGE flag, which is likewise just a hint. (Note that we
1881+
* only rely on the page-level flag in !heapkeyspace indexes.)
18811882
*/
18821883
if (killedsomething)
18831884
{

0 commit comments

Comments
 (0)