Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit e628464

Browse files
committed
Modify btree to delete known-dead index entries without an actual VACUUM.
When we are about to split an index page to do an insertion, first look to see if any entries marked LP_DELETE exist on the page, and if so remove them to try to make enough space for the desired insert. This should reduce index bloat in heavily-updated tables, although of course you still need VACUUM eventually to clean up the heap. Junji Teramoto
1 parent edd49fc commit e628464

File tree

6 files changed

+144
-16
lines changed

6 files changed

+144
-16
lines changed

src/backend/access/nbtree/README

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.12 2006/05/08 00:00:09 tgl Exp $
1+
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.13 2006/07/25 19:13:00 tgl Exp $
22

33
This directory contains a correct implementation of Lehman and Yao's
44
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
@@ -146,11 +146,12 @@ Because a page can be split even while someone holds a pin on it, it is
146146
possible that an indexscan will return items that are no longer stored on
147147
the page it has a pin on, but rather somewhere to the right of that page.
148148
To ensure that VACUUM can't prematurely remove such heap tuples, we require
149-
btbulkdelete to obtain super-exclusive lock on every leaf page in the index
150-
(even pages that don't contain any deletable tuples). This guarantees that
149+
btbulkdelete to obtain super-exclusive lock on every leaf page in the index,
150+
even pages that don't contain any deletable tuples. This guarantees that
151151
the btbulkdelete call cannot return while any indexscan is still holding
152152
a copy of a deleted index tuple. Note that this requirement does not say
153-
that btbulkdelete must visit the pages in any particular order.
153+
that btbulkdelete must visit the pages in any particular order. (See also
154+
on-the-fly deletion, below.)
154155

155156
There is no such interlocking for deletion of items in internal pages,
156157
since backends keep no lock nor pin on a page they have descended past.
@@ -320,6 +321,43 @@ positives, so long as it never gives a false negative. This makes it
320321
possible to implement the test with a small counter value stored on each
321322
index page.
322323

324+
On-the-fly deletion of index tuples
325+
-----------------------------------
326+
327+
If a process visits a heap tuple and finds that it's dead and removable
328+
(ie, dead to all open transactions, not only that process), then we can
329+
return to the index and mark the corresponding index entry "known dead",
330+
allowing subsequent index scans to skip visiting the heap tuple. The
331+
"known dead" marking uses the LP_DELETE bit in ItemIds. This is currently
332+
only done in plain indexscans, not bitmap scans, because only plain scans
333+
visit the heap and index "in sync" and so there's not a convenient way
334+
to do it for bitmap scans.
335+
336+
Once an index tuple has been marked LP_DELETE it can actually be removed
337+
from the index immediately; since index scans only stop "between" pages,
338+
no scan can lose its place from such a deletion. We separate the steps
339+
because we allow LP_DELETE to be set with only a share lock (it's exactly
340+
like a hint bit for a heap tuple), but physically removing tuples requires
341+
exclusive lock. In the current code we try to remove LP_DELETE tuples when
342+
we are otherwise faced with having to split a page to do an insertion (and
343+
hence have exclusive lock on it already).
344+
345+
This leaves the index in a state where it has no entry for a dead tuple
346+
that still exists in the heap. This is not a problem for the current
347+
implementation of VACUUM, but it could be a problem for anything that
348+
explicitly tries to find index entries for dead tuples. (However, the
349+
same situation is created by REINDEX, since it doesn't enter dead
350+
tuples into the index.)
351+
352+
It's sufficient to have an exclusive lock on the index page, not a
353+
super-exclusive lock, to do deletion of LP_DELETE items. It might seem
354+
that this breaks the interlock between VACUUM and indexscans, but that is
355+
not so: as long as an indexscanning process has a pin on the page where
356+
the index item used to be, VACUUM cannot complete its btbulkdelete scan
357+
and so cannot remove the heap tuple. This is another reason why
358+
btbulkdelete has to get super-exclusive lock on every leaf page, not only
359+
the ones where it actually sees items to delete.
360+
323361
WAL considerations
324362
------------------
325363

src/backend/access/nbtree/nbtinsert.c

Lines changed: 73 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.141 2006/07/13 16:49:12 momjian Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.142 2006/07/25 19:13:00 tgl Exp $
1212
*
1313
*-------------------------------------------------------------------------
1414
*/
@@ -65,6 +65,7 @@ static void _bt_pgaddtup(Relation rel, Page page,
6565
OffsetNumber itup_off, const char *where);
6666
static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
6767
int keysz, ScanKey scankey);
68+
static void _bt_vacuum_one_page(Relation rel, Buffer buffer);
6869

6970

7071
/*
@@ -262,6 +263,8 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
262263
hbuffer) == HEAPTUPLE_DEAD)
263264
{
264265
curitemid->lp_flags |= LP_DELETE;
266+
opaque->btpo_flags |= BTP_HAS_GARBAGE;
267+
/* be sure to mark the proper buffer dirty... */
265268
if (nbuf != InvalidBuffer)
266269
SetBufferCommitInfoNeedsSave(nbuf);
267270
else
@@ -424,11 +427,29 @@ _bt_insertonpg(Relation rel,
424427
*/
425428
bool movedright = false;
426429

427-
while (PageGetFreeSpace(page) < itemsz &&
428-
!P_RIGHTMOST(lpageop) &&
429-
_bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0 &&
430-
random() > (MAX_RANDOM_VALUE / 100))
430+
while (PageGetFreeSpace(page) < itemsz)
431431
{
432+
Buffer rbuf;
433+
434+
/*
435+
* before considering moving right, see if we can obtain enough
436+
* space by erasing LP_DELETE items
437+
*/
438+
if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
439+
{
440+
_bt_vacuum_one_page(rel, buf);
441+
if (PageGetFreeSpace(page) >= itemsz)
442+
break; /* OK, now we have enough space */
443+
}
444+
445+
/*
446+
* nope, so check conditions (b) and (c) enumerated above
447+
*/
448+
if (P_RIGHTMOST(lpageop) ||
449+
_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
450+
random() <= (MAX_RANDOM_VALUE / 100))
451+
break;
452+
432453
/*
433454
* step right to next non-dead page
434455
*
@@ -438,7 +459,7 @@ _bt_insertonpg(Relation rel,
438459
* pages won't do because we don't know when they will get
439460
* de-linked from the tree.
440461
*/
441-
Buffer rbuf = InvalidBuffer;
462+
rbuf = InvalidBuffer;
442463

443464
for (;;)
444465
{
@@ -702,9 +723,9 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
702723
ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
703724

704725
/* if we're splitting this page, it won't be the root when we're done */
705-
/* also, clear the SPLIT_END flag in both pages */
726+
/* also, clear the SPLIT_END and HAS_GARBAGE flags in both pages */
706727
lopaque->btpo_flags = oopaque->btpo_flags;
707-
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END);
728+
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
708729
ropaque->btpo_flags = lopaque->btpo_flags;
709730
lopaque->btpo_prev = oopaque->btpo_prev;
710731
lopaque->btpo_next = BufferGetBlockNumber(rbuf);
@@ -1651,3 +1672,47 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
16511672
/* if we get here, the keys are equal */
16521673
return true;
16531674
}
1675+
1676+
/*
1677+
* _bt_vacuum_one_page - vacuum just one index page.
1678+
*
1679+
* Try to remove LP_DELETE items from the given page. The passed buffer
1680+
* must be exclusive-locked, but unlike a real VACUUM, we don't need a
1681+
* super-exclusive "cleanup" lock (see nbtree/README).
1682+
*/
1683+
static void
1684+
_bt_vacuum_one_page(Relation rel, Buffer buffer)
1685+
{
1686+
OffsetNumber deletable[MaxOffsetNumber];
1687+
int ndeletable = 0;
1688+
OffsetNumber offnum,
1689+
minoff,
1690+
maxoff;
1691+
Page page = BufferGetPage(buffer);
1692+
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
1693+
1694+
/*
1695+
* Scan over all items to see which ones need deleted
1696+
* according to LP_DELETE flags.
1697+
*/
1698+
minoff = P_FIRSTDATAKEY(opaque);
1699+
maxoff = PageGetMaxOffsetNumber(page);
1700+
for (offnum = minoff;
1701+
offnum <= maxoff;
1702+
offnum = OffsetNumberNext(offnum))
1703+
{
1704+
ItemId itemId = PageGetItemId(page, offnum);
1705+
1706+
if (ItemIdDeleted(itemId))
1707+
deletable[ndeletable++] = offnum;
1708+
}
1709+
1710+
if (ndeletable > 0)
1711+
_bt_delitems(rel, buffer, deletable, ndeletable);
1712+
/*
1713+
* Note: if we didn't find any LP_DELETE items, then the page's
1714+
* BTP_HAS_GARBAGE hint bit is falsely set. We do not bother
1715+
* expending a separate write to clear it, however. We will clear
1716+
* it when we split the page.
1717+
*/
1718+
}

src/backend/access/nbtree/nbtpage.c

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
*
1010
*
1111
* IDENTIFICATION
12-
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.98 2006/07/13 16:49:12 momjian Exp $
12+
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.99 2006/07/25 19:13:00 tgl Exp $
1313
*
1414
* NOTES
1515
* Postgres btree pages look like ordinary relation pages. The opaque
@@ -668,6 +668,15 @@ _bt_delitems(Relation rel, Buffer buf,
668668
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
669669
opaque->btpo_cycleid = 0;
670670

671+
/*
672+
* Mark the page as not containing any LP_DELETE items. This is not
673+
* certainly true (there might be some that have recently been marked,
674+
* but weren't included in our target-item list), but it will almost
675+
* always be true and it doesn't seem worth an additional page scan
676+
* to check it. Remember that BTP_HAS_GARBAGE is only a hint anyway.
677+
*/
678+
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
679+
671680
MarkBufferDirty(buf);
672681

673682
/* XLOG stuff */

src/backend/access/nbtree/nbtutils.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtutils.c,v 1.77 2006/07/13 16:49:12 momjian Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtutils.c,v 1.78 2006/07/25 19:13:00 tgl Exp $
1212
*
1313
*-------------------------------------------------------------------------
1414
*/
@@ -884,9 +884,15 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)
884884
* Since this can be redone later if needed, it's treated the same
885885
* as a commit-hint-bit status update for heap tuples: we mark the
886886
* buffer dirty but don't make a WAL log entry.
887+
*
888+
* Whenever we mark anything LP_DELETEd, we also set the page's
889+
* BTP_HAS_GARBAGE flag, which is likewise just a hint.
887890
*/
888891
if (killedsomething)
892+
{
893+
opaque->btpo_flags |= BTP_HAS_GARBAGE;
889894
SetBufferCommitInfoNeedsSave(so->currPos.buf);
895+
}
890896

891897
if (!haveLock)
892898
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);

src/backend/access/nbtree/nbtxlog.c

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
* Portions Copyright (c) 1994, Regents of the University of California
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.35 2006/07/14 14:52:17 momjian Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtxlog.c,v 1.36 2006/07/25 19:13:00 tgl Exp $
1212
*
1313
*-------------------------------------------------------------------------
1414
*/
@@ -346,6 +346,7 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
346346
Relation reln;
347347
Buffer buffer;
348348
Page page;
349+
BTPageOpaque opaque;
349350

350351
if (record->xl_info & XLR_BKP_BLOCK_1)
351352
return;
@@ -374,6 +375,13 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
374375
PageIndexMultiDelete(page, unused, unend - unused);
375376
}
376377

378+
/*
379+
* Mark the page as not containing any LP_DELETE items --- see comments
380+
* in _bt_delitems().
381+
*/
382+
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
383+
opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
384+
377385
PageSetLSN(page, lsn);
378386
PageSetTLI(page, ThisTimeLineID);
379387
MarkBufferDirty(buffer);

src/include/access/nbtree.h

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
* Portions Copyright (c) 1996-2006, PostgreSQL Global Development Group
88
* Portions Copyright (c) 1994, Regents of the University of California
99
*
10-
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.101 2006/07/11 21:05:57 tgl Exp $
10+
* $PostgreSQL: pgsql/src/include/access/nbtree.h,v 1.102 2006/07/25 19:13:00 tgl Exp $
1111
*
1212
*-------------------------------------------------------------------------
1313
*/
@@ -71,6 +71,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
7171
#define BTP_META (1 << 3) /* meta-page */
7272
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
7373
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
74+
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DELETEd tuples */
7475

7576

7677
/*
@@ -163,6 +164,7 @@ typedef struct BTMetaPageData
163164
#define P_ISROOT(opaque) ((opaque)->btpo_flags & BTP_ROOT)
164165
#define P_ISDELETED(opaque) ((opaque)->btpo_flags & BTP_DELETED)
165166
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
167+
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
166168

167169
/*
168170
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost

0 commit comments

Comments
 (0)