Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 5749f6e

Browse files
committed
Rewrite btree vacuuming to fold the former bulkdelete and cleanup operations
into a single mostly-physical-order scan of the index. This requires some ticklish interlocking considerations, but should create no material performance impact on normal index operations (at least given the already-committed changes to make scans work a page at a time). VACUUM itself should get significantly faster in any index that's degenerated to a very nonlinear page order. Also, we save one pass over the index entirely, except in the case where there were no deletions to do and so only one pass happened anyway. Original patch by Heikki Linnakangas, rework by Tom Lane.
1 parent 09cb5c0 commit 5749f6e

File tree

10 files changed

+682
-243
lines changed

10 files changed

+682
-243
lines changed

src/backend/access/nbtree/README

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.11 2006/05/07 01:21:30 tgl Exp $
1+
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.12 2006/05/08 00:00:09 tgl Exp $
22

33
This directory contains a correct implementation of Lehman and Yao's
44
high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
@@ -293,10 +293,32 @@ as part of the atomic update for the delete (either way, the metapage has
293293
to be the last page locked in the update to avoid deadlock risks). This
294294
avoids race conditions if two such operations are executing concurrently.
295295

296-
VACUUM needs to do a linear scan of an index to search for empty leaf
297-
pages and half-dead parent pages that can be deleted, as well as deleted
298-
pages that can be reclaimed because they are older than all open
299-
transactions.
296+
VACUUM needs to do a linear scan of an index to search for deleted pages
297+
that can be reclaimed because they are older than all open transactions.
298+
For efficiency's sake, we'd like to use the same linear scan to search for
299+
deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
300+
in index order, but it is possible to visit them in physical order instead.
301+
The tricky part of this is to avoid missing any deletable tuples in the
302+
presence of concurrent page splits: a page split could easily move some
303+
tuples from a page not yet passed over by the sequential scan to a
304+
lower-numbered page already passed over. (This wasn't a concern for the
305+
index-order scan, because splits always split right.) To implement this,
306+
we provide a "vacuum cycle ID" mechanism that makes it possible to
307+
determine whether a page has been split since the current btbulkdelete
308+
cycle started. If btbulkdelete finds a page that has been split since
309+
it started, and has a right-link pointing to a lower page number, then
310+
it temporarily suspends its sequential scan and visits that page instead.
311+
It must continue to follow right-links and vacuum dead tuples until
312+
reaching a page that either hasn't been split since btbulkdelete started,
313+
or is above the location of the outer sequential scan. Then it can resume
314+
the sequential scan. This ensures that all tuples are visited. It may be
315+
that some tuples are visited twice, but that has no worse effect than an
316+
inaccurate index tuple count (and we can't guarantee an accurate count
317+
anyway in the face of concurrent activity). Note that this still works
318+
if the has-been-recently-split test has a small probability of false
319+
positives, so long as it never gives a false negative. This makes it
320+
possible to implement the test with a small counter value stored on each
321+
index page.
300322

301323
WAL considerations
302324
------------------

src/backend/access/nbtree/nbtinsert.c

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.136 2006/04/25 22:46:05 tgl Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtinsert.c,v 1.137 2006/05/08 00:00:09 tgl Exp $
1212
*
1313
*-------------------------------------------------------------------------
1414
*/
@@ -700,14 +700,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
700700
ropaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
701701

702702
/* if we're splitting this page, it won't be the root when we're done */
703+
/* also, clear the SPLIT_END flag in both pages */
703704
lopaque->btpo_flags = oopaque->btpo_flags;
704-
lopaque->btpo_flags &= ~BTP_ROOT;
705+
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END);
705706
ropaque->btpo_flags = lopaque->btpo_flags;
706707
lopaque->btpo_prev = oopaque->btpo_prev;
707708
lopaque->btpo_next = BufferGetBlockNumber(rbuf);
708709
ropaque->btpo_prev = BufferGetBlockNumber(buf);
709710
ropaque->btpo_next = oopaque->btpo_next;
710711
lopaque->btpo.level = ropaque->btpo.level = oopaque->btpo.level;
712+
/* Since we already have write-lock on both pages, ok to read cycleid */
713+
lopaque->btpo_cycleid = _bt_vacuum_cycleid(rel);
714+
ropaque->btpo_cycleid = lopaque->btpo_cycleid;
711715

712716
/*
713717
* If the page we're splitting is not the rightmost page at its level in
@@ -836,6 +840,21 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
836840
sopaque = (BTPageOpaque) PageGetSpecialPointer(spage);
837841
if (sopaque->btpo_prev != ropaque->btpo_prev)
838842
elog(PANIC, "right sibling's left-link doesn't match");
843+
/*
844+
* Check to see if we can set the SPLIT_END flag in the right-hand
845+
* split page; this can save some I/O for vacuum since it need not
846+
* proceed to the right sibling. We can set the flag if the right
847+
* sibling has a different cycleid: that means it could not be part
848+
* of a group of pages that were all split off from the same ancestor
849+
* page. If you're confused, imagine that page A splits to A B and
850+
* then again, yielding A C B, while vacuum is in progress. Tuples
851+
* originally in A could now be in either B or C, hence vacuum must
852+
* examine both pages. But if D, our right sibling, has a different
853+
* cycleid then it could not contain any tuples that were in A when
854+
* the vacuum started.
855+
*/
856+
if (sopaque->btpo_cycleid != ropaque->btpo_cycleid)
857+
ropaque->btpo_flags |= BTP_SPLIT_END;
839858
}
840859

841860
/*
@@ -1445,6 +1464,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
14451464
rootopaque->btpo_flags = BTP_ROOT;
14461465
rootopaque->btpo.level =
14471466
((BTPageOpaque) PageGetSpecialPointer(lpage))->btpo.level + 1;
1467+
rootopaque->btpo_cycleid = 0;
14481468

14491469
/* update metapage data */
14501470
metad->btm_root = rootblknum;

src/backend/access/nbtree/nbtpage.c

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
*
1010
*
1111
* IDENTIFICATION
12-
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.96 2006/04/25 22:46:05 tgl Exp $
12+
* $PostgreSQL: pgsql/src/backend/access/nbtree/nbtpage.c,v 1.97 2006/05/08 00:00:10 tgl Exp $
1313
*
1414
* NOTES
1515
* Postgres btree pages look like ordinary relation pages. The opaque
@@ -206,6 +206,7 @@ _bt_getroot(Relation rel, int access)
206206
rootopaque->btpo_prev = rootopaque->btpo_next = P_NONE;
207207
rootopaque->btpo_flags = (BTP_LEAF | BTP_ROOT);
208208
rootopaque->btpo.level = 0;
209+
rootopaque->btpo_cycleid = 0;
209210

210211
/* NO ELOG(ERROR) till meta is updated */
211212
START_CRIT_SECTION();
@@ -544,7 +545,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
544545
* Release the file-extension lock; it's now OK for someone else to
545546
* extend the relation some more. Note that we cannot release this
546547
* lock before we have buffer lock on the new page, or we risk a race
547-
* condition against btvacuumcleanup --- see comments therein.
548+
* condition against btvacuumscan --- see comments therein.
548549
*/
549550
if (needLock)
550551
UnlockRelationForExtension(rel, ExclusiveLock);
@@ -608,7 +609,7 @@ _bt_pageinit(Page page, Size size)
608609
/*
609610
* _bt_page_recyclable() -- Is an existing page recyclable?
610611
*
611-
* This exists to make sure _bt_getbuf and btvacuumcleanup have the same
612+
* This exists to make sure _bt_getbuf and btvacuumscan have the same
612613
* policy about whether a page is safe to re-use.
613614
*/
614615
bool
@@ -651,13 +652,21 @@ _bt_delitems(Relation rel, Buffer buf,
651652
OffsetNumber *itemnos, int nitems)
652653
{
653654
Page page = BufferGetPage(buf);
655+
BTPageOpaque opaque;
654656

655657
/* No ereport(ERROR) until changes are logged */
656658
START_CRIT_SECTION();
657659

658660
/* Fix the page */
659661
PageIndexMultiDelete(page, itemnos, nitems);
660662

663+
/*
664+
* We can clear the vacuum cycle ID since this page has certainly
665+
* been processed by the current vacuum scan.
666+
*/
667+
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
668+
opaque->btpo_cycleid = 0;
669+
661670
MarkBufferDirty(buf);
662671

663672
/* XLOG stuff */

0 commit comments

Comments
 (0)