Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 9c02cf5

Browse files
Remove block number field from nbtree stack.
The initial value of the nbtree stack downlink block number field recorded during an initial descent of the tree wasn't actually used. Both _bt_getstackbuf() callers overwrote the value with their own value. Remove the block number field from the stack struct, and add a child block number argument to _bt_getstackbuf() in its place. This makes the overall design of _bt_getstackbuf() clearer. Author: Peter Geoghegan Reviewed-By: Anastasia Lubennikova Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com
1 parent fded477 commit 9c02cf5

File tree

5 files changed

+50
-42
lines changed

5 files changed

+50
-42
lines changed

src/backend/access/nbtree/README

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,13 @@ it, but it's still linked to its siblings.
224224

225225
(Note: Lanin and Shasha prefer to make the key space move left, but their
226226
argument for doing so hinges on not having left-links, which we have
227-
anyway. So we simplify the algorithm by moving key space right.)
227+
anyway. So we simplify the algorithm by moving the key space right. Note
228+
also that Lanin and Shasha optimistically avoid holding multiple locks as
229+
the tree is ascended. They're willing to release all locks and retry in
230+
"rare" cases where the correct location for a new downlink cannot be found
231+
immediately. We prefer to stick with Lehman and Yao's approach of
232+
pessimistically coupling buffer locks when ascending the tree, since it's
233+
far simpler.)
228234

229235
To preserve consistency on the parent level, we cannot merge the key space
230236
of a page into its right sibling unless the right sibling is a child of

src/backend/access/nbtree/nbtinsert.c

Lines changed: 30 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1797,7 +1797,6 @@ _bt_insert_parent(Relation rel,
17971797
stack = &fakestack;
17981798
stack->bts_blkno = BufferGetBlockNumber(pbuf);
17991799
stack->bts_offset = InvalidOffsetNumber;
1800-
stack->bts_btentry = InvalidBlockNumber;
18011800
stack->bts_parent = NULL;
18021801
_bt_relbuf(rel, pbuf);
18031802
}
@@ -1819,8 +1818,7 @@ _bt_insert_parent(Relation rel,
18191818
* new downlink will be inserted at the correct offset. Even buf's
18201819
* parent may have changed.
18211820
*/
1822-
stack->bts_btentry = bknum;
1823-
pbuf = _bt_getstackbuf(rel, stack);
1821+
pbuf = _bt_getstackbuf(rel, stack, bknum);
18241822

18251823
/*
18261824
* Now we can unlock the right child. The left child will be unlocked
@@ -1834,7 +1832,7 @@ _bt_insert_parent(Relation rel,
18341832
errmsg_internal("failed to re-find parent key in index \"%s\" for split pages %u/%u",
18351833
RelationGetRelationName(rel), bknum, rbknum)));
18361834

1837-
/* Recursively update the parent */
1835+
/* Recursively insert into the parent */
18381836
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
18391837
new_item, stack->bts_offset + 1,
18401838
is_only);
@@ -1901,21 +1899,37 @@ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
19011899
}
19021900

19031901
/*
1904-
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
1905-
* we last looked at in the parent.
1902+
* _bt_getstackbuf() -- Walk back up the tree one step, and find the pivot
1903+
* tuple whose downlink points to child page.
19061904
*
1907-
* This is possible because we save the downlink from the parent item,
1908-
* which is enough to uniquely identify it. Insertions into the parent
1909-
* level could cause the item to move right; deletions could cause it
1910-
* to move left, but not left of the page we previously found it in.
1905+
* Caller passes child's block number, which is used to identify
1906+
* associated pivot tuple in parent page using a linear search that
1907+
* matches on pivot's downlink/block number. The expected location of
1908+
* the pivot tuple is taken from the stack one level above the child
1909+
* page. This is used as a starting point. Insertions into the
1910+
* parent level could cause the pivot tuple to move right; deletions
1911+
* could cause it to move left, but not left of the page we previously
1912+
* found it on.
19111913
*
1912-
* Adjusts bts_blkno & bts_offset if changed.
1914+
* Caller can use its stack to relocate the pivot tuple/downlink for
1915+
* any same-level page to the right of the page found by its initial
1916+
* descent. This is necessary because of the possibility that caller
1917+
* moved right to recover from a concurrent page split. It's also
1918+
* convenient for certain callers to be able to step right when there
1919+
* wasn't a concurrent page split, while still using their original
1920+
* stack. For example, the checkingunique _bt_doinsert() case may
1921+
* have to step right when there are many physical duplicates, and its
1922+
* scantid forces an insertion to the right of the "first page the
1923+
* value could be on".
19131924
*
1914-
* Returns write-locked buffer, or InvalidBuffer if item not found
1915-
* (should not happen).
1925+
* Returns write-locked parent page buffer, or InvalidBuffer if pivot
1926+
* tuple not found (should not happen). Adjusts bts_blkno &
1927+
* bts_offset if changed. Page split caller should insert its new
1928+
* pivot tuple for its new right sibling page on parent page, at the
1929+
* offset number bts_offset + 1.
19161930
*/
19171931
Buffer
1918-
_bt_getstackbuf(Relation rel, BTStack stack)
1932+
_bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child)
19191933
{
19201934
BlockNumber blkno;
19211935
OffsetNumber start;
@@ -1977,7 +1991,7 @@ _bt_getstackbuf(Relation rel, BTStack stack)
19771991
itemid = PageGetItemId(page, offnum);
19781992
item = (IndexTuple) PageGetItem(page, itemid);
19791993

1980-
if (BTreeInnerTupleGetDownLink(item) == stack->bts_btentry)
1994+
if (BTreeInnerTupleGetDownLink(item) == child)
19811995
{
19821996
/* Return accurate pointer to where link is now */
19831997
stack->bts_blkno = blkno;
@@ -1993,7 +2007,7 @@ _bt_getstackbuf(Relation rel, BTStack stack)
19932007
itemid = PageGetItemId(page, offnum);
19942008
item = (IndexTuple) PageGetItem(page, itemid);
19952009

1996-
if (BTreeInnerTupleGetDownLink(item) == stack->bts_btentry)
2010+
if (BTreeInnerTupleGetDownLink(item) == child)
19972011
{
19982012
/* Return accurate pointer to where link is now */
19992013
stack->bts_blkno = blkno;

src/backend/access/nbtree/nbtpage.c

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1189,8 +1189,7 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
11891189
* non-unique high keys in leaf level pages. Even heapkeyspace indexes
11901190
* can have a stale stack due to insertions into the parent.
11911191
*/
1192-
stack->bts_btentry = child;
1193-
pbuf = _bt_getstackbuf(rel, stack);
1192+
pbuf = _bt_getstackbuf(rel, stack, child);
11941193
if (pbuf == InvalidBuffer)
11951194
ereport(ERROR,
11961195
(errcode(ERRCODE_INDEX_CORRUPTED),

src/backend/access/nbtree/nbtsearch.c

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -146,23 +146,16 @@ _bt_search(Relation rel, BTScanInsert key, Buffer *bufP, int access,
146146
par_blkno = BufferGetBlockNumber(*bufP);
147147

148148
/*
149-
* We need to save the location of the index entry we chose in the
150-
* parent page on a stack. In case we split the tree, we'll use the
151-
* stack to work back up to the parent page. We also save the actual
152-
* downlink (block) to uniquely identify the index entry, in case it
153-
* moves right while we're working lower in the tree. See the paper
154-
* by Lehman and Yao for how this is detected and handled. (We use the
155-
* child link during the second half of a page split -- if caller ends
156-
* up splitting the child it usually ends up inserting a new pivot
157-
* tuple for child's new right sibling immediately after the original
158-
* bts_offset offset recorded here. The downlink block will be needed
159-
* to check if bts_offset remains the position of this same pivot
160-
* tuple.)
149+
* We need to save the location of the pivot tuple we chose in the
150+
* parent page on a stack. If we need to split a page, we'll use
151+
* the stack to work back up to its parent page. If caller ends up
152+
* splitting a page one level down, it usually ends up inserting a
153+
* new pivot tuple/downlink immediately after the location recorded
154+
* here.
161155
*/
162156
new_stack = (BTStack) palloc(sizeof(BTStackData));
163157
new_stack->bts_blkno = par_blkno;
164158
new_stack->bts_offset = offnum;
165-
new_stack->bts_btentry = blkno;
166159
new_stack->bts_parent = stack_in;
167160

168161
/*

src/include/access/nbtree.h

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -403,20 +403,16 @@ typedef struct BTMetaPageData
403403
#define BT_WRITE BUFFER_LOCK_EXCLUSIVE
404404

405405
/*
406-
* BTStackData -- As we descend a tree, we push the (location, downlink)
407-
* pairs from internal pages onto a private stack. If we split a
408-
* leaf, we use this stack to walk back up the tree and insert data
409-
* into parent pages (and possibly to split them, too). Lehman and
410-
* Yao's update algorithm guarantees that under no circumstances can
411-
* our private stack give us an irredeemably bad picture up the tree.
412-
* Again, see the paper for details.
406+
* BTStackData -- As we descend a tree, we push the location of pivot
407+
* tuples whose downlink we are about to follow onto a private stack. If
408+
* we split a leaf, we use this stack to walk back up the tree and insert
409+
* data into its parent page at the correct location. We may also have to
410+
* recursively split a grandparent of the leaf page (and so on).
413411
*/
414-
415412
typedef struct BTStackData
416413
{
417414
BlockNumber bts_blkno;
418415
OffsetNumber bts_offset;
419-
BlockNumber bts_btentry;
420416
struct BTStackData *bts_parent;
421417
} BTStackData;
422418

@@ -731,7 +727,7 @@ extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
731727
*/
732728
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
733729
IndexUniqueCheck checkUnique, Relation heapRel);
734-
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack);
730+
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
735731
extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
736732

737733
/*

0 commit comments

Comments
 (0)