Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit d2086b0

Browse files
committed
Reduce path length for locking leaf B-tree pages during insertion
In our B-tree implementation appropriate leaf page for new tuple insertion is acquired using _bt_search() function. This function always returns leaf page locked in shared mode. In order to obtain exclusive lock, caller have to relock the page. This commit makes _bt_search() function lock leaf page immediately in exclusive mode when needed. That removes unnecessary relock and, in turn reduces lock contention for B-tree leaf pages. Our experiments on multi-core systems showed acceleration up to 4.5 times in corner case. Discussion: https://postgr.es/m/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Yoshikazu Imai, Simon Riggs, Peter Geoghegan
1 parent 8a9b72c commit d2086b0

File tree

2 files changed

+41
-22
lines changed

2 files changed

+41
-22
lines changed

src/backend/access/nbtree/nbtinsert.c

+4-15
Original file line numberDiff line numberDiff line change
@@ -216,23 +216,12 @@ _bt_doinsert(Relation rel, IndexTuple itup,
216216

217217
if (!fastpath)
218218
{
219-
/* find the first page containing this key */
220-
stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
221-
NULL);
222-
223-
/* trade in our read lock for a write lock */
224-
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
225-
LockBuffer(buf, BT_WRITE);
226-
227219
/*
228-
* If the page was split between the time that we surrendered our read
229-
* lock and acquired our write lock, then this page may no longer be
230-
* the right place for the key we want to insert. In this case, we
231-
* need to move right in the tree. See Lehman and Yao for an
232-
* excruciatingly precise description.
220+
* Find the first page containing this key. Buffer returned by
221+
* _bt_search() is locked in exclusive mode.
233222
*/
234-
buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, false,
235-
true, stack, BT_WRITE, NULL);
223+
stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
224+
NULL);
236225
}
237226

238227
/*

src/backend/access/nbtree/nbtsearch.c

+37-7
Original file line numberDiff line numberDiff line change
@@ -87,17 +87,18 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
8787
* place during the descent through the tree. This is not needed when
8888
* positioning for an insert or delete, so NULL is used for those cases.
8989
*
90-
* NOTE that the returned buffer is read-locked regardless of the access
91-
* parameter. However, access = BT_WRITE will allow an empty root page
92-
* to be created and returned. When access = BT_READ, an empty index
93-
* will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
94-
* any incomplete splits encountered during the search will be finished.
90+
* The returned buffer is locked according to access parameter. Additionally,
91+
* access = BT_WRITE will allow an empty root page to be created and returned.
92+
* When access = BT_READ, an empty index will result in *bufP being set to
93+
* InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
94+
* during the search will be finished.
9595
*/
9696
BTStack
9797
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
9898
Buffer *bufP, int access, Snapshot snapshot)
9999
{
100100
BTStack stack_in = NULL;
101+
int page_access = BT_READ;
101102

102103
/* Get the root page to start with */
103104
*bufP = _bt_getroot(rel, access);
@@ -132,7 +133,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
132133
*/
133134
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
134135
(access == BT_WRITE), stack_in,
135-
BT_READ, snapshot);
136+
page_access, snapshot);
136137

137138
/* if this is a leaf page, we're done */
138139
page = BufferGetPage(*bufP);
@@ -166,13 +167,42 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
166167
new_stack->bts_btentry = blkno;
167168
new_stack->bts_parent = stack_in;
168169

170+
/*
171+
* Page level 1 is lowest non-leaf page level prior to leaves. So,
172+
* if we're on the level 1 and asked to lock leaf page in write mode,
173+
* then lock next page in write mode, because it must be a leaf.
174+
*/
175+
if (opaque->btpo.level == 1 && access == BT_WRITE)
176+
page_access = BT_WRITE;
177+
169178
/* drop the read lock on the parent page, acquire one on the child */
170-
*bufP = _bt_relandgetbuf(rel, *bufP, blkno, BT_READ);
179+
*bufP = _bt_relandgetbuf(rel, *bufP, blkno, page_access);
171180

172181
/* okay, all set to move down a level */
173182
stack_in = new_stack;
174183
}
175184

185+
/*
186+
* If we're asked to lock leaf in write mode, but didn't manage to, then
187+
* relock. That may happend when root page appears to be leaf.
188+
*/
189+
if (access == BT_WRITE && page_access == BT_READ)
190+
{
191+
/* trade in our read lock for a write lock */
192+
LockBuffer(*bufP, BUFFER_LOCK_UNLOCK);
193+
LockBuffer(*bufP, BT_WRITE);
194+
195+
/*
196+
* If the page was split between the time that we surrendered our read
197+
* lock and acquired our write lock, then this page may no longer be
198+
* the right place for the key we want to insert. In this case, we
199+
* need to move right in the tree. See Lehman and Yao for an
200+
* excruciatingly precise description.
201+
*/
202+
*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
203+
true, stack_in, BT_WRITE, snapshot);
204+
}
205+
176206
return stack_in;
177207
}
178208

0 commit comments

Comments
 (0)