Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 0bef1c0

Browse files
committed
Re-think predicate locking on GIN indexes.
The principle behind the locking was not very well thought-out, and not documented. Add a section in the README to explain how it's supposed to work, and change the code so that it actually works that way. This fixes two bugs: 1. If fast update was turned on concurrently, subsequent inserts to the pending list would not conflict with predicate locks that were acquired earlier, on entry pages. The included 'predicate-gin-fastupdate' test demonstrates that. To fix, make all scans acquire a predicate lock on the metapage. That lock represents a scan of the pending list, whether or not there is a pending list at the moment. Forget about the optimization to skip locking/checking for locks, when fastupdate=off. 2. If a scan finds no match, it still needs to lock the entry page. The point of predicate locks is to lock the gabs between values, whether or not there is a match. The included 'predicate-gin-nomatch' test tests that case. In addition to those two bug fixes, this removes some unnecessary locking, following the principle laid out in the README. Because all items in a posting tree have the same key value, a lock on the posting tree root is enough to cover all the items. (With a very large posting tree, it would possibly be better to lock the posting tree leaf pages instead, so that a "skip scan" with a query like "A & B", you could avoid unnecessary conflict if a new tuple is inserted with A but !B. But let's keep this simple.) Also, some spelling fixes. Author: Heikki Linnakangas with some editorization by me Review: Andrey Borodin, Alexander Korotkov Discussion: https://www.postgresql.org/message-id/0b3ad2c2-2692-62a9-3a04-5724f2af9114@iki.fi
1 parent 7d86799 commit 0bef1c0

18 files changed

+251
-117
lines changed

src/backend/access/gin/README

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -331,6 +331,40 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
331331
deleted pages around with the right-link intact until all concurrent scans
332332
have finished.)
333333

334+
Predicate Locking
335+
-----------------
336+
337+
GIN supports predicate locking, for serializable snapshot isolation.
338+
A predicate locks represent that a scan has scanned a range of values. They
339+
are not concerned with physical pages as such, but the logical key values.
340+
A predicate lock on a page covers the key range that would belong on that
341+
page, whether or not there are any matching tuples there currently. In other
342+
words, a predicate lock on an index page covers the "gaps" between the index
343+
tuples. To minimize false positives, predicate locks are acquired at the
344+
finest level possible.
345+
346+
* Like in the B-tree index, it is enough to lock only leaf pages, because all
347+
insertions happen at the leaf level.
348+
349+
* In an equality search (i.e. not a partial match search), if a key entry has
350+
a posting tree, we lock the posting tree root page, to represent a lock on
351+
just that key entry. Otherwise, we lock the entry tree page. We also lock
352+
the entry tree page if no match is found, to lock the "gap" where the entry
353+
would've been, had there been one.
354+
355+
* In a partial match search, we lock all the entry leaf pages that we scan,
356+
in addition to locks on posting tree roots, to represent the "gaps" between
357+
values.
358+
359+
* In addition to the locks on entry leaf pages and posting tree roots, all
360+
scans grab a lock the metapage. This is to interlock with insertions to
361+
the fast update pending list. An insertion to the pending list can really
362+
belong anywhere in the tree, and the lock on the metapage represents that.
363+
364+
The interlock for fastupdate pending lists means that with fastupdate=on,
365+
we effectively always grab a full-index lock, so you could get a lot of false
366+
positives.
367+
334368
Compatibility
335369
-------------
336370

src/backend/access/gin/ginbtree.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,9 @@ ginFindLeafPage(GinBtree btree, bool searchMode, Snapshot snapshot)
8484
stack->parent = NULL;
8585
stack->predictNumber = 1;
8686

87+
if (!searchMode)
88+
CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
89+
8790
for (;;)
8891
{
8992
Page page;

src/backend/access/gin/gindatapage.c

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1812,8 +1812,8 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
18121812
blkno = BufferGetBlockNumber(buffer);
18131813

18141814
/*
1815-
* Copy a predicate lock from entry tree leaf (containing posting list) to
1816-
* posting tree.
1815+
* Copy any predicate locks from the entry tree leaf (containing posting
1816+
* list) to the posting tree.
18171817
*/
18181818
PredicateLockPageSplit(index, BufferGetBlockNumber(entrybuffer), blkno);
18191819

@@ -1864,7 +1864,7 @@ createPostingTree(Relation index, ItemPointerData *items, uint32 nitems,
18641864
return blkno;
18651865
}
18661866

1867-
void
1867+
static void
18681868
ginPrepareDataScan(GinBtree btree, Relation index, BlockNumber rootBlkno)
18691869
{
18701870
memset(btree, 0, sizeof(GinBtreeData));
@@ -1911,7 +1911,6 @@ ginInsertItemPointers(Relation index, BlockNumber rootBlkno,
19111911
btree.itemptr = insertdata.items[insertdata.curitem];
19121912
stack = ginFindLeafPage(&btree, false, NULL);
19131913

1914-
GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
19151914
ginInsertValue(&btree, stack, &insertdata, buildStats);
19161915
}
19171916
}

src/backend/access/gin/ginfast.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
#include "postmaster/autovacuum.h"
3232
#include "storage/indexfsm.h"
3333
#include "storage/lmgr.h"
34+
#include "storage/predicate.h"
3435
#include "utils/builtins.h"
3536

3637
/* GUC parameter */
@@ -245,6 +246,13 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
245246
metabuffer = ReadBuffer(index, GIN_METAPAGE_BLKNO);
246247
metapage = BufferGetPage(metabuffer);
247248

249+
/*
250+
* An insertion to the pending list could logically belong anywhere in
251+
* the tree, so it conflicts with all serializable scans. All scans
252+
* acquire a predicate lock on the metabuffer to represent that.
253+
*/
254+
CheckForSerializableConflictIn(index, NULL, metabuffer);
255+
248256
if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
249257
{
250258
/*

src/backend/access/gin/ginget.c

Lines changed: 54 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -35,20 +35,6 @@ typedef struct pendingPosition
3535
} pendingPosition;
3636

3737

38-
/*
39-
* Place predicate lock on GIN page if needed.
40-
*/
41-
static void
42-
GinPredicateLockPage(Relation index, BlockNumber blkno, Snapshot snapshot)
43-
{
44-
/*
45-
* When fast update is on then no need in locking pages, because we anyway
46-
* need to lock the whole index.
47-
*/
48-
if (!GinGetUseFastUpdate(index))
49-
PredicateLockPage(index, blkno, snapshot);
50-
}
51-
5238
/*
5339
* Goes to the next page if current offset is outside of bounds
5440
*/
@@ -68,7 +54,7 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack, Snapshot snapshot
6854
stack->buffer = ginStepRight(stack->buffer, btree->index, GIN_SHARE);
6955
stack->blkno = BufferGetBlockNumber(stack->buffer);
7056
stack->off = FirstOffsetNumber;
71-
GinPredicateLockPage(btree->index, stack->blkno, snapshot);
57+
PredicateLockPage(btree->index, stack->blkno, snapshot);
7258
}
7359

7460
return true;
@@ -100,11 +86,6 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
10086
*/
10187
for (;;)
10288
{
103-
/*
104-
* Predicate lock each leaf page in posting tree
105-
*/
106-
GinPredicateLockPage(index, BufferGetBlockNumber(buffer), snapshot);
107-
10889
page = BufferGetPage(buffer);
10990
if ((GinPageGetOpaque(page)->flags & GIN_DELETED) == 0)
11091
{
@@ -158,7 +139,7 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
158139
* Predicate lock entry leaf page, following pages will be locked by
159140
* moveRightIfItNeeded()
160141
*/
161-
GinPredicateLockPage(btree->index, stack->buffer, snapshot);
142+
PredicateLockPage(btree->index, stack->buffer, snapshot);
162143

163144
for (;;)
164145
{
@@ -253,6 +234,13 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
253234

254235
LockBuffer(stack->buffer, GIN_UNLOCK);
255236

237+
/*
238+
* Acquire predicate lock on the posting tree. We already hold
239+
* a lock on the entry page, but insertions to the posting tree
240+
* don't check for conflicts on that level.
241+
*/
242+
PredicateLockPage(btree->index, rootPostingTree, snapshot);
243+
256244
/* Collect all the TIDs in this entry's posting tree */
257245
scanPostingTree(btree->index, scanEntry, rootPostingTree,
258246
snapshot);
@@ -400,17 +388,20 @@ startScanEntry(GinState *ginstate, GinScanEntry entry, Snapshot snapshot)
400388
{
401389
IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, stackEntry->off));
402390

403-
/* Predicate lock visited entry leaf page */
404-
GinPredicateLockPage(ginstate->index,
405-
BufferGetBlockNumber(stackEntry->buffer), snapshot);
406-
407391
if (GinIsPostingTree(itup))
408392
{
409393
BlockNumber rootPostingTree = GinGetPostingTree(itup);
410394
GinBtreeStack *stack;
411395
Page page;
412396
ItemPointerData minItem;
413397

398+
/*
399+
* This is an equality scan, so lock the root of the posting tree.
400+
* It represents a lock on the exact key value, and covers all the
401+
* items in the posting tree.
402+
*/
403+
PredicateLockPage(ginstate->index, rootPostingTree, snapshot);
404+
414405
/*
415406
* We should unlock entry page before touching posting tree to
416407
* prevent deadlocks with vacuum processes. Because entry is never
@@ -425,12 +416,6 @@ startScanEntry(GinState *ginstate, GinScanEntry entry, Snapshot snapshot)
425416
rootPostingTree, snapshot);
426417
entry->buffer = stack->buffer;
427418

428-
/*
429-
* Predicate lock visited posting tree page, following pages will
430-
* be locked by moveRightIfItNeeded or entryLoadMoreItems
431-
*/
432-
GinPredicateLockPage(ginstate->index, BufferGetBlockNumber(entry->buffer), snapshot);
433-
434419
/*
435420
* We keep buffer pinned because we need to prevent deletion of
436421
* page during scan. See GIN's vacuum implementation. RefCount is
@@ -452,15 +437,38 @@ startScanEntry(GinState *ginstate, GinScanEntry entry, Snapshot snapshot)
452437
freeGinBtreeStack(stack);
453438
entry->isFinished = false;
454439
}
455-
else if (GinGetNPosting(itup) > 0)
440+
else
456441
{
457-
entry->list = ginReadTuple(ginstate, entry->attnum, itup,
458-
&entry->nlist);
459-
entry->predictNumberResult = entry->nlist;
442+
/*
443+
* Lock the entry leaf page. This is more coarse-grained than
444+
* necessary, because it will conflict with any insertions that
445+
* land on the same leaf page, not only the exacty key we searched
446+
* for. But locking an individual tuple would require updating
447+
* that lock whenever it moves because of insertions or vacuums,
448+
* which seems too complicated.
449+
*/
450+
PredicateLockPage(ginstate->index,
451+
BufferGetBlockNumber(stackEntry->buffer),
452+
snapshot);
453+
if (GinGetNPosting(itup) > 0)
454+
{
455+
entry->list = ginReadTuple(ginstate, entry->attnum, itup,
456+
&entry->nlist);
457+
entry->predictNumberResult = entry->nlist;
460458

461-
entry->isFinished = false;
459+
entry->isFinished = false;
460+
}
462461
}
463462
}
463+
else
464+
{
465+
/*
466+
* No entry found. Predicate lock the leaf page, to lock the place
467+
* where the entry would've been, had there been one.
468+
*/
469+
PredicateLockPage(ginstate->index,
470+
BufferGetBlockNumber(stackEntry->buffer), snapshot);
471+
}
464472

465473
if (needUnlock)
466474
LockBuffer(stackEntry->buffer, GIN_UNLOCK);
@@ -533,7 +541,7 @@ startScanKey(GinState *ginstate, GinScanOpaque so, GinScanKey key)
533541

534542
for (i = 0; i < key->nentries - 1; i++)
535543
{
536-
/* Pass all entries <= i as false, and the rest as MAYBE */
544+
/* Pass all entries <= i as FALSE, and the rest as MAYBE */
537545
for (j = 0; j <= i; j++)
538546
key->entryRes[entryIndexes[j]] = GIN_FALSE;
539547
for (j = i + 1; j < key->nentries; j++)
@@ -673,8 +681,6 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry,
673681
entry->btree.fullScan = false;
674682
stack = ginFindLeafPage(&entry->btree, true, snapshot);
675683

676-
GinPredicateLockPage(ginstate->index, BufferGetBlockNumber(stack->buffer), snapshot);
677-
678684
/* we don't need the stack, just the buffer. */
679685
entry->buffer = stack->buffer;
680686
IncrBufferRefCount(entry->buffer);
@@ -719,10 +725,6 @@ entryLoadMoreItems(GinState *ginstate, GinScanEntry entry,
719725
entry->buffer = ginStepRight(entry->buffer,
720726
ginstate->index,
721727
GIN_SHARE);
722-
723-
GinPredicateLockPage(ginstate->index, BufferGetBlockNumber(entry->buffer), snapshot);
724-
725-
726728
page = BufferGetPage(entry->buffer);
727729
}
728730
stepright = true;
@@ -1084,8 +1086,8 @@ keyGetItem(GinState *ginstate, MemoryContext tempCtx, GinScanKey key,
10841086
* lossy page even when none of the other entries match.
10851087
*
10861088
* Our strategy is to call the tri-state consistent function, with the
1087-
* lossy-page entries set to MAYBE, and all the other entries false. If it
1088-
* returns false, none of the lossy items alone are enough for a match, so
1089+
* lossy-page entries set to MAYBE, and all the other entries FALSE. If it
1090+
* returns FALSE, none of the lossy items alone are enough for a match, so
10891091
* we don't need to return a lossy-page pointer. Otherwise, return a
10901092
* lossy-page pointer to indicate that the whole heap page must be
10911093
* checked. (On subsequent calls, we'll do nothing until minItem is past
@@ -1746,8 +1748,7 @@ collectMatchesForHeapRow(IndexScanDesc scan, pendingPosition *pos)
17461748
}
17471749

17481750
/*
1749-
* Collect all matched rows from pending list into bitmap. Also function
1750-
* takes PendingLockRelation if it's needed.
1751+
* Collect all matched rows from pending list into bitmap.
17511752
*/
17521753
static void
17531754
scanPendingInsert(IndexScanDesc scan, TIDBitmap *tbm, int64 *ntids)
@@ -1764,6 +1765,12 @@ scanPendingInsert(IndexScanDesc scan, TIDBitmap *tbm, int64 *ntids)
17641765

17651766
*ntids = 0;
17661767

1768+
/*
1769+
* Acquire predicate lock on the metapage, to conflict with any
1770+
* fastupdate insertions.
1771+
*/
1772+
PredicateLockPage(scan->indexRelation, GIN_METAPAGE_BLKNO, scan->xs_snapshot);
1773+
17671774
LockBuffer(metabuffer, GIN_SHARE);
17681775
page = BufferGetPage(metabuffer);
17691776
TestForOldSnapshot(scan->xs_snapshot, scan->indexRelation, page);
@@ -1777,24 +1784,9 @@ scanPendingInsert(IndexScanDesc scan, TIDBitmap *tbm, int64 *ntids)
17771784
{
17781785
/* No pending list, so proceed with normal scan */
17791786
UnlockReleaseBuffer(metabuffer);
1780-
1781-
/*
1782-
* If fast update is enabled, we acquire a predicate lock on the
1783-
* entire relation as fast update postpones the insertion of tuples
1784-
* into index structure due to which we can't detect rw conflicts.
1785-
*/
1786-
if (GinGetUseFastUpdate(scan->indexRelation))
1787-
PredicateLockRelation(scan->indexRelation, scan->xs_snapshot);
1788-
17891787
return;
17901788
}
17911789

1792-
/*
1793-
* Pending list is not empty, we need to lock the index doesn't despite on
1794-
* fastupdate state
1795-
*/
1796-
PredicateLockRelation(scan->indexRelation, scan->xs_snapshot);
1797-
17981790
pos.pendingBuffer = ReadBuffer(scan->indexRelation, blkno);
17991791
LockBuffer(pos.pendingBuffer, GIN_SHARE);
18001792
pos.firstOffset = FirstOffsetNumber;

src/backend/access/gin/gininsert.c

Lines changed: 2 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ ginEntryInsert(GinState *ginstate,
219219
return;
220220
}
221221

222-
GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
222+
CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
223223
/* modify an existing leaf entry */
224224
itup = addItemPointersToLeafTuple(ginstate, itup,
225225
items, nitem, buildStats, stack->buffer);
@@ -228,7 +228,7 @@ ginEntryInsert(GinState *ginstate,
228228
}
229229
else
230230
{
231-
GinCheckForSerializableConflictIn(btree.index, NULL, stack->buffer);
231+
CheckForSerializableConflictIn(ginstate->index, NULL, stack->buffer);
232232
/* no match, so construct a new leaf entry */
233233
itup = buildFreshLeafTuple(ginstate, attnum, key, category,
234234
items, nitem, buildStats, stack->buffer);
@@ -517,18 +517,6 @@ gininsert(Relation index, Datum *values, bool *isnull,
517517

518518
memset(&collector, 0, sizeof(GinTupleCollector));
519519

520-
/*
521-
* With fastupdate on each scan and each insert begin with access to
522-
* pending list, so it effectively lock entire index. In this case we
523-
* aquire predicate lock and check for conflicts over index relation,
524-
* and hope that it will reduce locking overhead.
525-
*
526-
* Do not use GinCheckForSerializableConflictIn() here, because it
527-
* will do nothing (it does actual work only with fastupdate off).
528-
* Check for conflicts for entire index.
529-
*/
530-
CheckForSerializableConflictIn(index, NULL, InvalidBuffer);
531-
532520
for (i = 0; i < ginstate->origTupdesc->natts; i++)
533521
ginHeapTupleFastCollect(ginstate, &collector,
534522
(OffsetNumber) (i + 1),
@@ -539,16 +527,6 @@ gininsert(Relation index, Datum *values, bool *isnull,
539527
}
540528
else
541529
{
542-
GinStatsData stats;
543-
544-
/*
545-
* Fastupdate is off but if pending list isn't empty then we need to
546-
* check conflicts with PredicateLockRelation in scanPendingInsert().
547-
*/
548-
ginGetStats(index, &stats);
549-
if (stats.nPendingPages > 0)
550-
CheckForSerializableConflictIn(index, NULL, InvalidBuffer);
551-
552530
for (i = 0; i < ginstate->origTupdesc->natts; i++)
553531
ginHeapTupleInsert(ginstate, (OffsetNumber) (i + 1),
554532
values[i], isnull[i],

0 commit comments

Comments
 (0)