Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 76837c1

Browse files
committed
Reduce use of heavyweight locking inside hash AM.
Avoid using LockPage(rel, 0, lockmode) to protect against changes to the bucket mapping. Instead, an exclusive buffer content lock is now viewed as sufficient permission to modify the metapage, and a shared buffer content lock is used when such modifications need to be prevented. This more relaxed locking regimen makes it possible that, when we're busy getting a heavyweight bucket on the bucket we intend to search or insert into, a bucket split might occur underneath us. To compenate for that possibility, we use a loop-and-retry system: release the metapage content lock, acquire the heavyweight lock on the target bucket, and then reacquire the metapage content lock and check that the bucket mapping has not changed. Normally it hasn't, and we're done. But if by chance it has, we simply unlock the metapage, release the heavyweight lock we acquired previously, lock the new bucket, and loop around again. Even in the worst case we cannot loop very many times here, since we don't split the same bucket again until we've split all the other buckets, and 2^N gets big pretty fast. This results in greatly improved concurrency, because we're effectively replacing two lwlock acquire-and-release cycles in exclusive mode (on one of the lock manager locks) with a single acquire-and-release cycle in shared mode (on the metapage buffer content lock). Testing shows that it's still not quite as good as btree; for that, we'd probably have to find some way of getting rid of the heavyweight bucket locks as well, which does not appear straightforward. Patch by me, review by Jeff Janes.
1 parent 038f3a0 commit 76837c1

File tree

4 files changed

+146
-130
lines changed

4 files changed

+146
-130
lines changed

src/backend/access/hash/README

Lines changed: 66 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -132,15 +132,6 @@ long-term locking since there is a (small) risk of deadlock, which we must
132132
be able to detect. Buffer context locks are used for short-term access
133133
control to individual pages of the index.
134134

135-
We define the following lmgr locks for a hash index:
136-
137-
LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
138-
mapping. A process attempting to enlarge the hash table by splitting a
139-
bucket must exclusive-lock this lock before modifying the metapage data
140-
representing the mapping. Processes intending to access a particular
141-
bucket must share-lock this lock until they have acquired lock on the
142-
correct target bucket.
143-
144135
LockPage(rel, page), where page is the page number of a hash bucket page,
145136
represents the right to split or compact an individual bucket. A process
146137
splitting a bucket must exclusive-lock both old and new halves of the
@@ -150,7 +141,10 @@ insertions must share-lock the bucket they are scanning or inserting into.
150141
(It is okay to allow concurrent scans and insertions.)
151142

152143
The lmgr lock IDs corresponding to overflow pages are currently unused.
153-
These are available for possible future refinements.
144+
These are available for possible future refinements. LockPage(rel, 0)
145+
is also currently undefined (it was previously used to represent the right
146+
to modify the hash-code-to-bucket mapping, but it is no longer needed for
147+
that purpose).
154148

155149
Note that these lock definitions are conceptually distinct from any sort
156150
of lock on the pages whose numbers they share. A process must also obtain
@@ -165,9 +159,7 @@ hash index code, since a process holding one of these locks could block
165159
waiting for an unrelated lock held by another process. If that process
166160
then does something that requires exclusive lock on the bucket, we have
167161
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
168-
can be detected and recovered from. This also forces the page-zero lock
169-
to be an lmgr lock, because as we'll see below it is held while attempting
170-
to acquire a bucket lock, and so it could also participate in a deadlock.
162+
can be detected and recovered from.
171163

172164
Processes must obtain read (share) buffer context lock on any hash index
173165
page while reading it, and write (exclusive) lock while modifying it.
@@ -195,24 +187,30 @@ track of available overflow pages.
195187

196188
The reader algorithm is:
197189

198-
share-lock page 0 (to prevent active split)
199-
read/sharelock meta page
200-
compute bucket number for target hash key
201-
release meta page
202-
share-lock bucket page (to prevent split/compact of this bucket)
203-
release page 0 share-lock
190+
pin meta page and take buffer content lock in shared mode
191+
loop:
192+
compute bucket number for target hash key
193+
release meta page buffer content lock
194+
if (correct bucket page is already locked)
195+
break
196+
release any existing bucket page lock (if a concurrent split happened)
197+
take heavyweight bucket lock
198+
retake meta page buffer content lock in shared mode
204199
-- then, per read request:
205-
read/sharelock current page of bucket
200+
release pin on metapage
201+
read current page of bucket and take shared buffer content lock
206202
step to next page if necessary (no chaining of locks)
207203
get tuple
208-
release current page
204+
release buffer content lock and pin on current page
209205
-- at scan shutdown:
210206
release bucket share-lock
211207

212-
By holding the page-zero lock until lock on the target bucket is obtained,
213-
the reader ensures that the target bucket calculation is valid (otherwise
214-
the bucket might be split before the reader arrives at it, and the target
215-
entries might go into the new bucket). Holding the bucket sharelock for
208+
We can't hold the metapage lock while acquiring a lock on the target bucket,
209+
because that might result in an undetected deadlock (lwlocks do not participate
210+
in deadlock detection). Instead, we relock the metapage after acquiring the
211+
bucket page lock and check whether the bucket has been split. If not, we're
212+
done. If so, we release our previously-acquired lock and repeat the process
213+
using the new bucket number. Holding the bucket sharelock for
216214
the remainder of the scan prevents the reader's current-tuple pointer from
217215
being invalidated by splits or compactions. Notice that the reader's lock
218216
does not prevent other buckets from being split or compacted.
@@ -229,22 +227,26 @@ as it was before.
229227

230228
The insertion algorithm is rather similar:
231229

232-
share-lock page 0 (to prevent active split)
233-
read/sharelock meta page
234-
compute bucket number for target hash key
235-
release meta page
236-
share-lock bucket page (to prevent split/compact of this bucket)
237-
release page 0 share-lock
230+
pin meta page and take buffer content lock in shared mode
231+
loop:
232+
compute bucket number for target hash key
233+
release meta page buffer content lock
234+
if (correct bucket page is already locked)
235+
break
236+
release any existing bucket page lock (if a concurrent split happened)
237+
take heavyweight bucket lock in shared mode
238+
retake meta page buffer content lock in shared mode
238239
-- (so far same as reader)
239-
read/exclusive-lock current page of bucket
240+
release pin on metapage
241+
pin current page of bucket and take exclusive buffer content lock
240242
if full, release, read/exclusive-lock next page; repeat as needed
241243
>> see below if no space in any page of bucket
242244
insert tuple at appropriate place in page
243-
write/release current page
244-
release bucket share-lock
245-
read/exclusive-lock meta page
245+
mark current page dirty and release buffer content lock and pin
246+
release heavyweight share-lock
247+
pin meta page and take buffer content lock in shared mode
246248
increment tuple count, decide if split needed
247-
write/release meta page
249+
mark meta page dirty and release buffer content lock and pin
248250
done if no split needed, else enter Split algorithm below
249251

250252
To speed searches, the index entries within any individual index page are
@@ -269,26 +271,23 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
269271
The algorithm attempts, but does not necessarily succeed, to split one
270272
existing bucket in two, thereby lowering the fill ratio:
271273

272-
exclusive-lock page 0 (assert the right to begin a split)
273-
read/exclusive-lock meta page
274+
pin meta page and take buffer content lock in exclusive mode
274275
check split still needed
275-
if split not needed anymore, drop locks and exit
276+
if split not needed anymore, drop buffer content lock and pin and exit
276277
decide which bucket to split
277278
Attempt to X-lock old bucket number (definitely could fail)
278279
Attempt to X-lock new bucket number (shouldn't fail, but...)
279-
if above fail, drop locks and exit
280+
if above fail, drop locks and pin and exit
280281
update meta page to reflect new number of buckets
281-
write/release meta page
282-
release X-lock on page 0
282+
mark meta page dirty and release buffer content lock and pin
283283
-- now, accesses to all other buckets can proceed.
284284
Perform actual split of bucket, moving tuples as needed
285285
>> see below about acquiring needed extra space
286286
Release X-locks of old and new buckets
287287

288-
Note the page zero and metapage locks are not held while the actual tuple
289-
rearrangement is performed, so accesses to other buckets can proceed in
290-
parallel; in fact, it's possible for multiple bucket splits to proceed
291-
in parallel.
288+
Note the metapage lock is not held while the actual tuple rearrangement is
289+
performed, so accesses to other buckets can proceed in parallel; in fact,
290+
it's possible for multiple bucket splits to proceed in parallel.
292291

293292
Split's attempt to X-lock the old bucket number could fail if another
294293
process holds S-lock on it. We do not want to wait if that happens, first
@@ -316,20 +315,20 @@ go-round.
316315
The fourth operation is garbage collection (bulk deletion):
317316

318317
next bucket := 0
319-
read/sharelock meta page
318+
pin metapage and take buffer content lock in exclusive mode
320319
fetch current max bucket number
321-
release meta page
320+
release meta page buffer content lock and pin
322321
while next bucket <= max bucket do
323322
Acquire X lock on target bucket
324323
Scan and remove tuples, compact free space as needed
325324
Release X lock
326325
next bucket ++
327326
end loop
328-
exclusive-lock meta page
327+
pin metapage and take buffer content lock in exclusive mode
329328
check if number of buckets changed
330-
if so, release lock and return to for-each-bucket loop
329+
if so, release content lock and pin and return to for-each-bucket loop
331330
else update metapage tuple count
332-
write/release meta page
331+
mark meta page dirty and release buffer content lock and pin
333332

334333
Note that this is designed to allow concurrent splits. If a split occurs,
335334
tuples relocated into the new bucket will be visited twice by the scan,
@@ -360,25 +359,25 @@ overflow page to the free pool.
360359

361360
Obtaining an overflow page:
362361

363-
read/exclusive-lock meta page
362+
take metapage content lock in exclusive mode
364363
determine next bitmap page number; if none, exit loop
365-
release meta page lock
366-
read/exclusive-lock bitmap page
364+
release meta page content lock
365+
pin bitmap page and take content lock in exclusive mode
367366
search for a free page (zero bit in bitmap)
368367
if found:
369368
set bit in bitmap
370-
write/release bitmap page
371-
read/exclusive-lock meta page
369+
mark bitmap page dirty and release content lock
370+
take metapage buffer content lock in exclusive mode
372371
if first-free-bit value did not change,
373-
update it and write meta page
374-
release meta page
372+
update it and mark meta page dirty
373+
release meta page buffer content lock
375374
return page number
376375
else (not found):
377-
release bitmap page
376+
release bitmap page buffer content lock
378377
loop back to try next bitmap page, if any
379378
-- here when we have checked all bitmap pages; we hold meta excl. lock
380379
extend index to add another overflow page; update meta information
381-
write/release meta page
380+
mark meta page dirty and release buffer content lock
382381
return page number
383382

384383
It is slightly annoying to release and reacquire the metapage lock
@@ -428,17 +427,17 @@ algorithm is:
428427

429428
delink overflow page from bucket chain
430429
(this requires read/update/write/release of fore and aft siblings)
431-
read/share-lock meta page
430+
pin meta page and take buffer content lock in shared mode
432431
determine which bitmap page contains the free space bit for page
433-
release meta page
434-
read/exclusive-lock bitmap page
432+
relase meta page buffer content lock
433+
pin bitmap page and take buffer content lock in exclusie mode
435434
update bitmap bit
436-
write/release bitmap page
435+
mark bitmap page dirty and release buffer content lock and pin
437436
if page number is less than what we saw as first-free-bit in meta:
438-
read/exclusive-lock meta page
437+
retake meta page buffer content lock in exclusive mode
439438
if page number is still less than first-free-bit,
440-
update first-free-bit field and write meta page
441-
release meta page
439+
update first-free-bit field and mark meta page dirty
440+
release meta page buffer content lock and pin
442441

443442
We have to do it this way because we must clear the bitmap bit before
444443
changing the first-free-bit field (hashm_firstfree). It is possible that

src/backend/access/hash/hashinsert.c

Lines changed: 35 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ _hash_doinsert(Relation rel, IndexTuple itup)
3232
Buffer metabuf;
3333
HashMetaPage metap;
3434
BlockNumber blkno;
35+
BlockNumber oldblkno = InvalidBlockNumber;
36+
bool retry = false;
3537
Page page;
3638
HashPageOpaque pageopaque;
3739
Size itemsz;
@@ -49,12 +51,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
4951
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
5052
* need to be consistent */
5153

52-
/*
53-
* Acquire shared split lock so we can compute the target bucket safely
54-
* (see README).
55-
*/
56-
_hash_getlock(rel, 0, HASH_SHARE);
57-
5854
/* Read the metapage */
5955
metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
6056
metap = HashPageGetMeta(BufferGetPage(metabuf));
@@ -75,24 +71,44 @@ _hash_doinsert(Relation rel, IndexTuple itup)
7571
errhint("Values larger than a buffer page cannot be indexed.")));
7672

7773
/*
78-
* Compute the target bucket number, and convert to block number.
74+
* Loop until we get a lock on the correct target bucket.
7975
*/
80-
bucket = _hash_hashkey2bucket(hashkey,
81-
metap->hashm_maxbucket,
82-
metap->hashm_highmask,
83-
metap->hashm_lowmask);
76+
for (;;)
77+
{
78+
/*
79+
* Compute the target bucket number, and convert to block number.
80+
*/
81+
bucket = _hash_hashkey2bucket(hashkey,
82+
metap->hashm_maxbucket,
83+
metap->hashm_highmask,
84+
metap->hashm_lowmask);
8485

85-
blkno = BUCKET_TO_BLKNO(metap, bucket);
86+
blkno = BUCKET_TO_BLKNO(metap, bucket);
8687

87-
/* release lock on metapage, but keep pin since we'll need it again */
88-
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
88+
/* Release metapage lock, but keep pin. */
89+
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
8990

90-
/*
91-
* Acquire share lock on target bucket; then we can release split lock.
92-
*/
93-
_hash_getlock(rel, blkno, HASH_SHARE);
91+
/*
92+
* If the previous iteration of this loop locked what is still the
93+
* correct target bucket, we are done. Otherwise, drop any old lock
94+
* and lock what now appears to be the correct bucket.
95+
*/
96+
if (retry)
97+
{
98+
if (oldblkno == blkno)
99+
break;
100+
_hash_droplock(rel, oldblkno, HASH_SHARE);
101+
}
102+
_hash_getlock(rel, blkno, HASH_SHARE);
94103

95-
_hash_droplock(rel, 0, HASH_SHARE);
104+
/*
105+
* Reacquire metapage lock and check that no bucket split has taken
106+
* place while we were awaiting the bucket lock.
107+
*/
108+
_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
109+
oldblkno = blkno;
110+
retry = true;
111+
}
96112

97113
/* Fetch the primary bucket page for the bucket */
98114
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);

src/backend/access/hash/hashpage.c

Lines changed: 5 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,9 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
5757
/*
5858
* _hash_getlock() -- Acquire an lmgr lock.
5959
*
60-
* 'whichlock' should be zero to acquire the split-control lock, or the
61-
* block number of a bucket's primary bucket page to acquire the per-bucket
62-
* lock. (See README for details of the use of these locks.)
60+
* 'whichlock' should the block number of a bucket's primary bucket page to
61+
* acquire the per-bucket lock. (See README for details of the use of these
62+
* locks.)
6363
*
6464
* 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
6565
*/
@@ -507,21 +507,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
507507
uint32 lowmask;
508508

509509
/*
510-
* Obtain the page-zero lock to assert the right to begin a split (see
511-
* README).
512-
*
513-
* Note: deadlock should be impossible here. Our own backend could only be
514-
* holding bucket sharelocks due to stopped indexscans; those will not
515-
* block other holders of the page-zero lock, who are only interested in
516-
* acquiring bucket sharelocks themselves. Exclusive bucket locks are
517-
* only taken here and in hashbulkdelete, and neither of these operations
518-
* needs any additional locks to complete. (If, due to some flaw in this
519-
* reasoning, we manage to deadlock anyway, it's okay to error out; the
520-
* index will be left in a consistent state.)
510+
* Write-lock the meta page. It used to be necessary to acquire a
511+
* heavyweight lock to begin a split, but that is no longer required.
521512
*/
522-
_hash_getlock(rel, 0, HASH_EXCLUSIVE);
523-
524-
/* Write-lock the meta page */
525513
_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
526514

527515
_hash_checkpage(rel, metabuf, LH_META_PAGE);
@@ -663,9 +651,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
663651
/* Write out the metapage and drop lock, but keep pin */
664652
_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
665653

666-
/* Release split lock; okay for other splits to occur now */
667-
_hash_droplock(rel, 0, HASH_EXCLUSIVE);
668-
669654
/* Relocate records to the new bucket */
670655
_hash_splitbucket(rel, metabuf, old_bucket, new_bucket,
671656
start_oblkno, start_nblkno,
@@ -682,9 +667,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
682667

683668
/* We didn't write the metapage, so just drop lock */
684669
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
685-
686-
/* Release split lock */
687-
_hash_droplock(rel, 0, HASH_EXCLUSIVE);
688670
}
689671

690672

0 commit comments

Comments
 (0)