@@ -132,15 +132,6 @@ long-term locking since there is a (small) risk of deadlock, which we must
132
132
be able to detect. Buffer context locks are used for short-term access
133
133
control to individual pages of the index.
134
134
135
- We define the following lmgr locks for a hash index:
136
-
137
- LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
138
- mapping. A process attempting to enlarge the hash table by splitting a
139
- bucket must exclusive-lock this lock before modifying the metapage data
140
- representing the mapping. Processes intending to access a particular
141
- bucket must share-lock this lock until they have acquired lock on the
142
- correct target bucket.
143
-
144
135
LockPage(rel, page), where page is the page number of a hash bucket page,
145
136
represents the right to split or compact an individual bucket. A process
146
137
splitting a bucket must exclusive-lock both old and new halves of the
@@ -150,7 +141,10 @@ insertions must share-lock the bucket they are scanning or inserting into.
150
141
(It is okay to allow concurrent scans and insertions.)
151
142
152
143
The lmgr lock IDs corresponding to overflow pages are currently unused.
153
- These are available for possible future refinements.
144
+ These are available for possible future refinements. LockPage(rel, 0)
145
+ is also currently undefined (it was previously used to represent the right
146
+ to modify the hash-code-to-bucket mapping, but it is no longer needed for
147
+ that purpose).
154
148
155
149
Note that these lock definitions are conceptually distinct from any sort
156
150
of lock on the pages whose numbers they share. A process must also obtain
@@ -165,9 +159,7 @@ hash index code, since a process holding one of these locks could block
165
159
waiting for an unrelated lock held by another process. If that process
166
160
then does something that requires exclusive lock on the bucket, we have
167
161
deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
168
- can be detected and recovered from. This also forces the page-zero lock
169
- to be an lmgr lock, because as we'll see below it is held while attempting
170
- to acquire a bucket lock, and so it could also participate in a deadlock.
162
+ can be detected and recovered from.
171
163
172
164
Processes must obtain read (share) buffer context lock on any hash index
173
165
page while reading it, and write (exclusive) lock while modifying it.
@@ -195,24 +187,30 @@ track of available overflow pages.
195
187
196
188
The reader algorithm is:
197
189
198
- share-lock page 0 (to prevent active split)
199
- read/sharelock meta page
200
- compute bucket number for target hash key
201
- release meta page
202
- share-lock bucket page (to prevent split/compact of this bucket)
203
- release page 0 share-lock
190
+ pin meta page and take buffer content lock in shared mode
191
+ loop:
192
+ compute bucket number for target hash key
193
+ release meta page buffer content lock
194
+ if (correct bucket page is already locked)
195
+ break
196
+ release any existing bucket page lock (if a concurrent split happened)
197
+ take heavyweight bucket lock
198
+ retake meta page buffer content lock in shared mode
204
199
-- then, per read request:
205
- read/sharelock current page of bucket
200
+ release pin on metapage
201
+ read current page of bucket and take shared buffer content lock
206
202
step to next page if necessary (no chaining of locks)
207
203
get tuple
208
- release current page
204
+ release buffer content lock and pin on current page
209
205
-- at scan shutdown:
210
206
release bucket share-lock
211
207
212
- By holding the page-zero lock until lock on the target bucket is obtained,
213
- the reader ensures that the target bucket calculation is valid (otherwise
214
- the bucket might be split before the reader arrives at it, and the target
215
- entries might go into the new bucket). Holding the bucket sharelock for
208
+ We can't hold the metapage lock while acquiring a lock on the target bucket,
209
+ because that might result in an undetected deadlock (lwlocks do not participate
210
+ in deadlock detection). Instead, we relock the metapage after acquiring the
211
+ bucket page lock and check whether the bucket has been split. If not, we're
212
+ done. If so, we release our previously-acquired lock and repeat the process
213
+ using the new bucket number. Holding the bucket sharelock for
216
214
the remainder of the scan prevents the reader's current-tuple pointer from
217
215
being invalidated by splits or compactions. Notice that the reader's lock
218
216
does not prevent other buckets from being split or compacted.
@@ -229,22 +227,26 @@ as it was before.
229
227
230
228
The insertion algorithm is rather similar:
231
229
232
- share-lock page 0 (to prevent active split)
233
- read/sharelock meta page
234
- compute bucket number for target hash key
235
- release meta page
236
- share-lock bucket page (to prevent split/compact of this bucket)
237
- release page 0 share-lock
230
+ pin meta page and take buffer content lock in shared mode
231
+ loop:
232
+ compute bucket number for target hash key
233
+ release meta page buffer content lock
234
+ if (correct bucket page is already locked)
235
+ break
236
+ release any existing bucket page lock (if a concurrent split happened)
237
+ take heavyweight bucket lock in shared mode
238
+ retake meta page buffer content lock in shared mode
238
239
-- (so far same as reader)
239
- read/exclusive-lock current page of bucket
240
+ release pin on metapage
241
+ pin current page of bucket and take exclusive buffer content lock
240
242
if full, release, read/exclusive-lock next page; repeat as needed
241
243
>> see below if no space in any page of bucket
242
244
insert tuple at appropriate place in page
243
- write/release current page
244
- release bucket share-lock
245
- read/exclusive-lock meta page
245
+ mark current page dirty and release buffer content lock and pin
246
+ release heavyweight share-lock
247
+ pin meta page and take buffer content lock in shared mode
246
248
increment tuple count, decide if split needed
247
- write/release meta page
249
+ mark meta page dirty and release buffer content lock and pin
248
250
done if no split needed, else enter Split algorithm below
249
251
250
252
To speed searches, the index entries within any individual index page are
@@ -269,26 +271,23 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
269
271
The algorithm attempts, but does not necessarily succeed, to split one
270
272
existing bucket in two, thereby lowering the fill ratio:
271
273
272
- exclusive-lock page 0 (assert the right to begin a split)
273
- read/exclusive-lock meta page
274
+ pin meta page and take buffer content lock in exclusive mode
274
275
check split still needed
275
- if split not needed anymore, drop locks and exit
276
+ if split not needed anymore, drop buffer content lock and pin and exit
276
277
decide which bucket to split
277
278
Attempt to X-lock old bucket number (definitely could fail)
278
279
Attempt to X-lock new bucket number (shouldn't fail, but...)
279
- if above fail, drop locks and exit
280
+ if above fail, drop locks and pin and exit
280
281
update meta page to reflect new number of buckets
281
- write/release meta page
282
- release X-lock on page 0
282
+ mark meta page dirty and release buffer content lock and pin
283
283
-- now, accesses to all other buckets can proceed.
284
284
Perform actual split of bucket, moving tuples as needed
285
285
>> see below about acquiring needed extra space
286
286
Release X-locks of old and new buckets
287
287
288
- Note the page zero and metapage locks are not held while the actual tuple
289
- rearrangement is performed, so accesses to other buckets can proceed in
290
- parallel; in fact, it's possible for multiple bucket splits to proceed
291
- in parallel.
288
+ Note the metapage lock is not held while the actual tuple rearrangement is
289
+ performed, so accesses to other buckets can proceed in parallel; in fact,
290
+ it's possible for multiple bucket splits to proceed in parallel.
292
291
293
292
Split's attempt to X-lock the old bucket number could fail if another
294
293
process holds S-lock on it. We do not want to wait if that happens, first
@@ -316,20 +315,20 @@ go-round.
316
315
The fourth operation is garbage collection (bulk deletion):
317
316
318
317
next bucket := 0
319
- read/sharelock meta page
318
+ pin metapage and take buffer content lock in exclusive mode
320
319
fetch current max bucket number
321
- release meta page
320
+ release meta page buffer content lock and pin
322
321
while next bucket <= max bucket do
323
322
Acquire X lock on target bucket
324
323
Scan and remove tuples, compact free space as needed
325
324
Release X lock
326
325
next bucket ++
327
326
end loop
328
- exclusive- lock meta page
327
+ pin metapage and take buffer content lock in exclusive mode
329
328
check if number of buckets changed
330
- if so, release lock and return to for-each-bucket loop
329
+ if so, release content lock and pin and return to for-each-bucket loop
331
330
else update metapage tuple count
332
- write/release meta page
331
+ mark meta page dirty and release buffer content lock and pin
333
332
334
333
Note that this is designed to allow concurrent splits. If a split occurs,
335
334
tuples relocated into the new bucket will be visited twice by the scan,
@@ -360,25 +359,25 @@ overflow page to the free pool.
360
359
361
360
Obtaining an overflow page:
362
361
363
- read/exclusive- lock meta page
362
+ take metapage content lock in exclusive mode
364
363
determine next bitmap page number; if none, exit loop
365
- release meta page lock
366
- read/exclusive-lock bitmap page
364
+ release meta page content lock
365
+ pin bitmap page and take content lock in exclusive mode
367
366
search for a free page (zero bit in bitmap)
368
367
if found:
369
368
set bit in bitmap
370
- write/release bitmap page
371
- read/exclusive- lock meta page
369
+ mark bitmap page dirty and release content lock
370
+ take metapage buffer content lock in exclusive mode
372
371
if first-free-bit value did not change,
373
- update it and write meta page
374
- release meta page
372
+ update it and mark meta page dirty
373
+ release meta page buffer content lock
375
374
return page number
376
375
else (not found):
377
- release bitmap page
376
+ release bitmap page buffer content lock
378
377
loop back to try next bitmap page, if any
379
378
-- here when we have checked all bitmap pages; we hold meta excl. lock
380
379
extend index to add another overflow page; update meta information
381
- write/release meta page
380
+ mark meta page dirty and release buffer content lock
382
381
return page number
383
382
384
383
It is slightly annoying to release and reacquire the metapage lock
@@ -428,17 +427,17 @@ algorithm is:
428
427
429
428
delink overflow page from bucket chain
430
429
(this requires read/update/write/release of fore and aft siblings)
431
- read/share-lock meta page
430
+ pin meta page and take buffer content lock in shared mode
432
431
determine which bitmap page contains the free space bit for page
433
- release meta page
434
- read/exclusive-lock bitmap page
432
+ relase meta page buffer content lock
433
+ pin bitmap page and take buffer content lock in exclusie mode
435
434
update bitmap bit
436
- write/release bitmap page
435
+ mark bitmap page dirty and release buffer content lock and pin
437
436
if page number is less than what we saw as first-free-bit in meta:
438
- read/exclusive-lock meta page
437
+ retake meta page buffer content lock in exclusive mode
439
438
if page number is still less than first-free-bit,
440
- update first-free-bit field and write meta page
441
- release meta page
439
+ update first-free-bit field and mark meta page dirty
440
+ release meta page buffer content lock and pin
442
441
443
442
We have to do it this way because we must clear the bitmap bit before
444
443
changing the first-free-bit field (hashm_firstfree). It is possible that
0 commit comments