@@ -213,7 +213,7 @@ this flag must be clear before splitting a bucket; thus, a bucket can't be
213
213
split again until the previous split is totally complete.
214
214
215
215
The moved-by-split flag on a tuple indicates that tuple is moved from old to
216
- new bucket. Concurrent scans can skip such tuples till the split operation
216
+ new bucket. Concurrent scans will skip such tuples until the split operation
217
217
is finished. Once the tuple is marked as moved-by-split, it will remain so
218
218
forever but that does no harm. We have intentionally not cleared it as that
219
219
can generate an additional I/O which is not necessary.
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
287
287
if current page is full, release lock but not pin, read/exclusive-lock
288
288
next page; repeat as needed
289
289
>> see below if no space in any page of bucket
290
+ take buffer content lock in exclusive mode on metapage
290
291
insert tuple at appropriate place in page
291
- mark current page dirty and release buffer content lock and pin
292
- if the current page is not a bucket page, release the pin on bucket page
293
- pin meta page and take buffer content lock in exclusive mode
292
+ mark current page dirty
294
293
increment tuple count, decide if split needed
295
- mark meta page dirty and release buffer content lock and pin
296
- done if no split needed, else enter Split algorithm below
294
+ mark meta page dirty
295
+ write WAL for insertion of tuple
296
+ release the buffer content lock on metapage
297
+ release buffer content lock on current page
298
+ if current page is not a bucket page, release the pin on bucket page
299
+ if split is needed, enter Split algorithm below
300
+ release the pin on metapage
297
301
298
302
To speed searches, the index entries within any individual index page are
299
303
kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
328
332
try to finish the split and the cleanup work
329
333
if that succeeds, start over; if it fails, give up
330
334
mark the old and new buckets indicating split is in progress
335
+ mark both old and new buckets as dirty
336
+ write WAL for allocation of new page for split
331
337
copy the tuples that belongs to new bucket from old bucket, marking
332
338
them as moved-by-split
339
+ write WAL record for moving tuples to new page once the new page is full
340
+ or all the pages of old bucket are finished
333
341
release lock but not pin for primary bucket page of old bucket,
334
342
read/shared-lock next page; repeat as needed
335
343
clear the bucket-being-split and bucket-being-populated flags
336
344
mark the old bucket indicating split-cleanup
345
+ write WAL for changing the flags on both old and new buckets
337
346
338
347
The split operation's attempt to acquire cleanup-lock on the old bucket number
339
348
could fail if another process holds any lock or pin on it. We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
369
378
acquire cleanup lock on primary bucket page
370
379
loop:
371
380
scan and remove tuples
381
+ mark the target page dirty
382
+ write WAL for deleting tuples from target page
372
383
if this is the last bucket page, break out of loop
373
384
pin and x-lock next page
374
385
release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
383
394
check if number of buckets changed
384
395
if so, release content lock and pin and return to for-each-bucket loop
385
396
else update metapage tuple count
386
- mark meta page dirty and release buffer content lock and pin
397
+ mark meta page dirty and write WAL for update of metapage
398
+ release buffer content lock and pin
387
399
388
400
Note that this is designed to allow concurrent splits and scans. If a split
389
401
occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
425
437
search for a free page (zero bit in bitmap)
426
438
if found:
427
439
set bit in bitmap
428
- mark bitmap page dirty and release content lock
440
+ mark bitmap page dirty
429
441
take metapage buffer content lock in exclusive mode
430
442
if first-free-bit value did not change,
431
443
update it and mark meta page dirty
432
- release meta page buffer content lock
433
- return page number
434
444
else (not found):
435
445
release bitmap page buffer content lock
436
446
loop back to try next bitmap page, if any
437
447
-- here when we have checked all bitmap pages; we hold meta excl. lock
438
448
extend index to add another overflow page; update meta information
439
- mark meta page dirty and release buffer content lock
449
+ mark meta page dirty
440
450
return page number
441
451
442
452
It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,17 @@ like this:
456
466
457
467
-- having determined that no space is free in the target bucket:
458
468
remember last page of bucket, drop write lock on it
459
- call free-page-acquire routine
460
469
re-write-lock last page of bucket
461
470
if it is not last anymore, step to the last page
462
- update (former) last page to point to new page
471
+ execute free-page-acquire (obtaining an overflow page) mechanism
472
+ described above
473
+ update (former) last page to point to the new page and mark buffer dirty
463
474
write-lock and initialize new page, with back link to former last page
464
- write and release former last page
475
+ write WAL for addition of overflow page
476
+ release the locks on meta page and bitmap page acquired in
477
+ free-page-acquire algorithm
478
+ release the lock on former last page
479
+ release the lock on new overflow page
465
480
insert tuple into new page
466
481
-- etc.
467
482
@@ -488,12 +503,14 @@ accessors of pages in the bucket. The algorithm is:
488
503
determine which bitmap page contains the free space bit for page
489
504
release meta page buffer content lock
490
505
pin bitmap page and take buffer content lock in exclusive mode
491
- update bitmap bit
492
- mark bitmap page dirty and release buffer content lock and pin
493
- if page number is less than what we saw as first-free-bit in meta:
494
506
retake meta page buffer content lock in exclusive mode
507
+ move (insert) tuples that belong to the overflow page being freed
508
+ update bitmap bit
509
+ mark bitmap page dirty
495
510
if page number is still less than first-free-bit,
496
511
update first-free-bit field and mark meta page dirty
512
+ write WAL for delinking overflow page operation
513
+ release buffer content lock and pin
497
514
release meta page buffer content lock and pin
498
515
499
516
We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +521,91 @@ page acquirer will scan more bitmap bits than he needs to. What must be
504
521
avoided is having first-free-bit greater than the actual first free bit,
505
522
because then that free page would never be found by searchers.
506
523
507
- All the freespace operations should be called while holding no buffer
508
- locks. Since they need no lmgr locks, deadlock is not possible.
524
+ The reason of moving tuples from overflow page while delinking the later is
525
+ to make that as an atomic operation. Not doing so could lead to spurious reads
526
+ on standby. Basically, the user might see the same tuple twice.
527
+
528
+
529
+ WAL Considerations
530
+ ------------------
531
+
532
+ The hash index operations like create index, insert, delete, bucket split,
533
+ allocate overflow page, and squeeze in themselves don't guarantee hash index
534
+ consistency after a crash. To provide robustness, we write WAL for each of
535
+ these operations.
536
+
537
+ CREATE INDEX writes multiple WAL records. First, we write a record to cover
538
+ the initializatoin of the metapage, followed by one for each new bucket
539
+ created, followed by one for the initial bitmap page. It's not important for
540
+ index creation to appear atomic, because the index isn't yet visible to any
541
+ other transaction, and the creating transaction will roll back in the event of
542
+ a crash. It would be difficult to cover the whole operation with a single
543
+ write-ahead log record anyway, because we can log only a fixed number of
544
+ pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.
545
+
546
+ Ordinary item insertions (that don't force a page split or need a new overflow
547
+ page) are single WAL entries. They touch a single bucket page and the
548
+ metapage. The metapage is updated during replay as it is updated during
549
+ original operation.
550
+
551
+ If an insertion causes the addition of an overflow page, there will be one
552
+ WAL entry for the new overflow page and second entry for insert itself.
553
+
554
+ If an insertion causes a bucket split, there will be one WAL entry for insert
555
+ itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
556
+ entry for each overflow bucket page in the new bucket to which the tuples are
557
+ moved from old bucket, followed by a WAL entry to indicate that split is
558
+ complete for both old and new buckets. A split operation which requires
559
+ overflow pages to complete the operation will need to write a WAL record for
560
+ each new allocation of an overflow page.
561
+
562
+ As splitting involves multiple atomic actions, it's possible that the system
563
+ crashes between moving tuples from bucket pages of the old bucket to new
564
+ bucket. In such a case, after recovery, the old and new buckets will be
565
+ marked with bucket-being-split and bucket-being-populated flags respectively
566
+ which indicates that split is in progress for those buckets. The reader
567
+ algorithm works correctly, as it will scan both the old and new buckets when
568
+ the split is in progress as explained in the reader algorithm section above.
569
+
570
+ We finish the split at next insert or split operation on the old bucket as
571
+ explained in insert and split algorithm above. It could be done during
572
+ searches, too, but it seems best not to put any extra updates in what would
573
+ otherwise be a read-only operation (updating is not possible in hot standby
574
+ mode anyway). It would seem natural to complete the split in VACUUM, but since
575
+ splitting a bucket might require allocating a new page, it might fail if you
576
+ run out of disk space. That would be bad during VACUUM - the reason for
577
+ running VACUUM in the first place might be that you run out of disk space,
578
+ and now VACUUM won't finish because you're out of disk space. In contrast,
579
+ an insertion can require enlarging the physical file anyway.
580
+
581
+ Deletion of tuples from a bucket is performed for two reasons: to remove dead
582
+ tuples, and to remove tuples that were moved by a bucket split. A WAL entry
583
+ is made for each bucket page from which tuples are removed, and then another
584
+ WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples
585
+ are removed, a separate WAL entry is made to update the metapage.
586
+
587
+ As deletion involves multiple atomic operations, it is quite possible that
588
+ system crashes after (a) removing tuples from some of the bucket pages, (b)
589
+ before clearing the garbage flag, or (c) before updating the metapage. If the
590
+ system crashes before completing (b), it will again try to clean the bucket
591
+ during next vacuum or insert after recovery which can have some performance
592
+ impact, but it will work fine. If the system crashes before completing (c),
593
+ after recovery there could be some additional splits until the next vacuum
594
+ updates the metapage, but the other operations like insert, delete and scan
595
+ will work correctly. We can fix this problem by actually updating the
596
+ metapage based on delete operation during replay, but it's not clear whether
597
+ it's worth the complication.
598
+
599
+ A squeeze operation moves tuples from one of the buckets later in the chain to
600
+ one of the bucket earlier in chain and writes WAL record when either the
601
+ bucket to which it is writing tuples is filled or bucket from which it
602
+ is removing the tuples becomes empty.
603
+
604
+ As a squeeze operation involves writing multiple atomic operations, it is
605
+ quite possible that the system crashes before completing the operation on
606
+ entire bucket. After recovery, the operations will work correctly, but
607
+ the index will remain bloated and this can impact performance of read and
608
+ insert operations until the next vacuum squeeze the bucket completely.
509
609
510
610
511
611
Other Notes
0 commit comments