Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 40dae7e

Browse files
committed
Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page, and inserting the downlink for the new right page to the parent. Previously, we handled the case that you crash in between those steps with a cleanup routine after the WAL recovery had finished, which finished the incomplete split. However, that doesn't help if the page split is interrupted but the database doesn't crash, so that you don't perform WAL recovery. That could happen for example if you run out of disk space. Remove the end-of-recovery cleanup step. Instead, when a page is split, the left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink is inserted to the parent, the flag is cleared again. If an insertion sees a page with the flag set, it knows that the split was interrupted for some reason, and inserts the missing downlink before proceeding. I used the same approach to fix GIN and GiST split algorithms earlier. This was the last WAL cleanup routine, so we could get rid of that whole machinery now, but I'll leave that for a separate patch. Reviewed by Peter Geoghegan.
1 parent b6ec7c9 commit 40dae7e

File tree

8 files changed

+561
-290
lines changed

8 files changed

+561
-290
lines changed

src/backend/access/nbtree/README

+45-8
Original file line numberDiff line numberDiff line change
@@ -404,12 +404,41 @@ an additional insertion above that, etc).
404404
For a root split, the followon WAL entry is a "new root" entry rather than
405405
an "insertion" entry, but details are otherwise much the same.
406406

407-
Because insertion involves multiple atomic actions, the WAL replay logic
408-
has to detect the case where a page split isn't followed by a matching
409-
insertion on the parent level, and then do that insertion on its own (and
410-
recursively for any subsequent parent insertion, of course). This is
411-
feasible because the WAL entry for the split contains enough info to know
412-
what must be inserted in the parent level.
407+
Because splitting involves multiple atomic actions, it's possible that the
408+
system crashes between splitting a page and inserting the downlink for the
409+
new half to the parent. After recovery, the downlink for the new page will
410+
be missing. The search algorithm works correctly, as the page will be found
411+
by following the right-link from its left sibling, although if a lot of
412+
downlinks in the tree are missing, performance will suffer. A more serious
413+
consequence is that if the page without a downlink gets split again, the
414+
insertion algorithm will fail to find the location in the parent level to
415+
insert the downlink.
416+
417+
Our approach is to create any missing downlinks on-the-fly, when searching
418+
the tree for a new insertion. It could be done during searches, too, but
419+
it seems best not to put any extra updates in what would otherwise be a
420+
read-only operation (updating is not possible in hot standby mode anyway).
421+
It would seem natural to add the missing downlinks in VACUUM, but since
422+
inserting a downlink might require splitting a page, it might fail if you
423+
run out of disk space. That would be bad during VACUUM - the reason for
424+
running VACUUM in the first place might be that you run out of disk space,
425+
and now VACUUM won't finish because you're out of disk space. In contrast,
426+
an insertion can require enlarging the physical file anyway.
427+
428+
To identify missing downlinks, when a page is split, the left page is
429+
flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
430+
When the downlink is inserted to the parent, the flag is cleared atomically
431+
with the insertion. The child page is kept locked until the insertion in
432+
the parent is finished and the flag in the child cleared, but can be
433+
released immediately after that, before recursing up the tree if the parent
434+
also needs to be split. This ensures that incompletely split pages should
435+
not be seen under normal circumstances; only if insertion to the parent
436+
has failed for some reason.
437+
438+
We flag the left page, even though it's the right page that's missing the
439+
downlink, beacuse it's more convenient to know already when following the
440+
right-link from the left page to the right page that it will need to have
441+
its downlink inserted to the parent.
413442

414443
When splitting a non-root page that is alone on its level, the required
415444
metapage update (of the "fast root" link) is performed and logged as part
@@ -419,8 +448,16 @@ metapage update is handled as part of the "new root" action.
419448
Each step in page deletion are logged as separate WAL entries: marking the
420449
leaf as half-dead and removing the downlink is one record, and unlinking a
421450
page is a second record. If vacuum is interrupted for some reason, or the
422-
system crashes, the tree is consistent for searches and insertions. The next
423-
VACUUM will find the half-dead leaf page and continue the deletion.
451+
system crashes, the tree is consistent for searches and insertions. The
452+
next VACUUM will find the half-dead leaf page and continue the deletion.
453+
454+
Before 9.4, we used to keep track of incomplete splits and page deletions
455+
during recovery and finish them immediately at end of recovery, instead of
456+
doing it lazily at the next insertion or vacuum. However, that made the
457+
recovery much more complicated, and only fixed the problem when crash
458+
recovery was performed. An incomplete split can also occur if an otherwise
459+
recoverable error, like out-of-memory or out-of-disk-space, happens while
460+
inserting the downlink to the parent.
424461

425462
Scans during Recovery
426463
---------------------

0 commit comments

Comments
 (0)