Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit efada2b

Browse files
committed
Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child of its parent, but that situation can change after we check for it. Problem ------- We check that the page to be deleted is not the rightmost child of its parent, and then lock its left sibling, the page itself, its right sibling, and the parent, in that order. However, if the parent page is split after the check but before acquiring the locks, the target page might become the rightmost child, if the split happens at the right place. That leads to an error in vacuum (I reproduced this by setting a breakpoint in debugger): ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey" We currently re-check that the page is still the rightmost child, and throw the above error if it's not. We could easily just give up rather than throw an error, but that approach doesn't scale to half-dead pages. To recap, although we don't normally allow deleting the rightmost child, if the page is the *only* child of its parent, we delete the child page and mark the parent page as half-dead in one atomic operation. But before we do that, we check that the parent can later be deleted, by checking that it in turn is not the rightmost child of the grandparent (potentially recursing all the way up to the root). But the same situation can arise there - the grandparent can be split while we're not holding the locks. We end up with a half-dead page that we cannot delete. To make things worse, the keyspace of the deleted page has already been transferred to its right sibling. As the README points out, the keyspace at the grandparent level is "out-of-whack" until the half-dead page is deleted, and if enough tuples with keys in the transferred keyspace are inserted, the page might get split and a downlink might be inserted into the grandparent that is out-of-order. That might not cause any serious problem if it's transient (as the README ponders), but is surely bad if it stays that way. Solution -------- This patch changes the page deletion algorithm to avoid that problem. After checking that the topmost page in the chain of to-be-deleted pages is not the rightmost child of its parent, and then deleting the pages from bottom up, unlink the pages from top to bottom. This way, the intermediate stages are similar to the intermediate stages in page splitting, and there is no transient stage where the keyspace is "out-of-whack". The topmost page in the to-be-deleted chain doesn't have a downlink pointing to it, like a page split before the downlink has been inserted. This also allows us to get rid of the cleanup step after WAL recovery, if we crash during page deletion. The deletion will be continued at next VACUUM, but the tree is consistent for searches and insertions at every step. This bug is old, all supported versions are affected, but this patch is too big to back-patch (and changes the WAL record formats of related records). We have not heard any reports of the bug from users, so clearly it's not easy to bump into. Maybe backpatch later, after this has had some field testing. Reviewed by Kevin Grittner and Peter Geoghegan.
1 parent 6c461cb commit efada2b

File tree

7 files changed

+793
-537
lines changed

7 files changed

+793
-537
lines changed

src/backend/access/nbtree/README

Lines changed: 63 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted. Also, because
168168
we are matching downlink page numbers and not data keys, we don't have any
169169
problem with possibly misidentifying the parent item.
170170

171+
Page Deletion
172+
-------------
173+
171174
We consider deleting an entire page from the btree only when it's become
172175
completely empty of items. (Merging partly-full pages would allow better
173176
space reuse, but it seems impractical to move existing data items left or
174177
right to make this happen --- a scan moving in the opposite direction
175178
might miss the items if so.) Also, we *never* delete the rightmost page
176179
on a tree level (this restriction simplifies the traversal algorithms, as
177-
explained below).
178-
179-
To delete an empty page, we acquire write lock on its left sibling (if
180-
any), the target page itself, the right sibling (there must be one), and
181-
the parent page, in that order. The parent page must be found using the
182-
same type of search as used to find the parent during an insertion split.
183-
Then we update the side-links in the siblings, mark the target page
184-
deleted, and remove the downlink from the parent, as well as the parent's
185-
upper bounding key for the target (the one separating it from its right
186-
sibling). This causes the target page's key space to effectively belong
187-
to its right sibling. (Neither the left nor right sibling pages need to
180+
explained below). Page deletion always begins from an empty leaf page. An
181+
internal page can only be deleted as part of a branch leading to a leaf
182+
page, where each internal page has only one child and that child is also to
183+
be deleted.
184+
185+
Deleting a leaf page is a two-stage process. In the first stage, the page
186+
is unlinked from its parent, and marked as half-dead. The parent page must
187+
be found using the same type of search as used to find the parent during an
188+
insertion split. We lock the target and the parent pages, change the
189+
target's downlink to point to the right sibling, and remove its old
190+
downlink. This causes the target page's key space to effectively belong to
191+
its right sibling. (Neither the left nor right sibling pages need to
188192
change their "high key" if any; so there is no problem with possibly not
189-
having enough space to replace a high key.) The side-links in the target
190-
page are not changed.
193+
having enough space to replace a high key.) At the same time, we mark the
194+
target page as half-dead, which causes any subsequent searches to ignore it
195+
and move right (or left, in a backwards scan). This leaves the tree in a
196+
similar state as during a page split: the page has no downlink pointing to
197+
it, but it's still linked to its siblings.
191198

192199
(Note: Lanin and Shasha prefer to make the key space move left, but their
193200
argument for doing so hinges on not having left-links, which we have
@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes
199206
too, meaning we'd have to make bounding-key updates in its parent, and
200207
perhaps all the way up the tree. Since we can't possibly do that
201208
atomically, we forbid this case. That means that the rightmost child of a
202-
parent node can't be deleted unless it's the only remaining child.
203-
204-
When we delete the last remaining child of a parent page, we mark the
205-
parent page "half-dead" as part of the atomic update that deletes the
206-
child page. This implicitly transfers the parent's key space to its right
207-
sibling (which it must have, since we never delete the overall-rightmost
208-
page of a level). Searches ignore the half-dead page and immediately move
209-
right. We need not worry about insertions into a half-dead page --- insertions
210-
into upper tree levels happen only as a result of splits of child pages, and
211-
the half-dead page no longer has any children that could split. Therefore
212-
the page stays empty even when we don't have lock on it, and we can complete
213-
its deletion in a second atomic action.
214-
215-
The notion of a half-dead page means that the key space relationship between
216-
the half-dead page's level and its parent's level may be a little out of
217-
whack: key space that appears to belong to the half-dead page's parent on the
218-
parent level may really belong to its right sibling. To prevent any possible
219-
problems, we hold lock on the deleted child page until we have finished
220-
deleting any now-half-dead parent page(s). This prevents any insertions into
221-
the transferred keyspace until the operation is complete. The reason for
222-
doing this is that a sufficiently large number of insertions into the
223-
transferred keyspace, resulting in multiple page splits, could propagate keys
224-
from that keyspace into the parent level, resulting in transiently
225-
out-of-order keys in that level. It is thought that that wouldn't cause any
226-
serious problem, but it seems too risky to allow.
209+
parent node can't be deleted unless it's the only remaining child, in which
210+
case we will delete the parent too (see below).
211+
212+
In the second-stage, the half-dead leaf page is unlinked from its siblings.
213+
We first lock the left sibling (if any) of the target, the target page
214+
itself, and its right sibling (there must be one) in that order. Then we
215+
update the side-links in the siblings, and mark the target page deleted.
216+
217+
When we're about to delete the last remaining child of a parent page, things
218+
are slightly more complicated. In the first stage, we leave the immediate
219+
parent of the leaf page alone, and remove the downlink to the parent page
220+
instead, from the grandparent. If it's the last child of the grandparent
221+
too, we recurse up until we find a parent with more than one child, and
222+
remove the downlink of that page. The leaf page is marked as half-dead, and
223+
the block number of the page whose downlink was removed is stashed in the
224+
half-dead leaf page. This leaves us with a chain of internal pages, with
225+
one downlink each, leading to the half-dead leaf page, and no downlink
226+
pointing to the topmost page in the chain.
227+
228+
While we recurse up to find the topmost parent in the chain, we keep the
229+
leaf page locked, but don't need to hold locks on the intermediate pages
230+
between the leaf and the topmost parent -- insertions into upper tree levels
231+
happen only as a result of splits of child pages, and that can't happen as
232+
long as we're keeping the leaf locked. The internal pages in the chain
233+
cannot acquire new children afterwards either, because the leaf page is
234+
marked as half-dead and won't be split.
235+
236+
Removing the downlink to the top of the to-be-deleted chain effectively
237+
transfers the key space to the right sibling for all the intermediate levels
238+
too, in one atomic operation. A concurrent search might still visit the
239+
intermediate pages, but it will move right when it reaches the half-dead page
240+
at the leaf level.
241+
242+
In the second stage, the topmost page in the chain is unlinked from its
243+
siblings, and the half-dead leaf page is updated to point to the next page
244+
down in the chain. This is repeated until there are no internal pages left
245+
in the chain. Finally, the half-dead leaf page itself is unlinked from its
246+
siblings.
227247

228248
A deleted page cannot be reclaimed immediately, since there may be other
229249
processes waiting to reference it (ie, search processes that just left the
@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part
396416
of the insertion into the parent level. When splitting the root page, the
397417
metapage update is handled as part of the "new root" action.
398418

399-
A page deletion is logged as a single WAL entry covering all four
400-
required page updates (target page, left and right siblings, and parent)
401-
as an atomic event. (Any required fast-root link update is also part
402-
of the WAL entry.) If the parent page becomes half-dead but is not
403-
immediately deleted due to a subsequent crash, there is no loss of
404-
consistency, and the empty page will be picked up by the next VACUUM.
419+
Each step in page deletion are logged as separate WAL entries: marking the
420+
leaf as half-dead and removing the downlink is one record, and unlinking a
421+
page is a second record. If vacuum is interrupted for some reason, or the
422+
system crashes, the tree is consistent for searches and insertions. The next
423+
VACUUM will find the half-dead leaf page and continue the deletion.
405424

406425
Scans during Recovery
407426
---------------------

0 commit comments

Comments
 (0)