Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 32ca32d

Browse files
committed
Revise GIN README
We find GIN concurrency bugs from time to time. One of the problems here is that concurrency of GIN isn't well-documented in README. So, it might be even hard to distinguish design bugs from implementation bugs. This commit revised concurrency section in GIN README providing more details. Some examples are illustrated in ASCII art. Also, this commit add the explanation of how is tuple layout in internal GIN B-tree page different in comparison with nbtree. Discussion: https://postgr.es/m/CAPpHfduXR_ywyaVN4%2BOYEGaw%3DcPLzWX6RxYLBncKw8de9vOkqw%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan Backpatch-through: 9.4
1 parent d5ad7a0 commit 32ca32d

File tree

1 file changed

+176
-38
lines changed

1 file changed

+176
-38
lines changed

src/backend/access/gin/README

Lines changed: 176 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,35 @@ fit on one pending-list page must have those pages to itself, even if this
215215
results in wasting much of the space on the preceding page and the last
216216
page for the tuple.)
217217

218+
GIN packs downlinks and pivot keys into internal page tuples in a different way
219+
than nbtree does. Lehman & Yao defines it as following.
220+
221+
P_0, K_1, P_1, K_2, P_2, ... , K_n, P_n, K_{n+1}
222+
223+
There P_i is a downlink and K_i is a key. K_i splits key space between P_{i-1}
224+
and P_i (0 <= i <= n). K_{n+1} is high key.
225+
226+
In internal page tuple is key and downlink grouped together. nbtree packs
227+
keys and downlinks into tuples as following.
228+
229+
(K_{n+1}, None), (-Inf, P_0), (K_1, P_1), ... , (K_n, P_n)
230+
231+
There tuples are shown in parentheses. So, highkey is stored separately. P_i
232+
is grouped with K_i. P_0 is grouped with -Inf key.
233+
234+
GIN packs keys and downlinks into tuples in a different way.
235+
236+
(P_0, K_1), (P_1, K_2), ... , (P_n, K_{n+1})
237+
238+
P_i is grouped with K_{i+1}. -Inf key is not needed.
239+
240+
There are couple of additional notes regarding K_{n+1} key.
241+
1) In entry tree rightmost page, a key coupled with P_n doesn't really matter.
242+
Highkey is assumed to be infinity.
243+
2) In posting tree, a key coupled with P_n always doesn't matter. Highkey for
244+
non-rightmost pages is stored separately and accessed via
245+
GinDataPageGetRightBound().
246+
218247
Posting tree
219248
------------
220249

@@ -278,21 +307,86 @@ Concurrency
278307
-----------
279308

280309
The entry tree and each posting tree are B-trees, with right-links connecting
281-
sibling pages at the same level. This is the same structure that is used in
310+
sibling pages at the same level. This is the same structure that is used in
282311
the regular B-tree indexam (invented by Lehman & Yao), but we don't support
283-
scanning a GIN trees backwards, so we don't need left-links.
284-
285-
To avoid deadlocks, B-tree pages must always be locked in the same order:
286-
left to right, and bottom to top. The exception is page deletion during vacuum,
287-
which would be considered separately. When searching, the tree is traversed
288-
from top to bottom, so the lock on the parent page must be released before
289-
descending to the next level. Concurrent page splits move the keyspace to
290-
right, so after following a downlink, the page actually containing the key
291-
we're looking for might be somewhere to the right of the page we landed on.
292-
In that case, we follow the right-links until we find the page we're looking
293-
for. In spite of searches, insertions keeps pins on all pages in path from
294-
root to the current page. These pages potentially requies downlink insertion,
295-
while pins prevent them from being deleted.
312+
scanning a GIN trees backwards, so we don't need left-links. The entry tree
313+
leaves don't have dedicated high keys, instead greatest leaf tuple serves as
314+
high key. That works because tuples are never deleted from the entry tree.
315+
316+
The algorithms used to operate entry and posting trees are considered below.
317+
318+
### Locating the leaf page
319+
320+
When we search for leaf page in GIN btree to perform a read, we descend from
321+
the root page to the leaf through using downlinks taking pin and shared lock on
322+
one page at once. So, we release pin and shared lock on previous page before
323+
getting them on the next page.
324+
325+
The picture below shows tree state after finding the leaf page. Lower case
326+
letters depicts tree pages. 'S' depicts shared lock on the page.
327+
328+
a
329+
/ | \
330+
b c d
331+
/ | \ | \ | \
332+
eS f g h i j k
333+
334+
### Steping right
335+
336+
Concurrent page splits move the keyspace to right, so after following a
337+
downlink, the page actually containing the key we're looking for might be
338+
somewhere to the right of the page we landed on. In that case, we follow the
339+
right-links until we find the page we're looking for.
340+
341+
During stepping right we take pin and shared lock on the right sibling before
342+
releasing them from the current page. This mechanism was designed to protect
343+
from stepping to delete page. We step to the right sibling while hold lock on
344+
the rightlink pointing there. So, it's guaranteed that nobody updates rightlink
345+
concurrently and doesn't delete right sibling accordingly.
346+
347+
The picture below shows two pages locked at once during stepping right.
348+
349+
a
350+
/ | \
351+
b c d
352+
/ | \ | \ | \
353+
eS fS g h i j k
354+
355+
### Insert
356+
357+
While finding appropriate leaf for insertion we also descend from the root to
358+
leaf, while shared locking one page at once in. But during insertion we don't
359+
release pins from root and internal pages. That could save us some lookups to
360+
the buffers hash table for downlinks insertion assuming parents are not changed
361+
due to concurrent splits. Once we reach leaf we re-lock the page in exclusive
362+
mode.
363+
364+
The picture below shows leaf page locked in exclusive mode and ready for
365+
insertion. 'P' and 'E' depict pin and exclusive lock correspondingly.
366+
367+
368+
aP
369+
/ | \
370+
b cP d
371+
/ | \ | \ | \
372+
e f g hE i j k
373+
374+
375+
If insert causes a page split, the parent is locked in exclusive mode before
376+
unlocking the left child. So, insertion algorithm can exclusively lock both
377+
parent and child pages at once starting from child.
378+
379+
The picture below shows tree state after leaf page split. 'q' is new page
380+
produced by split. Parent 'c' is about to have downlink inserted.
381+
382+
aP
383+
/ | \
384+
b cE d
385+
/ | \ / | \ | \
386+
e f g hE q i j k
387+
388+
389+
### Page deletion
296390

297391
Vacuum never deletes tuples or pages from the entry tree. It traverses entry
298392
tree leafs in logical order by rightlinks and removes deletable TIDs from
@@ -301,39 +395,83 @@ are vacuumed in two stages. At first stage, deletable TIDs are removed from
301395
leafs. If first stage detects at least one empty page, then at the second stage
302396
ginScanToDelete() deletes empty pages.
303397

304-
ginScanToDelete() scans posting tree to delete empty pages, while vacuum holds
305-
cleanup lock on the posting tree root. This lock prevent concurrent inserts,
306-
because inserters hold a pin on the root page. In spite of inserters searches
307-
don't hold pin on root page. So, while new searches cannot begin while root page
308-
is locked, any already-in-progress scans can continue concurrently with vacuum.
398+
ginScanToDelete() traverses the whole tree in depth-first manner. It starts
399+
from the super-exclusive lock on the tree root. This lock prevents all the
400+
concurrent insertions into this tree while we're deleting pages. However,
401+
there are still might be some in-progress readers, who traversed root before
402+
we locked it.
403+
404+
The picture below shows tree state after page deletion algorithm traversed to
405+
leftmost leaf of the tree.
309406

310-
ginScanToDelete() does depth-first tree scanning while keeping each page in path
311-
from root to current page exclusively locked. It also keeps left sibling of
312-
each page in the path locked. Thus, if current page is to be removed, all
313-
required pages to remove both downlink and rightlink are already locked.
314-
Therefore, page deletion locks pages top to bottom and left to right breaking
315-
our general rule. But assuming there is no concurrent insertions, this can't
316-
cause a deadlock.
407+
aE
408+
/ | \
409+
bE c d
410+
/ | \ | \ | \
411+
eE f g h i j k
317412

318-
During replay od page deletion at standby, the page's left sibling, the target
319-
page, and its parent, are locked in that order. So it follows bottom to top and
320-
left to right rule.
413+
Deletion algorithm keeps exclusive locks on left siblings of pages comprising
414+
currently investigated path. Thus, if current page is to be removed, all
415+
required pages to remove both downlink and rightlink are already locked. That
416+
evades potential right to left page locking order, which could deadlock with
417+
concurrent stepping right.
321418

322419
A search concurrent to page deletion might already have read a pointer to the
323-
page to be deleted, and might be just about to follow it. A page can be reached
420+
page to be deleted, and might be just about to follow it. A page can be reached
324421
via the right-link of its left sibling, or via its downlink in the parent.
325422

326-
To prevent a backend from reaching a deleted page via a right-link, when
327-
following a right-link the lock on the previous page is not released until the
328-
lock on next page has been acquired.
423+
To prevent a backend from reaching a deleted page via a right-link, stepping
424+
right algorithm doesn't release lock on the current page until lock of the
425+
right page is acquired.
329426

330-
The downlink is more tricky. A search descending the tree must release the lock
427+
The downlink is more tricky. A search descending the tree must release the lock
331428
on the parent page before locking the child, or it could deadlock with a
332429
concurrent split of the child page; a page split locks the parent, while already
333-
holding a lock on the child page. So, deleted page cannot be reclaimed
334-
immediately. Instead, we have to wait for every transaction, which might wait to
335-
reference this page, to finish. Corresponding processes must observe that the
336-
page is marked deleted and recover accordingly.
430+
holding a lock on the child page. So, deleted page cannot be reclaimed
431+
immediately. Instead, we have to wait for every transaction, which might wait
432+
to reference this page, to finish. Corresponding processes must observe that
433+
the page is marked deleted and recover accordingly.
434+
435+
The picture below shows tree state after page deletion algorithm further
436+
traversed the tree. Currently investigated path is 'a-c-h'. Left siblings 'b'
437+
and 'g' of 'c' and 'h' correspondingly are also exclusively locked.
438+
439+
aE
440+
/ | \
441+
bE cE d
442+
/ | \ | \ | \
443+
e f gE hE i j k
444+
445+
The next picture shows tree state after page 'h' was deleted. It's marked with
446+
'deleted' flag and newest xid, which might visit it. Downlink from 'c' to 'h'
447+
is also deleted.
448+
449+
aE
450+
/ | \
451+
bE cE d
452+
/ | \ \ | \
453+
e f gE hD iE j k
454+
455+
However, it's still possible that concurrent reader has seen downlink from 'c'
456+
to 'h' before we deleted it. In that case this reader will step right from 'h'
457+
to till find non-deleted page. Xid-marking of page 'h' guarantees that this
458+
page wouldn't be reused till all such readers gone. Next leaf page under
459+
investigation is 'i'. 'g' remains locked as it becomes left sibling of 'i'.
460+
461+
The next picture shows tree state after 'i' and 'c' was deleted. Internal page
462+
'c' was deleted because it appeared to have no downlinks. The path under
463+
investigation is 'a-d-j'. Pages 'b' and 'g' are locked as self siblings of 'd'
464+
and 'j'.
465+
466+
aE
467+
/ \
468+
bE cD dE
469+
/ | \ | \
470+
e f gE hD iD jE k
471+
472+
During the replay of page deletion at standby, the page's left sibling, the
473+
target page, and its parent, are locked in that order. This order guarantees
474+
no deadlock with concurrent reads.
337475

338476
Predicate Locking
339477
-----------------

0 commit comments

Comments
 (0)