Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 9be4ce4

Browse files
committed
Make deadlock-parallel isolation test more robust.
This test failed fairly reproducibly on some CLOBBER_CACHE_ALWAYS buildfarm animals. The cause seems to be that if a parallel worker is slow enough to reach its lock wait, it may not be released by the first deadlock check run, and then later deadlock checks might decide to unblock the d2 session instead of the d1 session, leaving us in an undetected deadlock state (since the isolationtester client is waiting for d1 to complete first). Fix by introducing an additional lock wait at the end of the d2a1 step, ensuring that the deadlock checker will recognize that d1 has to be unblocked before d2a1 completes. Also reduce max_parallel_workers_per_gather to 3 in this test. With the default max_worker_processes value, we were only getting one parallel worker for the d2a1 step, which is not the case I hoped to test. We should get 3 for d1a2 and 2 for d2a1, as the code stands; and maybe 3 for d2a1 if somebody figures out why the last parallel worker slot isn't free already. Discussion: https://postgr.es/m/22195.1566077308@sss.pgh.pa.us
1 parent d78d452 commit 9be4ce4

File tree

2 files changed

+43
-12
lines changed

2 files changed

+43
-12
lines changed

src/test/isolation/expected/deadlock-parallel.out

+13-6
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
Parsed test spec with 4 sessions
22

33
starting permutation: d1a1 d2a2 e1l e2l d1a2 d2a1 d1c e1c d2c e2c
4-
step d1a1: SELECT lock_share(1,x) FROM bigt LIMIT 1;
5-
lock_share
4+
step d1a1: SELECT lock_share(1,x), lock_excl(3,x) FROM bigt LIMIT 1;
5+
lock_share lock_excl
66

7-
1
7+
1 1
88
step d2a2: select lock_share(2,x) FROM bigt LIMIT 1;
99
lock_share
1010

@@ -16,15 +16,19 @@ step d1a2: SET force_parallel_mode = on;
1616
SET parallel_tuple_cost = 0;
1717
SET min_parallel_table_scan_size = 0;
1818
SET parallel_leader_participation = off;
19-
SET max_parallel_workers_per_gather = 4;
19+
SET max_parallel_workers_per_gather = 3;
2020
SELECT sum(lock_share(2,x)) FROM bigt; <waiting ...>
2121
step d2a1: SET force_parallel_mode = on;
2222
SET parallel_setup_cost = 0;
2323
SET parallel_tuple_cost = 0;
2424
SET min_parallel_table_scan_size = 0;
2525
SET parallel_leader_participation = off;
26-
SET max_parallel_workers_per_gather = 4;
27-
SELECT sum(lock_share(1,x)) FROM bigt; <waiting ...>
26+
SET max_parallel_workers_per_gather = 3;
27+
SELECT sum(lock_share(1,x)) FROM bigt;
28+
SET force_parallel_mode = off;
29+
RESET parallel_setup_cost;
30+
RESET parallel_tuple_cost;
31+
SELECT lock_share(3,x) FROM bigt LIMIT 1; <waiting ...>
2832
step d1a2: <... completed>
2933
sum
3034

@@ -38,6 +42,9 @@ step d2a1: <... completed>
3842
sum
3943

4044
10000
45+
lock_share
46+
47+
1
4148
step e1c: COMMIT;
4249
step d2c: COMMIT;
4350
step e2l: <... completed>

src/test/isolation/specs/deadlock-parallel.spec

+30-6
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,25 @@
1515
# The deadlock detector resolves the deadlock by reversing the d1-e2 edge,
1616
# unblocking d1.
1717

18+
# However ... it's not actually that well-defined whether the deadlock
19+
# detector will prefer to unblock d1 or d2. It depends on which backend
20+
# is first to run DeadLockCheck after the deadlock condition is created:
21+
# that backend will search outwards from its own wait condition, and will
22+
# first find a loop involving the *other* lock. We encourage that to be
23+
# one of the d2a1 parallel workers, which will therefore unblock d1a2
24+
# workers, by setting a shorter deadlock_timeout in session d2. But on
25+
# slow machines, one or more d1a2 workers may not yet have reached their
26+
# lock waits, so that they're not unblocked by the first DeadLockCheck.
27+
# The next DeadLockCheck may choose to unblock the d2a1 workers instead,
28+
# which would allow d2a1 to complete before d1a2, causing the test to
29+
# freeze up because isolationtester isn't expecting that completion order.
30+
# (In effect, we have an undetectable deadlock because d2 is waiting for
31+
# d1's completion, but on the client side.) To fix this, introduce an
32+
# additional lock (advisory lock 3), which is initially taken by d1 and
33+
# then d2a1 will wait for it after completing the main part of the test.
34+
# In this way, the deadlock detector can see that d1 must be completed
35+
# first, regardless of timing.
36+
1837
setup
1938
{
2039
create function lock_share(int,int) returns int language sql as
@@ -39,15 +58,15 @@ setup { BEGIN isolation level repeatable read;
3958
SET force_parallel_mode = off;
4059
SET deadlock_timeout = '10s';
4160
}
42-
# this lock will be taken in the leader, so it will persist:
43-
step "d1a1" { SELECT lock_share(1,x) FROM bigt LIMIT 1; }
61+
# these locks will be taken in the leader, so they will persist:
62+
step "d1a1" { SELECT lock_share(1,x), lock_excl(3,x) FROM bigt LIMIT 1; }
4463
# this causes all the parallel workers to take locks:
4564
step "d1a2" { SET force_parallel_mode = on;
4665
SET parallel_setup_cost = 0;
4766
SET parallel_tuple_cost = 0;
4867
SET min_parallel_table_scan_size = 0;
4968
SET parallel_leader_participation = off;
50-
SET max_parallel_workers_per_gather = 4;
69+
SET max_parallel_workers_per_gather = 3;
5170
SELECT sum(lock_share(2,x)) FROM bigt; }
5271
step "d1c" { COMMIT; }
5372

@@ -58,14 +77,19 @@ setup { BEGIN isolation level repeatable read;
5877
}
5978
# this lock will be taken in the leader, so it will persist:
6079
step "d2a2" { select lock_share(2,x) FROM bigt LIMIT 1; }
61-
# this causes all the parallel workers to take locks:
80+
# this causes all the parallel workers to take locks;
81+
# after which, make the leader take lock 3 to prevent client-driven deadlock
6282
step "d2a1" { SET force_parallel_mode = on;
6383
SET parallel_setup_cost = 0;
6484
SET parallel_tuple_cost = 0;
6585
SET min_parallel_table_scan_size = 0;
6686
SET parallel_leader_participation = off;
67-
SET max_parallel_workers_per_gather = 4;
68-
SELECT sum(lock_share(1,x)) FROM bigt; }
87+
SET max_parallel_workers_per_gather = 3;
88+
SELECT sum(lock_share(1,x)) FROM bigt;
89+
SET force_parallel_mode = off;
90+
RESET parallel_setup_cost;
91+
RESET parallel_tuple_cost;
92+
SELECT lock_share(3,x) FROM bigt LIMIT 1; }
6993
step "d2c" { COMMIT; }
7094

7195
session "e1"

0 commit comments

Comments
 (0)