Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 59cea09

Browse files
committed
Fix race leading to incorrect conflict cause in InvalidatePossiblyObsoleteSlot()
The invalidation of an active slot is done in two steps: - Termination of the backend holding it, if any. - Report that the slot is obsolete, with a conflict cause depending on the slot's data. This can be racy because between these two steps the slot mutex would be released while doing system calls, which means that the effective_xmin and effective_catalog_xmin could advance during that time, detecting a conflict cause different than the one originally wanted before the process owning a slot is terminated. Holding the mutex longer is not an option, so this commit changes the code to record the LSNs stored in the slot during the termination of the process owning the slot. Bonus thanks to Alexander Lakhin for the various tests and the analysis. Author: Bertrand Drouvot Reviewed-by: Michael Paquier, Bharath Rupireddy Discussion: https://postgr.es/m/ZaTjW2Xh+TQUCOH0@ip-10-97-1-34.eu-west-3.compute.internal Backpatch-through: 16
1 parent ec4fcf4 commit 59cea09

File tree

1 file changed

+33
-6
lines changed

1 file changed

+33
-6
lines changed

src/backend/replication/slot.c

Lines changed: 33 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1321,6 +1321,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
13211321
{
13221322
int last_signaled_pid = 0;
13231323
bool released_lock = false;
1324+
bool terminated = false;
1325+
XLogRecPtr initial_effective_xmin = InvalidXLogRecPtr;
1326+
XLogRecPtr initial_catalog_effective_xmin = InvalidXLogRecPtr;
1327+
XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr;
1328+
ReplicationSlotInvalidationCause conflict_prev PG_USED_FOR_ASSERTS_ONLY = RS_INVAL_NONE;
13241329

13251330
for (;;)
13261331
{
@@ -1355,11 +1360,24 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
13551360
*/
13561361
if (s->data.invalidated == RS_INVAL_NONE)
13571362
{
1363+
/*
1364+
* The slot's mutex will be released soon, and it is possible that
1365+
* those values change since the process holding the slot has been
1366+
* terminated (if any), so record them here to ensure that we
1367+
* would report the correct conflict cause.
1368+
*/
1369+
if (!terminated)
1370+
{
1371+
initial_restart_lsn = s->data.restart_lsn;
1372+
initial_effective_xmin = s->effective_xmin;
1373+
initial_catalog_effective_xmin = s->effective_catalog_xmin;
1374+
}
1375+
13581376
switch (cause)
13591377
{
13601378
case RS_INVAL_WAL_REMOVED:
1361-
if (s->data.restart_lsn != InvalidXLogRecPtr &&
1362-
s->data.restart_lsn < oldestLSN)
1379+
if (initial_restart_lsn != InvalidXLogRecPtr &&
1380+
initial_restart_lsn < oldestLSN)
13631381
conflict = cause;
13641382
break;
13651383
case RS_INVAL_HORIZON:
@@ -1368,12 +1386,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
13681386
/* invalid DB oid signals a shared relation */
13691387
if (dboid != InvalidOid && dboid != s->data.database)
13701388
break;
1371-
if (TransactionIdIsValid(s->effective_xmin) &&
1372-
TransactionIdPrecedesOrEquals(s->effective_xmin,
1389+
if (TransactionIdIsValid(initial_effective_xmin) &&
1390+
TransactionIdPrecedesOrEquals(initial_effective_xmin,
13731391
snapshotConflictHorizon))
13741392
conflict = cause;
1375-
else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
1376-
TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
1393+
else if (TransactionIdIsValid(initial_catalog_effective_xmin) &&
1394+
TransactionIdPrecedesOrEquals(initial_catalog_effective_xmin,
13771395
snapshotConflictHorizon))
13781396
conflict = cause;
13791397
break;
@@ -1386,6 +1404,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
13861404
}
13871405
}
13881406

1407+
/*
1408+
* The conflict cause recorded previously should not change while the
1409+
* process owning the slot (if any) has been terminated.
1410+
*/
1411+
Assert(!(conflict_prev != RS_INVAL_NONE && terminated &&
1412+
conflict_prev != conflict));
1413+
13891414
/* if there's no conflict, we're done */
13901415
if (conflict == RS_INVAL_NONE)
13911416
{
@@ -1460,6 +1485,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
14601485
(void) kill(active_pid, SIGTERM);
14611486

14621487
last_signaled_pid = active_pid;
1488+
terminated = true;
1489+
conflict_prev = conflict;
14631490
}
14641491

14651492
/* Wait until the slot is released. */

0 commit comments

Comments
 (0)