Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit a4d1ce0

Browse files
committed
Don't lose walreceiver start requests due to race condition in postmaster.
When a walreceiver dies, the startup process will notice that and send a PMSIGNAL_START_WALRECEIVER signal to the postmaster, asking for a new walreceiver to be launched. There's a race condition, which at least in HEAD is very easy to hit, whereby the postmaster might see that signal before it processes the SIGCHLD from the walreceiver process. In that situation, sigusr1_handler() just dropped the start request on the floor, reasoning that it must be redundant. Eventually, after 10 seconds (WALRCV_STARTUP_TIMEOUT), the startup process would make a fresh request --- but that's a long time if the connection could have been re-established almost immediately. Fix it by setting a state flag inside the postmaster that we won't clear until we do launch a walreceiver. In cases where that results in an extra walreceiver launch, it's up to the walreceiver to realize it's unwanted and go away --- but we have, and need, that logic anyway for the opposite race case. I came across this through investigating unexpected delays in the src/test/recovery TAP tests: it manifests there in test cases where a master server is stopped and restarted while leaving streaming slaves active. This logic has been broken all along, so back-patch to all supported branches. Discussion: https://postgr.es/m/21344.1498494720@sss.pgh.pa.us
1 parent f6af9c7 commit a4d1ce0

File tree

1 file changed

+32
-7
lines changed

1 file changed

+32
-7
lines changed

src/backend/postmaster/postmaster.c

+32-7
Original file line numberDiff line numberDiff line change
@@ -354,6 +354,9 @@ static volatile sig_atomic_t start_autovac_launcher = false;
354354
/* the launcher needs to be signalled to communicate some condition */
355355
static volatile bool avlauncher_needs_signal = false;
356356

357+
/* received START_WALRECEIVER signal */
358+
static volatile sig_atomic_t WalReceiverRequested = false;
359+
357360
/* set when there's a worker that needs to be started up */
358361
static volatile bool StartWorkerNeeded = true;
359362
static volatile bool HaveCrashedWorker = false;
@@ -417,6 +420,7 @@ static void maybe_start_bgworkers(void);
417420
static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
418421
static pid_t StartChildProcess(AuxProcType type);
419422
static void StartAutovacuumWorker(void);
423+
static void MaybeStartWalReceiver(void);
420424
static void InitPostmasterDeathWatchHandle(void);
421425

422426
/*
@@ -1783,6 +1787,10 @@ ServerLoop(void)
17831787
kill(AutoVacPID, SIGUSR2);
17841788
}
17851789

1790+
/* If we need to start a WAL receiver, try to do that now */
1791+
if (WalReceiverRequested)
1792+
MaybeStartWalReceiver();
1793+
17861794
/* Get other worker processes running, if needed */
17871795
if (StartWorkerNeeded || HaveCrashedWorker)
17881796
maybe_start_bgworkers();
@@ -2923,7 +2931,8 @@ reaper(SIGNAL_ARGS)
29232931
/*
29242932
* Was it the wal receiver? If exit status is zero (normal) or one
29252933
* (FATAL exit), we assume everything is all right just like normal
2926-
* backends.
2934+
* backends. (If we need a new wal receiver, we'll start one at the
2935+
* next iteration of the postmaster's main loop.)
29272936
*/
29282937
if (pid == WalReceiverPID)
29292938
{
@@ -5011,14 +5020,12 @@ sigusr1_handler(SIGNAL_ARGS)
50115020
StartAutovacuumWorker();
50125021
}
50135022

5014-
if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
5015-
WalReceiverPID == 0 &&
5016-
(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
5017-
pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
5018-
Shutdown == NoShutdown)
5023+
if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER))
50195024
{
50205025
/* Startup Process wants us to start the walreceiver process. */
5021-
WalReceiverPID = StartWalReceiver();
5026+
/* Start immediately if possible, else remember request for later. */
5027+
WalReceiverRequested = true;
5028+
MaybeStartWalReceiver();
50225029
}
50235030

50245031
if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
@@ -5369,6 +5376,24 @@ StartAutovacuumWorker(void)
53695376
}
53705377
}
53715378

5379+
/*
5380+
* MaybeStartWalReceiver
5381+
* Start the WAL receiver process, if not running and our state allows.
5382+
*/
5383+
static void
5384+
MaybeStartWalReceiver(void)
5385+
{
5386+
if (WalReceiverPID == 0 &&
5387+
(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
5388+
pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY) &&
5389+
Shutdown == NoShutdown)
5390+
{
5391+
WalReceiverPID = StartWalReceiver();
5392+
WalReceiverRequested = false;
5393+
}
5394+
}
5395+
5396+
53725397
/*
53735398
* Create the opts file
53745399
*/

0 commit comments

Comments
 (0)