Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 48913db

Browse files
committed
In immediate shutdown, postmaster should not exit till children are gone.
This adjusts commit 82233ce so that the postmaster does not exit until all its child processes have exited, even if the 5-second timeout elapses and we have to send SIGKILL. There is no great value in having the postmaster process quit sooner, and doing so can mislead onlookers into thinking that the cluster is fully terminated when actually some child processes still survive. This effect might explain recent test failures on buildfarm member hamster, wherein we failed to restart a cluster just after shutting it down with "pg_ctl stop -m immediate". I also did a bit of code review/beautification, including fixing a faulty use of the Max() macro on a volatile expression. Back-patch to 9.4. In older branches, the postmaster never waited for children to exit during immediate shutdowns, and changing that would be too much of a behavioral change.
1 parent da1a9d0 commit 48913db

File tree

2 files changed

+17
-19
lines changed

2 files changed

+17
-19
lines changed

doc/src/sgml/runtime.sgml

+5-4
Original file line numberDiff line numberDiff line change
@@ -1441,10 +1441,11 @@ $ <userinput>sysctl -w vm.nr_hugepages=3170</userinput>
14411441
<para>
14421442
This is the <firstterm>Immediate Shutdown</firstterm> mode.
14431443
The server will send <systemitem>SIGQUIT</systemitem> to all child
1444-
processes and wait for them to terminate. Those that don't terminate
1445-
within 5 seconds, will be sent <systemitem>SIGKILL</systemitem> by the
1446-
master <command>postgres</command> process, which will then terminate
1447-
without further waiting. This will lead to recovery (by
1444+
processes and wait for them to terminate. If any do not terminate
1445+
within 5 seconds, they will be sent <systemitem>SIGKILL</systemitem>.
1446+
The master server process exits as soon as all child processes have
1447+
exited, without doing normal database shutdown processing.
1448+
This will lead to recovery (by
14481449
replaying the WAL log) upon next start-up. This is recommended
14491450
only in emergencies.
14501451
</para>

src/backend/postmaster/postmaster.c

+12-15
Original file line numberDiff line numberDiff line change
@@ -324,8 +324,10 @@ typedef enum
324324

325325
static PMState pmState = PM_INIT;
326326

327-
/* Start time of abort processing at immediate shutdown or child crash */
328-
static time_t AbortStartTime;
327+
/* Start time of SIGKILL timeout during immediate shutdown or child crash */
328+
/* Zero means timeout is not running */
329+
static time_t AbortStartTime = 0;
330+
/* Length of said timeout */
329331
#define SIGKILL_CHILDREN_AFTER_SECS 5
330332

331333
static bool ReachedNormalRunning = false; /* T if we've reached PM_RUN */
@@ -1419,7 +1421,8 @@ checkDataDir(void)
14191421
* In normal conditions we wait at most one minute, to ensure that the other
14201422
* background tasks handled by ServerLoop get done even when no requests are
14211423
* arriving. However, if there are background workers waiting to be started,
1422-
* we don't actually sleep so that they are quickly serviced.
1424+
* we don't actually sleep so that they are quickly serviced. Other exception
1425+
* cases are as shown in the code.
14231426
*/
14241427
static void
14251428
DetermineSleepTime(struct timeval * timeout)
@@ -1433,11 +1436,12 @@ DetermineSleepTime(struct timeval * timeout)
14331436
if (Shutdown > NoShutdown ||
14341437
(!StartWorkerNeeded && !HaveCrashedWorker))
14351438
{
1436-
if (AbortStartTime > 0)
1439+
if (AbortStartTime != 0)
14371440
{
14381441
/* time left to abort; clamp to 0 in case it already expired */
1439-
timeout->tv_sec = Max(SIGKILL_CHILDREN_AFTER_SECS -
1440-
(time(NULL) - AbortStartTime), 0);
1442+
timeout->tv_sec = SIGKILL_CHILDREN_AFTER_SECS -
1443+
(time(NULL) - AbortStartTime);
1444+
timeout->tv_sec = Max(timeout->tv_sec, 0);
14411445
timeout->tv_usec = 0;
14421446
}
14431447
else
@@ -1707,20 +1711,13 @@ ServerLoop(void)
17071711
* Note we also do this during recovery from a process crash.
17081712
*/
17091713
if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
1710-
AbortStartTime > 0 &&
1711-
now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)
1714+
AbortStartTime != 0 &&
1715+
(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
17121716
{
17131717
/* We were gentle with them before. Not anymore */
17141718
TerminateChildren(SIGKILL);
17151719
/* reset flag so we don't SIGKILL again */
17161720
AbortStartTime = 0;
1717-
1718-
/*
1719-
* Additionally, unless we're recovering from a process crash,
1720-
* it's now the time for postmaster to abandon ship.
1721-
*/
1722-
if (!FatalError)
1723-
ExitPostmaster(1);
17241721
}
17251722
}
17261723
}

0 commit comments

Comments
 (0)