Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 04e9dc6

Browse files
committed
Fix bug where walsender goes into a busy loop if connection is terminated.
The problem was that ResetLatch was not being called in the walsender loop if the connection was terminated, so WaitLatch never sleeps until the terminated connection is detected. In the master-branch, this was already fixed as a side-effect of some refactoring of the loop. This commit backports that refactoring to 9.1. 9.0 does not have this bug, because we didn't use latches back then. Fujii Masao
1 parent 52b03fb commit 04e9dc6

File tree

1 file changed

+60
-50
lines changed

1 file changed

+60
-50
lines changed

src/backend/replication/walsender.c

Lines changed: 60 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -476,7 +476,7 @@ ProcessRepliesIfAny(void)
476476
{
477477
unsigned char firstchar;
478478
int r;
479-
int received = false;
479+
bool received = false;
480480

481481
for (;;)
482482
{
@@ -700,6 +700,9 @@ WalSndLoop(void)
700700
/* Loop forever, unless we get an error */
701701
for (;;)
702702
{
703+
/* Clear any already-pending wakeups */
704+
ResetLatch(&MyWalSnd->latch);
705+
703706
/*
704707
* Emergency bailout if postmaster has died. This is to avoid the
705708
* necessity for manual cleanup of all postmaster children.
@@ -718,60 +721,81 @@ WalSndLoop(void)
718721
/* Normal exit from the walsender is here */
719722
if (walsender_shutdown_requested)
720723
{
721-
/* Inform the standby that XLOG streaming was done */
724+
/* Inform the standby that XLOG streaming is done */
722725
pq_puttextmessage('C', "COPY 0");
723726
pq_flush();
724727

725728
proc_exit(0);
726729
}
727730

731+
/* Check for input from the client */
732+
ProcessRepliesIfAny();
733+
728734
/*
729735
* If we don't have any pending data in the output buffer, try to send
730-
* some more.
736+
* some more. If there is some, we don't bother to call XLogSend
737+
* again until we've flushed it ... but we'd better assume we are not
738+
* caught up.
731739
*/
732740
if (!pq_is_send_pending())
733-
{
734741
XLogSend(output_message, &caughtup);
742+
else
743+
caughtup = false;
744+
745+
/* Try to flush pending output to the client */
746+
if (pq_flush_if_writable() != 0)
747+
break;
735748

749+
/* If nothing remains to be sent right now ... */
750+
if (caughtup && !pq_is_send_pending())
751+
{
736752
/*
737-
* Even if we wrote all the WAL that was available when we started
738-
* sending, more might have arrived while we were sending this
739-
* batch. We had the latch set while sending, so we have not
740-
* received any signals from that time. Let's arm the latch again,
741-
* and after that check that we're still up-to-date.
753+
* If we're in catchup state, move to streaming. This is an
754+
* important state change for users to know about, since before
755+
* this point data loss might occur if the primary dies and we
756+
* need to failover to the standby. The state change is also
757+
* important for synchronous replication, since commits that
758+
* started to wait at that point might wait for some time.
742759
*/
743-
if (caughtup && !pq_is_send_pending())
760+
if (MyWalSnd->state == WALSNDSTATE_CATCHUP)
744761
{
745-
ResetLatch(&MyWalSnd->latch);
762+
ereport(DEBUG1,
763+
(errmsg("standby \"%s\" has now caught up with primary",
764+
application_name)));
765+
WalSndSetState(WALSNDSTATE_STREAMING);
766+
}
746767

768+
/*
769+
* When SIGUSR2 arrives, we send any outstanding logs up to the
770+
* shutdown checkpoint record (i.e., the latest record) and exit.
771+
* This may be a normal termination at shutdown, or a promotion,
772+
* the walsender is not sure which.
773+
*/
774+
if (walsender_ready_to_stop)
775+
{
776+
/* ... let's just be real sure we're caught up ... */
747777
XLogSend(output_message, &caughtup);
778+
if (caughtup && !pq_is_send_pending())
779+
{
780+
walsender_shutdown_requested = true;
781+
continue; /* don't want to wait more */
782+
}
748783
}
749784
}
750785

751-
/* Flush pending output to the client */
752-
if (pq_flush_if_writable() != 0)
753-
break;
754-
755786
/*
756-
* When SIGUSR2 arrives, we send any outstanding logs up to the
757-
* shutdown checkpoint record (i.e., the latest record) and exit.
787+
* We don't block if not caught up, unless there is unsent data
788+
* pending in which case we'd better block until the socket is
789+
* write-ready. This test is only needed for the case where XLogSend
790+
* loaded a subset of the available data but then pq_flush_if_writable
791+
* flushed it all --- we should immediately try to send more.
758792
*/
759-
if (walsender_ready_to_stop && !pq_is_send_pending())
760-
{
761-
XLogSend(output_message, &caughtup);
762-
ProcessRepliesIfAny();
763-
if (caughtup && !pq_is_send_pending())
764-
walsender_shutdown_requested = true;
765-
}
766-
767-
if ((caughtup || pq_is_send_pending()) &&
768-
!got_SIGHUP &&
769-
!walsender_shutdown_requested)
793+
if (caughtup || pq_is_send_pending())
770794
{
771795
TimestampTz finish_time = 0;
772-
long sleeptime;
796+
long sleeptime = -1;
773797

774-
/* Reschedule replication timeout */
798+
/* Determine time until replication timeout */
775799
if (replication_timeout > 0)
776800
{
777801
long secs;
@@ -795,12 +819,16 @@ WalSndLoop(void)
795819
sleeptime = WalSndDelay;
796820
}
797821

798-
/* Sleep */
822+
/* Sleep until something happens or replication timeout */
799823
WaitLatchOrSocket(&MyWalSnd->latch, MyProcPort->sock,
800824
true, pq_is_send_pending(),
801825
sleeptime);
802826

803-
/* Check for replication timeout */
827+
/*
828+
* Check for replication timeout. Note we ignore the corner case
829+
* possibility that the client replied just as we reached the
830+
* timeout ... he's supposed to reply *before* that.
831+
*/
804832
if (replication_timeout > 0 &&
805833
GetCurrentTimestamp() >= finish_time)
806834
{
@@ -814,24 +842,6 @@ WalSndLoop(void)
814842
break;
815843
}
816844
}
817-
818-
/*
819-
* If we're in catchup state, see if its time to move to streaming.
820-
* This is an important state change for users, since before this
821-
* point data loss might occur if the primary dies and we need to
822-
* failover to the standby. The state change is also important for
823-
* synchronous replication, since commits that started to wait at that
824-
* point might wait for some time.
825-
*/
826-
if (MyWalSnd->state == WALSNDSTATE_CATCHUP && caughtup)
827-
{
828-
ereport(DEBUG1,
829-
(errmsg("standby \"%s\" has now caught up with primary",
830-
application_name)));
831-
WalSndSetState(WALSNDSTATE_STREAMING);
832-
}
833-
834-
ProcessRepliesIfAny();
835845
}
836846

837847
/*

0 commit comments

Comments
 (0)