Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 9dd4178

Browse files
committed
Be more predictable about reporting "lock timeout" vs "statement timeout".
If both timeout indicators are set when we arrive at ProcessInterrupts, we've historically just reported "lock timeout". However, some buildfarm members have been observed to fail isolationtester's timeouts test by reporting "lock timeout" when the statement timeout was expected to fire first. The cause seems to be that the process is allowed to sleep longer than expected (probably due to heavy machine load) so that the lock timeout happens before we reach the point of reporting the error, and then this arbitrary tiebreak rule does the wrong thing. We can improve matters by comparing the scheduled timeout times to decide which error to report. I had originally proposed greatly reducing the 1-second window between the two timeouts in the test cases. On reflection that is a bad idea, at least for the case where the lock timeout is expected to fire first, because that would assume that it takes negligible time to get from statement start to the beginning of the lock wait. Thus, this patch doesn't completely remove the risk of test failures on slow machines. Empirically, however, the case this handles is the one we are seeing in the buildfarm. The explanation may be that the other case requires the scheduler to take the CPU away from a busy process, whereas the case fixed here only requires the scheduler to not give the CPU back right away to a process that has been woken from a multi-second sleep (and, perhaps, has been swapped out meanwhile). Back-patch to 9.3 where the isolationtester timeouts test was added. Discussion: <8693.1464314819@sss.pgh.pa.us>
1 parent d74048d commit 9dd4178

File tree

3 files changed

+35
-5
lines changed

3 files changed

+35
-5
lines changed

src/backend/tcop/postgres.c

+19-4
Original file line numberDiff line numberDiff line change
@@ -2909,6 +2909,9 @@ ProcessInterrupts(void)
29092909

29102910
if (QueryCancelPending)
29112911
{
2912+
bool lock_timeout_occurred;
2913+
bool stmt_timeout_occurred;
2914+
29122915
/*
29132916
* Don't allow query cancel interrupts while reading input from the
29142917
* client, because we might lose sync in the FE/BE protocol. (Die
@@ -2929,17 +2932,29 @@ ProcessInterrupts(void)
29292932

29302933
/*
29312934
* If LOCK_TIMEOUT and STATEMENT_TIMEOUT indicators are both set, we
2932-
* prefer to report the former; but be sure to clear both.
2935+
* need to clear both, so always fetch both.
29332936
*/
2934-
if (get_timeout_indicator(LOCK_TIMEOUT, true))
2937+
lock_timeout_occurred = get_timeout_indicator(LOCK_TIMEOUT, true);
2938+
stmt_timeout_occurred = get_timeout_indicator(STATEMENT_TIMEOUT, true);
2939+
2940+
/*
2941+
* If both were set, we want to report whichever timeout completed
2942+
* earlier; this ensures consistent behavior if the machine is slow
2943+
* enough that the second timeout triggers before we get here. A tie
2944+
* is arbitrarily broken in favor of reporting a lock timeout.
2945+
*/
2946+
if (lock_timeout_occurred && stmt_timeout_occurred &&
2947+
get_timeout_finish_time(STATEMENT_TIMEOUT) < get_timeout_finish_time(LOCK_TIMEOUT))
2948+
lock_timeout_occurred = false; /* report stmt timeout */
2949+
2950+
if (lock_timeout_occurred)
29352951
{
2936-
(void) get_timeout_indicator(STATEMENT_TIMEOUT, true);
29372952
LockErrorCleanup();
29382953
ereport(ERROR,
29392954
(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
29402955
errmsg("canceling statement due to lock timeout")));
29412956
}
2942-
if (get_timeout_indicator(STATEMENT_TIMEOUT, true))
2957+
if (stmt_timeout_occurred)
29432958
{
29442959
LockErrorCleanup();
29452960
ereport(ERROR,

src/backend/utils/misc/timeout.c

+15-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ typedef struct timeout_params
3434
timeout_handler_proc timeout_handler;
3535

3636
TimestampTz start_time; /* time that timeout was last activated */
37-
TimestampTz fin_time; /* if active, time it is due to fire */
37+
TimestampTz fin_time; /* time it is, or was last, due to fire */
3838
} timeout_params;
3939

4040
/*
@@ -654,3 +654,17 @@ get_timeout_start_time(TimeoutId id)
654654
{
655655
return all_timeouts[id].start_time;
656656
}
657+
658+
/*
659+
* Return the time when the timeout is, or most recently was, due to fire
660+
*
661+
* Note: will return 0 if timeout has never been activated in this process.
662+
* However, we do *not* reset the fin_time when a timeout occurs, so as
663+
* not to create a race condition if SIGALRM fires just as some code is
664+
* about to fetch the value.
665+
*/
666+
TimestampTz
667+
get_timeout_finish_time(TimeoutId id)
668+
{
669+
return all_timeouts[id].fin_time;
670+
}

src/include/utils/timeout.h

+1
Original file line numberDiff line numberDiff line change
@@ -82,5 +82,6 @@ extern void disable_all_timeouts(bool keep_indicators);
8282
/* accessors */
8383
extern bool get_timeout_indicator(TimeoutId id, bool reset_indicator);
8484
extern TimestampTz get_timeout_start_time(TimeoutId id);
85+
extern TimestampTz get_timeout_finish_time(TimeoutId id);
8586

8687
#endif /* TIMEOUT_H */

0 commit comments

Comments
 (0)