Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 624b6f3

Browse files
committed
Handle elog(FATAL) during ROLLBACK more robustly.
Stress testing by Andreas Seltenreich disclosed longstanding problems that occur if a FATAL exit (e.g. due to receipt of SIGTERM) occurs while we are trying to execute a ROLLBACK of an already-failed transaction. In such a case, xact.c is in TBLOCK_ABORT state, so that AbortOutOfAnyTransaction would skip AbortTransaction and go straight to CleanupTransaction. This led to an assert failure in an assert-enabled build (due to the ROLLBACK's portal still having a cleanup hook) or without assertions, to a FATAL exit complaining about "cannot drop active portal". The latter's not disastrous, perhaps, but it's messy enough to want to improve it. We don't really want to run all of AbortTransaction in this code path. The minimum required to clean up the open portal safely is to do AtAbort_Memory and AtAbort_Portals. It seems like a good idea to do AtAbort_Memory unconditionally, to be entirely sure that we are starting with a safe CurrentMemoryContext. That means that if the main loop in AbortOutOfAnyTransaction does nothing, we need an extra step at the bottom to restore CurrentMemoryContext = TopMemoryContext, which I chose to do by invoking AtCleanup_Memory. This'll result in calling AtCleanup_Memory twice in many of the paths through this function, but that seems harmless and reasonably inexpensive. The original motivation for the assertion in AtCleanup_Portals was that we wanted to be sure that any user-defined code executed as a consequence of the cleanup hook runs during AbortTransaction not CleanupTransaction. That still seems like a valid concern, and now that we've seen one case of the assertion firing --- which means that exactly that would have happened in a production build --- let's replace the Assert with a runtime check. If we see the cleanup hook still set, we'll emit a WARNING and just drop the hook unexecuted. This has been like this a long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/877ey7bmun.fsf@ansel.ydns.eu
1 parent 3883be3 commit 624b6f3

File tree

2 files changed

+44
-9
lines changed

2 files changed

+44
-9
lines changed

src/backend/access/transam/xact.c

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4208,6 +4208,9 @@ AbortOutOfAnyTransaction(void)
42084208
{
42094209
TransactionState s = CurrentTransactionState;
42104210

4211+
/* Ensure we're not running in a doomed memory context */
4212+
AtAbort_Memory();
4213+
42114214
/*
42124215
* Get out of any transaction or nested transaction
42134216
*/
@@ -4249,7 +4252,14 @@ AbortOutOfAnyTransaction(void)
42494252
break;
42504253
case TBLOCK_ABORT:
42514254
case TBLOCK_ABORT_END:
4252-
/* AbortTransaction already done, still need Cleanup */
4255+
4256+
/*
4257+
* AbortTransaction is already done, still need Cleanup.
4258+
* However, if we failed partway through running ROLLBACK,
4259+
* there will be an active portal running that command, which
4260+
* we need to shut down before doing CleanupTransaction.
4261+
*/
4262+
AtAbort_Portals();
42534263
CleanupTransaction();
42544264
s->blockState = TBLOCK_DEFAULT;
42554265
break;
@@ -4272,6 +4282,14 @@ AbortOutOfAnyTransaction(void)
42724282
case TBLOCK_SUBABORT_END:
42734283
case TBLOCK_SUBABORT_RESTART:
42744284
/* As above, but AbortSubTransaction already done */
4285+
if (s->curTransactionOwner)
4286+
{
4287+
/* As in TBLOCK_ABORT, might have a live portal to zap */
4288+
AtSubAbort_Portals(s->subTransactionId,
4289+
s->parent->subTransactionId,
4290+
s->curTransactionOwner,
4291+
s->parent->curTransactionOwner);
4292+
}
42754293
CleanupSubTransaction();
42764294
s = CurrentTransactionState; /* changed by pop */
42774295
break;
@@ -4280,6 +4298,9 @@ AbortOutOfAnyTransaction(void)
42804298

42814299
/* Should be out of all subxacts now */
42824300
Assert(s->parent == NULL);
4301+
4302+
/* If we didn't actually have anything to do, revert to TopMemoryContext */
4303+
AtCleanup_Memory();
42834304
}
42844305

42854306
/*

src/backend/utils/mmgr/portalmem.c

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -436,8 +436,8 @@ MarkPortalDone(Portal portal)
436436
* well do that now, since the portal can't be executed any more.
437437
*
438438
* In some cases involving execution of a ROLLBACK command in an already
439-
* aborted transaction, this prevents an assertion failure caused by
440-
* reaching AtCleanup_Portals with the cleanup hook still unexecuted.
439+
* aborted transaction, this is necessary, or we'd reach AtCleanup_Portals
440+
* with the cleanup hook still unexecuted.
441441
*/
442442
if (PointerIsValid(portal->cleanup))
443443
{
@@ -464,8 +464,8 @@ MarkPortalFailed(Portal portal)
464464
* well do that now, since the portal can't be executed any more.
465465
*
466466
* In some cases involving cleanup of an already aborted transaction, this
467-
* prevents an assertion failure caused by reaching AtCleanup_Portals with
468-
* the cleanup hook still unexecuted.
467+
* is necessary, or we'd reach AtCleanup_Portals with the cleanup hook
468+
* still unexecuted.
469469
*/
470470
if (PointerIsValid(portal->cleanup))
471471
{
@@ -863,8 +863,15 @@ AtCleanup_Portals(void)
863863
if (portal->portalPinned)
864864
portal->portalPinned = false;
865865

866-
/* We had better not be calling any user-defined code here */
867-
Assert(portal->cleanup == NULL);
866+
/*
867+
* We had better not call any user-defined code during cleanup, so if
868+
* the cleanup hook hasn't been run yet, too bad; we'll just skip it.
869+
*/
870+
if (PointerIsValid(portal->cleanup))
871+
{
872+
elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
873+
portal->cleanup = NULL;
874+
}
868875

869876
/* Zap it. */
870877
PortalDrop(portal, false);
@@ -1047,8 +1054,15 @@ AtSubCleanup_Portals(SubTransactionId mySubid)
10471054
if (portal->portalPinned)
10481055
portal->portalPinned = false;
10491056

1050-
/* We had better not be calling any user-defined code here */
1051-
Assert(portal->cleanup == NULL);
1057+
/*
1058+
* We had better not call any user-defined code during cleanup, so if
1059+
* the cleanup hook hasn't been run yet, too bad; we'll just skip it.
1060+
*/
1061+
if (PointerIsValid(portal->cleanup))
1062+
{
1063+
elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
1064+
portal->cleanup = NULL;
1065+
}
10521066

10531067
/* Zap it. */
10541068
PortalDrop(portal, false);

0 commit comments

Comments
 (0)