Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 91c0570

Browse files
committed
Don't fail for > 1 walsenders in 019_replslot_limit, add debug messages.
So far the first of the retries introduced in f28bf66 resolves the issue. But I (Andres) am still suspicious that the start of the failures might indicate a problem. To reduce noise, stop reporting a failure if a retry resolves the problem. To allow figuring out what causes the slow slot drop, add a few more debug messages to ReplicationSlotDropPtr. See also commit afdeff1, fe0972e and f28bf66. Discussion: https://postgr.es/m/20220327213219.smdvfkq2fl74flow@alap3.anarazel.de
1 parent da4b566 commit 91c0570

File tree

2 files changed

+16
-3
lines changed

2 files changed

+16
-3
lines changed

src/backend/replication/slot.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -702,15 +702,22 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
702702
slot->active_pid = 0;
703703
slot->in_use = false;
704704
LWLockRelease(ReplicationSlotControlLock);
705+
706+
elog(DEBUG3, "replication slot drop: %s: marked as not in use", NameStr(slot->data.name));
707+
705708
ConditionVariableBroadcast(&slot->active_cv);
706709

710+
elog(DEBUG3, "replication slot drop: %s: notified others", NameStr(slot->data.name));
711+
707712
/*
708713
* Slot is dead and doesn't prevent resource removal anymore, recompute
709714
* limits.
710715
*/
711716
ReplicationSlotsComputeRequiredXmin(false);
712717
ReplicationSlotsComputeRequiredLSN();
713718

719+
elog(DEBUG3, "replication slot drop: %s: computed required", NameStr(slot->data.name));
720+
714721
/*
715722
* If removing the directory fails, the worst thing that will happen is
716723
* that the user won't be able to create a new slot with the same name
@@ -720,6 +727,8 @@ ReplicationSlotDropPtr(ReplicationSlot *slot)
720727
ereport(WARNING,
721728
(errmsg("could not remove directory \"%s\"", tmppath)));
722729

730+
elog(DEBUG3, "replication slot drop: %s: removed directory", NameStr(slot->data.name));
731+
723732
/*
724733
* Send a message to drop the replication slot to the stats collector.
725734
* Since there is no guarantee of the order of message transfer on a UDP

src/test/recovery/t/019_replslot_limit.pl

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -339,8 +339,8 @@
339339
# We've seen occasional cases where multiple walsender pids are active. It
340340
# could be that we're just observing process shutdown being slow. To collect
341341
# more information, retry a couple times, print a bit of debugging information
342-
# each iteration. For now report a test failure even if later iterations
343-
# succeed.
342+
# each iteration. Don't fail the test if retries find just one pid, the
343+
# buildfarm failures are too noisy.
344344
my $i = 0;
345345
while (1)
346346
{
@@ -349,7 +349,9 @@
349349
$senderpid = $node_primary3->safe_psql('postgres',
350350
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'");
351351

352-
last if like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid");
352+
last if $senderpid =~ qr/^[0-9]+$/;
353+
354+
diag "multiple walsenders active in iteration $i";
353355

354356
# show information about all active connections
355357
$node_primary3->psql('postgres',
@@ -370,6 +372,8 @@
370372
usleep(100_000);
371373
}
372374

375+
like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid");
376+
373377
my $receiverpid = $node_standby3->safe_psql('postgres',
374378
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walreceiver'");
375379
like($receiverpid, qr/^[0-9]+$/, "have walreceiver pid $receiverpid");

0 commit comments

Comments
 (0)