Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 6cf1647

Browse files
committed
Fix header check for continuation records where standbys could be stuck
XLogPageRead() checks immediately for an invalid WAL record header on a standby, to be able to handle the case of continuation records that need to be read across two different sources. As written, the check was too generic, applying to any target LSN. Based on an analysis by Kyotaro Horiguchi, what really matters is to make sure that the page header is checked when attempting to read a LSN at the boundary of a segment, to handle the case of a continuation record that spawns across multiple pages when dealing with multiple segments, as WAL receivers are spawned they request WAL from the beginning of a segment. This fix has been proposed by Kyotaro Horiguchi. This could cause standbys to loop infinitely when dealing with a continuation record during a timeline jump, in the case where the contents of the record in the follow-up page are invalid. Some regression tests are added to check such scenarios, able to reproduce the original problem. In the test, the contents of a continuation record are overwritten with junk zeros on its follow-up page, and replayed on standbys. This is inspired by 039_end_of_wal.pl, and is enough to show how standbys should react on promotion by not being stuck. Without the fix, the test would fail with a timeout. The test to reproduce the problem has been written by Alexander Kukushkin. The original check has been introduced in 0668719, for a similar problem. Author: Kyotaro Horiguchi, Alexander Kukushkin Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CAFh8B=mozC+e1wGJq0H=0O65goZju+6ab5AU7DEWCSUA2OtwDg@mail.gmail.com Backpatch-through: 13
1 parent 23d7562 commit 6cf1647

File tree

3 files changed

+162
-6
lines changed

3 files changed

+162
-6
lines changed

src/backend/access/transam/xlogrecovery.c

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3438,12 +3438,12 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
34383438
* validates the page header anyway, and would propagate the failure up to
34393439
* ReadRecord(), which would retry. However, there's a corner case with
34403440
* continuation records, if a record is split across two pages such that
3441-
* we would need to read the two pages from different sources. For
3442-
* example, imagine a scenario where a streaming replica is started up,
3443-
* and replay reaches a record that's split across two WAL segments. The
3444-
* first page is only available locally, in pg_wal, because it's already
3445-
* been recycled on the primary. The second page, however, is not present
3446-
* in pg_wal, and we should stream it from the primary. There is a
3441+
* we would need to read the two pages from different sources across two
3442+
* WAL segments.
3443+
*
3444+
* The first page is only available locally, in pg_wal, because it's
3445+
* already been recycled on the primary. The second page, however, is not
3446+
* present in pg_wal, and we should stream it from the primary. There is a
34473447
* recycled WAL segment present in pg_wal, with garbage contents, however.
34483448
* We would read the first page from the local WAL segment, but when
34493449
* reading the second page, we would read the bogus, recycled, WAL
@@ -3465,6 +3465,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
34653465
* responsible for the validation.
34663466
*/
34673467
if (StandbyMode &&
3468+
(targetPagePtr % wal_segment_size) == 0 &&
34683469
!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
34693470
{
34703471
/*

src/test/recovery/meson.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ tests += {
5151
't/040_standby_failover_slots_sync.pl',
5252
't/041_checkpoint_at_promote.pl',
5353
't/042_low_level_backup.pl',
54+
't/043_no_contrecord_switch.pl',
5455
],
5556
},
5657
}
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Copyright (c) 2021-2025, PostgreSQL Global Development Group
2+
3+
# Tests for already-propagated WAL segments ending in incomplete WAL records.
4+
5+
use strict;
6+
use warnings;
7+
8+
use File::Copy;
9+
use PostgreSQL::Test::Cluster;
10+
use Test::More;
11+
use Fcntl qw(SEEK_SET);
12+
13+
use integer; # causes / operator to use integer math
14+
15+
# Values queried from the server
16+
my $WAL_SEGMENT_SIZE;
17+
my $WAL_BLOCK_SIZE;
18+
my $TLI;
19+
20+
# Build name of a WAL segment, used when filtering the contents of the server
21+
# logs.
22+
sub wal_segment_name
23+
{
24+
my $tli = shift;
25+
my $segment = shift;
26+
return sprintf("%08X%08X%08X", $tli, 0, $segment);
27+
}
28+
29+
# Calculate from a LSN (in bytes) its segment number and its offset, used
30+
# when filtering the contents of the server logs.
31+
sub lsn_to_segment_and_offset
32+
{
33+
my $lsn = shift;
34+
return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE);
35+
}
36+
37+
# Get GUC value, converted to an int.
38+
sub get_int_setting
39+
{
40+
my $node = shift;
41+
my $name = shift;
42+
return int(
43+
$node->safe_psql(
44+
'postgres',
45+
"SELECT setting FROM pg_settings WHERE name = '$name'"));
46+
}
47+
48+
# Find the start of a WAL page, based on an LSN in bytes.
49+
sub start_of_page
50+
{
51+
my $lsn = shift;
52+
return $lsn & ~($WAL_BLOCK_SIZE - 1);
53+
}
54+
55+
my $primary = PostgreSQL::Test::Cluster->new('primary');
56+
$primary->init(allows_streaming => 1, has_archiving => 1);
57+
58+
# The configuration is chosen here to minimize the friction with
59+
# concurrent WAL activity. checkpoint_timeout avoids noise with
60+
# checkpoint activity, and autovacuum is disabled to avoid any
61+
# WAL activity generated by it.
62+
$primary->append_conf(
63+
'postgresql.conf', qq(
64+
autovacuum = off
65+
checkpoint_timeout = '30min'
66+
wal_keep_size = 1GB
67+
));
68+
69+
$primary->start;
70+
$primary->backup('backup');
71+
72+
$primary->safe_psql('postgres', "CREATE TABLE t AS SELECT 0");
73+
74+
$WAL_SEGMENT_SIZE = get_int_setting($primary, 'wal_segment_size');
75+
$WAL_BLOCK_SIZE = get_int_setting($primary, 'wal_block_size');
76+
$TLI = $primary->safe_psql('postgres',
77+
"SELECT timeline_id FROM pg_control_checkpoint()");
78+
79+
# Get close to the end of the current WAL page, enough to fit the
80+
# beginning of a record that spans on two pages, generating a
81+
# continuation record.
82+
$primary->emit_wal(0);
83+
my $end_lsn =
84+
$primary->advance_wal_out_of_record_splitting_zone($WAL_BLOCK_SIZE);
85+
86+
# Do some math to find the record size that will overflow the page, and
87+
# write it.
88+
my $overflow_size = $WAL_BLOCK_SIZE - ($end_lsn % $WAL_BLOCK_SIZE);
89+
$end_lsn = $primary->emit_wal($overflow_size);
90+
$primary->stop('immediate');
91+
92+
# Find the beginning of the page with the continuation record and fill
93+
# the entire page with zero bytes to simulate broken replication.
94+
my $start_page = start_of_page($end_lsn);
95+
my $wal_file = $primary->write_wal($TLI, $start_page, $WAL_SEGMENT_SIZE,
96+
"\x00" x $WAL_BLOCK_SIZE);
97+
98+
# Copy the file we just "hacked" to the archives.
99+
copy($wal_file, $primary->archive_dir);
100+
101+
# Start standby nodes and make sure they replay the file "hacked" from
102+
# the archives of the primary.
103+
my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
104+
$standby1->init_from_backup(
105+
$primary, 'backup',
106+
standby => 1,
107+
has_restoring => 1);
108+
109+
my $standby2 = PostgreSQL::Test::Cluster->new('standby2');
110+
$standby2->init_from_backup(
111+
$primary, 'backup',
112+
standby => 1,
113+
has_restoring => 1);
114+
115+
my $log_size1 = -s $standby1->logfile;
116+
my $log_size2 = -s $standby2->logfile;
117+
118+
$standby1->start;
119+
$standby2->start;
120+
121+
my ($segment, $offset) = lsn_to_segment_and_offset($start_page);
122+
my $segment_name = wal_segment_name($TLI, $segment);
123+
my $pattern =
124+
qq(invalid magic number 0000 .* segment $segment_name.* offset $offset);
125+
126+
# We expect both standby nodes to complain about an empty page when trying to
127+
# assemble the record that spans over two pages, so wait for such reports in
128+
# their logs.
129+
$standby1->wait_for_log($pattern, $log_size1);
130+
$standby2->wait_for_log($pattern, $log_size2);
131+
132+
# Now check the case of a promotion with a timeline jump handled at
133+
# page boundary with a continuation record.
134+
$standby1->promote;
135+
136+
# This command forces standby2 to read a continuation record from the page
137+
# that is filled with zero bytes.
138+
$standby1->safe_psql('postgres', 'SELECT pg_switch_wal()');
139+
140+
# Make sure WAL moves forward.
141+
$standby1->safe_psql('postgres',
142+
'INSERT INTO t SELECT * FROM generate_series(1, 1000)');
143+
144+
# Configure standby2 to stream from just promoted standby1 (it also pulls WAL
145+
# files from the archive). It should be able to catch up.
146+
$standby2->enable_streaming($standby1);
147+
$standby2->reload;
148+
$standby1->wait_for_replay_catchup($standby2);
149+
150+
my $result = $standby2->safe_psql('postgres', "SELECT count(*) FROM t");
151+
print "standby2: $result\n";
152+
is($result, qq(1001), 'check streamed content on standby2');
153+
154+
done_testing();

0 commit comments

Comments
 (0)