Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 3c64dcb

Browse files
committed
Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first post-recovery checkpoint, then the minimal recovery point which marks the LSN position where the cluster is able to reach consistency may be set to a position older than the first end-of-recovery checkpoint while all the WAL available should be replayed. This leads to the instance thinking that it contains inconsistent pages, causing a PANIC and a hard instance crash even if all the WAL available has not been replayed for certain sets of records replayed. When in crash recovery, minRecoveryPoint is expected to always be set to InvalidXLogRecPtr, which forces the recovery to replay all the WAL available, so this commit makes sure that the local copy of minRecoveryPoint from the control file is initialized properly and stays as it is while crash recovery is performed. Once switching to archive recovery or if crash recovery finishes, then the local copy minRecoveryPoint can be safely updated. Pavan Deolasee has reported and diagnosed the failure in the first place, and the base fix idea to rely on the local copy of minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded into a full-fledged patch by me. The test included in this commit has been written by Álvaro Herrera and Pavan Deolasee, which I have modified to make it faster and more reliable with sleep phases. Backpatch down to all supported versions where the bug appears, aka 9.3 which is where the end-of-recovery checkpoint is not run by the startup process anymore. The test gets easily supported down to 10, still it has been tested on all branches. Reported-by: Pavan Deolasee Diagnosed-by: Pavan Deolasee Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
1 parent 249126e commit 3c64dcb

File tree

2 files changed

+157
-31
lines changed

2 files changed

+157
-31
lines changed

src/backend/access/transam/xlog.c

+70-31
Original file line numberDiff line numberDiff line change
@@ -821,8 +821,14 @@ static XLogSource XLogReceiptSource = 0; /* XLOG_FROM_* code */
821821
static XLogRecPtr ReadRecPtr; /* start of last record read */
822822
static XLogRecPtr EndRecPtr; /* end+1 of last record read */
823823

824-
static XLogRecPtr minRecoveryPoint; /* local copy of
825-
* ControlFile->minRecoveryPoint */
824+
/*
825+
* Local copies of equivalent fields in the control file. When running
826+
* crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
827+
* expect to replay all the WAL available, and updateMinRecoveryPoint is
828+
* switched to false to prevent any updates while replaying records.
829+
* Those values are kept consistent as long as crash recovery runs.
830+
*/
831+
static XLogRecPtr minRecoveryPoint;
826832
static TimeLineID minRecoveryPointTLI;
827833
static bool updateMinRecoveryPoint = true;
828834

@@ -2711,20 +2717,26 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
27112717
if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
27122718
return;
27132719

2720+
/*
2721+
* An invalid minRecoveryPoint means that we need to recover all the WAL,
2722+
* i.e., we're doing crash recovery. We never modify the control file's
2723+
* value in that case, so we can short-circuit future checks here too. The
2724+
* local values of minRecoveryPoint and minRecoveryPointTLI should not be
2725+
* updated until crash recovery finishes.
2726+
*/
2727+
if (XLogRecPtrIsInvalid(minRecoveryPoint))
2728+
{
2729+
updateMinRecoveryPoint = false;
2730+
return;
2731+
}
2732+
27142733
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
27152734

27162735
/* update local copy */
27172736
minRecoveryPoint = ControlFile->minRecoveryPoint;
27182737
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
27192738

2720-
/*
2721-
* An invalid minRecoveryPoint means that we need to recover all the WAL,
2722-
* i.e., we're doing crash recovery. We never modify the control file's
2723-
* value in that case, so we can short-circuit future checks here too.
2724-
*/
2725-
if (minRecoveryPoint == 0)
2726-
updateMinRecoveryPoint = false;
2727-
else if (force || minRecoveryPoint < lsn)
2739+
if (force || minRecoveryPoint < lsn)
27282740
{
27292741
XLogRecPtr newMinRecoveryPoint;
27302742
TimeLineID newMinRecoveryPointTLI;
@@ -3110,7 +3122,16 @@ XLogNeedsFlush(XLogRecPtr record)
31103122
*/
31113123
if (RecoveryInProgress())
31123124
{
3113-
/* Quick exit if already known updated */
3125+
/*
3126+
* An invalid minRecoveryPoint means that we need to recover all the
3127+
* WAL, i.e., we're doing crash recovery. We never modify the control
3128+
* file's value in that case, so we can short-circuit future checks
3129+
* here too.
3130+
*/
3131+
if (XLogRecPtrIsInvalid(minRecoveryPoint))
3132+
updateMinRecoveryPoint = false;
3133+
3134+
/* Quick exit if already known to be updated or cannot be updated */
31143135
if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
31153136
return false;
31163137

@@ -3124,20 +3145,8 @@ XLogNeedsFlush(XLogRecPtr record)
31243145
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
31253146
LWLockRelease(ControlFileLock);
31263147

3127-
/*
3128-
* An invalid minRecoveryPoint means that we need to recover all the
3129-
* WAL, i.e., we're doing crash recovery. We never modify the control
3130-
* file's value in that case, so we can short-circuit future checks
3131-
* here too.
3132-
*/
3133-
if (minRecoveryPoint == 0)
3134-
updateMinRecoveryPoint = false;
3135-
31363148
/* check again */
3137-
if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
3138-
return false;
3139-
else
3140-
return true;
3149+
return record > minRecoveryPoint;
31413150
}
31423151

31433152
/* Quick exit if already known flushed */
@@ -4269,6 +4278,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
42694278
minRecoveryPoint = ControlFile->minRecoveryPoint;
42704279
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
42714280

4281+
/*
4282+
* The startup process can update its local copy of
4283+
* minRecoveryPoint from this point.
4284+
*/
4285+
updateMinRecoveryPoint = true;
4286+
42724287
UpdateControlFile();
42734288
LWLockRelease(ControlFileLock);
42744289

@@ -6892,9 +6907,26 @@ StartupXLOG(void)
68926907
/* No need to hold ControlFileLock yet, we aren't up far enough */
68936908
UpdateControlFile();
68946909

6895-
/* initialize our local copy of minRecoveryPoint */
6896-
minRecoveryPoint = ControlFile->minRecoveryPoint;
6897-
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
6910+
/*
6911+
* Initialize our local copy of minRecoveryPoint. When doing crash
6912+
* recovery we want to replay up to the end of WAL. Particularly, in
6913+
* the case of a promoted standby minRecoveryPoint value in the
6914+
* control file is only updated after the first checkpoint. However,
6915+
* if the instance crashes before the first post-recovery checkpoint
6916+
* is completed then recovery will use a stale location causing the
6917+
* startup process to think that there are still invalid page
6918+
* references when checking for data consistency.
6919+
*/
6920+
if (InArchiveRecovery)
6921+
{
6922+
minRecoveryPoint = ControlFile->minRecoveryPoint;
6923+
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
6924+
}
6925+
else
6926+
{
6927+
minRecoveryPoint = InvalidXLogRecPtr;
6928+
minRecoveryPointTLI = 0;
6929+
}
68986930

68996931
/*
69006932
* Reset pgstat data, because it may be invalid after recovery.
@@ -7861,6 +7893,8 @@ CheckRecoveryConsistency(void)
78617893
if (XLogRecPtrIsInvalid(minRecoveryPoint))
78627894
return;
78637895

7896+
Assert(InArchiveRecovery);
7897+
78647898
/*
78657899
* assume that we are called in the startup process, and hence don't need
78667900
* a lock to read lastReplayedEndRecPtr
@@ -9949,11 +9983,16 @@ xlog_redo(XLogReaderState *record)
99499983
* Update minRecoveryPoint to ensure that if recovery is aborted, we
99509984
* recover back up to this point before allowing hot standby again.
99519985
* This is important if the max_* settings are decreased, to ensure
9952-
* you don't run queries against the WAL preceding the change.
9986+
* you don't run queries against the WAL preceding the change. The
9987+
* local copies cannot be updated as long as crash recovery is
9988+
* happening and we expect all the WAL to be replayed.
99539989
*/
9954-
minRecoveryPoint = ControlFile->minRecoveryPoint;
9955-
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
9956-
if (minRecoveryPoint != 0 && minRecoveryPoint < lsn)
9990+
if (InArchiveRecovery)
9991+
{
9992+
minRecoveryPoint = ControlFile->minRecoveryPoint;
9993+
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
9994+
}
9995+
if (minRecoveryPoint != InvalidXLogRecPtr && minRecoveryPoint < lsn)
99579996
{
99589997
ControlFile->minRecoveryPoint = lsn;
99599998
ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Test for promotion handling with WAL records generated post-promotion
2+
# before the first checkpoint is generated. This test case checks for
3+
# invalid page references at replay based on the minimum consistent
4+
# recovery point defined.
5+
use strict;
6+
use warnings;
7+
use PostgresNode;
8+
use TestLib;
9+
use Test::More tests => 1;
10+
11+
# Initialize primary node
12+
my $alpha = get_new_node('alpha');
13+
$alpha->init(allows_streaming => 1);
14+
# Setting wal_log_hints to off is important to get invalid page
15+
# references.
16+
$alpha->append_conf("postgresql.conf", <<EOF);
17+
wal_log_hints = off
18+
EOF
19+
20+
# Start the primary
21+
$alpha->start;
22+
23+
# setup/start a standby
24+
$alpha->backup('bkp');
25+
my $bravo = get_new_node('bravo');
26+
$bravo->init_from_backup($alpha, 'bkp', has_streaming => 1);
27+
$bravo->append_conf('postgresql.conf', <<EOF);
28+
checkpoint_timeout=1h
29+
checkpoint_completion_target=0.9
30+
EOF
31+
$bravo->start;
32+
33+
# Dummy table for the upcoming tests.
34+
$alpha->safe_psql('postgres', 'create table test1 (a int)');
35+
$alpha->safe_psql('postgres', 'insert into test1 select generate_series(1, 10000)');
36+
37+
# take a checkpoint
38+
$alpha->safe_psql('postgres', 'checkpoint');
39+
40+
# The following vacuum will set visibility map bits and create
41+
# problematic WAL records.
42+
$alpha->safe_psql('postgres', 'vacuum verbose test1');
43+
# Wait for last record to have been replayed on the standby.
44+
$alpha->wait_for_catchup($bravo, 'replay',
45+
$alpha->lsn('insert'));
46+
47+
# Now force a checkpoint on the standby. This seems unnecessary but for "some"
48+
# reason, the previous checkpoint on the primary does not reflect on the standby
49+
# and without an explicit checkpoint, it may start redo recovery from a much
50+
# older point, which includes even create table and initial page additions.
51+
$bravo->safe_psql('postgres', 'checkpoint');
52+
53+
# Now just use a dummy table and run some operations to move minRecoveryPoint
54+
# beyond the previous vacuum.
55+
$alpha->safe_psql('postgres', 'create table test2 (a int, b text)');
56+
$alpha->safe_psql('postgres', 'insert into test2 select generate_series(1,10000), md5(random()::text)');
57+
$alpha->safe_psql('postgres', 'truncate test2');
58+
59+
# Wait again for all records to be replayed.
60+
$alpha->wait_for_catchup($bravo, 'replay',
61+
$alpha->lsn('insert'));
62+
63+
# Do the promotion, which reinitializes minRecoveryPoint in the control
64+
# file so as WAL is replayed up to the end.
65+
$bravo->promote;
66+
67+
# Truncate the table on the promoted standby, vacuum and extend it
68+
# again to create new page references. The first post-recovery checkpoint
69+
# has not happened yet.
70+
$bravo->safe_psql('postgres', 'truncate test1');
71+
$bravo->safe_psql('postgres', 'vacuum verbose test1');
72+
$bravo->safe_psql('postgres', 'insert into test1 select generate_series(1,1000)');
73+
74+
# Now crash-stop the promoted standby and restart. This makes sure that
75+
# replay does not see invalid page references because of an invalid
76+
# minimum consistent recovery point.
77+
$bravo->stop('immediate');
78+
$bravo->start;
79+
80+
# Check state of the table after full crash recovery. All its data should
81+
# be here.
82+
my $psql_out;
83+
$bravo->psql(
84+
'postgres',
85+
"SELECT count(*) FROM test1",
86+
stdout => \$psql_out);
87+
is($psql_out, '1000', "Check that table state is correct");

0 commit comments

Comments
 (0)