Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 6672d79

Browse files
committed
Prevent WAL corruption after a standby promotion.
When a PostgreSQL instance performing archive recovery but not using standby mode is promoted, and the last WAL segment that it attempted to read ended in a partial record, the previous code would create invalid WAL on the new timeline. The WAL from the previously timeline would be copied to the new timeline up until the end of the last valid record, but instead of beginning to write WAL at immediately afterwards, the promoted server would write an overwrite contrecord at the beginning of the next segment. The end of the previous segment would be left as all-zeroes, resulting in failures if anything tried to read WAL from that file. The root of the issue is that ReadRecord() decides whether to set abortedRecPtr and missingContrecPtr based on the value of StandbyMode, but ReadRecord() switches to a new timeline based on the value of ArchiveRecoveryRequested. We shouldn't try to write an overwrite contrecord if we're switching to a new timeline, so change the test in ReadRecod() to check ArchiveRecoveryRequested instead. Code fix by Dilip Kumar. Comments by me incorporating suggested language from Álvaro Herrera. Further review from Kyotaro Horiguchi and Sami Imseih. Discussion: http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com Discussion: http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com
1 parent 620ac28 commit 6672d79

File tree

2 files changed

+19
-5
lines changed

2 files changed

+19
-5
lines changed

src/backend/access/transam/xlog.c

+8
Original file line numberDiff line numberDiff line change
@@ -5433,6 +5433,14 @@ StartupXLOG(void)
54335433
*/
54345434
if (!XLogRecPtrIsInvalid(missingContrecPtr))
54355435
{
5436+
/*
5437+
* We should only have a missingContrecPtr if we're not switching to
5438+
* a new timeline. When a timeline switch occurs, WAL is copied from
5439+
* the old timeline to the new only up to the end of the last complete
5440+
* record, so there can't be an incomplete WAL record that we need to
5441+
* disregard.
5442+
*/
5443+
Assert(newTLI == endOfRecoveryInfo->lastRecTLI);
54365444
Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
54375445
EndOfLog = missingContrecPtr;
54385446
}

src/backend/access/transam/xlogrecovery.c

+11-5
Original file line numberDiff line numberDiff line change
@@ -3024,12 +3024,18 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
30243024
if (record == NULL)
30253025
{
30263026
/*
3027-
* When not in standby mode we find that WAL ends in an incomplete
3028-
* record, keep track of that record. After recovery is done,
3029-
* we'll write a record to indicate to downstream WAL readers that
3030-
* that portion is to be ignored.
3027+
* When we find that WAL ends in an incomplete record, keep track
3028+
* of that record. After recovery is done, we'll write a record to
3029+
* indicate to downstream WAL readers that that portion is to be
3030+
* ignored.
3031+
*
3032+
* However, when ArchiveRecoveryRequested = true, we're going to
3033+
* switch to a new timeline at the end of recovery. We will only
3034+
* copy WAL over to the new timeline up to the end of the last
3035+
* complete record, so if we did this, we would later create an
3036+
* overwrite contrecord in the wrong place, breaking everything.
30313037
*/
3032-
if (!StandbyMode &&
3038+
if (!ArchiveRecoveryRequested &&
30333039
!XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
30343040
{
30353041
abortedRecPtr = xlogreader->abortedRecPtr;

0 commit comments

Comments
 (0)