Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 4b1dd9b

Browse files
committed
Fix timeline assignment in checkpoints with 2PC transactions
Any transactions found as still prepared by a checkpoint have their state data read from the WAL records generated by PREPARE TRANSACTION before being moved into their new location within pg_twophase/. While reading such records, the WAL reader uses the callback read_local_xlog_page() to read a page, that is shared across various parts of the system. This callback, since 1148e22, has introduced an update of ThisTimeLineID when reading a record while in recovery, which is potentially helpful in the context of cascading WAL senders. This update of ThisTimeLineID interacts badly with the checkpointer if a promotion happens while some 2PC data is read from its record, as, by changing ThisTimeLineID, any follow-up WAL records would be written to an timeline older than the promoted one. This results in consistency issues. For instance, a subsequent server restart would cause a failure in finding a valid checkpoint record, resulting in a PANIC, for instance. This commit changes the code reading the 2PC data to reset the timeline once the 2PC record has been read, to prevent messing up with the static state of the checkpointer. It would be tempting to do the same thing directly in read_local_xlog_page(). However, based on the discussion that has led to 1148e22, users may rely on the updates of ThisTimeLineID when a WAL record page is read in recovery, so changing this callback could break some cases that are working currently. A TAP test reproducing the issue is added, relying on a PITR to precisely trigger a promotion with a prepared transaction still tracked. Per discussion with Heikki Linnakangas, Kyotaro Horiguchi, Fujii Masao and myself. Author: Soumyadeep Chakraborty, Jimmy Yih, Kevin Yeap Discussion: https://postgr.es/m/CAE-ML+_EjH_fzfq1F3RJ1=XaaNG=-Jz-i3JqkNhXiLAsM3z-Ew@mail.gmail.com Backpatch-through: 10
1 parent 2f31414 commit 4b1dd9b

File tree

2 files changed

+104
-2
lines changed

2 files changed

+104
-2
lines changed

src/backend/access/transam/twophase.c

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1376,15 +1376,20 @@ ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed)
13761376
* twophase files and ReadTwoPhaseFile should be used instead.
13771377
*
13781378
* Note clearly that this function can access WAL during normal operation,
1379-
* similarly to the way WALSender or Logical Decoding would do.
1380-
*
1379+
* similarly to the way WALSender or Logical Decoding would do. While
1380+
* accessing WAL, read_local_xlog_page() may change ThisTimeLineID,
1381+
* particularly if this routine is called for the end-of-recovery checkpoint
1382+
* in the checkpointer itself, so save the current timeline number value
1383+
* and restore it once done.
1384+
>>>>>>> 9477be0b57 (Fix handling of end-of-recovery checkpoint with 2PC transactions)
13811385
*/
13821386
static void
13831387
XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
13841388
{
13851389
XLogRecord *record;
13861390
XLogReaderState *xlogreader;
13871391
char *errormsg;
1392+
TimeLineID save_currtli = ThisTimeLineID;
13881393

13891394
xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page,
13901395
NULL);
@@ -1395,6 +1400,14 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
13951400
errdetail("Failed while allocating a WAL reading processor.")));
13961401

13971402
record = XLogReadRecord(xlogreader, lsn, &errormsg);
1403+
1404+
/*
1405+
* Restore immediately the timeline where it was previously, as
1406+
* read_local_xlog_page() could have changed it if the record was read
1407+
* while recovery was finishing or if the timeline has jumped in-between.
1408+
*/
1409+
ThisTimeLineID = save_currtli;
1410+
13981411
if (record == NULL)
13991412
ereport(ERROR,
14001413
(errcode_for_file_access(),
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Test for point-in-time-recovery (PITR) with prepared transactions
2+
use strict;
3+
use warnings;
4+
use PostgresNode;
5+
use TestLib;
6+
use Test::More tests => 1;
7+
use File::Compare;
8+
9+
# Initialize and start primary node with WAL archiving
10+
my $node_primary = get_new_node('primary');
11+
$node_primary->init(has_archiving => 1);
12+
$node_primary->append_conf(
13+
'postgresql.conf', qq{
14+
max_wal_senders = 10
15+
wal_level = 'replica'
16+
max_prepared_transactions = 10});
17+
$node_primary->start;
18+
19+
# Take backup
20+
my $backup_name = 'my_backup';
21+
$node_primary->backup($backup_name);
22+
23+
# Initialize node for PITR targeting a very specific restore point, just
24+
# after a PREPARE TRANSACTION is issued so as we finish with a promoted
25+
# node where this 2PC transaction needs an explicit COMMIT PREPARED.
26+
my $node_pitr = get_new_node('node_pitr');
27+
$node_pitr->init_from_backup(
28+
$node_primary, $backup_name,
29+
standby => 0,
30+
has_restoring => 1);
31+
$node_pitr->append_conf(
32+
'postgresql.conf', qq{
33+
max_prepared_transactions = 10
34+
recovery_target_name = 'rp'
35+
recovery_target_action = 'promote'});
36+
37+
# Workload with a prepared transaction and the target restore point.
38+
$node_primary->psql(
39+
'postgres', qq{
40+
CREATE TABLE foo(i int);
41+
BEGIN;
42+
INSERT INTO foo VALUES(1);
43+
PREPARE TRANSACTION 'fooinsert';
44+
SELECT pg_create_restore_point('rp');
45+
INSERT INTO foo VALUES(2);
46+
});
47+
48+
# Find next WAL segment to be archived
49+
my $walfile_to_be_archived = $node_primary->safe_psql('postgres',
50+
"SELECT pg_walfile_name(pg_current_wal_lsn());");
51+
52+
# Make WAL segment eligible for archival
53+
$node_primary->safe_psql('postgres', 'SELECT pg_switch_wal()');
54+
55+
# Wait until the WAL segment has been archived.
56+
my $archive_wait_query =
57+
"SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
58+
$node_primary->poll_query_until('postgres', $archive_wait_query)
59+
or die "Timed out while waiting for WAL segment to be archived";
60+
my $last_archived_wal_file = $walfile_to_be_archived;
61+
62+
# Now start the PITR node.
63+
$node_pitr->start;
64+
65+
# Wait until the PITR node exits recovery.
66+
$node_pitr->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
67+
or die "Timed out while waiting for PITR promotion";
68+
69+
# Commit the prepared transaction in the latest timeline and check its
70+
# result. There should only be one row in the table, coming from the
71+
# prepared transaction. The row from the INSERT after the restore point
72+
# should not show up, since our recovery target was older than the second
73+
# INSERT done.
74+
$node_pitr->psql('postgres', qq{COMMIT PREPARED 'fooinsert';});
75+
my $result = $node_pitr->safe_psql('postgres', "SELECT * FROM foo;");
76+
is($result, qq{1}, "check table contents after COMMIT PREPARED");
77+
78+
# Insert more data and do a checkpoint. These should be generated on the
79+
# timeline chosen after the PITR promotion.
80+
$node_pitr->psql(
81+
'postgres', qq{
82+
INSERT INTO foo VALUES(3);
83+
CHECKPOINT;
84+
});
85+
86+
# Enforce recovery, the checkpoint record generated previously should
87+
# still be found.
88+
$node_pitr->stop('immediate');
89+
$node_pitr->start;

0 commit comments

Comments
 (0)