Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit b9cce9d

Browse files
committed
PANIC on fsync() failure.
On some operating systems, it doesn't make sense to retry fsync(), because dirty data cached by the kernel may have been dropped on write-back failure. In that case the only remaining copy of the data is in the WAL. A subsequent fsync() could appear to succeed, but not have flushed the data. That means that a future checkpoint could apparently complete successfully but have lost data. Therefore, violently prevent any future checkpoint attempts by panicking on the first fsync() failure. Note that we already did the same for WAL data; this change extends that behavior to non-temporary data files. Provide a GUC data_sync_retry to control this new behavior, for users of operating systems that don't eject dirty data, and possibly forensic/testing uses. If it is set to on and the write-back error was transient, a later checkpoint might genuinely succeed (on a system that does not throw away buffers on failure); if the error is permanent, later checkpoints will continue to fail. The GUC defaults to off, meaning that we panic. Back-patch to all supported releases. There is still a narrow window for error-loss on some operating systems: if the file is closed and later reopened and a write-back error occurs in the intervening time, but the inode has the bad luck to be evicted due to memory pressure before we reopen, we could miss the error. A later patch will address that with a scheme for keeping files with dirty data open at all times, but we judge that to be too complicated to back-patch. Author: Craig Ringer, with some adjustments by Thomas Munro Reported-by: Craig Ringer Reviewed-by: Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
1 parent a4d6d6a commit b9cce9d

File tree

12 files changed

+99
-18
lines changed

12 files changed

+99
-18
lines changed

doc/src/sgml/config.sgml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7512,6 +7512,38 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
75127512
</listitem>
75137513
</varlistentry>
75147514

7515+
<varlistentry id="guc-data-sync-retry" xreflabel="data_sync_retry">
7516+
<term><varname>data_sync_retry</varname> (<type>boolean</type>)
7517+
<indexterm>
7518+
<primary><varname>data_sync_retry</varname> configuration parameter</primary>
7519+
</indexterm>
7520+
</term>
7521+
<listitem>
7522+
<para>
7523+
When set to false, which is the default, <productname>PostgreSQL</productname>
7524+
will raise a PANIC-level error on failure to flush modified data files
7525+
to the filesystem. This causes the database server to crash.
7526+
</para>
7527+
<para>
7528+
On some operating systems, the status of data in the kernel's page
7529+
cache is unknown after a write-back failure. In some cases it might
7530+
have been entirely forgotten, making it unsafe to retry; the second
7531+
attempt may be reported as successful, when in fact the data has been
7532+
lost. In these circumstances, the only way to avoid data loss is to
7533+
recover from the WAL after any failure is reported, preferably
7534+
after investigating the root cause of the failure and replacing any
7535+
faulty hardware.
7536+
</para>
7537+
<para>
7538+
If set to true, <productname>PostgreSQL</productname> will instead
7539+
report an error but continue to run so that the data flushing
7540+
operation can be retried in a later checkpoint. Only set it to true
7541+
after investigating the operating system's treatment of buffered data
7542+
in case of write-back failure.
7543+
</para>
7544+
</listitem>
7545+
</varlistentry>
7546+
75157547
</variablelist>
75167548

75177549
</sect1>

src/backend/access/heap/rewriteheap.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -974,7 +974,7 @@ logical_end_heap_rewrite(RewriteState state)
974974
while ((src = (RewriteMappingFile *) hash_seq_search(&seq_status)) != NULL)
975975
{
976976
if (FileSync(src->vfd) != 0)
977-
ereport(ERROR,
977+
ereport(data_sync_elevel(ERROR),
978978
(errcode_for_file_access(),
979979
errmsg("could not fsync file \"%s\": %m", src->path)));
980980
FileClose(src->vfd);
@@ -1192,7 +1192,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
11921192
* doesn't seem worth the trouble.
11931193
*/
11941194
if (pg_fsync(fd) != 0)
1195-
ereport(ERROR,
1195+
ereport(data_sync_elevel(ERROR),
11961196
(errcode_for_file_access(),
11971197
errmsg("could not fsync file \"%s\": %m", path)));
11981198

@@ -1289,7 +1289,7 @@ CheckPointLogicalRewriteHeap(void)
12891289
* but it's currently not deemed worth the effort.
12901290
*/
12911291
else if (pg_fsync(fd) != 0)
1292-
ereport(ERROR,
1292+
ereport(data_sync_elevel(ERROR),
12931293
(errcode_for_file_access(),
12941294
errmsg("could not fsync file \"%s\": %m", path)));
12951295
CloseTransientFile(fd);

src/backend/access/transam/slru.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -921,7 +921,7 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
921921
path, offset)));
922922
break;
923923
case SLRU_FSYNC_FAILED:
924-
ereport(ERROR,
924+
ereport(data_sync_elevel(ERROR),
925925
(errcode_for_file_access(),
926926
errmsg("could not access status of transaction %u", xid),
927927
errdetail("Could not fsync file \"%s\": %m.",

src/backend/access/transam/timeline.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -402,7 +402,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
402402
}
403403

404404
if (pg_fsync(fd) != 0)
405-
ereport(ERROR,
405+
ereport(data_sync_elevel(ERROR),
406406
(errcode_for_file_access(),
407407
errmsg("could not fsync file \"%s\": %m", tmppath)));
408408

@@ -478,7 +478,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
478478
}
479479

480480
if (pg_fsync(fd) != 0)
481-
ereport(ERROR,
481+
ereport(data_sync_elevel(ERROR),
482482
(errcode_for_file_access(),
483483
errmsg("could not fsync file \"%s\": %m", tmppath)));
484484

src/backend/access/transam/xlog.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3283,7 +3283,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
32833283
}
32843284

32853285
if (pg_fsync(fd) != 0)
3286-
ereport(ERROR,
3286+
ereport(data_sync_elevel(ERROR),
32873287
(errcode_for_file_access(),
32883288
errmsg("could not fsync file \"%s\": %m", tmppath)));
32893289

src/backend/replication/logical/snapbuild.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1606,6 +1606,9 @@ SnapBuildSerialize(SnapBuild *builder, XLogRecPtr lsn)
16061606
* fsync the file before renaming so that even if we crash after this we
16071607
* have either a fully valid file or nothing.
16081608
*
1609+
* It's safe to just ERROR on fsync() here because we'll retry the whole
1610+
* operation including the writes.
1611+
*
16091612
* TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
16101613
* some noticeable overhead since it's performed synchronously during
16111614
* decoding?

src/backend/storage/file/fd.c

Lines changed: 41 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ int max_files_per_process = 1000;
138138
*/
139139
int max_safe_fds = 32; /* default if not changed */
140140

141+
/* Whether it is safe to continue running after fsync() fails. */
142+
bool data_sync_retry = false;
141143

142144
/* Debugging.... */
143145

@@ -433,11 +435,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
433435
*/
434436
rc = sync_file_range(fd, offset, nbytes,
435437
SYNC_FILE_RANGE_WRITE);
436-
437-
/* don't error out, this is just a performance optimization */
438438
if (rc != 0)
439439
{
440-
ereport(WARNING,
440+
ereport(data_sync_elevel(WARNING),
441441
(errcode_for_file_access(),
442442
errmsg("could not flush dirty data: %m")));
443443
}
@@ -509,7 +509,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
509509
rc = msync(p, (size_t) nbytes, MS_ASYNC);
510510
if (rc != 0)
511511
{
512-
ereport(WARNING,
512+
ereport(data_sync_elevel(WARNING),
513513
(errcode_for_file_access(),
514514
errmsg("could not flush dirty data: %m")));
515515
/* NB: need to fall through to munmap()! */
@@ -565,7 +565,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
565565
void
566566
fsync_fname(const char *fname, bool isdir)
567567
{
568-
fsync_fname_ext(fname, isdir, false, ERROR);
568+
fsync_fname_ext(fname, isdir, false, data_sync_elevel(ERROR));
569569
}
570570

571571
/*
@@ -994,7 +994,8 @@ LruDelete(File file)
994994
* to leak the FD than to mess up our internal state.
995995
*/
996996
if (close(vfdP->fd))
997-
elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
997+
elog(vfdP->fdstate & FD_TEMPORARY ? LOG : data_sync_elevel(LOG),
998+
"could not close file \"%s\": %m", vfdP->fileName);
998999
vfdP->fd = VFD_CLOSED;
9991000
--nfile;
10001001

@@ -1473,7 +1474,14 @@ FileClose(File file)
14731474
{
14741475
/* close the file */
14751476
if (close(vfdP->fd))
1476-
elog(LOG, "could not close file \"%s\": %m", vfdP->fileName);
1477+
{
1478+
/*
1479+
* We may need to panic on failure to close non-temporary files;
1480+
* see LruDelete.
1481+
*/
1482+
elog(vfdP->fdstate & FD_TEMPORARY ? LOG : data_sync_elevel(LOG),
1483+
"could not close file \"%s\": %m", vfdP->fileName);
1484+
}
14771485

14781486
--nfile;
14791487
vfdP->fd = VFD_CLOSED;
@@ -2898,6 +2906,9 @@ looks_like_temp_rel_name(const char *name)
28982906
* harmless cases such as read-only files in the data directory, and that's
28992907
* not good either.
29002908
*
2909+
* Note that if we previously crashed due to a PANIC on fsync(), we'll be
2910+
* rewriting all changes again during recovery.
2911+
*
29012912
* Note we assume we're chdir'd into PGDATA to begin with.
29022913
*/
29032914
void
@@ -3184,3 +3195,26 @@ fsync_parent_path(const char *fname, int elevel)
31843195

31853196
return 0;
31863197
}
3198+
3199+
/*
3200+
* Return the passed-in error level, or PANIC if data_sync_retry is off.
3201+
*
3202+
* Failure to fsync any data file is cause for immediate panic, unless
3203+
* data_sync_retry is enabled. Data may have been written to the operating
3204+
* system and removed from our buffer pool already, and if we are running on
3205+
* an operating system that forgets dirty data on write-back failure, there
3206+
* may be only one copy of the data remaining: in the WAL. A later attempt to
3207+
* fsync again might falsely report success. Therefore we must not allow any
3208+
* further checkpoints to be attempted. data_sync_retry can in theory be
3209+
* enabled on systems known not to drop dirty buffered data on write-back
3210+
* failure (with the likely outcome that checkpoints will continue to fail
3211+
* until the underlying problem is fixed).
3212+
*
3213+
* Any code that reports a failure from fsync() or related functions should
3214+
* filter the error level with this function.
3215+
*/
3216+
int
3217+
data_sync_elevel(int elevel)
3218+
{
3219+
return data_sync_retry ? elevel : PANIC;
3220+
}

src/backend/storage/smgr/md.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1037,7 +1037,7 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
10371037
while (v != NULL)
10381038
{
10391039
if (FileSync(v->mdfd_vfd) < 0)
1040-
ereport(ERROR,
1040+
ereport(data_sync_elevel(ERROR),
10411041
(errcode_for_file_access(),
10421042
errmsg("could not fsync file \"%s\": %m",
10431043
FilePathName(v->mdfd_vfd))));
@@ -1282,7 +1282,7 @@ mdsync(void)
12821282
bms_join(new_requests, requests);
12831283

12841284
errno = save_errno;
1285-
ereport(ERROR,
1285+
ereport(data_sync_elevel(ERROR),
12861286
(errcode_for_file_access(),
12871287
errmsg("could not fsync file \"%s\": %m",
12881288
path)));
@@ -1456,7 +1456,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
14561456
(errmsg("could not forward fsync request because request queue is full")));
14571457

14581458
if (FileSync(seg->mdfd_vfd) < 0)
1459-
ereport(ERROR,
1459+
ereport(data_sync_elevel(ERROR),
14601460
(errcode_for_file_access(),
14611461
errmsg("could not fsync file \"%s\": %m",
14621462
FilePathName(seg->mdfd_vfd))));

src/backend/utils/cache/relmapper.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -792,7 +792,7 @@ write_relmap_file(bool shared, RelMapFile *newmap,
792792
* CheckPointRelationMap.
793793
*/
794794
if (pg_fsync(fd) != 0)
795-
ereport(ERROR,
795+
ereport(data_sync_elevel(ERROR),
796796
(errcode_for_file_access(),
797797
errmsg("could not fsync relation mapping file \"%s\": %m",
798798
mapfilename)));

src/backend/utils/misc/guc.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1658,6 +1658,15 @@ static struct config_bool ConfigureNamesBool[] =
16581658
NULL, NULL, NULL
16591659
},
16601660

1661+
{
1662+
{"data_sync_retry", PGC_POSTMASTER, ERROR_HANDLING_OPTIONS,
1663+
gettext_noop("Whether to continue running after a failure to sync data files."),
1664+
},
1665+
&data_sync_retry,
1666+
false,
1667+
NULL, NULL, NULL
1668+
},
1669+
16611670
/* End-of-list marker */
16621671
{
16631672
{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL

src/backend/utils/misc/postgresql.conf.sample

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -620,6 +620,7 @@
620620

621621
#exit_on_error = off # terminate session on any error?
622622
#restart_after_crash = on # reinitialize after backend crash?
623+
#data_sync_retry = off # retry or panic on failure to fsync data?
623624

624625

625626
#------------------------------------------------------------------------------

src/include/storage/fd.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ typedef int File;
5353

5454
/* GUC parameter */
5555
extern PGDLLIMPORT int max_files_per_process;
56+
extern PGDLLIMPORT bool data_sync_retry;
5657

5758
/*
5859
* This is private to fd.c, but exported for save/restore_backend_variables()
@@ -123,6 +124,7 @@ extern void fsync_fname(const char *fname, bool isdir);
123124
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
124125
extern int durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
125126
extern void SyncDataDirectory(void);
127+
extern int data_sync_elevel(int elevel);
126128

127129
/* Filename components for OpenTemporaryFile */
128130
#define PG_TEMP_FILES_DIR "pgsql_tmp"

0 commit comments

Comments
 (0)