Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 428b1d6

Browse files
committed
Allow to trigger kernel writeback after a configurable number of writes.
Currently writes to the main data files of postgres all go through the OS page cache. This means that some operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other reads and writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. While desirable and likely possible this patch does not contain an implementation for windows. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund
1 parent c82c92b commit 428b1d6

File tree

15 files changed

+601
-31
lines changed

15 files changed

+601
-31
lines changed

doc/src/sgml/config.sgml

+87
Original file line numberDiff line numberDiff line change
@@ -1843,6 +1843,35 @@ include_dir 'conf.d'
18431843
</para>
18441844
</listitem>
18451845
</varlistentry>
1846+
1847+
<varlistentry id="guc-bgwriter-flush-after" xreflabel="bgwriter_flush_after">
1848+
<term><varname>bgwriter_flush_after</varname> (<type>int</type>)
1849+
<indexterm>
1850+
<primary><varname>bgwriter_flush_after</> configuration parameter</primary>
1851+
</indexterm>
1852+
</term>
1853+
<listitem>
1854+
<para>
1855+
Whenever more than <varname>bgwriter_flush_after</varname> bytes have
1856+
been written by the bgwriter, attempt to force the OS to issue these
1857+
writes to the underlying storage. Doing so will limit the amount of
1858+
dirty data in the kernel's page cache, reducing the likelihood of
1859+
stalls when an fsync is issued at the end of a checkpoint, or when
1860+
the OS writes data back in larger batches in the background. Often
1861+
that will result in greatly reduced transaction latency, but there
1862+
also are some cases, especially with workloads that are bigger than
1863+
<xref linkend="guc-shared-buffers">, but smaller than the OS's page
1864+
cache, where performance might degrade. This setting may have no
1865+
effect on some platforms. The valid range is between
1866+
<literal>0</literal>, which disables controlled writeback, and
1867+
<literal>2MB</literal>. The default is <literal>512Kb</> on Linux,
1868+
<literal>0</> elsewhere. (Non-default values of
1869+
<symbol>BLCKSZ</symbol> change the default and maximum.)
1870+
This parameter can only be set in the <filename>postgresql.conf</>
1871+
file or on the server command line.
1872+
</para>
1873+
</listitem>
1874+
</varlistentry>
18461875
</variablelist>
18471876

18481877
<para>
@@ -1944,6 +1973,35 @@ include_dir 'conf.d'
19441973
</para>
19451974
</listitem>
19461975
</varlistentry>
1976+
1977+
<varlistentry id="guc-backend-flush-after" xreflabel="backend_flush_after">
1978+
<term><varname>backend_flush_after</varname> (<type>int</type>)
1979+
<indexterm>
1980+
<primary><varname>backend_flush_after</> configuration parameter</primary>
1981+
</indexterm>
1982+
</term>
1983+
<listitem>
1984+
<para>
1985+
Whenever more than <varname>backend_flush_after</varname> bytes have
1986+
been written by a single backend, attempt to force the OS to issue
1987+
these writes to the underlying storage. Doing so will limit the
1988+
amount of dirty data in the kernel's page cache, reducing the
1989+
likelihood of stalls when an fsync is issued at the end of a
1990+
checkpoint, or when the OS writes data back in larger batches in the
1991+
background. Often that will result in greatly reduced transaction
1992+
latency, but there also are some cases, especially with workloads
1993+
that are bigger than <xref linkend="guc-shared-buffers">, but smaller
1994+
than the OS's page cache, where performance might degrade. This
1995+
setting may have no effect on some platforms. The valid range is
1996+
between <literal>0</literal>, which disables controlled writeback,
1997+
and <literal>2MB</literal>. The default is <literal>128Kb</> on
1998+
Linux, <literal>0</> elsewhere. (Non-default values of
1999+
<symbol>BLCKSZ</symbol> change the default and maximum.)
2000+
This parameter can only be set in the <filename>postgresql.conf</>
2001+
file or on the server command line.
2002+
</para>
2003+
</listitem>
2004+
</varlistentry>
19472005
</variablelist>
19482006
</sect2>
19492007
</sect1>
@@ -2475,6 +2533,35 @@ include_dir 'conf.d'
24752533
</listitem>
24762534
</varlistentry>
24772535

2536+
<varlistentry id="guc-checkpoint-flush-after" xreflabel="checkpoint_flush_after">
2537+
<term><varname>checkpoint_flush_after</varname> (<type>int</type>)
2538+
<indexterm>
2539+
<primary><varname>checkpoint_flush_after</> configuration parameter</primary>
2540+
</indexterm>
2541+
</term>
2542+
<listitem>
2543+
<para>
2544+
Whenever more than <varname>checkpoint_flush_after</varname> bytes
2545+
have been written while performing a checkpoint, attempt to force the
2546+
OS to issue these writes to the underlying storage. Doing so will
2547+
limit the amount of dirty data in the kernel's page cache, reducing
2548+
the likelihood of stalls when an fsync is issued at the end of the
2549+
checkpoint, or when the OS writes data back in larger batches in the
2550+
background. Often that will result in greatly reduced transaction
2551+
latency, but there also are some cases, especially with workloads
2552+
that are bigger than <xref linkend="guc-shared-buffers">, but smaller
2553+
than the OS's page cache, where performance might degrade. This
2554+
setting may have no effect on some platforms. The valid range is
2555+
between <literal>0</literal>, which disables controlled writeback,
2556+
and <literal>2MB</literal>. The default is <literal>128Kb</> on
2557+
Linux, <literal>0</> elsewhere. (Non-default values of
2558+
<symbol>BLCKSZ</symbol> change the default and maximum.)
2559+
This parameter can only be set in the <filename>postgresql.conf</>
2560+
file or on the server command line.
2561+
</para>
2562+
</listitem>
2563+
</varlistentry>
2564+
24782565
<varlistentry id="guc-checkpoint-warning" xreflabel="checkpoint_warning">
24792566
<term><varname>checkpoint_warning</varname> (<type>integer</type>)
24802567
<indexterm>

doc/src/sgml/wal.sgml

+11
Original file line numberDiff line numberDiff line change
@@ -545,6 +545,17 @@
545545
unexpected variation in the number of WAL segments needed.
546546
</para>
547547

548+
<para>
549+
On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after">
550+
allows to force the OS that pages written by the checkpoint should be
551+
flushed to disk after a configurable number of bytes. Otherwise, these
552+
pages may be kept in the OS's page cache, inducing a stall when
553+
<literal>fsync</> is issued at the end of a checkpoint. This setting will
554+
often help to reduce transaction latency, but it also can an adverse effect
555+
on performance; particularly for workloads that are bigger than
556+
<xref linkend="guc-shared-buffers">, but smaller than the OS's page cache.
557+
</para>
558+
548559
<para>
549560
The number of WAL segment files in <filename>pg_xlog</> directory depends on
550561
<varname>min_wal_size</>, <varname>max_wal_size</> and

src/backend/postmaster/bgwriter.c

+7-1
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ BackgroundWriterMain(void)
111111
sigjmp_buf local_sigjmp_buf;
112112
MemoryContext bgwriter_context;
113113
bool prev_hibernate;
114+
WritebackContext wb_context;
114115

115116
/*
116117
* Properly accept or ignore signals the postmaster might send us.
@@ -164,6 +165,8 @@ BackgroundWriterMain(void)
164165
ALLOCSET_DEFAULT_MAXSIZE);
165166
MemoryContextSwitchTo(bgwriter_context);
166167

168+
WritebackContextInit(&wb_context, &bgwriter_flush_after);
169+
167170
/*
168171
* If an exception is encountered, processing resumes here.
169172
*
@@ -208,6 +211,9 @@ BackgroundWriterMain(void)
208211
/* Flush any leaked data in the top-level context */
209212
MemoryContextResetAndDeleteChildren(bgwriter_context);
210213

214+
/* re-initilialize to avoid repeated errors causing problems */
215+
WritebackContextInit(&wb_context, &bgwriter_flush_after);
216+
211217
/* Now we can allow interrupts again */
212218
RESUME_INTERRUPTS();
213219

@@ -272,7 +278,7 @@ BackgroundWriterMain(void)
272278
/*
273279
* Do one cycle of dirty-buffer writing.
274280
*/
275-
can_hibernate = BgBufferSync();
281+
can_hibernate = BgBufferSync(&wb_context);
276282

277283
/*
278284
* Send off activity statistics to the stats collector

src/backend/storage/buffer/buf_init.c

+5
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ char *BufferBlocks;
2323
LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
2424
LWLockTranche BufferIOLWLockTranche;
2525
LWLockTranche BufferContentLWLockTranche;
26+
WritebackContext BackendWritebackContext;
2627

2728

2829
/*
@@ -149,6 +150,10 @@ InitBufferPool(void)
149150

150151
/* Init other shared buffer-management stuff */
151152
StrategyInitialize(!foundDescs);
153+
154+
/* Initialize per-backend file flush context */
155+
WritebackContextInit(&BackendWritebackContext,
156+
&backend_flush_after);
152157
}
153158

154159
/*

0 commit comments

Comments
 (0)