Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 70ec2f8

Browse files
committed
Improve the documentation about commit_delay.
Clarify the docs explaining what commit_delay does, and add a recommendation about a useful value for it, namely half of the single-page fsync time reported by pg_test_fsync. This is informed by testing of the new-in-9.3 implementation of commit_delay; in prior versions it was far harder to arrive at a useful setting. In passing, do some wordsmithing and markup-fixing in the same general area. Also, change pg_test_fsync's default time-per-test from 2 seconds to 5. The old value was about the minimum at which the results could be taken seriously at all, and so seems a tad optimistic as a default. Peter Geoghegan, reviewed by Noah Misch; some additional editing by me
1 parent dcafdbc commit 70ec2f8

File tree

4 files changed

+118
-70
lines changed

4 files changed

+118
-70
lines changed

contrib/pg_test_fsync/pg_test_fsync.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ do { \
6060

6161
static const char *progname;
6262

63-
static int secs_per_test = 2;
63+
static int secs_per_test = 5;
6464
static int needs_unlink = 0;
6565
static char full_buf[XLOG_SEG_SIZE],
6666
*buf,

doc/src/sgml/config.sgml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1603,8 +1603,8 @@ include 'filename'
16031603
<title>Write Ahead Log</title>
16041604

16051605
<para>
1606-
See also <xref linkend="wal-configuration"> for details on WAL
1607-
and checkpoint tuning.
1606+
For additional information on tuning these settings,
1607+
see <xref linkend="wal-configuration">.
16081608
</para>
16091609

16101610
<sect2 id="runtime-config-wal-settings">
@@ -1957,7 +1957,7 @@ include 'filename'
19571957
given interval. However, it also increases latency by up to
19581958
<varname>commit_delay</varname> microseconds for each WAL
19591959
flush. Because the delay is just wasted if no other transactions
1960-
become ready to commit, it is only performed if at least
1960+
become ready to commit, a delay is only performed if at least
19611961
<varname>commit_siblings</varname> other transactions are active
19621962
immediately before a flush would otherwise have been initiated.
19631963
In <productname>PostgreSQL</> releases prior to 9.3,
@@ -1968,7 +1968,8 @@ include 'filename'
19681968
the first process that becomes ready to flush waits for the configured
19691969
interval, while subsequent processes wait only until the leader
19701970
completes the flush. The default <varname>commit_delay</> is zero
1971-
(no delay), and only honored if <varname>fsync</varname> is enabled.
1971+
(no delay). No delays are performed unless <varname>fsync</varname>
1972+
is enabled.
19721973
</para>
19731974
</listitem>
19741975
</varlistentry>

doc/src/sgml/pgtestfsync.sgml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@
3636
difference in real database throughput, especially since many database servers
3737
are not speed-limited by their transaction logs.
3838
<application>pg_test_fsync</application> reports average file sync operation
39-
time in microseconds for each wal_sync_method, which can be used to inform
40-
efforts to optimize the value of <varname>commit_delay</varname>.
39+
time in microseconds for each wal_sync_method, which can also be used to
40+
inform efforts to optimize the value of <xref linkend="guc-commit-delay">.
4141
</para>
4242
</refsect1>
4343

@@ -72,8 +72,8 @@
7272
<para>
7373
Specifies the number of seconds for each test. The more time
7474
per test, the greater the test's accuracy, but the longer it takes
75-
to run. The default is 2 seconds, which allows the program to
76-
complete in about 30 seconds.
75+
to run. The default is 5 seconds, which allows the program to
76+
complete in under 2 minutes.
7777
</para>
7878
</listitem>
7979
</varlistentry>

doc/src/sgml/wal.sgml

Lines changed: 108 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@
133133
(<acronym>BBU</>) disk controllers. In such setups, the synchronize
134134
command forces all data from the controller cache to the disks,
135135
eliminating much of the benefit of the BBU. You can run the
136-
<xref linkend="pgtestfsync"> module to see
136+
<xref linkend="pgtestfsync"> program to see
137137
if you are affected. If you are affected, the performance benefits
138138
of the BBU can be regained by turning off write barriers in
139139
the file system or reconfiguring the disk controller, if that is
@@ -372,11 +372,12 @@
372372
asynchronous commit, but it is actually a synchronous commit method
373373
(in fact, <varname>commit_delay</varname> is ignored during an
374374
asynchronous commit). <varname>commit_delay</varname> causes a delay
375-
just before a synchronous commit attempts to flush
376-
<acronym>WAL</acronym> to disk, in the hope that a single flush
377-
executed by one such transaction can also serve other transactions
378-
committing at about the same time. Setting <varname>commit_delay</varname>
379-
can only help when there are many concurrently committing transactions.
375+
just before a transaction flushes <acronym>WAL</acronym> to disk, in
376+
the hope that a single flush executed by one such transaction can also
377+
serve other transactions committing at about the same time. The
378+
setting can be thought of as a way of increasing the time window in
379+
which transactions can join a group about to participate in a single
380+
flush, to amortize the cost of the flush among multiple transactions.
380381
</para>
381382

382383
</sect1>
@@ -394,15 +395,16 @@
394395
<para>
395396
<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
396397
are points in the sequence of transactions at which it is guaranteed
397-
that the heap and index data files have been updated with all information written before
398-
the checkpoint. At checkpoint time, all dirty data pages are flushed to
399-
disk and a special checkpoint record is written to the log file.
400-
(The changes were previously flushed to the <acronym>WAL</acronym> files.)
398+
that the heap and index data files have been updated with all
399+
information written before that checkpoint. At checkpoint time, all
400+
dirty data pages are flushed to disk and a special checkpoint record is
401+
written to the log file. (The change records were previously flushed
402+
to the <acronym>WAL</acronym> files.)
401403
In the event of a crash, the crash recovery procedure looks at the latest
402404
checkpoint record to determine the point in the log (known as the redo
403405
record) from which it should start the REDO operation. Any changes made to
404-
data files before that point are guaranteed to be already on disk. Hence, after
405-
a checkpoint, log segments preceding the one containing
406+
data files before that point are guaranteed to be already on disk.
407+
Hence, after a checkpoint, log segments preceding the one containing
406408
the redo record are no longer needed and can be recycled or removed. (When
407409
<acronym>WAL</acronym> archiving is being done, the log segments must be
408410
archived before being recycled or removed.)
@@ -411,31 +413,32 @@
411413
<para>
412414
The checkpoint requirement of flushing all dirty data pages to disk
413415
can cause a significant I/O load. For this reason, checkpoint
414-
activity is throttled so I/O begins at checkpoint start and completes
415-
before the next checkpoint starts; this minimizes performance
416+
activity is throttled so that I/O begins at checkpoint start and completes
417+
before the next checkpoint is due to start; this minimizes performance
416418
degradation during checkpoints.
417419
</para>
418420

419421
<para>
420422
The server's checkpointer process automatically performs
421-
a checkpoint every so often. A checkpoint is created every <xref
423+
a checkpoint every so often. A checkpoint is begun every <xref
422424
linkend="guc-checkpoint-segments"> log segments, or every <xref
423425
linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
424426
The default settings are 3 segments and 300 seconds (5 minutes), respectively.
425-
In cases where no WAL has been written since the previous checkpoint, new
426-
checkpoints will be skipped even if checkpoint_timeout has passed.
427-
If WAL archiving is being used and you want to put a lower limit on
428-
how often files are archived in order to bound potential data
429-
loss, you should adjust archive_timeout parameter rather than the checkpoint
430-
parameters. It is also possible to force a checkpoint by using the SQL
427+
If no WAL has been written since the previous checkpoint, new checkpoints
428+
will be skipped even if <varname>checkpoint_timeout</> has passed.
429+
(If WAL archiving is being used and you want to put a lower limit on how
430+
often files are archived in order to bound potential data loss, you should
431+
adjust the <xref linkend="guc-archive-timeout"> parameter rather than the
432+
checkpoint parameters.)
433+
It is also possible to force a checkpoint by using the SQL
431434
command <command>CHECKPOINT</command>.
432435
</para>
433436

434437
<para>
435438
Reducing <varname>checkpoint_segments</varname> and/or
436439
<varname>checkpoint_timeout</varname> causes checkpoints to occur
437-
more often. This allows faster after-crash recovery (since less work
438-
will need to be redone). However, one must balance this against the
440+
more often. This allows faster after-crash recovery, since less work
441+
will need to be redone. However, one must balance this against the
439442
increased cost of flushing dirty data pages more often. If
440443
<xref linkend="guc-full-page-writes"> is set (as is the default), there is
441444
another factor to consider. To ensure data page consistency,
@@ -450,7 +453,7 @@
450453
Checkpoints are fairly expensive, first because they require writing
451454
out all currently dirty buffers, and second because they result in
452455
extra subsequent WAL traffic as discussed above. It is therefore
453-
wise to set the checkpointing parameters high enough that checkpoints
456+
wise to set the checkpointing parameters high enough so that checkpoints
454457
don't happen too often. As a simple sanity check on your checkpointing
455458
parameters, you can set the <xref linkend="guc-checkpoint-warning">
456459
parameter. If checkpoints happen closer together than
@@ -498,7 +501,7 @@
498501
altered when building the server). You can use this to estimate space
499502
requirements for <acronym>WAL</acronym>.
500503
Ordinarily, when old log segment files are no longer needed, they
501-
are recycled (renamed to become the next segments in the numbered
504+
are recycled (that is, renamed to become future segments in the numbered
502505
sequence). If, due to a short-term peak of log output rate, there
503506
are more than 3 * <varname>checkpoint_segments</varname> + 1
504507
segment files, the unneeded segment files will be deleted instead
@@ -507,64 +510,108 @@
507510

508511
<para>
509512
In archive recovery or standby mode, the server periodically performs
510-
<firstterm>restartpoints</><indexterm><primary>restartpoint</></>
513+
<firstterm>restartpoints</>,<indexterm><primary>restartpoint</></>
511514
which are similar to checkpoints in normal operation: the server forces
512515
all its state to disk, updates the <filename>pg_control</> file to
513516
indicate that the already-processed WAL data need not be scanned again,
514-
and then recycles any old log segment files in <filename>pg_xlog</>
515-
directory. A restartpoint is triggered if at least one checkpoint record
516-
has been replayed and <varname>checkpoint_timeout</> seconds have passed
517-
since last restartpoint. In standby mode, a restartpoint is also triggered
518-
if <varname>checkpoint_segments</> log segments have been replayed since
519-
last restartpoint and at least one checkpoint record has been replayed.
517+
and then recycles any old log segment files in the <filename>pg_xlog</>
518+
directory.
520519
Restartpoints can't be performed more frequently than checkpoints in the
521520
master because restartpoints can only be performed at checkpoint records.
521+
A restartpoint is triggered when a checkpoint record is reached if at
522+
least <varname>checkpoint_timeout</> seconds have passed since the last
523+
restartpoint. In standby mode, a restartpoint is also triggered if at
524+
least <varname>checkpoint_segments</> log segments have been replayed
525+
since the last restartpoint.
522526
</para>
523527

524528
<para>
525529
There are two commonly used internal <acronym>WAL</acronym> functions:
526-
<function>LogInsert</function> and <function>LogFlush</function>.
527-
<function>LogInsert</function> is used to place a new record into
530+
<function>XLogInsert</function> and <function>XLogFlush</function>.
531+
<function>XLogInsert</function> is used to place a new record into
528532
the <acronym>WAL</acronym> buffers in shared memory. If there is no
529-
space for the new record, <function>LogInsert</function> will have
533+
space for the new record, <function>XLogInsert</function> will have
530534
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
531-
buffers. This is undesirable because <function>LogInsert</function>
535+
buffers. This is undesirable because <function>XLogInsert</function>
532536
is used on every database low level modification (for example, row
533537
insertion) at a time when an exclusive lock is held on affected
534538
data pages, so the operation needs to be as fast as possible. What
535539
is worse, writing <acronym>WAL</acronym> buffers might also force the
536540
creation of a new log segment, which takes even more
537541
time. Normally, <acronym>WAL</acronym> buffers should be written
538-
and flushed by a <function>LogFlush</function> request, which is
542+
and flushed by an <function>XLogFlush</function> request, which is
539543
made, for the most part, at transaction commit time to ensure that
540544
transaction records are flushed to permanent storage. On systems
541-
with high log output, <function>LogFlush</function> requests might
542-
not occur often enough to prevent <function>LogInsert</function>
545+
with high log output, <function>XLogFlush</function> requests might
546+
not occur often enough to prevent <function>XLogInsert</function>
543547
from having to do writes. On such systems
544548
one should increase the number of <acronym>WAL</acronym> buffers by
545-
modifying the configuration parameter <xref
546-
linkend="guc-wal-buffers">. When
549+
modifying the <xref linkend="guc-wal-buffers"> parameter. When
547550
<xref linkend="guc-full-page-writes"> is set and the system is very busy,
548-
setting this value higher will help smooth response times during the
549-
period immediately following each checkpoint.
551+
setting <varname>wal_buffers</> higher will help smooth response times
552+
during the period immediately following each checkpoint.
550553
</para>
551554

552555
<para>
553556
The <xref linkend="guc-commit-delay"> parameter defines for how many
554-
microseconds the server process will sleep after writing a commit
555-
record to the log with <function>LogInsert</function> but before
556-
performing a <function>LogFlush</function>. This delay allows other
557-
server processes to add their commit records to the log so as to have all
558-
of them flushed with a single log sync. No sleep will occur if
559-
<xref linkend="guc-fsync">
560-
is not enabled, or if fewer than <xref linkend="guc-commit-siblings">
561-
other sessions are currently in active transactions; this avoids
562-
sleeping when it's unlikely that any other session will commit soon.
563-
Note that on most platforms, the resolution of a sleep request is
564-
ten milliseconds, so that any nonzero <varname>commit_delay</varname>
565-
setting between 1 and 10000 microseconds would have the same effect.
566-
Good values for these parameters are not yet clear; experimentation
567-
is encouraged.
557+
microseconds a group commit leader process will sleep after acquiring a
558+
lock within <function>XLogFlush</function>, while group commit
559+
followers queue up behind the leader. This delay allows other server
560+
processes to add their commit records to the WAL buffers so that all of
561+
them will be flushed by the leader's eventual sync operation. No sleep
562+
will occur if <xref linkend="guc-fsync"> is not enabled, or if fewer
563+
than <xref linkend="guc-commit-siblings"> other sessions are currently
564+
in active transactions; this avoids sleeping when it's unlikely that
565+
any other session will commit soon. Note that on some platforms, the
566+
resolution of a sleep request is ten milliseconds, so that any nonzero
567+
<varname>commit_delay</varname> setting between 1 and 10000
568+
microseconds would have the same effect. Note also that on some
569+
platforms, sleep operations may take slightly longer than requested by
570+
the parameter.
571+
</para>
572+
573+
<para>
574+
Since the purpose of <varname>commit_delay</varname> is to allow the
575+
cost of each flush operation to be amortized across concurrently
576+
committing transactions (potentially at the expense of transaction
577+
latency), it is necessary to quantify that cost before the setting can
578+
be chosen intelligently. The higher that cost is, the more effective
579+
<varname>commit_delay</varname> is expected to be in increasing
580+
transaction throughput, up to a point. The <xref
581+
linkend="pgtestfsync"> program can be used to measure the average time
582+
in microseconds that a single WAL flush operation takes. A value of
583+
half of the average time the program reports it takes to flush after a
584+
single 8kB write operation is often the most effective setting for
585+
<varname>commit_delay</varname>, so this value is recommended as the
586+
starting point to use when optimizing for a particular workload. While
587+
tuning <varname>commit_delay</varname> is particularly useful when the
588+
WAL log is stored on high-latency rotating disks, benefits can be
589+
significant even on storage media with very fast sync times, such as
590+
solid-state drives or RAID arrays with a battery-backed write cache;
591+
but this should definitely be tested against a representative workload.
592+
Higher values of <varname>commit_siblings</varname> should be used in
593+
such cases, whereas smaller <varname>commit_siblings</varname> values
594+
are often helpful on higher latency media. Note that it is quite
595+
possible that a setting of <varname>commit_delay</varname> that is too
596+
high can increase transaction latency by so much that total transaction
597+
throughput suffers.
598+
</para>
599+
600+
<para>
601+
When <varname>commit_delay</varname> is set to zero (the default), it
602+
is still possible for a form of group commit to occur, but each group
603+
will consist only of sessions that reach the point where they need to
604+
flush their commit records during the window in which the previous
605+
flush operation (if any) is occurring. At higher client counts a
606+
<quote>gangway effect</> tends to occur, so that the effects of group
607+
commit become significant even when <varname>commit_delay</varname> is
608+
zero, and thus explicitly setting <varname>commit_delay</varname> tends
609+
to help less. Setting <varname>commit_delay</varname> can only help
610+
when (1) there are some concurrently committing transactions, and (2)
611+
throughput is limited to some degree by commit rate; but with high
612+
rotational latency this setting can be effective in increasing
613+
transaction throughput with as few as two clients (that is, a single
614+
committing client with one sibling transaction).
568615
</para>
569616

570617
<para>
@@ -574,9 +621,9 @@
574621
All the options should be the same in terms of reliability, with
575622
the exception of <literal>fsync_writethrough</>, which can sometimes
576623
force a flush of the disk cache even when other options do not do so.
577-
However, it's quite platform-specific which one will be the fastest;
578-
you can test option speeds using the <xref
579-
linkend="pgtestfsync"> module.
624+
However, it's quite platform-specific which one will be the fastest.
625+
You can test the speeds of different options using the <xref
626+
linkend="pgtestfsync"> program.
580627
Note that this parameter is irrelevant if <varname>fsync</varname>
581628
has been turned off.
582629
</para>
@@ -585,7 +632,7 @@
585632
Enabling the <xref linkend="guc-wal-debug"> configuration parameter
586633
(provided that <productname>PostgreSQL</productname> has been
587634
compiled with support for it) will result in each
588-
<function>LogInsert</function> and <function>LogFlush</function>
635+
<function>XLogInsert</function> and <function>XLogFlush</function>
589636
<acronym>WAL</acronym> call being logged to the server log. This
590637
option might be replaced by a more general mechanism in the future.
591638
</para>

0 commit comments

Comments
 (0)