|
133 | 133 | (<acronym>BBU</>) disk controllers. In such setups, the synchronize
|
134 | 134 | command forces all data from the controller cache to the disks,
|
135 | 135 | eliminating much of the benefit of the BBU. You can run the
|
136 |
| - <xref linkend="pgtestfsync"> module to see |
| 136 | + <xref linkend="pgtestfsync"> program to see |
137 | 137 | if you are affected. If you are affected, the performance benefits
|
138 | 138 | of the BBU can be regained by turning off write barriers in
|
139 | 139 | the file system or reconfiguring the disk controller, if that is
|
|
372 | 372 | asynchronous commit, but it is actually a synchronous commit method
|
373 | 373 | (in fact, <varname>commit_delay</varname> is ignored during an
|
374 | 374 | asynchronous commit). <varname>commit_delay</varname> causes a delay
|
375 |
| - just before a synchronous commit attempts to flush |
376 |
| - <acronym>WAL</acronym> to disk, in the hope that a single flush |
377 |
| - executed by one such transaction can also serve other transactions |
378 |
| - committing at about the same time. Setting <varname>commit_delay</varname> |
379 |
| - can only help when there are many concurrently committing transactions. |
| 375 | + just before a transaction flushes <acronym>WAL</acronym> to disk, in |
| 376 | + the hope that a single flush executed by one such transaction can also |
| 377 | + serve other transactions committing at about the same time. The |
| 378 | + setting can be thought of as a way of increasing the time window in |
| 379 | + which transactions can join a group about to participate in a single |
| 380 | + flush, to amortize the cost of the flush among multiple transactions. |
380 | 381 | </para>
|
381 | 382 |
|
382 | 383 | </sect1>
|
|
394 | 395 | <para>
|
395 | 396 | <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></>
|
396 | 397 | are points in the sequence of transactions at which it is guaranteed
|
397 |
| - that the heap and index data files have been updated with all information written before |
398 |
| - the checkpoint. At checkpoint time, all dirty data pages are flushed to |
399 |
| - disk and a special checkpoint record is written to the log file. |
400 |
| - (The changes were previously flushed to the <acronym>WAL</acronym> files.) |
| 398 | + that the heap and index data files have been updated with all |
| 399 | + information written before that checkpoint. At checkpoint time, all |
| 400 | + dirty data pages are flushed to disk and a special checkpoint record is |
| 401 | + written to the log file. (The change records were previously flushed |
| 402 | + to the <acronym>WAL</acronym> files.) |
401 | 403 | In the event of a crash, the crash recovery procedure looks at the latest
|
402 | 404 | checkpoint record to determine the point in the log (known as the redo
|
403 | 405 | record) from which it should start the REDO operation. Any changes made to
|
404 |
| - data files before that point are guaranteed to be already on disk. Hence, after |
405 |
| - a checkpoint, log segments preceding the one containing |
| 406 | + data files before that point are guaranteed to be already on disk. |
| 407 | + Hence, after a checkpoint, log segments preceding the one containing |
406 | 408 | the redo record are no longer needed and can be recycled or removed. (When
|
407 | 409 | <acronym>WAL</acronym> archiving is being done, the log segments must be
|
408 | 410 | archived before being recycled or removed.)
|
|
411 | 413 | <para>
|
412 | 414 | The checkpoint requirement of flushing all dirty data pages to disk
|
413 | 415 | can cause a significant I/O load. For this reason, checkpoint
|
414 |
| - activity is throttled so I/O begins at checkpoint start and completes |
415 |
| - before the next checkpoint starts; this minimizes performance |
| 416 | + activity is throttled so that I/O begins at checkpoint start and completes |
| 417 | + before the next checkpoint is due to start; this minimizes performance |
416 | 418 | degradation during checkpoints.
|
417 | 419 | </para>
|
418 | 420 |
|
419 | 421 | <para>
|
420 | 422 | The server's checkpointer process automatically performs
|
421 |
| - a checkpoint every so often. A checkpoint is created every <xref |
| 423 | + a checkpoint every so often. A checkpoint is begun every <xref |
422 | 424 | linkend="guc-checkpoint-segments"> log segments, or every <xref
|
423 | 425 | linkend="guc-checkpoint-timeout"> seconds, whichever comes first.
|
424 | 426 | The default settings are 3 segments and 300 seconds (5 minutes), respectively.
|
425 |
| - In cases where no WAL has been written since the previous checkpoint, new |
426 |
| - checkpoints will be skipped even if checkpoint_timeout has passed. |
427 |
| - If WAL archiving is being used and you want to put a lower limit on |
428 |
| - how often files are archived in order to bound potential data |
429 |
| - loss, you should adjust archive_timeout parameter rather than the checkpoint |
430 |
| - parameters. It is also possible to force a checkpoint by using the SQL |
| 427 | + If no WAL has been written since the previous checkpoint, new checkpoints |
| 428 | + will be skipped even if <varname>checkpoint_timeout</> has passed. |
| 429 | + (If WAL archiving is being used and you want to put a lower limit on how |
| 430 | + often files are archived in order to bound potential data loss, you should |
| 431 | + adjust the <xref linkend="guc-archive-timeout"> parameter rather than the |
| 432 | + checkpoint parameters.) |
| 433 | + It is also possible to force a checkpoint by using the SQL |
431 | 434 | command <command>CHECKPOINT</command>.
|
432 | 435 | </para>
|
433 | 436 |
|
434 | 437 | <para>
|
435 | 438 | Reducing <varname>checkpoint_segments</varname> and/or
|
436 | 439 | <varname>checkpoint_timeout</varname> causes checkpoints to occur
|
437 |
| - more often. This allows faster after-crash recovery (since less work |
438 |
| - will need to be redone). However, one must balance this against the |
| 440 | + more often. This allows faster after-crash recovery, since less work |
| 441 | + will need to be redone. However, one must balance this against the |
439 | 442 | increased cost of flushing dirty data pages more often. If
|
440 | 443 | <xref linkend="guc-full-page-writes"> is set (as is the default), there is
|
441 | 444 | another factor to consider. To ensure data page consistency,
|
|
450 | 453 | Checkpoints are fairly expensive, first because they require writing
|
451 | 454 | out all currently dirty buffers, and second because they result in
|
452 | 455 | extra subsequent WAL traffic as discussed above. It is therefore
|
453 |
| - wise to set the checkpointing parameters high enough that checkpoints |
| 456 | + wise to set the checkpointing parameters high enough so that checkpoints |
454 | 457 | don't happen too often. As a simple sanity check on your checkpointing
|
455 | 458 | parameters, you can set the <xref linkend="guc-checkpoint-warning">
|
456 | 459 | parameter. If checkpoints happen closer together than
|
|
498 | 501 | altered when building the server). You can use this to estimate space
|
499 | 502 | requirements for <acronym>WAL</acronym>.
|
500 | 503 | Ordinarily, when old log segment files are no longer needed, they
|
501 |
| - are recycled (renamed to become the next segments in the numbered |
| 504 | + are recycled (that is, renamed to become future segments in the numbered |
502 | 505 | sequence). If, due to a short-term peak of log output rate, there
|
503 | 506 | are more than 3 * <varname>checkpoint_segments</varname> + 1
|
504 | 507 | segment files, the unneeded segment files will be deleted instead
|
|
507 | 510 |
|
508 | 511 | <para>
|
509 | 512 | In archive recovery or standby mode, the server periodically performs
|
510 |
| - <firstterm>restartpoints</><indexterm><primary>restartpoint</></> |
| 513 | + <firstterm>restartpoints</>,<indexterm><primary>restartpoint</></> |
511 | 514 | which are similar to checkpoints in normal operation: the server forces
|
512 | 515 | all its state to disk, updates the <filename>pg_control</> file to
|
513 | 516 | indicate that the already-processed WAL data need not be scanned again,
|
514 |
| - and then recycles any old log segment files in <filename>pg_xlog</> |
515 |
| - directory. A restartpoint is triggered if at least one checkpoint record |
516 |
| - has been replayed and <varname>checkpoint_timeout</> seconds have passed |
517 |
| - since last restartpoint. In standby mode, a restartpoint is also triggered |
518 |
| - if <varname>checkpoint_segments</> log segments have been replayed since |
519 |
| - last restartpoint and at least one checkpoint record has been replayed. |
| 517 | + and then recycles any old log segment files in the <filename>pg_xlog</> |
| 518 | + directory. |
520 | 519 | Restartpoints can't be performed more frequently than checkpoints in the
|
521 | 520 | master because restartpoints can only be performed at checkpoint records.
|
| 521 | + A restartpoint is triggered when a checkpoint record is reached if at |
| 522 | + least <varname>checkpoint_timeout</> seconds have passed since the last |
| 523 | + restartpoint. In standby mode, a restartpoint is also triggered if at |
| 524 | + least <varname>checkpoint_segments</> log segments have been replayed |
| 525 | + since the last restartpoint. |
522 | 526 | </para>
|
523 | 527 |
|
524 | 528 | <para>
|
525 | 529 | There are two commonly used internal <acronym>WAL</acronym> functions:
|
526 |
| - <function>LogInsert</function> and <function>LogFlush</function>. |
527 |
| - <function>LogInsert</function> is used to place a new record into |
| 530 | + <function>XLogInsert</function> and <function>XLogFlush</function>. |
| 531 | + <function>XLogInsert</function> is used to place a new record into |
528 | 532 | the <acronym>WAL</acronym> buffers in shared memory. If there is no
|
529 |
| - space for the new record, <function>LogInsert</function> will have |
| 533 | + space for the new record, <function>XLogInsert</function> will have |
530 | 534 | to write (move to kernel cache) a few filled <acronym>WAL</acronym>
|
531 |
| - buffers. This is undesirable because <function>LogInsert</function> |
| 535 | + buffers. This is undesirable because <function>XLogInsert</function> |
532 | 536 | is used on every database low level modification (for example, row
|
533 | 537 | insertion) at a time when an exclusive lock is held on affected
|
534 | 538 | data pages, so the operation needs to be as fast as possible. What
|
535 | 539 | is worse, writing <acronym>WAL</acronym> buffers might also force the
|
536 | 540 | creation of a new log segment, which takes even more
|
537 | 541 | time. Normally, <acronym>WAL</acronym> buffers should be written
|
538 |
| - and flushed by a <function>LogFlush</function> request, which is |
| 542 | + and flushed by an <function>XLogFlush</function> request, which is |
539 | 543 | made, for the most part, at transaction commit time to ensure that
|
540 | 544 | transaction records are flushed to permanent storage. On systems
|
541 |
| - with high log output, <function>LogFlush</function> requests might |
542 |
| - not occur often enough to prevent <function>LogInsert</function> |
| 545 | + with high log output, <function>XLogFlush</function> requests might |
| 546 | + not occur often enough to prevent <function>XLogInsert</function> |
543 | 547 | from having to do writes. On such systems
|
544 | 548 | one should increase the number of <acronym>WAL</acronym> buffers by
|
545 |
| - modifying the configuration parameter <xref |
546 |
| - linkend="guc-wal-buffers">. When |
| 549 | + modifying the <xref linkend="guc-wal-buffers"> parameter. When |
547 | 550 | <xref linkend="guc-full-page-writes"> is set and the system is very busy,
|
548 |
| - setting this value higher will help smooth response times during the |
549 |
| - period immediately following each checkpoint. |
| 551 | + setting <varname>wal_buffers</> higher will help smooth response times |
| 552 | + during the period immediately following each checkpoint. |
550 | 553 | </para>
|
551 | 554 |
|
552 | 555 | <para>
|
553 | 556 | The <xref linkend="guc-commit-delay"> parameter defines for how many
|
554 |
| - microseconds the server process will sleep after writing a commit |
555 |
| - record to the log with <function>LogInsert</function> but before |
556 |
| - performing a <function>LogFlush</function>. This delay allows other |
557 |
| - server processes to add their commit records to the log so as to have all |
558 |
| - of them flushed with a single log sync. No sleep will occur if |
559 |
| - <xref linkend="guc-fsync"> |
560 |
| - is not enabled, or if fewer than <xref linkend="guc-commit-siblings"> |
561 |
| - other sessions are currently in active transactions; this avoids |
562 |
| - sleeping when it's unlikely that any other session will commit soon. |
563 |
| - Note that on most platforms, the resolution of a sleep request is |
564 |
| - ten milliseconds, so that any nonzero <varname>commit_delay</varname> |
565 |
| - setting between 1 and 10000 microseconds would have the same effect. |
566 |
| - Good values for these parameters are not yet clear; experimentation |
567 |
| - is encouraged. |
| 557 | + microseconds a group commit leader process will sleep after acquiring a |
| 558 | + lock within <function>XLogFlush</function>, while group commit |
| 559 | + followers queue up behind the leader. This delay allows other server |
| 560 | + processes to add their commit records to the WAL buffers so that all of |
| 561 | + them will be flushed by the leader's eventual sync operation. No sleep |
| 562 | + will occur if <xref linkend="guc-fsync"> is not enabled, or if fewer |
| 563 | + than <xref linkend="guc-commit-siblings"> other sessions are currently |
| 564 | + in active transactions; this avoids sleeping when it's unlikely that |
| 565 | + any other session will commit soon. Note that on some platforms, the |
| 566 | + resolution of a sleep request is ten milliseconds, so that any nonzero |
| 567 | + <varname>commit_delay</varname> setting between 1 and 10000 |
| 568 | + microseconds would have the same effect. Note also that on some |
| 569 | + platforms, sleep operations may take slightly longer than requested by |
| 570 | + the parameter. |
| 571 | + </para> |
| 572 | + |
| 573 | + <para> |
| 574 | + Since the purpose of <varname>commit_delay</varname> is to allow the |
| 575 | + cost of each flush operation to be amortized across concurrently |
| 576 | + committing transactions (potentially at the expense of transaction |
| 577 | + latency), it is necessary to quantify that cost before the setting can |
| 578 | + be chosen intelligently. The higher that cost is, the more effective |
| 579 | + <varname>commit_delay</varname> is expected to be in increasing |
| 580 | + transaction throughput, up to a point. The <xref |
| 581 | + linkend="pgtestfsync"> program can be used to measure the average time |
| 582 | + in microseconds that a single WAL flush operation takes. A value of |
| 583 | + half of the average time the program reports it takes to flush after a |
| 584 | + single 8kB write operation is often the most effective setting for |
| 585 | + <varname>commit_delay</varname>, so this value is recommended as the |
| 586 | + starting point to use when optimizing for a particular workload. While |
| 587 | + tuning <varname>commit_delay</varname> is particularly useful when the |
| 588 | + WAL log is stored on high-latency rotating disks, benefits can be |
| 589 | + significant even on storage media with very fast sync times, such as |
| 590 | + solid-state drives or RAID arrays with a battery-backed write cache; |
| 591 | + but this should definitely be tested against a representative workload. |
| 592 | + Higher values of <varname>commit_siblings</varname> should be used in |
| 593 | + such cases, whereas smaller <varname>commit_siblings</varname> values |
| 594 | + are often helpful on higher latency media. Note that it is quite |
| 595 | + possible that a setting of <varname>commit_delay</varname> that is too |
| 596 | + high can increase transaction latency by so much that total transaction |
| 597 | + throughput suffers. |
| 598 | + </para> |
| 599 | + |
| 600 | + <para> |
| 601 | + When <varname>commit_delay</varname> is set to zero (the default), it |
| 602 | + is still possible for a form of group commit to occur, but each group |
| 603 | + will consist only of sessions that reach the point where they need to |
| 604 | + flush their commit records during the window in which the previous |
| 605 | + flush operation (if any) is occurring. At higher client counts a |
| 606 | + <quote>gangway effect</> tends to occur, so that the effects of group |
| 607 | + commit become significant even when <varname>commit_delay</varname> is |
| 608 | + zero, and thus explicitly setting <varname>commit_delay</varname> tends |
| 609 | + to help less. Setting <varname>commit_delay</varname> can only help |
| 610 | + when (1) there are some concurrently committing transactions, and (2) |
| 611 | + throughput is limited to some degree by commit rate; but with high |
| 612 | + rotational latency this setting can be effective in increasing |
| 613 | + transaction throughput with as few as two clients (that is, a single |
| 614 | + committing client with one sibling transaction). |
568 | 615 | </para>
|
569 | 616 |
|
570 | 617 | <para>
|
|
574 | 621 | All the options should be the same in terms of reliability, with
|
575 | 622 | the exception of <literal>fsync_writethrough</>, which can sometimes
|
576 | 623 | force a flush of the disk cache even when other options do not do so.
|
577 |
| - However, it's quite platform-specific which one will be the fastest; |
578 |
| - you can test option speeds using the <xref |
579 |
| - linkend="pgtestfsync"> module. |
| 624 | + However, it's quite platform-specific which one will be the fastest. |
| 625 | + You can test the speeds of different options using the <xref |
| 626 | + linkend="pgtestfsync"> program. |
580 | 627 | Note that this parameter is irrelevant if <varname>fsync</varname>
|
581 | 628 | has been turned off.
|
582 | 629 | </para>
|
|
585 | 632 | Enabling the <xref linkend="guc-wal-debug"> configuration parameter
|
586 | 633 | (provided that <productname>PostgreSQL</productname> has been
|
587 | 634 | compiled with support for it) will result in each
|
588 |
| - <function>LogInsert</function> and <function>LogFlush</function> |
| 635 | + <function>XLogInsert</function> and <function>XLogFlush</function> |
589 | 636 | <acronym>WAL</acronym> call being logged to the server log. This
|
590 | 637 | option might be replaced by a more general mechanism in the future.
|
591 | 638 | </para>
|
|
0 commit comments