|
1 |
| -<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.11 2001/09/29 04:02:19 tgl Exp $ --> |
| 1 | +<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.12 2001/10/26 23:10:21 tgl Exp $ --> |
2 | 2 |
|
3 | 3 | <chapter id="wal">
|
4 | 4 | <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
|
|
88 | 88 | transaction identifiers. Once UNDO is implemented,
|
89 | 89 | <filename>pg_clog</filename> will no longer be required to be
|
90 | 90 | permanent; it will be possible to remove
|
91 |
| - <filename>pg_clog</filename> at shutdown, split it into segments |
92 |
| - and remove old segments. |
| 91 | + <filename>pg_clog</filename> at shutdown. (However, the urgency |
| 92 | + of this concern has decreased greatly with the adoption of a segmented |
| 93 | + storage method for <filename>pg_clog</filename> --- it is no longer |
| 94 | + necessary to keep old <filename>pg_clog</filename> entries around |
| 95 | + forever.) |
93 | 96 | </para>
|
94 | 97 |
|
95 | 98 | <para>
|
|
116 | 119 | copying the data files (operating system copy commands are not
|
117 | 120 | suitable).
|
118 | 121 | </para>
|
| 122 | + |
| 123 | + <para> |
| 124 | + A difficulty standing in the way of realizing these benefits is that they |
| 125 | + require saving <acronym>WAL</acronym> entries for considerable periods |
| 126 | + of time (eg, as long as the longest possible transaction if transaction |
| 127 | + UNDO is wanted). The present <acronym>WAL</acronym> format is |
| 128 | + extremely bulky since it includes many disk page snapshots. |
| 129 | + This is not a serious concern at present, since the entries only need |
| 130 | + to be kept for one or two checkpoint intervals; but to achieve |
| 131 | + these future benefits some sort of compressed <acronym>WAL</acronym> |
| 132 | + format will be needed. |
| 133 | + </para> |
119 | 134 | </sect2>
|
120 | 135 | </sect1>
|
121 | 136 |
|
|
133 | 148 | <para>
|
134 | 149 | <acronym>WAL</acronym> logs are stored in the directory
|
135 | 150 | <Filename><replaceable>$PGDATA</replaceable>/pg_xlog</Filename>, as
|
136 |
| - a set of segment files, each 16 MB in size. Each segment is |
137 |
| - divided into 8 kB pages. The log record headers are described in |
| 151 | + a set of segment files, each 16MB in size. Each segment is |
| 152 | + divided into 8KB pages. The log record headers are described in |
138 | 153 | <filename>access/xlog.h</filename>; record content is dependent on
|
139 | 154 | the type of event that is being logged. Segment files are given
|
140 | 155 | ever-increasing numbers as names, starting at
|
|
147 | 162 | The <acronym>WAL</acronym> buffers and control structure are in
|
148 | 163 | shared memory, and are handled by the backends; they are protected
|
149 | 164 | by lightweight locks. The demand on shared memory is dependent on the
|
150 |
| - number of buffers; the default size of the <acronym>WAL</acronym> |
151 |
| - buffers is 64 kB. |
| 165 | + number of buffers. The default size of the <acronym>WAL</acronym> |
| 166 | + buffers is 8 8KB buffers, or 64KB. |
152 | 167 | </para>
|
153 | 168 |
|
154 | 169 | <para>
|
|
166 | 181 | disk drives that falsely report a successful write to the kernel,
|
167 | 182 | when, in fact, they have only cached the data and not yet stored it
|
168 | 183 | on the disk. A power failure in such a situation may still lead to
|
169 |
| - irrecoverable data corruption; administrators should try to ensure |
170 |
| - that disks holding <productname>PostgreSQL</productname>'s data and |
| 184 | + irrecoverable data corruption. Administrators should try to ensure |
| 185 | + that disks holding <productname>PostgreSQL</productname>'s |
171 | 186 | log files do not make such false reports.
|
172 | 187 | </para>
|
173 | 188 |
|
|
179 | 194 | checkpoint's position is saved in the file
|
180 | 195 | <filename>pg_control</filename>. Therefore, when recovery is to be
|
181 | 196 | done, the backend first reads <filename>pg_control</filename> and
|
182 |
| - then the checkpoint record; next it reads the redo record, whose |
183 |
| - position is saved in the checkpoint, and begins the REDO operation. |
184 |
| - Because the entire content of the pages is saved in the log on the |
185 |
| - first page modification after a checkpoint, the pages will be first |
186 |
| - restored to a consistent state. |
| 197 | + then the checkpoint record; then it performs the REDO operation by |
| 198 | + scanning forward from the log position indicated in the checkpoint |
| 199 | + record. |
| 200 | + Because the entire content of data pages is saved in the log on the |
| 201 | + first page modification after a checkpoint, all pages changed since |
| 202 | + the checkpoint will be restored to a consistent state. |
187 | 203 | </para>
|
188 | 204 |
|
189 | 205 | <para>
|
|
217 | 233 | buffers. This is undesirable because <function>LogInsert</function>
|
218 | 234 | is used on every database low level modification (for example,
|
219 | 235 | tuple insertion) at a time when an exclusive lock is held on
|
220 |
| - affected data pages and the operation is supposed to be as fast as |
221 |
| - possible; what is worse, writing <acronym>WAL</acronym> buffers may |
222 |
| - also cause the creation of a new log segment, which takes even more |
| 236 | + affected data pages, so the operation needs to be as fast as |
| 237 | + possible. What is worse, writing <acronym>WAL</acronym> buffers may |
| 238 | + also force the creation of a new log segment, which takes even more |
223 | 239 | time. Normally, <acronym>WAL</acronym> buffers should be written
|
224 | 240 | and flushed by a <function>LogFlush</function> request, which is
|
225 | 241 | made, for the most part, at transaction commit time to ensure that
|
|
230 | 246 | one should increase the number of <acronym>WAL</acronym> buffers by
|
231 | 247 | modifying the <varname>WAL_BUFFERS</varname> parameter. The default
|
232 | 248 | number of <acronym>WAL</acronym> buffers is 8. Increasing this
|
233 |
| - value will have an impact on shared memory usage. |
| 249 | + value will correspondingly increase shared memory usage. |
234 | 250 | </para>
|
235 | 251 |
|
236 | 252 | <para>
|
|
243 | 259 | log (known as the redo record) it should start the REDO operation,
|
244 | 260 | since any changes made to data files before that record are already
|
245 | 261 | on disk. After a checkpoint has been made, any log segments written
|
246 |
| - before the undo records are removed, so checkpoints are used to free |
247 |
| - disk space in the <acronym>WAL</acronym> directory. (When |
248 |
| - <acronym>WAL</acronym>-based <acronym>BAR</acronym> is implemented, |
249 |
| - the log segments can be archived instead of just being removed.) |
250 |
| - The checkpoint maker is also able to create a few log segments for |
251 |
| - future use, so as to avoid the need for |
252 |
| - <function>LogInsert</function> or <function>LogFlush</function> to |
253 |
| - spend time in creating them. |
| 262 | + before the undo records are no longer needed and can be recycled or |
| 263 | + removed. (When <acronym>WAL</acronym>-based <acronym>BAR</acronym> is |
| 264 | + implemented, the log segments would be archived before being recycled |
| 265 | + or removed.) |
254 | 266 | </para>
|
255 | 267 |
|
256 | 268 | <para>
|
257 |
| - The <acronym>WAL</acronym> log is held on the disk as a set of 16 |
258 |
| - MB files called <firstterm>segments</firstterm>. By default a new |
259 |
| - segment is created only if more than 75% of the current segment is |
260 |
| - used. One can instruct the server to pre-create up to 64 log segments |
| 269 | + The checkpoint maker is also able to create a few log segments for |
| 270 | + future use, so as to avoid the need for |
| 271 | + <function>LogInsert</function> or <function>LogFlush</function> to |
| 272 | + spend time in creating them. (If that happens, the entire database |
| 273 | + system will be delayed by the creation operation, so it's better if |
| 274 | + the files can be created in the checkpoint maker, which is not on |
| 275 | + anyone's critical path.) |
| 276 | + By default a new 16MB segment file is created only if more than 75% of |
| 277 | + the current segment has been used. This is inadequate if the system |
| 278 | + generates more than 4MB of log output between checkpoints. |
| 279 | + One can instruct the server to pre-create up to 64 log segments |
261 | 280 | at checkpoint time by modifying the <varname>WAL_FILES</varname>
|
262 | 281 | configuration parameter.
|
263 | 282 | </para>
|
264 | 283 |
|
265 |
| - <para> |
266 |
| - For faster after-crash recovery, it would be better to create |
267 |
| - checkpoints more often. However, one should balance this against |
268 |
| - the cost of flushing dirty data pages; in addition, to ensure data |
269 |
| - page consistency, the first modification of a data page after each |
270 |
| - checkpoint results in logging the entire page content, thus |
271 |
| - increasing output to log and the log's size. |
272 |
| - </para> |
273 |
| - |
274 | 284 | <para>
|
275 | 285 | The postmaster spawns a special backend process every so often
|
276 | 286 | to create the next checkpoint. A checkpoint is created every
|
|
281 | 291 | <command>CHECKPOINT</command>.
|
282 | 292 | </para>
|
283 | 293 |
|
| 294 | + <para> |
| 295 | + Reducing <varname>CHECKPOINT_SEGMENTS</varname> and/or |
| 296 | + <varname>CHECKPOINT_TIMEOUT</varname> causes checkpoints to be |
| 297 | + done more often. This allows faster after-crash recovery (since |
| 298 | + less work will need to be redone). However, one must balance this against |
| 299 | + the increased cost of flushing dirty data pages more often. In addition, |
| 300 | + to ensure data page consistency, the first modification of a data page |
| 301 | + after each checkpoint results in logging the entire page content. |
| 302 | + Thus a smaller checkpoint interval increases the volume of output to |
| 303 | + the log, partially negating the goal of using a smaller interval, and |
| 304 | + in any case causing more disk I/O. |
| 305 | + </para> |
| 306 | + |
| 307 | + <para> |
| 308 | + The number of 16MB segment files will always be at least |
| 309 | + <varname>WAL_FILES</varname> + 1, and will normally not exceed |
| 310 | + <varname>WAL_FILES</varname> + 2 * <varname>CHECKPOINT_SEGMENTS</varname> |
| 311 | + + 1. This may be used to estimate space requirements for WAL. Ordinarily, |
| 312 | + when an old log segment file is no longer needed, it is recycled (renamed |
| 313 | + to become the next sequential future segment). If, due to a short-term |
| 314 | + peak of log output rate, there are more than <varname>WAL_FILES</varname> + |
| 315 | + 2 * <varname>CHECKPOINT_SEGMENTS</varname> + 1 segment files, then unneeded |
| 316 | + segment files will be deleted instead of recycled until the system gets |
| 317 | + back under this limit. (If this happens on a regular basis, |
| 318 | + <varname>WAL_FILES</varname> should be increased to avoid it. Deleting log |
| 319 | + segments that will only have to be created again later is expensive and |
| 320 | + pointless.) |
| 321 | + </para> |
| 322 | + |
284 | 323 | <para>
|
285 | 324 | The <varname>COMMIT_DELAY</varname> parameter defines for how many
|
286 | 325 | microseconds the backend will sleep after writing a commit
|
|
294 | 333 | Note that on most platforms, the resolution of a sleep request is
|
295 | 334 | ten milliseconds, so that any nonzero <varname>COMMIT_DELAY</varname>
|
296 | 335 | setting between 1 and 10000 microseconds will have the same effect.
|
| 336 | + Good values for these parameters are not yet clear; experimentation |
| 337 | + is encouraged. |
297 | 338 | </para>
|
298 | 339 |
|
299 | 340 | <para>
|
|
0 commit comments