|
| 1 | +src/backend/storage/page/README |
| 2 | + |
| 3 | +Checksums |
| 4 | +--------- |
| 5 | + |
| 6 | +Checksums on data pages are designed to detect corruption by the I/O system. |
| 7 | +We do not protect buffers against uncorrectable memory errors, since these |
| 8 | +have a very low measured incidence according to research on large server farms, |
| 9 | +http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed |
| 10 | +2010/12/22 on -hackers list. |
| 11 | + |
| 12 | +Current implementation requires this be enabled system-wide at initdb time. |
| 13 | + |
| 14 | +The checksum is not valid at all times on a data page!! |
| 15 | +The checksum is valid when the page leaves the shared pool and is checked |
| 16 | +when it later re-enters the shared pool as a result of I/O. |
| 17 | +We set the checksum on a buffer in the shared pool immediately before we |
| 18 | +flush the buffer. As a result we implicitly invalidate the page's checksum |
| 19 | +when we modify the page for a data change or even a hint. This means that |
| 20 | +many or even most pages in shared buffers have invalid page checksums, |
| 21 | +so be careful how you interpret the pd_checksum field. |
| 22 | + |
| 23 | +That means that WAL-logged changes to a page do NOT update the page checksum, |
| 24 | +so full page images may not have a valid checksum. But those page images have |
| 25 | +the WAL CRC covering them and so are verified separately from this |
| 26 | +mechanism. WAL replay should not test the checksum of a full-page image. |
| 27 | + |
| 28 | +The best way to understand this is that WAL CRCs protect records entering the |
| 29 | +WAL stream, and data page verification protects blocks entering the shared |
| 30 | +buffer pool. They are similar in purpose, yet completely separate. Together |
| 31 | +they ensure we are able to detect errors in data re-entering |
| 32 | +PostgreSQL-controlled memory. Note also that the WAL checksum is a 32-bit CRC, |
| 33 | +whereas the page checksum is only 16-bits. |
| 34 | + |
| 35 | +Any write of a data block can cause a torn page if the write is unsuccessful. |
| 36 | +Full page writes protect us from that, which are stored in WAL. Setting hint |
| 37 | +bits when a page is already dirty is OK because a full page write must already |
| 38 | +have been written for it since the last checkpoint. Setting hint bits on an |
| 39 | +otherwise clean page can allow torn pages; this doesn't normally matter since |
| 40 | +they are just hints, but when the page has checksums, then losing a few bits |
| 41 | +would cause the checksum to be invalid. So if we have full_page_writes = on |
| 42 | +and checksums enabled then we must write a WAL record specifically so that we |
| 43 | +record a full page image in WAL. Hint bits updates should be protected using |
| 44 | +MarkBufferDirtyHint(), which is responsible for writing the full-page image |
| 45 | +when necessary. |
| 46 | + |
| 47 | +New WAL records cannot be written during recovery, so hint bits set during |
| 48 | +recovery must not dirty the page if the buffer is not already dirty, when |
| 49 | +checksums are enabled. Systems in Hot-Standby mode may benefit from hint bits |
| 50 | +being set, but with checksums enabled, a page cannot be dirtied after setting a |
| 51 | +hint bit (due to the torn page risk). So, it must wait for full-page images |
| 52 | +containing the hint bit updates to arrive from the master. |
0 commit comments