Add new README file for pages/checksums

simonat2ndQuadrant · simonat2ndQuadrant · commit 9df56f6d91c3 · 2013-03-22T14:21:58.000Z
diff --git a/src/backend/storage/page/README b/src/backend/storage/page/README
@@ -0,0 +1,52 @@
+src/backend/storage/page/README
+
+Checksums
+---------
+
+Checksums on data pages are designed to detect corruption by the I/O system.
+We do not protect buffers against uncorrectable memory errors, since these
+have a very low measured incidence according to research on large server farms,
+http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed
+2010/12/22 on -hackers list.
+
+Current implementation requires this be enabled system-wide at initdb time.
+
+The checksum is not valid at all times on a data page!!
+The checksum is valid when the page leaves the shared pool and is checked
+when it later re-enters the shared pool as a result of I/O.
+We set the checksum on a buffer in the shared pool immediately before we
+flush the buffer. As a result we implicitly invalidate the page's checksum
+when we modify the page for a data change or even a hint. This means that
+many or even most pages in shared buffers have invalid page checksums,
+so be careful how you interpret the pd_checksum field.
+
+That means that WAL-logged changes to a page do NOT update the page checksum,
+so full page images may not have a valid checksum. But those page images have
+the WAL CRC covering them and so are verified separately from this
+mechanism. WAL replay should not test the checksum of a full-page image.
+
+The best way to understand this is that WAL CRCs protect records entering the
+WAL stream, and data page verification protects blocks entering the shared
+buffer pool. They are similar in purpose, yet completely separate.  Together
+they ensure we are able to detect errors in data re-entering
+PostgreSQL-controlled memory. Note also that the WAL checksum is a 32-bit CRC,
+whereas the page checksum is only 16-bits.
+
+Any write of a data block can cause a torn page if the write is unsuccessful.
+Full page writes protect us from that, which are stored in WAL.  Setting hint
+bits when a page is already dirty is OK because a full page write must already
+have been written for it since the last checkpoint.  Setting hint bits on an
+otherwise clean page can allow torn pages; this doesn't normally matter since
+they are just hints, but when the page has checksums, then losing a few bits
+would cause the checksum to be invalid.  So if we have full_page_writes = on
+and checksums enabled then we must write a WAL record specifically so that we
+record a full page image in WAL.  Hint bits updates should be protected using
+MarkBufferDirtyHint(), which is responsible for writing the full-page image
+when necessary.
+
+New WAL records cannot be written during recovery, so hint bits set during
+recovery must not dirty the page if the buffer is not already dirty, when
+checksums are enabled.  Systems in Hot-Standby mode may benefit from hint bits
+being set, but with checksums enabled, a page cannot be dirtied after setting a
+hint bit (due to the torn page risk). So, it must wait for full-page images
+containing the hint bit updates to arrive from the master.