Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 54d0e28

Browse files
committed
Add some documentation about how we WAL-log filesystem actions.
Per a question from Robert Haas.
1 parent 594419e commit 54d0e28

File tree

1 file changed

+80
-1
lines changed
  • src/backend/access/transam

1 file changed

+80
-1
lines changed

src/backend/access/transam/README

Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $
1+
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.14 2010/09/17 00:42:39 tgl Exp $
22

33
The Transaction System
44
======================
@@ -543,6 +543,85 @@ consistency. Such insertions occur after WAL is operational, so they can
543543
and should write WAL records for the additional generated actions.
544544

545545

546+
Write-Ahead Logging for Filesystem Actions
547+
------------------------------------------
548+
549+
The previous section described how to WAL-log actions that only change page
550+
contents within shared buffers. For that type of action it is generally
551+
possible to check all likely error cases (such as insufficient space on the
552+
page) before beginning to make the actual change. Therefore we can make
553+
the change and the creation of the associated WAL log record "atomic" by
554+
wrapping them into a critical section --- the odds of failure partway
555+
through are low enough that PANIC is acceptable if it does happen.
556+
557+
Clearly, that approach doesn't work for cases where there's a significant
558+
probability of failure within the action to be logged, such as creation
559+
of a new file or database. We don't want to PANIC, and we especially don't
560+
want to PANIC after having already written a WAL record that says we did
561+
the action --- if we did, replay of the record would probably fail again
562+
and PANIC again, making the failure unrecoverable. This means that the
563+
ordinary WAL rule of "write WAL before the changes it describes" doesn't
564+
work, and we need a different design for such cases.
565+
566+
There are several basic types of filesystem actions that have this
567+
issue. Here is how we deal with each:
568+
569+
1. Adding a disk page to an existing table.
570+
571+
This action isn't WAL-logged at all. We extend a table by writing a page
572+
of zeroes at its end. We must actually do this write so that we are sure
573+
the filesystem has allocated the space. If the write fails we can just
574+
error out normally. Once the space is known allocated, we can initialize
575+
and fill the page via one or more normal WAL-logged actions. Because it's
576+
possible that we crash between extending the file and writing out the WAL
577+
entries, we have to treat discovery of an all-zeroes page in a table or
578+
index as being a non-error condition. In such cases we can just reclaim
579+
the space for re-use.
580+
581+
2. Creating a new table, which requires a new file in the filesystem.
582+
583+
We try to create the file, and if successful we make a WAL record saying
584+
we did it. If not successful, we can just throw an error. Notice that
585+
there is a window where we have created the file but not yet written any
586+
WAL about it to disk. If we crash during this window, the file remains
587+
on disk as an "orphan". It would be possible to clean up such orphans
588+
by having database restart search for files that don't have any committed
589+
entry in pg_class, but that currently isn't done because of the possibility
590+
of deleting data that is useful for forensic analysis of the crash.
591+
Orphan files are harmless --- at worst they waste a bit of disk space ---
592+
because we check for on-disk collisions when allocating new relfilenode
593+
OIDs. So cleaning up isn't really necessary.
594+
595+
3. Deleting a table, which requires an unlink() that could fail.
596+
597+
Our approach here is to WAL-log the operation first, but to treat failure
598+
of the actual unlink() call as a warning rather than error condition.
599+
Again, this can leave an orphan file behind, but that's cheap compared to
600+
the alternatives. Since we can't actually do the unlink() until after
601+
we've committed the DROP TABLE transaction, throwing an error would be out
602+
of the question anyway. (It may be worth noting that the WAL entry about
603+
the file deletion is actually part of the commit record for the dropping
604+
transaction.)
605+
606+
4. Creating and deleting databases and tablespaces, which requires creating
607+
and deleting directories and entire directory trees.
608+
609+
These cases are handled similarly to creating individual files, ie, we
610+
try to do the action first and then write a WAL entry if it succeeded.
611+
The potential amount of wasted disk space is rather larger, of course.
612+
In the creation case we try to delete the directory tree again if creation
613+
fails, so as to reduce the risk of wasted space. Failure partway through
614+
a deletion operation results in a corrupt database: the DROP failed, but
615+
some of the data is gone anyway. There is little we can do about that,
616+
though, and in any case it was presumably data the user no longer wants.
617+
618+
In all of these cases, if WAL replay fails to redo the original action
619+
we must panic and abort recovery. The DBA will have to manually clean up
620+
(for instance, free up some disk space or fix directory permissions) and
621+
then restart recovery. This is part of the reason for not writing a WAL
622+
entry until we've successfully done the original action.
623+
624+
546625
Asynchronous Commit
547626
-------------------
548627

0 commit comments

Comments
 (0)