|
1 |
| -$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.13 2009/12/19 01:32:33 sriggs Exp $ |
| 1 | +$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.14 2010/09/17 00:42:39 tgl Exp $ |
2 | 2 |
|
3 | 3 | The Transaction System
|
4 | 4 | ======================
|
@@ -543,6 +543,85 @@ consistency. Such insertions occur after WAL is operational, so they can
|
543 | 543 | and should write WAL records for the additional generated actions.
|
544 | 544 |
|
545 | 545 |
|
| 546 | +Write-Ahead Logging for Filesystem Actions |
| 547 | +------------------------------------------ |
| 548 | + |
| 549 | +The previous section described how to WAL-log actions that only change page |
| 550 | +contents within shared buffers. For that type of action it is generally |
| 551 | +possible to check all likely error cases (such as insufficient space on the |
| 552 | +page) before beginning to make the actual change. Therefore we can make |
| 553 | +the change and the creation of the associated WAL log record "atomic" by |
| 554 | +wrapping them into a critical section --- the odds of failure partway |
| 555 | +through are low enough that PANIC is acceptable if it does happen. |
| 556 | + |
| 557 | +Clearly, that approach doesn't work for cases where there's a significant |
| 558 | +probability of failure within the action to be logged, such as creation |
| 559 | +of a new file or database. We don't want to PANIC, and we especially don't |
| 560 | +want to PANIC after having already written a WAL record that says we did |
| 561 | +the action --- if we did, replay of the record would probably fail again |
| 562 | +and PANIC again, making the failure unrecoverable. This means that the |
| 563 | +ordinary WAL rule of "write WAL before the changes it describes" doesn't |
| 564 | +work, and we need a different design for such cases. |
| 565 | + |
| 566 | +There are several basic types of filesystem actions that have this |
| 567 | +issue. Here is how we deal with each: |
| 568 | + |
| 569 | +1. Adding a disk page to an existing table. |
| 570 | + |
| 571 | +This action isn't WAL-logged at all. We extend a table by writing a page |
| 572 | +of zeroes at its end. We must actually do this write so that we are sure |
| 573 | +the filesystem has allocated the space. If the write fails we can just |
| 574 | +error out normally. Once the space is known allocated, we can initialize |
| 575 | +and fill the page via one or more normal WAL-logged actions. Because it's |
| 576 | +possible that we crash between extending the file and writing out the WAL |
| 577 | +entries, we have to treat discovery of an all-zeroes page in a table or |
| 578 | +index as being a non-error condition. In such cases we can just reclaim |
| 579 | +the space for re-use. |
| 580 | + |
| 581 | +2. Creating a new table, which requires a new file in the filesystem. |
| 582 | + |
| 583 | +We try to create the file, and if successful we make a WAL record saying |
| 584 | +we did it. If not successful, we can just throw an error. Notice that |
| 585 | +there is a window where we have created the file but not yet written any |
| 586 | +WAL about it to disk. If we crash during this window, the file remains |
| 587 | +on disk as an "orphan". It would be possible to clean up such orphans |
| 588 | +by having database restart search for files that don't have any committed |
| 589 | +entry in pg_class, but that currently isn't done because of the possibility |
| 590 | +of deleting data that is useful for forensic analysis of the crash. |
| 591 | +Orphan files are harmless --- at worst they waste a bit of disk space --- |
| 592 | +because we check for on-disk collisions when allocating new relfilenode |
| 593 | +OIDs. So cleaning up isn't really necessary. |
| 594 | + |
| 595 | +3. Deleting a table, which requires an unlink() that could fail. |
| 596 | + |
| 597 | +Our approach here is to WAL-log the operation first, but to treat failure |
| 598 | +of the actual unlink() call as a warning rather than error condition. |
| 599 | +Again, this can leave an orphan file behind, but that's cheap compared to |
| 600 | +the alternatives. Since we can't actually do the unlink() until after |
| 601 | +we've committed the DROP TABLE transaction, throwing an error would be out |
| 602 | +of the question anyway. (It may be worth noting that the WAL entry about |
| 603 | +the file deletion is actually part of the commit record for the dropping |
| 604 | +transaction.) |
| 605 | + |
| 606 | +4. Creating and deleting databases and tablespaces, which requires creating |
| 607 | +and deleting directories and entire directory trees. |
| 608 | + |
| 609 | +These cases are handled similarly to creating individual files, ie, we |
| 610 | +try to do the action first and then write a WAL entry if it succeeded. |
| 611 | +The potential amount of wasted disk space is rather larger, of course. |
| 612 | +In the creation case we try to delete the directory tree again if creation |
| 613 | +fails, so as to reduce the risk of wasted space. Failure partway through |
| 614 | +a deletion operation results in a corrupt database: the DROP failed, but |
| 615 | +some of the data is gone anyway. There is little we can do about that, |
| 616 | +though, and in any case it was presumably data the user no longer wants. |
| 617 | + |
| 618 | +In all of these cases, if WAL replay fails to redo the original action |
| 619 | +we must panic and abort recovery. The DBA will have to manually clean up |
| 620 | +(for instance, free up some disk space or fix directory permissions) and |
| 621 | +then restart recovery. This is part of the reason for not writing a WAL |
| 622 | +entry until we've successfully done the original action. |
| 623 | + |
| 624 | + |
546 | 625 | Asynchronous Commit
|
547 | 626 | -------------------
|
548 | 627 |
|
|
0 commit comments