Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit c74ad4e

Browse files
committed
Avoid throwing ERROR during WAL replay of DROP TABLESPACE.
Although we will not even issue an XLOG_TBLSPC_DROP WAL record unless removal of the tablespace's directories succeeds, that does not guarantee that the same operation will succeed during WAL replay. Foreseeable reasons for it to fail include temp files created in the tablespace by Hot Standby backends, wrong directory permissions on a standby server, etc etc. The original coding threw ERROR if replay failed to remove the directories, but that is a serious overreaction. Throwing an error aborts recovery, and worse means that manual intervention will be needed to get the database to start again, since otherwise the same error will recur on subsequent attempts to replay the same WAL record. And the consequence of failing to remove the directories is only that some probably-small amount of disk space is wasted, so it hardly seems justified to throw an error. Accordingly, arrange to report such failures as LOG messages and keep going when a failure occurs during replay. Back-patch to 9.0 where Hot Standby was introduced. In principle such problems can occur in earlier releases, but Hot Standby increases the odds of trouble significantly. Given the lack of field reports of such issues, I'm satisfied with patching back as far as the patch applies easily.
1 parent f1b8a84 commit c74ad4e

File tree

1 file changed

+47
-18
lines changed

1 file changed

+47
-18
lines changed

src/backend/commands/tablespace.c

Lines changed: 47 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -631,9 +631,13 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid)
631631
/*
632632
* destroy_tablespace_directories
633633
*
634-
* Attempt to remove filesystem infrastructure
634+
* Attempt to remove filesystem infrastructure for the tablespace.
635635
*
636-
* 'redo' indicates we are redoing a drop from XLOG; okay if nothing there
636+
* 'redo' indicates we are redoing a drop from XLOG; in that case we should
637+
* not throw an ERROR for problems, just LOG them. The worst consequence of
638+
* not removing files here would be failure to release some disk space, which
639+
* does not justify throwing an error that would require manual intervention
640+
* to get the database running again.
637641
*
638642
* Returns TRUE if successful, FALSE if some subdirectory is not empty
639643
*/
@@ -686,6 +690,16 @@ destroy_tablespace_directories(Oid tablespaceoid, bool redo)
686690
pfree(linkloc_with_version_dir);
687691
return true;
688692
}
693+
else if (redo)
694+
{
695+
/* in redo, just log other types of error */
696+
ereport(LOG,
697+
(errcode_for_file_access(),
698+
errmsg("could not open directory \"%s\": %m",
699+
linkloc_with_version_dir)));
700+
pfree(linkloc_with_version_dir);
701+
return false;
702+
}
689703
/* else let ReadDir report the error */
690704
}
691705

@@ -699,7 +713,7 @@ destroy_tablespace_directories(Oid tablespaceoid, bool redo)
699713
sprintf(subfile, "%s/%s", linkloc_with_version_dir, de->d_name);
700714

701715
/* This check is just to deliver a friendlier error message */
702-
if (!directory_is_empty(subfile))
716+
if (!redo && !directory_is_empty(subfile))
703717
{
704718
FreeDir(dirdesc);
705719
pfree(subfile);
@@ -709,7 +723,7 @@ destroy_tablespace_directories(Oid tablespaceoid, bool redo)
709723

710724
/* remove empty directory */
711725
if (rmdir(subfile) < 0)
712-
ereport(ERROR,
726+
ereport(redo ? LOG : ERROR,
713727
(errcode_for_file_access(),
714728
errmsg("could not remove directory \"%s\": %m",
715729
subfile)));
@@ -721,31 +735,38 @@ destroy_tablespace_directories(Oid tablespaceoid, bool redo)
721735

722736
/* remove version directory */
723737
if (rmdir(linkloc_with_version_dir) < 0)
724-
ereport(ERROR,
738+
{
739+
ereport(redo ? LOG : ERROR,
725740
(errcode_for_file_access(),
726741
errmsg("could not remove directory \"%s\": %m",
727742
linkloc_with_version_dir)));
743+
pfree(linkloc_with_version_dir);
744+
return false;
745+
}
728746

729747
/*
730748
* Try to remove the symlink. We must however deal with the possibility
731749
* that it's a directory instead of a symlink --- this could happen during
732750
* WAL replay (see TablespaceCreateDbspace), and it is also the case on
733751
* Windows where junction points lstat() as directories.
752+
*
753+
* Note: in the redo case, we'll return true if this final step fails;
754+
* there's no point in retrying it.
734755
*/
735756
linkloc = pstrdup(linkloc_with_version_dir);
736757
get_parent_directory(linkloc);
737758
if (lstat(linkloc, &st) == 0 && S_ISDIR(st.st_mode))
738759
{
739760
if (rmdir(linkloc) < 0)
740-
ereport(ERROR,
761+
ereport(redo ? LOG : ERROR,
741762
(errcode_for_file_access(),
742763
errmsg("could not remove directory \"%s\": %m",
743764
linkloc)));
744765
}
745766
else
746767
{
747768
if (unlink(linkloc) < 0)
748-
ereport(ERROR,
769+
ereport(redo ? LOG : ERROR,
749770
(errcode_for_file_access(),
750771
errmsg("could not remove symbolic link \"%s\": %m",
751772
linkloc)));
@@ -1456,30 +1477,38 @@ tblspc_redo(XLogRecPtr lsn, XLogRecord *record)
14561477
xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
14571478

14581479
/*
1459-
* If we issued a WAL record for a drop tablespace it is because there
1460-
* were no files in it at all. That means that no permanent objects
1461-
* can exist in it at this point.
1480+
* If we issued a WAL record for a drop tablespace it implies that
1481+
* there were no files in it at all when the DROP was done. That means
1482+
* that no permanent objects can exist in it at this point.
14621483
*
14631484
* It is possible for standby users to be using this tablespace as a
14641485
* location for their temporary files, so if we fail to remove all
14651486
* files then do conflict processing and try again, if currently
14661487
* enabled.
1488+
*
1489+
* Other possible reasons for failure include bollixed file permissions
1490+
* on a standby server when they were okay on the primary, etc etc.
1491+
* There's not much we can do about that, so just remove what we can
1492+
* and press on.
14671493
*/
14681494
if (!destroy_tablespace_directories(xlrec->ts_id, true))
14691495
{
14701496
ResolveRecoveryConflictWithTablespace(xlrec->ts_id);
14711497

14721498
/*
14731499
* If we did recovery processing then hopefully the backends who
1474-
* wrote temp files should have cleaned up and exited by now. So
1475-
* lets recheck before we throw an error. If !process_conflicts
1476-
* then this will just fail again.
1500+
* wrote temp files should have cleaned up and exited by now. So
1501+
* retry before complaining. If we fail again, this is just a LOG
1502+
* condition, because it's not worth throwing an ERROR for (as
1503+
* that would crash the database and require manual intervention
1504+
* before we could get past this WAL record on restart).
14771505
*/
14781506
if (!destroy_tablespace_directories(xlrec->ts_id, true))
1479-
ereport(ERROR,
1507+
ereport(LOG,
14801508
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
1481-
errmsg("tablespace %u is not empty",
1482-
xlrec->ts_id)));
1509+
errmsg("directories for tablespace %u could not be removed",
1510+
xlrec->ts_id),
1511+
errhint("You can remove the directories manually if necessary.")));
14831512
}
14841513
}
14851514
else
@@ -1495,14 +1524,14 @@ tblspc_desc(StringInfo buf, uint8 xl_info, char *rec)
14951524
{
14961525
xl_tblspc_create_rec *xlrec = (xl_tblspc_create_rec *) rec;
14971526

1498-
appendStringInfo(buf, "create ts: %u \"%s\"",
1527+
appendStringInfo(buf, "create tablespace: %u \"%s\"",
14991528
xlrec->ts_id, xlrec->ts_path);
15001529
}
15011530
else if (info == XLOG_TBLSPC_DROP)
15021531
{
15031532
xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) rec;
15041533

1505-
appendStringInfo(buf, "drop ts: %u", xlrec->ts_id);
1534+
appendStringInfo(buf, "drop tablespace: %u", xlrec->ts_id);
15061535
}
15071536
else
15081537
appendStringInfo(buf, "UNKNOWN");

0 commit comments

Comments
 (0)