Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 71e8e66

Browse files
committed
Cope with data-offset-less archive files during out-of-order restores.
pg_dump produces custom-format archive files that lack data offsets when it is unable to seek its output. Up to now that's been a hazard for pg_restore. But if pg_restore is able to seek in the archive file, there is no reason to throw up our hands when asked to restore data blocks out of order. Instead, whenever we are searching for a data block, record the locations of the blocks we passed over (that is, fill in the missing data-offset fields in our in-memory copy of the TOC data). Then, when we hit a case that requires going backwards, we can just seek back. Also track the furthest point that we've searched to, and seek back to there when beginning a search for a new data block. This avoids possible O(N^2) time consumption, by ensuring that each data block is examined at most twice. (On Unix systems, that's at most twice per parallel-restore job; but since Windows uses threads here, the threads can share block location knowledge, reducing the amount of duplicated work.) We can also improve the code a bit by using fseeko() to skip over data blocks during the search. This is all of some use even in simple restores, but it's really significant for parallel pg_restore. In that case, we require seekability of the input already, and we will very probably need to do out-of-order restores. Back-patch to v12, as this fixes a regression introduced by commit 548e509. Before that, parallel restore avoided requesting out-of-order restores, so it would work on a data-offset-less archive. Now it will again. Ideally this patch would include some test coverage, but there are other open bugs that need to be fixed before we can extend our coverage of parallel restore very much. Plan to revisit that later. David Gilman and Tom Lane; reviewed by Justin Pryzby Discussion: https://postgr.es/m/CALBH9DDuJ+scZc4MEvw5uO-=vRyR2=QF9+Yh=3hPEnKHWfS81A@mail.gmail.com
1 parent 447cf2f commit 71e8e66

File tree

2 files changed

+117
-34
lines changed

2 files changed

+117
-34
lines changed

doc/src/sgml/ref/pg_restore.sgml

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -246,12 +246,14 @@ PostgreSQL documentation
246246
<term><option>--jobs=<replaceable class="parameter">number-of-jobs</replaceable></option></term>
247247
<listitem>
248248
<para>
249-
Run the most time-consuming parts
250-
of <application>pg_restore</application> &mdash; those which load data,
251-
create indexes, or create constraints &mdash; using multiple
252-
concurrent jobs. This option can dramatically reduce the time
249+
Run the most time-consuming steps
250+
of <application>pg_restore</application> &mdash; those that load data,
251+
create indexes, or create constraints &mdash; concurrently, using up
252+
to <replaceable class="parameter">number-of-jobs</replaceable>
253+
concurrent sessions. This option can dramatically reduce the time
253254
to restore a large database to a server running on a
254-
multiprocessor machine.
255+
multiprocessor machine. This option is ignored when emitting a script
256+
rather than connecting directly to a database server.
255257
</para>
256258

257259
<para>
@@ -274,8 +276,7 @@ PostgreSQL documentation
274276
Only the custom and directory archive formats are supported
275277
with this option.
276278
The input must be a regular file or directory (not, for example, a
277-
pipe). This option is ignored when emitting a script rather
278-
than connecting directly to a database server. Also, multiple
279+
pipe or standard input). Also, multiple
279280
jobs cannot be used together with the
280281
option <option>--single-transaction</option>.
281282
</para>

src/bin/pg_dump/pg_backup_custom.c

Lines changed: 109 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,8 @@ typedef struct
7070
{
7171
CompressorState *cs;
7272
int hasSeek;
73+
/* lastFilePos is used only when reading, and may be invalid if !hasSeek */
74+
pgoff_t lastFilePos; /* position after last data block we've read */
7375
} lclContext;
7476

7577
typedef struct
@@ -181,8 +183,13 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
181183

182184
ReadHead(AH);
183185
ReadToc(AH);
184-
}
185186

187+
/*
188+
* Remember location of first data block (i.e., the point after TOC)
189+
* in case we have to search for desired data blocks.
190+
*/
191+
ctx->lastFilePos = _getFilePos(AH, ctx);
192+
}
186193
}
187194

188195
/*
@@ -420,13 +427,62 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
420427
{
421428
/*
422429
* We cannot seek directly to the desired block. Instead, skip over
423-
* block headers until we find the one we want. This could fail if we
424-
* are asked to restore items out-of-order.
430+
* block headers until we find the one we want. Remember the
431+
* positions of skipped-over blocks, so that if we later decide we
432+
* need to read one, we'll be able to seek to it.
433+
*
434+
* When our input file is seekable, we can do the search starting from
435+
* the point after the last data block we scanned in previous
436+
* iterations of this function.
425437
*/
426-
_readBlockHeader(AH, &blkType, &id);
438+
if (ctx->hasSeek)
439+
{
440+
if (fseeko(AH->FH, ctx->lastFilePos, SEEK_SET) != 0)
441+
fatal("error during file seek: %m");
442+
}
427443

428-
while (blkType != EOF && id != te->dumpId)
444+
for (;;)
429445
{
446+
pgoff_t thisBlkPos = _getFilePos(AH, ctx);
447+
448+
_readBlockHeader(AH, &blkType, &id);
449+
450+
if (blkType == EOF || id == te->dumpId)
451+
break;
452+
453+
/* Remember the block position, if we got one */
454+
if (thisBlkPos >= 0)
455+
{
456+
TocEntry *otherte = getTocEntryByDumpId(AH, id);
457+
458+
if (otherte && otherte->formatData)
459+
{
460+
lclTocEntry *othertctx = (lclTocEntry *) otherte->formatData;
461+
462+
/*
463+
* Note: on Windows, multiple threads might access/update
464+
* the same lclTocEntry concurrently, but that should be
465+
* safe as long as we update dataPos before dataState.
466+
* Ideally, we'd use pg_write_barrier() to enforce that,
467+
* but the needed infrastructure doesn't exist in frontend
468+
* code. But Windows only runs on machines with strong
469+
* store ordering, so it should be okay for now.
470+
*/
471+
if (othertctx->dataState == K_OFFSET_POS_NOT_SET)
472+
{
473+
othertctx->dataPos = thisBlkPos;
474+
othertctx->dataState = K_OFFSET_POS_SET;
475+
}
476+
else if (othertctx->dataPos != thisBlkPos ||
477+
othertctx->dataState != K_OFFSET_POS_SET)
478+
{
479+
/* sanity check */
480+
pg_log_warning("data block %d has wrong seek position",
481+
id);
482+
}
483+
}
484+
}
485+
430486
switch (blkType)
431487
{
432488
case BLK_DATA:
@@ -442,7 +498,6 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
442498
blkType);
443499
break;
444500
}
445-
_readBlockHeader(AH, &blkType, &id);
446501
}
447502
}
448503
else
@@ -454,20 +509,18 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
454509
_readBlockHeader(AH, &blkType, &id);
455510
}
456511

457-
/* Produce suitable failure message if we fell off end of file */
512+
/*
513+
* If we reached EOF without finding the block we want, then either it
514+
* doesn't exist, or it does but we lack the ability to seek back to it.
515+
*/
458516
if (blkType == EOF)
459517
{
460-
if (tctx->dataState == K_OFFSET_POS_NOT_SET)
461-
fatal("could not find block ID %d in archive -- "
462-
"possibly due to out-of-order restore request, "
463-
"which cannot be handled due to lack of data offsets in archive",
464-
te->dumpId);
465-
else if (!ctx->hasSeek)
518+
if (!ctx->hasSeek)
466519
fatal("could not find block ID %d in archive -- "
467520
"possibly due to out-of-order restore request, "
468521
"which cannot be handled due to non-seekable input file",
469522
te->dumpId);
470-
else /* huh, the dataPos led us to EOF? */
523+
else
471524
fatal("could not find block ID %d in archive -- "
472525
"possibly corrupt archive",
473526
te->dumpId);
@@ -493,6 +546,20 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te)
493546
blkType);
494547
break;
495548
}
549+
550+
/*
551+
* If our input file is seekable but lacks data offsets, update our
552+
* knowledge of where to start future searches from. (Note that we did
553+
* not update the current TE's dataState/dataPos. We could have, but
554+
* there is no point since it will not be visited again.)
555+
*/
556+
if (ctx->hasSeek && tctx->dataState == K_OFFSET_POS_NOT_SET)
557+
{
558+
pgoff_t curPos = _getFilePos(AH, ctx);
559+
560+
if (curPos > ctx->lastFilePos)
561+
ctx->lastFilePos = curPos;
562+
}
496563
}
497564

498565
/*
@@ -550,6 +617,7 @@ _skipBlobs(ArchiveHandle *AH)
550617
static void
551618
_skipData(ArchiveHandle *AH)
552619
{
620+
lclContext *ctx = (lclContext *) AH->formatData;
553621
size_t blkLen;
554622
char *buf = NULL;
555623
int buflen = 0;
@@ -558,19 +626,27 @@ _skipData(ArchiveHandle *AH)
558626
blkLen = ReadInt(AH);
559627
while (blkLen != 0)
560628
{
561-
if (blkLen > buflen)
629+
if (ctx->hasSeek)
562630
{
563-
if (buf)
564-
free(buf);
565-
buf = (char *) pg_malloc(blkLen);
566-
buflen = blkLen;
631+
if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
632+
fatal("error during file seek: %m");
567633
}
568-
if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
634+
else
569635
{
570-
if (feof(AH->FH))
571-
fatal("could not read from input file: end of file");
572-
else
573-
fatal("could not read from input file: %m");
636+
if (blkLen > buflen)
637+
{
638+
if (buf)
639+
free(buf);
640+
buf = (char *) pg_malloc(blkLen);
641+
buflen = blkLen;
642+
}
643+
if ((cnt = fread(buf, 1, blkLen, AH->FH)) != blkLen)
644+
{
645+
if (feof(AH->FH))
646+
fatal("could not read from input file: end of file");
647+
else
648+
fatal("could not read from input file: %m");
649+
}
574650
}
575651

576652
blkLen = ReadInt(AH);
@@ -806,6 +882,9 @@ _Clone(ArchiveHandle *AH)
806882
{
807883
lclContext *ctx = (lclContext *) AH->formatData;
808884

885+
/*
886+
* Each thread must have private lclContext working state.
887+
*/
809888
AH->formatData = (lclContext *) pg_malloc(sizeof(lclContext));
810889
memcpy(AH->formatData, ctx, sizeof(lclContext));
811890
ctx = (lclContext *) AH->formatData;
@@ -815,10 +894,13 @@ _Clone(ArchiveHandle *AH)
815894
fatal("compressor active");
816895

817896
/*
897+
* We intentionally do not clone TOC-entry-local state: it's useful to
898+
* share knowledge about where the data blocks are across threads.
899+
* _PrintTocData has to be careful about the order of operations on that
900+
* state, though.
901+
*
818902
* Note: we do not make a local lo_buf because we expect at most one BLOBS
819-
* entry per archive, so no parallelism is possible. Likewise,
820-
* TOC-entry-local state isn't an issue because any one TOC entry is
821-
* touched by just one worker child.
903+
* entry per archive, so no parallelism is possible.
822904
*/
823905
}
824906

0 commit comments

Comments
 (0)