Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit a45c78e

Browse files
committed
Rearrange pg_dump's handling of large objects for better efficiency.
Commit c0d5be5 caused pg_dump to create a separate BLOB metadata TOC entry for each large object (blob), but it did not touch the ancient decision to put all the blobs' data into a single "BLOBS" TOC entry. This is bad for a few reasons: for databases with millions of blobs, the TOC becomes unreasonably large, causing performance issues; selective restore of just some blobs is quite impossible; and we cannot parallelize either dump or restore of the blob data, since our architecture for that relies on farming out whole TOC entries to worker processes. To improve matters, let's group multiple blobs into each blob metadata TOC entry, and then make corresponding per-group blob data TOC entries. Selective restore using pg_restore's -l/-L switches is then possible, though only at the group level. (Perhaps we should provide a switch to allow forcing one-blob-per-group for users who need precise selective restore and don't have huge numbers of blobs. This patch doesn't do that, instead just hard-wiring the maximum number of blobs per entry at 1000.) The blobs in a group must all have the same owner, since the TOC entry format only allows one owner to be named. In this implementation we also require them to all share the same ACL (grants); the archive format wouldn't require that, but pg_dump's representation of DumpableObjects does. It seems unlikely that either restriction will be problematic for databases with huge numbers of blobs. The metadata TOC entries now have a "desc" string of "BLOB METADATA", and their "defn" string is just a newline-separated list of blob OIDs. The restore code has to generate creation commands, ALTER OWNER commands, and drop commands (for --clean mode) from that. We would need special-case code for ALTER OWNER and drop in any case, so the alternative of keeping the "defn" as directly executable SQL code for creation wouldn't buy much, and it seems like it'd bloat the archive to little purpose. Since we require the blobs of a metadata group to share the same ACL, we can furthermore store only one copy of that ACL, and then make pg_restore regenerate the appropriate commands for each blob. This saves space in the dump file not only by removing duplicative SQL command strings, but by not needing a separate TOC entry for each blob's ACL. In turn, that reduces client-side memory requirements for handling many blobs. ACL TOC entries that need this special processing are labeled as "ACL"/"LARGE OBJECTS nnn..nnn". If we have a blob with a unique ACL, continue to label it as "ACL"/"LARGE OBJECT nnn". We don't actually have to make such a distinction, but it saves a few cycles during restore for the easy case, and it seems like a good idea to not change the TOC contents unnecessarily. The data TOC entries ("BLOBS") are exactly the same as before, except that now there can be more than one, so we'd better give them identifying tag strings. Also, commit c0d5be5 put the new BLOB metadata TOC entries into SECTION_PRE_DATA, which perhaps is defensible in some ways, but it's a rather odd choice considering that we go out of our way to treat blobs as data. Moreover, because parallel restore handles the PRE_DATA section serially, this means we'd only get part of the parallelism speedup we could hope for. Move these entries into SECTION_DATA, letting us parallelize the lo_create calls not just the data loading when there are many blobs. Add dependencies to ensure that we won't try to load data for a blob we've not yet created. As this stands, we still generate a separate TOC entry for any comment or security label attached to a blob. I feel comfortable in believing that comments and security labels on blobs are rare, so this patch should be enough to get most of the useful TOC compression for blobs. We have to bump the archive file format version number, since existing versions of pg_restore wouldn't know they need to do something special for BLOB METADATA, plus they aren't going to work correctly with multiple BLOBS entries or multiple-large-object ACL entries. The directory and tar-file format handlers need some work for multiple BLOBS entries: they used to hard-wire the file name as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc". The 002_pg_dump.pl test script also knows about that and requires minor updates. (I had to drop the test for manually-compressed blobs.toc files with LZ4, because lz4's obtuse command line design requires explicit specification of the output file name which seems impractical here. I don't think we're losing any useful test coverage thereby; that test stanza seems completely duplicative with the gzip and zstd cases anyway.) In passing, centralize management of the lo_buf used to hold data while restoring blobs. The code previously had each format handler create lo_buf, which seems rather pointless given that the format handlers all make it the same way. Moreover, the format handlers never use lo_buf directly, making this setup a failure from a separation-of-concerns standpoint. Let's move the responsibility into pg_backup_archiver.c, which is the only module concerned with lo_buf. The reason to do this in this patch is that it allows a centralized fix for the now-false assumption that we never restore blobs in parallel. Also, get rid of dead code in DropLOIfExists: it's been a long time since we had any need to be able to restore to a pre-9.0 server. Discussion: https://postgr.es/m/a9f9376f1c3343a6bb319dce294e20ac@EX13D05UWC001.ant.amazon.com
1 parent 5eac8ce commit a45c78e

11 files changed

+530
-261
lines changed

src/bin/pg_dump/common.c

+26
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ static DumpId lastDumpId = 0; /* Note: 0 is InvalidDumpId */
4747
* expects that it can move them around when resizing the table. So we
4848
* cannot make the DumpableObjects be elements of the hash table directly;
4949
* instead, the hash table elements contain pointers to DumpableObjects.
50+
* This does have the advantage of letting us map multiple CatalogIds
51+
* to one DumpableObject, which is useful for blobs.
5052
*
5153
* It turns out to be convenient to also use this data structure to map
5254
* CatalogIds to owning extensions, if any. Since extension membership
@@ -700,6 +702,30 @@ AssignDumpId(DumpableObject *dobj)
700702
}
701703
}
702704

705+
/*
706+
* recordAdditionalCatalogID
707+
* Record an additional catalog ID for the given DumpableObject
708+
*/
709+
void
710+
recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj)
711+
{
712+
CatalogIdMapEntry *entry;
713+
bool found;
714+
715+
/* CatalogId hash table must exist, if we have a DumpableObject */
716+
Assert(catalogIdHash != NULL);
717+
718+
/* Add reference to CatalogId hash */
719+
entry = catalogid_insert(catalogIdHash, catId, &found);
720+
if (!found)
721+
{
722+
entry->dobj = NULL;
723+
entry->ext = NULL;
724+
}
725+
Assert(entry->dobj == NULL);
726+
entry->dobj = dobj;
727+
}
728+
703729
/*
704730
* Assign a DumpId that's not tied to a DumpableObject.
705731
*

src/bin/pg_dump/pg_backup_archiver.c

+80-26
Original file line numberDiff line numberDiff line change
@@ -512,7 +512,20 @@ RestoreArchive(Archive *AHX)
512512
* don't necessarily emit it verbatim; at this point we add an
513513
* appropriate IF EXISTS clause, if the user requested it.
514514
*/
515-
if (*te->dropStmt != '\0')
515+
if (strcmp(te->desc, "BLOB METADATA") == 0)
516+
{
517+
/* We must generate the per-blob commands */
518+
if (ropt->if_exists)
519+
IssueCommandPerBlob(AH, te,
520+
"SELECT pg_catalog.lo_unlink(oid) "
521+
"FROM pg_catalog.pg_largeobject_metadata "
522+
"WHERE oid = '", "'");
523+
else
524+
IssueCommandPerBlob(AH, te,
525+
"SELECT pg_catalog.lo_unlink('",
526+
"')");
527+
}
528+
else if (*te->dropStmt != '\0')
516529
{
517530
if (!ropt->if_exists ||
518531
strncmp(te->dropStmt, "--", 2) == 0)
@@ -528,12 +541,12 @@ RestoreArchive(Archive *AHX)
528541
{
529542
/*
530543
* Inject an appropriate spelling of "if exists". For
531-
* large objects, we have a separate routine that
544+
* old-style large objects, we have a routine that
532545
* knows how to do it, without depending on
533546
* te->dropStmt; use that. For other objects we need
534547
* to parse the command.
535548
*/
536-
if (strncmp(te->desc, "BLOB", 4) == 0)
549+
if (strcmp(te->desc, "BLOB") == 0)
537550
{
538551
DropLOIfExists(AH, te->catalogId.oid);
539552
}
@@ -1290,7 +1303,7 @@ EndLO(Archive *AHX, Oid oid)
12901303
**********/
12911304

12921305
/*
1293-
* Called by a format handler before any LOs are restored
1306+
* Called by a format handler before a group of LOs is restored
12941307
*/
12951308
void
12961309
StartRestoreLOs(ArchiveHandle *AH)
@@ -1309,7 +1322,7 @@ StartRestoreLOs(ArchiveHandle *AH)
13091322
}
13101323

13111324
/*
1312-
* Called by a format handler after all LOs are restored
1325+
* Called by a format handler after a group of LOs is restored
13131326
*/
13141327
void
13151328
EndRestoreLOs(ArchiveHandle *AH)
@@ -1343,6 +1356,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop)
13431356
AH->loCount++;
13441357

13451358
/* Initialize the LO Buffer */
1359+
if (AH->lo_buf == NULL)
1360+
{
1361+
/* First time through (in this process) so allocate the buffer */
1362+
AH->lo_buf_size = LOBBUFSIZE;
1363+
AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
1364+
}
13461365
AH->lo_buf_used = 0;
13471366

13481367
pg_log_info("restoring large object with OID %u", oid);
@@ -2988,19 +3007,20 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
29883007
{
29893008
/*
29903009
* Special Case: If 'SEQUENCE SET' or anything to do with LOs, then it
2991-
* is considered a data entry. We don't need to check for the BLOBS
2992-
* entry or old-style BLOB COMMENTS, because they will have hadDumper
2993-
* = true ... but we do need to check new-style BLOB ACLs, comments,
3010+
* is considered a data entry. We don't need to check for BLOBS or
3011+
* old-style BLOB COMMENTS entries, because they will have hadDumper =
3012+
* true ... but we do need to check new-style BLOB ACLs, comments,
29943013
* etc.
29953014
*/
29963015
if (strcmp(te->desc, "SEQUENCE SET") == 0 ||
29973016
strcmp(te->desc, "BLOB") == 0 ||
3017+
strcmp(te->desc, "BLOB METADATA") == 0 ||
29983018
(strcmp(te->desc, "ACL") == 0 &&
2999-
strncmp(te->tag, "LARGE OBJECT ", 13) == 0) ||
3019+
strncmp(te->tag, "LARGE OBJECT", 12) == 0) ||
30003020
(strcmp(te->desc, "COMMENT") == 0 &&
3001-
strncmp(te->tag, "LARGE OBJECT ", 13) == 0) ||
3021+
strncmp(te->tag, "LARGE OBJECT", 12) == 0) ||
30023022
(strcmp(te->desc, "SECURITY LABEL") == 0 &&
3003-
strncmp(te->tag, "LARGE OBJECT ", 13) == 0))
3023+
strncmp(te->tag, "LARGE OBJECT", 12) == 0))
30043024
res = res & REQ_DATA;
30053025
else
30063026
res = res & ~REQ_DATA;
@@ -3035,12 +3055,13 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH)
30353055
if (!(ropt->sequence_data && strcmp(te->desc, "SEQUENCE SET") == 0) &&
30363056
!(ropt->binary_upgrade &&
30373057
(strcmp(te->desc, "BLOB") == 0 ||
3058+
strcmp(te->desc, "BLOB METADATA") == 0 ||
30383059
(strcmp(te->desc, "ACL") == 0 &&
3039-
strncmp(te->tag, "LARGE OBJECT ", 13) == 0) ||
3060+
strncmp(te->tag, "LARGE OBJECT", 12) == 0) ||
30403061
(strcmp(te->desc, "COMMENT") == 0 &&
3041-
strncmp(te->tag, "LARGE OBJECT ", 13) == 0) ||
3062+
strncmp(te->tag, "LARGE OBJECT", 12) == 0) ||
30423063
(strcmp(te->desc, "SECURITY LABEL") == 0 &&
3043-
strncmp(te->tag, "LARGE OBJECT ", 13) == 0))))
3064+
strncmp(te->tag, "LARGE OBJECT", 12) == 0))))
30443065
res = res & REQ_SCHEMA;
30453066
}
30463067

@@ -3607,18 +3628,35 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
36073628
}
36083629

36093630
/*
3610-
* Actually print the definition.
3631+
* Actually print the definition. Normally we can just print the defn
3632+
* string if any, but we have three special cases:
36113633
*
3612-
* Really crude hack for suppressing AUTHORIZATION clause that old pg_dump
3634+
* 1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump
36133635
* versions put into CREATE SCHEMA. Don't mutate the variant for schema
36143636
* "public" that is a comment. We have to do this when --no-owner mode is
36153637
* selected. This is ugly, but I see no other good way ...
3638+
*
3639+
* 2. BLOB METADATA entries need special processing since their defn
3640+
* strings are just lists of OIDs, not complete SQL commands.
3641+
*
3642+
* 3. ACL LARGE OBJECTS entries need special processing because they
3643+
* contain only one copy of the ACL GRANT/REVOKE commands, which we must
3644+
* apply to each large object listed in the associated BLOB METADATA.
36163645
*/
36173646
if (ropt->noOwner &&
36183647
strcmp(te->desc, "SCHEMA") == 0 && strncmp(te->defn, "--", 2) != 0)
36193648
{
36203649
ahprintf(AH, "CREATE SCHEMA %s;\n\n\n", fmtId(te->tag));
36213650
}
3651+
else if (strcmp(te->desc, "BLOB METADATA") == 0)
3652+
{
3653+
IssueCommandPerBlob(AH, te, "SELECT pg_catalog.lo_create('", "')");
3654+
}
3655+
else if (strcmp(te->desc, "ACL") == 0 &&
3656+
strncmp(te->tag, "LARGE OBJECTS", 13) == 0)
3657+
{
3658+
IssueACLPerBlob(AH, te);
3659+
}
36223660
else
36233661
{
36243662
if (te->defn && strlen(te->defn) > 0)
@@ -3639,18 +3677,31 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
36393677
te->owner && strlen(te->owner) > 0 &&
36403678
te->dropStmt && strlen(te->dropStmt) > 0)
36413679
{
3642-
PQExpBufferData temp;
3680+
if (strcmp(te->desc, "BLOB METADATA") == 0)
3681+
{
3682+
/* BLOB METADATA needs special code to handle multiple LOs */
3683+
char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner));
36433684

3644-
initPQExpBuffer(&temp);
3645-
_getObjectDescription(&temp, te);
3685+
IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd);
3686+
pg_free(cmdEnd);
3687+
}
3688+
else
3689+
{
3690+
/* For all other cases, we can use _getObjectDescription */
3691+
PQExpBufferData temp;
36463692

3647-
/*
3648-
* If _getObjectDescription() didn't fill the buffer, then there is no
3649-
* owner.
3650-
*/
3651-
if (temp.data[0])
3652-
ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", temp.data, fmtId(te->owner));
3653-
termPQExpBuffer(&temp);
3693+
initPQExpBuffer(&temp);
3694+
_getObjectDescription(&temp, te);
3695+
3696+
/*
3697+
* If _getObjectDescription() didn't fill the buffer, then there
3698+
* is no owner.
3699+
*/
3700+
if (temp.data[0])
3701+
ahprintf(AH, "ALTER %s OWNER TO %s;\n\n",
3702+
temp.data, fmtId(te->owner));
3703+
termPQExpBuffer(&temp);
3704+
}
36543705
}
36553706

36563707
/*
@@ -4749,6 +4800,9 @@ CloneArchive(ArchiveHandle *AH)
47494800
/* clone has its own error count, too */
47504801
clone->public.n_errors = 0;
47514802

4803+
/* clones should not share lo_buf */
4804+
clone->lo_buf = NULL;
4805+
47524806
/*
47534807
* Connect our new clone object to the database, using the same connection
47544808
* parameters used for the original connection.

src/bin/pg_dump/pg_backup_archiver.h

+6-1
Original file line numberDiff line numberDiff line change
@@ -68,10 +68,12 @@
6868
#define K_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0) /* add
6969
* compression_algorithm
7070
* in header */
71+
#define K_VERS_1_16 MAKE_ARCHIVE_VERSION(1, 16, 0) /* BLOB METADATA entries
72+
* and multiple BLOBS */
7173

7274
/* Current archive version number (the format we can output) */
7375
#define K_VERS_MAJOR 1
74-
#define K_VERS_MINOR 15
76+
#define K_VERS_MINOR 16
7577
#define K_VERS_REV 0
7678
#define K_VERS_SELF MAKE_ARCHIVE_VERSION(K_VERS_MAJOR, K_VERS_MINOR, K_VERS_REV)
7779

@@ -448,6 +450,9 @@ extern void InitArchiveFmt_Tar(ArchiveHandle *AH);
448450
extern bool isValidTarHeader(char *header);
449451

450452
extern void ReconnectToServer(ArchiveHandle *AH, const char *dbname);
453+
extern void IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te,
454+
const char *cmdBegin, const char *cmdEnd);
455+
extern void IssueACLPerBlob(ArchiveHandle *AH, TocEntry *te);
451456
extern void DropLOIfExists(ArchiveHandle *AH, Oid oid);
452457

453458
void ahwrite(const void *ptr, size_t size, size_t nmemb, ArchiveHandle *AH);

src/bin/pg_dump/pg_backup_custom.c

+2-9
Original file line numberDiff line numberDiff line change
@@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
140140
ctx = (lclContext *) pg_malloc0(sizeof(lclContext));
141141
AH->formatData = (void *) ctx;
142142

143-
/* Initialize LO buffering */
144-
AH->lo_buf_size = LOBBUFSIZE;
145-
AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
146-
147143
/*
148144
* Now open the file
149145
*/
@@ -342,7 +338,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te)
342338
}
343339

344340
/*
345-
* Called by the archiver when starting to save all BLOB DATA (not schema).
341+
* Called by the archiver when starting to save BLOB DATA (not schema).
346342
* This routine should save whatever format-specific information is needed
347343
* to read the LOs back into memory.
348344
*
@@ -402,7 +398,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid)
402398
}
403399

404400
/*
405-
* Called by the archiver when finishing saving all BLOB DATA.
401+
* Called by the archiver when finishing saving BLOB DATA.
406402
*
407403
* Optional.
408404
*/
@@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH)
902898
* share knowledge about where the data blocks are across threads.
903899
* _PrintTocData has to be careful about the order of operations on that
904900
* state, though.
905-
*
906-
* Note: we do not make a local lo_buf because we expect at most one BLOBS
907-
* entry per archive, so no parallelism is possible.
908901
*/
909902
}
910903

0 commit comments

Comments
 (0)