Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 568d413

Browse files
committed
Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of concurrent updates, the scan can fail to see either the old or the new versions of the row. In many cases, we work around this by requiring DDL operations to hold AccessExclusiveLock on the object being modified; in some cases, the existing locking is inadequate and random failures occur as a result. This commit doesn't change anything related to locking, but will hopefully pave the way to allowing lock strength reductions in the future. The major issue has held us back from making this change in the past is that taking an MVCC snapshot is significantly more expensive than using a static special snapshot such as SnapshotNow. However, testing of various worst-case scenarios reveals that this problem is not severe except under fairly extreme workloads. To mitigate those problems, we avoid retaking the MVCC snapshot for each new scan; instead, we take a new snapshot only when invalidation messages have been processed. The catcache machinery already requires that invalidation messages be sent before releasing the related heavyweight lock; else other backends might rely on locally-cached data rather than scanning the catalog at all. Thus, making snapshot reuse dependent on the same guarantees shouldn't break anything that wasn't already subtly broken. Patch by me. Review by Michael Paquier and Andres Freund.
1 parent 384f933 commit 568d413

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+616
-352
lines changed

contrib/dblink/dblink.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -2046,7 +2046,7 @@ get_pkey_attnames(Relation rel, int16 *numatts)
20462046
ObjectIdGetDatum(RelationGetRelid(rel)));
20472047

20482048
scan = systable_beginscan(indexRelation, IndexIndrelidIndexId, true,
2049-
SnapshotNow, 1, &skey);
2049+
NULL, 1, &skey);
20502050

20512051
while (HeapTupleIsValid(indexTuple = systable_getnext(scan)))
20522052
{

contrib/sepgsql/label.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -727,7 +727,7 @@ exec_object_restorecon(struct selabel_handle * sehnd, Oid catalogId)
727727
rel = heap_open(catalogId, AccessShareLock);
728728

729729
sscan = systable_beginscan(rel, InvalidOid, false,
730-
SnapshotNow, 0, NULL);
730+
NULL, 0, NULL);
731731
while (HeapTupleIsValid(tuple = systable_getnext(sscan)))
732732
{
733733
Form_pg_database datForm;

doc/src/sgml/indexam.sgml

+1-1
Original file line numberDiff line numberDiff line change
@@ -713,7 +713,7 @@ amrestrpos (IndexScanDesc scan);
713713
When using an MVCC-compliant snapshot, there is no problem because
714714
the new occupant of the slot is certain to be too new to pass the
715715
snapshot test. However, with a non-MVCC-compliant snapshot (such as
716-
<literal>SnapshotNow</>), it would be possible to accept and return
716+
<literal>SnapshotAny</>), it would be possible to accept and return
717717
a row that does not in fact match the scan keys. We could defend
718718
against this scenario by requiring the scan keys to be rechecked
719719
against the heap row in all cases, but that is too expensive. Instead,

src/backend/access/heap/heapam.c

+19-5
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
8080
Snapshot snapshot,
8181
int nkeys, ScanKey key,
8282
bool allow_strat, bool allow_sync,
83-
bool is_bitmapscan);
83+
bool is_bitmapscan, bool temp_snap);
8484
static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
8585
TransactionId xid, CommandId cid, int options);
8686
static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -1286,7 +1286,17 @@ heap_beginscan(Relation relation, Snapshot snapshot,
12861286
int nkeys, ScanKey key)
12871287
{
12881288
return heap_beginscan_internal(relation, snapshot, nkeys, key,
1289-
true, true, false);
1289+
true, true, false, false);
1290+
}
1291+
1292+
HeapScanDesc
1293+
heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
1294+
{
1295+
Oid relid = RelationGetRelid(relation);
1296+
Snapshot snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
1297+
1298+
return heap_beginscan_internal(relation, snapshot, nkeys, key,
1299+
true, true, false, true);
12901300
}
12911301

12921302
HeapScanDesc
@@ -1295,22 +1305,22 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
12951305
bool allow_strat, bool allow_sync)
12961306
{
12971307
return heap_beginscan_internal(relation, snapshot, nkeys, key,
1298-
allow_strat, allow_sync, false);
1308+
allow_strat, allow_sync, false, false);
12991309
}
13001310

13011311
HeapScanDesc
13021312
heap_beginscan_bm(Relation relation, Snapshot snapshot,
13031313
int nkeys, ScanKey key)
13041314
{
13051315
return heap_beginscan_internal(relation, snapshot, nkeys, key,
1306-
false, false, true);
1316+
false, false, true, false);
13071317
}
13081318

13091319
static HeapScanDesc
13101320
heap_beginscan_internal(Relation relation, Snapshot snapshot,
13111321
int nkeys, ScanKey key,
13121322
bool allow_strat, bool allow_sync,
1313-
bool is_bitmapscan)
1323+
bool is_bitmapscan, bool temp_snap)
13141324
{
13151325
HeapScanDesc scan;
13161326

@@ -1335,6 +1345,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
13351345
scan->rs_strategy = NULL; /* set in initscan */
13361346
scan->rs_allow_strat = allow_strat;
13371347
scan->rs_allow_sync = allow_sync;
1348+
scan->rs_temp_snap = temp_snap;
13381349

13391350
/*
13401351
* we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1421,6 +1432,9 @@ heap_endscan(HeapScanDesc scan)
14211432
if (scan->rs_strategy != NULL)
14221433
FreeAccessStrategy(scan->rs_strategy);
14231434

1435+
if (scan->rs_temp_snap)
1436+
UnregisterSnapshot(scan->rs_snapshot);
1437+
14241438
pfree(scan);
14251439
}
14261440

src/backend/access/index/genam.c

+33-1
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
#include "utils/builtins.h"
2929
#include "utils/lsyscache.h"
3030
#include "utils/rel.h"
31+
#include "utils/snapmgr.h"
3132
#include "utils/tqual.h"
3233

3334

@@ -231,7 +232,7 @@ BuildIndexValueDescription(Relation indexRelation,
231232
* rel: catalog to scan, already opened and suitably locked
232233
* indexId: OID of index to conditionally use
233234
* indexOK: if false, forces a heap scan (see notes below)
234-
* snapshot: time qual to use (usually should be SnapshotNow)
235+
* snapshot: time qual to use (NULL for a recent catalog snapshot)
235236
* nkeys, key: scan keys
236237
*
237238
* The attribute numbers in the scan key should be set for the heap case.
@@ -266,6 +267,19 @@ systable_beginscan(Relation heapRelation,
266267
sysscan->heap_rel = heapRelation;
267268
sysscan->irel = irel;
268269

270+
if (snapshot == NULL)
271+
{
272+
Oid relid = RelationGetRelid(heapRelation);
273+
274+
snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
275+
sysscan->snapshot = snapshot;
276+
}
277+
else
278+
{
279+
/* Caller is responsible for any snapshot. */
280+
sysscan->snapshot = NULL;
281+
}
282+
269283
if (irel)
270284
{
271285
int i;
@@ -401,6 +415,9 @@ systable_endscan(SysScanDesc sysscan)
401415
else
402416
heap_endscan(sysscan->scan);
403417

418+
if (sysscan->snapshot)
419+
UnregisterSnapshot(sysscan->snapshot);
420+
404421
pfree(sysscan);
405422
}
406423

@@ -444,6 +461,19 @@ systable_beginscan_ordered(Relation heapRelation,
444461
sysscan->heap_rel = heapRelation;
445462
sysscan->irel = indexRelation;
446463

464+
if (snapshot == NULL)
465+
{
466+
Oid relid = RelationGetRelid(heapRelation);
467+
468+
snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
469+
sysscan->snapshot = snapshot;
470+
}
471+
else
472+
{
473+
/* Caller is responsible for any snapshot. */
474+
sysscan->snapshot = NULL;
475+
}
476+
447477
/* Change attribute numbers to be index column numbers. */
448478
for (i = 0; i < nkeys; i++)
449479
{
@@ -494,5 +524,7 @@ systable_endscan_ordered(SysScanDesc sysscan)
494524
{
495525
Assert(sysscan->irel);
496526
index_endscan(sysscan->iscan);
527+
if (sysscan->snapshot)
528+
UnregisterSnapshot(sysscan->snapshot);
497529
pfree(sysscan);
498530
}

src/backend/access/nbtree/README

+4-3
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,10 @@ deletes index entries before deleting tuples, the super-exclusive lock
141141
guarantees that VACUUM can't delete any heap tuple that an indexscanning
142142
process might be about to visit. (This guarantee works only for simple
143143
indexscans that visit the heap in sync with the index scan, not for bitmap
144-
scans. We only need the guarantee when using non-MVCC snapshot rules such
145-
as SnapshotNow, so in practice this is only important for system catalog
146-
accesses.)
144+
scans. We only need the guarantee when using non-MVCC snapshot rules; in
145+
an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with
146+
an unrelated tuple at the same TID, because the new tuple wouldn't be
147+
visible to our scan anyway.)
147148

148149
Because a page can be split even while someone holds a pin on it, it is
149150
possible that an indexscan will return items that are no longer stored on

src/backend/access/rmgrdesc/xactdesc.c

+4-1
Original file line numberDiff line numberDiff line change
@@ -69,11 +69,14 @@ xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec)
6969
appendStringInfo(buf, " catalog %u", msg->cat.catId);
7070
else if (msg->id == SHAREDINVALRELCACHE_ID)
7171
appendStringInfo(buf, " relcache %u", msg->rc.relId);
72-
/* remaining cases not expected, but print something anyway */
72+
/* not expected, but print something anyway */
7373
else if (msg->id == SHAREDINVALSMGR_ID)
7474
appendStringInfo(buf, " smgr");
75+
/* not expected, but print something anyway */
7576
else if (msg->id == SHAREDINVALRELMAP_ID)
7677
appendStringInfo(buf, " relmap");
78+
else if (msg->id == SHAREDINVALSNAPSHOT_ID)
79+
appendStringInfo(buf, " snapshot %u", msg->sn.relId);
7780
else
7881
appendStringInfo(buf, " unknown id %d", msg->id);
7982
}

src/backend/bootstrap/bootstrap.c

+4-4
Original file line numberDiff line numberDiff line change
@@ -611,7 +611,7 @@ boot_openrel(char *relname)
611611
{
612612
/* We can now load the pg_type data */
613613
rel = heap_open(TypeRelationId, NoLock);
614-
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
614+
scan = heap_beginscan_catalog(rel, 0, NULL);
615615
i = 0;
616616
while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
617617
++i;
@@ -620,7 +620,7 @@ boot_openrel(char *relname)
620620
while (i-- > 0)
621621
*app++ = ALLOC(struct typmap, 1);
622622
*app = NULL;
623-
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
623+
scan = heap_beginscan_catalog(rel, 0, NULL);
624624
app = Typ;
625625
while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
626626
{
@@ -918,7 +918,7 @@ gettype(char *type)
918918
}
919919
elog(DEBUG4, "external type: %s", type);
920920
rel = heap_open(TypeRelationId, NoLock);
921-
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
921+
scan = heap_beginscan_catalog(rel, 0, NULL);
922922
i = 0;
923923
while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
924924
++i;
@@ -927,7 +927,7 @@ gettype(char *type)
927927
while (i-- > 0)
928928
*app++ = ALLOC(struct typmap, 1);
929929
*app = NULL;
930-
scan = heap_beginscan(rel, SnapshotNow, 0, NULL);
930+
scan = heap_beginscan_catalog(rel, 0, NULL);
931931
app = Typ;
932932
while ((tup = heap_getnext(scan, ForwardScanDirection)) != NULL)
933933
{

src/backend/catalog/aclchk.c

+16-13
Original file line numberDiff line numberDiff line change
@@ -788,7 +788,7 @@ objectsInSchemaToOids(GrantObjectType objtype, List *nspnames)
788788
ObjectIdGetDatum(namespaceId));
789789

790790
rel = heap_open(ProcedureRelationId, AccessShareLock);
791-
scan = heap_beginscan(rel, SnapshotNow, 1, key);
791+
scan = heap_beginscan_catalog(rel, 1, key);
792792

793793
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
794794
{
@@ -833,7 +833,7 @@ getRelationsInNamespace(Oid namespaceId, char relkind)
833833
CharGetDatum(relkind));
834834

835835
rel = heap_open(RelationRelationId, AccessShareLock);
836-
scan = heap_beginscan(rel, SnapshotNow, 2, key);
836+
scan = heap_beginscan_catalog(rel, 2, key);
837837

838838
while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
839839
{
@@ -1332,7 +1332,7 @@ RemoveRoleFromObjectACL(Oid roleid, Oid classid, Oid objid)
13321332
ObjectIdGetDatum(objid));
13331333

13341334
scan = systable_beginscan(rel, DefaultAclOidIndexId, true,
1335-
SnapshotNow, 1, skey);
1335+
NULL, 1, skey);
13361336

13371337
tuple = systable_getnext(scan);
13381338

@@ -1452,7 +1452,7 @@ RemoveDefaultACLById(Oid defaclOid)
14521452
ObjectIdGetDatum(defaclOid));
14531453

14541454
scan = systable_beginscan(rel, DefaultAclOidIndexId, true,
1455-
SnapshotNow, 1, skey);
1455+
NULL, 1, skey);
14561456

14571457
tuple = systable_getnext(scan);
14581458

@@ -2705,7 +2705,7 @@ ExecGrant_Largeobject(InternalGrant *istmt)
27052705

27062706
scan = systable_beginscan(relation,
27072707
LargeObjectMetadataOidIndexId, true,
2708-
SnapshotNow, 1, entry);
2708+
NULL, 1, entry);
27092709

27102710
tuple = systable_getnext(scan);
27112711
if (!HeapTupleIsValid(tuple))
@@ -3468,7 +3468,7 @@ pg_aclmask(AclObjectKind objkind, Oid table_oid, AttrNumber attnum, Oid roleid,
34683468
return pg_language_aclmask(table_oid, roleid, mask, how);
34693469
case ACL_KIND_LARGEOBJECT:
34703470
return pg_largeobject_aclmask_snapshot(table_oid, roleid,
3471-
mask, how, SnapshotNow);
3471+
mask, how, NULL);
34723472
case ACL_KIND_NAMESPACE:
34733473
return pg_namespace_aclmask(table_oid, roleid, mask, how);
34743474
case ACL_KIND_TABLESPACE:
@@ -3856,10 +3856,13 @@ pg_language_aclmask(Oid lang_oid, Oid roleid,
38563856
* Exported routine for examining a user's privileges for a largeobject
38573857
*
38583858
* When a large object is opened for reading, it is opened relative to the
3859-
* caller's snapshot, but when it is opened for writing, it is always relative
3860-
* to SnapshotNow, as documented in doc/src/sgml/lobj.sgml. This function
3861-
* takes a snapshot argument so that the permissions check can be made relative
3862-
* to the same snapshot that will be used to read the underlying data.
3859+
* caller's snapshot, but when it is opened for writing, a current
3860+
* MVCC snapshot will be used. See doc/src/sgml/lobj.sgml. This function
3861+
* takes a snapshot argument so that the permissions check can be made
3862+
* relative to the same snapshot that will be used to read the underlying
3863+
* data. The caller will actually pass NULL for an instantaneous MVCC
3864+
* snapshot, since all we do with the snapshot argument is pass it through
3865+
* to systable_beginscan().
38633866
*/
38643867
AclMode
38653868
pg_largeobject_aclmask_snapshot(Oid lobj_oid, Oid roleid,
@@ -4644,7 +4647,7 @@ pg_language_ownercheck(Oid lan_oid, Oid roleid)
46444647
* Ownership check for a largeobject (specified by OID)
46454648
*
46464649
* This is only used for operations like ALTER LARGE OBJECT that are always
4647-
* relative to SnapshotNow.
4650+
* relative to an up-to-date snapshot.
46484651
*/
46494652
bool
46504653
pg_largeobject_ownercheck(Oid lobj_oid, Oid roleid)
@@ -4670,7 +4673,7 @@ pg_largeobject_ownercheck(Oid lobj_oid, Oid roleid)
46704673

46714674
scan = systable_beginscan(pg_lo_meta,
46724675
LargeObjectMetadataOidIndexId, true,
4673-
SnapshotNow, 1, entry);
4676+
NULL, 1, entry);
46744677

46754678
tuple = systable_getnext(scan);
46764679
if (!HeapTupleIsValid(tuple))
@@ -5032,7 +5035,7 @@ pg_extension_ownercheck(Oid ext_oid, Oid roleid)
50325035

50335036
scan = systable_beginscan(pg_extension,
50345037
ExtensionOidIndexId, true,
5035-
SnapshotNow, 1, entry);
5038+
NULL, 1, entry);
50365039

50375040
tuple = systable_getnext(scan);
50385041
if (!HeapTupleIsValid(tuple))

src/backend/catalog/catalog.c

+10-14
Original file line numberDiff line numberDiff line change
@@ -218,20 +218,16 @@ IsReservedName(const char *name)
218218
* Given the OID of a relation, determine whether it's supposed to be
219219
* shared across an entire database cluster.
220220
*
221-
* Hard-wiring this list is pretty grotty, but we really need it so that
222-
* we can compute the locktag for a relation (and then lock it) without
223-
* having already read its pg_class entry. If we try to retrieve relisshared
224-
* from pg_class with no pre-existing lock, there is a race condition against
225-
* anyone who is concurrently committing a change to the pg_class entry:
226-
* since we read system catalog entries under SnapshotNow, it's possible
227-
* that both the old and new versions of the row are invalid at the instants
228-
* we scan them. We fix this by insisting that updaters of a pg_class
229-
* row must hold exclusive lock on the corresponding rel, and that users
230-
* of a relation must hold at least AccessShareLock on the rel *before*
231-
* trying to open its relcache entry. But to lock a rel, you have to
232-
* know if it's shared. Fortunately, the set of shared relations is
233-
* fairly static, so a hand-maintained list of their OIDs isn't completely
234-
* impractical.
221+
* In older releases, this had to be hard-wired so that we could compute the
222+
* locktag for a relation and lock it before examining its catalog entry.
223+
* Since we now have MVCC catalog access, the race conditions that made that
224+
* a hard requirement are gone, so we could look at relaxing this restriction.
225+
* However, if we scanned the pg_class entry to find relisshared, and only
226+
* then locked the relation, pg_class could get updated in the meantime,
227+
* forcing us to scan the relation again, which would definitely be complex
228+
* and might have undesirable performance consequences. Fortunately, the set
229+
* of shared relations is fairly static, so a hand-maintained list of their
230+
* OIDs isn't completely impractical.
235231
*/
236232
bool
237233
IsSharedRelation(Oid relationId)

src/backend/catalog/dependency.c

+3-3
Original file line numberDiff line numberDiff line change
@@ -558,7 +558,7 @@ findDependentObjects(const ObjectAddress *object,
558558
nkeys = 2;
559559

560560
scan = systable_beginscan(*depRel, DependDependerIndexId, true,
561-
SnapshotNow, nkeys, key);
561+
NULL, nkeys, key);
562562

563563
while (HeapTupleIsValid(tup = systable_getnext(scan)))
564564
{
@@ -733,7 +733,7 @@ findDependentObjects(const ObjectAddress *object,
733733
nkeys = 2;
734734

735735
scan = systable_beginscan(*depRel, DependReferenceIndexId, true,
736-
SnapshotNow, nkeys, key);
736+
NULL, nkeys, key);
737737

738738
while (HeapTupleIsValid(tup = systable_getnext(scan)))
739739
{
@@ -1069,7 +1069,7 @@ deleteOneObject(const ObjectAddress *object, Relation *depRel, int flags)
10691069
nkeys = 2;
10701070

10711071
scan = systable_beginscan(*depRel, DependDependerIndexId, true,
1072-
SnapshotNow, nkeys, key);
1072+
NULL, nkeys, key);
10731073

10741074
while (HeapTupleIsValid(tup = systable_getnext(scan)))
10751075
{

0 commit comments

Comments
 (0)