Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 898e5e3

Browse files
committed
Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
We still require AccessExclusiveLock on the partition itself, because otherwise an insert that violates the newly-imposed partition constraint could be in progress at the same time that we're changing that constraint; only the lock level on the parent relation is weakened. To make this safe, we have to cope with (at least) three separate problems. First, relevant DDL might commit while we're in the process of building a PartitionDesc. If so, find_inheritance_children() might see a new partition while the RELOID system cache still has the old partition bound cached, and even before invalidation messages have been queued. To fix that, if we see that the pg_class tuple seems to be missing or to have a null relpartbound, refetch the value directly from the table. We can't get the wrong value, because DETACH PARTITION still requires AccessExclusiveLock throughout; if we ever want to change that, this will need more thought. In testing, I found it quite difficult to hit even the null-relpartbound case; the race condition is extremely tight, but the theoretical risk is there. Second, successive calls to RelationGetPartitionDesc might not return the same answer. The query planner will get confused if lookup up the PartitionDesc for a particular relation does not return a consistent answer for the entire duration of query planning. Likewise, query execution will get confused if the same relation seems to have a different PartitionDesc at different times. Invent a new PartitionDirectory concept and use it to ensure consistency. This ensures that a single invocation of either the planner or the executor sees the same view of the PartitionDesc from beginning to end, but it does not guarantee that the planner and the executor see the same view. Since this allows pointers to old PartitionDesc entries to survive even after a relcache rebuild, also postpone removing the old PartitionDesc entry until we're certain no one is using it. For the most part, it seems to be OK for the planner and executor to have different views of the PartitionDesc, because the executor will just ignore any concurrently added partitions which were unknown at plan time; those partitions won't be part of the inheritance expansion, but invalidation messages will trigger replanning at some point. Normally, this happens by the time the very next command is executed, but if the next command acquires no locks and executes a prepared query, it can manage not to notice until a new transaction is started. We might want to tighten that up, but it's material for a separate patch. There would still be a small window where a query that started just after an ATTACH PARTITION command committed might fail to notice its results -- but only if the command starts before the commit has been acknowledged to the user. All in all, the warts here around serializability seem small enough to be worth accepting for the considerable advantage of being able to add partitions without a full table lock. Although in general the consequences of new partitions showing up between planning and execution are limited to the query not noticing the new partitions, run-time partition pruning will get confused in that case, so that's the third problem that this patch fixes. Run-time partition pruning assumes that indexes into the PartitionDesc are stable between planning and execution. So, add code so that if new partitions are added between plan time and execution time, the indexes stored in the subplan_map[] and subpart_map[] arrays within the plan's PartitionedRelPruneInfo get adjusted accordingly. There does not seem to be a simple way to generalize this scheme to cope with partitions that are removed, mostly because they could then get added back again with different bounds, but it works OK for added partitions. This code does not try to ensure that every backend participating in a parallel query sees the same view of the PartitionDesc. That currently doesn't matter, because we never pass PartitionDesc indexes between backends. Each backend will ignore the concurrently added partitions which it notices, and it doesn't matter if different backends are ignoring different sets of concurrently added partitions. If in the future that matters, for example because we allow writes in parallel query and want all participants to do tuple routing to the same set of partitions, the PartitionDirectory concept could be improved to share PartitionDescs across backends. There is a draft patch to serialize and restore PartitionDescs on the thread where this patch was discussed, which may be a useful place to start. Patch by me. Thanks to Alvaro Herrera, David Rowley, Simon Riggs, Amit Langote, and Michael Paquier for discussion, and to Alvaro Herrera for some review. Discussion: http://postgr.es/m/CA+Tgmobt2upbSocvvDej3yzokd7AkiT+PvgFH+a9-5VV1oJNSQ@mail.gmail.com Discussion: http://postgr.es/m/CA+TgmoZE0r9-cyA-aY6f8WFEROaDLLL7Vf81kZ8MtFCkxpeQSw@mail.gmail.com Discussion: http://postgr.es/m/CA+TgmoY13KQZF-=HNTrt9UYWYx3_oYOQpu9ioNT49jGgiDpUEA@mail.gmail.com
1 parent ec51727 commit 898e5e3

File tree

21 files changed

+314
-45
lines changed

21 files changed

+314
-45
lines changed

doc/src/sgml/ddl.sgml

+2-1
Original file line numberDiff line numberDiff line change
@@ -3827,7 +3827,8 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
38273827
the system will be able to skip the scan to validate the implicit
38283828
partition constraint. Without such a constraint, the table will be
38293829
scanned to validate the partition constraint while holding an
3830-
<literal>ACCESS EXCLUSIVE</literal> lock on the parent table.
3830+
<literal>ACCESS EXCLUSIVE</literal> lock on that partition
3831+
and a <literal>SHARE UPDATE EXCLUSIVE</literal> lock on the parent table.
38313832
One may then drop the constraint after <command>ATTACH PARTITION</command>
38323833
is finished, because it is no longer necessary.
38333834
</para>

src/backend/commands/copy.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -2556,7 +2556,7 @@ CopyFrom(CopyState cstate)
25562556
* CopyFrom tuple routing.
25572557
*/
25582558
if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
2559-
proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
2559+
proute = ExecSetupPartitionTupleRouting(estate, NULL, cstate->rel);
25602560

25612561
if (cstate->whereClause)
25622562
cstate->qualexpr = ExecInitQual(castNode(List, cstate->whereClause),

src/backend/commands/tablecmds.c

+3
Original file line numberDiff line numberDiff line change
@@ -3692,6 +3692,9 @@ AlterTableGetLockLevel(List *cmds)
36923692
break;
36933693

36943694
case AT_AttachPartition:
3695+
cmd_lockmode = ShareUpdateExclusiveLock;
3696+
break;
3697+
36953698
case AT_DetachPartition:
36963699
cmd_lockmode = AccessExclusiveLock;
36973700
break;

src/backend/executor/execPartition.c

+77-19
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,8 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
167167
PartitionDispatch dispatch,
168168
ResultRelInfo *partRelInfo,
169169
int partidx);
170-
static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
170+
static PartitionDispatch ExecInitPartitionDispatchInfo(EState *estate,
171+
PartitionTupleRouting *proute,
171172
Oid partoid, PartitionDispatch parent_pd, int partidx);
172173
static void FormPartitionKeyDatum(PartitionDispatch pd,
173174
TupleTableSlot *slot,
@@ -201,7 +202,8 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
201202
* it should be estate->es_query_cxt.
202203
*/
203204
PartitionTupleRouting *
204-
ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
205+
ExecSetupPartitionTupleRouting(EState *estate, ModifyTableState *mtstate,
206+
Relation rel)
205207
{
206208
PartitionTupleRouting *proute;
207209
ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -223,7 +225,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
223225
* parent as NULL as we don't need to care about any parent of the target
224226
* partitioned table.
225227
*/
226-
ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
228+
ExecInitPartitionDispatchInfo(estate, proute, RelationGetRelid(rel),
229+
NULL, 0);
227230

228231
/*
229232
* If performing an UPDATE with tuple routing, we can reuse partition
@@ -424,7 +427,8 @@ ExecFindPartition(ModifyTableState *mtstate,
424427
* Create the new PartitionDispatch. We pass the current one
425428
* in as the parent PartitionDispatch
426429
*/
427-
subdispatch = ExecInitPartitionDispatchInfo(proute,
430+
subdispatch = ExecInitPartitionDispatchInfo(mtstate->ps.state,
431+
proute,
428432
partdesc->oids[partidx],
429433
dispatch, partidx);
430434
Assert(dispatch->indexes[partidx] >= 0 &&
@@ -988,7 +992,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
988992
* PartitionDispatch later.
989993
*/
990994
static PartitionDispatch
991-
ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
995+
ExecInitPartitionDispatchInfo(EState *estate,
996+
PartitionTupleRouting *proute, Oid partoid,
992997
PartitionDispatch parent_pd, int partidx)
993998
{
994999
Relation rel;
@@ -997,6 +1002,10 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
9971002
int dispatchidx;
9981003
MemoryContext oldcxt;
9991004

1005+
if (estate->es_partition_directory == NULL)
1006+
estate->es_partition_directory =
1007+
CreatePartitionDirectory(estate->es_query_cxt);
1008+
10001009
oldcxt = MemoryContextSwitchTo(proute->memcxt);
10011010

10021011
/*
@@ -1008,7 +1017,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
10081017
rel = table_open(partoid, RowExclusiveLock);
10091018
else
10101019
rel = proute->partition_root;
1011-
partdesc = RelationGetPartitionDesc(rel);
1020+
partdesc = PartitionDirectoryLookup(estate->es_partition_directory, rel);
10121021

10131022
pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
10141023
partdesc->nparts * sizeof(int));
@@ -1554,6 +1563,10 @@ ExecCreatePartitionPruneState(PlanState *planstate,
15541563
ListCell *lc;
15551564
int i;
15561565

1566+
if (estate->es_partition_directory == NULL)
1567+
estate->es_partition_directory =
1568+
CreatePartitionDirectory(estate->es_query_cxt);
1569+
15571570
n_part_hierarchies = list_length(partitionpruneinfo->prune_infos);
15581571
Assert(n_part_hierarchies > 0);
15591572

@@ -1610,18 +1623,6 @@ ExecCreatePartitionPruneState(PlanState *planstate,
16101623
int n_steps;
16111624
ListCell *lc3;
16121625

1613-
/*
1614-
* We must copy the subplan_map rather than pointing directly to
1615-
* the plan's version, as we may end up making modifications to it
1616-
* later.
1617-
*/
1618-
pprune->subplan_map = palloc(sizeof(int) * pinfo->nparts);
1619-
memcpy(pprune->subplan_map, pinfo->subplan_map,
1620-
sizeof(int) * pinfo->nparts);
1621-
1622-
/* We can use the subpart_map verbatim, since we never modify it */
1623-
pprune->subpart_map = pinfo->subpart_map;
1624-
16251626
/* present_parts is also subject to later modification */
16261627
pprune->present_parts = bms_copy(pinfo->present_parts);
16271628

@@ -1633,7 +1634,64 @@ ExecCreatePartitionPruneState(PlanState *planstate,
16331634
*/
16341635
partrel = ExecGetRangeTableRelation(estate, pinfo->rtindex);
16351636
partkey = RelationGetPartitionKey(partrel);
1636-
partdesc = RelationGetPartitionDesc(partrel);
1637+
partdesc = PartitionDirectoryLookup(estate->es_partition_directory,
1638+
partrel);
1639+
1640+
/*
1641+
* Initialize the subplan_map and subpart_map. Since detaching a
1642+
* partition requires AccessExclusiveLock, no partitions can have
1643+
* disappeared, nor can the bounds for any partition have changed.
1644+
* However, new partitions may have been added.
1645+
*/
1646+
Assert(partdesc->nparts >= pinfo->nparts);
1647+
pprune->subplan_map = palloc(sizeof(int) * partdesc->nparts);
1648+
if (partdesc->nparts == pinfo->nparts)
1649+
{
1650+
/*
1651+
* There are no new partitions, so this is simple. We can
1652+
* simply point to the subpart_map from the plan, but we must
1653+
* copy the subplan_map since we may change it later.
1654+
*/
1655+
pprune->subpart_map = pinfo->subpart_map;
1656+
memcpy(pprune->subplan_map, pinfo->subplan_map,
1657+
sizeof(int) * pinfo->nparts);
1658+
1659+
/* Double-check that list of relations has not changed. */
1660+
Assert(memcmp(partdesc->oids, pinfo->relid_map,
1661+
pinfo->nparts * sizeof(Oid)) == 0);
1662+
}
1663+
else
1664+
{
1665+
int pd_idx = 0;
1666+
int pp_idx;
1667+
1668+
/*
1669+
* Some new partitions have appeared since plan time, and
1670+
* those are reflected in our PartitionDesc but were not
1671+
* present in the one used to construct subplan_map and
1672+
* subpart_map. So we must construct new and longer arrays
1673+
* where the partitions that were originally present map to the
1674+
* same place, and any added indexes map to -1, as if the
1675+
* new partitions had been pruned.
1676+
*/
1677+
pprune->subpart_map = palloc(sizeof(int) * partdesc->nparts);
1678+
for (pp_idx = 0; pp_idx < partdesc->nparts; ++pp_idx)
1679+
{
1680+
if (pinfo->relid_map[pd_idx] != partdesc->oids[pp_idx])
1681+
{
1682+
pprune->subplan_map[pp_idx] = -1;
1683+
pprune->subpart_map[pp_idx] = -1;
1684+
}
1685+
else
1686+
{
1687+
pprune->subplan_map[pp_idx] =
1688+
pinfo->subplan_map[pd_idx];
1689+
pprune->subpart_map[pp_idx] =
1690+
pinfo->subpart_map[pd_idx++];
1691+
}
1692+
}
1693+
Assert(pd_idx == pinfo->nparts);
1694+
}
16371695

16381696
n_steps = list_length(pinfo->pruning_steps);
16391697

src/backend/executor/execUtils.c

+8
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@
5454
#include "mb/pg_wchar.h"
5555
#include "nodes/nodeFuncs.h"
5656
#include "parser/parsetree.h"
57+
#include "partitioning/partdesc.h"
5758
#include "storage/lmgr.h"
5859
#include "utils/builtins.h"
5960
#include "utils/memutils.h"
@@ -214,6 +215,13 @@ FreeExecutorState(EState *estate)
214215
estate->es_jit = NULL;
215216
}
216217

218+
/* release partition directory, if allocated */
219+
if (estate->es_partition_directory)
220+
{
221+
DestroyPartitionDirectory(estate->es_partition_directory);
222+
estate->es_partition_directory = NULL;
223+
}
224+
217225
/*
218226
* Free the per-query memory context, thereby releasing all working
219227
* memory, including the EState node itself.

src/backend/executor/nodeModifyTable.c

+1-1
Original file line numberDiff line numberDiff line change
@@ -2186,7 +2186,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
21862186
if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
21872187
(operation == CMD_INSERT || update_tuple_routing_needed))
21882188
mtstate->mt_partition_tuple_routing =
2189-
ExecSetupPartitionTupleRouting(mtstate, rel);
2189+
ExecSetupPartitionTupleRouting(estate, mtstate, rel);
21902190

21912191
/*
21922192
* Build state for collecting transition tuples. This requires having a

src/backend/nodes/copyfuncs.c

+1
Original file line numberDiff line numberDiff line change
@@ -1197,6 +1197,7 @@ _copyPartitionedRelPruneInfo(const PartitionedRelPruneInfo *from)
11971197
COPY_SCALAR_FIELD(nexprs);
11981198
COPY_POINTER_FIELD(subplan_map, from->nparts * sizeof(int));
11991199
COPY_POINTER_FIELD(subpart_map, from->nparts * sizeof(int));
1200+
COPY_POINTER_FIELD(relid_map, from->nparts * sizeof(int));
12001201
COPY_POINTER_FIELD(hasexecparam, from->nexprs * sizeof(bool));
12011202
COPY_SCALAR_FIELD(do_initial_prune);
12021203
COPY_SCALAR_FIELD(do_exec_prune);

src/backend/nodes/outfuncs.c

+1
Original file line numberDiff line numberDiff line change
@@ -947,6 +947,7 @@ _outPartitionedRelPruneInfo(StringInfo str, const PartitionedRelPruneInfo *node)
947947
WRITE_INT_FIELD(nexprs);
948948
WRITE_INT_ARRAY(subplan_map, node->nparts);
949949
WRITE_INT_ARRAY(subpart_map, node->nparts);
950+
WRITE_OID_ARRAY(relid_map, node->nparts);
950951
WRITE_BOOL_ARRAY(hasexecparam, node->nexprs);
951952
WRITE_BOOL_FIELD(do_initial_prune);
952953
WRITE_BOOL_FIELD(do_exec_prune);

src/backend/nodes/readfuncs.c

+1
Original file line numberDiff line numberDiff line change
@@ -2386,6 +2386,7 @@ _readPartitionedRelPruneInfo(void)
23862386
READ_INT_FIELD(nexprs);
23872387
READ_INT_ARRAY(subplan_map, local_node->nparts);
23882388
READ_INT_ARRAY(subpart_map, local_node->nparts);
2389+
READ_OID_ARRAY(relid_map, local_node->nparts);
23892390
READ_BOOL_ARRAY(hasexecparam, local_node->nexprs);
23902391
READ_BOOL_FIELD(do_initial_prune);
23912392
READ_BOOL_FIELD(do_exec_prune);

src/backend/optimizer/plan/planner.c

+4
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@
5656
#include "parser/analyze.h"
5757
#include "parser/parsetree.h"
5858
#include "parser/parse_agg.h"
59+
#include "partitioning/partdesc.h"
5960
#include "rewrite/rewriteManip.h"
6061
#include "storage/dsm_impl.h"
6162
#include "utils/rel.h"
@@ -567,6 +568,9 @@ standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
567568
result->jitFlags |= PGJIT_DEFORM;
568569
}
569570

571+
if (glob->partition_directory != NULL)
572+
DestroyPartitionDirectory(glob->partition_directory);
573+
570574
return result;
571575
}
572576

src/backend/optimizer/util/inherit.c

+8-1
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,10 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
147147
{
148148
Assert(rte->relkind == RELKIND_PARTITIONED_TABLE);
149149

150+
if (root->glob->partition_directory == NULL)
151+
root->glob->partition_directory =
152+
CreatePartitionDirectory(CurrentMemoryContext);
153+
150154
/*
151155
* If this table has partitions, recursively expand and lock them.
152156
* While at it, also extract the partition key columns of all the
@@ -246,7 +250,10 @@ expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
246250
int i;
247251
RangeTblEntry *childrte;
248252
Index childRTindex;
249-
PartitionDesc partdesc = RelationGetPartitionDesc(parentrel);
253+
PartitionDesc partdesc;
254+
255+
partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
256+
parentrel);
250257

251258
check_stack_depth();
252259

src/backend/optimizer/util/plancat.c

+2-1
Original file line numberDiff line numberDiff line change
@@ -2086,7 +2086,8 @@ set_relation_partition_info(PlannerInfo *root, RelOptInfo *rel,
20862086

20872087
Assert(relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
20882088

2089-
partdesc = RelationGetPartitionDesc(relation);
2089+
partdesc = PartitionDirectoryLookup(root->glob->partition_directory,
2090+
relation);
20902091
partkey = RelationGetPartitionKey(relation);
20912092
rel->part_scheme = find_partition_scheme(root, relation);
20922093
Assert(partdesc != NULL && rel->part_scheme != NULL);

0 commit comments

Comments
 (0)