Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 45be99f

Browse files
committed
Support parallel joins, and make related improvements.
The core innovation of this patch is the introduction of the concept of a partial path; that is, a path which if executed in parallel will generate a subset of the output rows in each process. Gathering a partial path produces an ordinary (complete) path. This allows us to generate paths for parallel joins by joining a partial path for one side (which at the baserel level is currently always a Partial Seq Scan) to an ordinary path on the other side. This is subject to various restrictions at present, especially that this strategy seems unlikely to be sensible for merge joins, so only nested loops and hash joins paths are generated. This also allows an Append node to be pushed below a Gather node in the case of a partitioned table. Testing revealed that early versions of this patch made poor decisions in some cases, which turned out to be caused by the fact that the original cost model for Parallel Seq Scan wasn't very good. So this patch tries to make some modest improvements in that area. There is much more to be done in the area of generating good parallel plans in all cases, but this seems like a useful step forward. Patch by me, reviewed by Dilip Kumar and Amit Kapila.
1 parent a7de3dc commit 45be99f

File tree

15 files changed

+875
-119
lines changed

15 files changed

+875
-119
lines changed

src/backend/executor/execParallel.c

+40-26
Original file line numberDiff line numberDiff line change
@@ -167,25 +167,25 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
167167
e->nnodes++;
168168

169169
/* Call estimators for parallel-aware nodes. */
170-
switch (nodeTag(planstate))
170+
if (planstate->plan->parallel_aware)
171171
{
172-
case T_SeqScanState:
173-
ExecSeqScanEstimate((SeqScanState *) planstate,
174-
e->pcxt);
175-
break;
176-
default:
177-
break;
172+
switch (nodeTag(planstate))
173+
{
174+
case T_SeqScanState:
175+
ExecSeqScanEstimate((SeqScanState *) planstate,
176+
e->pcxt);
177+
break;
178+
default:
179+
break;
180+
}
178181
}
179182

180183
return planstate_tree_walker(planstate, ExecParallelEstimate, e);
181184
}
182185

183186
/*
184-
* Ordinary plan nodes won't do anything here, but parallel-aware plan nodes
185-
* may need to initialize shared state in the DSM before parallel workers
186-
* are available. They can allocate the space they previous estimated using
187-
* shm_toc_allocate, and add the keys they previously estimated using
188-
* shm_toc_insert, in each case targeting pcxt->toc.
187+
* Initialize the dynamic shared memory segment that will be used to control
188+
* parallel execution.
189189
*/
190190
static bool
191191
ExecParallelInitializeDSM(PlanState *planstate,
@@ -202,15 +202,26 @@ ExecParallelInitializeDSM(PlanState *planstate,
202202
/* Count this node. */
203203
d->nnodes++;
204204

205-
/* Call initializers for parallel-aware plan nodes. */
206-
switch (nodeTag(planstate))
205+
/*
206+
* Call initializers for parallel-aware plan nodes.
207+
*
208+
* Ordinary plan nodes won't do anything here, but parallel-aware plan
209+
* nodes may need to initialize shared state in the DSM before parallel
210+
* workers are available. They can allocate the space they previously
211+
* estimated using shm_toc_allocate, and add the keys they previously
212+
* estimated using shm_toc_insert, in each case targeting pcxt->toc.
213+
*/
214+
if (planstate->plan->parallel_aware)
207215
{
208-
case T_SeqScanState:
209-
ExecSeqScanInitializeDSM((SeqScanState *) planstate,
210-
d->pcxt);
211-
break;
212-
default:
213-
break;
216+
switch (nodeTag(planstate))
217+
{
218+
case T_SeqScanState:
219+
ExecSeqScanInitializeDSM((SeqScanState *) planstate,
220+
d->pcxt);
221+
break;
222+
default:
223+
break;
224+
}
214225
}
215226

216227
return planstate_tree_walker(planstate, ExecParallelInitializeDSM, d);
@@ -623,13 +634,16 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
623634
return false;
624635

625636
/* Call initializers for parallel-aware plan nodes. */
626-
switch (nodeTag(planstate))
637+
if (planstate->plan->parallel_aware)
627638
{
628-
case T_SeqScanState:
629-
ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
630-
break;
631-
default:
632-
break;
639+
switch (nodeTag(planstate))
640+
{
641+
case T_SeqScanState:
642+
ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
643+
break;
644+
default:
645+
break;
646+
}
633647
}
634648

635649
return planstate_tree_walker(planstate, ExecParallelInitializeWorker, toc);

src/backend/nodes/outfuncs.c

+3-1
Original file line numberDiff line numberDiff line change
@@ -1591,6 +1591,8 @@ _outPathInfo(StringInfo str, const Path *node)
15911591
else
15921592
_outBitmapset(str, NULL);
15931593
WRITE_BOOL_FIELD(parallel_aware);
1594+
WRITE_BOOL_FIELD(parallel_safe);
1595+
WRITE_INT_FIELD(parallel_degree);
15941596
WRITE_FLOAT_FIELD(rows, "%.0f");
15951597
WRITE_FLOAT_FIELD(startup_cost, "%.2f");
15961598
WRITE_FLOAT_FIELD(total_cost, "%.2f");
@@ -1768,7 +1770,6 @@ _outGatherPath(StringInfo str, const GatherPath *node)
17681770
_outPathInfo(str, (const Path *) node);
17691771

17701772
WRITE_NODE_FIELD(subpath);
1771-
WRITE_INT_FIELD(num_workers);
17721773
WRITE_BOOL_FIELD(single_copy);
17731774
}
17741775

@@ -1890,6 +1891,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
18901891
WRITE_NODE_FIELD(reltargetlist);
18911892
WRITE_NODE_FIELD(pathlist);
18921893
WRITE_NODE_FIELD(ppilist);
1894+
WRITE_NODE_FIELD(partial_pathlist);
18931895
WRITE_NODE_FIELD(cheapest_startup_path);
18941896
WRITE_NODE_FIELD(cheapest_total_path);
18951897
WRITE_NODE_FIELD(cheapest_unique_path);

src/backend/optimizer/README

+54-1
Original file line numberDiff line numberDiff line change
@@ -851,4 +851,57 @@ lateral reference. (Perhaps now that that stuff works, we could relax the
851851
pullup restriction?)
852852

853853

854-
-- bjm & tgl
854+
Parallel Query and Partial Paths
855+
--------------------------------
856+
857+
Parallel query involves dividing up the work that needs to be performed
858+
either by an entire query or some portion of the query in such a way that
859+
some of that work can be done by one or more worker processes, which are
860+
called parallel workers. Parallel workers are a subtype of dynamic
861+
background workers; see src/backend/access/transam/README.parallel for a
862+
fuller description. Academic literature on parallel query suggests that
863+
that parallel execution strategies can be divided into essentially two
864+
categories: pipelined parallelism, where the execution of the query is
865+
divided into multiple stages and each stage is handled by a separate
866+
process; and partitioning parallelism, where the data is split between
867+
multiple processes and each process handles a subset of it. The
868+
literature, however, suggests that gains from pipeline parallelism are
869+
often very limited due to the difficulty of avoiding pipeline stalls.
870+
Consequently, we do not currently attempt to generate query plans that
871+
use this technique.
872+
873+
Instead, we focus on partitioning paralellism, which does not require
874+
that the underlying table be partitioned. It only requires that (1)
875+
there is some method of dividing the data from at least one of the base
876+
tables involved in the relation across multiple processes, (2) allowing
877+
each process to handle its own portion of the data, and then (3)
878+
collecting the results. Requirements (2) and (3) is satisfied by the
879+
executor node Gather, which launches any number of worker processes and
880+
executes its single child plan in all of them (and perhaps in the leader
881+
also, if the children aren't generating enough data to keep the leader
882+
busy). Requirement (1) is handled by the SeqScan node: when invoked
883+
with parallel_aware = true, this node will, in effect, partition the
884+
table on a block by block basis, returning a subset of the tuples from
885+
the relation in each worker where that SeqScan is executed. A similar
886+
scheme could be (and probably should be) implemented for bitmap heap
887+
scans.
888+
889+
Just as we do for non-parallel access methods, we build Paths to
890+
represent access strategies that can be used in a parallel plan. These
891+
are, in essence, the same strategies that are available in the
892+
non-parallel plan, but there is an important difference: a path that
893+
will run beneath a Gather node returns only a subset of the query
894+
results in each worker, not all of them. To form a path that can
895+
actually be executed, the (rather large) cost of the Gather node must be
896+
accounted for. For this reason among others, paths intended to run
897+
beneath a Gather node - which we call "partial" paths since they return
898+
only a subset of the results in each worker - must be kept separate from
899+
ordinary paths (see RelOptInfo's partial_pathlist and the function
900+
add_partial_path).
901+
902+
One of the keys to making parallel query effective is to run as much of
903+
the query in parallel as possible. Therefore, we expect it to generally
904+
be desirable to postpone the Gather stage until as near to the top of the
905+
plan as possible. Expanding the range of cases in which more work can be
906+
pushed below the Gather (and costly them accurately) is likely to keep us
907+
busy for a long time to come.

0 commit comments

Comments
 (0)