Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 90b1271

Browse files
tglsfdcCommitfest Bot
authored and
Commitfest Bot
committed
Improve hash join's handling of tuples with null join keys.
In a plain join, we can just summarily discard an input tuple with null join key(s), since it cannot match anything from the other side of the join (assuming a strict join operator). However, if the tuple comes from the outer side of an outer join then we have to emit it with null-extension of the other side. Up to now, hash joins did that by inserting the tuple into the hash table as though it were a normal tuple. This is unnecessarily inefficient though, since the required processing is far simpler than for a potentially-matchable tuple. Worse, if there are a lot of such tuples they will bloat the hash bucket they go into, possibly causing useless repeated attempts to split that bucket or increase the number of batches. We have a report of a large join vainly creating many thousands of batches when faced with such input. This patch improves the situation by keeping such tuples out of the hash table altogether, instead pushing them into a separate tuplestore from which we return them later. (One might consider trying to return them immediately; but that would require substantial refactoring, and it doesn't work anyway for the case where we rescan an unmodified hash table.) This works even in parallel hash joins, because whichever worker reads a null-keyed tuple can just return it; there's no need for consultation with other workers. Thus the tuplestores are local storage even in a parallel join.
1 parent 1722d5e commit 90b1271

File tree

13 files changed

+356
-59
lines changed

13 files changed

+356
-59
lines changed

src/backend/executor/execExpr.c

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4282,25 +4282,25 @@ ExecBuildHash32FromAttrs(TupleDesc desc, const TupleTableSlotOps *ops,
42824282
* 'hash_exprs'. When multiple expressions are present, the hash values
42834283
* returned by each hash function are combined to produce a single hash value.
42844284
*
4285+
* If any hash_expr yields NULL and the corresponding hash function is strict,
4286+
* the created ExprState will return NULL.
4287+
*
42854288
* desc: tuple descriptor for the to-be-hashed expressions
42864289
* ops: TupleTableSlotOps for the TupleDesc
42874290
* hashfunc_oids: Oid for each hash function to call, one for each 'hash_expr'
4288-
* collations: collation to use when calling the hash function.
4289-
* hash_expr: list of expressions to hash the value of
4290-
* opstrict: array corresponding to the 'hashfunc_oids' to store op_strict()
4291+
* collations: collation to use when calling the hash function
4292+
* hash_exprs: list of expressions to hash the value of
4293+
* opstrict: strictness flag for each hash function
42914294
* parent: PlanState node that the 'hash_exprs' will be evaluated at
42924295
* init_value: Normally 0, but can be set to other values to seed the hash
42934296
* with some other value. Using non-zero is slightly less efficient but can
42944297
* be useful.
4295-
* keep_nulls: if true, evaluation of the returned ExprState will abort early
4296-
* returning NULL if the given hash function is strict and the Datum to hash
4297-
* is null. When set to false, any NULL input Datums are skipped.
42984298
*/
42994299
ExprState *
43004300
ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
43014301
const Oid *hashfunc_oids, const List *collations,
43024302
const List *hash_exprs, const bool *opstrict,
4303-
PlanState *parent, uint32 init_value, bool keep_nulls)
4303+
PlanState *parent, uint32 init_value)
43044304
{
43054305
ExprState *state = makeNode(ExprState);
43064306
ExprEvalStep scratch = {0};
@@ -4377,8 +4377,8 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
43774377
fmgr_info(funcid, finfo);
43784378

43794379
/*
4380-
* Build the steps to evaluate the hash function's argument have it so
4381-
* the value of that is stored in the 0th argument of the hash func.
4380+
* Build the steps to evaluate the hash function's argument, placing
4381+
* the value in the 0th argument of the hash func.
43824382
*/
43834383
ExecInitExprRec(expr,
43844384
state,
@@ -4413,7 +4413,7 @@ ExecBuildHash32Expr(TupleDesc desc, const TupleTableSlotOps *ops,
44134413
scratch.d.hashdatum.fcinfo_data = fcinfo;
44144414
scratch.d.hashdatum.fn_addr = finfo->fn_addr;
44154415

4416-
scratch.opcode = opstrict[i] && !keep_nulls ? strict_opcode : opcode;
4416+
scratch.opcode = opstrict[i] ? strict_opcode : opcode;
44174417
scratch.d.hashdatum.jumpdone = -1;
44184418

44194419
ExprEvalPushStep(state, &scratch);

src/backend/executor/nodeHash.c

Lines changed: 55 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -154,8 +154,11 @@ MultiExecPrivateHash(HashState *node)
154154
econtext = node->ps.ps_ExprContext;
155155

156156
/*
157-
* Get all tuples from the node below the Hash node and insert into the
158-
* hash table (or temp files).
157+
* Get all tuples from the node below the Hash node and insert the
158+
* potentially-matchable ones into the hash table (or temp files). Tuples
159+
* that can't possibly match because they have null join keys are dumped
160+
* into a separate tuplestore, or just summarily discarded if we don't
161+
* need to emit them with null-extension.
159162
*/
160163
for (;;)
161164
{
@@ -175,6 +178,7 @@ MultiExecPrivateHash(HashState *node)
175178

176179
if (!isnull)
177180
{
181+
/* normal case with a non-null join key */
178182
uint32 hashvalue = DatumGetUInt32(hashdatum);
179183
int bucketNumber;
180184

@@ -193,6 +197,14 @@ MultiExecPrivateHash(HashState *node)
193197
}
194198
hashtable->totalTuples += 1;
195199
}
200+
else if (node->keep_null_tuples)
201+
{
202+
/* null join key, but we must save tuple to be emitted later */
203+
if (node->null_tuple_store == NULL)
204+
node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
205+
tuplestore_puttupleslot(node->null_tuple_store, slot);
206+
}
207+
/* else we can discard the tuple immediately */
196208
}
197209

198210
/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
@@ -223,7 +235,6 @@ MultiExecParallelHash(HashState *node)
223235
HashJoinTable hashtable;
224236
TupleTableSlot *slot;
225237
ExprContext *econtext;
226-
uint32 hashvalue;
227238
Barrier *build_barrier;
228239
int i;
229240

@@ -283,6 +294,7 @@ MultiExecParallelHash(HashState *node)
283294
for (;;)
284295
{
285296
bool isnull;
297+
uint32 hashvalue;
286298

287299
slot = ExecProcNode(outerNode);
288300
if (TupIsNull(slot))
@@ -296,8 +308,19 @@ MultiExecParallelHash(HashState *node)
296308
&isnull));
297309

298310
if (!isnull)
311+
{
312+
/* normal case with a non-null join key */
299313
ExecParallelHashTableInsert(hashtable, slot, hashvalue);
300-
hashtable->partialTuples++;
314+
hashtable->partialTuples++;
315+
}
316+
else if (node->keep_null_tuples)
317+
{
318+
/* null join key, but save tuple to be emitted later */
319+
if (node->null_tuple_store == NULL)
320+
node->null_tuple_store = ExecHashBuildNullTupleStore(hashtable);
321+
tuplestore_puttupleslot(node->null_tuple_store, slot);
322+
}
323+
/* else we can discard the tuple immediately */
301324
}
302325

303326
/*
@@ -405,14 +428,10 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
405428

406429
Assert(node->plan.qual == NIL);
407430

408-
/*
409-
* Delay initialization of hash_expr until ExecInitHashJoin(). We cannot
410-
* build the ExprState here as we don't yet know the join type we're going
411-
* to be hashing values for and we need to know that before calling
412-
* ExecBuildHash32Expr as the keep_nulls parameter depends on the join
413-
* type.
414-
*/
431+
/* these fields will be filled by ExecInitHashJoin() */
415432
hashstate->hash_expr = NULL;
433+
hashstate->null_tuple_store = NULL;
434+
hashstate->keep_null_tuples = false;
416435

417436
return hashstate;
418437
}
@@ -2748,6 +2767,31 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
27482767
}
27492768
}
27502769

2770+
/*
2771+
* Build a tuplestore suitable for holding null-keyed input tuples.
2772+
* (This function doesn't care whether it's for outer or inner tuples.)
2773+
*
2774+
* Note that in a parallel hash join, each worker has its own tuplestore(s)
2775+
* for these. There's no need to interact with other workers to decide
2776+
* what to do with them. So they're always in private storage.
2777+
*/
2778+
Tuplestorestate *
2779+
ExecHashBuildNullTupleStore(HashJoinTable hashtable)
2780+
{
2781+
Tuplestorestate *tstore;
2782+
MemoryContext oldcxt;
2783+
2784+
/*
2785+
* We keep the tuplestore in the hashCxt to ensure it won't go away too
2786+
* soon. Size it at work_mem/16 so that it doesn't bloat the node's space
2787+
* consumption too much.
2788+
*/
2789+
oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
2790+
tstore = tuplestore_begin_heap(false, false, work_mem / 16);
2791+
MemoryContextSwitchTo(oldcxt);
2792+
return tstore;
2793+
}
2794+
27512795
/*
27522796
* Reserve space in the DSM segment for instrumentation data.
27532797
*/

0 commit comments

Comments
 (0)