Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 78530c8

Browse files
Add hash_mem_multiplier GUC.
Add a GUC that acts as a multiplier on work_mem. It gets applied when sizing executor node hash tables that were previously size constrained using work_mem alone. The new GUC can be used to preferentially give hash-based nodes more memory than the generic work_mem limit. It is intended to enable admin tuning of the executor's memory usage. Overall system throughput and system responsiveness can be improved by giving hash-based executor nodes more memory (especially over sort-based alternatives, which are often much less sensitive to being memory constrained). The default value for hash_mem_multiplier is 1.0, which is also the minimum valid value. This means that hash-based nodes continue to apply work_mem in the traditional way by default. hash_mem_multiplier is generally useful. However, it is being added now due to concerns about hash aggregate performance stability for users that upgrade to Postgres 13 (which added disk-based hash aggregation in commit 1f39bce). While the old hash aggregate behavior risked out-of-memory errors, it is nevertheless likely that many users actually benefited. Hash agg's previous indifference to work_mem during query execution was not just faster; it also accidentally made aggregation resilient to grouping estimate problems (at least in cases where this didn't create destabilizing memory pressure). hash_mem_multiplier can provide a certain kind of continuity with the behavior of Postgres 12 hash aggregates in cases where the planner incorrectly estimates that all groups (plus related allocations) will fit in work_mem/hash_mem. This seems necessary because hash-based aggregation is usually much slower when only a small fraction of all groups can fit. Even when it isn't possible to totally avoid hash aggregates that spill, giving hash aggregation more memory will reliably improve performance (the same cannot be said for external sort operations, which appear to be almost unaffected by memory availability provided it's at least possible to get a single merge pass). The PostgreSQL 13 release notes should advise users that increasing hash_mem_multiplier can help with performance regressions associated with hash aggregation. That can be taken care of by a later commit. Author: Peter Geoghegan Reviewed-By: Álvaro Herrera, Jeff Davis Discussion: https://postgr.es/m/20200625203629.7m6yvut7eqblgmfo@alap3.anarazel.de Discussion: https://postgr.es/m/CAH2-WzmD%2Bi1pG6rc1%2BCjc4V6EaFJ_qSuKCCHVnH%3DoruqD-zqow%40mail.gmail.com Backpatch: 13-, where disk-based hash aggregation was introduced.
1 parent 3a232a3 commit 78530c8

File tree

19 files changed

+205
-81
lines changed

19 files changed

+205
-81
lines changed

doc/src/sgml/config.sgml

Lines changed: 51 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1663,22 +1663,64 @@ include_dir 'conf.d'
16631663
</term>
16641664
<listitem>
16651665
<para>
1666-
Sets the maximum amount of memory to be used by a query operation
1666+
Sets the base maximum amount of memory to be used by a query operation
16671667
(such as a sort or hash table) before writing to temporary disk files.
16681668
If this value is specified without units, it is taken as kilobytes.
16691669
The default value is four megabytes (<literal>4MB</literal>).
16701670
Note that for a complex query, several sort or hash operations might be
1671-
running in parallel; each operation will be allowed to use as much memory
1672-
as this value specifies before it starts to write data into temporary
1673-
files. Also, several running sessions could be doing such operations
1674-
concurrently. Therefore, the total memory used could be many
1675-
times the value of <varname>work_mem</varname>; it is necessary to
1676-
keep this fact in mind when choosing the value. Sort operations are
1677-
used for <literal>ORDER BY</literal>, <literal>DISTINCT</literal>, and
1678-
merge joins.
1671+
running in parallel; each operation will generally be allowed
1672+
to use as much memory as this value specifies before it starts
1673+
to write data into temporary files. Also, several running
1674+
sessions could be doing such operations concurrently.
1675+
Therefore, the total memory used could be many times the value
1676+
of <varname>work_mem</varname>; it is necessary to keep this
1677+
fact in mind when choosing the value. Sort operations are used
1678+
for <literal>ORDER BY</literal>, <literal>DISTINCT</literal>,
1679+
and merge joins.
16791680
Hash tables are used in hash joins, hash-based aggregation, and
16801681
hash-based processing of <literal>IN</literal> subqueries.
16811682
</para>
1683+
<para>
1684+
Hash-based operations are generally more sensitive to memory
1685+
availability than equivalent sort-based operations. The
1686+
memory available for hash tables is computed by multiplying
1687+
<varname>work_mem</varname> by
1688+
<varname>hash_mem_multiplier</varname>. This makes it
1689+
possible for hash-based operations to use an amount of memory
1690+
that exceeds the usual <varname>work_mem</varname> base
1691+
amount.
1692+
</para>
1693+
</listitem>
1694+
</varlistentry>
1695+
1696+
<varlistentry id="guc-hash-mem-multiplier" xreflabel="hash_mem_multiplier">
1697+
<term><varname>hash_mem_multiplier</varname> (<type>floating point</type>)
1698+
<indexterm>
1699+
<primary><varname>hash_mem_multiplier</varname> configuration parameter</primary>
1700+
</indexterm>
1701+
</term>
1702+
<listitem>
1703+
<para>
1704+
Used to compute the maximum amount of memory that hash-based
1705+
operations can use. The final limit is determined by
1706+
multiplying <varname>work_mem</varname> by
1707+
<varname>hash_mem_multiplier</varname>. The default value is
1708+
1.0, which makes hash-based operations subject to the same
1709+
simple <varname>work_mem</varname> maximum as sort-based
1710+
operations.
1711+
</para>
1712+
<para>
1713+
Consider increasing <varname>hash_mem_multiplier</varname> in
1714+
environments where spilling by query operations is a regular
1715+
occurrence, especially when simply increasing
1716+
<varname>work_mem</varname> results in memory pressure (memory
1717+
pressure typically takes the form of intermittent out of
1718+
memory errors). A setting of 1.5 or 2.0 may be effective with
1719+
mixed workloads. Higher settings in the range of 2.0 - 8.0 or
1720+
more may be effective in environments where
1721+
<varname>work_mem</varname> has already been increased to 40MB
1722+
or more.
1723+
</para>
16821724
</listitem>
16831725
</varlistentry>
16841726

doc/src/sgml/ref/postgres-ref.sgml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -338,10 +338,10 @@ PostgreSQL documentation
338338
<term><option>-S</option> <replaceable class="parameter">work-mem</replaceable></term>
339339
<listitem>
340340
<para>
341-
Specifies the amount of memory to be used by internal sorts and hashes
342-
before resorting to temporary disk files. See the description of the
343-
<varname>work_mem</varname> configuration parameter in <xref
344-
linkend="runtime-config-resource-memory"/>.
341+
Specifies the base amount of memory to be used by sorts and
342+
hash tables before resorting to temporary disk files. See the
343+
description of the <varname>work_mem</varname> configuration
344+
parameter in <xref linkend="runtime-config-resource-memory"/>.
345345
</para>
346346
</listitem>
347347
</varlistentry>

doc/src/sgml/runtime.sgml

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1326,10 +1326,12 @@ Out of Memory: Killed process 12345 (postgres).
13261326
system running out of memory, you can avoid the problem by changing
13271327
your configuration. In some cases, it may help to lower memory-related
13281328
configuration parameters, particularly
1329-
<link linkend="guc-shared-buffers"><varname>shared_buffers</varname></link>
1330-
and <link linkend="guc-work-mem"><varname>work_mem</varname></link>. In
1331-
other cases, the problem may be caused by allowing too many connections
1332-
to the database server itself. In many cases, it may be better to reduce
1329+
<link linkend="guc-shared-buffers"><varname>shared_buffers</varname></link>,
1330+
<link linkend="guc-work-mem"><varname>work_mem</varname></link>, and
1331+
<link linkend="guc-hash-mem-multiplier"><varname>hash_mem_multiplier</varname></link>.
1332+
In other cases, the problem may be caused by allowing too many
1333+
connections to the database server itself. In many cases, it may
1334+
be better to reduce
13331335
<link linkend="guc-max-connections"><varname>max_connections</varname></link>
13341336
and instead make use of external connection-pooling software.
13351337
</para>

src/backend/executor/execGrouping.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,13 +165,14 @@ BuildTupleHashTableExt(PlanState *parent,
165165
{
166166
TupleHashTable hashtable;
167167
Size entrysize = sizeof(TupleHashEntryData) + additionalsize;
168+
int hash_mem = get_hash_mem();
168169
MemoryContext oldcontext;
169170
bool allow_jit;
170171

171172
Assert(nbuckets > 0);
172173

173-
/* Limit initial table size request to not more than work_mem */
174-
nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / entrysize));
174+
/* Limit initial table size request to not more than hash_mem */
175+
nbuckets = Min(nbuckets, (long) ((hash_mem * 1024L) / entrysize));
175176

176177
oldcontext = MemoryContextSwitchTo(metacxt);
177178

src/backend/executor/nodeAgg.c

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@
203203
* entries (and initialize new transition states), we instead spill them to
204204
* disk to be processed later. The tuples are spilled in a partitioned
205205
* manner, so that subsequent batches are smaller and less likely to exceed
206-
* work_mem (if a batch does exceed work_mem, it must be spilled
206+
* hash_mem (if a batch does exceed hash_mem, it must be spilled
207207
* recursively).
208208
*
209209
* Spilled data is written to logical tapes. These provide better control
@@ -212,7 +212,7 @@
212212
*
213213
* Note that it's possible for transition states to start small but then
214214
* grow very large; for instance in the case of ARRAY_AGG. In such cases,
215-
* it's still possible to significantly exceed work_mem. We try to avoid
215+
* it's still possible to significantly exceed hash_mem. We try to avoid
216216
* this situation by estimating what will fit in the available memory, and
217217
* imposing a limit on the number of groups separately from the amount of
218218
* memory consumed.
@@ -1516,7 +1516,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
15161516

15171517
/*
15181518
* Used to make sure initial hash table allocation does not exceed
1519-
* work_mem. Note that the estimate does not include space for
1519+
* hash_mem. Note that the estimate does not include space for
15201520
* pass-by-reference transition data values, nor for the representative
15211521
* tuple of each group.
15221522
*/
@@ -1782,7 +1782,7 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
17821782
}
17831783

17841784
/*
1785-
* Set limits that trigger spilling to avoid exceeding work_mem. Consider the
1785+
* Set limits that trigger spilling to avoid exceeding hash_mem. Consider the
17861786
* number of partitions we expect to create (if we do spill).
17871787
*
17881788
* There are two limits: a memory limit, and also an ngroups limit. The
@@ -1796,13 +1796,14 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
17961796
{
17971797
int npartitions;
17981798
Size partition_mem;
1799+
int hash_mem = get_hash_mem();
17991800

1800-
/* if not expected to spill, use all of work_mem */
1801-
if (input_groups * hashentrysize < work_mem * 1024L)
1801+
/* if not expected to spill, use all of hash_mem */
1802+
if (input_groups * hashentrysize < hash_mem * 1024L)
18021803
{
18031804
if (num_partitions != NULL)
18041805
*num_partitions = 0;
1805-
*mem_limit = work_mem * 1024L;
1806+
*mem_limit = hash_mem * 1024L;
18061807
*ngroups_limit = *mem_limit / hashentrysize;
18071808
return;
18081809
}
@@ -1824,14 +1825,14 @@ hash_agg_set_limits(double hashentrysize, double input_groups, int used_bits,
18241825
HASHAGG_WRITE_BUFFER_SIZE * npartitions;
18251826

18261827
/*
1827-
* Don't set the limit below 3/4 of work_mem. In that case, we are at the
1828+
* Don't set the limit below 3/4 of hash_mem. In that case, we are at the
18281829
* minimum number of partitions, so we aren't going to dramatically exceed
18291830
* work mem anyway.
18301831
*/
1831-
if (work_mem * 1024L > 4 * partition_mem)
1832-
*mem_limit = work_mem * 1024L - partition_mem;
1832+
if (hash_mem * 1024L > 4 * partition_mem)
1833+
*mem_limit = hash_mem * 1024L - partition_mem;
18331834
else
1834-
*mem_limit = work_mem * 1024L * 0.75;
1835+
*mem_limit = hash_mem * 1024L * 0.75;
18351836

18361837
if (*mem_limit > hashentrysize)
18371838
*ngroups_limit = *mem_limit / hashentrysize;
@@ -1989,19 +1990,20 @@ hash_choose_num_partitions(double input_groups, double hashentrysize,
19891990
int partition_limit;
19901991
int npartitions;
19911992
int partition_bits;
1993+
int hash_mem = get_hash_mem();
19921994

19931995
/*
19941996
* Avoid creating so many partitions that the memory requirements of the
1995-
* open partition files are greater than 1/4 of work_mem.
1997+
* open partition files are greater than 1/4 of hash_mem.
19961998
*/
19971999
partition_limit =
1998-
(work_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
2000+
(hash_mem * 1024L * 0.25 - HASHAGG_READ_BUFFER_SIZE) /
19992001
HASHAGG_WRITE_BUFFER_SIZE;
20002002

20012003
mem_wanted = HASHAGG_PARTITION_FACTOR * input_groups * hashentrysize;
20022004

20032005
/* make enough partitions so that each one is likely to fit in memory */
2004-
npartitions = 1 + (mem_wanted / (work_mem * 1024L));
2006+
npartitions = 1 + (mem_wanted / (hash_mem * 1024L));
20052007

20062008
if (npartitions > partition_limit)
20072009
npartitions = partition_limit;

src/backend/executor/nodeHash.c

Lines changed: 61 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
#include "port/atomics.h"
4040
#include "port/pg_bitutils.h"
4141
#include "utils/dynahash.h"
42+
#include "utils/guc.h"
4243
#include "utils/lsyscache.h"
4344
#include "utils/memutils.h"
4445
#include "utils/syscache.h"
@@ -506,7 +507,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
506507
hashtable->spaceAllowed = space_allowed;
507508
hashtable->spaceUsedSkew = 0;
508509
hashtable->spaceAllowedSkew =
509-
hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
510+
hashtable->spaceAllowed * SKEW_HASH_MEM_PERCENT / 100;
510511
hashtable->chunks = NULL;
511512
hashtable->current_chunk = NULL;
512513
hashtable->parallel_state = state->parallel_state;
@@ -665,7 +666,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
665666

666667
void
667668
ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
668-
bool try_combined_work_mem,
669+
bool try_combined_hash_mem,
669670
int parallel_workers,
670671
size_t *space_allowed,
671672
int *numbuckets,
@@ -682,6 +683,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
682683
int nbatch = 1;
683684
int nbuckets;
684685
double dbuckets;
686+
int hash_mem = get_hash_mem();
685687

686688
/* Force a plausible relation size if no info */
687689
if (ntuples <= 0.0)
@@ -698,16 +700,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
698700
inner_rel_bytes = ntuples * tupsize;
699701

700702
/*
701-
* Target in-memory hashtable size is work_mem kilobytes.
703+
* Target in-memory hashtable size is hash_mem kilobytes.
702704
*/
703-
hash_table_bytes = work_mem * 1024L;
705+
hash_table_bytes = hash_mem * 1024L;
704706

705707
/*
706-
* Parallel Hash tries to use the combined work_mem of all workers to
707-
* avoid the need to batch. If that won't work, it falls back to work_mem
708+
* Parallel Hash tries to use the combined hash_mem of all workers to
709+
* avoid the need to batch. If that won't work, it falls back to hash_mem
708710
* per worker and tries to process batches in parallel.
709711
*/
710-
if (try_combined_work_mem)
712+
if (try_combined_hash_mem)
711713
hash_table_bytes += hash_table_bytes * parallel_workers;
712714

713715
*space_allowed = hash_table_bytes;
@@ -728,7 +730,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
728730
*/
729731
if (useskew)
730732
{
731-
skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;
733+
skew_table_bytes = hash_table_bytes * SKEW_HASH_MEM_PERCENT / 100;
732734

733735
/*----------
734736
* Divisor is:
@@ -751,7 +753,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
751753
/*
752754
* Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
753755
* memory is filled, assuming a single batch; but limit the value so that
754-
* the pointer arrays we'll try to allocate do not exceed work_mem nor
756+
* the pointer arrays we'll try to allocate do not exceed hash_mem nor
755757
* MaxAllocSize.
756758
*
757759
* Note that both nbuckets and nbatch must be powers of 2 to make
@@ -790,10 +792,10 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
790792
long bucket_size;
791793

792794
/*
793-
* If Parallel Hash with combined work_mem would still need multiple
794-
* batches, we'll have to fall back to regular work_mem budget.
795+
* If Parallel Hash with combined hash_mem would still need multiple
796+
* batches, we'll have to fall back to regular hash_mem budget.
795797
*/
796-
if (try_combined_work_mem)
798+
if (try_combined_hash_mem)
797799
{
798800
ExecChooseHashTableSize(ntuples, tupwidth, useskew,
799801
false, parallel_workers,
@@ -805,7 +807,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
805807
}
806808

807809
/*
808-
* Estimate the number of buckets we'll want to have when work_mem is
810+
* Estimate the number of buckets we'll want to have when hash_mem is
809811
* entirely full. Each bucket will contain a bucket pointer plus
810812
* NTUP_PER_BUCKET tuples, whose projected size already includes
811813
* overhead for the hash code, pointer to the next tuple, etc.
@@ -820,8 +822,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
820822
/*
821823
* Buckets are simple pointers to hashjoin tuples, while tupsize
822824
* includes the pointer, hash code, and MinimalTupleData. So buckets
823-
* should never really exceed 25% of work_mem (even for
824-
* NTUP_PER_BUCKET=1); except maybe for work_mem values that are not
825+
* should never really exceed 25% of hash_mem (even for
826+
* NTUP_PER_BUCKET=1); except maybe for hash_mem values that are not
825827
* 2^N bytes, where we might get more because of doubling. So let's
826828
* look for 50% here.
827829
*/
@@ -1095,15 +1097,17 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
10951097
/* Figure out how many batches to use. */
10961098
if (hashtable->nbatch == 1)
10971099
{
1100+
int hash_mem = get_hash_mem();
1101+
10981102
/*
10991103
* We are going from single-batch to multi-batch. We need
11001104
* to switch from one large combined memory budget to the
1101-
* regular work_mem budget.
1105+
* regular hash_mem budget.
11021106
*/
1103-
pstate->space_allowed = work_mem * 1024L;
1107+
pstate->space_allowed = hash_mem * 1024L;
11041108

11051109
/*
1106-
* The combined work_mem of all participants wasn't
1110+
* The combined hash_mem of all participants wasn't
11071111
* enough. Therefore one batch per participant would be
11081112
* approximately equivalent and would probably also be
11091113
* insufficient. So try two batches per participant,
@@ -2855,7 +2859,7 @@ ExecParallelHashTupleAlloc(HashJoinTable hashtable, size_t size,
28552859

28562860
/*
28572861
* Check if our space limit would be exceeded. To avoid choking on
2858-
* very large tuples or very low work_mem setting, we'll always allow
2862+
* very large tuples or very low hash_mem setting, we'll always allow
28592863
* each backend to allocate at least one chunk.
28602864
*/
28612865
if (hashtable->batches[0].at_least_one_chunk &&
@@ -3366,3 +3370,41 @@ ExecParallelHashTuplePrealloc(HashJoinTable hashtable, int batchno, size_t size)
33663370

33673371
return true;
33683372
}
3373+
3374+
/*
3375+
* Get a hash_mem value by multiplying the work_mem GUC's value by the
3376+
* hash_mem_multiplier GUC's value.
3377+
*
3378+
* Returns a work_mem style KB value that hash-based nodes (including but not
3379+
* limited to hash join) use in place of work_mem. This is subject to the
3380+
* same restrictions as work_mem itself. (There is no such thing as the
3381+
* hash_mem GUC, but it's convenient for our callers to pretend that there
3382+
* is.)
3383+
*
3384+
* Exported for use by the planner, as well as other hash-based executor
3385+
* nodes. This is a rather random place for this, but there is no better
3386+
* place.
3387+
*/
3388+
int
3389+
get_hash_mem(void)
3390+
{
3391+
double hash_mem;
3392+
3393+
Assert(hash_mem_multiplier >= 1.0);
3394+
3395+
hash_mem = (double) work_mem * hash_mem_multiplier;
3396+
3397+
/*
3398+
* guc.c enforces a MAX_KILOBYTES limitation on work_mem in order to
3399+
* support the assumption that raw derived byte values can be stored in
3400+
* 'long' variables. The returned hash_mem value must also meet this
3401+
* assumption.
3402+
*
3403+
* We clamp the final value rather than throw an error because it should
3404+
* be possible to set work_mem and hash_mem_multiplier independently.
3405+
*/
3406+
if (hash_mem < MAX_KILOBYTES)
3407+
return (int) hash_mem;
3408+
3409+
return MAX_KILOBYTES;
3410+
}

0 commit comments

Comments
 (0)