Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit bd10ec5

Browse files
committed
Detect redundant GROUP BY columns using UNIQUE indexes
d4c3a15 added support that when the GROUP BY contained all of the columns belonging to a relation's PRIMARY KEY, all other columns belonging to that relation would be removed from the GROUP BY clause. That's possible because all other columns are functionally dependent on the PRIMARY KEY and those columns alone ensure the groups are distinct. Here we expand on that optimization and allow it to work for any unique indexes on the table rather than just the PRIMARY KEY index. This normally requires that all columns in the index are defined with NOT NULL, however, we can relax that requirement when the index is defined with NULLS NOT DISTINCT. When there are multiple suitable indexes to allow columns to be removed, we prefer the index with the least number of columns as this allows us to remove the highest number of GROUP BY columns. One day, we may want to revisit that decision as it may make more sense to use the narrower set of columns in terms of the width of the data types and stored/queried data. This also adjusts the code to make use of RelOptInfo.indexlist rather than looking up the catalog tables. In passing, add another short-circuit path to allow bailing out earlier in cases where it's certainly not possible to remove redundant GROUP BY columns. This early exit is now cheaper to do than when this code was originally written as 00b4146 made it cheaper to check for empty Bitmapsets. Patch originally by Zhang Mingli and later worked on by jian he, but after I (David) worked on it, there was very little of the original left. Author: Zhang Mingli, jian he, David Rowley Reviewed-by: jian he, Andrei Lepikhov Discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0%40Spark
1 parent d8f3351 commit bd10ec5

File tree

5 files changed

+200
-22
lines changed

5 files changed

+200
-22
lines changed

src/backend/optimizer/plan/initsplan.c

Lines changed: 98 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -400,24 +400,21 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
400400
*
401401
* Since some other DBMSes do not allow references to ungrouped columns, it's
402402
* not unusual to find all columns listed in GROUP BY even though listing the
403-
* primary-key columns would be sufficient. Deleting such excess columns
404-
* avoids redundant sorting work, so it's worth doing.
403+
* primary-key columns, or columns of a unique constraint would be sufficient.
404+
* Deleting such excess columns avoids redundant sorting or hashing work, so
405+
* it's worth doing.
405406
*
406407
* Relcache invalidations will ensure that cached plans become invalidated
407-
* when the underlying index of the pkey constraint is dropped.
408-
*
409-
* Currently, we only make use of pkey constraints for this, however, we may
410-
* wish to take this further in the future and also use unique constraints
411-
* which have NOT NULL columns. In that case, plan invalidation will still
412-
* work since relations will receive a relcache invalidation when a NOT NULL
413-
* constraint is dropped.
408+
* when the underlying supporting indexes are dropped or if a column's NOT
409+
* NULL attribute is removed.
414410
*/
415411
void
416412
remove_useless_groupby_columns(PlannerInfo *root)
417413
{
418414
Query *parse = root->parse;
419415
Bitmapset **groupbyattnos;
420416
Bitmapset **surplusvars;
417+
bool tryremove = false;
421418
ListCell *lc;
422419
int relid;
423420

@@ -457,10 +454,24 @@ remove_useless_groupby_columns(PlannerInfo *root)
457454
/* OK, remember we have this Var */
458455
relid = var->varno;
459456
Assert(relid <= list_length(parse->rtable));
457+
458+
/*
459+
* If this isn't the first column for this relation then we now have
460+
* multiple columns. That means there might be some that can be
461+
* removed.
462+
*/
463+
tryremove |= !bms_is_empty(groupbyattnos[relid]);
460464
groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
461465
var->varattno - FirstLowInvalidHeapAttributeNumber);
462466
}
463467

468+
/*
469+
* No Vars or didn't find multiple Vars for any relation in the GROUP BY?
470+
* If so, nothing can be removed, so don't waste more effort trying.
471+
*/
472+
if (!tryremove)
473+
return;
474+
464475
/*
465476
* Consider each relation and see if it is possible to remove some of its
466477
* Vars from GROUP BY. For simplicity and speed, we do the actual removal
@@ -472,9 +483,10 @@ remove_useless_groupby_columns(PlannerInfo *root)
472483
foreach(lc, parse->rtable)
473484
{
474485
RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
486+
RelOptInfo *rel;
475487
Bitmapset *relattnos;
476-
Bitmapset *pkattnos;
477-
Oid constraintOid;
488+
Bitmapset *best_keycolumns = NULL;
489+
int32 best_nkeycolumns = PG_INT32_MAX;
478490

479491
relid++;
480492

@@ -495,19 +507,83 @@ remove_useless_groupby_columns(PlannerInfo *root)
495507
if (bms_membership(relattnos) != BMS_MULTIPLE)
496508
continue;
497509

498-
/*
499-
* Can't remove any columns for this rel if there is no suitable
500-
* (i.e., nondeferrable) primary key constraint.
501-
*/
502-
pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
503-
if (pkattnos == NULL)
504-
continue;
510+
rel = root->simple_rel_array[relid];
505511

506512
/*
507-
* If the primary key is a proper subset of relattnos then we have
508-
* some items in the GROUP BY that can be removed.
513+
* Now check each index for this relation to see if there are any with
514+
* columns which are a proper subset of the grouping columns for this
515+
* relation.
509516
*/
510-
if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
517+
foreach_node(IndexOptInfo, index, rel->indexlist)
518+
{
519+
Bitmapset *ind_attnos;
520+
bool nulls_check_ok;
521+
522+
/*
523+
* Skip any non-unique and deferrable indexes. Predicate indexes
524+
* have not been checked yet, so we must skip those too as the
525+
* predOK check that's done later might fail.
526+
*/
527+
if (!index->unique || !index->immediate || index->indpred != NIL)
528+
continue;
529+
530+
/* For simplicity, we currently don't support expression indexes */
531+
if (index->indexprs != NIL)
532+
continue;
533+
534+
ind_attnos = NULL;
535+
nulls_check_ok = true;
536+
for (int i = 0; i < index->nkeycolumns; i++)
537+
{
538+
/*
539+
* We must insist that the index columns are all defined NOT
540+
* NULL otherwise duplicate NULLs could exist. However, we
541+
* can relax this check when the index is defined with NULLS
542+
* NOT DISTINCT as there can only be 1 NULL row, therefore
543+
* functional dependency on the unique columns is maintained,
544+
* despite the NULL.
545+
*/
546+
if (!index->nullsnotdistinct &&
547+
!bms_is_member(index->indexkeys[i],
548+
rel->notnullattnums))
549+
{
550+
nulls_check_ok = false;
551+
break;
552+
}
553+
554+
ind_attnos =
555+
bms_add_member(ind_attnos,
556+
index->indexkeys[i] -
557+
FirstLowInvalidHeapAttributeNumber);
558+
}
559+
560+
if (!nulls_check_ok)
561+
continue;
562+
563+
/*
564+
* Skip any indexes where the indexed columns aren't a proper
565+
* subset of the GROUP BY.
566+
*/
567+
if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
568+
continue;
569+
570+
/*
571+
* Record the attribute numbers from the index with the fewest
572+
* columns. This allows the largest number of columns to be
573+
* removed from the GROUP BY clause. In the future, we may wish
574+
* to consider using the narrowest set of columns and looking at
575+
* pg_statistic.stawidth as it might be better to use an index
576+
* with, say two INT4s, rather than, say, one long varlena column.
577+
*/
578+
if (index->nkeycolumns < best_nkeycolumns)
579+
{
580+
best_keycolumns = ind_attnos;
581+
best_nkeycolumns = index->nkeycolumns;
582+
}
583+
}
584+
585+
/* Did we find a suitable index? */
586+
if (!bms_is_empty(best_keycolumns))
511587
{
512588
/*
513589
* To easily remember whether we've found anything to do, we don't
@@ -518,7 +594,7 @@ remove_useless_groupby_columns(PlannerInfo *root)
518594
(list_length(parse->rtable) + 1));
519595

520596
/* Remember the attnos of the removable columns */
521-
surplusvars[relid] = bms_difference(relattnos, pkattnos);
597+
surplusvars[relid] = bms_difference(relattnos, best_keycolumns);
522598
}
523599
}
524600

src/backend/optimizer/util/plancat.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
457457
info->indrestrictinfo = NIL; /* set later, in indxpath.c */
458458
info->predOK = false; /* set later, in indxpath.c */
459459
info->unique = index->indisunique;
460+
info->nullsnotdistinct = index->indnullsnotdistinct;
460461
info->immediate = index->indimmediate;
461462
info->hypothetical = false;
462463

src/include/nodes/pathnodes.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1187,6 +1187,8 @@ struct IndexOptInfo
11871187
bool predOK;
11881188
/* true if a unique index */
11891189
bool unique;
1190+
/* true if the index was defined with NULLS NOT DISTINCT */
1191+
bool nullsnotdistinct;
11901192
/* is uniqueness enforced immediately? */
11911193
bool immediate;
11921194
/* true if index doesn't really exist */

src/test/regress/expected/aggregates.out

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1448,6 +1448,73 @@ explain (costs off) select * from p_t1 group by a,b,c,d;
14481448
-> Seq Scan on p_t1_2
14491449
(5 rows)
14501450

1451+
create unique index t3_c_uidx on t3(c);
1452+
-- Ensure we don't remove any columns from the GROUP BY for a unique
1453+
-- index on a NULLable column.
1454+
explain (costs off) select b,c from t3 group by b,c;
1455+
QUERY PLAN
1456+
----------------------
1457+
HashAggregate
1458+
Group Key: b, c
1459+
-> Seq Scan on t3
1460+
(3 rows)
1461+
1462+
-- Make the column NOT NULL and ensure we remove the redundant column
1463+
alter table t3 alter column c set not null;
1464+
explain (costs off) select b,c from t3 group by b,c;
1465+
QUERY PLAN
1466+
----------------------
1467+
HashAggregate
1468+
Group Key: c
1469+
-> Seq Scan on t3
1470+
(3 rows)
1471+
1472+
-- When there are multiple supporting unique indexes and the GROUP BY contains
1473+
-- columns to cover all of those, ensure we pick the index with the least
1474+
-- number of columns so that we can remove more columns from the GROUP BY.
1475+
explain (costs off) select a,b,c from t3 group by a,b,c;
1476+
QUERY PLAN
1477+
----------------------
1478+
HashAggregate
1479+
Group Key: c
1480+
-> Seq Scan on t3
1481+
(3 rows)
1482+
1483+
-- As above but try ordering the columns differently to ensure we get the
1484+
-- same result.
1485+
explain (costs off) select a,b,c from t3 group by c,a,b;
1486+
QUERY PLAN
1487+
----------------------
1488+
HashAggregate
1489+
Group Key: c
1490+
-> Seq Scan on t3
1491+
(3 rows)
1492+
1493+
-- Ensure we don't use a partial index as proof of functional dependency
1494+
drop index t3_c_uidx;
1495+
create index t3_c_uidx on t3 (c) where c > 0;
1496+
explain (costs off) select b,c from t3 group by b,c;
1497+
QUERY PLAN
1498+
----------------------
1499+
HashAggregate
1500+
Group Key: b, c
1501+
-> Seq Scan on t3
1502+
(3 rows)
1503+
1504+
-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
1505+
-- NULL constraint on the indexed columns. Ensure the redundant columns are
1506+
-- removed from the GROUP BY for such a table.
1507+
drop index t3_c_uidx;
1508+
alter table t3 alter column c drop not null;
1509+
create unique index t3_c_uidx on t3(c) nulls not distinct;
1510+
explain (costs off) select b,c from t3 group by b,c;
1511+
QUERY PLAN
1512+
----------------------
1513+
HashAggregate
1514+
Group Key: c
1515+
-> Seq Scan on t3
1516+
(3 rows)
1517+
14511518
drop table t1 cascade;
14521519
NOTICE: drop cascades to table t1c
14531520
drop table t2;

src/test/regress/sql/aggregates.sql

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,38 @@ create temp table p_t1_2 partition of p_t1 for values in(2);
507507
-- Ensure we can remove non-PK columns for partitioned tables.
508508
explain (costs off) select * from p_t1 group by a,b,c,d;
509509

510+
create unique index t3_c_uidx on t3(c);
511+
512+
-- Ensure we don't remove any columns from the GROUP BY for a unique
513+
-- index on a NULLable column.
514+
explain (costs off) select b,c from t3 group by b,c;
515+
516+
-- Make the column NOT NULL and ensure we remove the redundant column
517+
alter table t3 alter column c set not null;
518+
explain (costs off) select b,c from t3 group by b,c;
519+
520+
-- When there are multiple supporting unique indexes and the GROUP BY contains
521+
-- columns to cover all of those, ensure we pick the index with the least
522+
-- number of columns so that we can remove more columns from the GROUP BY.
523+
explain (costs off) select a,b,c from t3 group by a,b,c;
524+
525+
-- As above but try ordering the columns differently to ensure we get the
526+
-- same result.
527+
explain (costs off) select a,b,c from t3 group by c,a,b;
528+
529+
-- Ensure we don't use a partial index as proof of functional dependency
530+
drop index t3_c_uidx;
531+
create index t3_c_uidx on t3 (c) where c > 0;
532+
explain (costs off) select b,c from t3 group by b,c;
533+
534+
-- A unique index defined as NULLS NOT DISTINCT does not need a supporting NOT
535+
-- NULL constraint on the indexed columns. Ensure the redundant columns are
536+
-- removed from the GROUP BY for such a table.
537+
drop index t3_c_uidx;
538+
alter table t3 alter column c drop not null;
539+
create unique index t3_c_uidx on t3(c) nulls not distinct;
540+
explain (costs off) select b,c from t3 group by b,c;
541+
510542
drop table t1 cascade;
511543
drop table t2;
512544
drop table t3;

0 commit comments

Comments
 (0)