Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit eae056c

Browse files
committed
Apply multiple multivariate MCV lists when possible
Until now we've only used a single multivariate MCV list per relation, covering the largest number of clauses. So for example given a query SELECT * FROM t WHERE a = 1 AND b =1 AND c = 1 AND d = 1 and extended statistics on (a,b) and (c,d), we'd only pick and use one of them. This commit improves this by repeatedly picking and applying the best statistics (matching the largest number of remaining clauses) until no additional statistics is applicable. This greedy algorithm is simple, but may not be optimal. A different choice of statistics may leave fewer clauses unestimated and/or give better estimates for some other reason. This can however happen only when there are overlapping statistics, and selecting one makes it impossible to use the other. E.g. with statistics on (a,b), (c,d), (b,c,d), we may pick either (a,b) and (c,d) or (b,c,d). But it's not clear which option is the best one. We however assume cases like this are rare, and the easiest solution is to define statistics covering the whole group of correlated columns. In the future we might support overlapping stats, using some of the clauses as conditions (in conditional probability sense). Author: Tomas Vondra Reviewed-by: Mark Dilger, Kyotaro Horiguchi Discussion: https://postgr.es/m/20191028152048.jc6pqv5hb7j77ocp@development
1 parent aaa6761 commit eae056c

File tree

3 files changed

+167
-64
lines changed

3 files changed

+167
-64
lines changed

src/backend/statistics/extended_stats.c

+75-64
Original file line numberDiff line numberDiff line change
@@ -1148,9 +1148,13 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
11481148
* statext_mcv_clauselist_selectivity
11491149
* Estimate clauses using the best multi-column statistics.
11501150
*
1151-
* Selects the best extended (multi-column) statistic on a table (measured by
1152-
* the number of attributes extracted from the clauses and covered by it), and
1153-
* computes the selectivity for the supplied clauses.
1151+
* Applies available extended (multi-column) statistics on a table. There may
1152+
* be multiple applicable statistics (with respect to the clauses), in which
1153+
* case we use greedy approach. In each round we select the best statistic on
1154+
* a table (measured by the number of attributes extracted from the clauses
1155+
* and covered by it), and compute the selectivity for the supplied clauses.
1156+
* We repeat this process with the remaining clauses (if any), until none of
1157+
* the available statistics can be used.
11541158
*
11551159
* One of the main challenges with using MCV lists is how to extrapolate the
11561160
* estimate to the data not covered by the MCV list. To do that, we compute
@@ -1194,11 +1198,6 @@ statext_is_compatible_clause(PlannerInfo *root, Node *clause, Index relid,
11941198
* 'estimatedclauses' is an input/output parameter. We set bits for the
11951199
* 0-based 'clauses' indexes we estimate for and also skip clause items that
11961200
* already have a bit set.
1197-
*
1198-
* XXX If we were to use multiple statistics, this is where it would happen.
1199-
* We would simply repeat this on a loop on the "remaining" clauses, possibly
1200-
* using the already estimated clauses as conditions (and combining the values
1201-
* using conditional probability formula).
12021201
*/
12031202
static Selectivity
12041203
statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varRelid,
@@ -1208,14 +1207,7 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
12081207
ListCell *l;
12091208
Bitmapset **list_attnums;
12101209
int listidx;
1211-
StatisticExtInfo *stat;
1212-
List *stat_clauses;
1213-
Selectivity simple_sel,
1214-
mcv_sel,
1215-
mcv_basesel,
1216-
mcv_totalsel,
1217-
other_sel,
1218-
sel;
1210+
Selectivity sel = 1.0;
12191211

12201212
/* check if there's any stats that might be useful for us. */
12211213
if (!has_stats_of_kind(rel->statlist, STATS_EXT_MCV))
@@ -1250,65 +1242,84 @@ statext_mcv_clauselist_selectivity(PlannerInfo *root, List *clauses, int varReli
12501242
listidx++;
12511243
}
12521244

1253-
/* find the best suited statistics object for these attnums */
1254-
stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV,
1255-
list_attnums, list_length(clauses));
1256-
1257-
/* if no matching stats could be found then we've nothing to do */
1258-
if (!stat)
1259-
return 1.0;
1245+
/* apply as many extended statistics as possible */
1246+
while (true)
1247+
{
1248+
StatisticExtInfo *stat;
1249+
List *stat_clauses;
1250+
Selectivity simple_sel,
1251+
mcv_sel,
1252+
mcv_basesel,
1253+
mcv_totalsel,
1254+
other_sel,
1255+
stat_sel;
1256+
1257+
/* find the best suited statistics object for these attnums */
1258+
stat = choose_best_statistics(rel->statlist, STATS_EXT_MCV,
1259+
list_attnums, list_length(clauses));
1260+
1261+
/* if no (additional) matching stats could be found then we've nothing to do */
1262+
if (!stat)
1263+
break;
12601264

1261-
/* Ensure choose_best_statistics produced an expected stats type. */
1262-
Assert(stat->kind == STATS_EXT_MCV);
1265+
/* Ensure choose_best_statistics produced an expected stats type. */
1266+
Assert(stat->kind == STATS_EXT_MCV);
12631267

1264-
/* now filter the clauses to be estimated using the selected MCV */
1265-
stat_clauses = NIL;
1268+
/* now filter the clauses to be estimated using the selected MCV */
1269+
stat_clauses = NIL;
12661270

1267-
listidx = 0;
1268-
foreach(l, clauses)
1269-
{
1270-
/*
1271-
* If the clause is compatible with the selected statistics, mark it
1272-
* as estimated and add it to the list to estimate.
1273-
*/
1274-
if (list_attnums[listidx] != NULL &&
1275-
bms_is_subset(list_attnums[listidx], stat->keys))
1271+
listidx = 0;
1272+
foreach(l, clauses)
12761273
{
1277-
stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
1278-
*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
1274+
/*
1275+
* If the clause is compatible with the selected statistics, mark it
1276+
* as estimated and add it to the list to estimate.
1277+
*/
1278+
if (list_attnums[listidx] != NULL &&
1279+
bms_is_subset(list_attnums[listidx], stat->keys))
1280+
{
1281+
stat_clauses = lappend(stat_clauses, (Node *) lfirst(l));
1282+
*estimatedclauses = bms_add_member(*estimatedclauses, listidx);
1283+
1284+
bms_free(list_attnums[listidx]);
1285+
list_attnums[listidx] = NULL;
1286+
}
1287+
1288+
listidx++;
12791289
}
12801290

1281-
listidx++;
1282-
}
1291+
/*
1292+
* First compute "simple" selectivity, i.e. without the extended
1293+
* statistics, and essentially assuming independence of the
1294+
* columns/clauses. We'll then use the various selectivities computed from
1295+
* MCV list to improve it.
1296+
*/
1297+
simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
1298+
jointype, sjinfo, NULL);
12831299

1284-
/*
1285-
* First compute "simple" selectivity, i.e. without the extended
1286-
* statistics, and essentially assuming independence of the
1287-
* columns/clauses. We'll then use the various selectivities computed from
1288-
* MCV list to improve it.
1289-
*/
1290-
simple_sel = clauselist_selectivity_simple(root, stat_clauses, varRelid,
1291-
jointype, sjinfo, NULL);
1300+
/*
1301+
* Now compute the multi-column estimate from the MCV list, along with the
1302+
* other selectivities (base & total selectivity).
1303+
*/
1304+
mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
1305+
jointype, sjinfo, rel,
1306+
&mcv_basesel, &mcv_totalsel);
12921307

1293-
/*
1294-
* Now compute the multi-column estimate from the MCV list, along with the
1295-
* other selectivities (base & total selectivity).
1296-
*/
1297-
mcv_sel = mcv_clauselist_selectivity(root, stat, stat_clauses, varRelid,
1298-
jointype, sjinfo, rel,
1299-
&mcv_basesel, &mcv_totalsel);
1308+
/* Estimated selectivity of values not covered by MCV matches */
1309+
other_sel = simple_sel - mcv_basesel;
1310+
CLAMP_PROBABILITY(other_sel);
13001311

1301-
/* Estimated selectivity of values not covered by MCV matches */
1302-
other_sel = simple_sel - mcv_basesel;
1303-
CLAMP_PROBABILITY(other_sel);
1312+
/* The non-MCV selectivity can't exceed the 1 - mcv_totalsel. */
1313+
if (other_sel > 1.0 - mcv_totalsel)
1314+
other_sel = 1.0 - mcv_totalsel;
13041315

1305-
/* The non-MCV selectivity can't exceed the 1 - mcv_totalsel. */
1306-
if (other_sel > 1.0 - mcv_totalsel)
1307-
other_sel = 1.0 - mcv_totalsel;
1316+
/* Overall selectivity is the combination of MCV and non-MCV estimates. */
1317+
stat_sel = mcv_sel + other_sel;
1318+
CLAMP_PROBABILITY(stat_sel);
13081319

1309-
/* Overall selectivity is the combination of MCV and non-MCV estimates. */
1310-
sel = mcv_sel + other_sel;
1311-
CLAMP_PROBABILITY(sel);
1320+
/* Factor the estimate from this MCV to the oveall estimate. */
1321+
sel *= stat_sel;
1322+
}
13121323

13131324
return sel;
13141325
}

src/test/regress/expected/stats_ext.out

+57
Original file line numberDiff line numberDiff line change
@@ -836,6 +836,63 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
836836
1 | 0
837837
(1 row)
838838

839+
-- check the ability to use multiple MCV lists
840+
CREATE TABLE mcv_lists_multi (
841+
a INTEGER,
842+
b INTEGER,
843+
c INTEGER,
844+
d INTEGER
845+
);
846+
INSERT INTO mcv_lists_multi (a, b, c, d)
847+
SELECT
848+
mod(i,5),
849+
mod(i,5),
850+
mod(i,7),
851+
mod(i,7)
852+
FROM generate_series(1,5000) s(i);
853+
ANALYZE mcv_lists_multi;
854+
-- estimates without any mcv statistics
855+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
856+
estimated | actual
857+
-----------+--------
858+
200 | 1000
859+
(1 row)
860+
861+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
862+
estimated | actual
863+
-----------+--------
864+
102 | 714
865+
(1 row)
866+
867+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
868+
estimated | actual
869+
-----------+--------
870+
4 | 142
871+
(1 row)
872+
873+
-- create separate MCV statistics
874+
CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
875+
CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
876+
ANALYZE mcv_lists_multi;
877+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
878+
estimated | actual
879+
-----------+--------
880+
1000 | 1000
881+
(1 row)
882+
883+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
884+
estimated | actual
885+
-----------+--------
886+
714 | 714
887+
(1 row)
888+
889+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
890+
estimated | actual
891+
-----------+--------
892+
143 | 142
893+
(1 row)
894+
895+
DROP TABLE mcv_lists_multi;
839896
-- Permission tests. Users should not be able to see specific data values in
840897
-- the extended statistics, if they lack permission to see those values in
841898
-- the underlying table.

src/test/regress/sql/stats_ext.sql

+35
Original file line numberDiff line numberDiff line change
@@ -535,6 +535,41 @@ SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND
535535

536536
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_bool WHERE NOT a AND b AND NOT c');
537537

538+
-- check the ability to use multiple MCV lists
539+
CREATE TABLE mcv_lists_multi (
540+
a INTEGER,
541+
b INTEGER,
542+
c INTEGER,
543+
d INTEGER
544+
);
545+
546+
INSERT INTO mcv_lists_multi (a, b, c, d)
547+
SELECT
548+
mod(i,5),
549+
mod(i,5),
550+
mod(i,7),
551+
mod(i,7)
552+
FROM generate_series(1,5000) s(i);
553+
554+
ANALYZE mcv_lists_multi;
555+
556+
-- estimates without any mcv statistics
557+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
558+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
559+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
560+
561+
-- create separate MCV statistics
562+
CREATE STATISTICS mcv_lists_multi_1 (mcv) ON a, b FROM mcv_lists_multi;
563+
CREATE STATISTICS mcv_lists_multi_2 (mcv) ON c, d FROM mcv_lists_multi;
564+
565+
ANALYZE mcv_lists_multi;
566+
567+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0');
568+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE c = 0 AND d = 0');
569+
SELECT * FROM check_estimated_rows('SELECT * FROM mcv_lists_multi WHERE a = 0 AND b = 0 AND c = 0 AND d = 0');
570+
571+
DROP TABLE mcv_lists_multi;
572+
538573
-- Permission tests. Users should not be able to see specific data values in
539574
-- the extended statistics, if they lack permission to see those values in
540575
-- the underlying table.

0 commit comments

Comments
 (0)