Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 95bee94

Browse files
committed
Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
1 parent 8a8c6b5 commit 95bee94

File tree

6 files changed

+34
-21
lines changed

6 files changed

+34
-21
lines changed

doc/src/sgml/catalogs.sgml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5849,9 +5849,9 @@
58495849
<entry>The number of distinct nonnull data values in the column.
58505850
A value greater than zero is the actual number of distinct values.
58515851
A value less than zero is the negative of a multiplier for the number
5852-
of rows in the table; for example, a column in which values appear about
5853-
twice on the average could be represented by
5854-
<structfield>stadistinct</> = -0.5.
5852+
of rows in the table; for example, a column in which about 80% of the
5853+
values are nonnull and each nonnull value appears about twice on
5854+
average could be represented by <structfield>stadistinct</> = -0.4.
58555855
A zero value means the number of distinct values is unknown.
58565856
</entry>
58575857
</row>

src/backend/commands/analyze.c

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2049,8 +2049,11 @@ compute_distinct_stats(VacAttrStatsP stats,
20492049

20502050
if (nmultiple == 0)
20512051
{
2052-
/* If we found no repeated values, assume it's a unique column */
2053-
stats->stadistinct = -1.0;
2052+
/*
2053+
* If we found no repeated non-null values, assume it's a unique
2054+
* column; but be sure to discount for any nulls we found.
2055+
*/
2056+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
20542057
}
20552058
else if (track_cnt < track_max && toowide_cnt == 0 &&
20562059
nmultiple == track_cnt)
@@ -2426,8 +2429,11 @@ compute_scalar_stats(VacAttrStatsP stats,
24262429

24272430
if (nmultiple == 0)
24282431
{
2429-
/* If we found no repeated values, assume it's a unique column */
2430-
stats->stadistinct = -1.0;
2432+
/*
2433+
* If we found no repeated non-null values, assume it's a unique
2434+
* column; but be sure to discount for any nulls we found.
2435+
*/
2436+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
24312437
}
24322438
else if (toowide_cnt == 0 && nmultiple == ndistinct)
24332439
{
@@ -2753,7 +2759,7 @@ compute_scalar_stats(VacAttrStatsP stats,
27532759
else
27542760
stats->stawidth = stats->attrtype->typlen;
27552761
/* Assume all too-wide values are distinct, so it's a unique column */
2756-
stats->stadistinct = -1.0;
2762+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
27572763
}
27582764
else if (null_cnt > 0)
27592765
{

src/backend/tsearch/ts_typanalyze.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -295,7 +295,7 @@ compute_tsvector_stats(VacAttrStats *stats,
295295
stats->stawidth = total_width / (double) nonnull_cnt;
296296

297297
/* Assume it's a unique column (see notes above) */
298-
stats->stadistinct = -1.0;
298+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
299299

300300
/*
301301
* Construct an array of the interesting hashtable items, that is,

src/backend/utils/adt/rangetypes_typanalyze.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,9 @@ compute_range_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
203203
/* Do the simple null-frac and width stats */
204204
stats->stanullfrac = (double) null_cnt / (double) samplerows;
205205
stats->stawidth = total_width / (double) non_null_cnt;
206-
stats->stadistinct = -1.0;
206+
207+
/* Estimate that non-null values are unique */
208+
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
207209

208210
/* Must copy the target values into anl_context */
209211
old_cxt = MemoryContextSwitchTo(stats->anl_context);

src/backend/utils/adt/selfuncs.c

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4738,14 +4738,16 @@ double
47384738
get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47394739
{
47404740
double stadistinct;
4741+
double stanullfrac = 0.0;
47414742
double ntuples;
47424743

47434744
*isdefault = false;
47444745

47454746
/*
47464747
* Determine the stadistinct value to use. There are cases where we can
47474748
* get an estimate even without a pg_statistic entry, or can get a better
4748-
* value than is in pg_statistic.
4749+
* value than is in pg_statistic. Grab stanullfrac too if we can find it
4750+
* (otherwise, assume no nulls, for lack of any better idea).
47494751
*/
47504752
if (HeapTupleIsValid(vardata->statsTuple))
47514753
{
@@ -4754,6 +4756,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47544756

47554757
stats = (Form_pg_statistic) GETSTRUCT(vardata->statsTuple);
47564758
stadistinct = stats->stadistinct;
4759+
stanullfrac = stats->stanullfrac;
47574760
}
47584761
else if (vardata->vartype == BOOLOID)
47594762
{
@@ -4777,7 +4780,7 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47774780
{
47784781
case ObjectIdAttributeNumber:
47794782
case SelfItemPointerAttributeNumber:
4780-
stadistinct = -1.0; /* unique */
4783+
stadistinct = -1.0; /* unique (and all non null) */
47814784
break;
47824785
case TableOidAttributeNumber:
47834786
stadistinct = 1.0; /* only 1 value */
@@ -4799,10 +4802,11 @@ get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
47994802
* If there is a unique index or DISTINCT clause for the variable, assume
48004803
* it is unique no matter what pg_statistic says; the statistics could be
48014804
* out of date, or we might have found a partial unique index that proves
4802-
* the var is unique for this query.
4805+
* the var is unique for this query. However, we'd better still believe
4806+
* the null-fraction statistic.
48034807
*/
48044808
if (vardata->isunique)
4805-
stadistinct = -1.0;
4809+
stadistinct = -1.0 * (1.0 - stanullfrac);
48064810

48074811
/*
48084812
* If we had an absolute estimate, use that.

src/include/catalog/pg_statistic.h

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,13 +57,14 @@ CATALOG(pg_statistic,2619) BKI_WITHOUT_OIDS
5757
* > 0 actual number of distinct values
5858
* < 0 negative of multiplier for number of rows
5959
* The special negative case allows us to cope with columns that are
60-
* unique (stadistinct = -1) or nearly so (for example, a column in
61-
* which values appear about twice on the average could be represented
62-
* by stadistinct = -0.5). Because the number-of-rows statistic in
63-
* pg_class may be updated more frequently than pg_statistic is, it's
64-
* important to be able to describe such situations as a multiple of
65-
* the number of rows, rather than a fixed number of distinct values.
66-
* But in other cases a fixed number is correct (eg, a boolean column).
60+
* unique (stadistinct = -1) or nearly so (for example, a column in which
61+
* non-null values appear about twice on the average could be represented
62+
* by stadistinct = -0.5 if there are no nulls, or -0.4 if 20% of the
63+
* column is nulls). Because the number-of-rows statistic in pg_class may
64+
* be updated more frequently than pg_statistic is, it's important to be
65+
* able to describe such situations as a multiple of the number of rows,
66+
* rather than a fixed number of distinct values. But in other cases a
67+
* fixed number is correct (eg, a boolean column).
6768
* ----------------
6869
*/
6970
float4 stadistinct;

0 commit comments

Comments
 (0)