Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 66a7e6b

Browse files
committed
Improve estimation of IN/NOT IN by assuming array elements are distinct.
In constructs such as "x IN (1,2,3,4)" and "x <> ALL(ARRAY[1,2,3,4])", we formerly always used a general-purpose assumption that the probability of success is independent for each comparison of "x" to an array element. But in real-world usage of these constructs, that's a pretty poor assumption; it's much saner to assume that the array elements are distinct and so the match probabilities are disjoint. Apply that assumption if the operator appears to behave as equality (for ANY) or inequality (for ALL). But fall back to the normal independent-probabilities calculation if this yields an impossible result, ie probability > 1 or < 0. We could protect ourselves against bad estimates even more by explicitly checking for equal array elements, but that is expensive and doesn't seem worthwhile: doing it would amount to optimizing for poorly-written queries at the expense of well-written ones. Daniele Varrazzo and Tom Lane, after a suggestion by Ants Aasma
1 parent 1ed7f0e commit 66a7e6b

File tree

1 file changed

+71
-3
lines changed

1 file changed

+71
-3
lines changed

src/backend/utils/adt/selfuncs.c

Lines changed: 71 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1712,6 +1712,7 @@ scalararraysel(PlannerInfo *root,
17121712
RegProcedure oprsel;
17131713
FmgrInfo oprselproc;
17141714
Selectivity s1;
1715+
Selectivity s1disjoint;
17151716

17161717
/* First, deconstruct the expression */
17171718
Assert(list_length(clause->args) == 2);
@@ -1768,6 +1769,19 @@ scalararraysel(PlannerInfo *root,
17681769
return (Selectivity) 0.5;
17691770
fmgr_info(oprsel, &oprselproc);
17701771

1772+
/*
1773+
* In the array-containment check above, we must only believe that an
1774+
* operator is equality or inequality if it is the default btree equality
1775+
* operator (or its negator) for the element type, since those are the
1776+
* operators that array containment will use. But in what follows, we can
1777+
* be a little laxer, and also believe that any operators using eqsel() or
1778+
* neqsel() as selectivity estimator act like equality or inequality.
1779+
*/
1780+
if (oprsel == F_EQSEL || oprsel == F_EQJOINSEL)
1781+
isEquality = true;
1782+
else if (oprsel == F_NEQSEL || oprsel == F_NEQJOINSEL)
1783+
isInequality = true;
1784+
17711785
/*
17721786
* We consider three cases:
17731787
*
@@ -1802,7 +1816,23 @@ scalararraysel(PlannerInfo *root,
18021816
ARR_ELEMTYPE(arrayval),
18031817
elmlen, elmbyval, elmalign,
18041818
&elem_values, &elem_nulls, &num_elems);
1805-
s1 = useOr ? 0.0 : 1.0;
1819+
1820+
/*
1821+
* For generic operators, we assume the probability of success is
1822+
* independent for each array element. But for "= ANY" or "<> ALL",
1823+
* if the array elements are distinct (which'd typically be the case)
1824+
* then the probabilities are disjoint, and we should just sum them.
1825+
*
1826+
* If we were being really tense we would try to confirm that the
1827+
* elements are all distinct, but that would be expensive and it
1828+
* doesn't seem to be worth the cycles; it would amount to penalizing
1829+
* well-written queries in favor of poorly-written ones. However, we
1830+
* do protect ourselves a little bit by checking whether the
1831+
* disjointness assumption leads to an impossible (out of range)
1832+
* probability; if so, we fall back to the normal calculation.
1833+
*/
1834+
s1 = s1disjoint = (useOr ? 0.0 : 1.0);
1835+
18061836
for (i = 0; i < num_elems; i++)
18071837
{
18081838
List *args;
@@ -1829,11 +1859,25 @@ scalararraysel(PlannerInfo *root,
18291859
ObjectIdGetDatum(operator),
18301860
PointerGetDatum(args),
18311861
Int32GetDatum(varRelid)));
1862+
18321863
if (useOr)
1864+
{
18331865
s1 = s1 + s2 - s1 * s2;
1866+
if (isEquality)
1867+
s1disjoint += s2;
1868+
}
18341869
else
1870+
{
18351871
s1 = s1 * s2;
1872+
if (isInequality)
1873+
s1disjoint += s2 - 1.0;
1874+
}
18361875
}
1876+
1877+
/* accept disjoint-probability estimate if in range */
1878+
if ((useOr ? isEquality : isInequality) &&
1879+
s1disjoint >= 0.0 && s1disjoint <= 1.0)
1880+
s1 = s1disjoint;
18371881
}
18381882
else if (rightop && IsA(rightop, ArrayExpr) &&
18391883
!((ArrayExpr *) rightop)->multidims)
@@ -1845,7 +1889,16 @@ scalararraysel(PlannerInfo *root,
18451889

18461890
get_typlenbyval(arrayexpr->element_typeid,
18471891
&elmlen, &elmbyval);
1848-
s1 = useOr ? 0.0 : 1.0;
1892+
1893+
/*
1894+
* We use the assumption of disjoint probabilities here too, although
1895+
* the odds of equal array elements are rather higher if the elements
1896+
* are not all constants (which they won't be, else constant folding
1897+
* would have reduced the ArrayExpr to a Const). In this path it's
1898+
* critical to have the sanity check on the s1disjoint estimate.
1899+
*/
1900+
s1 = s1disjoint = (useOr ? 0.0 : 1.0);
1901+
18491902
foreach(l, arrayexpr->elements)
18501903
{
18511904
Node *elem = (Node *) lfirst(l);
@@ -1871,11 +1924,25 @@ scalararraysel(PlannerInfo *root,
18711924
ObjectIdGetDatum(operator),
18721925
PointerGetDatum(args),
18731926
Int32GetDatum(varRelid)));
1927+
18741928
if (useOr)
1929+
{
18751930
s1 = s1 + s2 - s1 * s2;
1931+
if (isEquality)
1932+
s1disjoint += s2;
1933+
}
18761934
else
1935+
{
18771936
s1 = s1 * s2;
1937+
if (isInequality)
1938+
s1disjoint += s2 - 1.0;
1939+
}
18781940
}
1941+
1942+
/* accept disjoint-probability estimate if in range */
1943+
if ((useOr ? isEquality : isInequality) &&
1944+
s1disjoint >= 0.0 && s1disjoint <= 1.0)
1945+
s1 = s1disjoint;
18791946
}
18801947
else
18811948
{
@@ -1911,7 +1978,8 @@ scalararraysel(PlannerInfo *root,
19111978

19121979
/*
19131980
* Arbitrarily assume 10 elements in the eventual array value (see
1914-
* also estimate_array_length)
1981+
* also estimate_array_length). We don't risk an assumption of
1982+
* disjoint probabilities here.
19151983
*/
19161984
for (i = 0; i < 10; i++)
19171985
{

0 commit comments

Comments
 (0)