Home > mailing lists

Re: Consider the number of columns in the sort cost model - Mailing list pgsql-hackers

From	Alena Rybakina
Subject	Re: Consider the number of columns in the sort cost model
Date	October 28, 2024 12:48:19
Msg-id	0870a8c4-4579-4428-bacb-c604463ab2ae@postgrespro.ru Whole thread Raw
In response to	Re: Consider the number of columns in the sort cost model (Kirill Reshke <reshkekirill@gmail.com>)
List	pgsql-hackers

Tree view

Hi! Thank you for your work on this subject!

On 23.10.2024 04:39, Andrei Lepikhov wrote:
> On 15/10/2024 12:15, David Rowley wrote:
>> On Tue, 15 Oct 2024 at 17:48, Andrei Lepikhov <lepihov@gmail.com> wrote:
>> I think maybe what is worth working on is seeing if you can better
>> estimate the number of tiebreak comparisons by checking the number of
>> distinct values for each sort key.  That might require what we
>> discussed on another thread about having estimate_num_groups() better
>> use the n_distinct estimate from the EquivalenceMember with the least
>> distinct values.  It'll be another question if all that can be done
>> and this all still perform well enough for cost_sort(). You may have
>> to write it first before we can figure that out.  It may be possible
>> to buy back some of the overheads with some additional caching...
>> Perhaps that could be done within the EquivalenceClass.
> It seems that your idea with an equivalence class works.
> In the patchset attached, I show all the ideas (I initially wanted to 
> discuss it step-by-step).
> In the attached, the first patch is about choosing adequate expression 
> from the list of EC members.
> The second one is the current patch, which considers the number of 
> sort columns.
> Third patch - elaboration of the second patch. Use only the trustful 
> statistics here - the number of ndistinct values on the first sorted 
> column.
> And the last patch is a demo of how I'm going to use the previous 
> three patches and add one more strategy to improve the order of 
> columns in the GROUP-BY clause.
>
I have some question about your patches.

1. You consider two lists root->group_pathkeys and 
root->processed_groupClause.

As far as I understand, this is due to the fact that the pathkey 
elements are identical to the group clause and identity verification is 
precisely checked by this condition:

if (foreach_current_index(lc1) >= root->num_groupby_pathkeys)

To be honest, I didn’t find information about this in the code, but did 
I understand everything correctly?

2. I noticed that statistics of distinct values are calculated several 
times. Maybe I miss something, but this can be avoided if these 
statistics can be saved after calculation. For example, I saw that it is 
possible that you calculate the same statistic information for the same 
equivalence members in cost_incremental_sort and identify_sort_ecmember. 
Is it possible to store information in a structure and use it later?

3. I think you should initialize the variable ndist in your patch. I 
faced the compile complaining during your code compilation.

costsize.c: In function ‘identify_sort_ecmember’:
costsize.c:6694:42: error: ‘ndist’ may be used uninitialized 
[-Werror=maybe-uninitialized]
  6694 |                         em->em_ndistinct = ndist;
       |                         ~~~~~~~~~~~~~~~~~^~~~~
costsize.c:6680:33: note: ‘ndist’ was declared here
  6680 |                         double  ndist;
       |                                 ^~~
cc1: all warnings being treated as errors
gmake[4]: *** [<builtin>: costsize.o] Error 1

4. Before executing examine_variable function I think you should check 
that it is not null.

@@ -575,6 +575,8 @@ get_useful_group_keys_orderings(PlannerInfo *root, 
Path *path)
               */
              continue;

+        Assert (node != NULL);
+
          examine_variable(root, node, 0, &vardata);
          if (!HeapTupleIsValid(vardata.statsTuple))
              continue;


-- 
Regards,
Alena Rybakina
Postgres Professional

pgsql-hackers by date:

From: Peter Eisentraut
Date: 28 October 2024, 12:45:21
Subject: Re: further #include cleanup (IWYU)

From: Heikki Linnakangas
Date: 28 October 2024, 13:56:03
Subject: Re: IPC::Run::time[r|out] vs our TAP tests

Re: Consider the number of columns in the sort cost model - Mailing list pgsql-hackers

Previous

Next