Re: Consider the number of columns in the sort cost model - Mailing list pgsql-hackers
From | Alena Rybakina |
---|---|
Subject | Re: Consider the number of columns in the sort cost model |
Date | |
Msg-id | 0870a8c4-4579-4428-bacb-c604463ab2ae@postgrespro.ru Whole thread Raw |
In response to | Re: Consider the number of columns in the sort cost model (Kirill Reshke <reshkekirill@gmail.com>) |
List | pgsql-hackers |
Hi! Thank you for your work on this subject! On 23.10.2024 04:39, Andrei Lepikhov wrote: > On 15/10/2024 12:15, David Rowley wrote: >> On Tue, 15 Oct 2024 at 17:48, Andrei Lepikhov <lepihov@gmail.com> wrote: >> I think maybe what is worth working on is seeing if you can better >> estimate the number of tiebreak comparisons by checking the number of >> distinct values for each sort key. That might require what we >> discussed on another thread about having estimate_num_groups() better >> use the n_distinct estimate from the EquivalenceMember with the least >> distinct values. It'll be another question if all that can be done >> and this all still perform well enough for cost_sort(). You may have >> to write it first before we can figure that out. It may be possible >> to buy back some of the overheads with some additional caching... >> Perhaps that could be done within the EquivalenceClass. > It seems that your idea with an equivalence class works. > In the patchset attached, I show all the ideas (I initially wanted to > discuss it step-by-step). > In the attached, the first patch is about choosing adequate expression > from the list of EC members. > The second one is the current patch, which considers the number of > sort columns. > Third patch - elaboration of the second patch. Use only the trustful > statistics here - the number of ndistinct values on the first sorted > column. > And the last patch is a demo of how I'm going to use the previous > three patches and add one more strategy to improve the order of > columns in the GROUP-BY clause. > I have some question about your patches. 1. You consider two lists root->group_pathkeys and root->processed_groupClause. As far as I understand, this is due to the fact that the pathkey elements are identical to the group clause and identity verification is precisely checked by this condition: if (foreach_current_index(lc1) >= root->num_groupby_pathkeys) To be honest, I didn’t find information about this in the code, but did I understand everything correctly? 2. I noticed that statistics of distinct values are calculated several times. Maybe I miss something, but this can be avoided if these statistics can be saved after calculation. For example, I saw that it is possible that you calculate the same statistic information for the same equivalence members in cost_incremental_sort and identify_sort_ecmember. Is it possible to store information in a structure and use it later? 3. I think you should initialize the variable ndist in your patch. I faced the compile complaining during your code compilation. costsize.c: In function ‘identify_sort_ecmember’: costsize.c:6694:42: error: ‘ndist’ may be used uninitialized [-Werror=maybe-uninitialized] 6694 | em->em_ndistinct = ndist; | ~~~~~~~~~~~~~~~~~^~~~~ costsize.c:6680:33: note: ‘ndist’ was declared here 6680 | double ndist; | ^~~ cc1: all warnings being treated as errors gmake[4]: *** [<builtin>: costsize.o] Error 1 4. Before executing examine_variable function I think you should check that it is not null. @@ -575,6 +575,8 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path) */ continue; + Assert (node != NULL); + examine_variable(root, node, 0, &vardata); if (!HeapTupleIsValid(vardata.statsTuple)) continue; -- Regards, Alena Rybakina Postgres Professional
pgsql-hackers by date: