Re: POC: GROUP BY optimization
От | Andrey V. Lepikhov |
---|---|
Тема | Re: POC: GROUP BY optimization |
Дата | |
Msg-id | 82114b8f-d3a1-ee58-f840-466e695f90c0@postgrespro.ru обсуждение исходный текст |
Ответ на | Re: POC: GROUP BY optimization ("Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru>) |
Список | pgsql-hackers |
> On 7/22/21 3:58 AM, Tomas Vondra wrote: > I've simplified the costing a bit, and the attached version actually > undoes all the "suspicious" plan changes in postgres_fdw. It changes one > new plan, but that seems somewhat reasonable, as it pushes sort to the > remote side. I tried to justify heap-sort part of the compute_cpu_sort_cost() routine and realized, that here we may have a mistake. After a week of efforts, I don't found any research papers on dependence of bounded heap-sort time compexity on number of duplicates. So, I suppose self-made formula, based on simple logical constructions: 1. We should base on initial formula: cost ~ N*LOG2(M), where M - output_tuples. 2. Realize, that full representation of this formula is: cost ~ N*LOG2(min{N,M}) 3. In the case of multicolumn, number of comparisons for each next column can be estimated by the same formula, but arranged to a number of tuples per group: comparisons ~ input * LOG2(min{input,M}) 4. Realize, that for the case, when M > input, we should change this formula a bit: comparisons ~ max{input,M} * LOG2(min{input,M}) Remember, that in our case M << tuples. So, general formula for bounded heap sort can be written as: cost ~ N * sum(max{n_i,M}/N * LOG2(min{n_i,M})), i=1,ncols where n_1 == N, n_i - number of tuples per group, estimated from previous iteration. In attachment - an implementation of this approach. -- regards, Andrey Lepikhov Postgres Professional
Вложения
В списке pgsql-hackers по дате отправления: