Re: POC: GROUP BY optimization

Поиск

Список

Период

Сортировка

От	Andrey V. Lepikhov
Тема	Re: POC: GROUP BY optimization
Дата	1 февраля 2022 г. 09:48:11
Msg-id	82114b8f-d3a1-ee58-f840-466e695f90c0@postgrespro.ru обсуждение исходный текст
Ответ на	Re: POC: GROUP BY optimization ("Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru>)
Список	pgsql-hackers

Дерево обсуждения

> On 7/22/21 3:58 AM, Tomas Vondra wrote:
> I've simplified the costing a bit, and the attached version actually
> undoes all the "suspicious" plan changes in postgres_fdw. It changes one
> new plan, but that seems somewhat reasonable, as it pushes sort to the
> remote side.

I tried to justify heap-sort part of the compute_cpu_sort_cost() routine 
and realized, that here we may have a mistake.
After a week of efforts, I don't found any research papers on dependence 
of bounded heap-sort time compexity on number of duplicates.

So, I suppose self-made formula, based on simple logical constructions:

1. We should base on initial formula: cost ~ N*LOG2(M), where M - 
output_tuples.
2. Realize, that full representation of this formula is:

cost ~ N*LOG2(min{N,M})

3. In the case of multicolumn, number of comparisons for each next 
column can be estimated by the same formula, but arranged to a number of 
tuples per group:

comparisons ~ input * LOG2(min{input,M})

4. Realize, that for the case, when M > input, we should change this 
formula a bit:

comparisons ~ max{input,M} * LOG2(min{input,M})

Remember, that in our case M << tuples.
So, general formula for bounded heap sort can be written as:

cost ~ N * sum(max{n_i,M}/N * LOG2(min{n_i,M})), i=1,ncols

where n_1 == N, n_i - number of tuples per group, estimated from 
previous iteration.

In attachment - an implementation of this approach.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Вложения

bounded_heap_sort_fix.txt

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: POC: GROUP BY optimization

Вложения