Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit bbd5d65

Browse files
committed
Update detail for new todo items.
1 parent 7bbe216 commit bbd5d65

File tree

1 file changed

+252
-1
lines changed

1 file changed

+252
-1
lines changed

doc/TODO.detail/optimizer

Lines changed: 252 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1059,7 +1059,7 @@ From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000
10591059
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
10601060
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
10611061
for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
1062-
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.15 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
1062+
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.16 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
10631063
Received: from localhost (majordom@localhost)
10641064
by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
10651065
Thu, 20 Jan 2000 19:35:19 -0500 (EST)
@@ -1586,3 +1586,254 @@ support a couple gigs of RAM now.
15861586

15871587
************
15881588

1589+
From pgsql-hackers-owner+M6019@hub.org Mon Aug 21 11:47:56 2000
1590+
Received: from hub.org (root@hub.org [216.126.84.1])
1591+
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA07289
1592+
for <pgman@candle.pha.pa.us>; Mon, 21 Aug 2000 11:47:55 -0400 (EDT)
1593+
Received: from hub.org (majordom@localhost [127.0.0.1])
1594+
by hub.org (8.10.1/8.10.1) with SMTP id e7LFlpT03383;
1595+
Mon, 21 Aug 2000 11:47:51 -0400 (EDT)
1596+
Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])
1597+
by hub.org (8.10.1/8.10.1) with SMTP id e7LFlaT03243
1598+
for <pgsql-hackers@postgresql.org>; Mon, 21 Aug 2000 11:47:37 -0400 (EDT)
1599+
Received: (qmail 7416 invoked by alias); 21 Aug 2000 15:54:33 -0000
1600+
Received: (qmail 7410 invoked from network); 21 Aug 2000 15:54:32 -0000
1601+
Received: from eros.si.fct.unl.pt (193.136.120.112)
1602+
by fct1.si.fct.unl.pt with SMTP; 21 Aug 2000 15:54:32 -0000
1603+
Date: Mon, 21 Aug 2000 16:48:08 +0100 (WEST)
1604+
From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
1605+
X-Sender: tiago@eros.si.fct.unl.pt
1606+
To: Tom Lane <tgl@sss.pgh.pa.us>
1607+
cc: pgsql-hackers@postgresql.org
1608+
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,
1609+
constant-->index scan
1610+
In-Reply-To: <1731.966868649@sss.pgh.pa.us>
1611+
Message-ID: <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt>
1612+
MIME-Version: 1.0
1613+
Content-Type: TEXT/PLAIN; charset=US-ASCII
1614+
X-Mailing-List: pgsql-hackers@postgresql.org
1615+
Precedence: bulk
1616+
Sender: pgsql-hackers-owner@hub.org
1617+
Status: ORr
1618+
1619+
On Mon, 21 Aug 2000, Tom Lane wrote:
1620+
1621+
> > One thing it might be interesting (please tell me if you think
1622+
> > otherwise) would be to improve pg with better statistical information, by
1623+
> > using, for example, histograms.
1624+
>
1625+
> Yes, that's been on the todo list for a while.
1626+
1627+
If it's ok and nobody is working on that, I'll look on that subject.
1628+
I'll start by looking at the analize portion of vacuum. I'm thinking in
1629+
using arrays for the histogram (I've never used the array data type of
1630+
postgres).
1631+
Should I use 7.0.2 or the cvs version?
1632+
1633+
1634+
> Interesting article. We do most of what she talks about, but we don't
1635+
> have anything like the ClusterRatio statistic. We need it --- that was
1636+
> just being discussed a few days ago in another thread. Do you have any
1637+
> reference on exactly how DB2 defines that stat?
1638+
1639+
1640+
I don't remember seeing that information spefically. From what I've
1641+
read I can speculate:
1642+
1643+
1. They have clusterratios for both indexes and the relation itself.
1644+
2. They might use an index even if there is no "order by" if the table
1645+
has a low clusterratio: just to get the RIDs, then sort the RIDs and
1646+
fetch.
1647+
3. One possible way to calculate this ratio:
1648+
a) for tables
1649+
SeqScan
1650+
if tuple points to a next tuple on the same page then its
1651+
"good"
1652+
ratio = # good tuples / # all tuples
1653+
b) for indexes (high speculation ratio here)
1654+
foreach pointed RID in index
1655+
if RID is in same page of next RID in index than mark as
1656+
"good"
1657+
1658+
I suspect that if a tuple size is big (relative to page size) than the
1659+
cluster ratio is always low.
1660+
1661+
A tuple might also be "good" if it pointed to the next page.
1662+
1663+
Tiago
1664+
1665+
1666+
From pgsql-hackers-owner+M6152@hub.org Wed Aug 23 13:00:33 2000
1667+
Received: from hub.org (root@hub.org [216.126.84.1])
1668+
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA10259
1669+
for <pgman@candle.pha.pa.us>; Wed, 23 Aug 2000 13:00:33 -0400 (EDT)
1670+
Received: from hub.org (majordom@localhost [127.0.0.1])
1671+
by hub.org (8.10.1/8.10.1) with SMTP id e7NGsPN83008;
1672+
Wed, 23 Aug 2000 12:54:25 -0400 (EDT)
1673+
Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1])
1674+
by hub.org (8.10.1/8.10.1) with SMTP id e7NGniN81749
1675+
for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 12:49:44 -0400 (EDT)
1676+
Received: (qmail 9869 invoked by alias); 23 Aug 2000 15:10:04 -0000
1677+
Received: (qmail 9860 invoked from network); 23 Aug 2000 15:10:04 -0000
1678+
Received: from eros.si.fct.unl.pt (193.136.120.112)
1679+
by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 15:10:04 -0000
1680+
Date: Wed, 23 Aug 2000 16:03:42 +0100 (WEST)
1681+
From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
1682+
X-Sender: tiago@eros.si.fct.unl.pt
1683+
To: Tom Lane <tgl@sss.pgh.pa.us>
1684+
cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org
1685+
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,
1686+
constant-->index scan
1687+
In-Reply-To: <27971.967041030@sss.pgh.pa.us>
1688+
Message-ID: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
1689+
MIME-Version: 1.0
1690+
Content-Type: TEXT/PLAIN; charset=US-ASCII
1691+
X-Mailing-List: pgsql-hackers@postgresql.org
1692+
Precedence: bulk
1693+
Sender: pgsql-hackers-owner@hub.org
1694+
Status: ORr
1695+
1696+
Hi!
1697+
1698+
On Wed, 23 Aug 2000, Tom Lane wrote:
1699+
1700+
> Yes, we know about that one. We have stats about the most common value
1701+
> in a column, but no information about how the less-common values are
1702+
> distributed. We definitely need stats about several top values not just
1703+
> one, because this phenomenon of a badly skewed distribution is pretty
1704+
> common.
1705+
1706+
1707+
An end-biased histogram has stats on top values and also on the least
1708+
frequent values. So if a there is a selection on a value that is well
1709+
bellow average, the selectivity estimation will be more acurate. On some
1710+
research papers I've read, it's refered that this is a better approach
1711+
than equi-width histograms (which are said to be the "industry" standard).
1712+
1713+
I not sure whether to use a table or a array attribute on pg_stat for
1714+
the histogram, the problem is what could be expected from the size of the
1715+
attribute (being a text). I'm very affraid of the cost of going through
1716+
several tuples on a table (pg_histogram?) during the optimization phase.
1717+
1718+
One other idea would be to only have better statistics for special
1719+
attributes requested by the user... something like "analyze special
1720+
table(column)".
1721+
1722+
Best Regards,
1723+
Tiago
1724+
1725+
1726+
1727+
From pgsql-hackers-owner+M6160@hub.org Thu Aug 24 00:21:39 2000
1728+
Received: from hub.org (root@hub.org [216.126.84.1])
1729+
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA27662
1730+
for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 00:21:38 -0400 (EDT)
1731+
Received: from hub.org (majordom@localhost [127.0.0.1])
1732+
by hub.org (8.10.1/8.10.1) with SMTP id e7O46w585951;
1733+
Thu, 24 Aug 2000 00:06:58 -0400 (EDT)
1734+
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
1735+
by hub.org (8.10.1/8.10.1) with ESMTP id e7O3uv583775
1736+
for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 23:56:57 -0400 (EDT)
1737+
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
1738+
by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA20973;
1739+
Wed, 23 Aug 2000 23:56:35 -0400 (EDT)
1740+
To: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
1741+
cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org
1742+
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan
1743+
In-reply-to: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
1744+
References: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
1745+
Comments: In-reply-to =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
1746+
message dated "Wed, 23 Aug 2000 16:03:42 +0100"
1747+
Date: Wed, 23 Aug 2000 23:56:35 -0400
1748+
Message-ID: <20970.967089395@sss.pgh.pa.us>
1749+
From: Tom Lane <tgl@sss.pgh.pa.us>
1750+
X-Mailing-List: pgsql-hackers@postgresql.org
1751+
Precedence: bulk
1752+
Sender: pgsql-hackers-owner@hub.org
1753+
Status: OR
1754+
1755+
=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
1756+
> One other idea would be to only have better statistics for special
1757+
> attributes requested by the user... something like "analyze special
1758+
> table(column)".
1759+
1760+
This might actually fall out "for free" from the cheapest way of
1761+
implementing the stats. We've talked before about scanning btree
1762+
indexes directly to obtain data values in sorted order, which makes
1763+
it very easy to find the most common values. If you do that, you
1764+
get good stats for exactly those columns that the user has created
1765+
indexes on. A tad indirect but I bet it'd be effective...
1766+
1767+
regards, tom lane
1768+
1769+
From pgsql-hackers-owner+M6165@hub.org Thu Aug 24 05:33:02 2000
1770+
Received: from hub.org (root@hub.org [216.126.84.1])
1771+
by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id FAA14309
1772+
for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 05:33:01 -0400 (EDT)
1773+
Received: from hub.org (majordom@localhost [127.0.0.1])
1774+
by hub.org (8.10.1/8.10.1) with SMTP id e7O9X0584670;
1775+
Thu, 24 Aug 2000 05:33:00 -0400 (EDT)
1776+
Received: from athena.office.vi.net (office-gwb.fulham.vi.net [194.88.77.158])
1777+
by hub.org (8.10.1/8.10.1) with ESMTP id e7O9Ix581216
1778+
for <pgsql-hackers@postgresql.org>; Thu, 24 Aug 2000 05:19:03 -0400 (EDT)
1779+
Received: from grommit.office.vi.net [192.168.1.200] (mail)
1780+
by athena.office.vi.net with esmtp (Exim 3.12 #1 (Debian))
1781+
id 13Rt2Y-00073I-00; Thu, 24 Aug 2000 10:11:14 +0100
1782+
Received: from jules by grommit.office.vi.net with local (Exim 3.12 #1 (Debian))
1783+
id 13Rt2Y-0005GV-00; Thu, 24 Aug 2000 10:11:14 +0100
1784+
Date: Thu, 24 Aug 2000 10:11:14 +0100
1785+
From: Jules Bean <jules@jellybean.co.uk>
1786+
To: Tom Lane <tgl@sss.pgh.pa.us>
1787+
Cc: Tiago Ant?o <tra@fct.unl.pt>, pgsql-hackers@postgresql.org
1788+
Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan
1789+
Message-ID: <20000824101113.N17510@grommit.office.vi.net>
1790+
References: <1731.966868649@sss.pgh.pa.us> <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt> <20000823133418.F17510@grommit.office.vi.net> <27971.967041030@sss.pgh.pa.us>
1791+
Mime-Version: 1.0
1792+
Content-Type: text/plain; charset=us-ascii
1793+
Content-Disposition: inline
1794+
User-Agent: Mutt/1.2i
1795+
In-Reply-To: <27971.967041030@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Wed, Aug 23, 2000 at 10:30:30AM -0400
1796+
X-Mailing-List: pgsql-hackers@postgresql.org
1797+
Precedence: bulk
1798+
Sender: pgsql-hackers-owner@hub.org
1799+
Status: OR
1800+
1801+
On Wed, Aug 23, 2000 at 10:30:30AM -0400, Tom Lane wrote:
1802+
> Jules Bean <jules@jellybean.co.uk> writes:
1803+
> > I have in a table a 'category' column which takes a small number of
1804+
> > (basically fixed) values. Here by 'small', I mean ~1000, while the
1805+
> > table itself has ~10 000 000 rows. Some categories have many, many
1806+
> > more rows than others. In particular, there's one category which hits
1807+
> > over half the rows. Because of this (AIUI) postgresql assumes
1808+
> > that the query
1809+
> > select ... from thistable where category='something'
1810+
> > is best served by a seqscan, even though there is an index on
1811+
> > category.
1812+
>
1813+
> Yes, we know about that one. We have stats about the most common value
1814+
> in a column, but no information about how the less-common values are
1815+
> distributed. We definitely need stats about several top values not just
1816+
> one, because this phenomenon of a badly skewed distribution is pretty
1817+
> common.
1818+
1819+
ISTM that that might be enough, in fact.
1820+
1821+
If you have stats telling you that the most popular value is 'xyz',
1822+
and that it constitutes 50% of the rows (i.e. 5 000 000) then you can
1823+
conclude that, on average, other entries constitute a mere 5 000
1824+
000/999 ~~ 5000 entries, and it would be definitely be enough.
1825+
(That's assuming you store the number of distinct values somewhere).
1826+
1827+
1828+
> BTW, if your highly-popular value is actually a dummy value ('UNKNOWN'
1829+
> or something like that), a fairly effective workaround is to replace the
1830+
> dummy entries with NULL. The system does account for NULLs separately
1831+
> from real values, so you'd then get stats based on the most common
1832+
> non-dummy value.
1833+
1834+
I can't really do that. Even if I could, the distribution is very
1835+
skewed -- so the next most common makes up a very high proportion of
1836+
what's left. I forget the figures exactly.
1837+
1838+
Jules
1839+

0 commit comments

Comments
 (0)