Re: a few crazy ideas about hash joins - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: a few crazy ideas about hash joins |
Date | |
Msg-id | 603c8f070904041939j2b28f322mfbcfdfa6294f6902@mail.gmail.com Whole thread Raw |
In response to | Re: a few crazy ideas about hash joins (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: a few crazy ideas about hash joins
|
List | pgsql-hackers |
On Fri, Apr 3, 2009 at 5:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Apr 3, 2009 at 4:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Correct, but you've got the details all wrong. The real problem is that >>> the planner might discard a join path hash(A,B) at level 2 because it >>> loses compared to, say, merge(A,B). But when we get to level three, >>> perhaps hash(hash(A,B),C) would've been the best plan due to synergy >>> of the two hashes. We'll never find that out unless we keep the >>> "inferior" hash path around. We can certainly do that; the question >>> is what's it going to cost us to allow more paths to survive to the >>> next join level. (And I'm afraid the answer may be "plenty"; it's a >>> combinatorial explosion we're looking at here.) > >> That would be crazy. I think doing it the way I suggested is correct, >> just not guaranteed to catch every case. The reality is that even if >> we took Greg Stark's suggestion of detecting this situation only in >> the executor, we'd still get some benefit out of this. If we take my >> intermediate approach, we'll catch more cases where this is a win. >> What you're suggesting here would catch every conceivable case, but at >> the expense of what I'm sure would be an unacceptable increase in >> planning time for very limit benefit. > > Maybe, maybe not. I've seen plenty of plans that have several > mergejoins stacked up on top of each other with no intervening sorts. > There is 0 chance that the planner would have produced that if it > thought that it had to re-sort at each level; something else would have > looked cheaper. I think that your proposals will end up getting very > little of the possible benefit, because the planner will fail to choose > plan trees in which the optimization can be exploited. Well, I'm all ears if you have suggestions for improvement. For sorts, we use PathKeys to represent the ordering of each path and keep the paths for each set of pathkeys. By analogy, we could maintain a list of PathHash structures for each path representing the tables that had already been hashed. add_path() would then have to consider both the PathHash structures and the PathKey structures before concluding that a path was definitely worse than some path previously found. At each level of the join tree, we'd need to truncate PathHash structures that provably have no further use (e.g. on a base table that does not appear again above the level of the join already planned) to avoid keeping around paths that appeared to be better only because we didn't know that the paths they have hashed are worthless in practice. Maybe that wouldn't even be that expensive, actually, because there will be lots of cases where you know the relevant table doesn't appear elsewhere in the query and not save any extra paths. But I think we'd have to write the code and benchmark it to really know. I guess the reason I'm not too worked up about this is because my experience is that the planner nearly always prefers hash joins on small tables, even when an index is present - the queries I'm worried about optimizing don't need any additional encouragement to use hash joins; they're doing it already. But certainly it doesn't hurt to see how many cases we can pick up. ...Robert
pgsql-hackers by date: