Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit e5a11a8

Browse files
committed
Improve hash method for bitmapsets: some examination of actual outputs
shows that adding a circular shift between words greatly improves the distribution of hash outputs.
1 parent 1f01d59 commit e5a11a8

File tree

1 file changed

+21
-7
lines changed

1 file changed

+21
-7
lines changed

src/backend/nodes/bitmapset.c

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
* Copyright (c) 2003-2005, PostgreSQL Global Development Group
1515
*
1616
* IDENTIFICATION
17-
* $PostgreSQL: pgsql/src/backend/nodes/bitmapset.c,v 1.8 2005/06/08 23:02:04 tgl Exp $
17+
* $PostgreSQL: pgsql/src/backend/nodes/bitmapset.c,v 1.9 2005/06/15 16:24:07 tgl Exp $
1818
*
1919
*-------------------------------------------------------------------------
2020
*/
@@ -769,22 +769,36 @@ bms_first_member(Bitmapset *a)
769769
*
770770
* Note: we must ensure that any two bitmapsets that are bms_equal() will
771771
* hash to the same value; in practice this means that trailing all-zero
772-
* words cannot affect the result. Longitudinal XOR provides a reasonable
773-
* hash value that has this property.
772+
* words cannot affect the result. The circular-shift-and-XOR hash method
773+
* used here has this property, so long as we work from back to front.
774+
*
775+
* Note: you might wonder why we bother with the circular shift; at first
776+
* glance a straight longitudinal XOR seems as good and much simpler. The
777+
* reason is empirical: this gives a better distribution of hash values on
778+
* the bitmapsets actually generated by the planner. A common way to have
779+
* multiword bitmapsets is "a JOIN b JOIN c JOIN d ...", which gives rise
780+
* to rangetables in which base tables and JOIN nodes alternate; so
781+
* bitmapsets of base table RT indexes tend to use only odd-numbered or only
782+
* even-numbered bits. A straight longitudinal XOR would preserve this
783+
* property, leading to a much smaller set of possible outputs than if
784+
* we include a shift.
774785
*/
775786
uint32
776787
bms_hash_value(const Bitmapset *a)
777788
{
778789
bitmapword result = 0;
779-
int nwords;
780790
int wordnum;
781791

782-
if (a == NULL)
792+
if (a == NULL || a->nwords <= 0)
783793
return 0; /* All empty sets hash to 0 */
784-
nwords = a->nwords;
785-
for (wordnum = 0; wordnum < nwords; wordnum++)
794+
for (wordnum = a->nwords; --wordnum > 0; )
786795
{
787796
result ^= a->words[wordnum];
797+
if (result & ((bitmapword) 1 << (BITS_PER_BITMAPWORD - 1)))
798+
result = (result << 1) | 1;
799+
else
800+
result = (result << 1);
788801
}
802+
result ^= a->words[0];
789803
return (uint32) result;
790804
}

0 commit comments

Comments
 (0)