|
| 1 | +<!-- doc/src/sgml/hash.sgml --> |
| 2 | + |
| 3 | +<chapter id="hash-index"> |
| 4 | +<title>Hash Indexes</title> |
| 5 | + |
| 6 | + <indexterm> |
| 7 | + <primary>index</primary> |
| 8 | + <secondary>Hash</secondary> |
| 9 | + </indexterm> |
| 10 | + |
| 11 | +<sect1 id="hash-intro"> |
| 12 | + <title>Overview</title> |
| 13 | + |
| 14 | + <para> |
| 15 | + <productname>PostgreSQL</productname> |
| 16 | + includes an implementation of persistent on-disk hash indexes, |
| 17 | + which are fully crash recoverable. Any data type can be indexed by a |
| 18 | + hash index, including data types that do not have a well-defined linear |
| 19 | + ordering. Hash indexes store only the hash value of the data being |
| 20 | + indexed, thus there are no restrictions on the size of the data column |
| 21 | + being indexed. |
| 22 | + </para> |
| 23 | + |
| 24 | + <para> |
| 25 | + Hash indexes support only single-column indexes and do not allow |
| 26 | + uniqueness checking. |
| 27 | + </para> |
| 28 | + |
| 29 | + <para> |
| 30 | + Hash indexes support only the <literal>=</literal> operator, |
| 31 | + so WHERE clauses that specify range operations will not be able to take |
| 32 | + advantage of hash indexes. |
| 33 | + </para> |
| 34 | + |
| 35 | + <para> |
| 36 | + Each hash index tuple stores just the 4-byte hash value, not the actual |
| 37 | + column value. As a result, hash indexes may be much smaller than B-trees |
| 38 | + when indexing longer data items such as UUIDs, URLs, etc. The absence of |
| 39 | + the column value also makes all hash index scans lossy. Hash indexes may |
| 40 | + take part in bitmap index scans and backward scans. |
| 41 | + </para> |
| 42 | + |
| 43 | + <para> |
| 44 | + Hash indexes are best optimized for SELECT and UPDATE-heavy workloads |
| 45 | + that use equality scans on larger tables. In a B-tree index, searches must |
| 46 | + descend through the tree until the leaf page is found. In tables with |
| 47 | + millions of rows, this descent can increase access time to data. The |
| 48 | + equivalent of a leaf page in a hash index is referred to as a bucket page. In |
| 49 | + contrast, a hash index allows accessing the bucket pages directly, |
| 50 | + thereby potentially reducing index access time in larger tables. This |
| 51 | + reduction in "logical I/O" becomes even more pronounced on indexes/data |
| 52 | + larger than shared_buffers/RAM. |
| 53 | + </para> |
| 54 | + |
| 55 | + <para> |
| 56 | + Hash indexes have been designed to cope with uneven distributions of |
| 57 | + hash values. Direct access to the bucket pages works well if the hash |
| 58 | + values are evenly distributed. When inserts mean that the bucket page |
| 59 | + becomes full, additional overflow pages are chained to that specific |
| 60 | + bucket page, locally expanding the storage for index tuples that match |
| 61 | + that hash value. When scanning a hash bucket during queries, we need to |
| 62 | + scan through all of the overflow pages. Thus an unbalanced hash index |
| 63 | + might actually be worse than a B-tree in terms of number of block |
| 64 | + accesses required, for some data. |
| 65 | + </para> |
| 66 | + |
| 67 | + <para> |
| 68 | + As a result of the overflow cases, we can say that hash indexes are |
| 69 | + most suitable for unique, nearly unique data or data with a low number |
| 70 | + of rows per hash bucket. |
| 71 | + One possible way to avoid problems is to exclude highly non-unique |
| 72 | + values from the index using a partial index condition, but this may |
| 73 | + not be suitable in many cases. |
| 74 | + </para> |
| 75 | + |
| 76 | + <para> |
| 77 | + Like B-Trees, hash indexes perform simple index tuple deletion. This |
| 78 | + is a deferred maintenance operation that deletes index tuples that are |
| 79 | + known to be safe to delete (those whose item identifier's LP_DEAD bit |
| 80 | + is already set). If an insert finds no space is available on a page we |
| 81 | + try to avoid creating a new overflow page by attempting to remove dead |
| 82 | + index tuples. Removal cannot occur if the page is pinned at that time. |
| 83 | + Deletion of dead index pointers also occurs during VACUUM. |
| 84 | + </para> |
| 85 | + |
| 86 | + <para> |
| 87 | + If it can, VACUUM will also try to squeeze the index tuples onto as |
| 88 | + few overflow pages as possible, minimizing the overflow chain. If an |
| 89 | + overflow page becomes empty, overflow pages can be recycled for reuse |
| 90 | + in other buckets, though we never return them to the operating system. |
| 91 | + There is currently no provision to shrink a hash index, other than by |
| 92 | + rebuilding it with REINDEX. |
| 93 | + There is no provision for reducing the number of buckets, either. |
| 94 | + </para> |
| 95 | + |
| 96 | + <para> |
| 97 | + Hash indexes may expand the number of bucket pages as the number of |
| 98 | + rows indexed grows. The hash key-to-bucket-number mapping is chosen so that |
| 99 | + the index can be incrementally expanded. When a new bucket is to be added to |
| 100 | + the index, exactly one existing bucket will need to be "split", with some of |
| 101 | + its tuples being transferred to the new bucket according to the updated |
| 102 | + key-to-bucket-number mapping. |
| 103 | + </para> |
| 104 | + |
| 105 | + <para> |
| 106 | + The expansion occurs in the foreground, which could increase execution |
| 107 | + time for user inserts. Thus, hash indexes may not be suitable for tables |
| 108 | + with rapidly increasing number of rows. |
| 109 | + </para> |
| 110 | + |
| 111 | +</sect1> |
| 112 | + |
| 113 | +<sect1 id="hash-implementation"> |
| 114 | + <title>Implementation</title> |
| 115 | + |
| 116 | + <para> |
| 117 | + There are four kinds of pages in a hash index: the meta page (page zero), |
| 118 | + which contains statically allocated control information; primary bucket |
| 119 | + pages; overflow pages; and bitmap pages, which keep track of overflow |
| 120 | + pages that have been freed and are available for re-use. For addressing |
| 121 | + purposes, bitmap pages are regarded as a subset of the overflow pages. |
| 122 | + </para> |
| 123 | + |
| 124 | + <para> |
| 125 | + Both scanning the index and inserting tuples require locating the bucket |
| 126 | + where a given tuple ought to be located. To do this, we need the bucket |
| 127 | + count, highmask, and lowmask from the metapage; however, it's undesirable |
| 128 | + for performance reasons to have to have to lock and pin the metapage for |
| 129 | + every such operation. Instead, we retain a cached copy of the metapage |
| 130 | + in each backend's relcache entry. This will produce the correct bucket |
| 131 | + mapping as long as the target bucket hasn't been split since the last |
| 132 | + cache refresh. |
| 133 | + </para> |
| 134 | + |
| 135 | + <para> |
| 136 | + Primary bucket pages and overflow pages are allocated independently since |
| 137 | + any given index might need more or fewer overflow pages relative to its |
| 138 | + number of buckets. The hash code uses an interesting set of addressing |
| 139 | + rules to support a variable number of overflow pages while not having to |
| 140 | + move primary bucket pages around after they are created. |
| 141 | + </para> |
| 142 | + |
| 143 | + <para> |
| 144 | + Each row in the table indexed is represented by a single index tuple in |
| 145 | + the hash index. Hash index tuples are stored in bucket pages, and if |
| 146 | + they exist, overflow pages. We speed up searches by keeping the index entries |
| 147 | + in any one index page sorted by hash code, thus allowing binary search to be |
| 148 | + used within an index page. Note however that there is *no* assumption about |
| 149 | + the relative ordering of hash codes across different index pages of a bucket. |
| 150 | + </para> |
| 151 | + |
| 152 | + <para> |
| 153 | + The bucket splitting algorithms to expand the hash index are too complex to |
| 154 | + be worthy of mention here, though are described in more detail in |
| 155 | + <filename>src/backend/access/hash/README</filename>. |
| 156 | + The split algorithm is crash safe and can be restarted if not completed |
| 157 | + successfully. |
| 158 | + </para> |
| 159 | + |
| 160 | +</sect1> |
| 161 | + |
| 162 | +</chapter> |
0 commit comments