Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit f3690fb

Browse files
author
Amit Kapila
committed
Doc: Hash Indexes.
A new chapter for Hash Indexes, designed to help users understand how they work and when to use them. Backpatch-through 10 where we have made hash indexes durable. Author: Simon Riggs Reviewed-By: Justin Pryzby, Amit Kapila Discussion: https://postgr.es/m/CANbhV-HRjNPYgHo--P1ewBrFJ-GpZPb9_25P7=Wgu7s7hy_sLQ@mail.gmail.com
1 parent 95e2f92 commit f3690fb

File tree

3 files changed

+164
-0
lines changed

3 files changed

+164
-0
lines changed

doc/src/sgml/filelist.sgml

+1
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@
8989
<!ENTITY spgist SYSTEM "spgist.sgml">
9090
<!ENTITY gin SYSTEM "gin.sgml">
9191
<!ENTITY brin SYSTEM "brin.sgml">
92+
<!ENTITY hash SYSTEM "hash.sgml">
9293
<!ENTITY planstats SYSTEM "planstats.sgml">
9394
<!ENTITY tableam SYSTEM "tableam.sgml">
9495
<!ENTITY indexam SYSTEM "indexam.sgml">

doc/src/sgml/hash.sgml

+162
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
<!-- doc/src/sgml/hash.sgml -->
2+
3+
<chapter id="hash-index">
4+
<title>Hash Indexes</title>
5+
6+
<indexterm>
7+
<primary>index</primary>
8+
<secondary>Hash</secondary>
9+
</indexterm>
10+
11+
<sect1 id="hash-intro">
12+
<title>Overview</title>
13+
14+
<para>
15+
<productname>PostgreSQL</productname>
16+
includes an implementation of persistent on-disk hash indexes,
17+
which are fully crash recoverable. Any data type can be indexed by a
18+
hash index, including data types that do not have a well-defined linear
19+
ordering. Hash indexes store only the hash value of the data being
20+
indexed, thus there are no restrictions on the size of the data column
21+
being indexed.
22+
</para>
23+
24+
<para>
25+
Hash indexes support only single-column indexes and do not allow
26+
uniqueness checking.
27+
</para>
28+
29+
<para>
30+
Hash indexes support only the <literal>=</literal> operator,
31+
so WHERE clauses that specify range operations will not be able to take
32+
advantage of hash indexes.
33+
</para>
34+
35+
<para>
36+
Each hash index tuple stores just the 4-byte hash value, not the actual
37+
column value. As a result, hash indexes may be much smaller than B-trees
38+
when indexing longer data items such as UUIDs, URLs, etc. The absence of
39+
the column value also makes all hash index scans lossy. Hash indexes may
40+
take part in bitmap index scans and backward scans.
41+
</para>
42+
43+
<para>
44+
Hash indexes are best optimized for SELECT and UPDATE-heavy workloads
45+
that use equality scans on larger tables. In a B-tree index, searches must
46+
descend through the tree until the leaf page is found. In tables with
47+
millions of rows, this descent can increase access time to data. The
48+
equivalent of a leaf page in a hash index is referred to as a bucket page. In
49+
contrast, a hash index allows accessing the bucket pages directly,
50+
thereby potentially reducing index access time in larger tables. This
51+
reduction in "logical I/O" becomes even more pronounced on indexes/data
52+
larger than shared_buffers/RAM.
53+
</para>
54+
55+
<para>
56+
Hash indexes have been designed to cope with uneven distributions of
57+
hash values. Direct access to the bucket pages works well if the hash
58+
values are evenly distributed. When inserts mean that the bucket page
59+
becomes full, additional overflow pages are chained to that specific
60+
bucket page, locally expanding the storage for index tuples that match
61+
that hash value. When scanning a hash bucket during queries, we need to
62+
scan through all of the overflow pages. Thus an unbalanced hash index
63+
might actually be worse than a B-tree in terms of number of block
64+
accesses required, for some data.
65+
</para>
66+
67+
<para>
68+
As a result of the overflow cases, we can say that hash indexes are
69+
most suitable for unique, nearly unique data or data with a low number
70+
of rows per hash bucket.
71+
One possible way to avoid problems is to exclude highly non-unique
72+
values from the index using a partial index condition, but this may
73+
not be suitable in many cases.
74+
</para>
75+
76+
<para>
77+
Like B-Trees, hash indexes perform simple index tuple deletion. This
78+
is a deferred maintenance operation that deletes index tuples that are
79+
known to be safe to delete (those whose item identifier's LP_DEAD bit
80+
is already set). If an insert finds no space is available on a page we
81+
try to avoid creating a new overflow page by attempting to remove dead
82+
index tuples. Removal cannot occur if the page is pinned at that time.
83+
Deletion of dead index pointers also occurs during VACUUM.
84+
</para>
85+
86+
<para>
87+
If it can, VACUUM will also try to squeeze the index tuples onto as
88+
few overflow pages as possible, minimizing the overflow chain. If an
89+
overflow page becomes empty, overflow pages can be recycled for reuse
90+
in other buckets, though we never return them to the operating system.
91+
There is currently no provision to shrink a hash index, other than by
92+
rebuilding it with REINDEX.
93+
There is no provision for reducing the number of buckets, either.
94+
</para>
95+
96+
<para>
97+
Hash indexes may expand the number of bucket pages as the number of
98+
rows indexed grows. The hash key-to-bucket-number mapping is chosen so that
99+
the index can be incrementally expanded. When a new bucket is to be added to
100+
the index, exactly one existing bucket will need to be "split", with some of
101+
its tuples being transferred to the new bucket according to the updated
102+
key-to-bucket-number mapping.
103+
</para>
104+
105+
<para>
106+
The expansion occurs in the foreground, which could increase execution
107+
time for user inserts. Thus, hash indexes may not be suitable for tables
108+
with rapidly increasing number of rows.
109+
</para>
110+
111+
</sect1>
112+
113+
<sect1 id="hash-implementation">
114+
<title>Implementation</title>
115+
116+
<para>
117+
There are four kinds of pages in a hash index: the meta page (page zero),
118+
which contains statically allocated control information; primary bucket
119+
pages; overflow pages; and bitmap pages, which keep track of overflow
120+
pages that have been freed and are available for re-use. For addressing
121+
purposes, bitmap pages are regarded as a subset of the overflow pages.
122+
</para>
123+
124+
<para>
125+
Both scanning the index and inserting tuples require locating the bucket
126+
where a given tuple ought to be located. To do this, we need the bucket
127+
count, highmask, and lowmask from the metapage; however, it's undesirable
128+
for performance reasons to have to have to lock and pin the metapage for
129+
every such operation. Instead, we retain a cached copy of the metapage
130+
in each backend's relcache entry. This will produce the correct bucket
131+
mapping as long as the target bucket hasn't been split since the last
132+
cache refresh.
133+
</para>
134+
135+
<para>
136+
Primary bucket pages and overflow pages are allocated independently since
137+
any given index might need more or fewer overflow pages relative to its
138+
number of buckets. The hash code uses an interesting set of addressing
139+
rules to support a variable number of overflow pages while not having to
140+
move primary bucket pages around after they are created.
141+
</para>
142+
143+
<para>
144+
Each row in the table indexed is represented by a single index tuple in
145+
the hash index. Hash index tuples are stored in bucket pages, and if
146+
they exist, overflow pages. We speed up searches by keeping the index entries
147+
in any one index page sorted by hash code, thus allowing binary search to be
148+
used within an index page. Note however that there is *no* assumption about
149+
the relative ordering of hash codes across different index pages of a bucket.
150+
</para>
151+
152+
<para>
153+
The bucket splitting algorithms to expand the hash index are too complex to
154+
be worthy of mention here, though are described in more detail in
155+
<filename>src/backend/access/hash/README</filename>.
156+
The split algorithm is crash safe and can be restarted if not completed
157+
successfully.
158+
</para>
159+
160+
</sect1>
161+
162+
</chapter>

doc/src/sgml/postgres.sgml

+1
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,7 @@ break is not needed in a wider output rendering.
266266
&spgist;
267267
&gin;
268268
&brin;
269+
&hash;
269270
&storage;
270271
&bki;
271272
&planstats;

0 commit comments

Comments
 (0)