Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit c11453c

Browse files
committed
hash: Add write-ahead logging support.
The warning about hash indexes not being write-ahead logged and their use being discouraged has been removed. "snapshot too old" is now supported for tables with hash indexes. Most importantly, barring bugs, hash indexes will now be crash-safe and usable on standbys. This commit doesn't yet add WAL consistency checking for hash indexes, as we now have for other index types; a separate patch has been submitted to cure that lack. Amit Kapila, reviewed and slightly modified by me. The larger patch series of which this is a part has been reviewed and tested by Álvaro Herrera, Ashutosh Sharma, Mark Kirkwood, Jeff Janes, and Jesper Pedersen. Discussion: http://postgr.es/m/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
1 parent 2b32ac2 commit c11453c

File tree

25 files changed

+2001
-140
lines changed

25 files changed

+2001
-140
lines changed

contrib/pageinspect/expected/hash.out

-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
CREATE TABLE test_hash (a int, b text);
22
INSERT INTO test_hash VALUES (1, 'one');
33
CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
4-
WARNING: hash indexes are not WAL-logged and their use is discouraged
54
\x
65
SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
76
-[ RECORD 1 ]--+---------

contrib/pgstattuple/expected/pgstattuple.out

-2
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
131131
(1 row)
132132

133133
create index test_hashidx on test using hash (b);
134-
WARNING: hash indexes are not WAL-logged and their use is discouraged
135134
select * from pgstathashindex('test_hashidx');
136135
version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent
137136
---------+--------------+----------------+--------------+------------+------------+------------+--------------
@@ -226,7 +225,6 @@ ERROR: "test_partition" is not an index
226225
-- an actual index of a partitioned table should work though
227226
create index test_partition_idx on test_partition(a);
228227
create index test_partition_hash_idx on test_partition using hash (a);
229-
WARNING: hash indexes are not WAL-logged and their use is discouraged
230228
-- these should work
231229
select pgstatindex('test_partition_idx');
232230
pgstatindex

doc/src/sgml/backup.sgml

-13
Original file line numberDiff line numberDiff line change
@@ -1536,19 +1536,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
15361536
technique. These will probably be fixed in future releases:
15371537

15381538
<itemizedlist>
1539-
<listitem>
1540-
<para>
1541-
Operations on hash indexes are not presently WAL-logged, so
1542-
replay will not update these indexes. This will mean that any new inserts
1543-
will be ignored by the index, updated rows will apparently disappear and
1544-
deleted rows will still retain pointers. In other words, if you modify a
1545-
table with a hash index on it then you will get incorrect query results
1546-
on a standby server. When recovery completes it is recommended that you
1547-
manually <xref linkend="sql-reindex">
1548-
each such index after completing a recovery operation.
1549-
</para>
1550-
</listitem>
1551-
15521539
<listitem>
15531540
<para>
15541541
If a <xref linkend="sql-createdatabase">

doc/src/sgml/config.sgml

+3-4
Original file line numberDiff line numberDiff line change
@@ -2153,10 +2153,9 @@ include_dir 'conf.d'
21532153
has materialized a result set, no error will be generated even if the
21542154
underlying rows in the referenced table have been vacuumed away.
21552155
Some tables cannot safely be vacuumed early, and so will not be
2156-
affected by this setting. Examples include system catalogs and any
2157-
table which has a hash index. For such tables this setting will
2158-
neither reduce bloat nor create a possibility of a <literal>snapshot
2159-
too old</> error on scanning.
2156+
affected by this setting, such as system catalogs. For such tables
2157+
this setting will neither reduce bloat nor create a possibility
2158+
of a <literal>snapshot too old</> error on scanning.
21602159
</para>
21612160
</listitem>
21622161
</varlistentry>

doc/src/sgml/high-availability.sgml

-6
Original file line numberDiff line numberDiff line change
@@ -2351,12 +2351,6 @@ LOG: database system is ready to accept read only connections
23512351
These can and probably will be fixed in future releases:
23522352

23532353
<itemizedlist>
2354-
<listitem>
2355-
<para>
2356-
Operations on hash indexes are not presently WAL-logged, so
2357-
replay will not update these indexes.
2358-
</para>
2359-
</listitem>
23602354
<listitem>
23612355
<para>
23622356
Full knowledge of running transactions is required before snapshots

doc/src/sgml/indices.sgml

-12
Original file line numberDiff line numberDiff line change
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
193193
</synopsis>
194194
</para>
195195

196-
<caution>
197-
<para>
198-
Hash index operations are not presently WAL-logged,
199-
so hash indexes might need to be rebuilt with <command>REINDEX</>
200-
after a database crash if there were unwritten changes.
201-
Also, changes to hash indexes are not replicated over streaming or
202-
file-based replication after the initial base backup, so they
203-
give wrong answers to queries that subsequently use them.
204-
For these reasons, hash index use is presently discouraged.
205-
</para>
206-
</caution>
207-
208196
<para>
209197
<indexterm>
210198
<primary>index</primary>

doc/src/sgml/ref/create_index.sgml

-13
Original file line numberDiff line numberDiff line change
@@ -510,19 +510,6 @@ Indexes:
510510
they can be useful.
511511
</para>
512512

513-
<caution>
514-
<para>
515-
Hash index operations are not presently WAL-logged,
516-
so hash indexes might need to be rebuilt with <command>REINDEX</>
517-
after a database crash if there were unwritten changes.
518-
Also, changes to hash indexes are not replicated over streaming or
519-
file-based replication after the initial base backup, so they
520-
give wrong answers to queries that subsequently use them.
521-
Hash indexes are also not properly restored during point-in-time
522-
recovery. For these reasons, hash index use is presently discouraged.
523-
</para>
524-
</caution>
525-
526513
<para>
527514
Currently, only the B-tree, GiST, GIN, and BRIN index methods support
528515
multicolumn indexes. Up to 32 fields can be specified by default.

src/backend/access/hash/Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,6 @@ top_builddir = ../../../..
1313
include $(top_builddir)/src/Makefile.global
1414

1515
OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
16-
hashsort.o hashutil.o hashvalidate.o
16+
hashsort.o hashutil.o hashvalidate.o hash_xlog.o
1717

1818
include $(top_srcdir)/src/backend/common.mk

src/backend/access/hash/README

+119-19
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ this flag must be clear before splitting a bucket; thus, a bucket can't be
213213
split again until the previous split is totally complete.
214214

215215
The moved-by-split flag on a tuple indicates that tuple is moved from old to
216-
new bucket. Concurrent scans can skip such tuples till the split operation
216+
new bucket. Concurrent scans will skip such tuples until the split operation
217217
is finished. Once the tuple is marked as moved-by-split, it will remain so
218218
forever but that does no harm. We have intentionally not cleared it as that
219219
can generate an additional I/O which is not necessary.
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
287287
if current page is full, release lock but not pin, read/exclusive-lock
288288
next page; repeat as needed
289289
>> see below if no space in any page of bucket
290+
take buffer content lock in exclusive mode on metapage
290291
insert tuple at appropriate place in page
291-
mark current page dirty and release buffer content lock and pin
292-
if the current page is not a bucket page, release the pin on bucket page
293-
pin meta page and take buffer content lock in exclusive mode
292+
mark current page dirty
294293
increment tuple count, decide if split needed
295-
mark meta page dirty and release buffer content lock and pin
296-
done if no split needed, else enter Split algorithm below
294+
mark meta page dirty
295+
write WAL for insertion of tuple
296+
release the buffer content lock on metapage
297+
release buffer content lock on current page
298+
if current page is not a bucket page, release the pin on bucket page
299+
if split is needed, enter Split algorithm below
300+
release the pin on metapage
297301

298302
To speed searches, the index entries within any individual index page are
299303
kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
328332
try to finish the split and the cleanup work
329333
if that succeeds, start over; if it fails, give up
330334
mark the old and new buckets indicating split is in progress
335+
mark both old and new buckets as dirty
336+
write WAL for allocation of new page for split
331337
copy the tuples that belongs to new bucket from old bucket, marking
332338
them as moved-by-split
339+
write WAL record for moving tuples to new page once the new page is full
340+
or all the pages of old bucket are finished
333341
release lock but not pin for primary bucket page of old bucket,
334342
read/shared-lock next page; repeat as needed
335343
clear the bucket-being-split and bucket-being-populated flags
336344
mark the old bucket indicating split-cleanup
345+
write WAL for changing the flags on both old and new buckets
337346

338347
The split operation's attempt to acquire cleanup-lock on the old bucket number
339348
could fail if another process holds any lock or pin on it. We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
369378
acquire cleanup lock on primary bucket page
370379
loop:
371380
scan and remove tuples
381+
mark the target page dirty
382+
write WAL for deleting tuples from target page
372383
if this is the last bucket page, break out of loop
373384
pin and x-lock next page
374385
release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
383394
check if number of buckets changed
384395
if so, release content lock and pin and return to for-each-bucket loop
385396
else update metapage tuple count
386-
mark meta page dirty and release buffer content lock and pin
397+
mark meta page dirty and write WAL for update of metapage
398+
release buffer content lock and pin
387399

388400
Note that this is designed to allow concurrent splits and scans. If a split
389401
occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
425437
search for a free page (zero bit in bitmap)
426438
if found:
427439
set bit in bitmap
428-
mark bitmap page dirty and release content lock
440+
mark bitmap page dirty
429441
take metapage buffer content lock in exclusive mode
430442
if first-free-bit value did not change,
431443
update it and mark meta page dirty
432-
release meta page buffer content lock
433-
return page number
434444
else (not found):
435445
release bitmap page buffer content lock
436446
loop back to try next bitmap page, if any
437447
-- here when we have checked all bitmap pages; we hold meta excl. lock
438448
extend index to add another overflow page; update meta information
439-
mark meta page dirty and release buffer content lock
449+
mark meta page dirty
440450
return page number
441451

442452
It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,17 @@ like this:
456466

457467
-- having determined that no space is free in the target bucket:
458468
remember last page of bucket, drop write lock on it
459-
call free-page-acquire routine
460469
re-write-lock last page of bucket
461470
if it is not last anymore, step to the last page
462-
update (former) last page to point to new page
471+
execute free-page-acquire (obtaining an overflow page) mechanism
472+
described above
473+
update (former) last page to point to the new page and mark buffer dirty
463474
write-lock and initialize new page, with back link to former last page
464-
write and release former last page
475+
write WAL for addition of overflow page
476+
release the locks on meta page and bitmap page acquired in
477+
free-page-acquire algorithm
478+
release the lock on former last page
479+
release the lock on new overflow page
465480
insert tuple into new page
466481
-- etc.
467482

@@ -488,12 +503,14 @@ accessors of pages in the bucket. The algorithm is:
488503
determine which bitmap page contains the free space bit for page
489504
release meta page buffer content lock
490505
pin bitmap page and take buffer content lock in exclusive mode
491-
update bitmap bit
492-
mark bitmap page dirty and release buffer content lock and pin
493-
if page number is less than what we saw as first-free-bit in meta:
494506
retake meta page buffer content lock in exclusive mode
507+
move (insert) tuples that belong to the overflow page being freed
508+
update bitmap bit
509+
mark bitmap page dirty
495510
if page number is still less than first-free-bit,
496511
update first-free-bit field and mark meta page dirty
512+
write WAL for delinking overflow page operation
513+
release buffer content lock and pin
497514
release meta page buffer content lock and pin
498515

499516
We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +521,91 @@ page acquirer will scan more bitmap bits than he needs to. What must be
504521
avoided is having first-free-bit greater than the actual first free bit,
505522
because then that free page would never be found by searchers.
506523

507-
All the freespace operations should be called while holding no buffer
508-
locks. Since they need no lmgr locks, deadlock is not possible.
524+
The reason of moving tuples from overflow page while delinking the later is
525+
to make that as an atomic operation. Not doing so could lead to spurious reads
526+
on standby. Basically, the user might see the same tuple twice.
527+
528+
529+
WAL Considerations
530+
------------------
531+
532+
The hash index operations like create index, insert, delete, bucket split,
533+
allocate overflow page, and squeeze in themselves don't guarantee hash index
534+
consistency after a crash. To provide robustness, we write WAL for each of
535+
these operations.
536+
537+
CREATE INDEX writes multiple WAL records. First, we write a record to cover
538+
the initializatoin of the metapage, followed by one for each new bucket
539+
created, followed by one for the initial bitmap page. It's not important for
540+
index creation to appear atomic, because the index isn't yet visible to any
541+
other transaction, and the creating transaction will roll back in the event of
542+
a crash. It would be difficult to cover the whole operation with a single
543+
write-ahead log record anyway, because we can log only a fixed number of
544+
pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery.
545+
546+
Ordinary item insertions (that don't force a page split or need a new overflow
547+
page) are single WAL entries. They touch a single bucket page and the
548+
metapage. The metapage is updated during replay as it is updated during
549+
original operation.
550+
551+
If an insertion causes the addition of an overflow page, there will be one
552+
WAL entry for the new overflow page and second entry for insert itself.
553+
554+
If an insertion causes a bucket split, there will be one WAL entry for insert
555+
itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
556+
entry for each overflow bucket page in the new bucket to which the tuples are
557+
moved from old bucket, followed by a WAL entry to indicate that split is
558+
complete for both old and new buckets. A split operation which requires
559+
overflow pages to complete the operation will need to write a WAL record for
560+
each new allocation of an overflow page.
561+
562+
As splitting involves multiple atomic actions, it's possible that the system
563+
crashes between moving tuples from bucket pages of the old bucket to new
564+
bucket. In such a case, after recovery, the old and new buckets will be
565+
marked with bucket-being-split and bucket-being-populated flags respectively
566+
which indicates that split is in progress for those buckets. The reader
567+
algorithm works correctly, as it will scan both the old and new buckets when
568+
the split is in progress as explained in the reader algorithm section above.
569+
570+
We finish the split at next insert or split operation on the old bucket as
571+
explained in insert and split algorithm above. It could be done during
572+
searches, too, but it seems best not to put any extra updates in what would
573+
otherwise be a read-only operation (updating is not possible in hot standby
574+
mode anyway). It would seem natural to complete the split in VACUUM, but since
575+
splitting a bucket might require allocating a new page, it might fail if you
576+
run out of disk space. That would be bad during VACUUM - the reason for
577+
running VACUUM in the first place might be that you run out of disk space,
578+
and now VACUUM won't finish because you're out of disk space. In contrast,
579+
an insertion can require enlarging the physical file anyway.
580+
581+
Deletion of tuples from a bucket is performed for two reasons: to remove dead
582+
tuples, and to remove tuples that were moved by a bucket split. A WAL entry
583+
is made for each bucket page from which tuples are removed, and then another
584+
WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples
585+
are removed, a separate WAL entry is made to update the metapage.
586+
587+
As deletion involves multiple atomic operations, it is quite possible that
588+
system crashes after (a) removing tuples from some of the bucket pages, (b)
589+
before clearing the garbage flag, or (c) before updating the metapage. If the
590+
system crashes before completing (b), it will again try to clean the bucket
591+
during next vacuum or insert after recovery which can have some performance
592+
impact, but it will work fine. If the system crashes before completing (c),
593+
after recovery there could be some additional splits until the next vacuum
594+
updates the metapage, but the other operations like insert, delete and scan
595+
will work correctly. We can fix this problem by actually updating the
596+
metapage based on delete operation during replay, but it's not clear whether
597+
it's worth the complication.
598+
599+
A squeeze operation moves tuples from one of the buckets later in the chain to
600+
one of the bucket earlier in chain and writes WAL record when either the
601+
bucket to which it is writing tuples is filled or bucket from which it
602+
is removing the tuples becomes empty.
603+
604+
As a squeeze operation involves writing multiple atomic operations, it is
605+
quite possible that the system crashes before completing the operation on
606+
entire bucket. After recovery, the operations will work correctly, but
607+
the index will remain bloated and this can impact performance of read and
608+
insert operations until the next vacuum squeeze the bucket completely.
509609

510610

511611
Other Notes

0 commit comments

Comments
 (0)