Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit c4afdca

Browse files
committed
Fix two serious bugs introduced into hash indexes by the 8.4 patch that made
hash indexes keep entries sorted by hash value. First, the original plans for concurrency assumed that insertions would happen only at the end of a page, which is no longer true; this could cause scans to transiently fail to find index entries in the presence of concurrent insertions. We can compensate by teaching scans to re-find their position after re-acquiring read locks. Second, neither the bucket split nor the bucket compaction logic had been fixed to preserve hashvalue ordering, so application of either of those processes could lead to permanent corruption of an index, in the sense that searches might fail to find entries that are present. This patch fixes the split and compaction logic to preserve hashvalue ordering, but it cannot do anything about pre-existing corruption. We will need to recommend reindexing all hash indexes in the 8.4.2 release notes. To buy back the performance loss hereby induced in split and compaction, fix them to use PageIndexMultiDelete instead of retail PageIndexDelete operations. We might later want to do something with qsort'ing the page contents rather than doing a binary search for each insertion, but that seemed more invasive than I cared to risk in a back-patch. Per bug #5157 from Jeff Janes and subsequent investigation.
1 parent ef59fa0 commit c4afdca

File tree

6 files changed

+250
-190
lines changed

6 files changed

+250
-190
lines changed

src/backend/access/hash/README

+35-15
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
1-
$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.8 2008/03/20 17:55:14 momjian Exp $
1+
$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.9 2009/11/01 21:25:25 tgl Exp $
22

33
Hash Indexing
44
=============
55

6-
This directory contains an implementation of hash indexing for Postgres. Most
7-
of the core ideas are taken from Margo Seltzer and Ozan Yigit, A New Hashing
8-
Package for UNIX, Proceedings of the Winter USENIX Conference, January 1991.
9-
(Our in-memory hashtable implementation, src/backend/utils/hash/dynahash.c,
10-
also relies on some of the same concepts; it is derived from code written by
11-
Esmond Pitt and later improved by Margo among others.)
6+
This directory contains an implementation of hash indexing for Postgres.
7+
Most of the core ideas are taken from Margo Seltzer and Ozan Yigit,
8+
A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference,
9+
January 1991. (Our in-memory hashtable implementation,
10+
src/backend/utils/hash/dynahash.c, also relies on some of the same concepts;
11+
it is derived from code written by Esmond Pitt and later improved by Margo
12+
among others.)
1213

1314
A hash index consists of two or more "buckets", into which tuples are
1415
placed whenever their hash key maps to the bucket number. The
@@ -32,6 +33,14 @@ rebuilding it with REINDEX. Overflow pages can be recycled for reuse
3233
in other buckets, but we never give them back to the operating system.
3334
There is no provision for reducing the number of buckets, either.
3435

36+
As of PostgreSQL 8.4, hash index entries store only the hash code, not the
37+
actual data value, for each indexed item. This makes the index entries
38+
smaller (perhaps very substantially so) and speeds up various operations.
39+
In particular, we can speed searches by keeping the index entries in any
40+
one index page sorted by hash code, thus allowing binary search to be used
41+
within an index page. Note however that there is *no* assumption about the
42+
relative ordering of hash codes across different index pages of a bucket.
43+
3544

3645
Page Addressing
3746
---------------
@@ -205,8 +214,18 @@ the reader ensures that the target bucket calculation is valid (otherwise
205214
the bucket might be split before the reader arrives at it, and the target
206215
entries might go into the new bucket). Holding the bucket sharelock for
207216
the remainder of the scan prevents the reader's current-tuple pointer from
208-
being invalidated by other processes. Notice though that the reader need
209-
not prevent other buckets from being split or compacted.
217+
being invalidated by splits or compactions. Notice that the reader's lock
218+
does not prevent other buckets from being split or compacted.
219+
220+
To keep concurrency reasonably good, we require readers to cope with
221+
concurrent insertions, which means that they have to be able to re-find
222+
their current scan position after re-acquiring the page sharelock. Since
223+
deletion is not possible while a reader holds the bucket sharelock, and
224+
we assume that heap tuple TIDs are unique, this can be implemented by
225+
searching for the same heap tuple TID previously returned. Insertion does
226+
not move index entries across pages, so the previously-returned index entry
227+
should always be on the same page, at the same or higher offset number,
228+
as it was before.
210229

211230
The insertion algorithm is rather similar:
212231

@@ -220,19 +239,20 @@ The insertion algorithm is rather similar:
220239
read/exclusive-lock current page of bucket
221240
if full, release, read/exclusive-lock next page; repeat as needed
222241
>> see below if no space in any page of bucket
223-
insert tuple
242+
insert tuple at appropriate place in page
224243
write/release current page
225244
release bucket share-lock
226245
read/exclusive-lock meta page
227246
increment tuple count, decide if split needed
228247
write/release meta page
229248
done if no split needed, else enter Split algorithm below
230249

231-
It is okay for an insertion to take place in a bucket that is being
232-
actively scanned, because it does not change the position of any existing
233-
item in the bucket, so scan states are not invalidated. We only need the
234-
short-term buffer locks to ensure that readers do not see a
235-
partially-updated page.
250+
To speed searches, the index entries within any individual index page are
251+
kept sorted by hash code; the insertion code must take care to insert new
252+
entries in the right place. It is okay for an insertion to take place in a
253+
bucket that is being actively scanned, because readers can cope with this
254+
as explained above. We only need the short-term buffer locks to ensure
255+
that readers do not see a partially-updated page.
236256

237257
It is clearly impossible for readers and inserters to deadlock, and in
238258
fact this algorithm allows them a very high degree of concurrency.

src/backend/access/hash/hash.c

+49-22
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/hash/hash.c,v 1.113 2009/07/29 20:56:18 tgl Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/hash/hash.c,v 1.114 2009/11/01 21:25:25 tgl Exp $
1212
*
1313
* NOTES
1414
* This file contains only the public interface routines.
@@ -206,8 +206,10 @@ hashgettuple(PG_FUNCTION_ARGS)
206206
ScanDirection dir = (ScanDirection) PG_GETARG_INT32(1);
207207
HashScanOpaque so = (HashScanOpaque) scan->opaque;
208208
Relation rel = scan->indexRelation;
209+
Buffer buf;
209210
Page page;
210211
OffsetNumber offnum;
212+
ItemPointer current;
211213
bool res;
212214

213215
/* Hash indexes are always lossy since we store only the hash code */
@@ -225,8 +227,38 @@ hashgettuple(PG_FUNCTION_ARGS)
225227
* appropriate direction. If we haven't done so yet, we call a routine to
226228
* get the first item in the scan.
227229
*/
228-
if (ItemPointerIsValid(&(so->hashso_curpos)))
230+
current = &(so->hashso_curpos);
231+
if (ItemPointerIsValid(current))
229232
{
233+
/*
234+
* An insertion into the current index page could have happened while
235+
* we didn't have read lock on it. Re-find our position by looking
236+
* for the TID we previously returned. (Because we hold share lock on
237+
* the bucket, no deletions or splits could have occurred; therefore
238+
* we can expect that the TID still exists in the current index page,
239+
* at an offset >= where we were.)
240+
*/
241+
OffsetNumber maxoffnum;
242+
243+
buf = so->hashso_curbuf;
244+
Assert(BufferIsValid(buf));
245+
page = BufferGetPage(buf);
246+
maxoffnum = PageGetMaxOffsetNumber(page);
247+
for (offnum = ItemPointerGetOffsetNumber(current);
248+
offnum <= maxoffnum;
249+
offnum = OffsetNumberNext(offnum))
250+
{
251+
IndexTuple itup;
252+
253+
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
254+
if (ItemPointerEquals(&scan->xs_ctup.t_self, &itup->t_tid))
255+
break;
256+
}
257+
if (offnum > maxoffnum)
258+
elog(ERROR, "failed to re-find scan position within index \"%s\"",
259+
RelationGetRelationName(rel));
260+
ItemPointerSetOffsetNumber(current, offnum);
261+
230262
/*
231263
* Check to see if we should kill the previously-fetched tuple.
232264
*/
@@ -235,16 +267,14 @@ hashgettuple(PG_FUNCTION_ARGS)
235267
/*
236268
* Yes, so mark it by setting the LP_DEAD state in the item flags.
237269
*/
238-
offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
239-
page = BufferGetPage(so->hashso_curbuf);
240270
ItemIdMarkDead(PageGetItemId(page, offnum));
241271

242272
/*
243273
* Since this can be redone later if needed, it's treated the same
244274
* as a commit-hint-bit status update for heap tuples: we mark the
245275
* buffer dirty but don't make a WAL log entry.
246276
*/
247-
SetBufferCommitInfoNeedsSave(so->hashso_curbuf);
277+
SetBufferCommitInfoNeedsSave(buf);
248278
}
249279

250280
/*
@@ -262,7 +292,7 @@ hashgettuple(PG_FUNCTION_ARGS)
262292
{
263293
while (res)
264294
{
265-
offnum = ItemPointerGetOffsetNumber(&(so->hashso_curpos));
295+
offnum = ItemPointerGetOffsetNumber(current);
266296
page = BufferGetPage(so->hashso_curbuf);
267297
if (!ItemIdIsDead(PageGetItemId(page, offnum)))
268298
break;
@@ -517,7 +547,8 @@ hashbulkdelete(PG_FUNCTION_ARGS)
517547
HashPageOpaque opaque;
518548
OffsetNumber offno;
519549
OffsetNumber maxoffno;
520-
bool page_dirty = false;
550+
OffsetNumber deletable[MaxOffsetNumber];
551+
int ndeletable = 0;
521552

522553
vacuum_delay_point();
523554

@@ -529,9 +560,10 @@ hashbulkdelete(PG_FUNCTION_ARGS)
529560
Assert(opaque->hasho_bucket == cur_bucket);
530561

531562
/* Scan each tuple in page */
532-
offno = FirstOffsetNumber;
533563
maxoffno = PageGetMaxOffsetNumber(page);
534-
while (offno <= maxoffno)
564+
for (offno = FirstOffsetNumber;
565+
offno <= maxoffno;
566+
offno = OffsetNumberNext(offno))
535567
{
536568
IndexTuple itup;
537569
ItemPointer htup;
@@ -541,30 +573,25 @@ hashbulkdelete(PG_FUNCTION_ARGS)
541573
htup = &(itup->t_tid);
542574
if (callback(htup, callback_state))
543575
{
544-
/* delete the item from the page */
545-
PageIndexTupleDelete(page, offno);
546-
bucket_dirty = page_dirty = true;
547-
548-
/* don't increment offno, instead decrement maxoffno */
549-
maxoffno = OffsetNumberPrev(maxoffno);
550-
576+
/* mark the item for deletion */
577+
deletable[ndeletable++] = offno;
551578
tuples_removed += 1;
552579
}
553580
else
554-
{
555-
offno = OffsetNumberNext(offno);
556-
557581
num_index_tuples += 1;
558-
}
559582
}
560583

561584
/*
562-
* Write page if needed, advance to next page.
585+
* Apply deletions and write page if needed, advance to next page.
563586
*/
564587
blkno = opaque->hasho_nextblkno;
565588

566-
if (page_dirty)
589+
if (ndeletable > 0)
590+
{
591+
PageIndexMultiDelete(page, deletable, ndeletable);
567592
_hash_wrtbuf(rel, buf);
593+
bucket_dirty = true;
594+
}
568595
else
569596
_hash_relbuf(rel, buf);
570597
}

src/backend/access/hash/hashinsert.c

+10-13
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
*
99
*
1010
* IDENTIFICATION
11-
* $PostgreSQL: pgsql/src/backend/access/hash/hashinsert.c,v 1.52 2009/01/01 17:23:35 momjian Exp $
11+
* $PostgreSQL: pgsql/src/backend/access/hash/hashinsert.c,v 1.53 2009/11/01 21:25:25 tgl Exp $
1212
*
1313
*-------------------------------------------------------------------------
1414
*/
@@ -20,10 +20,6 @@
2020
#include "utils/rel.h"
2121

2222

23-
static OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
24-
Size itemsize, IndexTuple itup);
25-
26-
2723
/*
2824
* _hash_doinsert() -- Handle insertion of a single index tuple.
2925
*
@@ -180,15 +176,16 @@ _hash_doinsert(Relation rel, IndexTuple itup)
180176
/*
181177
* _hash_pgaddtup() -- add a tuple to a particular page in the index.
182178
*
183-
* This routine adds the tuple to the page as requested; it does
184-
* not write out the page. It is an error to call pgaddtup() without
185-
* a write lock and pin.
179+
* This routine adds the tuple to the page as requested; it does not write out
180+
* the page. It is an error to call pgaddtup() without pin and write lock on
181+
* the target buffer.
182+
*
183+
* Returns the offset number at which the tuple was inserted. This function
184+
* is responsible for preserving the condition that tuples in a hash index
185+
* page are sorted by hashkey value.
186186
*/
187-
static OffsetNumber
188-
_hash_pgaddtup(Relation rel,
189-
Buffer buf,
190-
Size itemsize,
191-
IndexTuple itup)
187+
OffsetNumber
188+
_hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
192189
{
193190
OffsetNumber itup_off;
194191
Page page;

0 commit comments

Comments
 (0)