Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 36a35c5

Browse files
committed
Compress GIN posting lists, for smaller index size.
GIN posting lists are now encoded using varbyte-encoding, which allows them to fit in much smaller space than the straight ItemPointer array format used before. The new encoding is used for both the lists stored in-line in entry tree items, and in posting tree leaf pages. To maintain backwards-compatibility and keep pg_upgrade working, the code can still read old-style pages and tuples. Posting tree leaf pages in the new format are flagged with GIN_COMPRESSED flag, to distinguish old and new format pages. Likewise, entry tree tuples in the new format have a GIN_ITUP_COMPRESSED flag set in a bit that was previously unused. This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with version 9.4 will therefore have version number 2 in the metapage, while old pg_upgraded indexes will have version 1. The code treats them the same, but it might be come handy in the future, if we want to drop support for the uncompressed format. Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
1 parent 243ee26 commit 36a35c5

File tree

13 files changed

+2309
-718
lines changed

13 files changed

+2309
-718
lines changed

contrib/pgstattuple/expected/pgstattuple.out

+1-1
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,6 @@ create index test_ginidx on test using gin (b);
123123
select * from pgstatginindex('test_ginidx');
124124
version | pending_pages | pending_tuples
125125
---------+---------------+----------------
126-
1 | 0 | 0
126+
2 | 0 | 0
127127
(1 row)
128128

src/backend/access/gin/README

+114-9
Original file line numberDiff line numberDiff line change
@@ -135,15 +135,15 @@ same category of null entry are merged into one index entry just as happens
135135
with ordinary key entries.
136136

137137
* In a key entry at the btree leaf level, at the next SHORTALIGN boundary,
138-
there is an array of zero or more ItemPointers, which store the heap tuple
139-
TIDs for which the indexable items contain this key. This is called the
140-
"posting list". The TIDs in a posting list must appear in sorted order.
141-
If the list would be too big for the index tuple to fit on an index page,
142-
the ItemPointers are pushed out to a separate posting page or pages, and
143-
none appear in the key entry itself. The separate pages are called a
144-
"posting tree"; they are organized as a btree of ItemPointer values.
145-
Note that in either case, the ItemPointers associated with a key can
146-
easily be read out in sorted order; this is relied on by the scan
138+
there is a list of item pointers, in compressed format (see Posting List
139+
Compression section), pointing to the heap tuples for which the indexable
140+
items contain this key. This is called the "posting list".
141+
142+
If the list would be too big for the index tuple to fit on an index page, the
143+
ItemPointers are pushed out to a separate posting page or pages, and none
144+
appear in the key entry itself. The separate pages are called a "posting
145+
tree" (see below); Note that in either case, the ItemPointers associated with
146+
a key can easily be read out in sorted order; this is relied on by the scan
147147
algorithms.
148148

149149
* The index tuple header fields of a leaf key entry are abused as follows:
@@ -163,6 +163,11 @@ algorithms.
163163

164164
* The posting list can be accessed with GinGetPosting(itup)
165165

166+
* If GinITupIsCompressed(itup), the posting list is stored in compressed
167+
format. Otherwise it is just an array of ItemPointers. New tuples are always
168+
stored in compressed format, uncompressed items can be present if the
169+
database was migrated from 9.3 or earlier version.
170+
166171
2) Posting tree case:
167172

168173
* ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number
@@ -210,6 +215,76 @@ fit on one pending-list page must have those pages to itself, even if this
210215
results in wasting much of the space on the preceding page and the last
211216
page for the tuple.)
212217

218+
Posting tree
219+
------------
220+
221+
If a posting list is too large to store in-line in a key entry, a posting tree
222+
is created. A posting tree is a B-tree structure, where the ItemPointer is
223+
used as the key.
224+
225+
Internal posting tree pages use the standard PageHeader and the same "opaque"
226+
struct as other GIN page, but do not contain regular index tuples. Instead,
227+
the contents of the page is an array of PostingItem structs. Each PostingItem
228+
consists of the block number of the child page, and the right bound of that
229+
child page, as an ItemPointer. The right bound of the page is stored right
230+
after the page header, before the PostingItem array.
231+
232+
Posting tree leaf pages also use the standard PageHeader and opaque struct,
233+
and the right bound of the page is stored right after the page header,
234+
but the page content comprises of 0-32 compressed posting lists, and an
235+
additional array of regular uncompressed item pointers. The compressed posting
236+
lists are stored one after each other, between page header and pd_lower. The
237+
uncompressed array is stored between pd_upper and pd_special. The space
238+
between pd_lower and pd_upper is unused, which allows full-page images of
239+
posting tree leaf pages to skip the unused space in middle (buffer_std = true
240+
in XLogRecData). For historical reasons, this does not apply to internal
241+
pages, or uncompressed leaf pages migrated from earlier versions.
242+
243+
The item pointers are stored in a number of independent compressed posting
244+
lists (also called segments), instead of one big one, to make random access
245+
to a given item pointer faster: to find an item in a compressed list, you
246+
have to read the list from the beginning, but when the items are split into
247+
multiple lists, you can first skip over to the list containing the item you're
248+
looking for, and read only that segment. Also, an update only needs to
249+
re-encode the affected segment.
250+
251+
The uncompressed items array is used for insertions, to avoid re-encoding
252+
a compressed list on every update. If there is room on a page, an insertion
253+
simply inserts the new item to the right place in the uncompressed array.
254+
When a page becomes full, it is rewritten, merging all the uncompressed items
255+
are into the compressed lists. When reading, the uncompressed array and the
256+
compressed lists are read in tandem, and merged into one stream of sorted
257+
item pointers.
258+
259+
Posting List Compression
260+
------------------------
261+
262+
To fit as many item pointers on a page as possible, posting tree leaf pages
263+
and posting lists stored inline in entry tree leaf tuples use a lightweight
264+
form of compression. We take advantage of the fact that the item pointers
265+
are stored in sorted order. Instead of storing the block and offset number of
266+
each item pointer separately, we store the difference from the previous item.
267+
That in itself doesn't do much, but it allows us to use so-called varbyte
268+
encoding to compress them.
269+
270+
Varbyte encoding is a method to encode integers, allowing smaller numbers to
271+
take less space at the cost of larger numbers. Each integer is represented by
272+
variable number of bytes. High bit of each byte in varbyte encoding determines
273+
whether the next byte is still part of this number. Therefore, to read a single
274+
varbyte encoded number, you have to read bytes until you find a byte with the
275+
high bit not set.
276+
277+
When encoding, the block and offset number forming the item pointer are
278+
combined into a single integer. The offset number is stored in the 11 low
279+
bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number
280+
is stored in the higher bits. That requires 43 bits in total, which
281+
conveniently fits in at most 6 bytes.
282+
283+
A compressed posting list is passed around and stored on disk in a
284+
PackedPostingList struct. The first item in the list is stored uncompressed
285+
as a regular ItemPointerData, followed by the length of the list in bytes,
286+
followed by the packed items.
287+
213288
Concurrency
214289
-----------
215290

@@ -260,6 +335,36 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
260335
deleted pages around with the right-link intact until all concurrent scans
261336
have finished.)
262337

338+
Compatibility
339+
-------------
340+
341+
Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in
342+
uncompressed format because of pg_upgrade from 9.3 or earlier versions.
343+
For compatibility, old uncompressed format is also supported. Following
344+
rules are used to handle it:
345+
346+
* GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list.
347+
This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid).
348+
Use GinItupIsCompressed(itup) to check the flag.
349+
350+
* Posting tree pages in the new format are marked with the GIN_COMPRESSED flag.
351+
Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to
352+
check and set this flag.
353+
354+
* All scan operations check format of posting list add use corresponding code
355+
to read its content.
356+
357+
* When updating an index tuple containing an uncompressed posting list, it
358+
will be replaced with new index tuple containing a compressed list.
359+
360+
* When updating an uncompressed posting tree leaf page, it's compressed.
361+
362+
* If vacuum finds some dead TIDs in uncompressed posting lists, they are
363+
converted into compressed posting lists. This assumes that the compressed
364+
posting list fits in the space occupied by the uncompressed list. IOW, we
365+
assume that the compressed version of the page, with the dead items removed,
366+
takes less space than the old uncompressed version.
367+
263368
Limitations
264369
-----------
265370

src/backend/access/gin/ginbtree.c

+40-33
Original file line numberDiff line numberDiff line change
@@ -325,9 +325,10 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
325325
{
326326
Page page = BufferGetPage(stack->buffer);
327327
XLogRecData *payloadrdata;
328-
bool fit;
328+
GinPlaceToPageRC rc;
329329
uint16 xlflags = 0;
330330
Page childpage = NULL;
331+
Page newlpage = NULL, newrpage = NULL;
331332

332333
if (GinPageIsData(page))
333334
xlflags |= GIN_INSERT_ISDATA;
@@ -345,16 +346,17 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
345346
}
346347

347348
/*
348-
* Try to put the incoming tuple on the page. If it doesn't fit,
349-
* placeToPage method will return false and leave the page unmodified, and
350-
* we'll have to split the page.
349+
* Try to put the incoming tuple on the page. placeToPage will decide
350+
* if the page needs to be split.
351351
*/
352-
START_CRIT_SECTION();
353-
fit = btree->placeToPage(btree, stack->buffer, stack->off,
354-
insertdata, updateblkno,
355-
&payloadrdata);
356-
if (fit)
352+
rc = btree->placeToPage(btree, stack->buffer, stack,
353+
insertdata, updateblkno,
354+
&payloadrdata, &newlpage, &newrpage);
355+
if (rc == UNMODIFIED)
356+
return true;
357+
else if (rc == INSERTED)
357358
{
359+
/* placeToPage did START_CRIT_SECTION() */
358360
MarkBufferDirty(stack->buffer);
359361

360362
/* An insert to an internal page finishes the split of the child. */
@@ -373,7 +375,6 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
373375

374376
xlrec.node = btree->index->rd_node;
375377
xlrec.blkno = BufferGetBlockNumber(stack->buffer);
376-
xlrec.offset = stack->off;
377378
xlrec.flags = xlflags;
378379

379380
rdata[0].buffer = InvalidBuffer;
@@ -415,20 +416,16 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
415416

416417
return true;
417418
}
418-
else
419+
else if (rc == SPLIT)
419420
{
420421
/* Didn't fit, have to split */
421422
Buffer rbuffer;
422-
Page newlpage;
423423
BlockNumber savedRightLink;
424-
Page rpage;
425424
XLogRecData rdata[2];
426425
ginxlogSplit data;
427426
Buffer lbuffer = InvalidBuffer;
428427
Page newrootpg = NULL;
429428

430-
END_CRIT_SECTION();
431-
432429
rbuffer = GinNewBuffer(btree->index);
433430

434431
/* During index build, count the new page */
@@ -443,12 +440,9 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
443440
savedRightLink = GinPageGetOpaque(page)->rightlink;
444441

445442
/*
446-
* newlpage is a pointer to memory page, it is not associated with a
447-
* buffer. stack->buffer is not touched yet.
443+
* newlpage and newrpage are pointers to memory pages, not associated
444+
* with buffers. stack->buffer is not touched yet.
448445
*/
449-
newlpage = btree->splitPage(btree, stack->buffer, rbuffer, stack->off,
450-
insertdata, updateblkno,
451-
&payloadrdata);
452446

453447
data.node = btree->index->rd_node;
454448
data.rblkno = BufferGetBlockNumber(rbuffer);
@@ -481,8 +475,6 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
481475
else
482476
rdata[0].next = payloadrdata;
483477

484-
rpage = BufferGetPage(rbuffer);
485-
486478
if (stack->parent == NULL)
487479
{
488480
/*
@@ -508,7 +500,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
508500
data.lblkno = BufferGetBlockNumber(lbuffer);
509501
data.flags |= GIN_SPLIT_ROOT;
510502

511-
GinPageGetOpaque(rpage)->rightlink = InvalidBlockNumber;
503+
GinPageGetOpaque(newrpage)->rightlink = InvalidBlockNumber;
512504
GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
513505

514506
/*
@@ -517,20 +509,20 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
517509
* than overwriting the original page directly, so that we can still
518510
* abort gracefully if this fails.)
519511
*/
520-
newrootpg = PageGetTempPage(rpage);
521-
GinInitPage(newrootpg, GinPageGetOpaque(newlpage)->flags & ~GIN_LEAF, BLCKSZ);
512+
newrootpg = PageGetTempPage(newrpage);
513+
GinInitPage(newrootpg, GinPageGetOpaque(newlpage)->flags & ~(GIN_LEAF | GIN_COMPRESSED), BLCKSZ);
522514

523515
btree->fillRoot(btree, newrootpg,
524516
BufferGetBlockNumber(lbuffer), newlpage,
525-
BufferGetBlockNumber(rbuffer), rpage);
517+
BufferGetBlockNumber(rbuffer), newrpage);
526518
}
527519
else
528520
{
529521
/* split non-root page */
530522
data.rrlink = savedRightLink;
531523
data.lblkno = BufferGetBlockNumber(stack->buffer);
532524

533-
GinPageGetOpaque(rpage)->rightlink = savedRightLink;
525+
GinPageGetOpaque(newrpage)->rightlink = savedRightLink;
534526
GinPageGetOpaque(newlpage)->flags |= GIN_INCOMPLETE_SPLIT;
535527
GinPageGetOpaque(newlpage)->rightlink = BufferGetBlockNumber(rbuffer);
536528
}
@@ -550,16 +542,24 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
550542
START_CRIT_SECTION();
551543

552544
MarkBufferDirty(rbuffer);
545+
MarkBufferDirty(stack->buffer);
553546

547+
/*
548+
* Restore the temporary copies over the real buffers. But don't free
549+
* the temporary copies yet, WAL record data points to them.
550+
*/
554551
if (stack->parent == NULL)
555552
{
556-
PageRestoreTempPage(newlpage, BufferGetPage(lbuffer));
557553
MarkBufferDirty(lbuffer);
558-
newlpage = newrootpg;
554+
memcpy(BufferGetPage(stack->buffer), newrootpg, BLCKSZ);
555+
memcpy(BufferGetPage(lbuffer), newlpage, BLCKSZ);
556+
memcpy(BufferGetPage(rbuffer), newrpage, BLCKSZ);
557+
}
558+
else
559+
{
560+
memcpy(BufferGetPage(stack->buffer), newlpage, BLCKSZ);
561+
memcpy(BufferGetPage(rbuffer), newrpage, BLCKSZ);
559562
}
560-
561-
PageRestoreTempPage(newlpage, BufferGetPage(stack->buffer));
562-
MarkBufferDirty(stack->buffer);
563563

564564
/* write WAL record */
565565
if (RelationNeedsWAL(btree->index))
@@ -568,7 +568,7 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
568568

569569
recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_SPLIT, rdata);
570570
PageSetLSN(BufferGetPage(stack->buffer), recptr);
571-
PageSetLSN(rpage, recptr);
571+
PageSetLSN(BufferGetPage(rbuffer), recptr);
572572
if (stack->parent == NULL)
573573
PageSetLSN(BufferGetPage(lbuffer), recptr);
574574
}
@@ -582,6 +582,11 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
582582
if (stack->parent == NULL)
583583
UnlockReleaseBuffer(lbuffer);
584584

585+
pfree(newlpage);
586+
pfree(newrpage);
587+
if (newrootpg)
588+
pfree(newrootpg);
589+
585590
/*
586591
* If we split the root, we're done. Otherwise the split is not
587592
* complete until the downlink for the new page has been inserted to
@@ -592,6 +597,8 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack,
592597
else
593598
return false;
594599
}
600+
else
601+
elog(ERROR, "unknown return code from GIN placeToPage method: %d", rc);
595602
}
596603

597604
/*

0 commit comments

Comments
 (0)