Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 1bd4bc8

Browse files
Optimize nbtree backwards scans.
Make nbtree backwards scans optimistically access the next page to be read to the left by following a prevPage block number that's now stashed in currPos when the leaf page is first read. This approach matches the one taken during forward scans, which follow a symmetric nextPage block number from currPos. We stash both a prevPage and a nextPage, since the scan direction might change (when fetching from a scrollable cursor). Backwards scans will no longer need to lock the same page twice, except in rare cases where the scan detects a concurrent page split (or page deletion). Testing has shown this optimization to be particularly effective during parallel index-only backwards scans: ~12% reductions in query execution time are quite possible. We're much better off being optimistic; concurrent left sibling page splits are rare in general. It's possible that we'll need to lock more pages than the pessimistic approach would have, but only when there are _multiple_ concurrent splits of the left sibling page we now start at. If there's just a single concurrent left sibling page split, the new approach to scanning backwards will at least break even relative to the old one (we'll acquire the same number of leaf page locks as before). The optimization from this commit has long been contemplated by comments added by commit 2ed5b87, which changed the rules for locking/pinning during nbtree index scans. The approach that that commit introduced to leaf level link traversal when scanning forwards is now more or less applied all the time, regardless of the direction we're scanning in. Following uniform conventions around sibling link traversal is simpler. The only real remaining difference between our forward and backwards handling is that our backwards handling must still detect and recover from any concurrent left sibling splits (and concurrent page deletions), as documented in the nbtree README. That is structured as a single, isolated extra step that takes place in _bt_readnextpage. Also use this opportunity to further simplify the functions that deal with reading pages and traversing sibling links on the leaf level, and to document their preconditions and postconditions (with respect to things like buffer locks, buffer pins, and seizing the parallel scan). This enhancement completely supersedes the one recently added by commit 3f44959. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
1 parent 9e2d813 commit 1bd4bc8

File tree

4 files changed

+388
-466
lines changed

4 files changed

+388
-466
lines changed

src/backend/access/nbtree/nbtree.c

Lines changed: 49 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,9 @@ typedef enum
6666
*/
6767
typedef struct BTParallelScanDescData
6868
{
69-
BlockNumber btps_scanPage; /* latest or next page to be scanned */
69+
BlockNumber btps_nextScanPage; /* next page to be scanned */
70+
BlockNumber btps_lastCurrPage; /* page whose sibling link was copied into
71+
* btps_nextScanPage */
7072
BTPS_State btps_pageStatus; /* indicates whether next page is
7173
* available for scan. see above for
7274
* possible states of parallel scan. */
@@ -550,7 +552,8 @@ btinitparallelscan(void *target)
550552
BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
551553

552554
SpinLockInit(&bt_target->btps_mutex);
553-
bt_target->btps_scanPage = InvalidBlockNumber;
555+
bt_target->btps_nextScanPage = InvalidBlockNumber;
556+
bt_target->btps_lastCurrPage = InvalidBlockNumber;
554557
bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
555558
ConditionVariableInit(&bt_target->btps_cv);
556559
}
@@ -575,7 +578,8 @@ btparallelrescan(IndexScanDesc scan)
575578
* consistency.
576579
*/
577580
SpinLockAcquire(&btscan->btps_mutex);
578-
btscan->btps_scanPage = InvalidBlockNumber;
581+
btscan->btps_nextScanPage = InvalidBlockNumber;
582+
btscan->btps_lastCurrPage = InvalidBlockNumber;
579583
btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
580584
SpinLockRelease(&btscan->btps_mutex);
581585
}
@@ -591,26 +595,38 @@ btparallelrescan(IndexScanDesc scan)
591595
* start just yet (only backends that call from _bt_first are capable of
592596
* starting primitive index scans, which they indicate by passing first=true).
593597
*
594-
* If the return value is true, *pageno returns the next or current page
595-
* of the scan (depending on the scan direction). An invalid block number
596-
* means the scan hasn't yet started, or that caller needs to start the next
597-
* primitive index scan (if it's the latter case we'll set so.needPrimScan).
598-
* The first time a participating process reaches the last page, it will return
599-
* true and set *pageno to P_NONE; after that, further attempts to seize the
600-
* scan will return false.
598+
* If the return value is true, *next_scan_page returns the next page of the
599+
* scan, and *last_curr_page returns the page that *next_scan_page came from.
600+
* An invalid *next_scan_page means the scan hasn't yet started, or that
601+
* caller needs to start the next primitive index scan (if it's the latter
602+
* case we'll set so.needPrimScan). The first time a participating process
603+
* reaches the last page, it will return true and set *next_scan_page to
604+
* P_NONE; after that, further attempts to seize the scan will return false.
601605
*
602-
* Callers should ignore the value of pageno if the return value is false.
606+
* Callers should ignore the value of *next_scan_page and *last_curr_page if
607+
* the return value is false.
603608
*/
604609
bool
605-
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
610+
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *next_scan_page,
611+
BlockNumber *last_curr_page, bool first)
606612
{
607613
BTScanOpaque so = (BTScanOpaque) scan->opaque;
608614
bool exit_loop = false;
609615
bool status = true;
610616
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
611617
BTParallelScanDesc btscan;
612618

613-
*pageno = P_NONE;
619+
*next_scan_page = P_NONE;
620+
*last_curr_page = InvalidBlockNumber;
621+
622+
/*
623+
* Reset so->currPos, and initialize moreLeft/moreRight such that the next
624+
* call to _bt_readnextpage treats this backend similarly to a serial
625+
* backend that steps from *last_curr_page to *next_scan_page (unless this
626+
* backend's so->currPos is initialized by _bt_readfirstpage before then).
627+
*/
628+
BTScanPosInvalidate(so->currPos);
629+
so->currPos.moreLeft = so->currPos.moreRight = true;
614630

615631
if (first)
616632
{
@@ -660,7 +676,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
660676
array->cur_elem = btscan->btps_arrElems[i];
661677
skey->sk_argument = array->elem_values[array->cur_elem];
662678
}
663-
*pageno = InvalidBlockNumber;
679+
*next_scan_page = InvalidBlockNumber;
664680
exit_loop = true;
665681
}
666682
else
@@ -688,7 +704,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
688704
* of advancing it to a new page!
689705
*/
690706
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
691-
*pageno = btscan->btps_scanPage;
707+
*next_scan_page = btscan->btps_nextScanPage;
708+
*last_curr_page = btscan->btps_lastCurrPage;
692709
exit_loop = true;
693710
}
694711
SpinLockRelease(&btscan->btps_mutex);
@@ -703,17 +720,21 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
703720

704721
/*
705722
* _bt_parallel_release() -- Complete the process of advancing the scan to a
706-
* new page. We now have the new value btps_scanPage; some other backend
723+
* new page. We now have the new value btps_nextScanPage; another backend
707724
* can now begin advancing the scan.
708725
*
709-
* Callers whose scan uses array keys must save their scan_page argument so
726+
* Callers whose scan uses array keys must save their curr_page argument so
710727
* that it can be passed to _bt_parallel_primscan_schedule, should caller
711-
* determine that another primitive index scan is required. If that happens,
712-
* scan_page won't be scanned by any backend (unless the next primitive index
713-
* scan lands on scan_page).
728+
* determine that another primitive index scan is required.
729+
*
730+
* Note: unlike the serial case, parallel scans don't need to remember both
731+
* sibling links. next_scan_page is whichever link is next given the scan's
732+
* direction. That's all we'll ever need, since the direction of a parallel
733+
* scan can never change.
714734
*/
715735
void
716-
_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
736+
_bt_parallel_release(IndexScanDesc scan, BlockNumber next_scan_page,
737+
BlockNumber curr_page)
717738
{
718739
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
719740
BTParallelScanDesc btscan;
@@ -722,7 +743,8 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
722743
parallel_scan->ps_offset);
723744

724745
SpinLockAcquire(&btscan->btps_mutex);
725-
btscan->btps_scanPage = scan_page;
746+
btscan->btps_nextScanPage = next_scan_page;
747+
btscan->btps_lastCurrPage = curr_page;
726748
btscan->btps_pageStatus = BTPARALLEL_IDLE;
727749
SpinLockRelease(&btscan->btps_mutex);
728750
ConditionVariableSignal(&btscan->btps_cv);
@@ -778,13 +800,13 @@ _bt_parallel_done(IndexScanDesc scan)
778800
/*
779801
* _bt_parallel_primscan_schedule() -- Schedule another primitive index scan.
780802
*
781-
* Caller passes the block number most recently passed to _bt_parallel_release
803+
* Caller passes the curr_page most recently passed to _bt_parallel_release
782804
* by its backend. Caller successfully schedules the next primitive index scan
783805
* if the shared parallel state hasn't been seized since caller's backend last
784806
* advanced the scan.
785807
*/
786808
void
787-
_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
809+
_bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber curr_page)
788810
{
789811
BTScanOpaque so = (BTScanOpaque) scan->opaque;
790812
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
@@ -796,10 +818,11 @@ _bt_parallel_primscan_schedule(IndexScanDesc scan, BlockNumber prev_scan_page)
796818
parallel_scan->ps_offset);
797819

798820
SpinLockAcquire(&btscan->btps_mutex);
799-
if (btscan->btps_scanPage == prev_scan_page &&
821+
if (btscan->btps_lastCurrPage == curr_page &&
800822
btscan->btps_pageStatus == BTPARALLEL_IDLE)
801823
{
802-
btscan->btps_scanPage = InvalidBlockNumber;
824+
btscan->btps_nextScanPage = InvalidBlockNumber;
825+
btscan->btps_lastCurrPage = InvalidBlockNumber;
803826
btscan->btps_pageStatus = BTPARALLEL_NEED_PRIMSCAN;
804827

805828
/* Serialize scan's current array keys */

0 commit comments

Comments
 (0)