Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 11dce63

Browse files
committed
Fix recently-introduced performance problem in ts_headline().
The new hlCover() algorithm that I introduced in commit c9b0c67 turns out to potentially take O(N^2) or worse time on long documents, if there are many occurrences of individual query words but few or no substrings that actually satisfy the query. (One way to hit this behavior is with a "common_word & rare_word" type of query.) This seems unavoidable given the original goal of checking every substring of the document, so we have to back off that idea. Fortunately, it seems unlikely that anyone would really want headlines spanning all of a long document, so we can avoid the worse-than-linear behavior by imposing a maximum length of substring that we'll consider. For now, just hard-wire that maximum length as a multiple of max_words times max_fragments. Perhaps at some point somebody will argue for exposing it as a ts_headline parameter, but I'm hesitant to make such a feature addition in a back-patched bug fix. I also noted that the hlFirstIndex() function I'd added in that commit was unnecessarily stupid: it really only needs to check whether a HeadlineWordEntry's item pointer is null or not. This wouldn't make all that much difference in typical cases with queries having just a few terms, but a cycle shaved is a cycle earned. In addition, add a CHECK_FOR_INTERRUPTS call in TS_execute_recurse. This ensures that hlCover's loop is cancellable if it manages to take a long time, and it may protect some other TS_execute callers as well. Back-patch to 9.6 as the previous commit was. I also chose to add the CHECK_FOR_INTERRUPTS call to 9.5. The old hlCover() algorithm seems to avoid the O(N^2) behavior, at least on the test case I tried, but nonetheless it's not very quick on a long document. Per report from Stephen Frost. Discussion: https://postgr.es/m/20200724160535.GW12375@tamriel.snowman.net
1 parent 0d9b64f commit 11dce63

File tree

2 files changed

+35
-25
lines changed

2 files changed

+35
-25
lines changed

src/backend/tsearch/wparser_def.c

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2010,33 +2010,29 @@ checkcondition_HL(void *opaque, QueryOperand *val, ExecPhraseData *data)
20102010
* Returns -1 if no such index
20112011
*/
20122012
static int
2013-
hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos)
2013+
hlFirstIndex(HeadlineParsedText *prs, int pos)
20142014
{
20152015
int i;
20162016

2017-
/* For each word ... */
20182017
for (i = pos; i < prs->curwords; i++)
20192018
{
2020-
/* ... scan the query to see if this word matches any operand */
2021-
QueryItem *item = GETQUERY(query);
2022-
int j;
2023-
2024-
for (j = 0; j < query->size; j++)
2025-
{
2026-
if (item->type == QI_VAL &&
2027-
prs->words[i].item == &item->qoperand)
2028-
return i;
2029-
item++;
2030-
}
2019+
if (prs->words[i].item != NULL)
2020+
return i;
20312021
}
20322022
return -1;
20332023
}
20342024

20352025
/*
20362026
* hlCover: try to find a substring of prs' word list that satisfies query
20372027
*
2038-
* At entry, *p must be the first word index to consider (initialize this to
2039-
* zero, or to the next index after a previous successful search).
2028+
* At entry, *p must be the first word index to consider (initialize this
2029+
* to zero, or to the next index after a previous successful search).
2030+
* We will consider all substrings starting at or after that word, and
2031+
* containing no more than max_cover words. (We need a length limit to
2032+
* keep this from taking O(N^2) time for a long document with many query
2033+
* words but few complete matches. Actually, since checkcondition_HL is
2034+
* roughly O(N) in the length of the substring being checked, it's even
2035+
* worse than that.)
20402036
*
20412037
* On success, sets *p to first word index and *q to last word index of the
20422038
* cover substring, and returns true.
@@ -2045,7 +2041,8 @@ hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos)
20452041
* words used in the query.
20462042
*/
20472043
static bool
2048-
hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q)
2044+
hlCover(HeadlineParsedText *prs, TSQuery query, int max_cover,
2045+
int *p, int *q)
20492046
{
20502047
int pmin,
20512048
pmax,
@@ -2059,7 +2056,7 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q)
20592056
* appearing in the query; there's no point in trying endpoints in between
20602057
* such points.
20612058
*/
2062-
pmin = hlFirstIndex(prs, query, *p);
2059+
pmin = hlFirstIndex(prs, *p);
20632060
while (pmin >= 0)
20642061
{
20652062
/* This useless assignment just keeps stupider compilers quiet */
@@ -2080,7 +2077,7 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q)
20802077
return true;
20812078
}
20822079
/* Nope, so advance pmax to next feasible endpoint */
2083-
nextpmax = hlFirstIndex(prs, query, pmax + 1);
2080+
nextpmax = hlFirstIndex(prs, pmax + 1);
20842081

20852082
/*
20862083
* If this is our first advance past pmin, then the result is also
@@ -2091,7 +2088,7 @@ hlCover(HeadlineParsedText *prs, TSQuery query, int *p, int *q)
20912088
nextpmin = nextpmax;
20922089
pmax = nextpmax;
20932090
}
2094-
while (pmax >= 0);
2091+
while (pmax >= 0 && pmax - pmin < max_cover);
20952092
/* No luck here, so try next feasible startpoint */
20962093
pmin = nextpmin;
20972094
}
@@ -2193,7 +2190,7 @@ get_next_fragment(HeadlineParsedText *prs, int *startpos, int *endpos,
21932190
static void
21942191
mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
21952192
int shortword, int min_words,
2196-
int max_words, int max_fragments)
2193+
int max_words, int max_fragments, int max_cover)
21972194
{
21982195
int32 poslen,
21992196
curlen,
@@ -2220,7 +2217,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
22202217
covers = palloc(maxcovers * sizeof(CoverPos));
22212218

22222219
/* get all covers */
2223-
while (hlCover(prs, query, &p, &q))
2220+
while (hlCover(prs, query, max_cover, &p, &q))
22242221
{
22252222
startpos = p;
22262223
endpos = q;
@@ -2375,7 +2372,7 @@ mark_hl_fragments(HeadlineParsedText *prs, TSQuery query, bool highlightall,
23752372
*/
23762373
static void
23772374
mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall,
2378-
int shortword, int min_words, int max_words)
2375+
int shortword, int min_words, int max_words, int max_cover)
23792376
{
23802377
int p = 0,
23812378
q = 0;
@@ -2393,7 +2390,7 @@ mark_hl_words(HeadlineParsedText *prs, TSQuery query, bool highlightall,
23932390
if (!highlightall)
23942391
{
23952392
/* examine all covers, select a headline using the best one */
2396-
while (hlCover(prs, query, &p, &q))
2393+
while (hlCover(prs, query, max_cover, &p, &q))
23972394
{
23982395
/*
23992396
* Count words (curlen) and interesting words (poslen) within
@@ -2549,6 +2546,7 @@ prsd_headline(PG_FUNCTION_ARGS)
25492546
int shortword = 3;
25502547
int max_fragments = 0;
25512548
bool highlightall = false;
2549+
int max_cover;
25522550
ListCell *l;
25532551

25542552
/* Extract configuration option values */
@@ -2588,6 +2586,15 @@ prsd_headline(PG_FUNCTION_ARGS)
25882586
defel->defname)));
25892587
}
25902588

2589+
/*
2590+
* We might eventually make max_cover a user-settable parameter, but for
2591+
* now, just compute a reasonable value based on max_words and
2592+
* max_fragments.
2593+
*/
2594+
max_cover = Max(max_words * 10, 100);
2595+
if (max_fragments > 0)
2596+
max_cover *= max_fragments;
2597+
25912598
/* in HighlightAll mode these parameters are ignored */
25922599
if (!highlightall)
25932600
{
@@ -2612,10 +2619,10 @@ prsd_headline(PG_FUNCTION_ARGS)
26122619
/* Apply appropriate headline selector */
26132620
if (max_fragments == 0)
26142621
mark_hl_words(prs, query, highlightall, shortword,
2615-
min_words, max_words);
2622+
min_words, max_words, max_cover);
26162623
else
26172624
mark_hl_fragments(prs, query, highlightall, shortword,
2618-
min_words, max_words, max_fragments);
2625+
min_words, max_words, max_fragments, max_cover);
26192626

26202627
/* Fill in default values for string options */
26212628
if (!prs->startsel)

src/backend/utils/adt/tsvector_op.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1868,6 +1868,9 @@ TS_execute_recurse(QueryItem *curitem, void *arg, uint32 flags,
18681868
/* since this function recurses, it could be driven to stack overflow */
18691869
check_stack_depth();
18701870

1871+
/* ... and let's check for query cancel while we're at it */
1872+
CHECK_FOR_INTERRUPTS();
1873+
18711874
if (curitem->type == QI_VAL)
18721875
return chkcond(arg, (QueryOperand *) curitem,
18731876
NULL /* don't need position info */ );

0 commit comments

Comments
 (0)