Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Commit 4e2477b

Browse files
committed
Fix strange behavior (and possible crashes) in full text phrase search.
In an attempt to simplify the tsquery matching engine, the original phrase search patch invented rewrite rules that would rearrange a tsquery so that no AND/OR/NOT operator appeared below a PHRASE operator. But this approach had numerous problems. The rearrangement step was missed by ts_rewrite (and perhaps other places), allowing tsqueries to be created that would cause Assert failures or perhaps crashes at execution, as reported by Andreas Seltenreich. The rewrite rules effectively defined semantics for operators underneath PHRASE that were buggy, or at least unintuitive. And because rewriting was done in tsqueryin() rather than at execution, the rearrangement was user-visible, which is not very desirable --- for example, it might cause unexpected matches or failures to match in ts_rewrite. As a somewhat independent problem, the behavior of nested PHRASE operators was only sane for left-deep trees; queries like "x <-> (y <-> z)" did not behave intuitively at all. To fix, get rid of the rewrite logic altogether, and instead teach the tsquery execution engine to manage AND/OR/NOT below a PHRASE operator by explicitly computing the match location(s) and match widths for these operators. This requires introducing some additional fields into the publicly visible ExecPhraseData struct; but since there's no way for third-party code to pass such a struct to TS_phrase_execute, it shouldn't create an ABI problem as long as we don't move the offsets of the existing fields. Another related problem was that index searches supposed that "!x <-> y" could be lossily approximated as "!x & y", which isn't correct because the latter will reject, say, "x q y" which the query itself accepts. This required some tweaking in TS_execute_ternary along with the main tsquery engine. Back-patch to 9.6 where phrase operators were introduced. While this could be argued to change behavior more than we'd like in a stable branch, we have to do something about the crash hazards and index-vs-seqscan inconsistency, and it doesn't seem desirable to let the unintuitive behaviors induced by the rewriting implementation stand as precedent. Discussion: https://postgr.es/m/28215.1481999808@sss.pgh.pa.us Discussion: https://postgr.es/m/26706.1482087250@sss.pgh.pa.us
1 parent eaac6c7 commit 4e2477b

File tree

14 files changed

+673
-599
lines changed

14 files changed

+673
-599
lines changed

doc/src/sgml/datatype.sgml

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3959,15 +3959,7 @@ SELECT 'fat &amp; rat &amp; ! cat'::tsquery;
39593959
tsquery
39603960
------------------------
39613961
'fat' &amp; 'rat' &amp; !'cat'
3962-
3963-
SELECT '(fat | rat) &lt;-&gt; cat'::tsquery;
3964-
tsquery
3965-
-----------------------------------
3966-
'fat' &lt;-&gt; 'cat' | 'rat' &lt;-&gt; 'cat'
39673962
</programlisting>
3968-
3969-
The last example demonstrates that <type>tsquery</type> sometimes
3970-
rearranges nested operators into a logically equivalent formulation.
39713963
</para>
39723964

39733965
<para>

doc/src/sgml/textsearch.sgml

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -264,7 +264,7 @@ SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t
264264
text, any more than a <type>tsvector</type> is. A <type>tsquery</type>
265265
contains search terms, which must be already-normalized lexemes, and
266266
may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
267-
(For details see <xref linkend="datatype-tsquery">.) There are
267+
(For syntax details see <xref linkend="datatype-tsquery">.) There are
268268
functions <function>to_tsquery</>, <function>plainto_tsquery</>,
269269
and <function>phraseto_tsquery</>
270270
that are helpful in converting user-written text into a proper
@@ -323,6 +323,8 @@ text @@ text
323323
at least one of its arguments must appear, while the <literal>!</> (NOT)
324324
operator specifies that its argument must <emphasis>not</> appear in
325325
order to have a match.
326+
For example, the query <literal>fat &amp; ! rat</> matches documents that
327+
contain <literal>fat</> but not <literal>rat</>.
326328
</para>
327329

328330
<para>
@@ -377,6 +379,28 @@ SELECT phraseto_tsquery('the cats ate the rats');
377379
then <literal>&amp;</literal>, then <literal>&lt;-&gt;</literal>,
378380
and <literal>!</literal> most tightly.
379381
</para>
382+
383+
<para>
384+
It's worth noticing that the AND/OR/NOT operators mean something subtly
385+
different when they are within the arguments of a FOLLOWED BY operator
386+
than when they are not, because within FOLLOWED BY the exact position of
387+
the match is significant. For example, normally <literal>!x</> matches
388+
only documents that do not contain <literal>x</> anywhere.
389+
But <literal>!x &lt;-&gt; y</> matches <literal>y</> if it is not
390+
immediately after an <literal>x</>; an occurrence of <literal>x</>
391+
elsewhere in the document does not prevent a match. Another example is
392+
that <literal>x &amp; y</> normally only requires that <literal>x</>
393+
and <literal>y</> both appear somewhere in the document, but
394+
<literal>(x &amp; y) &lt;-&gt; z</> requires <literal>x</>
395+
and <literal>y</> to match at the same place, immediately before
396+
a <literal>z</>. Thus this query behaves differently from
397+
<literal>x &lt;-&gt; z &amp; y &lt;-&gt; z</>, which will match a
398+
document containing two separate sequences <literal>x z</> and
399+
<literal>y z</>. (This specific query is useless as written,
400+
since <literal>x</> and <literal>y</> could not match at the same place;
401+
but with more complex situations such as prefix-match patterns, a query
402+
of this form could be useful.)
403+
</para>
380404
</sect2>
381405

382406
<sect2 id="textsearch-intro-configurations">

src/backend/utils/adt/tsginidx.c

Lines changed: 18 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ checkcondition_gin(void *checkval, QueryOperand *val, ExecPhraseData *data)
212212
* Evaluate tsquery boolean expression using ternary logic.
213213
*/
214214
static GinTernaryValue
215-
TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem)
215+
TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
216216
{
217217
GinTernaryValue val1,
218218
val2,
@@ -230,25 +230,32 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem)
230230
switch (curitem->qoperator.oper)
231231
{
232232
case OP_NOT:
233-
result = TS_execute_ternary(gcv, curitem + 1);
233+
/* In phrase search, always return MAYBE since we lack positions */
234+
if (in_phrase)
235+
return GIN_MAYBE;
236+
result = TS_execute_ternary(gcv, curitem + 1, in_phrase);
234237
if (result == GIN_MAYBE)
235238
return result;
236239
return !result;
237240

238241
case OP_PHRASE:
239242

240243
/*
241-
* GIN doesn't contain any information about positions, treat
244+
* GIN doesn't contain any information about positions, so treat
242245
* OP_PHRASE as OP_AND with recheck requirement
243246
*/
244-
*gcv->need_recheck = true;
247+
*(gcv->need_recheck) = true;
248+
/* Pass down in_phrase == true in case there's a NOT below */
249+
in_phrase = true;
250+
245251
/* FALL THRU */
246252

247253
case OP_AND:
248-
val1 = TS_execute_ternary(gcv, curitem + curitem->qoperator.left);
254+
val1 = TS_execute_ternary(gcv, curitem + curitem->qoperator.left,
255+
in_phrase);
249256
if (val1 == GIN_FALSE)
250257
return GIN_FALSE;
251-
val2 = TS_execute_ternary(gcv, curitem + 1);
258+
val2 = TS_execute_ternary(gcv, curitem + 1, in_phrase);
252259
if (val2 == GIN_FALSE)
253260
return GIN_FALSE;
254261
if (val1 == GIN_TRUE && val2 == GIN_TRUE)
@@ -257,10 +264,11 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem)
257264
return GIN_MAYBE;
258265

259266
case OP_OR:
260-
val1 = TS_execute_ternary(gcv, curitem + curitem->qoperator.left);
267+
val1 = TS_execute_ternary(gcv, curitem + curitem->qoperator.left,
268+
in_phrase);
261269
if (val1 == GIN_TRUE)
262270
return GIN_TRUE;
263-
val2 = TS_execute_ternary(gcv, curitem + 1);
271+
val2 = TS_execute_ternary(gcv, curitem + 1, in_phrase);
264272
if (val2 == GIN_TRUE)
265273
return GIN_TRUE;
266274
if (val1 == GIN_FALSE && val2 == GIN_FALSE)
@@ -307,7 +315,7 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS)
307315

308316
res = TS_execute(GETQUERY(query),
309317
&gcv,
310-
TS_EXEC_CALC_NOT | TS_EXEC_PHRASE_AS_AND,
318+
TS_EXEC_CALC_NOT | TS_EXEC_PHRASE_NO_POS,
311319
checkcondition_gin);
312320
}
313321

@@ -343,7 +351,7 @@ gin_tsquery_triconsistent(PG_FUNCTION_ARGS)
343351
gcv.map_item_operand = (int *) (extra_data[0]);
344352
gcv.need_recheck = &recheck;
345353

346-
res = TS_execute_ternary(&gcv, GETQUERY(query));
354+
res = TS_execute_ternary(&gcv, GETQUERY(query), false);
347355

348356
if (res == GIN_TRUE && recheck)
349357
res = GIN_MAYBE;

src/backend/utils/adt/tsgistidx.c

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -359,25 +359,22 @@ gtsvector_consistent(PG_FUNCTION_ARGS)
359359
if (ISALLTRUE(key))
360360
PG_RETURN_BOOL(true);
361361

362-
PG_RETURN_BOOL(TS_execute(
363-
GETQUERY(query),
362+
/* since signature is lossy, cannot specify CALC_NOT here */
363+
PG_RETURN_BOOL(TS_execute(GETQUERY(query),
364364
(void *) GETSIGN(key),
365-
TS_EXEC_PHRASE_AS_AND,
366-
checkcondition_bit
367-
));
365+
TS_EXEC_PHRASE_NO_POS,
366+
checkcondition_bit));
368367
}
369368
else
370369
{ /* only leaf pages */
371370
CHKVAL chkval;
372371

373372
chkval.arrb = GETARR(key);
374373
chkval.arre = chkval.arrb + ARRNELEM(key);
375-
PG_RETURN_BOOL(TS_execute(
376-
GETQUERY(query),
374+
PG_RETURN_BOOL(TS_execute(GETQUERY(query),
377375
(void *) &chkval,
378-
TS_EXEC_PHRASE_AS_AND | TS_EXEC_CALC_NOT,
379-
checkcondition_arr
380-
));
376+
TS_EXEC_PHRASE_NO_POS | TS_EXEC_CALC_NOT,
377+
checkcondition_arr));
381378
}
382379
}
383380

src/backend/utils/adt/tsquery.c

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -557,13 +557,11 @@ findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes, bool *needcleanup)
557557
curitem->oper == OP_OR ||
558558
curitem->oper == OP_PHRASE);
559559

560-
if (curitem->oper == OP_PHRASE)
561-
*needcleanup = true; /* push OP_PHRASE down later */
562-
563560
(*pos)++;
564561

565562
/* process RIGHT argument */
566563
findoprnd_recurse(ptr, pos, nnodes, needcleanup);
564+
567565
curitem->left = *pos - tmp; /* set LEFT arg's offset */
568566

569567
/* process LEFT argument */
@@ -574,8 +572,9 @@ findoprnd_recurse(QueryItem *ptr, uint32 *pos, int nnodes, bool *needcleanup)
574572

575573

576574
/*
577-
* Fills in the left-fields previously left unfilled. The input
578-
* QueryItems must be in polish (prefix) notation.
575+
* Fill in the left-fields previously left unfilled.
576+
* The input QueryItems must be in polish (prefix) notation.
577+
* Also, set *needcleanup to true if there are any QI_VALSTOP nodes.
579578
*/
580579
static void
581580
findoprnd(QueryItem *ptr, int size, bool *needcleanup)
@@ -687,15 +686,17 @@ parse_tsquery(char *buf,
687686
memcpy((void *) GETOPERAND(query), (void *) state.op, state.sumlen);
688687
pfree(state.op);
689688

690-
/* Set left operand pointers for every operator. */
689+
/*
690+
* Set left operand pointers for every operator. While we're at it,
691+
* detect whether there are any QI_VALSTOP nodes.
692+
*/
691693
findoprnd(ptr, query->size, &needcleanup);
692694

693695
/*
694-
* QI_VALSTOP nodes should be cleaned and OP_PHRASE should be pushed
695-
* down
696+
* If there are QI_VALSTOP nodes, delete them and simplify the tree.
696697
*/
697698
if (needcleanup)
698-
return cleanup_fakeval_and_phrase(query);
699+
query = cleanup_tsquery_stopwords(query);
699700

700701
return query;
701702
}
@@ -1088,6 +1089,9 @@ tsqueryrecv(PG_FUNCTION_ARGS)
10881089
*/
10891090
findoprnd(item, size, &needcleanup);
10901091

1092+
/* Can't have found any QI_VALSTOP nodes */
1093+
Assert(!needcleanup);
1094+
10911095
/* Copy operands to output struct */
10921096
for (i = 0; i < size; i++)
10931097
{
@@ -1105,9 +1109,6 @@ tsqueryrecv(PG_FUNCTION_ARGS)
11051109

11061110
SET_VARSIZE(query, len + datalen);
11071111

1108-
if (needcleanup)
1109-
PG_RETURN_TSQUERY(cleanup_fakeval_and_phrase(query));
1110-
11111112
PG_RETURN_TSQUERY(query);
11121113
}
11131114

0 commit comments

Comments
 (0)