Re: [GENERAL] Full Text Search combined with Fuzzy
От | Artur Zakirov |
---|---|
Тема | Re: [GENERAL] Full Text Search combined with Fuzzy |
Дата | |
Msg-id | 93e265be-81d7-a65f-4ad4-5eb816fd75a9@postgrespro.ru обсуждение исходный текст |
Ответ на | Re: [GENERAL] Full Text Search combined with Fuzzy (Nicolas Paris <niparisco@gmail.com>) |
Ответы |
Re: [GENERAL] Full Text Search combined with Fuzzy
|
Список | pgsql-general |
On 03.03.2017 15:49, Nicolas Paris wrote: > > Hi Oleg, > > Thanks. I thought pgtrgm was not able to index my long texts because of > limitation of 8191 bytes per index row for btree. > > Then I found out it is possible to use pgtrgm over a GIN/GIST index. > My final use case is phrase mining in texts. > > I want my application returns texts that contains approximatly the user > entry: > > Eg: user search "Hello Word" > a text containing "blah blah blah hello world blah blah blah" would be > returned. > > Test: > postgres=# CREATE table test_trgm (texts text); > CREATE TABLE > postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); > CREATE INDEX > postgres=# SET enable_seqscan = OFF; > SET > postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); > INSERT 0 1 > postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); > INSERT 0 1 > postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; > texts | similarity > -------------------------------------------+------------ > blah blah blah hello world blah blah blah | 0.473684 > blah blah blah hello word blah blah blah | 0.6875 > (2 rows) > > postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; > QUERY PLAN > ----------------------------------------------------------------------------------- > Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) > Recheck Cond: (texts % 'hello word'::text) > -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) > Index Cond: (texts % 'hello word'::text) > (4 rows) > > Conclusion: If I d'say 0.4 is my threshold, would this methodology meet > my requirements ? > > Thanks for the help ! > Hello, If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For example: postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm WHERE 'hello word' <% texts; texts | word_similarity -------------------------------------------+----------------- blah blah blah hello world blah blah blah | 0.818182 blah blah blah hello word blah blah blah | 1 (2 rows) 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
В списке pgsql-general по дате отправления: