Sphinx High Performance Full Text Search For MySQL Presentation
Sphinx High Performance Full Text Search For MySQL Presentation
Presented by,
MySQL AB® & O’Reilly Media, Inc. High-performance full-text search for
MySQL
. . . . .
.
Source CHUNK10
Index CHUNK10
Host GREP10
Searching 101 – the client side
Create a client object
Set up the options
Fire the query
<?php
include ( “sphinxapi.php” );
$cl = new SphinxClient ();
$cl->SetMatchMode ( SPH_MATCH_PHRASE );
$cl->SetSortMode ( SPH_SORT_EXTENDED, “price desc” );
$res = $cl->Query ( “ipod nano”, “products” );
var_dump ( $res );
?>
Searching 102 – match contents
Matches will always have document ID, weight
Matches can also have numeric attributes
No string attributes yet (pull ‘em from MySQL)
print_r ( $result[“matches”][0] ):
Array (
[id] => 123
[weight] => 101421
[attrs] => Array (
[group_id] => 12345678901
[added] => 1207261463 ) )
Searching 103 – why attributes
Short answer – efficiency
Long answer – efficient filtering, sorting, and
grouping for big result sets (over 1,000 matches)
Real-world example:
Using Sphinx for searching only and then sorting just
1000 matches using MySQL – up to 2-3 seconds
Using Sphinx for both searching and sorting –
improves that to under 0.1 second
Random row IO in MySQL, no row IO in Sphinx
Now imagine there’s 1,000,000 matches…
Moving parts
SQL query parts that can be moved to Sphinx
Filtering – WHERE vs. SetFilter() or fake keyword
Sorting – ORDER BY vs. SetSortMode()
Grouping – GROUP BY vs. SetGroupBy()
Up to 100x (!) improvement vs. “naïve” approach
Rule of thumb – move everything you can from
MySQL to Sphinx
Rule of thumb 2.0 – apply sacred knowledge of
Sphinx pipeline (and then move everything)
Searching pipeline in 30 seconds
Search, WHERE, rank, ORDER/GROUP
“Cheap” boolean searching first
Then filters (WHERE clause)
Then “expensive” relevance ranking
Then sorting (ORDER BY clause) and/or grouping
(GROUP BY clause)
Searching pipeline details
Query is evaluated as a boolean query
CPU and IO, O(sum(docs_per_keyword))
Candidates are filtered
based on their attribute values
CPU only, O(sum(docs_per_keyword))
Relevance rank (weight) is computed
CPU and IO, O(sum(hits_per_keyword))
Matches are sorted and grouped
CPU only, O(filtered_matches_count)
Filters vs. fake keywords
The key idea – instead of using an attribute,
inject a fake keyword when indexing
sql_query = SELECT id, title, vendor ...
vs.
$client->ResetGroupBy ();
$client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” );
$client->SetLimit ( 0, 10 );