SlideShare ist ein Scribd-Unternehmen logo
1 von 69
Downloaden Sie, um offline zu lesen
Full Text Search
     Throwdown
  Bill Karwin, Percona Inc.
In a full text
                                   search, the search
                                   engine examines
                                   all of the words in
                                   every stored
                                   document as it
                                   tries to match
                                   search words
                                   supplied by the
                                   user.


http://www.flickr.com/photos/tryingyouth/
StackOverflow Test Data
• Data dump, exported December 2011
• 7.4 million Posts = 8.18 GB




                                      www.percona.com
StackOverflow ER diagram


searchable
   text




                      www.percona.com
The Baseline:
Naive Search Predicates
Some people, when confronted with a problem, think
	

 “I know, I’ll use regular expressions.”
	

 Now they have two problems.
	

 	

   	

   	

   	

   	

   	

   	

   	

   	

   — Jamie Zawinsky




                                                                         www.percona.com
Accuracy issue
• Irrelevant or false matching words ‘one’, ‘money’,
   ‘prone’, etc.:
     SELECT * FROM Posts
      WHERE Body LIKE '%one%'
• Regular expressions in MySQL support escapes
   for word boundaries:
     SELECT * FROM Posts
      WHERE Body RLIKE '[[:<:]]one[[:>:]]'




                                          www.percona.com
Performance issue
• LIKE with wildcards:
  ! SELECT * FROM Posts
                                        49 sec
    WHERE title LIKE '%performance%'
    !   OR body LIKE '%performance%'
    !   OR tags LIKE '%performance%';

• POSIX regular expressions:
  ! SELECT * FROM Posts                7 min 57 sec
    WHERE title RLIKE '[[:<:]]performance[[:>:]]'
    !   OR body RLIKE '[[:<:]]performance[[:>:]]'
    !   OR tags RLIKE '[[:<:]]performance[[:>:]]';




                                           www.percona.com
Why so slow?
CREATE TABLE TelephoneBook (
  ! FullName VARCHAR(50));

CREATE INDEX name_idx ON TelephoneBook
  ! (FullName);

INSERT INTO TelephoneBook VALUES
  ! ('Riddle, Thomas'),
  ! ('Thomas, Dean');



                               www.percona.com
Why so slow?
• Search for all with last name “Thomas”
                                                uses index
    SELECT * FROM telephone_book
     WHERE full_name LIKE 'Thomas%'


• Search for all with first name “Thomas”
    SELECT * FROM telephone_book
     WHERE full_name LIKE '%Thomas'

                                       can’t use index



                                            www.percona.com
Because:




    B-Tree indexes can’t
☞   search for substrings



                      www.percona.com
• FULLTEXT in MyISAM
• FULLTEXT in InnoDB
• Apache Solr
• Sphinx Search
• Trigraphs
FULLTEXT
in MyISAM
FULLTEXT Index with MyISAM
• Special index type for MyISAM
• Integrated with SQL queries
• Indexes always in sync with data
• Balances features vs. speed vs. space




                                          www.percona.com
Insert Data into Index (MyISAM)

mysql> INSERT INTO Posts
 SELECT * FROM PostsSource;


          time: 33 min, 34 sec




                                 www.percona.com
Build Index on Data (MyISAM)

mysql> CREATE FULLTEXT INDEX PostText
 ! ON Posts(title, body, tags);


          time: 31 min, 18 sec




                                 www.percona.com
Querying

SELECT * FROM Posts
 WHERE MATCH(column(s))
 AGAINST('query pattern');
                      must include all columns
                      of your index, in the
                      order you defined




                                www.percona.com
Natural Language Mode (MyISAM)

• Searches concepts with free text queries:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags )
    AGAINST('mysql performance'
    IN NATURAL LANGUAGE MODE)
    LIMIT 100;

                     time with index:
                     200 milliseconds

                                         www.percona.com
Query Profile:
     Natural Language Mode (MyISAM)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000068 |
| checking permissions    | 0.000006 |
| Opening tables          | 0.000017 |
| init                    | 0.000032 |
| System lock             | 0.000007 |
| optimizing              | 0.000007 |
| statistics              | 0.000018 |
| preparing               | 0.000006 |
| FULLTEXT initialization | 0.198358 |
| executing               | 0.000012 |
| Sending data            | 0.001921 |
| end                     | 0.000005 |
| query end               | 0.000003 |
| closing tables          | 0.000018 |
| freeing items           | 0.000341 |
| cleaning up             | 0.000012 |
+-------------------------+----------+
                                         www.percona.com
Boolean Mode (MyISAM)

• Searches words using mini-language:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags)
    AGAINST('+mysql +performance'
    IN BOOLEAN MODE)
    LIMIT 100;
             time with index:
             16 milliseconds



                                        www.percona.com
Query Profile:
            Boolean Mode (MyISAM)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000031 |
| checking permissions    | 0.000003 |
| Opening tables          | 0.000008 |
| init                    | 0.000017 |
| System lock             | 0.000004 |
| optimizing              | 0.000004 |
| statistics              | 0.000008 |
| preparing               | 0.000003 |
| FULLTEXT initialization | 0.000008 |
| executing               | 0.000001 |
| Sending data            | 0.015703 |
| end                     | 0.000004 |
| query end               | 0.000002 |
| closing tables          | 0.000007 |
| freeing items           | 0.000381 |
| cleaning up             | 0.000007 |
+-------------------------+----------+
                                         www.percona.com
FULLTEXT
in InnoDB
FULLTEXT Index with InnoDB
• Under development in MySQL 5.6
  • I’m testing 5.6.6 m1
• Usage very similar to FULLTEXT in MyISAM
• Integrated with SQL queries
• Indexes always* in sync with data
• Read the blogs for more details:
  •   http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/
  •   http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/
  •   http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/
  •   http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/


                                                                             www.percona.com
Insert Data into Index (InnoDB)

mysql> INSERT INTO Posts
 SELECT * FROM PostsSource;


          time: 55 min 46 sec




                                www.percona.com
Build Index on Data (InnoDB)
• Still under development; you might see problems:
mysql> CREATE FULLTEXT INDEX PostText
 ! ON Posts(title, body, tags);
ERROR 2013 (HY000): Lost connection to
 MySQL server during query




                                        www.percona.com
Build Index on Data (InnoDB)
• Solution: make sure you define a primary key
   column `FTS_DOC_ID` explicitly:
mysql> ALTER TABLE Posts
 CHANGE COLUMN PostId
 `FTS_DOC_ID` BIGINT UNSIGNED;
mysql> CREATE FULLTEXT INDEX PostText
 ! ON Posts(title, body, tags);


                   time: 25 min 27 sec

                                         www.percona.com
Natural Language Mode (InnoDB)

• Searches concepts with free text queries:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags)
    AGAINST('mysql performance'
    IN NATURAL LANGUAGE MODE)
    LIMIT 100;

                     time with index:
                     740 milliseconds

                                         www.percona.com
Query Profile:
     Natural Language Mode (InnoDB)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000074 |
| checking permissions    | 0.000007 |
| Opening tables          | 0.000020 |
| init                    | 0.000034 |
| System lock             | 0.000007 |
| optimizing              | 0.000009 |
| statistics              | 0.000020 |
| preparing               | 0.000008 |
| FULLTEXT initialization | 0.577257 |
| executing               | 0.000013 |
| Sending data            | 0.106279 |
| end                     | 0.000018 |
| query end               | 0.000012 |
| closing tables          | 0.000018 |
| freeing items           | 0.055584 |
| cleaning up             | 0.000039 |
+-------------------------+----------+
                                         www.percona.com
Boolean Mode (InnoDB)

• Searches words using mini-language:
  ! SELECT * FROM Posts
    WHERE MATCH(title, body, tags)
    AGAINST('+mysql +performance'
    IN BOOLEAN MODE)
    LIMIT 100;
             time with index:
             350 milliseconds



                                        www.percona.com
Query Profile:
             Boolean Mode (InnoDB)
+-------------------------+----------+
| Status                  | Duration |
+-------------------------+----------+
| starting                | 0.000064 |
| checking permissions    | 0.000005 |
| Opening tables          | 0.000017 |
| init                    | 0.000047 |
| System lock             | 0.000007 |
| optimizing              | 0.000009 |
| statistics              | 0.000019 |
| preparing               | 0.000008 |
| FULLTEXT initialization | 0.347172 |
| executing               | 0.000014 |
| Sending data            | 0.008089 |
| end                     | 0.000011 |
| query end               | 0.000012 |
| closing tables          | 0.000015 |
| freeing items           | 0.001570 |
| cleaning up             | 0.000023 |
+-------------------------+----------+
                                         www.percona.com
Apache Solr
Apache Solr
• http://lucene.apache.org/solr/
• Formerly known as Lucene, started 2001
• Apache License
• Java implementation
• Web service architecture
• Many sophisticated search features



                                     www.percona.com
DataImportHandler
• conf/solrconfig.xml:
. . .
<requestHandler name="/dataimport"
 class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>
. . .




                                                  www.percona.com
DataImportHandler
• conf/data-config.xml:
 <dataConfig>
   <dataSource type="JdbcDataSource"
                driver="com.mysql.jdbc.Driver"
                url="jdbc:mysql://localhost/testpattern?useUnicode=true"
                batchSize="-1"
                user="xxxx"
                password="xxxx"/>
   <document>
     <entity name="id"
              query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts">
     </entity>
   </document>
 </dataConfig>                          extremely important
                                       to avoid buffering the
                                        whole query result!


                                                           www.percona.com
DataImportHandler
•conf/schema.xml:
. . .
<fields>
  <field name="PostId" type="string" indexed="true" stored="true" required="true" />
   <field name="ParentId" type="string" indexed="true" stored="true" required="false" />
   <field name="Title" type="text_general" indexed="false" stored="false"
   required="false" />
   <field name="Body" type="text_general" indexed="false" stored="false" required="false" /
   >
   <field name="Tags" type="text_general" indexed="false" stored="false" required="false" /
   >

   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/
   >
<fields>

<uniqueKey>PostId</uniqueKey>
<defaultSearchField>text</defaultSearchField>

<copyField source="Title" dest="text"/>
<copyField source="Body" dest="text"/>
<copyField source="Tags" dest="text"/>
. . .

                                                                  www.percona.com
Insert Data into Index (Solr)

• http://localhost:8983/solr/dataimport?
   command=full-import


              time: 14 min 28 sec




                                           www.percona.com
Searching Solr

• http://localhost:8983/solr/select/?q=mysql+AND
   +performance


              time: 79ms

           Query results are cached (like MySQL Query Cache),
           so they return much faster on subsequent execution




                                                        www.percona.com
Sphinx Search
Sphinx Search
• http://sphinxsearch.com/
• Started in 2001
• GPLv2 license
• C++ implementation
• SphinxSE storage engine for MySQL
• Supports MySQL protocol, SQL-like queries
• Many sophisticated search features


                                     www.percona.com
sphinx.conf
source src1
{
! type = mysql
! sql_host = localhost
! sql_user = xxxx
! sql_pass = xxxx
! sql_db = testpattern
! sql_query = SELECT PostId, ParentId, Title,
! ! Body, Tags FROM Posts
! sql_query_info = SELECT * FROM Posts 
! ! WHERE PostId=$id
}



                                   www.percona.com
sphinx.conf
! index test1
  {
  ! source = src1
  ! path = C:Sphinxdata
  }




                             www.percona.com
Insert Data into Index (Sphinx)
!   C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1

    Sphinx 2.0.5-release (r3309)

    using config file 'sphinx.conf'...
    indexing index 'test1'...
    collected 7397507 docs, 5731.8 MB         time: 8 min 20 sec
    sorted 920.3 Mhits, 100.0% done
    total 7397507 docs, 5731776959 bytes
    total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec
    total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3
      msec/call avg
    total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/
      call avg

    Execution time: 500.196 s



                                                  www.percona.com
Querying index
$ mysql --port 9306

Server version: 2.0.5-release (r3309)

mysql> SELECT * FROM test1 WHERE MATCH('mysql performance');

+---------+--------+
| id      | weight |
+---------+--------+
| 6016856 |   6600 |
| 4207641 |   6595 |
| 2656325 |   6593 |
| 7192928 |   5605 |
| 8118235 |   5598 |
. . .
20 rows in set (0.02 sec)



                                                www.percona.com
Querying index
mysql> SHOW META;

+---------------+-------------+
| Variable_name | Value       |
+---------------+-------------+
| total         | 1000        |
| total_found   | 7672        |
| time          | 0.013       |
| keyword[0]    | mysql       |
| docs[0]       | 162287      |   time: 13ms
| hits[0]       | 363694      |
| keyword[1]    | performance |
| docs[1]       | 147249      |
| hits[1]       | 210895      |
+---------------+-------------+




                                           www.percona.com
Trigraphs
Trigraphs Overview
• Not very fast, but still better than LIKE / RLIKE
• Generic, portable SQL solution
• No dependency on version, storage engine, third-
   party technology




                                        www.percona.com
Three-Letter Sequences
! CREATE TABLE AtoZ (
  !   c! ! CHAR(1),
  !   PRIMARY KEY (c));
! INSERT INTO AtoZ (c)
  VALUES ('a'), ('b'), ('c'), ...
! CREATE TABLE Trigraphs (
  !   Tri!
         !    CHAR(3),
  !   PRIMARY KEY (Tri));
! INSERT INTO Trigraphs (Tri)
  SELECT CONCAT(t1.c, t2.c, t3.c)
  FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3;




                                            www.percona.com
Insert Data Into Index
my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr;
$sth->execute() or die $dbh1->errstr;
$dbh2->begin_work;
my $i = 0;
while (my $row = $sth->fetchrow_hashref ) {
  my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags})));
  my %tri;
  map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g ));
  next unless %tri;
  my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri));
  my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES
 $tuple_list";
  $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr;
  if (++$i % 1000 == 0) {
    print ".";
    $dbh2->commit;
    $dbh2->begin_work;
                                            time: 116 min 50 sec
}
  }
                                            space: 16.2GiB
print ".n";
$dbh2->commit;                              rows: 519 million

                                                          www.percona.com
Indexed Lookups
! SELECT p.*                        time: 46 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND t1.Tri = 'mys'
  !




                                      www.percona.com
Search Among Fewer Matches
! SELECT p.*                        time: 19 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND t2.Tri = 'per'




                                      www.percona.com
Search Among Fewer Matches
! SELECT p.*                        time: 22 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND t2.Tri = 'per'
  JOIN PostsTrigraph t3 ON
  ! t3.PostId = p.PostId AND t3.Tri = 'for'




                                      www.percona.com
Search Among Fewer Matches
! SELECT p.*                          time: 13.6 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND   t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND   t2.Tri = 'per'
  JOIN PostsTrigraph t3 ON
  ! t3.PostId = p.PostId AND   t3.Tri = 'for'
  JOIN PostsTrigraph t4 ON
  ! t4.PostId = p.PostId AND   t4.Tri = 'man'




                                      www.percona.com
Narrow Down Further
! SELECT p.*                                    time: 13.8 sec
  FROM Posts p
  JOIN PostsTrigraph t1 ON
  ! t1.PostId = p.PostId AND          t1.Tri = 'mys'
  JOIN PostsTrigraph t2 ON
  ! t2.PostId = p.PostId AND          t2.Tri = 'per'
  JOIN PostsTrigraph t3 ON
  ! t3.PostId = p.PostId AND          t3.Tri = 'for'
  JOIN PostsTrigraph t4 ON
  ! t4.PostId = p.PostId AND          t4.Tri = 'man'
 WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%'
 !   AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%';




                                                www.percona.com
And the winner is...




     Jarrett Campbell
     http://www.flickr.com/people/77744839@N00
Time to Insert Data into Index

LIKE expression   n/a

FULLTEXT MyISAM 33 min, 34 sec

FULLTEXT InnoDB   55 min, 46 sec

Apache Solr       14 min, 28 sec

Sphinx Search     8 min, 20 sec

Trigraphs         116 min, 50 sec
                                    www.percona.com
Insert Data into Index (sec)
8000


6000


4000


2000


   0
       LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                          www.percona.com
Time to Build Index on Data

LIKE expression   n/a

FULLTEXT MyISAM 31 min, 18 sec

FULLTEXT InnoDB   25 min, 27 sec

Apache Solr       n/a

Sphinx Search     n/a

Trigraphs         n/a
                                   www.percona.com
Build Index on Data (sec)
4000


3000


2000


1000
                               n/a     n/a         n/a

   0
        LIKE   MyISAM InnoDB   Solr   Sphinx    Trigraph

                                             www.percona.com
Index Storage

LIKE expression     n/a

FULLTEXT MyISAM 2382 MiB

FULLTEXT InnoDB     ? MiB

Apache Solr         2766 MiB

Sphinx Search       3355 MiB

Trigraphs           16589 MiB
                                  www.percona.com
Index Storage (MiB)
20000


15000


10000


 5000


    0
        LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                           www.percona.com
Query Speed

LIKE expression    49,000ms - 399,000ms

FULLTEXT MyISAM 16-200ms

FULLTEXT InnoDB    350-740ms

Apache Solr        79ms

Sphinx Search      13ms

Trigraphs          13800ms
                                          www.percona.com
Query Speed (ms)
400000
350000
300000
250000
200000
150000
100000
 50000
     0
         LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                            www.percona.com
1000


          Query Speed (ms)
 750




 500




 250




   0
       LIKE   MyISAM InnoDB   Solr   Sphinx   Trigraph

                                          www.percona.com
Bottom Line
                    build   insert   storage     query      solution

                                                49k-399k
LIKE expression       0       0         0
                                                   ms
                                                              SQL



FULLTEXT MyISAM     31:18   33:28    2382MiB    16-200ms     MySQL



FULLTEXT InnoDB     25:27   55:46       ?       350-740ms   MySQL 5.6



Apache Solr          n/a    14:28    2766MiB      79ms        Java



Sphinx Search        n/a     8:20    3487MiB      13ms        C++



Trigraphs            n/a    116:50   16.2 GiB   13,800ms      SQL


                                                 www.percona.com
Final Thoughts
• Third-party search engines are complex to keep in
   sync with data, and adding another type of
   server adds more operations work for you.
• Built-in FULLTEXT indexes are therefore useful
    even if they are not absolutely the fastest.
• Different search implementations may return
    different results, so you should evaluate what
    works best for your project.
• Any indexed search solution is orders of
    magnitude better than LIKE!

                                          www.percona.com
New York, October 1-2, 2012
London, December 3-4, 2012
Santa Clara, April 22-25, 2013




                  www.percona.com/live
Expert instructors
        In-person training
      Custom onsite training
       Live virtual training
http://www.percona.com/training

                               www.percona.com
http://www.pragprog.com/titles/bksqla/
                                www.percona.com
Copyright 2012 Bill Karwin
           www.slideshare.net/billkarwin

          Released under a Creative Commons 3.0 License:
           http://creativecommons.org/licenses/by-nc-nd/3.0/
              You are free to share - to copy, distribute and
             transmit this work, under the following conditions:




   Attribution.               Noncommercial.          No Derivative Works.
You must attribute this   You may not use this work       You may not alter,
 work to Bill Karwin.      for commercial purposes.      transform, or build
                                                           upon this work.

                                                               www.percona.com

Weitere ähnliche Inhalte

Andere mochten auch

Hierarchical data models in Relational Databases
Hierarchical data models in Relational DatabasesHierarchical data models in Relational Databases
Hierarchical data models in Relational Databasesnavicorevn
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Jamey Hanson
 
Indexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningIndexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningOSSCube
 
Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structurezhaolinjnu
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Facettensuche mit Lucene und Solr
Facettensuche mit Lucene und SolrFacettensuche mit Lucene und Solr
Facettensuche mit Lucene und SolrThomas Koch
 
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String CollectionsEfficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String Collectionsrvernica
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 

Andere mochten auch (19)

Extensible Data Modeling
Extensible Data ModelingExtensible Data Modeling
Extensible Data Modeling
 
Sql query patterns, optimized
Sql query patterns, optimizedSql query patterns, optimized
Sql query patterns, optimized
 
SQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and ProfitSQL Outer Joins for Fun and Profit
SQL Outer Joins for Fun and Profit
 
Sql Antipatterns Strike Back
Sql Antipatterns Strike BackSql Antipatterns Strike Back
Sql Antipatterns Strike Back
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
Sql Injection Myths and Fallacies
Sql Injection Myths and FallaciesSql Injection Myths and Fallacies
Sql Injection Myths and Fallacies
 
Practical Object Oriented Models In Sql
Practical Object Oriented Models In SqlPractical Object Oriented Models In Sql
Practical Object Oriented Models In Sql
 
Models for hierarchical data
Models for hierarchical dataModels for hierarchical data
Models for hierarchical data
 
Hierarchical data models in Relational Databases
Hierarchical data models in Relational DatabasesHierarchical data models in Relational Databases
Hierarchical data models in Relational Databases
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
 
Indexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuningIndexing the MySQL Index: Key to performance tuning
Indexing the MySQL Index: Key to performance tuning
 
Inno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structureInno db internals innodb file formats and source code structure
Inno db internals innodb file formats and source code structure
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Facettensuche mit Lucene und Solr
Facettensuche mit Lucene und SolrFacettensuche mit Lucene und Solr
Facettensuche mit Lucene und Solr
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String CollectionsEfficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String Collections
 
Trees and Hierarchies in SQL
Trees and Hierarchies in SQLTrees and Hierarchies in SQL
Trees and Hierarchies in SQL
 
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Full Text Search Throwdown

  • 1. Full Text Search Throwdown Bill Karwin, Percona Inc.
  • 2. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. http://www.flickr.com/photos/tryingyouth/
  • 3. StackOverflow Test Data • Data dump, exported December 2011 • 7.4 million Posts = 8.18 GB www.percona.com
  • 4. StackOverflow ER diagram searchable text www.percona.com
  • 6. Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky www.percona.com
  • 7. Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: SELECT * FROM Posts WHERE Body LIKE '%one%' • Regular expressions in MySQL support escapes for word boundaries: SELECT * FROM Posts WHERE Body RLIKE '[[:<:]]one[[:>:]]' www.percona.com
  • 8. Performance issue • LIKE with wildcards: ! SELECT * FROM Posts 49 sec WHERE title LIKE '%performance%' ! OR body LIKE '%performance%' ! OR tags LIKE '%performance%'; • POSIX regular expressions: ! SELECT * FROM Posts 7 min 57 sec WHERE title RLIKE '[[:<:]]performance[[:>:]]' ! OR body RLIKE '[[:<:]]performance[[:>:]]' ! OR tags RLIKE '[[:<:]]performance[[:>:]]'; www.percona.com
  • 9. Why so slow? CREATE TABLE TelephoneBook ( ! FullName VARCHAR(50)); CREATE INDEX name_idx ON TelephoneBook ! (FullName); INSERT INTO TelephoneBook VALUES ! ('Riddle, Thomas'), ! ('Thomas, Dean'); www.percona.com
  • 10. Why so slow? • Search for all with last name “Thomas” uses index SELECT * FROM telephone_book WHERE full_name LIKE 'Thomas%' • Search for all with first name “Thomas” SELECT * FROM telephone_book WHERE full_name LIKE '%Thomas' can’t use index www.percona.com
  • 11. Because: B-Tree indexes can’t ☞ search for substrings www.percona.com
  • 12. • FULLTEXT in MyISAM • FULLTEXT in InnoDB • Apache Solr • Sphinx Search • Trigraphs
  • 14. FULLTEXT Index with MyISAM • Special index type for MyISAM • Integrated with SQL queries • Indexes always in sync with data • Balances features vs. speed vs. space www.percona.com
  • 15. Insert Data into Index (MyISAM) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 33 min, 34 sec www.percona.com
  • 16. Build Index on Data (MyISAM) mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 31 min, 18 sec www.percona.com
  • 17. Querying SELECT * FROM Posts WHERE MATCH(column(s)) AGAINST('query pattern'); must include all columns of your index, in the order you defined www.percona.com
  • 18. Natural Language Mode (MyISAM) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags ) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 200 milliseconds www.percona.com
  • 19. Query Profile: Natural Language Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000068 | | checking permissions | 0.000006 | | Opening tables | 0.000017 | | init | 0.000032 | | System lock | 0.000007 | | optimizing | 0.000007 | | statistics | 0.000018 | | preparing | 0.000006 | | FULLTEXT initialization | 0.198358 | | executing | 0.000012 | | Sending data | 0.001921 | | end | 0.000005 | | query end | 0.000003 | | closing tables | 0.000018 | | freeing items | 0.000341 | | cleaning up | 0.000012 | +-------------------------+----------+ www.percona.com
  • 20. Boolean Mode (MyISAM) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 16 milliseconds www.percona.com
  • 21. Query Profile: Boolean Mode (MyISAM) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000031 | | checking permissions | 0.000003 | | Opening tables | 0.000008 | | init | 0.000017 | | System lock | 0.000004 | | optimizing | 0.000004 | | statistics | 0.000008 | | preparing | 0.000003 | | FULLTEXT initialization | 0.000008 | | executing | 0.000001 | | Sending data | 0.015703 | | end | 0.000004 | | query end | 0.000002 | | closing tables | 0.000007 | | freeing items | 0.000381 | | cleaning up | 0.000007 | +-------------------------+----------+ www.percona.com
  • 23. FULLTEXT Index with InnoDB • Under development in MySQL 5.6 • I’m testing 5.6.6 m1 • Usage very similar to FULLTEXT in MyISAM • Integrated with SQL queries • Indexes always* in sync with data • Read the blogs for more details: • http://blogs.innodb.com/wp/2011/07/overview-and-getting-started-with-innodb-fts/ • http://blogs.innodb.com/wp/2011/07/innodb-full-text-search-tutorial/ • http://blogs.innodb.com/wp/2011/07/innodb-fts-performance/ • http://blogs.innodb.com/wp/2011/07/difference-between-innodb-fts-and-myisam-fts/ www.percona.com
  • 24. Insert Data into Index (InnoDB) mysql> INSERT INTO Posts SELECT * FROM PostsSource; time: 55 min 46 sec www.percona.com
  • 25. Build Index on Data (InnoDB) • Still under development; you might see problems: mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); ERROR 2013 (HY000): Lost connection to MySQL server during query www.percona.com
  • 26. Build Index on Data (InnoDB) • Solution: make sure you define a primary key column `FTS_DOC_ID` explicitly: mysql> ALTER TABLE Posts CHANGE COLUMN PostId `FTS_DOC_ID` BIGINT UNSIGNED; mysql> CREATE FULLTEXT INDEX PostText ! ON Posts(title, body, tags); time: 25 min 27 sec www.percona.com
  • 27. Natural Language Mode (InnoDB) • Searches concepts with free text queries: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('mysql performance' IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 740 milliseconds www.percona.com
  • 28. Query Profile: Natural Language Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000074 | | checking permissions | 0.000007 | | Opening tables | 0.000020 | | init | 0.000034 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000020 | | preparing | 0.000008 | | FULLTEXT initialization | 0.577257 | | executing | 0.000013 | | Sending data | 0.106279 | | end | 0.000018 | | query end | 0.000012 | | closing tables | 0.000018 | | freeing items | 0.055584 | | cleaning up | 0.000039 | +-------------------------+----------+ www.percona.com
  • 29. Boolean Mode (InnoDB) • Searches words using mini-language: ! SELECT * FROM Posts WHERE MATCH(title, body, tags) AGAINST('+mysql +performance' IN BOOLEAN MODE) LIMIT 100; time with index: 350 milliseconds www.percona.com
  • 30. Query Profile: Boolean Mode (InnoDB) +-------------------------+----------+ | Status | Duration | +-------------------------+----------+ | starting | 0.000064 | | checking permissions | 0.000005 | | Opening tables | 0.000017 | | init | 0.000047 | | System lock | 0.000007 | | optimizing | 0.000009 | | statistics | 0.000019 | | preparing | 0.000008 | | FULLTEXT initialization | 0.347172 | | executing | 0.000014 | | Sending data | 0.008089 | | end | 0.000011 | | query end | 0.000012 | | closing tables | 0.000015 | | freeing items | 0.001570 | | cleaning up | 0.000023 | +-------------------------+----------+ www.percona.com
  • 32. Apache Solr • http://lucene.apache.org/solr/ • Formerly known as Lucene, started 2001 • Apache License • Java implementation • Web service architecture • Many sophisticated search features www.percona.com
  • 33. DataImportHandler • conf/solrconfig.xml: . . . <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler> . . . www.percona.com
  • 34. DataImportHandler • conf/data-config.xml: <dataConfig> <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/testpattern?useUnicode=true" batchSize="-1" user="xxxx" password="xxxx"/> <document> <entity name="id" query="SELECT PostId, ParentId, Title, Body, Tags FROM Posts"> </entity> </document> </dataConfig> extremely important to avoid buffering the whole query result! www.percona.com
  • 35. DataImportHandler •conf/schema.xml: . . . <fields> <field name="PostId" type="string" indexed="true" stored="true" required="true" /> <field name="ParentId" type="string" indexed="true" stored="true" required="false" /> <field name="Title" type="text_general" indexed="false" stored="false" required="false" /> <field name="Body" type="text_general" indexed="false" stored="false" required="false" / > <field name="Tags" type="text_general" indexed="false" stored="false" required="false" / > <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/ > <fields> <uniqueKey>PostId</uniqueKey> <defaultSearchField>text</defaultSearchField> <copyField source="Title" dest="text"/> <copyField source="Body" dest="text"/> <copyField source="Tags" dest="text"/> . . . www.percona.com
  • 36. Insert Data into Index (Solr) • http://localhost:8983/solr/dataimport? command=full-import time: 14 min 28 sec www.percona.com
  • 37. Searching Solr • http://localhost:8983/solr/select/?q=mysql+AND +performance time: 79ms Query results are cached (like MySQL Query Cache), so they return much faster on subsequent execution www.percona.com
  • 39. Sphinx Search • http://sphinxsearch.com/ • Started in 2001 • GPLv2 license • C++ implementation • SphinxSE storage engine for MySQL • Supports MySQL protocol, SQL-like queries • Many sophisticated search features www.percona.com
  • 40. sphinx.conf source src1 { ! type = mysql ! sql_host = localhost ! sql_user = xxxx ! sql_pass = xxxx ! sql_db = testpattern ! sql_query = SELECT PostId, ParentId, Title, ! ! Body, Tags FROM Posts ! sql_query_info = SELECT * FROM Posts ! ! WHERE PostId=$id } www.percona.com
  • 41. sphinx.conf ! index test1 { ! source = src1 ! path = C:Sphinxdata } www.percona.com
  • 42. Insert Data into Index (Sphinx) ! C:Sphinx> indexer.exe -c sphinx.conf.in --verbose test1 Sphinx 2.0.5-release (r3309) using config file 'sphinx.conf'... indexing index 'test1'... collected 7397507 docs, 5731.8 MB time: 8 min 20 sec sorted 920.3 Mhits, 100.0% done total 7397507 docs, 5731776959 bytes total 500.149 sec, 11460138 bytes/sec, 14790.60 docs/sec total 11 reads, 15.898 sec, 314584.8 kb/call avg, 1445.3 msec/call avg total 542 writes, 3.129 sec, 12723.3 kb/call avg, 5.7 msec/ call avg Execution time: 500.196 s www.percona.com
  • 43. Querying index $ mysql --port 9306 Server version: 2.0.5-release (r3309) mysql> SELECT * FROM test1 WHERE MATCH('mysql performance'); +---------+--------+ | id | weight | +---------+--------+ | 6016856 | 6600 | | 4207641 | 6595 | | 2656325 | 6593 | | 7192928 | 5605 | | 8118235 | 5598 | . . . 20 rows in set (0.02 sec) www.percona.com
  • 44. Querying index mysql> SHOW META; +---------------+-------------+ | Variable_name | Value | +---------------+-------------+ | total | 1000 | | total_found | 7672 | | time | 0.013 | | keyword[0] | mysql | | docs[0] | 162287 | time: 13ms | hits[0] | 363694 | | keyword[1] | performance | | docs[1] | 147249 | | hits[1] | 210895 | +---------------+-------------+ www.percona.com
  • 46. Trigraphs Overview • Not very fast, but still better than LIKE / RLIKE • Generic, portable SQL solution • No dependency on version, storage engine, third- party technology www.percona.com
  • 47. Three-Letter Sequences ! CREATE TABLE AtoZ ( ! c! ! CHAR(1), ! PRIMARY KEY (c)); ! INSERT INTO AtoZ (c) VALUES ('a'), ('b'), ('c'), ... ! CREATE TABLE Trigraphs ( ! Tri! ! CHAR(3), ! PRIMARY KEY (Tri)); ! INSERT INTO Trigraphs (Tri) SELECT CONCAT(t1.c, t2.c, t3.c) FROM AtoZ t1 JOIN AtoZ t2 JOIN AtoZ t3; www.percona.com
  • 48. Insert Data Into Index my $sth = $dbh1->prepare("SELECT * FROM Posts") or die $dbh1->errstr; $sth->execute() or die $dbh1->errstr; $dbh2->begin_work; my $i = 0; while (my $row = $sth->fetchrow_hashref ) { my $text = lc(join('|', ($row->{title}, $row->{body}, $row->{tags}))); my %tri; map($tri{$_}=1, ( $text =~ m/[[:alpha:]]{3}/g )); next unless %tri; my $tuple_list = join(",", map("('$_',$row->{postid})", keys %tri)); my $sql = "INSERT IGNORE INTO PostsTrigraph (tri, PostId) VALUES $tuple_list"; $dbh2->do($sql) or die "SQL = $sql, ".$dbh2->errstr; if (++$i % 1000 == 0) { print "."; $dbh2->commit; $dbh2->begin_work; time: 116 min 50 sec } } space: 16.2GiB print ".n"; $dbh2->commit; rows: 519 million www.percona.com
  • 49. Indexed Lookups ! SELECT p.* time: 46 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' ! www.percona.com
  • 50. Search Among Fewer Matches ! SELECT p.* time: 19 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' www.percona.com
  • 51. Search Among Fewer Matches ! SELECT p.* time: 22 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' www.percona.com
  • 52. Search Among Fewer Matches ! SELECT p.* time: 13.6 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' www.percona.com
  • 53. Narrow Down Further ! SELECT p.* time: 13.8 sec FROM Posts p JOIN PostsTrigraph t1 ON ! t1.PostId = p.PostId AND t1.Tri = 'mys' JOIN PostsTrigraph t2 ON ! t2.PostId = p.PostId AND t2.Tri = 'per' JOIN PostsTrigraph t3 ON ! t3.PostId = p.PostId AND t3.Tri = 'for' JOIN PostsTrigraph t4 ON ! t4.PostId = p.PostId AND t4.Tri = 'man' WHERE CONCAT(p.title,p.body,p.tags) LIKE '%mysql%' ! AND CONCAT(p.title,p.body,p.tags) LIKE '%performance%'; www.percona.com
  • 54. And the winner is... Jarrett Campbell http://www.flickr.com/people/77744839@N00
  • 55. Time to Insert Data into Index LIKE expression n/a FULLTEXT MyISAM 33 min, 34 sec FULLTEXT InnoDB 55 min, 46 sec Apache Solr 14 min, 28 sec Sphinx Search 8 min, 20 sec Trigraphs 116 min, 50 sec www.percona.com
  • 56. Insert Data into Index (sec) 8000 6000 4000 2000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 57. Time to Build Index on Data LIKE expression n/a FULLTEXT MyISAM 31 min, 18 sec FULLTEXT InnoDB 25 min, 27 sec Apache Solr n/a Sphinx Search n/a Trigraphs n/a www.percona.com
  • 58. Build Index on Data (sec) 4000 3000 2000 1000 n/a n/a n/a 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 59. Index Storage LIKE expression n/a FULLTEXT MyISAM 2382 MiB FULLTEXT InnoDB ? MiB Apache Solr 2766 MiB Sphinx Search 3355 MiB Trigraphs 16589 MiB www.percona.com
  • 60. Index Storage (MiB) 20000 15000 10000 5000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 61. Query Speed LIKE expression 49,000ms - 399,000ms FULLTEXT MyISAM 16-200ms FULLTEXT InnoDB 350-740ms Apache Solr 79ms Sphinx Search 13ms Trigraphs 13800ms www.percona.com
  • 62. Query Speed (ms) 400000 350000 300000 250000 200000 150000 100000 50000 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 63. 1000 Query Speed (ms) 750 500 250 0 LIKE MyISAM InnoDB Solr Sphinx Trigraph www.percona.com
  • 64. Bottom Line build insert storage query solution 49k-399k LIKE expression 0 0 0 ms SQL FULLTEXT MyISAM 31:18 33:28 2382MiB 16-200ms MySQL FULLTEXT InnoDB 25:27 55:46 ? 350-740ms MySQL 5.6 Apache Solr n/a 14:28 2766MiB 79ms Java Sphinx Search n/a 8:20 3487MiB 13ms C++ Trigraphs n/a 116:50 16.2 GiB 13,800ms SQL www.percona.com
  • 65. Final Thoughts • Third-party search engines are complex to keep in sync with data, and adding another type of server adds more operations work for you. • Built-in FULLTEXT indexes are therefore useful even if they are not absolutely the fastest. • Different search implementations may return different results, so you should evaluate what works best for your project. • Any indexed search solution is orders of magnitude better than LIKE! www.percona.com
  • 66. New York, October 1-2, 2012 London, December 3-4, 2012 Santa Clara, April 22-25, 2013 www.percona.com/live
  • 67. Expert instructors In-person training Custom onsite training Live virtual training http://www.percona.com/training www.percona.com
  • 69. Copyright 2012 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work. www.percona.com