SlideShare a Scribd company logo
1 of 33
Search Server

Sphinx is an open source full text search server, designed from
the ground up with performance, relevance (a.k.a. search
quality), and integration simplicity in mind.

•   Craigslist serves 200 million queries/day
•   Used by Slashdot, Mozilla, Meetup
•   Scales to billions of documents (distributed)
•   Support almost any data source (SQL, XML, etc.)
•   Batch and real-time indexes




                                                      By Andrew Kandels
What is a Search Server?

Sphinx is like a database because…

• It has a schema
• It has field types (integer, boolean, strings, dates)
• It responds to queries (SQL, API):

  SELECT * FROM Books WHERE MATCH(“a rose by any other name”)
Documents

Sphinx indexes data from just about any source.
SELECT
  CONCAT(a.first_name, ' ', a.last_name) AS full_name,
  COUNT(b.book_id) AS num_books,
  MIN(b.publish_date) AS first_published
FROM
  author a
INNER JOIN book b
  ON a.author_id = b.author_id


<?xml version=“1.0”?>
<author>
  <id>1433</id>
  <name>Mark Twain</name>
  <books>
     <book>A Connecticut Yankee in King Arthur’s Court</book>
  </books>
</author>
How it Works

Sphinx parses plain text queries and answers with rows.

Search

@author_id 15 “Mark Twain” king << arthur

Results

1. document=1433, weight=1692, createdAt=Jan 1 1889
Relevance

Only the strongest will survive; but, relevance is in the
eye of the beholder. Some factors include:

• How many times did our keywords match?
• How many times did they repeat in the query?
• How frequently do keywords appear?
• Do keywords in the document appear in the same order as
  the query?
• Did we match exactly, or is it a stemmed match?
B-Tree Index

                            User                                 Index (Last Name (4))
First Name   Last Name   City         State     Notes             Row #     Contents
Allison      Janney      Baltimore    MA        Cregg               1       Jann
John         Spencer     Des Moines   IA        McGarry             5       Molo
Bradley      Whitford    Newport      VA        Lyman               6       Schi
Martin       Sheen       Seattle      WA        Bartlett            4       Shee
Janel        Moloney     Hollywood    CA        Moss                2       Spen
Richard      Schiff      Lincoln      NE        Ziegler             3       Whit



A B-tree is a tree data structure that keeps data sorted and allows searches,
sequential access, insertions, and deletions in logarithmic time.
Logical Queries

Logical conditions return a boolean result based on an
expression:

country = “United States”
AND num_published >= 50
AND (author_id = 5 OR author_id = 8 OR author_id = 10)


Logic queries can be complex and typically evaluate based on
the whole value of a column.
Stemming

Stemming (a.k.a. morphology) is the process for reducing inflected or derived
words to their stem, base or root form.

For example, “dove” is a synonym for “pigeon”. The words are different; but they
can mean the same thing.
Tokenizing

Sphinx breaks down documents into keywords. This is called tokenization.

Word breaker characters allow exception cases for keywords like AT&T, C++ or T-
Mobile.

Short words are ignored (by default, words less than 3 characters) but a placeholder
is saved to support proximity and phrase searching.
Full Text Index

                                          Inversion

Document              Index (Full Text)
A man caught a fish   [spacer]
                      man, person, human, being
                      caught, catch, catcher, catching, catches
                      [spacer]
                      fish, fishing, fished, fisher

                                           Metadata
                      man                    2              1
                      caught                 3              1
                      fish                   5              1
Full Text Queries
Searches multiple columns or within contents in columns, also known as Keyword
Searching.

Boolean Search                           fiction AND (Twain OR Dickens)

Phrase Search                            “Mark Twain”

Field-Based Search                       @author_id 15

Proximity Search                         “fear itself”~2, fear << itself

Substring Search                         @author[4] Mark

Quorum Search                            “the world is a wonderful place”/3

Same Sentence/Paragraph                  fear SENTENCE itself
Getting Sphinx
Download it from http://www.sphinxsearch.com (RPM, DEB, Tarball)
Important Files and Binaries
A successful Sphinx installation will yield the following:

searchd                                      The search daemon, answers queries



Indexer                                      Collects documents and builds the index



search                                       Performs a search (useful for debugging)



sphinx.conf                                  Defines your data and configures your
                                             indexes and daemon
Sphinx.conf
Defaults to /etc/sphinx/sphinx.conf, but can exist anywhere.

It can even be executable:

#!/usr/bin/env php
source mysource
{
   type = mysql
   sql_host = <?php echo DB_HOST; ?>
}
Sphinx.conf Blocks
The contents of sphinx.conf consists of several named blocks:

source                                     Defines your data source and queries



index                                      Define sources to index searches for



indexer                                    Configure the indexer utility



searchd                                    Configure the search daemon
Source
Define the connection to your database and query in the source block.
source filmssource
{
  type = mysql
  sql_host = localhost
  sql_user = root
  sql_pass =
  sql_db = sakila

    sql_query = 
      SELECT f.film_id, f.title, f.description,
      f.release_year, f.rating, l.name as language
      FROM film f
      INNER JOIN language l
      ON l.language_id = f.language_id

    sql_attr_uint = release_year
    sql_attr_string = rating
    sql_attr_string = language
}
Index
Define which sources to include and index parameters:
index films
{
   source = filmssource
   charset_type = utf-8
   path = /home/andrew/sphinx/films
   stopwords = /home/andrew/sphinx/stopwords.txt
   enable_star = 1
   min_word_len = 2
   min_prefix_len = 0
   min_infix_len = 2
}
Indexer (optional)
Configure the indexing process which runs occasionally as a batch:
indexer
{
   mem_limit = 256M
}
Searchd (optional)
Configure the search daemon (searchd) which answers queries:
searchd
{
  listen = localhost:9312
  listen = localhost:9306:mysql41
  log = /home/andrew/sphinx.log
  read_timeout = 8
  max_children = 30
  pid_file = /home/andrew/sphinx.pid
  max_matches = 25
  seamless_rotate = 1
  preopen_indexes = 1
  unlink_old = 1
}
stopwords.txt
To generate stopwords from your data, use the indexer binary:
indexer --config /path/to/sphinx.conf
     --buildstops /path/to/stopwords.txt 25

of
who
must
in
and
the
mad
An


Builds a stopwords.txt file with the 25 most commonly found words.
Use --buildfreqs to include counts.

Stopwords can dramatically reduce the index size and time-to-build; but, it’s a
good idea to inspect the output before using it!
Build your Index
To generate your index, use the indexer binary:
indexer --config /path/to/sphinx.conf --all –rotate
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file 'sphinx.conf`...
indexing index 'films'...
collected 1000 docs, 0.1 MB
sorted 0.3 Mhits, 100.0% done
total 1000 docs, 108077 bytes
total 0.148 sec, 727012 bytes/sec, 6726.80 docs/sec
total 3 reads, 0.003 sec, 675.6 kb/call avg, 1.1 msec/call avg
total 11 writes, 0.004 sec, 331.8 kb/call avg, 0.4 msec/call avg
Start the Server
Start the server by executing the searchd binary:
searchd --config /path/to/sphinx.conf
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file 'sphinx.conf’...
listening on 127.0.0.1:9312
listening on 127.0.0.1:9306
precaching index 'films'
precached 1 indexes in 0.001 sec
Run a Search
Test your index by running a search:
search --limit 3 robot
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file './sphinx.conf'...
index 'films': query 'robot ': returned 77 matches of 77 total in 0.000 sec

displaying matches:
1. document=138, weight=1612, release_year=2006, rating=R, language=English
2. document=920, weight=1612, release_year=2006, rating=G, language=English
3. document=6, weight=1581, release_year=2006, rating=PG, language=English

words:
1. 'robot': 77 documents, 79 hits
MySQL Interface
You can query Sphinx using the MySQL protocol:
mysql –h127.0.0.1 –P 9306
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or g.
Your MySQL connection id is 1
Server version: 2.0.4-release (r3135)

Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

mysql>
MySQL Interface
Queries are written in SphinxQL, which is much like SQL:
mysql> SELECT *
FROM films
WHERE MATCH('robot')
ORDER BY release_year DESC
LIMIT 5;
+------+--------+--------------+--------+----------+
| id | weight | release_year | rating | language |
+------+--------+--------------+--------+----------+
| 6 | 1581 |           2006 | PG | English |
| 16 | 1581 |           2006 | NC-17 | English |
| 25 | 1581 |           2006 | G        | English |
| 42 | 1581 |           2006 | NC-17 | English |
| 61 | 1581 |           2006 | G        | English |
+------+--------+--------------+--------+----------+
5 rows in set (0.00 sec)
MySQL Interface
Additional metrics can also be retrieved:
mysql> SHOW META;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total       | 77 |
| total_found | 77 |
| time         | 0.000 |
| keyword[0] | robot |
| docs[0]        | 77 |
| hits[0]      | 79 |
+---------------+-------+
6 rows in set (0.00 sec)
MySQL Interface
You can even do grouping:
mysql> SELECT rating, COUNT(*) AS num_movies,
          MIN(release_year) AS first_year
     FROM films
     GROUP BY rating
     ORDER BY num_movies DESC;
+------+--------+--------------+--------+------------+--------+
| id | weight | release_year | rating | first_year | @count |
+------+--------+--------------+--------+------------+--------+
| 7|       1|        2006 | PG-13 |          2006 | 223 |
| 3|       1|        2006 | NC-17 |          2006 | 210 |
| 8|       1|        2006 | R       |     2006 | 195 |
| 1|       1|        2006 | PG |            2006 | 194 |
| 2|       1|        2006 | G        |     2006 | 178 |
+------+--------+--------------+--------+------------+--------+
5 rows in set (0.00 sec)
Other Applications
Sphinx does more than just full text search. It has other practical
applications as well:

• Metrics and Reporting
• Data Warehouse
• Materialized Views
• Operational Data Store
• Offloading Queries
Quick and Dirty PHP
Integrate Sphinx by using any MySQL driver (like PDO):
SphinxAPI
Or use a native extension like SphinxClient for PHP:




Download it here: http://pecl.php.net/sphinx
Indexing Strategies
Sphinx supports several types of indexes:

• Disk
• In-memory
• Distributed
• Real-time
Main+delta Batch Indexes
Disk indexes often use the main+delta(s) strategy:
• One or more delta indexes collect new data as often as every minute.
• Larger batch indexes rebuild daily, weekly or even less frequently.


Disk indexes have the following benefits:
• They can be re-indexed online without interruption (--rotate)
• They can be distributed over filesystems and hardware
The End

There’s a book!   Andrew Kandels
                  Website: http://andrewkandels.com
                  Twitter: @andrewkandels
                  Facebook/G+: No thanks

More Related Content

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Sphinx Full Text Search Server

  • 1. Search Server Sphinx is an open source full text search server, designed from the ground up with performance, relevance (a.k.a. search quality), and integration simplicity in mind. • Craigslist serves 200 million queries/day • Used by Slashdot, Mozilla, Meetup • Scales to billions of documents (distributed) • Support almost any data source (SQL, XML, etc.) • Batch and real-time indexes By Andrew Kandels
  • 2. What is a Search Server? Sphinx is like a database because… • It has a schema • It has field types (integer, boolean, strings, dates) • It responds to queries (SQL, API): SELECT * FROM Books WHERE MATCH(“a rose by any other name”)
  • 3. Documents Sphinx indexes data from just about any source. SELECT CONCAT(a.first_name, ' ', a.last_name) AS full_name, COUNT(b.book_id) AS num_books, MIN(b.publish_date) AS first_published FROM author a INNER JOIN book b ON a.author_id = b.author_id <?xml version=“1.0”?> <author> <id>1433</id> <name>Mark Twain</name> <books> <book>A Connecticut Yankee in King Arthur’s Court</book> </books> </author>
  • 4. How it Works Sphinx parses plain text queries and answers with rows. Search @author_id 15 “Mark Twain” king << arthur Results 1. document=1433, weight=1692, createdAt=Jan 1 1889
  • 5. Relevance Only the strongest will survive; but, relevance is in the eye of the beholder. Some factors include: • How many times did our keywords match? • How many times did they repeat in the query? • How frequently do keywords appear? • Do keywords in the document appear in the same order as the query? • Did we match exactly, or is it a stemmed match?
  • 6. B-Tree Index User Index (Last Name (4)) First Name Last Name City State Notes Row # Contents Allison Janney Baltimore MA Cregg 1 Jann John Spencer Des Moines IA McGarry 5 Molo Bradley Whitford Newport VA Lyman 6 Schi Martin Sheen Seattle WA Bartlett 4 Shee Janel Moloney Hollywood CA Moss 2 Spen Richard Schiff Lincoln NE Ziegler 3 Whit A B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time.
  • 7. Logical Queries Logical conditions return a boolean result based on an expression: country = “United States” AND num_published >= 50 AND (author_id = 5 OR author_id = 8 OR author_id = 10) Logic queries can be complex and typically evaluate based on the whole value of a column.
  • 8. Stemming Stemming (a.k.a. morphology) is the process for reducing inflected or derived words to their stem, base or root form. For example, “dove” is a synonym for “pigeon”. The words are different; but they can mean the same thing.
  • 9. Tokenizing Sphinx breaks down documents into keywords. This is called tokenization. Word breaker characters allow exception cases for keywords like AT&T, C++ or T- Mobile. Short words are ignored (by default, words less than 3 characters) but a placeholder is saved to support proximity and phrase searching.
  • 10. Full Text Index Inversion Document Index (Full Text) A man caught a fish [spacer] man, person, human, being caught, catch, catcher, catching, catches [spacer] fish, fishing, fished, fisher Metadata man 2 1 caught 3 1 fish 5 1
  • 11. Full Text Queries Searches multiple columns or within contents in columns, also known as Keyword Searching. Boolean Search fiction AND (Twain OR Dickens) Phrase Search “Mark Twain” Field-Based Search @author_id 15 Proximity Search “fear itself”~2, fear << itself Substring Search @author[4] Mark Quorum Search “the world is a wonderful place”/3 Same Sentence/Paragraph fear SENTENCE itself
  • 12. Getting Sphinx Download it from http://www.sphinxsearch.com (RPM, DEB, Tarball)
  • 13. Important Files and Binaries A successful Sphinx installation will yield the following: searchd The search daemon, answers queries Indexer Collects documents and builds the index search Performs a search (useful for debugging) sphinx.conf Defines your data and configures your indexes and daemon
  • 14. Sphinx.conf Defaults to /etc/sphinx/sphinx.conf, but can exist anywhere. It can even be executable: #!/usr/bin/env php source mysource { type = mysql sql_host = <?php echo DB_HOST; ?> }
  • 15. Sphinx.conf Blocks The contents of sphinx.conf consists of several named blocks: source Defines your data source and queries index Define sources to index searches for indexer Configure the indexer utility searchd Configure the search daemon
  • 16. Source Define the connection to your database and query in the source block. source filmssource { type = mysql sql_host = localhost sql_user = root sql_pass = sql_db = sakila sql_query = SELECT f.film_id, f.title, f.description, f.release_year, f.rating, l.name as language FROM film f INNER JOIN language l ON l.language_id = f.language_id sql_attr_uint = release_year sql_attr_string = rating sql_attr_string = language }
  • 17. Index Define which sources to include and index parameters: index films { source = filmssource charset_type = utf-8 path = /home/andrew/sphinx/films stopwords = /home/andrew/sphinx/stopwords.txt enable_star = 1 min_word_len = 2 min_prefix_len = 0 min_infix_len = 2 }
  • 18. Indexer (optional) Configure the indexing process which runs occasionally as a batch: indexer { mem_limit = 256M }
  • 19. Searchd (optional) Configure the search daemon (searchd) which answers queries: searchd { listen = localhost:9312 listen = localhost:9306:mysql41 log = /home/andrew/sphinx.log read_timeout = 8 max_children = 30 pid_file = /home/andrew/sphinx.pid max_matches = 25 seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1 }
  • 20. stopwords.txt To generate stopwords from your data, use the indexer binary: indexer --config /path/to/sphinx.conf --buildstops /path/to/stopwords.txt 25 of who must in and the mad An Builds a stopwords.txt file with the 25 most commonly found words. Use --buildfreqs to include counts. Stopwords can dramatically reduce the index size and time-to-build; but, it’s a good idea to inspect the output before using it!
  • 21. Build your Index To generate your index, use the indexer binary: indexer --config /path/to/sphinx.conf --all –rotate Sphinx 2.0.4-release (r3135) Copyright (c) 2001-2012, Andrew Aksyonoff Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) using config file 'sphinx.conf`... indexing index 'films'... collected 1000 docs, 0.1 MB sorted 0.3 Mhits, 100.0% done total 1000 docs, 108077 bytes total 0.148 sec, 727012 bytes/sec, 6726.80 docs/sec total 3 reads, 0.003 sec, 675.6 kb/call avg, 1.1 msec/call avg total 11 writes, 0.004 sec, 331.8 kb/call avg, 0.4 msec/call avg
  • 22. Start the Server Start the server by executing the searchd binary: searchd --config /path/to/sphinx.conf Sphinx 2.0.4-release (r3135) Copyright (c) 2001-2012, Andrew Aksyonoff Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) using config file 'sphinx.conf’... listening on 127.0.0.1:9312 listening on 127.0.0.1:9306 precaching index 'films' precached 1 indexes in 0.001 sec
  • 23. Run a Search Test your index by running a search: search --limit 3 robot Sphinx 2.0.4-release (r3135) Copyright (c) 2001-2012, Andrew Aksyonoff Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) using config file './sphinx.conf'... index 'films': query 'robot ': returned 77 matches of 77 total in 0.000 sec displaying matches: 1. document=138, weight=1612, release_year=2006, rating=R, language=English 2. document=920, weight=1612, release_year=2006, rating=G, language=English 3. document=6, weight=1581, release_year=2006, rating=PG, language=English words: 1. 'robot': 77 documents, 79 hits
  • 24. MySQL Interface You can query Sphinx using the MySQL protocol: mysql –h127.0.0.1 –P 9306 Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 1 Server version: 2.0.4-release (r3135) Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved. This software comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to modify and redistribute it under the GPL v2 license Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. mysql>
  • 25. MySQL Interface Queries are written in SphinxQL, which is much like SQL: mysql> SELECT * FROM films WHERE MATCH('robot') ORDER BY release_year DESC LIMIT 5; +------+--------+--------------+--------+----------+ | id | weight | release_year | rating | language | +------+--------+--------------+--------+----------+ | 6 | 1581 | 2006 | PG | English | | 16 | 1581 | 2006 | NC-17 | English | | 25 | 1581 | 2006 | G | English | | 42 | 1581 | 2006 | NC-17 | English | | 61 | 1581 | 2006 | G | English | +------+--------+--------------+--------+----------+ 5 rows in set (0.00 sec)
  • 26. MySQL Interface Additional metrics can also be retrieved: mysql> SHOW META; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | total | 77 | | total_found | 77 | | time | 0.000 | | keyword[0] | robot | | docs[0] | 77 | | hits[0] | 79 | +---------------+-------+ 6 rows in set (0.00 sec)
  • 27. MySQL Interface You can even do grouping: mysql> SELECT rating, COUNT(*) AS num_movies, MIN(release_year) AS first_year FROM films GROUP BY rating ORDER BY num_movies DESC; +------+--------+--------------+--------+------------+--------+ | id | weight | release_year | rating | first_year | @count | +------+--------+--------------+--------+------------+--------+ | 7| 1| 2006 | PG-13 | 2006 | 223 | | 3| 1| 2006 | NC-17 | 2006 | 210 | | 8| 1| 2006 | R | 2006 | 195 | | 1| 1| 2006 | PG | 2006 | 194 | | 2| 1| 2006 | G | 2006 | 178 | +------+--------+--------------+--------+------------+--------+ 5 rows in set (0.00 sec)
  • 28. Other Applications Sphinx does more than just full text search. It has other practical applications as well: • Metrics and Reporting • Data Warehouse • Materialized Views • Operational Data Store • Offloading Queries
  • 29. Quick and Dirty PHP Integrate Sphinx by using any MySQL driver (like PDO):
  • 30. SphinxAPI Or use a native extension like SphinxClient for PHP: Download it here: http://pecl.php.net/sphinx
  • 31. Indexing Strategies Sphinx supports several types of indexes: • Disk • In-memory • Distributed • Real-time
  • 32. Main+delta Batch Indexes Disk indexes often use the main+delta(s) strategy: • One or more delta indexes collect new data as often as every minute. • Larger batch indexes rebuild daily, weekly or even less frequently. Disk indexes have the following benefits: • They can be re-indexed online without interruption (--rotate) • They can be distributed over filesystems and hardware
  • 33. The End There’s a book! Andrew Kandels Website: http://andrewkandels.com Twitter: @andrewkandels Facebook/G+: No thanks