These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.
3. CL Sphinx Infrastructure
• Live Sphinx
• ~30 million postings
• end users searching for stuff on craigslist
• Team Sphinx
• ~100 million postings
• additional indexes of postings for internal use
(including non-live postings)
4. CL Sphinx Infrastructure
• Archive Sphinx
• older postings (~3 billion)
• constantly growing in size
• Real-Time Sphinx
• last ~2 days worth of postings
• Forums Sphinx
• ~150 million forum postings
6. Back in 2008
• MySQL FULL TEXT (MyISAM)
• 25 Servers
• Melted Down Frequently
• Desperately Needed a Solution
• This was my first project at craigslist...
• Looked at Solr, Sphinx, Xapian
• Sphinx felt like the right fit
7. Making Sphinx Work
• Benchmarking showed promising results
• Query performance was great
• ~800qps/instance
• back then we only needed 1,200/sec
• Indexing performance too
• Can index documents far faster than I can
make the XML for input (from Perl)
• Can’t index and serve at the same time, though...
8. “Live” Sphinx
• One index per city (~700 indexes)
• Main + Delta
• xmlpipe2 input
• Data all fits on a single machine
• 32bit ids
• High churn rate
• Settled on Master/Slave model w/rsync replication
• Deployed in January, 2009
10. Main+Delta Indexes
delta
Regular Merge
from transient delta
today
Periodic Merge Logical
to clean house Index
index
11. Early Issues
• Monitoring
• Persistent Connections w/prefork
• hacked up my own initially
• Index merge crashes/bugs
• We’re aways running svn snapshots
12. Early Success
• Replaced the 25 MySQL servers
• Used 10 sphinx servers (2 masters, 8 slaves)
• Search traffic continued to increase
• Tons of headroom!
• Typical search is under 5ms
• New Features
• “nearby” search
• sort by: recent, price, best match
13. Early Mistakes
• Stopwords
• Not setting query limits
• Sphinx handled this just fine!
• ASCII-only
• Query mangling
• need to understand how users search and what
they expect to find
• UpdateAttributes (no kill lists!)
15. Growth
• Wanted Sphinx for “internal” use
• Created internal “team sphinx” with more indexed
data
• includes not visible postings
• includes additional fields
• Space became an issue, so had to build some simple
sharding into our code
• 2 clusters: even/odd split for indexes
16. Live Sphinx Today
• 300+ million queries/day
• 5,000 queries/sec peak load
• removed stopwords
• threaded workers
• dict=keywords
• wildcard search enabled
• UTF-8 (mostly) and charset_table
• blend_chars
• kill lists (no searchd on masters)
• sharded (3 masters, 18 slaves) on blades
19. Archive Sphinx
• The Archive Project!
• 2.5 billion postings
• Growing by ~1.6 million daily
• String attributes
• 4 shards, each is a 1 master, 2 slave cluster
• Bucket based on UserID (not city)
• Low query volume
• Need a way to reindex all docs
20. Real-Time Sphinx
• There’s a delay in indexing data on the master and
replicating to the slaves...
• What if we want to offer “real-time search” of your
own postings?
21. So I built something...
• Known as rtsd (real-time search daemon)
• Sphinx instance with MySQL Protocol
• Primarily uses in-memory indexes
• Used to bridge the gap between “now” and
“archive sphinx”
• Configured as an N day rolling window
• Runs on archive sphinx master hosts
22. Sphinx Time Horizons
Classic Team Archive rtsd
0-20min All
20m-1day Visible All All
1-60 days Visible All All
60+ days All
Note:Visible postings are findable on the site.
28. Future Work
• autonomous nodes (no master/slave)
• many-core blades with SSD storage
• better performance metrics
• we drop a lot of data on the floor
• log mining and analysis
• sphinx for “table of contents” (browsing)
• haproxy in front of sphinx
• generic sharding code
• testing framework
29. Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
servers per index (for failover and load):
30. Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
servers per index (for failover and load):
31. Craigslist is Hiring!
• Developers
• Back-end
• Front-end
• Systems Administrators
• Network Engineers
• Email: z@craiglist.org plain text resume!