SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Sphinx at Craigslist
      Jeremy Zawodny
          craigslist
Brief Overview
CL Sphinx Infrastructure
• Live Sphinx
 • ~30 million postings
 • end users searching for stuff on craigslist
• Team Sphinx
 • ~100 million postings
 • additional indexes of postings for internal use
    (including non-live postings)
CL Sphinx Infrastructure
• Archive Sphinx
 • older postings (~3 billion)
 • constantly growing in size
• Real-Time Sphinx
 • last ~2 days worth of postings
• Forums Sphinx
 • ~150 million forum postings
How We Got Here
Back in 2008
• MySQL FULL TEXT (MyISAM)
• 25 Servers
• Melted Down Frequently
• Desperately Needed a Solution
• This was my first project at craigslist...
• Looked at Solr, Sphinx, Xapian
• Sphinx felt like the right fit
Making Sphinx Work
• Benchmarking showed promising results
 • Query performance was great
   • ~800qps/instance
   • back then we only needed 1,200/sec
 • Indexing performance too
   • Can index documents far faster than I can
      make the XML for input (from Perl)
• Can’t index and serve at the same time, though...
“Live” Sphinx
• One index per city (~700 indexes)
 • Main + Delta
 • xmlpipe2 input
• Data all fits on a single machine
• 32bit ids
• High churn rate
• Settled on Master/Slave model w/rsync replication
• Deployed in January, 2009
Master/Slave Clusters
• Number of slaves varies (typically 3-7)
          master                    master
       slave   slave             slave   slave


          master                    master
       slave   slave             slave   slave
Main+Delta Indexes

                         delta
     Regular Merge
  from transient delta
                         today



     Periodic Merge              Logical
     to clean house               Index




                         index
Early Issues
• Monitoring
• Persistent Connections w/prefork
 • hacked up my own initially
• Index merge crashes/bugs
• We’re aways running svn snapshots
Early Success
• Replaced the 25 MySQL servers
• Used 10 sphinx servers (2 masters, 8 slaves)
• Search traffic continued to increase
• Tons of headroom!
• Typical search is under 5ms
• New Features
 • “nearby” search
 • sort by: recent, price, best match
Early Mistakes
• Stopwords
• Not setting query limits
 • Sphinx handled this just fine!
• ASCII-only
• Query mangling
 • need to understand how users search and what
    they expect to find
• UpdateAttributes (no kill lists!)
What Then?
Growth
• Wanted Sphinx for “internal” use
• Created internal “team sphinx” with more indexed
  data
 • includes not visible postings
 • includes additional fields
• Space became an issue, so had to build some simple
  sharding into our code
 • 2 clusters: even/odd split for indexes
Live Sphinx Today
•   300+ million queries/day
•   5,000 queries/sec peak load
•   removed stopwords
•   threaded workers
•   dict=keywords
•   wildcard search enabled
•   UTF-8 (mostly) and charset_table
•   blend_chars
•   kill lists (no searchd on masters)
•   sharded (3 masters, 18 slaves) on blades
Sharding
Query Volume
Archive Sphinx
• The Archive Project!
• 2.5 billion postings
• Growing by ~1.6 million daily
• String attributes
• 4 shards, each is a 1 master, 2 slave cluster
• Bucket based on UserID (not city)
• Low query volume
• Need a way to reindex all docs
Real-Time Sphinx
• There’s a delay in indexing data on the master and
  replicating to the slaves...
• What if we want to offer “real-time search” of your
  own postings?
So I built something...
• Known as rtsd (real-time search daemon)
• Sphinx instance with MySQL Protocol
• Primarily uses in-memory indexes
• Used to bridge the gap between “now” and
  “archive sphinx”
• Configured as an N day rolling window
• Runs on archive sphinx master hosts
Sphinx Time Horizons
            Classic     Team      Archive       rtsd

0-20min                                         All

20m-1day    Visible      All         All

1-60 days   Visible      All         All

60+ days                             All

    Note:Visible postings are findable on the site.
rtsd overview

PostingInfo table




rtsd_consumer       redis queue




                    rtsd_indexer   PostingCache




                    rtsd_sphinx




    webbie            webbie         webbie
Daily Posting Buckets
• 3 indexes
 • yesterday
 • today
 • tomorrow
• (DayofYear(PostedDate)%3) = $index_num
• Nightly cron to “TRUNCATE RTINDEX” on the
  “tomorrow” index
 • sponsored feature!
rtsd indexes
rtsd virtual indexes
rtsd virtual indexes
Future Work
•   autonomous nodes (no master/slave)
    •   many-core blades with SSD storage
•   better performance metrics
    •   we drop a lot of data on the floor
•   log mining and analysis
•   sphinx for “table of contents” (browsing)
•   haproxy in front of sphinx
•   generic sharding code
•   testing framework
Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
  servers per index (for failover and load):
Sphinx Wishlist
• 32 -> 64 bit migration tool
• capture stats at daemon shut down
• RT optimizations for DELETE (high churn)
• distributed search (agent) config with multiple
  servers per index (for failover and load):
Craigslist is Hiring!
• Developers
 • Back-end
 • Front-end
• Systems Administrators
• Network Engineers
• Email: z@craiglist.org plain text resume!

Weitere ähnliche Inhalte

Was ist angesagt?

«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghubit-people
 
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerZero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerMongoDB
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
MongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorPierre Baillet
 
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...leifwalsh
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamRedis Labs
 
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用Akira Kitauchi
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redisZhichao Liang
 
MongoDB Hadoop DC
MongoDB Hadoop DCMongoDB Hadoop DC
MongoDB Hadoop DCMike Dirolf
 
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production DeploymentBack to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production DeploymentMongoDB
 
Sharding
ShardingSharding
ShardingMongoDB
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs RedisItamar Haber
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Kate Marshalkina
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetMongoDB
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture ForumChristopher Spring
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)MongoSF
 
Memory: The New Disk
Memory: The New DiskMemory: The New Disk
Memory: The New DiskTim Lossen
 

Was ist angesagt? (20)

«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub«Scrapy internals» Александр Сибиряков, Scrapinghub
«Scrapy internals» Александр Сибиряков, Scrapinghub
 
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerZero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
MongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBookMongoDB World 2016: Poster Sessions eBook
MongoDB World 2016: Poster Sessions eBook
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
 
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
A New MongoDB Sharding Architecture for Higher Availability and Better Resour...
 
MongoDB Basic Concepts
MongoDB Basic ConceptsMongoDB Basic Concepts
MongoDB Basic Concepts
 
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops TeamManaging 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
Managing 50K+ Redis Databases Over 4 Public Clouds ... with a Tiny Devops Team
 
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
企業・業界情報プラットフォームSPEEDAにおけるElasticsearchの活用
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
 
MongoDB Hadoop DC
MongoDB Hadoop DCMongoDB Hadoop DC
MongoDB Hadoop DC
 
Back to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production DeploymentBack to Basics Webinar 6: Production Deployment
Back to Basics Webinar 6: Production Deployment
 
Sharding
ShardingSharding
Sharding
 
Why Your MongoDB Needs Redis
Why Your MongoDB Needs RedisWhy Your MongoDB Needs Redis
Why Your MongoDB Needs Redis
 
Mongodb
MongodbMongodb
Mongodb
 
Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014Drupal meets PostgreSQL for DrupalCamp MSK 2014
Drupal meets PostgreSQL for DrupalCamp MSK 2014
 
Webinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica SetWebinar Back to Basics 3 - Introduzione ai Replica Set
Webinar Back to Basics 3 - Introduzione ai Replica Set
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
 
Memory: The New Disk
Memory: The New DiskMemory: The New Disk
Memory: The New Disk
 

Andere mochten auch

Fulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesFulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesAdrian Nuta
 
Advanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdvanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdrian Nuta
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQLmwasaha mwagambo
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLNguyen Van Vuong
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MYXPLAIN
 
MySQL Performance Tips & Best Practices
MySQL Performance Tips & Best PracticesMySQL Performance Tips & Best Practices
MySQL Performance Tips & Best PracticesIsaac Mosquera
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)MongoDB
 
MySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 TipsMySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 TipsOSSCube
 
10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were PossibleLukas Eder
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDBMongoDB
 

Andere mochten auch (12)

Fulltext engine for non fulltext searches
Fulltext engine for non fulltext searchesFulltext engine for non fulltext searches
Fulltext engine for non fulltext searches
 
Advanced fulltext search with Sphinx
Advanced fulltext search with SphinxAdvanced fulltext search with Sphinx
Advanced fulltext search with Sphinx
 
SphinxSearch
SphinxSearchSphinxSearch
SphinxSearch
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
 
Sphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQLSphinx - High performance full-text search for MySQL
Sphinx - High performance full-text search for MySQL
 
MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6MySQL Indexing - Best practices for MySQL 5.6
MySQL Indexing - Best practices for MySQL 5.6
 
MySQL Performance Tips & Best Practices
MySQL Performance Tips & Best PracticesMySQL Performance Tips & Best Practices
MySQL Performance Tips & Best Practices
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
 
MySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 TipsMySQL Performance Tuning: Top 10 Tips
MySQL Performance Tuning: Top 10 Tips
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible10 SQL Tricks that You Didn't Think Were Possible
10 SQL Tricks that You Didn't Think Were Possible
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
 

Ähnlich wie Sphinx at Craigslist in 2012

London devops logging
London devops loggingLondon devops logging
London devops loggingTomas Doran
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At CraigslistMySQLConference
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedis Labs
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkTomas Doran
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018Roy Russo
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Percona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environmentPercona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environmentspil-engineering
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDBMongoDB
 
Cassandra vs. Redis
Cassandra vs. RedisCassandra vs. Redis
Cassandra vs. RedisTim Lossen
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017Roy Russo
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Saltmarch Media
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisRicard Clau
 

Ähnlich wie Sphinx at Craigslist in 2012 (20)

London devops logging
London devops loggingLondon devops logging
London devops logging
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Zero mq logs
Zero mq logsZero mq logs
Zero mq logs
 
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per DayRedisConf18 - Redis at LINE - 25 Billion Messages Per Day
RedisConf18 - Redis at LINE - 25 Billion Messages Per Day
 
Messaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new frameworkMessaging, interoperability and log aggregation - a new framework
Messaging, interoperability and log aggregation - a new framework
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Percona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environmentPercona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environment
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
Cassandra vs. Redis
Cassandra vs. RedisCassandra vs. Redis
Cassandra vs. Redis
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
 
Speed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with RedisSpeed up your Symfony2 application and build awesome features with Redis
Speed up your Symfony2 application and build awesome features with Redis
 

Kürzlich hochgeladen

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Sphinx at Craigslist in 2012

  • 1. Sphinx at Craigslist Jeremy Zawodny craigslist
  • 3. CL Sphinx Infrastructure • Live Sphinx • ~30 million postings • end users searching for stuff on craigslist • Team Sphinx • ~100 million postings • additional indexes of postings for internal use (including non-live postings)
  • 4. CL Sphinx Infrastructure • Archive Sphinx • older postings (~3 billion) • constantly growing in size • Real-Time Sphinx • last ~2 days worth of postings • Forums Sphinx • ~150 million forum postings
  • 5. How We Got Here
  • 6. Back in 2008 • MySQL FULL TEXT (MyISAM) • 25 Servers • Melted Down Frequently • Desperately Needed a Solution • This was my first project at craigslist... • Looked at Solr, Sphinx, Xapian • Sphinx felt like the right fit
  • 7. Making Sphinx Work • Benchmarking showed promising results • Query performance was great • ~800qps/instance • back then we only needed 1,200/sec • Indexing performance too • Can index documents far faster than I can make the XML for input (from Perl) • Can’t index and serve at the same time, though...
  • 8. “Live” Sphinx • One index per city (~700 indexes) • Main + Delta • xmlpipe2 input • Data all fits on a single machine • 32bit ids • High churn rate • Settled on Master/Slave model w/rsync replication • Deployed in January, 2009
  • 9. Master/Slave Clusters • Number of slaves varies (typically 3-7) master master slave slave slave slave master master slave slave slave slave
  • 10. Main+Delta Indexes delta Regular Merge from transient delta today Periodic Merge Logical to clean house Index index
  • 11. Early Issues • Monitoring • Persistent Connections w/prefork • hacked up my own initially • Index merge crashes/bugs • We’re aways running svn snapshots
  • 12. Early Success • Replaced the 25 MySQL servers • Used 10 sphinx servers (2 masters, 8 slaves) • Search traffic continued to increase • Tons of headroom! • Typical search is under 5ms • New Features • “nearby” search • sort by: recent, price, best match
  • 13. Early Mistakes • Stopwords • Not setting query limits • Sphinx handled this just fine! • ASCII-only • Query mangling • need to understand how users search and what they expect to find • UpdateAttributes (no kill lists!)
  • 15. Growth • Wanted Sphinx for “internal” use • Created internal “team sphinx” with more indexed data • includes not visible postings • includes additional fields • Space became an issue, so had to build some simple sharding into our code • 2 clusters: even/odd split for indexes
  • 16. Live Sphinx Today • 300+ million queries/day • 5,000 queries/sec peak load • removed stopwords • threaded workers • dict=keywords • wildcard search enabled • UTF-8 (mostly) and charset_table • blend_chars • kill lists (no searchd on masters) • sharded (3 masters, 18 slaves) on blades
  • 19. Archive Sphinx • The Archive Project! • 2.5 billion postings • Growing by ~1.6 million daily • String attributes • 4 shards, each is a 1 master, 2 slave cluster • Bucket based on UserID (not city) • Low query volume • Need a way to reindex all docs
  • 20. Real-Time Sphinx • There’s a delay in indexing data on the master and replicating to the slaves... • What if we want to offer “real-time search” of your own postings?
  • 21. So I built something... • Known as rtsd (real-time search daemon) • Sphinx instance with MySQL Protocol • Primarily uses in-memory indexes • Used to bridge the gap between “now” and “archive sphinx” • Configured as an N day rolling window • Runs on archive sphinx master hosts
  • 22. Sphinx Time Horizons Classic Team Archive rtsd 0-20min All 20m-1day Visible All All 1-60 days Visible All All 60+ days All Note:Visible postings are findable on the site.
  • 23. rtsd overview PostingInfo table rtsd_consumer redis queue rtsd_indexer PostingCache rtsd_sphinx webbie webbie webbie
  • 24. Daily Posting Buckets • 3 indexes • yesterday • today • tomorrow • (DayofYear(PostedDate)%3) = $index_num • Nightly cron to “TRUNCATE RTINDEX” on the “tomorrow” index • sponsored feature!
  • 28. Future Work • autonomous nodes (no master/slave) • many-core blades with SSD storage • better performance metrics • we drop a lot of data on the floor • log mining and analysis • sphinx for “table of contents” (browsing) • haproxy in front of sphinx • generic sharding code • testing framework
  • 29. Sphinx Wishlist • 32 -> 64 bit migration tool • capture stats at daemon shut down • RT optimizations for DELETE (high churn) • distributed search (agent) config with multiple servers per index (for failover and load):
  • 30. Sphinx Wishlist • 32 -> 64 bit migration tool • capture stats at daemon shut down • RT optimizations for DELETE (high churn) • distributed search (agent) config with multiple servers per index (for failover and load):
  • 31. Craigslist is Hiring! • Developers • Back-end • Front-end • Systems Administrators • Network Engineers • Email: z@craiglist.org plain text resume!

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n