SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
MySQL and Search at Craigslist


           Jeremy Zawodny
        jzawodn@craigslist.org
          http://craigslist.org/

         Jeremy@Zawodny.com
    http://jeremy.zawodny.com/blog/
Who Am I?
    Creator and co-author of High Performance
●

    MySQL
    Creator of mytop
●


    Perl Hacker
●


    MySQL Geek
●


    Craigslist Engineer (as of July, 2008)
●


        MySQL, Data, Search, Perl
    –

    Ex-Yahoo (Perl, MySQL, Search, Web
●

    Services)
What is Craigslist?
What is Craigslist?
    Local Classifieds
●


        Jobs, Housing, Autos, Goods, Services
    –

    ~500 cities world-wide
●


    Free
●


        Except for jobs in ~18 cities and brokered
    –
        apartments in NYC
        Over 20B pageviews/month
    –

        50M monthly users
    –

        50+ countries, multiple languages
    –

        40+M ads/month, 10+M images
    –
What is Craigslist?
    Forums
●


        100M posts
    –

        100s of forums
    –
Technical and other Challenges
    High ad churn rate
●


        Post half-life can be short
    –

    Growth
●


    High traffic volume
●


    Back-end tools and data analysis needs
●


    Growth
●


    Need to archive postings... forever!
●


        100s of millions, searchable
    –

    Internationalization and UTF-8
●
Technical and other Challenges
    Small Team
●


        Fires take priority
    –

        Infrastructure gets creaky
    –

        Organic code and schema growth over years
    –

    Growth
●


    Lack of abstractions
●


        Too much embedded SQL in code
    –

    Documentation vs. Institutional Knowledge
●


        “Why do we have things configured like this?”
    –
Goals
    Use Open Source
●


    Keep infrastructure small and simple
●


        Lower power is good!
    –

        Efficiency all around
    –

        Do more with less
    –

    Keep site easy and appraochable
●


        Don't overload with features
    –

        People are easily confuse
    –
Craigslist Internals Overview
                                   Load Balancer



Read Proxy Array                                                    Write Proxy Array
                   Perl + memcached



                                                                          ...
Web Read Array     Apache 1.3 + mod_perl




 Object Cache                                Search Cluster
                   Perl + memcached                            Sphinx




                                                              Not Included:
Read DB Cluster    MySQL 5.0.xx                               - user db, image db
                                                              - async tasks, email
                                                              - accounting, internal tools
                                                              - and more!
Vertical Partitioning: Roles

Users             Classifieds             Forums




        Write   Read     Long   Trash




        Stats                   Archive
Vertical Partitioning
    Different roles have different access patterns
●


        Sub-roles based on query type
    –

    Easier to manage and scale
●


    Logical, self-contained data
●


    Servers may not need to be as
●

    big/fast/expensive
    Difficult to do retroactively
●


    Various named db “handles” in code
●
Horizontal Partitioning: Hydra

                                        ...
cluster_01   cluster_02    cluster_03         cluster_N




                      client
Horizontal Partitioning: Hydra
    Need to retrofit a lot of code
●


    Need non-blocking Perl MySQL client
●


    Wrapped
●

    http://code.google.com/p/perl-mysql-async/
    Eventually can size DB boxes based on
●

    price/power and adjust mapping function(s)
        Choose hardware first
    –

        Make the db “fit”
    –

    Archiving lets us age a cluster instead of
●

    migrating it's data to a new one.
Search Evolution
    Problem: Users want to find stuff.
●


    Solution: Use MySQL Full Text.
●


    ...time passes...
●


    Problem: MySQL Full Text Doesn't Scale!
●


    Solution: Use Sphinx.
●


    ...time passes...
●


    Problem: Sphinx doesn't scale!
●


    Solution: Patch Sphinx.
●
MySQL Full-Text Problems
    Hitting invisible limits
●


        CPU not pegged, Memory available
    –

        Disk I/O not unreasonable
    –

        Locking / Mutex contention? Probably.
    –

    MyISAM has occasional crashing / corruption
●


    5 clusters of 5 machines
●


        Partitioning based on city and category
    –

        All “hand balanced” and high-maintenance
    –

    ~30M queries/day
●


        Close to limits
    –
Sphinx: My First CL Project
    Sphinx is designed for text search
●


    Fast and lean C++ code
●


    Forking model scales well on multi-core
●


    Control over indexing, weighting, etc.
●


    Also spent some time looking at Apache Solr
●
Search Implementation Details
    Partitioning based on cities (each has a
●

    numeric id)
    Attributes vs. Keywords
●


    Persistent Connections
●


        Custom client and server modifications
    –

    Minimal stopword List
●


    Partition into 2 clusters (1 master, 4 slaves)
●
Sphinx Incremental Indexing
    Re-index every N minutes
●


    Use main + delta strategy
●


        Adopted as: index + today + delta
    –

        One set per city (~500 * 3)
    –

    Slaves handle live queries, update via rsync
●


    Need lots of FDs
●


    Use all 4 cores to index
●


    Every night, perform “daily merge”
●


    Generate config files via Perl
●
Sphinx Incremental Indexing
Sphinx Issues
    Merge bugs [fixed]
●


    File descriptor corruption [fixed]
●


    Persistent connections [fixed]
●


        Overhead of fork() was substantial in our testing
    –

        200 queries/sec vs. 1,000 queries/sec per box
    –

    Missing attribute updates [unreported]
●


    Bogus docids in responses
●


    We need to upgrade to latest Sphinx soon
●


    Andrew and team have been excellent!
●
Search Project Results
    From 25 MySQL Boxes to 10 Sphinx
●


    Lots more headroom!
●


    New Features
●


        Nearby Search
    –

    No seizing or locking issues
●


    1,000+ qps during peak w/room to grow
●


    50M queries per day w/steady growth
●


    Cluster partitioning built but not needed (yet?)
●


    Better separation of code
●
Sphinx Wishlist
    Efficient delete handling (kill lists)
●


    Non-fatal “missing” indexes
●


    Index dump tool
●


    Live document add/change/delete
●


    Built-in replication
●


    Stats and counters
●


    Text attributes
●


    Protocol checksum
●
Data Archiving, Replication, Indexes
    Problem: We want to keep everything.
●


    Solution: Archive to an archive cluster.
●


    Problem: Archiving is too painful. Index
●

    updates are expensive! Slaves affected.
    Solution: Archive with home-grown eventually
●

    consistent replication.
Data Archiving: OOB Replication
    Eventual Consistency
●


    Master process
●


        SET SQL_LOG_BIN=0
    –

        Select expired IDs
    –

        Export records from live master
    –

        Import records into archive master
    –

        Delete expired from live master
    –

        Add IDs to list
    –
Data Archiving: OOB Replication
    Slave process
●


        One per MySQL slave
    –

        Throttled to minimize impact
    –

        State kept on slave
    –

             Clone friendly
         ●



        Simple logic
    –

             Select expired IDs added since my sequence number
         ●


             Delete expired records
         ●


             Update local “last seen” sequence number
         ●
Long Term Data Archiving
    Schema coupling is bad
●


        ALTER TABLE takes forever
    –

        Lots of NULLs flying around
    –

    CouchDB or similar long-term?
●


        Schema-free feels like a good fit
    –

    Tested some home grown solutions already
●


    Separate storage and indexing?
●


        Indexing with Sphinx?
    –
Drizzle, XtraDB, Future Stuff
    CouchDB looks very interesting. Maybe for
●

    archive?
    XtraDB / InnoDB plugin
●


        Better concurrency
    –

        Better tuning of InnoDB internals
    –

    libdrizzle + Perl
●


        DBI/DBD may not fit an async model well
    –

        Can talk to both MySQL and Drizzle!
    –

    Oracle buying Sun?!?!
●
We're Hiring!
    Work in San Francisco
●


    Flexible, Small Company
●


    Excellent Benefits
●


    Help Millions of People Every Week
●


    We Need Perl/MySQL Hackers
●


    Come Help us Scale and Grow
●
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales InstagramC4Media
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDBStone Gao
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...slashn
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDBMongoDB
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph DatabaseTobias Lindaaker
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational DatabasesChris Baglieri
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Javasunnygleason
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 
Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Tom Corrigan
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBlehresman
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
 
MySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiMySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiRemote MySQL DBA
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP OnTomer Gabel
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)gdusbabek
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagramiammutex
 
My first powershell script
My first powershell scriptMy first powershell script
My first powershell scriptDavid Cobb
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Colin Charles
 

Was ist angesagt? (20)

How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
Challenges with MongoDB
Challenges with MongoDBChallenges with MongoDB
Challenges with MongoDB
 
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
Slash n: Tech Talk Track 2 – Website Architecture-Mistakes & Learnings - Sidd...
 
Scaling MongoDB
Scaling MongoDBScaling MongoDB
Scaling MongoDB
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Mongo DB
Mongo DBMongo DB
Mongo DB
 
Non Relational Databases
Non Relational DatabasesNon Relational Databases
Non Relational Databases
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
High-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and JavaHigh-Performance Storage Services with HailDB and Java
High-Performance Storage Services with HailDB and Java
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2Document Locking with Redis in Symfony2
Document Locking with Redis in Symfony2
 
Strengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDBStrengths and Weaknesses of MongoDB
Strengths and Weaknesses of MongoDB
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 
MySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup MumbaiMySQL HA Percona cluster @ MySQL meetup Mumbai
MySQL HA Percona cluster @ MySQL meetup Mumbai
 
Put Your Thinking CAP On
Put Your Thinking CAP OnPut Your Thinking CAP On
Put Your Thinking CAP On
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
 
Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
My first powershell script
My first powershell scriptMy first powershell script
My first powershell script
 
Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7Better encryption & security with MariaDB 10.1 & MySQL 5.7
Better encryption & security with MariaDB 10.1 & MySQL 5.7
 

Ähnlich wie My Sql And Search At Craigslist

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDBMongoDB
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Luceneeby
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataRoger Xia
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQLCrate.io
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraPyData
 

Ähnlich wie My Sql And Search At Craigslist (20)

MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
20080611accel
20080611accel20080611accel
20080611accel
 
Agility and Scalability with MongoDB
Agility and Scalability with MongoDBAgility and Scalability with MongoDB
Agility and Scalability with MongoDB
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
20081022cca
20081022cca20081022cca
20081022cca
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Webinar: The Future of SQL
Webinar: The Future of SQLWebinar: The Future of SQL
Webinar: The Future of SQL
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 
MySQL highav Availability
MySQL highav AvailabilityMySQL highav Availability
MySQL highav Availability
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Qcon
QconQcon
Qcon
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 

Mehr von MySQLConference

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMySQLConference
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real WorldMySQLConference
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The HoodMySQLConference
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudMySQLConference
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsMySQLConference
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjMySQLConference
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2MySQLConference
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesMySQLConference
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineMySQLConference
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksMySQLConference
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMySQLConference
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20MySQLConference
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendMySQLConference
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceMySQLConference
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureMySQLConference
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMySQLConference
 

Mehr von MySQLConference (17)

Memcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My SqlMemcached Functions For My Sql Seemless Caching In My Sql
Memcached Functions For My Sql Seemless Caching In My Sql
 
Using Open Source Bi In The Real World
Using Open Source Bi In The Real WorldUsing Open Source Bi In The Real World
Using Open Source Bi In The Real World
 
Partitioning Under The Hood
Partitioning Under The HoodPartitioning Under The Hood
Partitioning Under The Hood
 
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The CloudTricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
Tricks And Tradeoffs Of Deploying My Sql Clusters In The Cloud
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using NdbjWriting Efficient Java Applications For My Sql Cluster Using Ndbj
Writing Efficient Java Applications For My Sql Cluster Using Ndbj
 
My Sql Performance On Ec2
My Sql Performance On Ec2My Sql Performance On Ec2
My Sql Performance On Ec2
 
Inno Db Performance And Usability Patches
Inno Db Performance And Usability PatchesInno Db Performance And Usability Patches
Inno Db Performance And Usability Patches
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Solving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq EngineSolving Common Sql Problems With The Seq Engine
Solving Common Sql Problems With The Seq Engine
 
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql BottlenecksUsing Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
Using Continuous Etl With Real Time Queries To Eliminate My Sql Bottlenecks
 
Make Your Life Easier With Maatkit
Make Your Life Easier With MaatkitMake Your Life Easier With Maatkit
Make Your Life Easier With Maatkit
 
Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20Getting The Most Out Of My Sql Enterprise Monitor 20
Getting The Most Out Of My Sql Enterprise Monitor 20
 
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service BackendWide Open Spaces Using My Sql As A Web Mapping Service Backend
Wide Open Spaces Using My Sql As A Web Mapping Service Backend
 
Unleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business IntelligenceUnleash The Power Of Your Data Using Open Source Business Intelligence
Unleash The Power Of Your Data Using Open Source Business Intelligence
 
Inno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code StructureInno Db Internals Inno Db File Formats And Source Code Structure
Inno Db Internals Inno Db File Formats And Source Code Structure
 
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin ExpressMy Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
My Sql High Availability With A Punch Drbd 83 And Drbd For Dolphin Express
 

Kürzlich hochgeladen

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

My Sql And Search At Craigslist

  • 1. MySQL and Search at Craigslist Jeremy Zawodny jzawodn@craigslist.org http://craigslist.org/ Jeremy@Zawodny.com http://jeremy.zawodny.com/blog/
  • 2. Who Am I? Creator and co-author of High Performance ● MySQL Creator of mytop ● Perl Hacker ● MySQL Geek ● Craigslist Engineer (as of July, 2008) ● MySQL, Data, Search, Perl – Ex-Yahoo (Perl, MySQL, Search, Web ● Services)
  • 4. What is Craigslist? Local Classifieds ● Jobs, Housing, Autos, Goods, Services – ~500 cities world-wide ● Free ● Except for jobs in ~18 cities and brokered – apartments in NYC Over 20B pageviews/month – 50M monthly users – 50+ countries, multiple languages – 40+M ads/month, 10+M images –
  • 5. What is Craigslist? Forums ● 100M posts – 100s of forums –
  • 6. Technical and other Challenges High ad churn rate ● Post half-life can be short – Growth ● High traffic volume ● Back-end tools and data analysis needs ● Growth ● Need to archive postings... forever! ● 100s of millions, searchable – Internationalization and UTF-8 ●
  • 7. Technical and other Challenges Small Team ● Fires take priority – Infrastructure gets creaky – Organic code and schema growth over years – Growth ● Lack of abstractions ● Too much embedded SQL in code – Documentation vs. Institutional Knowledge ● “Why do we have things configured like this?” –
  • 8. Goals Use Open Source ● Keep infrastructure small and simple ● Lower power is good! – Efficiency all around – Do more with less – Keep site easy and appraochable ● Don't overload with features – People are easily confuse –
  • 9. Craigslist Internals Overview Load Balancer Read Proxy Array Write Proxy Array Perl + memcached ... Web Read Array Apache 1.3 + mod_perl Object Cache Search Cluster Perl + memcached Sphinx Not Included: Read DB Cluster MySQL 5.0.xx - user db, image db - async tasks, email - accounting, internal tools - and more!
  • 10. Vertical Partitioning: Roles Users Classifieds Forums Write Read Long Trash Stats Archive
  • 11. Vertical Partitioning Different roles have different access patterns ● Sub-roles based on query type – Easier to manage and scale ● Logical, self-contained data ● Servers may not need to be as ● big/fast/expensive Difficult to do retroactively ● Various named db “handles” in code ●
  • 12. Horizontal Partitioning: Hydra ... cluster_01 cluster_02 cluster_03 cluster_N client
  • 13. Horizontal Partitioning: Hydra Need to retrofit a lot of code ● Need non-blocking Perl MySQL client ● Wrapped ● http://code.google.com/p/perl-mysql-async/ Eventually can size DB boxes based on ● price/power and adjust mapping function(s) Choose hardware first – Make the db “fit” – Archiving lets us age a cluster instead of ● migrating it's data to a new one.
  • 14. Search Evolution Problem: Users want to find stuff. ● Solution: Use MySQL Full Text. ● ...time passes... ● Problem: MySQL Full Text Doesn't Scale! ● Solution: Use Sphinx. ● ...time passes... ● Problem: Sphinx doesn't scale! ● Solution: Patch Sphinx. ●
  • 15. MySQL Full-Text Problems Hitting invisible limits ● CPU not pegged, Memory available – Disk I/O not unreasonable – Locking / Mutex contention? Probably. – MyISAM has occasional crashing / corruption ● 5 clusters of 5 machines ● Partitioning based on city and category – All “hand balanced” and high-maintenance – ~30M queries/day ● Close to limits –
  • 16. Sphinx: My First CL Project Sphinx is designed for text search ● Fast and lean C++ code ● Forking model scales well on multi-core ● Control over indexing, weighting, etc. ● Also spent some time looking at Apache Solr ●
  • 17. Search Implementation Details Partitioning based on cities (each has a ● numeric id) Attributes vs. Keywords ● Persistent Connections ● Custom client and server modifications – Minimal stopword List ● Partition into 2 clusters (1 master, 4 slaves) ●
  • 18. Sphinx Incremental Indexing Re-index every N minutes ● Use main + delta strategy ● Adopted as: index + today + delta – One set per city (~500 * 3) – Slaves handle live queries, update via rsync ● Need lots of FDs ● Use all 4 cores to index ● Every night, perform “daily merge” ● Generate config files via Perl ●
  • 20. Sphinx Issues Merge bugs [fixed] ● File descriptor corruption [fixed] ● Persistent connections [fixed] ● Overhead of fork() was substantial in our testing – 200 queries/sec vs. 1,000 queries/sec per box – Missing attribute updates [unreported] ● Bogus docids in responses ● We need to upgrade to latest Sphinx soon ● Andrew and team have been excellent! ●
  • 21. Search Project Results From 25 MySQL Boxes to 10 Sphinx ● Lots more headroom! ● New Features ● Nearby Search – No seizing or locking issues ● 1,000+ qps during peak w/room to grow ● 50M queries per day w/steady growth ● Cluster partitioning built but not needed (yet?) ● Better separation of code ●
  • 22. Sphinx Wishlist Efficient delete handling (kill lists) ● Non-fatal “missing” indexes ● Index dump tool ● Live document add/change/delete ● Built-in replication ● Stats and counters ● Text attributes ● Protocol checksum ●
  • 23. Data Archiving, Replication, Indexes Problem: We want to keep everything. ● Solution: Archive to an archive cluster. ● Problem: Archiving is too painful. Index ● updates are expensive! Slaves affected. Solution: Archive with home-grown eventually ● consistent replication.
  • 24. Data Archiving: OOB Replication Eventual Consistency ● Master process ● SET SQL_LOG_BIN=0 – Select expired IDs – Export records from live master – Import records into archive master – Delete expired from live master – Add IDs to list –
  • 25. Data Archiving: OOB Replication Slave process ● One per MySQL slave – Throttled to minimize impact – State kept on slave – Clone friendly ● Simple logic – Select expired IDs added since my sequence number ● Delete expired records ● Update local “last seen” sequence number ●
  • 26. Long Term Data Archiving Schema coupling is bad ● ALTER TABLE takes forever – Lots of NULLs flying around – CouchDB or similar long-term? ● Schema-free feels like a good fit – Tested some home grown solutions already ● Separate storage and indexing? ● Indexing with Sphinx? –
  • 27. Drizzle, XtraDB, Future Stuff CouchDB looks very interesting. Maybe for ● archive? XtraDB / InnoDB plugin ● Better concurrency – Better tuning of InnoDB internals – libdrizzle + Perl ● DBI/DBD may not fit an async model well – Can talk to both MySQL and Drizzle! – Oracle buying Sun?!?! ●
  • 28. We're Hiring! Work in San Francisco ● Flexible, Small Company ● Excellent Benefits ● Help Millions of People Every Week ● We Need Perl/MySQL Hackers ● Come Help us Scale and Grow ●