SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M  posts We keep all postings Users reuse postings Daily archive migration Internal query tools
Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up.  But we had to clean them all up. Start data load.  Wait 12-36 hours.  Witness fail.  Fix code.  Start over.  Sigh. This is a combination of having been sloppy and having old data.  Even with a lot less history, this can bite you.  Get your encoding house in order!
Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have.  One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily

Weitere Àhnliche Inhalte

Was ist angesagt?

Choosing your first AI project. How to get a quick ROI in process industries
Choosing your first AI project. How to get a quick ROI in process industriesChoosing your first AI project. How to get a quick ROI in process industries
Choosing your first AI project. How to get a quick ROI in process industries
Yandex Data Factory
 

Was ist angesagt? (20)

The Myths of Big Data
The Myths of Big DataThe Myths of Big Data
The Myths of Big Data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Sqoop
SqoopSqoop
Sqoop
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Mind Map of Big Data Technologies and Concepts
Mind Map of Big Data Technologies and ConceptsMind Map of Big Data Technologies and Concepts
Mind Map of Big Data Technologies and Concepts
 
IT Transformation Initiatives PowerPoint Presentation Slides
IT Transformation Initiatives PowerPoint Presentation SlidesIT Transformation Initiatives PowerPoint Presentation Slides
IT Transformation Initiatives PowerPoint Presentation Slides
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
1.4 data warehouse
1.4 data warehouse1.4 data warehouse
1.4 data warehouse
 
Jpm big data and ai strategies final
Jpm big data and ai strategies finalJpm big data and ai strategies final
Jpm big data and ai strategies final
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Choosing your first AI project. How to get a quick ROI in process industries
Choosing your first AI project. How to get a quick ROI in process industriesChoosing your first AI project. How to get a quick ROI in process industries
Choosing your first AI project. How to get a quick ROI in process industries
 
Chief Data Architect or Chief Data Officer: Connecting the Enterprise Data Ec...
Chief Data Architect or Chief Data Officer: Connecting the Enterprise Data Ec...Chief Data Architect or Chief Data Officer: Connecting the Enterprise Data Ec...
Chief Data Architect or Chief Data Officer: Connecting the Enterprise Data Ec...
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Governance for power bi Toronto SPS Saturday
Governance for power bi Toronto SPS Saturday Governance for power bi Toronto SPS Saturday
Governance for power bi Toronto SPS Saturday
 
Cs583 info-retrieval
Cs583 info-retrievalCs583 info-retrieval
Cs583 info-retrieval
 
Power bi premium
Power bi premiumPower bi premium
Power bi premium
 
Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data Management
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 

Andere mochten auch

Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
mwasaha mwagambo
 

Andere mochten auch (20)

Webinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDBWebinar - Approaching 1 billion documents with MongoDB
Webinar - Approaching 1 billion documents with MongoDB
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic ApproachLiving with SQL and NoSQL at craigslist, a Pragmatic Approach
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
 
Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.Midas - on-the-fly schema migration tool for MongoDB.
Midas - on-the-fly schema migration tool for MongoDB.
 
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Migrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at WordnikMigrating from MySQL to MongoDB at Wordnik
Migrating from MySQL to MongoDB at Wordnik
 
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
Webinaire 3 de la série « Retour aux fondamentaux » : Conception de schémas :...
 
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
MongoDB 3.0 and WiredTiger (Event: An Evening with MongoDB Dallas 3/10/15)
 
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQLWebinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
Webinaire 1 de la série Retour aux fondamentaux : Introduction à NoSQL
 
Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011Redis and Groovy and Grails - gr8conf 2011
Redis and Groovy and Grails - gr8conf 2011
 
Tayra
TayraTayra
Tayra
 
Fusion-io and MySQL at Craigslist
Fusion-io and MySQL at CraigslistFusion-io and MySQL at Craigslist
Fusion-io and MySQL at Craigslist
 
SphinxSearch
SphinxSearchSphinxSearch
SphinxSearch
 
MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016MongoDB Certification Study Group - May 2016
MongoDB Certification Study Group - May 2016
 
Production deployment
Production deploymentProduction deployment
Production deployment
 
Managing Big Data with MySQL
Managing Big Data with MySQLManaging Big Data with MySQL
Managing Big Data with MySQL
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 
Migrating to MongoDB: Best Practices
Migrating to MongoDB: Best PracticesMigrating to MongoDB: Best Practices
Migrating to MongoDB: Best Practices
 
Social Media Trends - Content Curation
Social Media Trends - Content CurationSocial Media Trends - Content Curation
Social Media Trends - Content Curation
 

Ähnlich wie Lessons Learned Migrating 2+ Billion Documents at Craigslist

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
Philip Zhong
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
Chris Henry
 
disertation
disertationdisertation
disertation
Ruben Casas
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
MongoSF
 

Ähnlich wie Lessons Learned Migrating 2+ Billion Documents at Craigslist (20)

MongoDB Knowledge Shareing
MongoDB Knowledge ShareingMongoDB Knowledge Shareing
MongoDB Knowledge Shareing
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
MongoDB Pros and Cons
MongoDB Pros and ConsMongoDB Pros and Cons
MongoDB Pros and Cons
 
Why Wordnik went non-relational
Why Wordnik went non-relationalWhy Wordnik went non-relational
Why Wordnik went non-relational
 
Hadoop bank
Hadoop bankHadoop bank
Hadoop bank
 
Look Ma! No more blobs
Look Ma! No more blobsLook Ma! No more blobs
Look Ma! No more blobs
 
Mongo db transcript
Mongo db transcriptMongo db transcript
Mongo db transcript
 
Open source Technology
Open source TechnologyOpen source Technology
Open source Technology
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Scaling with mongo db (with notes)
Scaling with mongo db (with notes)Scaling with mongo db (with notes)
Scaling with mongo db (with notes)
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
MongoDB 2.4 and spring data
MongoDB 2.4 and spring dataMongoDB 2.4 and spring data
MongoDB 2.4 and spring data
 
Silicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDBSilicon Valley Code Camp: 2011 Introduction to MongoDB
Silicon Valley Code Camp: 2011 Introduction to MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptx
 
Mdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_searchMdb dn 2016_07_elastic_search
Mdb dn 2016_07_elastic_search
 
disertation
disertationdisertation
disertation
 
From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)From MySQL to MongoDB at Wordnik (Tony Tam)
From MySQL to MongoDB at Wordnik (Tony Tam)
 
Whynosql
WhynosqlWhynosql
Whynosql
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdf
 

KĂŒrzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

KĂŒrzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

Lessons Learned Migrating 2+ Billion Documents at Craigslist

  • 1. Lessons Learned from Migrating 2+ Billion Documents at Craigslist Jeremy Zawodny jzawodn@craigslist.org Jeremy@Zawodny.com http://blog.zawodny.com/
  • 2. Outline Recap last year’s MongoSV Talk The Archive, Why MongoDB, etc. http://www.10gen.com/video/mongosv2010/craigslist The Infrastructure The Lessons Wishlist Q&A
  • 3. Craigslist Numbers 2 data centers ~500 servers ~100 MySQL servers ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day
  • 4. Archive: Where Data Goes To Die Live Numbers ~1.75M posts/day ~14 day avg. lifetime ~60 day retention ~100M posts We keep all postings Users reuse postings Daily archive migration Internal query tools
  • 5. Archive Pain Coupled Schemas Big Indexes Hardware Failures Replication Lag Poor Search Human Time Costs
  • 6. MongoDB Wins Scalable Fast Friendly Proven Pragmatic Approachable
  • 7. MongoDB Details Plan for 5 billion documents Average size: 2KB 3 Replica sets, 3 Servers each Deploy to 2 datacenters Same deployment in each datacenter Posting ID is sharding key
  • 8. MongoDB Architecture Typical Sharding with Replica Sets (external sphinx full-text indexers not pictured) config client client client client config config mongos mongos mongos shard001 shard003 shard002 replica set replica set replica set
  • 9. Lesson: Know Your Hardware MongoDB on blades really sucks Single 10k RPM disks can’t take it when data is noticeably larger than RAM Mongo operations can hit the client timeout (30 sec default) Even minutely cron jobs start to spew Lots of time wasted in development environment, trying different kernels, tuning, etc. Most noticeable during heavy writes but can happen if pages fall out of RAM for other reasons
  • 10. Lesson: Replica Sets Rock Lots of reboots happened during dev environment troubleshooting Each time, one of the remaining nodes took over No “reclone” no config file or DNS changes Stuff “just worked” while nodes bounced up and down
  • 11. Lesson: Know Your Data MongoDB is UTF-8 Some of our older data is decidedly NOT UTF-8 We have lots of sloppy encoding issues to clean up. But we had to clean them all up. Start data load. Wait 12-36 hours. Witness fail. Fix code. Start over. Sigh. This is a combination of having been sloppy and having old data. Even with a lot less history, this can bite you. Get your encoding house in order!
  • 12. Lesson: Know Your Data Size MongoDB has a doc size limits 4MB in 1.6.x, 16MB in 1.8.x What to do with outliers? In our case, trim off some useless data. But going from relational to document means this sort of problem is easy to have. One parent, many children. It’d be nice if this was easier to change, but clients have it hard-coded too. Compression would help, of course.
  • 13. Lesson: Know Your Data Types Field Types and Conversions can be expensive to do after the fact! MongoDB treats strings and numbers differently, but some programming languages (such as Perl) don’t make that distinction obvious This has indexing implications when you later look for 123456789 but had unknowingly stored “123456789” http://search.cpan.org/dist/MongoDB/lib/MongoDB/DataTypes.pod
  • 14. Data Types, continued “If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending.” Do you know how to do that in your language of choice? Some drivers may make a “guess” that gets it right most of the time.
  • 15. Lesson: Know SomeSharding The Balancer can be your frenemy Initial insert rate: 8,000/sec Later drops to 200/sec Too much time spent waiting to page in data that’s going to be sent to another node and never looked at (locally) again Pre-split your data if possible http://blog.zawodny.com/2011/03/06/mongodb-pre-splitting-for-faster-data-loading-and-importing/
  • 16. Lesson: Know Some Replica Sets Replica Set re-sync requires index rebuilds on the secondary Most painful when a slave is down too long and can’t catch up using the oplog Typically during high write volumes In a large data set, the index rebuilding can take a couple of days w/out many indexes What if you lose another while that is happening?
  • 17. MongoDBWishlist Replica set node re-sync without out index rebuilding Record (or field) compression (not everyone uses a filesystem that offers compression) Method to tap into the oplog so that changes can be fed to external indexers (Sphinx, Redis, etc.) Hash-based sharding (coming soon?) Cluster snapshot/backup tool
  • 18. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Front-end Engineering HTML, CSS, JavaScript, jQuery (Mobile too) Network Administration Routers, switches, load balancers, etc. Back-end Engineering Linux, Apache, Perl, MySQL, MongoDB, Redis, Gearman, etc. Systems Administration Help keep all those systems running.
  • 19. craigslist is hiring! send resumes to: z@craigslist.org Plain Text or PDF, no Word Docs! Laid back, non-corporateenvironment Engineering driven culture Lots of interesting technical challenges Easy SF commute Excellent benefits and pay High-impact work Millions use craigslist daily