MongoHQ knows there is something special about 100 GB of data. Our customers that hit 100 GB are running core pieces of their business on a scalable MongoDB platform. In this presentation, we will walk through a cloud focused scaling checklist that will help you quickly and securely blow past the 100 GB milestone. Using customer examples and best practice MongoDB use cases, we'll help prepare you to get to the data size your business needs.
2. www.mongohq.com Scaling Checklist for MongoDB
MongoHQ
www.mongohq.com | @mongohq
MongoHQ is a fully-managed platform used by
developers to deploy, host and scale open-source
databases.
Chris Winslett
chris@mongohq.com
I’ve spoken at a number of MongoDB conferences on
optimizing queries. I’ve been with MongoHQ for two
years – prior to that I built applications for the education
and technical sectors.
3. www.mongohq.com Scaling Checklist for MongoDB
TL;DR
• 100GB of data is relatively big data
• MongoDB has comparative advantages
• MongoDB has absolute constraints
• Know the MongoDB gauges
• Surpassing 100GB requires:
– Understanding absolute constraints.
– Knowledge of application’s data consumption
– Optimization of data consumption to comparative
advantages
4. www.mongohq.com Scaling Checklist for MongoDB
Audience Survey
What is your data size? Choose the biggest
bucket.
A. < 10GB
B. < 50GB
C. < 75GB
D. < 100GB
E. > 100 GB
5. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
6. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
7. www.mongohq.com Scaling Checklist for MongoDB
Identify your data behavior
1. Small v. Large – type of data
2. Fast v. Slow – behavior of data
3. Complex v. Simple – type of queries
4. Known v. Unknown – behavior of queries
5. Queuing v. Application data
This can happen at planning, staging, or production phase.
9. www.mongohq.com Scaling Checklist for MongoDB
Small Large
Fast
Slow
Modern applications have all patterns
Main
application
collections
Application
Metadata
Secondary
Application
Collections
Internal
metrics
Event logs
and event
data
Queues,
OLTP,
Messages
Rendered
in
background
10. www.mongohq.com Scaling Checklist for MongoDB
Small Large
Fast
Slow
Where doesn’t MongoDB excel?
Main
application
collections
Application
Metadata
Secondary
Application
Collections
Internal
metrics
Event logs
and event
data
Queues,
OLTP,
Messages
Rendered
in
background
11. www.mongohq.com Scaling Checklist for MongoDB
4th dimension is time
Main
application
collections
Today’s Data
Last week’s data
Small Large
Fast
Slow
12. www.mongohq.com Scaling Checklist for MongoDB
Data-types to avoid with MongoDB
Main
application
collections
Application
Metadata
Secondary
Application
Collections
Internal
metrics
Event logs
and event
data
Queues,
OLTP,
Messages
Small Large
Fast
Slow
Rendered
in
background
14. www.mongohq.com Scaling Checklist for MongoDB
Unknown Known
Simple
Complex
Modern applications have all types of queries
Data
discovery
Application
search
Key
value
Single
Range
Query
User
generated
search
Internal
metrics
Multi-
Range
Query
15. www.mongohq.com Scaling Checklist for MongoDB
Unknown Known
Simple
Complex
Queries to Avoid with MongoDB
Data
discovery
Application
search
Key
value
Single
Range
Query
User
generated
search
Internal
metrics
Multi-
Range
Query
16. www.mongohq.com Scaling Checklist for MongoDB
Unknown Known
Simple
Complex
4th Dimension is Time
Real-time
core of
application
Today’s Data
Last week’s data
17. www.mongohq.com Scaling Checklist for MongoDB
MongoDB
Queries and MongoDB
Elastic Search
SQL
Elastic Search
Unknown Known
Simple
Complex
18. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
19. www.mongohq.com Scaling Checklist for MongoDB
MongoDB’s Technical Comparative
Advantage
• Expressive data structure allows simplification of
complex data relationships
• Create simple, known queries and return
expressive relationships
• On-the-fly addition of attributes / columns
• Total Cost of Ownership*
20. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
21. www.mongohq.com Scaling Checklist for MongoDB
MongoDB Indexing Constraints
• Only one index can be used per query
• Only one range operator can be used per
index
• Range operator must be the last field on index
• Know how to use the right side of indexes
22. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
23. www.mongohq.com Scaling Checklist for MongoDB
What does it mean to optimize?
Unknown Known
Simple
Complex
Scaling to 100GB
involves moving queries from
complex to simple and
unknown to known
Start
Finish
Start
24. www.mongohq.com Scaling Checklist for MongoDB
Example of simplifying a query.
Naïve Query:
db.messages.find({$or: [{recipient_id: <id>}, {sender_id: <id>}]}).sort({_id: -1})
Find the most recent messages for a person’s message stream.
Second attempt:
db.messages.find({participant_ids: <id>}).sort({_id: -1})
Best approach
db.users.find({_id: <id>})
25. www.mongohq.com Scaling Checklist for MongoDB
Naïve Query
{
_id: <id>,
message: “Wow, this pizza is good!”,
sender_id: <user_id>,
recipient_id: <user_id>
}
db.messages.find({$or: [{recipient_id: <id>}, {sender_id: <id>}]}).sort({_id: -1})
Document
Query
26. www.mongohq.com Scaling Checklist for MongoDB
Second Attempt
Document
{
_id: <id>,
message: “Wow, this pizza is good!”,
sender_id: <sender_id>,
recipient_id: <recipient_id>,
participant_ids: [<sender_id>,<recipient_id>]
}
db.messages.find({participant_ids: <id>}).sort({_id: -1})
Query
27. www.mongohq.com Scaling Checklist for MongoDB
Best approach
Document
Hint: use the $push, $sort, $slice for the last 50
{
_id: <id>,
name: “Clarke Kent”,
recent_messages: [
<…50 denormalized messages…>
]
}
db.users.find({_id: <id>})
Query
28. www.mongohq.com Scaling Checklist for MongoDB
How did we optimize?
Unknown Known
Simple
Complex
We took a known, complex
query and made it simple.
Finish
Start
29. www.mongohq.com Scaling Checklist for MongoDB
Methods for Simplifying Queries
• Bucket values
• Create summary attributes
• Pre-compute values
• Use expressive documents structures
• Sort and filter at the application level
• Create summary documents
• Divide and measure (more on this later)
30. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
31. www.mongohq.com Scaling Checklist for MongoDB
Small Large
Fast
Slow
Remove “unrefactorable” data
Main
application
collections
Application
Metadata
Secondary
Application
Collections
Internal
metrics
Event logs
and event
data
Queues,
OLTP,
Messages
Rendered
in
background
Redis
32. www.mongohq.com Scaling Checklist for MongoDB
MongoDB
Move up and right, or find another tool
Unknown Known
Simple
Complex
Data
discovery
Application
search
User
generated
search
Multi-
Range
Query
33. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
34. www.mongohq.com Scaling Checklist for MongoDB
Unknown Known
Simple
Complex
4th Dimension is Time
Real-time
core of
application
Today’s Data (fast)
Last week’s data (slower)
35. www.mongohq.com Scaling Checklist for MongoDB
Separate Data with Cross Purposes
• If this today’s data must be fast, and last
week’s data can be slow:
– Rollout today’s data using TTL collections
– Use another database for last weeks data
– Use high-RAM ratio and SSD backed machines for
this today’s data
– Use cheaper hardware for last week’s data
36. www.mongohq.com Scaling Checklist for MongoDB
MongoDB Doesn’t have Joins
Data doesn’t have to be adjacent.
Divide, measure, conquer.
37. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
38. www.mongohq.com Scaling Checklist for MongoDB
Stop Use `mongodump`
`mongodump` is long running tablescan that
exports all documents. This disrupts RAM and
causes performance issues.
Self-hosting: use the MongoDB MMS and
Backup
As-a-service: ask your vendor about backup
alternatives
39. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
41. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
42. www.mongohq.com Scaling Checklist for MongoDB
Avoid Page Faults like the Plague
0
1000
2000
3000
4000
5000
6000
7000
8000
50% Table Scans 1% Table Scans 0% Table Scans
MongoDB Operations / Second
43. www.mongohq.com Scaling Checklist for MongoDB
MongoDB
What type of queries cause page faults?
Unknown Known
Simple
Complex
Data
discovery
Application
search
User
generated
search
Multi-
Range
Query
44. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
45. www.mongohq.com Scaling Checklist for MongoDB
Track & Remove Slow Queries
• system.profile collection – link
• MongoDB professor – link
• Dex – link
• MongoHQ Slow Query Tracker and Profiler -
link
46. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
47. www.mongohq.com Scaling Checklist for MongoDB
Buying time with hardware has a
limited life
• Don’t get addicted to buying more hardware.
• Before any purchasing decision, always
– consider optimization
– investigate separating, paring data
48. www.mongohq.com Scaling Checklist for MongoDB
100 GB Checklist
1. Identify your data behavior
2. Use MongoDB for comparative advantages
3. Know the MongoDB indexing constraints
4. Refactor schema to simplify queries
5. Remove data that does not fit MongoDB
6. Separate hot and cold data
7. Stop using `mongodump`
8. Check your gauges
9. Avoid queries causing page faults
10. Track and monitor slow queries
11. Buying time with hardware
4 years of experienceRun 50,000 total MongoDB databasesRun multi-terrabytesharded environmentsWe have a philosophy of optimize, then shardOur real enjoyment is seeing a company grow, use our platform, and find value with our platform
If your company is creating data, and you and your customers have created 100GB of data, that is fast growing business.In some cases, 10GB is a good, growing business.Unless you are digesting a Twitter’s firehose or scraping the web consuming someone else’s data, there is something special about 100GB of data10GB and growing is a good as-a-service business on a good clipAs you approach 50GB of data, your next hurdle is 100GBWe are thinking about building applications that are planning for 100GB of data
Quick survey to see what type of audience we have. If you would, just respond manually with your dataSize, or with your letter to the chat. We will summate those values in a moment.
Let’s get started looking at the checkilst.Reasonable numberFirst 3 build the case for how you should think about MongoDB, please bare with me through these sections, I am framing the discussion. 1 – 3 will be longer, but will lend a theme to latter examples4 – 6 are a set of techniques for improving performance7 – 10 are some commandments.11 leaves you with some words of wisdom
Let’s start off by identifying your data’s behavior.
I’ve put together two different axis for us to look at: data and queries.Data type : small v. largeData type: fast v. slowQuery type: complex v. simpleQuery type: known v. unknownBefore engineers checkout from this talk, this exercise is quick, easy and important for mapping your data usage. It will help you understand your different types of data that compose your application. It will also help when searching for the best tools for the job.I am proposing two sets of axis here, but I am sure there are more. After discussing with customers, these two charts are a good starting point, and offer a good way to think of data growth.
Here is our first access. Fast and slow on the y axis Small and large on the x axisMy question for you is: what type of data do you have?Do you have fast and large data?Do you have fast and small data?Slow and large data?What type of data do you have?Increasingly we find
Modern applications have all data types
Data’s characteristics are not static.Overtime, your data type is changing. In the chart, we are showing aging data move from fast and small to larger and slower.When discussing the same set of data, we have to discuss the assumption of time. Two engineers can talk about the same collection of data. One engineer is thinking of last week’s data. Another engineer is thinking of this week’s data, and they are arguing different use cases.
Green is good.MongoDB excels at use cases with many types of data except queues, messaging, and OLTP. If you have small / fast data, MongoDB is not the tool for that. Notice, I don’t go to the end of fast, large, or small axis. I recognize there are edge cases past the capabilities of MongoDB performance.
Our next axis we are introducing is query type.Previously, we had discussed data type. Previously, we answered “How does your data behave?”Now, we are answering “How does you retrieve your data?” These axis are not as intuitive as our prior axises.Simple query: single valueComplex query: multiple values, multiple conditions, multiple rangesKnown query: you precisely know the arguments you are queryingUnknown query: you do know yet know the arguments you are querying
As with earlier, modern applications have all types of data.The key positions on this spectrum are upper right, “Simple and known” and lower left, “Complex and unknown”.Simple and known represent many modern NoSQL database’s – key value storesUnknown and complex is what I term “data discovery” – the data has not turned into information, and I want to go through the process of turning raw data into actionable information. This is “data discovery”. It typically represents very complex queries, and represents an unknown end state of the queriesOn the rest of the spectrum, we have internal metrics, which are often every structured datasets, and single range queries. Across from each other, we have application search and user search. Applications’ typically have a search mechanism.
Which queries are off limits in MongoDB – data discovery.
As with data types, queries required of data also change over time. Today’s data is suitable for quick fast application usage.Last week’s data requires analysis – data discoveryThese are important notions when working with increasingly larger data – we recognize that all data created similarly will have different requirements during its life. Recognize these nuances, and adapt.
MongoDB dominates the “simple / known” quadrant. We
If you’ve usedMongoDB, JSON-like dataExpressive documents on complex relationshipsWith creativity you can create simple, known queries and these complex relationshipsOn-the-fly addition of attributes / columns
MongoDB use’s btree indexes.The indexing constraints are:absolutesimpleList the indexing constraintsNo intersections of indexesOperators are $or, $and, $sort, $nin, $in, $ne, $gte, $lteAny violation of these constraints will lead to table scans
Hopefully, with 1 – 3, I’ve built the case for simplicity. And, now we will answer the question: “what does it mean to optimize with for 100 GB of data?”
Scaling your query and database interactions will move you toward the simple and known quadrant. Approaching an interaction that is similar to a key-value store.
Imagine a messaging system that captures messages between two parties.For any person, you want to find the most recent messages for a personNaïve example – use the $or operator Of course, we learned that $sort does not work with $or, this will cause table scansSecond attempt, query on participantsBest approach, use NoSQL for what it does best
Here is a view of aging data.How does data become “cold”?How does your data become “hot”? What data needs to be fast?What data can be slow?
Keep fast data fast, and keep cold data separateOver time, we’ve stated our data becomes:LargerSlowerRequires complex queriesFilters on unknown conditionsKeep that data separate from today’s data.
Backups are important – don’t use `mongodump` to do it (unless you have a 3rd member you want to run them against)