2. Or:
● Cautionary Tales
● Don’t solve the wrong problems
● Bad schemas hurt ops too
● etc.
3. The Stories
● Are (mostly) true, and (mostly) actually happened
● Names have been changed to protect the (mostly)
innocent
● No animals were harmed during the making of this
presentation
○ Perhaps a few DBAs and engineers had light
emotional scarring
● Some of the people that inspired the stories may well be
here today at MongoDB London
4. Story #1: Bill the Bulk Updater
● Bill built a system that tracked status information for
entities in his business domain
● State changes for this system happened in batches:
o Sometimes 10% of entities get updated
o Sometimes 100% get updated
● Essentially, lots of random updates
6. What about production?
● Bill’s system was a success!
● The product grew, and the number of entities increased
by a factor of 5
● Not a problem - add more shards!
12. What did we recommend?
● Scale the random I/O vertically, not horizontally
● Sometimes a combination of vertical & horizontal
scaling is the best approach
14. Story #2: Gary the Game Developer
● Gary was launching a AAA game title
● MongoDB would provide the backend for the player’s
online experience
● Launched worldwide, same day, midnight launches
15. Complex Cloud Deployment
● Deploying in the cloud, but very beefy instances
● 32 vCPU, 244GiB RAM, 8 x SSD
● Single mongod unable to stress instances
● Hence “Micro-Sharding” required to get most out of
instances
16. Micro-What?
Micro-Sharding is the practice of deploying multiple relatively small (hence “micro”) shards on
large hosts to better take advantage of available resources which are difficult to utilise with a
single mongod instance.
HOST1
Primary1
Primary2
Primary3
Secondary4
Secondary5
Secondary6
Secondary7
Secondary8
Secondary9
HOST2
Secondary1
Secondary2
Secondary3
Primary4
Primary5
Primary6
Secondary7
Secondary8
Secondary9
HOST3
Secondary1
Secondary2
Secondary3
Secondary4
Secondary5
Secondary6
Primary7
Primary8
Primary9
For example, 9 shards evenly distributed across 3 hosts, as below:
17. Extensive Pre-Production Testing
● Load tested
● Failover and Backups tested
● Procedures, architecture reviewed
● Basically, lots of testing/reviewing was done (all
passed)
18. However…….
The production layout of mongod processes actually was 8 shards on 3 host, reproduced below.
This layout caused a problem in production. But, it was tested and had no issues, right?
Almost: the backup process was tested, and load was tested, but not together…..
HOST1
Primary1
Primary2
Primary3
Secondary4
Secondary5
Secondary6
Secondary7
Secondary8
HOST2
Secondary1
Secondary2
Secondary3
Primary4
Primary5
Primary6
Secondary7
Secondary8
HOST3
Secondary1
Secondary2
Secondary3
Secondary4
Secondary5
Secondary6
Primary7
Primary8
19. The Backup Process
HOST1
Primary1
Primary2
Primary3
Secondary4
Secondary5
Secondary6
Secondary7
Secondary8
HOST2
Secondary1
Secondary2
Secondary3
Primary4
Primary5
Primary6
Secondary7
Secondary8
HOST3
Secondary1
Secondary2
Secondary3
Secondary4
Secondary5
Secondary6
Primary7
Primary8
Backups took place on a single host (host 2 below).
The databases were locked, then an LVM snapshot was taken, the lock was released.
This was almost instantaneous in pre-prod testing (no load), not so in production.
20. Backup Under Load
Once load was introduced to the equation, the snapshots were no longer instantaneous. This
essentially caused the primaries to become unresponsive but not fail over on the host taking the
backup
Which eventually caused a cascading failure, bringing the whole cluster down
HOST1
Primary1
Primary2
Primary3
Secondary4
Secondary5
Secondary6
Secondary7
Secondary8
HOST2
Secondary1
Secondary2
Secondary3
Primary4
Primary5
Primary6
Secondary7
Secondary8
HOST3
Secondary1
Secondary2
Secondary3
Secondary4
Secondary5
Secondary6
Primary7
Primary8
21.
22. What did we recommend?
HOST1
Primary1
Primary2
Primary3
Primary4
Secondary5
Secondary6
Secondary7
Secondary8
HOST2
Secondary1
Secondary2
Secondary3
Secondary4
Secondary5
Secondary6
Secondary7
Secondary8
HOST3
Secondary1
Secondary2
Secondary3
Secondary4
Primary5
Primary6
Primary7
Primary8
New process layout proposed, as below, backups still taken on Host2.
The database lock was not necessary because LVM snapshot gives point in time, removed.
Also put some limits on max connections, just in case
23. Summary
No one single cause:
● Small issue with deployment layout
● Small error with backup process
● Lack of integration with testing plan
● Relatively new system
● Some bad luck
Led to:
● Large outage, slow cautious recovery
24. Story #3: Rita the Retailer
Rita the Retailer had an ecommerce site, selling
diverse goods in 20+ countries.
25. Product Catalog: Original
Schema
{
_id: 375
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... >
}
26. What’s good about this schema?
● Each document contains all the data about a given
product, across all languages/locales
● Very efficient way to retrieve the English, French,
German, etc. translations of a single product’s
information in one query
27. However……
That is not how the product data is
actually used
(except perhaps by translation staff)
29. Which means……
The Product Catalog’s data model
did not fit the way the data was
accessed.
30. Consequences
● Each document contained ~20x more data than any
common use case needed
● MongoDB lets you request just a subset of a
document’s contents (using a projection), but…
o Typically the whole document will get loaded into
RAM to serve the request
● There are other overheads for reading from disk into
memory (like readahead)
31. Therefore…..
Less than 5% of data loaded into RAM from disk is
actually required at the time - highly inefficient
32. Visualising the problem
{ _id: 42,
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... > }
<READAHEAD OVERHEAD>
{ _id: 709,
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... > }
<READAHEAD OVERHEAD>
{ _id: 3600,
en_US : { name : ..., description : ..., <etc...> },
en_GB : { name : ..., description : ..., <etc...> },
fr_FR : { name : ..., description : ..., <etc...> },
de_DE : ...,
de_CH : ...,
<... and so on for other locales... > }
- Data in RED are loaded into RAM and used.
- Data in BLUE take up memory but are not required.
- Readahead padding in GREEN makes things even
more inefficient
34. What did we recommend?
● Design for your use case, your dominant query pattern
o In this case: 99.99% of queries want the product
data for exactly one locale at a time
o Hence, alter schema appropriately
● Eliminate inefficiencies on the system
o Make reading from disk less wasteful, maximise I/O
capabilities: reduce readahead settings
35. Schema: Before & After
Schema After (document per-locale):
{ _id: "375-en_US",
name : ..., description : ..., <etc...> }
{ _id: "375-en_GB",
name : ..., description : ..., <etc...> }
{ _id: "375-fr_FR",
name : ..., description : ..., <etc...> }
... and so on for other locales ...
Query After:
db.catalog.find( { _id : "375-en_US" };
db.catalog.find( { _id : "375-fr_FR" };
db.catalog.find( { _id : "375-de_DE" };
Schema Before (embedded):
{ _id: 375
en_US : { name : ..., description : ...,
<etc...> },
en_GB : { name : ..., description : ...,
<etc...> },
fr_FR : { name : ..., description : ...,
<etc...> },
<... and so on for other locales... >
}
Query Before:
db.catalog.find( { _id : 375 } , { en_US : true } );
db.catalog.find( { _id : 375 } , { fr_FR : true } );
db.catalog.find( { _id : 375 } , { de_DE : true } );
36. Consequences of Changes
● Queries induced minimal overhead
● Greater than 20x distinct products fit in memory at once
● Disk I/O utilization reduced
● UI latency decreased
● Happier Customers
● Profit (well, we hope)
37. Conclusions
● MongoDB can be used to for a wide range of
(sometimes pretty cool) use cases
● A small problem can seem much bigger
when it happens in production
● We are here to help - if you hit a problem, it’s
likely you are not the first to hit it
● We can provide a fresh perspective, advice
based on experience to prevent and solve
issues
39. Further Reading for
Retail/Catalogs
● Antoine Girbal (my team mate) has produced a full reference architecture
for this type of application
o Blog part 1: http://tmblr.co/ZiOADx1RRsAWe
o Blog part 2: http://tmblr.co/ZiOADx1LfVmfm
● Detailed presentations and talks from MongoDB World:
o http://www.mongodb.com/presentations/retail-reference-architecture-part-
1-flexible-searchable-low-latency-product-catalog
o http://www.mongodb.com/presentations/retail-reference-architecture-part-
2-real-time-geo-distributed-inventory
o http://www.mongodb.com/presentations/retail-reference-architecture-part-
3-scalable-insight-component-providing-user-history
Editor's Notes
Field/Trenches
We also try to make it about interesting systems
Generally, MongoDB works well, but like all complex systems, it has its foibles, pitfalls and preferences
Despite the blurb, will not be talking about binary shard keys
Some borrowed, some merged into a single narrative
More ops focused take on borrowed stories
Sensor data, say from a trucking fleet, some real time data, rest uploaded in batches
He set up a sharded collection across 4 shards, all using locally-attached, commodity storage.
Everything worked well in the test environment…
Bill's Bulk Updates randomly affected an ever larger data set.
In order to cope with the database size, Bill added more shards.
The cluster scaled linearly, as intended.
Imagine that the sample rate was going to go from once a minute to once every 5 seconds
You can run 200 shards, we have customers that have been doing so for years
But, if already worried about the TCO at 20 shards, 200 is a significant problem
Just because you can add horizontal capacity, does not mean it is the optimum solution
When you have identified your scaling driver (I/O, memory, CPU, locking), always consider alternatives to simply adding more of the same
Spinning rust (commodity hard disks) are generally poor for random I/O, but are cheap. SSDs are expensive but an order of magnitude faster for random I/O
Another alternative would be a high performance SAN, RAID of commodity spinning disks, but that was not an option here.
Fewer, beefier shards.
Bill went with SSDs.
Ultimately, Bill only needed about 4 shards, so cost savings overall.
Bill, and Bill's boss was very relieved.
Lots of hype, so Gary’s load was going to go from zero to insane in a matter of minutes/hours when the game launched
Taking story 1 to an extreme
Usually sharding = one host per data bearing node. Good for scaling horizontally on hosts with modest capabilities.
Micro sharding deploys more than one mongod process on a host. In this case 8 data bearing processes per host for a total of 24 = 8 “micro” shards on 3 hosts
Taking story 1 to an extreme
Individual tests were OK, but not combined
The process followed our reference documentation, for the general use case:
Lock the database, take a backup (snapshot), release the lock
Generally you would do this on a host running a secondary, not on a host running 8 mongod processes, some of them primary
Connections built up on the clients, more were opened, hosts began to run out of resources (open files, ephemeral ports, memory in general)
Which boils down to……
Well, not quite, the dev people were on the call trying to fix this too
No recurrence, still running without an issue
Taking story 1 to an extreme
In fact, this is how the data was used because of the current schema
Which means lots of I/O, lots of page faults, low cache hit rate, slowness – even though rough calculations say it should fit in RAM
Out of RAM, no problem, just download more! With the cloud, VMs, containers this is close to being true, but there is still a cost
It might fix things, but it’s expensive and the real problem is the efficiency, or lack thereof
Not the only approach, but allowed for minimal changes on the client side
Disk accessed more efficiently and less often, actually allows the readahead to potentially go lower too by freeing up IOPS capacity
For example, at the “Ask The Experts” table which I will be at for the next hour following this presentation