2. 2JUNE 2014
Agenda
⢠CMP.LY and CommandPost
⢠What is MongoDB Management Service?
⢠Performance Tuning
⢠MongoDB Issues weâve faced
⢠Slow response times and delayed writes
⢠Unindexed queries
⢠Increased Replication Lag and Plummeting oplog Window
⢠Keep your deployment healthy with MMS
⢠Using MMS Alerts
⢠Using MMS Backups
3. 3JUNE 2014
A venture-funded NYC startup that offers proprietary social media, monitoring,
measurement, insight and compliance solutions for Fortune 100
A Monitoring, Measurement & Insights (MMI) tool for managed social
communications.
4. 4JUNE 2014
Use CommandPost to:
⢠Track and measure cross-platform in real-time
⢠Identify and attribute high-value engagement
⢠Analyze and segment engaged audience
⢠Optimize content and engagement strategies
⢠Address compliance needs
6. 6JUNE 2014
MongoDB Management Service
⢠Free MongoDB Monitoring
⢠MongoDB Backup in the Cloud
⢠Free Cloud service or Available
to run On-Prem for Standard or
Enterprise Subscriptions
⢠Automation coming soonâFTW!
Ops
Makes MongoDB easier to use and
manage
7. 7JUNE 2014
Who Is MMS for?
⢠Developers
⢠Ops Team
⢠MongoDB Technical Service Team
9. 9JUNE 2014
How To Do Performance Tuning?
⢠Assess the problem and establish acceptable behavior.
⢠Measure the performance before modification.
⢠Identify the bottleneck.
⢠Remove the bottleneck.
⢠Measure performance after modification to confirm.
⢠Keep it or revert it and repeat.
Adapted from [http://en.wikipedia.org/wiki/Performance_tuning]
13. 13JUNE 2014
Concurrency
⢠What is it?
⢠How did it affect us?
⢠How did MMS help identify it?
⢠How did we diagnose the issue in our app and fix it?
⢠Today
14. 14JUNE 2014
Concurrency in MongoDB
⢠MongoDB uses a readers-writer lock
⢠Many read operations can use a read lock
⢠If a write lock exists, a single write lock holds the lock exclusively
⢠No other read or write operations can share the lock
⢠Locks are âwriter-greedyâ
15. 15JUNE 2014
How Did This Affect Us?
⢠Slow API response times due to slow database operations
⢠Delayed writes
⢠Backed up queues
17. 17JUNE 2014
Lock % Greater than 100%?!?!?
⢠time spent in write lock state; sum of global lock + hottest database at that time,
can make value > 100%
⢠Global lock percentage is a derived metric:
% of time in global lock (small number)
+
% of time locked by hottest (âmost lockedâ) database
⢠Data is sampled and combined, it is possible to see values over 100%.
18. 18JUNE 2014
Diagnosis
⢠Identified the write-heavy collections in our applications
⢠Used application logs to identify slow API responses
⢠Analyzed MongoDB logs to identify slow database queries
22. 22JUNE 2014
Message Queues
⢠Controlled writes to specific collections using Pub/Sub
⢠We chose Amazon SQS
⢠Other options include Redis, Beanstalkd, IronMQ or any other message queue
⢠Created consistent flow of writes versus bursts
⢠Reduced length and frequency of write locks by controlling flow/speed of writes
23. 23JUNE 2014
Using Multiple Databases
⢠As of version 2.2, MongoDB implements locks at a per database granularity for
most read and write operations
⢠Planned to be at the document level in version 2.8
⢠Moved write-heavy collections to new (separate) databases
24. 24JUNE 2014
Using Sharding
⢠Improves concurrency by distributing databases across multiple mongod
instances
⢠Locks are per-mongod instance
27. 27JUNE 2014
Indexing
⢠What is it?
⢠How did it affect us?
⢠How did MMS help identify it?
⢠How did we diagnose the issue in our app and fix it?
⢠Today
28. 28JUNE 2014
Indexing with MongoDB
⢠Support for efficient execution of queries
⢠Without indexes, MongoDB must scan every document
⢠Example
Wed Jul 17 13:40:14 [conn28600] query x.y [snip] ntoreturn:16 ntoskip:0
nscanned:16779 scanAndOrder:1 keyUpdates:0 numYields: 906 locks(micros)
r:46877422 nreturned:16 reslen:6948 38172ms
38 seconds! Scanned 17k documents, returned 16
⢠Create indexes to cover all queries, especially support common and user-facing
⢠Collection scans can push entire working set out of RAM
29. 29JUNE 2014
How Did this Affect Us?
⢠Our web apps became slow
⢠Queries began to timeout
⢠Longer operations mean longer lock times
30. 30JUNE 2014
MMS: Identifying Indexing Issues
Page Faults
⢠The number of times that
MongoDB requires data
not located in physical
memory, and must read
from virtual memory.
31. 31JUNE 2014
Diagnosis
⢠Log Analysis
⢠Use mtools to analyze MongoDB logs
⢠mlogfilter
⢠filter logs for slow queries, collection scans, etc.
⢠mplotqueries
⢠graph query response times and volumes
⢠https://github.com/rueckstiess/mtools
32. 32JUNE 2014
Diagnosis
⢠Monitoring application logs
⢠Enabling ânotablescanâ option in development and testing versions of apps
⢠MongoDB profiling
33. 33JUNE 2014
The MongoDB Profiler
⢠Collects fine grained data about MongoDB write operations, cursors, database
commands on a running mongod instance.
⢠Default slowOpThreshold value is 100ms, can be changed from the Mongo shell
34. 34JUNE 2014
Our Remedies
⢠Add indexes!
⢠Make sure queries are covered
⢠Utilize the projection specification to limit fields (data) returned
35. 35JUNE 2014
Adding Indexes
⢠Improved performance for common queries
⢠Alleviates the need to go to disk for many operations
36. 36JUNE 2014
Projection Specification
Controls the amount of data that needs to be (de-)serialized for use in your app
⢠We used it to limit data returned in embedded documents and arrays
db.inventory.find( { type: 'food' }, { item: 1, qty: 1 } )
39. 39JUNE 2014
Replication
⢠What is it?
⢠How did it affect us?
⢠How did MMS help identify it?
⢠How did we diagnose the issue in our app?
⢠How did we fix it?
⢠Today
40. 40JUNE 2014
What is Replication?
⢠A replica set is a group of mongod
processes that maintain the same data
set.
⢠Replica sets provide redundancy and
high availability, and are the basis for all
production deployments
41. 41JUNE 2014
What Is the Oplog?
⢠A special capped collection that keeps a rolling record of all operations that
modify the data stored in your databases.
⢠Operations are first applied on the primary and then recorded to its oplog.
⢠Secondary members then copy and apply these operations in an asynchronous
process.
42. 42JUNE 2014
What is Replication Lag?
⢠A delay between an operation on the primary and the application of that
operation from the oplog to the secondary.
⢠Effects of excessive lag
⢠âLaggedâ members ineligible to quickly become primary
⢠Increases the possibility that distributed read operations will be inconsistent.
43. 43JUNE 2014
How did this affect us?
⢠Degraded overall health of our production deployment.
⢠Distributed reads are no longer eventually consistent.
⢠Unable to bring new secondary members online.
⢠Caused MMS Backups to do full re-syncs.
45. 45JUNE 2014
Diagnosis
⢠Possible causes of replication lag include network latency, disk throughput,
concurrency and/or appropriate write concern
⢠Size of operations to be replicated
⢠Confirmed Non-Issues for us
⢠Network latency
⢠Disk throughput
⢠Possible Issues for us
⢠Concurrency/write concern
⢠Size of op is an issue because entire document is written to oplog
46. 46JUNE 2014
Concurrency/Write Concern
⢠Our applications apply many updates very quickly
⢠All operations need to be replicated to secondary members
⢠We use the default write concernâAcknowledge
⢠The mongod confirms receipt of the write operation
⢠Allows clients to catch network, duplicate key and other errors
48. 48JUNE 2014
Operation Size Was the Issue
Collection A (most active)
Total Updates: 3,373
Total Size of updates: 6.5 GB
Activity accounted for nearly 87% of total traffic
Collection B (next most active)
Total Updates: 85,423
Total Size of updates: 740 MB
49. 49JUNE 2014
Fast Growing oplog causes issues
Replication oplog Window â approximate hours available in the primaryâs oplog
50. 50JUNE 2014
How We Fixed It
⢠Changed our schema
⢠Changed the types of updates that were made to documents
⢠Both allowed us to utilize atomic operations
⢠Led to smaller updates
⢠Smaller updates == less oplog space used
55. 55JUNE 2014
Watch for Warnings
⢠Be warned if you are
⢠Running outdated versions
⢠Have startup warnings
⢠If a mongod is publicly visible
⢠Pay attention to these warnings
56. 56JUNE 2014
MMS Backups
⢠Engineered by MongoDB
⢠Continuous backup with point-in-time recovery
⢠Fully managed backups
57. 57JUNE 2014
Using MMS Backups
⢠Seeding new secondaries
⢠Repairing replica set members
⢠Development and testing databases
⢠Restores are free!
58. 58JUNE 2014
Summary
⢠Know whatâs expected and ânormalâ in your systems
⢠Know when and what changes in your systems
⢠Utilize MMS alerts, visualizations and warnings to keep things running smoothly
Developers, what weâre focused on today â track bottlenecks
Ops team :: great for small teams where your developers are also part of your ops team (DevOps) â monitor health of clusters, backup dbs, automate updates and add capacity
MongoDB technical service team :: helps them help you
Important for us because we maintain a small tech team
PRO-TIP: Know what is ânormalâ for you system.
Know what changed when something happens, what do you expect to be normal behavior, what are you normal MMS metrics
readers-writer lock allows concurrent read access to the db,
but exclusive access to a single write
âWriter-greedyâ - When both a read and write are waiting for a
lock, MongoDB grants the lock to the write.
The exclusivity of write locks is one of the keys to why getting
our lock % under control is so important.
Lock %
time spent in write lock state; sum of global lock + hottest database at that time, can make value > 100%
Our Issue: Primary database maintaining a write lock of 150-175% of the time
Global lock percentage has remained about the same
Primary client-facing database has seen lock % drop
Developed by a MongoDB engineer
- Purple bar indicates downtime
- Alerts for down hosts, down agents and more
- According to Technical Services, In many cases, fixing warnings will fix issues