3. Some History
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data is abstracted away from app
4. Some History
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data is abstracted away from app
• 1980's RDBMS commercialized
– Client/Server model
– SQL becomes the standard
5. Some History
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
• 1980's RDBMS commercialized
– Client/Server model
– SQL becomes the standard
• 1990's Things begin to change
– Client/Server=> 3-tier architecture
– Internet and the Web
6. Some History
• 2000's Web 2.0
– "Social Media"
– E-Commerce
– Decrease of HW prices
– Increase of collected data
7. Some History
• 2000's Web 2.0
– "Social Media"
– E-Commerce
– Decrease of HW prices
– Increase of collected data
• Result
– Need to scale
-- How do we keep up?
8. Developers
• Agile Development Methodology
– Shorter development cycles
– Constant evolution of requirements
– Flexibility at design time
9. Developers
• Agile Development Methodology
– Shorter development cycles
– Constant evolution of requirements
– Flexibility at design time
• Relational Schema
– Hard to evolve
• must stay in sync with
application
18. MongoDB History
• Designed/developed by founders of Doubleclick, ShopWiki, GILT
groupe, etc.
• First production site March 2008 - businessinsider.com
• Open Source – AGPL, written in C++
• Version 0.8 – first official release February 2009
• Version 2.4 – March 2013
23. Better Data Locality
• Data model means "entities" can reside "together"
• Optimize schema for read and write access patterns
• Minimize "seeks" as they dominate IO slowdown
• Failure to take advantage of document model:
– no improved performance
– all the disadvantages with non of the advantages!
– incorrect model can overshoot "all data embedded"
25. In-memory Caching
• memory mapped files,
• caching handled by OS,
• naturally leaves most frequently accessed data in RAM
• have enough RAM to fit indexes and working data set
for best performance
27. Auto-Sharding
• horizontal scaling is "built-in" to the product
• Replication is for HA
• Sharding is for scaling
• Number of servers in replica set based on HA
requirements
• Number of shards is based on capacity needed vs.
single server/replicaset capacity
28. MongoDB Performance*
Top 5 Marketing
Firm
Government Agency
Top 5 Investment
Bank
10+ fields, arrays,
nested documents
20+ fields, arrays,
nested documents
Queries Key-based
1 – 100 docs/query
80/20 read/write
Compound queries
Range queries
MapReduce
20/80 read/write
Compound queries
Range queries
50/50 read/write
Servers ~250
~50
~40
Ops/sec 1,200,000
500,000
30,000
Data Key/value
* These figures are provided as examples. Your application governs your performance.
33. What
• There is one thing that is absolutely mandatory to
have in order to succeed in capacity planning
• Without it, you will not be successful
• We must have REQUIREMENTS from business
– without requirements, we're building a roadmap without
knowing the desired destination
Imagine building a car without knowing what its top speed
should be, acceleration, MPH, and cost?
35. What
• Availability: what is uptime requirement?
• Throughput
– average read/write/users
– peak throughput?
– OPS (operations per second)? per hour? per day?
• Responsiveness
– what is acceptable latency?
– is higher during peak times acceptable?
38. When
• At the beginning before production, but after you launch you
must continue the process
• Lack of future planning: Failure to project performance
drop-off as the amount of data increases –
• Process (steps): -> ACTIONS
– Requirements ask, guess, try/measure.
– Understand application needs
– Choose hardware to meet that pattern (...)
– How many machines you need
– Monitor to recognize growth exceeding current capacity.
39. Capacity Planning: What?
Understand Resources
– Storage
– Memory
– CPU
– Network
• Understand Your Application
– Monitor and Collect Metrics
– Model to Predict Change
– Allocate and Deploy
– (repeat process)
40. Resource Usage
Storage
– IOPS
– Size
– Data & Loading Patterns
Memory
– Working Set
CPU
– Speed
– Cores
Network
– Latency
– Throughput
58. Starter Questions
What is the working set?
– How does that equate to memory
– How much disk access will that require
How efficient are the queries?
What is the rate of data change?
How big are the highs and lows?
59. Deployment Types
All of these use the same resources:
•
Single Instance
•
Multiple Instances (Replica Set)
•
Cluster (Sharding)
•
Data Centers
61. Monitoring
• CLI and internal status commands
• mongostat; mongotop; db.serverStatus()
• Plug-ins for munin, Nagios, cacti, etc.
• Integration via SNMP to other tools
• MMS
67. Velocity of Change
• Limitations -> takes time
– Data Movement
– Allocation/Provisioning (servers/mem/disk)
• Improvement
– Limit Size of Change (if you can)
– Increase Frequency
– MEASURE its effect
– Practice