A Systematic Approach to Capacity Planning in the Real World
1. @Twitter | Velocity 2013 1
A Systematic Approach to !
Capacity Planning in the Real World
Bryce Yan, Arun Kejariwal
(@bryce_yan, @arun_kejariwal)
Capacity Engineering @ Twitter
June 2013
3. @Twitter | Velocity 2013 3
Approaches to Capacity Planning
• Throw hardware at the problem
• Reactive approach
o How much?
o What kind? (Inventory management etc.)
PoorUX
Bottomline
4. @Twitter | Velocity 2013 4
Capacity Planning is Non-trivial
• Organic growth
Over 200M monthly active users [1]
• Events planned or unplanned
Events/incidents (e.g., Superbowl’13 blackout)
Behavioral response
o Demographics, Cultural
o Retweets, Photos, Vines
Tax different services/applications
o Different capacity requests
[2] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/
[3] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/
[2, 3]
[1] https://twitter.com/twitter/status/281051652235087872
5. @Twitter | Velocity 2013 5
Capacity Planning is Non-trivial (cont’d)
• Evolving product development landscape
New features
New products
• New hardware platforms
Purchase pipeline
How much and when to buy – Cost performance trade-off
• Overall goal
User Experience
Operational footprint
7. @Twitter | Velocity 2013 7
Capacity Modeling
• Takes core drivers as inputs to generate usage demand
Forecasts the amount of work based on core driver projections
• Relates the work metric to a primary resource to identify the capacity
threshold
Primary resources
Computing power (CPU, RAM)
Storage (disk I/O, disk space)
Network (network bandwidth)
• Generate hardware demand based on the limiting primary resource
8. @Twitter | Velocity 2013 8
Core Drivers
• Underlying business metrics that drive demand for more capacity
Active Users
Tweets per second (TPS)
Favorites per second (FPS)
Requests per second (RPS)
• Normalized by Active Users to isolate user engagement
• Project user engagement and Active Users independently
9. @Twitter | Velocity 2013 9
Active Users aka User Growth
Normalized Core Drivers for Engagement
Core Drivers (cont’d)
PerActiveUserValues
Time
Favorites
Retweets
Poly. (Favorites)
Linear (Retweets)
ActiveUserCount
Time
Active
Users
Linear (Active
Users)
10. @Twitter | Velocity 2013 10
Core Drivers (cont’d)
Time
User Growth: Active Users
Active
Users
Linear (Active
Users)
Time
Engagement: Photos/Active User
Photos
Linear (Photos)
Time
Core Driver: Photos per Day
Photos
Photos
Forecast
11. @Twitter | Velocity 2013 11
Capacity Threshold
• Primary resource scalability threshold
Determined by load testing
Synthetic load
Replaying production traffic
Real-time production traffic
Test systems may be
Isolated replicas of production
Staging systems in production
Production systems
ServiceResponseTime
CPU
Average Response Times vs CPU
X
12. @Twitter | Velocity 2013 12
Hardware Demand
• Core driver capacity threshold scaling formula server count
• Example
Core driver: Requests per Second
Per server request throughput determined by
capacity threshold
Scaling formula for Sizing
Number of Servers = (RPS) / Per Server Threshold
CoreDriver(RPS)/ServerCount
Time
RPS (Actuals)
RPS (Forecast)
# Servers (Actuals)
# Servers (Forecast)
14. @Twitter | Velocity 2013 14
Capacity Planning Methodology
• Predict expected value based on historical and temporal statistical analysis
Metrics
Average, Standard deviation, 95th, 99th percentile
Techniques
Moving Average – EMA (exponential moving average)
Correlation
β analysis
MACD
Forecasting - ARIMA
• Limitations
Changing usage patterns
Organic growth, behavioral, cultural
Event driven
Super Bowl: How a game would turn out?
15. @Twitter | Velocity 2013 15
Capacity Planning Methodology (contd.)
• Correlation Analysis
Assess the relation between resource metric(s) and core driver
Caution: Correlation does not imply causation
Core Driver
Network
CPU
Time
17. @Twitter | Velocity 2013 17
Rolling Correlation
Time
Capacity Planning Methodology (contd.)
• Correlation varies over time
Growing user base
New products, features
• Rolling correlation analysis – capture time varying nature
Raw times series
EMA
Challenge: What should be the window width?
20. @Twitter | Velocity 2013 20
Capacity Planning Methodology (contd.)
• β varies over time
New products, features
New metric to log
Rolling Beta
Time
21. @Twitter | Velocity 2013 21
Capacity Planning Methodology (contd.)
• Growth: Detecting breakout
MACD: Moving Average Convergence Divergence
Difference of n- and m-width, n>m, EMA
Diverging EMAs
o Commonly used as a
buy/sell signal in
context of a stock
o Early detection of
potential capacity ask
"MACD"
MACD Signal
Time
23. @Twitter | Velocity 2013 23
Join the Flock
• We are hiring!!
https://twitter.com/JoinTheFlock
https://twitter.com/jobs
Contact us: @bryce_yan, @arun_kejariwal
Like problem solving?
Like challenges?
Be at cutting Edge
Make an impact