SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Scaling Workshop
Provisioning and Capacity Planning
Brian Brazil
Founder
Who am I?
Engineer passionate about running software reliably in production.
● TCD CS Degree
● Google SRE for 7 years, working on high-scale reliable systems such as
Adwords, Adsense, Ad Exchange, Billing, Database
● Boxever TL Systems&Infrastructure, applied processes and technology to let
allow company to scale and reduce operational load
● Contributor to many open source projects, including Prometheus, Ansible,
Python, Aurora and Zookeeper.
● Founder of Robust Perception, making scalability and efficiency available to
everyone
Goals
At the end of the workshop you will be able to:
● Estimate how much spare capacity you have in less than 5 minutes
● Estimate how much runway that capacity provides
● Determine how many machines you need
● Spot common potential problems as you scale
This should set you up for your first 1-2 years, if not more
Audience
This is an introductory workshop to teach you the basics.
Your company:
● Uses Unix in production
● Has a relatively simple setup/small number of machines
● Operations primarily performed by developers
● Performance has not been a primary consideration in your product
I’m also going to focus on webservices-type systems rather than offline processing
or batch.
Capacity
Estimate your capacity in 3 easy steps!
1. Measure bottleneck resource at peak traffic
2. Divide to get fraction of limit
3. Multiply by peak traffic
Estimate your capacity in 3 not so easy steps!
1. What’s your bottleneck? How do you measure it?
2. What’s your bottleneck’s limit?
3. What’s your peak traffic?
Step 1: What’s the bottleneck?
The most common bottlenecks:
1. CPU
2. Disk I/O
Less common: network, disk space, external resources, quotas, hardcoded limits,
contention/locking, memory, file descriptors, port numbers, humans
Step 1: Where’s the bottleneck?
Look at CPU % and Disk I/O Utilisation on each type of machine.
If you’ve monitoring, use that.
Failing that:
sudo apt-get install sysstat
iostat -x 5
Step 1: Iostat
avg-cpu: %user %nice %system %iowait %steal %idle
4.24 0.00 1.18 0.98 0.00 93.60
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 1.40 0.00 3.80 0.00 45.20 23.79 0.00 1.05 0.00 1.05 0.84 0.32
sdb 0.00 1.40 0.00 21.00 0.00 267.20 25.45 0.09 4.11 0.00 4.11 4.11 8.64
sdc 0.00 1.40 0.00 20.00 0.00 267.20 26.72 0.06 3.24 0.00 3.24 3.24 6.48
md0 0.00 0.00 0.00 2.00 0.00 8.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
The numbers you care about are %idle and %util.
%idle is the amount of CPU not in use. %util is the amount of disk I/O in use, take
the biggest one.
Step 2: What’s the limit?
We now know the CPU and disk I/O usage on each machine at peak.
Which is the bottleneck though?
Need to know the limit. Rules of thumb:
● 80% limit for CPU
● 50% limit for Disk I/O
Step 2: Division
Find how full each CPU and disk is.
Say we had a disk 10% utilised, and a CPU 20% utilised (80% idle).
0.1/0.5 = 0.2 => Disk IO is at 20% of limit
0.2/0.8 = 0.25 => CPU is at 25% of limit
CPU is our bottleneck, with 25% of capacity used.
Step 2: Utilisation Visualisation
Step 3: Peak traffic
Now that we know how full our bottleneck is, we need to know how much capacity
we have.
Figure out how much traffic you were handling around the time you measured cpu
and disk utilisation.
You might do this via monitoring, or parsing logs or if you’re really stuck tcpdump.
Step 3: The 2nd division
Let’s say our queries per second (qps) was 10 around peak.
Our CPU was our bottleneck, and about 25% of our limit.
10/0.25 = 40qps
So we can currently handle a maximum traffic of around 40qps
Step 3: Capacity Visualisation
Now you can estimate your capacity in 3 easy steps!
1. Measure bottleneck resource at peak traffic
○ Use monitoring or iostat to see how close you are to the limit, say 20% full
2. Divide to get fraction of limit
○ With a limit of 80% for CPU, you’re 20/80 = 25% full
3. Multiply by peak traffic
○ Traffic was 10qps, so 10/0.25 = 40qps capacity
Runway
How much runway do you have?
You now have a rough idea of how much capacity you have to spare.
In the example here, we’re using 10qps out of 40qps capacity.
How long will that 30qps last you?
The two main factors are new customers and organic growth.
New Customers
New customers/partners are your main source of traffic.
Look at your traffic graphs around the time a new customer started using your
system.
If the customer had say 1M users and you saw 10qps increased peak traffic, you
can now predict how much traffic future customers will need.
Based on sales predictions, you can tell how much capacity you’ll need for new
customers.
Organic growth
Over time your existing customers/partners will use the system more and more,
new employees are hired, they get new customers etc.
Look at your monitoring’s traffic graphs over a few months to see what the trend is
like. Do your best to ignore the impact of launches.
Calculate your % growth month on month.
Starting out, it’s likely that organic growth will not be your main consideration.
Calculating runway
Once again in the example here, we’re using 10qps out of 40qps capacity.
Each 1M user customer generates 10qps of additional traffic.
You also expect a negligible amount of organic growth.
This means you can handle 3M more users worth of new customers.
If you’re signing up one 1M user customer per month, that gives you 3 months.
Provisioning
Provisioning vs Capacity Planning
Capacity Planning:
In 6 months I will have 7 new customers, and need to be able to handle 100qps in
total
Provisioning:
To handle 100qps I need X frontends and Y databases
Provisioning: What can a machine handle?
Continuing our example, let’s say we had 4 machines and each reported being at
CPU 20% (25% of the 80% limit) while dealing with 10qps each.
The key metric is qps per machine.
10qps/.2 machines = 50qps/machine
Can only safely use 80% of the machine, so 50*.8 = 40qps
So we can handle 40 qps per machine.
Provisioning: How many machines do I need?
If we want to handle 100qps, we need 100/40 = 2.5 machines. So 3 machines.
For each type of machine, calculate the incoming external qps it can handle and
how many you need.
Don’t fret about $10/month worth of cost, it’s not worth your time.
Provisioning: Visualisation
Review: The Basics
● Estimating capacity:
○ Measure bottleneck at peak
○ Find how near bottleneck is to the limit
○ Calculate spare capacity based on peak traffic
● Keep an eye on new customers/partners and organic growth to track runway
● For provisioning, calculate qps/machine for each type of machine
Life is not Basic
A few wrinkles
I’ve glossed over a lot of detail so you can go away from today’s workshop with
something you can immediately use.
Some questions ye may have:
● Why measure at peak traffic?
● What if I don’t have much traffic?
● Why 80% limit on CPU and 50% on disk?
● What if a machine fails?
● What if things aren’t that simple?
● Doesn’t autoscaling take care of all this for me?
Why measure at peak traffic?
As your utilisation increases:
● Latency increases
● Performance decreases
In addition skew due to
background of constant CPU
usage is decreased
Measuring at peak helps
allow for these factors.
Beware the knee.
What if I don’t have much traffic?
If you don’t have enough traffic to show up in top or iotop, then these techniques
won’t help you much.
You could loadtest, but that takes time. Or use rules of thumb.
Easier way: Use latency to estimate throughput.
If your queries take 10ms, then you can probably handle 100/s
Why 80% limit on CPU and 50% on disk?
For CPU due to utilisation/latency curve you want to avoid having too high
utilisation.
If you have the CPU to yourself 90-95% is safe in a controlled environment with
good loadtesting. This is uncommon, so leave safety margin for OS processes etc.
For spinning disks the impact of utilisation tend to be more problematic, and
background tasks tend to use a lot of disk.
What if a machine fails?
You generally should add 2 extra machines beyond that you need to serve peak
qps. This is commonly known as “n+2”.
This is to allow for one machine failure, and to let you take down a machine to
push a new binary, perform maintenance or whatever.
This also gives you some slack in your capacity. As you grow, more sophisticated
math is required.
What if things aren’t that simple?
Lots of other issues can throw a spanner in the works.
● Heterogeneous machines
● Varying machine performance
● Varying traffic mixes
● Multiple datacenters
● Multi-tiered services
As a general rule try to keep things simple. A perfect model is brittle and usually
takes more time than it’s worth.
Doesn’t autoscaling take care of all this for me?
Short answer Long answer
Doesn’t autoscaling take care of all this for me?
Short answer
No
Long answer
Doesn’t autoscaling take care of all this for me?
Short answer
No
Long answer
Haha, Haha.
No
Doesn’t autoscaling take care of all this for me?
EC2 Autoscaling can eliminate some of the day-to-day work in provisioning
servers.
There’s operational and complexity overhead, as you have to maintain images and
systems that can be spun up.
You have to wait for instances to spin up - can’t rely on it completely for sudden
spikes. You need to do math to tune it to be able to handle a spikes.
You still have to tune everything. Control systems are hard.
Wrapping Up
Monitoring Matters
A common thread through this workshop is that monitoring is what should be
providing you the information you need to make operational decisions.
Make sure you have a good monitoring system.
Logs are not monitoring, though better than nothing.
I recommend Prometheus.io: If it didn’t exist I would have created it.
Production Matters
Provisioning and Capacity planning is just one aspect of production. There’s many
others involved with running your company:
Robust Perception can help you with all of this and more.
● Deployment
● Change Management
● Configuration Management
● Reliability
● Architecture
● Design Feasibility
● Cost Management
● Performance Tuning
● SLAs
● Contract Sanity Check
● Debugging
● Alerting
● Oncall
● Incident Management
Questions?
Blog: www.robustperception.io/blog
Twitter: @RobustPerceiver
Email: brian.brazil@robustperception.io
Linkedin: https://ie.linkedin.com/in/brianbrazil

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
Evaluating Prometheus Knowledge in Interviews (PromCon 2018)
 
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
No C-QL (Or how I learned to stop worrying, and love eventual consistency) (N...
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)Prometheus (Monitorama 2016)
Prometheus (Monitorama 2016)
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
Counting with Prometheus (CloudNativeCon+Kubecon Europe 2017)
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
Monitoring What Matters: The Prometheus Approach to Whitebox Monitoring (Berl...
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
 
Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)Life of a Label (PromCon2016, Berlin)
Life of a Label (PromCon2016, Berlin)
 
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
Staleness and Isolation in Prometheus 2.0 (PromCon 2017)
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
What is your application doing right now? An introduction to Prometheus
What is your application doing right now? An introduction to PrometheusWhat is your application doing right now? An introduction to Prometheus
What is your application doing right now? An introduction to Prometheus
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 

Ähnlich wie Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
slandelle
 
EuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalEuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen Final
Albert Witteveen
 
Early watch report
Early watch reportEarly watch report
Early watch report
cecileekove
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
DataWorks Summit
 

Ähnlich wie Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015) (20)

Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.Mathworks CAE simulation suite – case in point from automotive and aerospace.
Mathworks CAE simulation suite – case in point from automotive and aerospace.
 
Ask the Expert: Lean Leadership - Can We Talk About OEE?
Ask the Expert: Lean Leadership - Can We Talk About OEE?Ask the Expert: Lean Leadership - Can We Talk About OEE?
Ask the Expert: Lean Leadership - Can We Talk About OEE?
 
Дмитро Волошин "High[Page]load"
Дмитро Волошин "High[Page]load"Дмитро Волошин "High[Page]load"
Дмитро Волошин "High[Page]load"
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
PuppetConf 2017: Deploying is Only Half the Battle! Operationalizing Applicat...
 
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance TestingAlbert Witteveen - With Cloud Computing Who Needs Performance Testing
Albert Witteveen - With Cloud Computing Who Needs Performance Testing
 
Monitoring and automation
Monitoring and automationMonitoring and automation
Monitoring and automation
 
Quick guide to plan and execute a load test
Quick guide to plan and execute a load testQuick guide to plan and execute a load test
Quick guide to plan and execute a load test
 
Velocity 2015: Building Self-Healing Systems
Velocity 2015: Building Self-Healing SystemsVelocity 2015: Building Self-Healing Systems
Velocity 2015: Building Self-Healing Systems
 
Velocity 2015 building self healing systems (slide share version)
Velocity 2015 building self healing systems (slide share version)Velocity 2015 building self healing systems (slide share version)
Velocity 2015 building self healing systems (slide share version)
 
EuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen FinalEuroSTAR 2013 Albert Witteveen Final
EuroSTAR 2013 Albert Witteveen Final
 
Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Machine Learning in Production: Manu Mukerji, Strata CA March 2018 Machine Learning in Production: Manu Mukerji, Strata CA March 2018
Machine Learning in Production: Manu Mukerji, Strata CA March 2018
 
Designing and Running Performance Experiments
Designing and Running Performance ExperimentsDesigning and Running Performance Experiments
Designing and Running Performance Experiments
 
With Cloud Computing, Who Needs Performance Testing?
With Cloud Computing, Who Needs Performance Testing?With Cloud Computing, Who Needs Performance Testing?
With Cloud Computing, Who Needs Performance Testing?
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Early watch report
Early watch reportEarly watch report
Early watch report
 
Load Test Like a Pro
Load Test Like a ProLoad Test Like a Pro
Load Test Like a Pro
 
A Study on Giving Commonsense to Machines
A Study on Giving Commonsense to MachinesA Study on Giving Commonsense to Machines
A Study on Giving Commonsense to Machines
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 

Mehr von Brian Brazil

Mehr von Brian Brazil (11)

OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
 
Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)Evolution of Monitoring and Prometheus (Dublin 2018)
Evolution of Monitoring and Prometheus (Dublin 2018)
 
Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)Anatomy of a Prometheus Client Library (PromCon 2018)
Anatomy of a Prometheus Client Library (PromCon 2018)
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
Evolution of the Prometheus TSDB  (Percona Live Europe 2017)Evolution of the Prometheus TSDB  (Percona Live Europe 2017)
Evolution of the Prometheus TSDB (Percona Live Europe 2017)
 
Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)Rule 110 for Prometheus (PromCon 2017)
Rule 110 for Prometheus (PromCon 2017)
 
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)Prometheus:  From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
Prometheus: From Berlin to Bonanza (Keynote CloudNativeCon+Kubecon Europe 2017)
 
An Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQLAn Exploration of the Formal Properties of PromQL
An Exploration of the Formal Properties of PromQL
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 

Kürzlich hochgeladen

Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
ayvbos
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Monica Sydney
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Monica Sydney
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
pxcywzqs
 

Kürzlich hochgeladen (20)

20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 

Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)

  • 1. Scaling Workshop Provisioning and Capacity Planning Brian Brazil Founder
  • 2. Who am I? Engineer passionate about running software reliably in production. ● TCD CS Degree ● Google SRE for 7 years, working on high-scale reliable systems such as Adwords, Adsense, Ad Exchange, Billing, Database ● Boxever TL Systems&Infrastructure, applied processes and technology to let allow company to scale and reduce operational load ● Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper. ● Founder of Robust Perception, making scalability and efficiency available to everyone
  • 3. Goals At the end of the workshop you will be able to: ● Estimate how much spare capacity you have in less than 5 minutes ● Estimate how much runway that capacity provides ● Determine how many machines you need ● Spot common potential problems as you scale This should set you up for your first 1-2 years, if not more
  • 4. Audience This is an introductory workshop to teach you the basics. Your company: ● Uses Unix in production ● Has a relatively simple setup/small number of machines ● Operations primarily performed by developers ● Performance has not been a primary consideration in your product I’m also going to focus on webservices-type systems rather than offline processing or batch.
  • 6. Estimate your capacity in 3 easy steps! 1. Measure bottleneck resource at peak traffic 2. Divide to get fraction of limit 3. Multiply by peak traffic
  • 7. Estimate your capacity in 3 not so easy steps! 1. What’s your bottleneck? How do you measure it? 2. What’s your bottleneck’s limit? 3. What’s your peak traffic?
  • 8. Step 1: What’s the bottleneck? The most common bottlenecks: 1. CPU 2. Disk I/O Less common: network, disk space, external resources, quotas, hardcoded limits, contention/locking, memory, file descriptors, port numbers, humans
  • 9. Step 1: Where’s the bottleneck? Look at CPU % and Disk I/O Utilisation on each type of machine. If you’ve monitoring, use that. Failing that: sudo apt-get install sysstat iostat -x 5
  • 10. Step 1: Iostat avg-cpu: %user %nice %system %iowait %steal %idle 4.24 0.00 1.18 0.98 0.00 93.60 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 1.40 0.00 3.80 0.00 45.20 23.79 0.00 1.05 0.00 1.05 0.84 0.32 sdb 0.00 1.40 0.00 21.00 0.00 267.20 25.45 0.09 4.11 0.00 4.11 4.11 8.64 sdc 0.00 1.40 0.00 20.00 0.00 267.20 26.72 0.06 3.24 0.00 3.24 3.24 6.48 md0 0.00 0.00 0.00 2.00 0.00 8.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 The numbers you care about are %idle and %util. %idle is the amount of CPU not in use. %util is the amount of disk I/O in use, take the biggest one.
  • 11. Step 2: What’s the limit? We now know the CPU and disk I/O usage on each machine at peak. Which is the bottleneck though? Need to know the limit. Rules of thumb: ● 80% limit for CPU ● 50% limit for Disk I/O
  • 12. Step 2: Division Find how full each CPU and disk is. Say we had a disk 10% utilised, and a CPU 20% utilised (80% idle). 0.1/0.5 = 0.2 => Disk IO is at 20% of limit 0.2/0.8 = 0.25 => CPU is at 25% of limit CPU is our bottleneck, with 25% of capacity used.
  • 13. Step 2: Utilisation Visualisation
  • 14. Step 3: Peak traffic Now that we know how full our bottleneck is, we need to know how much capacity we have. Figure out how much traffic you were handling around the time you measured cpu and disk utilisation. You might do this via monitoring, or parsing logs or if you’re really stuck tcpdump.
  • 15. Step 3: The 2nd division Let’s say our queries per second (qps) was 10 around peak. Our CPU was our bottleneck, and about 25% of our limit. 10/0.25 = 40qps So we can currently handle a maximum traffic of around 40qps
  • 16. Step 3: Capacity Visualisation
  • 17. Now you can estimate your capacity in 3 easy steps! 1. Measure bottleneck resource at peak traffic ○ Use monitoring or iostat to see how close you are to the limit, say 20% full 2. Divide to get fraction of limit ○ With a limit of 80% for CPU, you’re 20/80 = 25% full 3. Multiply by peak traffic ○ Traffic was 10qps, so 10/0.25 = 40qps capacity
  • 19. How much runway do you have? You now have a rough idea of how much capacity you have to spare. In the example here, we’re using 10qps out of 40qps capacity. How long will that 30qps last you? The two main factors are new customers and organic growth.
  • 20. New Customers New customers/partners are your main source of traffic. Look at your traffic graphs around the time a new customer started using your system. If the customer had say 1M users and you saw 10qps increased peak traffic, you can now predict how much traffic future customers will need. Based on sales predictions, you can tell how much capacity you’ll need for new customers.
  • 21. Organic growth Over time your existing customers/partners will use the system more and more, new employees are hired, they get new customers etc. Look at your monitoring’s traffic graphs over a few months to see what the trend is like. Do your best to ignore the impact of launches. Calculate your % growth month on month. Starting out, it’s likely that organic growth will not be your main consideration.
  • 22. Calculating runway Once again in the example here, we’re using 10qps out of 40qps capacity. Each 1M user customer generates 10qps of additional traffic. You also expect a negligible amount of organic growth. This means you can handle 3M more users worth of new customers. If you’re signing up one 1M user customer per month, that gives you 3 months.
  • 24. Provisioning vs Capacity Planning Capacity Planning: In 6 months I will have 7 new customers, and need to be able to handle 100qps in total Provisioning: To handle 100qps I need X frontends and Y databases
  • 25. Provisioning: What can a machine handle? Continuing our example, let’s say we had 4 machines and each reported being at CPU 20% (25% of the 80% limit) while dealing with 10qps each. The key metric is qps per machine. 10qps/.2 machines = 50qps/machine Can only safely use 80% of the machine, so 50*.8 = 40qps So we can handle 40 qps per machine.
  • 26. Provisioning: How many machines do I need? If we want to handle 100qps, we need 100/40 = 2.5 machines. So 3 machines. For each type of machine, calculate the incoming external qps it can handle and how many you need. Don’t fret about $10/month worth of cost, it’s not worth your time.
  • 28. Review: The Basics ● Estimating capacity: ○ Measure bottleneck at peak ○ Find how near bottleneck is to the limit ○ Calculate spare capacity based on peak traffic ● Keep an eye on new customers/partners and organic growth to track runway ● For provisioning, calculate qps/machine for each type of machine
  • 29. Life is not Basic
  • 30. A few wrinkles I’ve glossed over a lot of detail so you can go away from today’s workshop with something you can immediately use. Some questions ye may have: ● Why measure at peak traffic? ● What if I don’t have much traffic? ● Why 80% limit on CPU and 50% on disk? ● What if a machine fails? ● What if things aren’t that simple? ● Doesn’t autoscaling take care of all this for me?
  • 31. Why measure at peak traffic? As your utilisation increases: ● Latency increases ● Performance decreases In addition skew due to background of constant CPU usage is decreased Measuring at peak helps allow for these factors. Beware the knee.
  • 32. What if I don’t have much traffic? If you don’t have enough traffic to show up in top or iotop, then these techniques won’t help you much. You could loadtest, but that takes time. Or use rules of thumb. Easier way: Use latency to estimate throughput. If your queries take 10ms, then you can probably handle 100/s
  • 33. Why 80% limit on CPU and 50% on disk? For CPU due to utilisation/latency curve you want to avoid having too high utilisation. If you have the CPU to yourself 90-95% is safe in a controlled environment with good loadtesting. This is uncommon, so leave safety margin for OS processes etc. For spinning disks the impact of utilisation tend to be more problematic, and background tasks tend to use a lot of disk.
  • 34. What if a machine fails? You generally should add 2 extra machines beyond that you need to serve peak qps. This is commonly known as “n+2”. This is to allow for one machine failure, and to let you take down a machine to push a new binary, perform maintenance or whatever. This also gives you some slack in your capacity. As you grow, more sophisticated math is required.
  • 35. What if things aren’t that simple? Lots of other issues can throw a spanner in the works. ● Heterogeneous machines ● Varying machine performance ● Varying traffic mixes ● Multiple datacenters ● Multi-tiered services As a general rule try to keep things simple. A perfect model is brittle and usually takes more time than it’s worth.
  • 36. Doesn’t autoscaling take care of all this for me? Short answer Long answer
  • 37. Doesn’t autoscaling take care of all this for me? Short answer No Long answer
  • 38. Doesn’t autoscaling take care of all this for me? Short answer No Long answer Haha, Haha. No
  • 39. Doesn’t autoscaling take care of all this for me? EC2 Autoscaling can eliminate some of the day-to-day work in provisioning servers. There’s operational and complexity overhead, as you have to maintain images and systems that can be spun up. You have to wait for instances to spin up - can’t rely on it completely for sudden spikes. You need to do math to tune it to be able to handle a spikes. You still have to tune everything. Control systems are hard.
  • 41. Monitoring Matters A common thread through this workshop is that monitoring is what should be providing you the information you need to make operational decisions. Make sure you have a good monitoring system. Logs are not monitoring, though better than nothing. I recommend Prometheus.io: If it didn’t exist I would have created it.
  • 42. Production Matters Provisioning and Capacity planning is just one aspect of production. There’s many others involved with running your company: Robust Perception can help you with all of this and more. ● Deployment ● Change Management ● Configuration Management ● Reliability ● Architecture ● Design Feasibility ● Cost Management ● Performance Tuning ● SLAs ● Contract Sanity Check ● Debugging ● Alerting ● Oncall ● Incident Management
  • 43. Questions? Blog: www.robustperception.io/blog Twitter: @RobustPerceiver Email: brian.brazil@robustperception.io Linkedin: https://ie.linkedin.com/in/brianbrazil