SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Capacity Management ,[object Object],John Allspaw Operations Engineering
the book I’m writing
???
Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
[object Object],[object Object],[object Object],[object Object],* (should be the  last  thing you need to worry about) Things that can cause downtime
Capacity != Performance ,[object Object],[object Object],[object Object]
Thank You HPC Industry! ,[object Object],[object Object],a lot of great deployment and management tricks come from them, adopted by web ops
Good Measurement Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],I
Clouds need planning too ,[object Object],[object Object],[object Object]
Metrics ,[object Object]
Metrics ,[object Object],(photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)
Metrics ,[object Object],here, total CPU = ~1.12 * # busy apache procs (ymmv)
2400 photos per minute being uploaded right NOW (Tuesday afternoon)
Ceilings the most amount of “work” your resources will allow before degradation or failure
Forget Benchmarking
Find your ceilings what you have left The End
Use  real  live production data  to find ceilings Production:  “it’s like a lab, but bigger!”
Like: database ceilings replication   lag: bad!
Ceilings waiting on disk  too much sustained disk I/O wait for  >40% creates slave lag* *for us, YMMV
35,000 photo requests per second on a Tuesday peak
Safety Factors
Safety Factors Ceiling * Factor of Safety = UR LIMITZ
Safety Factors webserver!
“ safe” ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs what you have left
Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike)
Forecasting
Forecasting Fictional Example: webservers
Forecasting Fictional example: 15 webservers. 1 week.  peak of the week
...bigger sample, 6 weeks....isolate the peaks... Forecasting
...”Add a Trendline” with some decent correlation... Forecasting now not too shabby
Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left
Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68
[object Object],[object Object],Forecasting Automation Use  http://fityk.sf.net  to  automate the curve-fit
Forecasting Fictional Example: storage consumption
Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is
Forecasting Automation cmd line script output jallspaw:~]$cfityk ./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2>  @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3>  guess Quadratic New function %_1 was created. 4>  fit Initial values:  lambda=0.001  WSSR=464.564 #1:  WSSR=0.90162  lambda=0.0001  d(WSSR)=-463.663  (99.8059%) #2:  WSSR=0.736787  lambda=1e-05  d(WSSR)=-0.164833  (18.2818%) #3:  WSSR=0.736763  lambda=1e-06  d(WSSR)=-2.45151e-05  (0.00332729%) #4:  WSSR=0.736763  lambda=1e-07  d(WSSR)=-3.84524e-11  (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
Forecasting Automation (SAME) fityk gave: y = 0.786854x 2  + 146.657x + 14147.4  ( R 2  = 99.84) Excel gave: y = 0.7675x 2  + 146.96x + 14147.3  ( R 2  = 99.84)
Capacity Health ,[object Object],[object Object],[object Object],[object Object],[object Object]
High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
A good dashboard looks something like... (yes, fictional numbers) type # limit/box ceiling units limit (total) current (peak) %  peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48
Diagonal Scaling  ,[object Object],[object Object],vertically scaling your already horizontal nodes
Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
~45  images/min @ peak ~140  images/min @ peak (same CPU usage, but ~3x more work) “ processing” means making 4 sizes from originals Diagonal Scaling example: image processing throughput
Diagonal Scaling example: image processing 3008.4   Watts 1036.8   Watts went from: 23  Dell PE860s 8  HP DL140 G3s to: 1035   photos/min 1120   photos/min ( 75%  faster, even) 23U rack 8U rack !!!
3.52 terabytes will be consumed today (on a Tuesday)
2nd Order Effects (beware the wandering bottleneck) running hot, so add more
2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot
Stupid Capacity Tricks
Stupid Capacity Tricks quick and dirty  management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2
Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100:  Mon Jun 23 14:14:53 UTC 2008 www118:  Mon Jun 23 14:14:53 UTC 2008 dbcontacts3:  Mon Jun 23 07:14:53 PDT 2008 admin1:  Mon Jun 23 14:14:53 UTC 2008 admin2:  Mon Jun 23 14:14:53 UTC 2008 dsh>
Stupid Capacity Tricks Turn Stuff OFF ,[object Object],[object Object]
Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
[object Object],[object Object],Stupid Capacity Tricks Outages Happen
Stupid Capacity Tricks Hit the Pause Button ,[object Object],[object Object]
thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/
We’re Hiring! flickr.com/jobs Come see me!
questions?

Weitere ähnliche Inhalte

Was ist angesagt?

MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011Mike Willbanks
 
How we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental BackupsHow we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental Backupsnicholaspaun
 
Migrating and living on rds aurora
Migrating and living on rds auroraMigrating and living on rds aurora
Migrating and living on rds auroraBalazs Pocze
 
Apache Traffic Server & Lua
Apache Traffic Server & LuaApache Traffic Server & Lua
Apache Traffic Server & LuaKit Chan
 
Data integration with embulk
Data integration with embulkData integration with embulk
Data integration with embulkTeguh Nugraha
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackMatt Ray
 
Scaling Next-Generation Internet TV on AWS With Docker, Packer, and Chef
Scaling Next-Generation Internet TV on AWS With Docker, Packer, and ChefScaling Next-Generation Internet TV on AWS With Docker, Packer, and Chef
Scaling Next-Generation Internet TV on AWS With Docker, Packer, and Chefbridgetkromhout
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesYevgeniy Brikman
 
Integrated Cache on Netscaler
Integrated Cache on NetscalerIntegrated Cache on Netscaler
Integrated Cache on NetscalerMark Hillick
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with CassandraSperasoft
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloudTahsin Hasan
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Marcus Barczak
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014Amazon Web Services
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기NAVER D2
 
(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep Dive(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep DiveAmazon Web Services
 
Caching with Varnish
Caching with VarnishCaching with Varnish
Caching with Varnishschoefmax
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 

Was ist angesagt? (20)

MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
How we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental BackupsHow we setup Rsync-powered Incremental Backups
How we setup Rsync-powered Incremental Backups
 
Migrating and living on rds aurora
Migrating and living on rds auroraMigrating and living on rds aurora
Migrating and living on rds aurora
 
Apache Traffic Server & Lua
Apache Traffic Server & LuaApache Traffic Server & Lua
Apache Traffic Server & Lua
 
Data integration with embulk
Data integration with embulkData integration with embulk
Data integration with embulk
 
Australian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStackAustralian OpenStack User Group August 2012: Chef for OpenStack
Australian OpenStack User Group August 2012: Chef for OpenStack
 
Scaling Next-Generation Internet TV on AWS With Docker, Packer, and Chef
Scaling Next-Generation Internet TV on AWS With Docker, Packer, and ChefScaling Next-Generation Internet TV on AWS With Docker, Packer, and Chef
Scaling Next-Generation Internet TV on AWS With Docker, Packer, and Chef
 
Apache Cassandra and Go
Apache Cassandra and GoApache Cassandra and Go
Apache Cassandra and Go
 
Reusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modulesReusable, composable, battle-tested Terraform modules
Reusable, composable, battle-tested Terraform modules
 
Integrated Cache on Netscaler
Integrated Cache on NetscalerIntegrated Cache on Netscaler
Integrated Cache on Netscaler
 
Developing with Cassandra
Developing with CassandraDeveloping with Cassandra
Developing with Cassandra
 
Architecting cloud
Architecting cloudArchitecting cloud
Architecting cloud
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013
 
Usenix lisa 2011
Usenix lisa 2011Usenix lisa 2011
Usenix lisa 2011
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep Dive(CMP402) Amazon EC2 Instances Deep Dive
(CMP402) Amazon EC2 Instances Deep Dive
 
Caching with Varnish
Caching with VarnishCaching with Varnish
Caching with Varnish
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 

Andere mochten auch

SpeedGeeks
SpeedGeeksSpeedGeeks
SpeedGeeksxlight
 
Oracle ha
Oracle haOracle ha
Oracle haxlight
 
usenix
usenixusenix
usenixxlight
 
openid-pres
openid-presopenid-pres
openid-presxlight
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
淘宝无线电子商务数据报告
淘宝无线电子商务数据报告淘宝无线电子商务数据报告
淘宝无线电子商务数据报告xlight
 
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010xlight
 
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed SystemsGoogle: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systemsxlight
 
What does it take to make google work at scale
What does it take to make google work at scale What does it take to make google work at scale
What does it take to make google work at scale xlight
 

Andere mochten auch (9)

SpeedGeeks
SpeedGeeksSpeedGeeks
SpeedGeeks
 
Oracle ha
Oracle haOracle ha
Oracle ha
 
usenix
usenixusenix
usenix
 
openid-pres
openid-presopenid-pres
openid-pres
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
淘宝无线电子商务数据报告
淘宝无线电子商务数据报告淘宝无线电子商务数据报告
淘宝无线电子商务数据报告
 
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
Optimizing Drupal Performance Zend Acquia Whitepaper Feb2010
 
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed SystemsGoogle: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
 
What does it take to make google work at scale
What does it take to make google work at scale What does it take to make google work at scale
What does it take to make google work at scale
 

Ähnlich wie Capacity Management from Flickr

Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web OperationsJohn Allspaw
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas DeduplicationMichael Hudak
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationPrestaShop
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everythingLew Tucker
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteDeepak Singh
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014Amazon Web Services
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
WebPerformance: Why and How? – Stefan Wintermeyer
WebPerformance: Why and How? – Stefan WintermeyerWebPerformance: Why and How? – Stefan Wintermeyer
WebPerformance: Why and How? – Stefan WintermeyerElixir Club
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentationjward5519
 
The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solutionRadu Vunvulea
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsSerge Smetana
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
 

Ähnlich wie Capacity Management from Flickr (20)

Capacity Management for Web Operations
Capacity Management for Web OperationsCapacity Management for Web Operations
Capacity Management for Web Operations
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
BigDataCloud meetup - July 8th - Cost effective big-data processing using Ama...
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas Deduplication
 
Good practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimizationGood practices for PrestaShop code security and optimization
Good practices for PrestaShop code security and optimization
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Systems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop KeynoteSystems Bioinformatics Workshop Keynote
Systems Bioinformatics Workshop Keynote
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
WebPerformance: Why and How? – Stefan Wintermeyer
WebPerformance: Why and How? – Stefan WintermeyerWebPerformance: Why and How? – Stefan Wintermeyer
WebPerformance: Why and How? – Stefan Wintermeyer
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
Capacity Planning For Web Operations Presentation
Capacity Planning For Web Operations PresentationCapacity Planning For Web Operations Presentation
Capacity Planning For Web Operations Presentation
 
The rice and fail of an IoT solution
The rice and fail of an IoT solutionThe rice and fail of an IoT solution
The rice and fail of an IoT solution
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails Applications
 
11g R2
11g R211g R2
11g R2
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 

Mehr von xlight

New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Product manager-chrissyuan v1.0
Product manager-chrissyuan v1.0Product manager-chrissyuan v1.0
Product manager-chrissyuan v1.0xlight
 
Oracle 高可用概述
Oracle 高可用概述Oracle 高可用概述
Oracle 高可用概述xlight
 
Stats partitioned table
Stats partitioned tableStats partitioned table
Stats partitioned tablexlight
 
C/C++与Lua混合编程
C/C++与Lua混合编程C/C++与Lua混合编程
C/C++与Lua混合编程xlight
 
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed SystemsGoogle: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systemsxlight
 
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile ServiceHigh Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Servicexlight
 
PgSQL vs MySQL
PgSQL vs MySQLPgSQL vs MySQL
PgSQL vs MySQLxlight
 
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems xlight
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
Make Your web Work
Make Your web WorkMake Your web Work
Make Your web Workxlight
 
mogpres
mogpresmogpres
mogpresxlight
 
moscow_developer_day
moscow_developer_daymoscow_developer_day
moscow_developer_dayxlight
 

Mehr von xlight (17)

New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Product manager-chrissyuan v1.0
Product manager-chrissyuan v1.0Product manager-chrissyuan v1.0
Product manager-chrissyuan v1.0
 
Oracle 高可用概述
Oracle 高可用概述Oracle 高可用概述
Oracle 高可用概述
 
Stats partitioned table
Stats partitioned tableStats partitioned table
Stats partitioned table
 
C/C++与Lua混合编程
C/C++与Lua混合编程C/C++与Lua混合编程
C/C++与Lua混合编程
 
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed SystemsGoogle: The Chubby Lock Service for Loosely-Coupled Distributed Systems
Google: The Chubby Lock Service for Loosely-Coupled Distributed Systems
 
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile ServiceHigh Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
High Availability MySQL with DRBD and Heartbeat MTV Japan Mobile Service
 
PgSQL vs MySQL
PgSQL vs MySQLPgSQL vs MySQL
PgSQL vs MySQL
 
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems GOOGLE: Designs, Lessons and Advice from Building Large   Distributed Systems
GOOGLE: Designs, Lessons and Advice from Building Large Distributed Systems
 
UDT
UDTUDT
UDT
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
Make Your web Work
Make Your web WorkMake Your web Work
Make Your web Work
 
mogpres
mogpresmogpres
mogpres
 
moscow_developer_day
moscow_developer_daymoscow_developer_day
moscow_developer_day
 
OSGi
OSGiOSGi
OSGi
 

Capacity Management from Flickr

  • 1.
  • 2. the book I’m writing
  • 3. ???
  • 4. Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. 2400 photos per minute being uploaded right NOW (Tuesday afternoon)
  • 14. Ceilings the most amount of “work” your resources will allow before degradation or failure
  • 16. Find your ceilings what you have left The End
  • 17. Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”
  • 18. Like: database ceilings replication lag: bad!
  • 19. Ceilings waiting on disk too much sustained disk I/O wait for >40% creates slave lag* *for us, YMMV
  • 20. 35,000 photo requests per second on a Tuesday peak
  • 22. Safety Factors Ceiling * Factor of Safety = UR LIMITZ
  • 24. “ safe” ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs what you have left
  • 25. Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike)
  • 28. Forecasting Fictional example: 15 webservers. 1 week. peak of the week
  • 29. ...bigger sample, 6 weeks....isolate the peaks... Forecasting
  • 30. ...”Add a Trendline” with some decent correlation... Forecasting now not too shabby
  • 31. Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left
  • 32. Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68
  • 33.
  • 34. Forecasting Fictional Example: storage consumption
  • 35. Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is
  • 36. Forecasting Automation cmd line script output jallspaw:~]$cfityk ./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...
  • 37. Forecasting Automation (SAME) fityk gave: y = 0.786854x 2 + 146.657x + 14147.4 ( R 2 = 99.84) Excel gave: y = 0.7675x 2 + 146.96x + 14147.3 ( R 2 = 99.84)
  • 38.
  • 39. High and Low Water Marks alert if higher alert if lower Per server, squid requests per second
  • 40. A good dashboard looks something like... (yes, fictional numbers) type # limit/box ceiling units limit (total) current (peak) % peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48
  • 41.
  • 42. Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)
  • 43. ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “ processing” means making 4 sizes from originals Diagonal Scaling example: image processing throughput
  • 44. Diagonal Scaling example: image processing 3008.4 Watts 1036.8 Watts went from: 23 Dell PE860s 8 HP DL140 G3s to: 1035 photos/min 1120 photos/min ( 75% faster, even) 23U rack 8U rack !!!
  • 45. 3.52 terabytes will be consumed today (on a Tuesday)
  • 46. 2nd Order Effects (beware the wandering bottleneck) running hot, so add more
  • 47. 2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot
  • 49. Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2
  • 50. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>
  • 51.
  • 52. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
  • 53.
  • 54.
  • 55. thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/

Hinweis der Redaktion

  1. Only two more chapters to go. :)
  2. How many of you manage servers for your site? How many of you know how many servers you have? (databases, webservers, etc.) How many of you collect metrics for all of your capacity resources?
  3. I’ll be repeating some concepts that I’ve talked about in other presentations on the same topic... 1. Planning 2. Manage 3. Stupid Catastrophe Tricks with some random statistics from Flickr sprinkled throughout
  4. By “capacity problems” I mean NORMAL capacity trends, not spiking ones. By edge cases, I mean usage patterns that exist outside the realm of normal operation. Examples: users with 60,000 tags on 20 photos (not possible anymore)...search API calls with 60 ORs (not possible anymore)
  5. The High Performance Computing industry has created a lot of tools and deployment philosophies that web operations can learn from.
  6. It *DOESN’T* matter which tool you use, as long as it can satisfy these criteria.
  7. Knowing what system resources mean in terms of application usage puts the whole capacity shebang into context. Another example would be: Max QPS for a MySQL server = X users, Y photos.
  8. (that’s about 44 per second.)
  9. Artificial stress testing is rarely good for testing real capacity ceilings. It’s great for comparing two different hardware platforms, tho.
  10. How many of you know how many QPS your MySQL machines can do without degrading or failing? (slave lag, anyone?)
  11. Find ceilings by measuring *real* data from production. WHY?? 1. Development “cycles” are TIGHT, so code changes, so load characteristics change all the time. (sometimes in big ways) 2. Edge cases get shown in production, not in my imagination. (60k tags on 100 photos?) 3. Too much time wasted on artificial test setups to get accuracy that doesn’t matter.
  12. Sometimes you don’t have to increase load artificially, you bump up against the limits naturally...
  13. So, our ceiling is disk I/O wait, and it’s around 30-40% that we want to stay under... WHY does this happen? I don’t know, and I don’t care, not right now....
  14. Squid requests per second, at peak, on a Tuesday.
  15. Structural and mechanical engineering use a Factor of Safety (FoS) when designing components that experience load, both stress and strain: bridges, airbags, buildings, seatbelts, toasters. So should we, as web operations.
  16. Whether you express it as a “reserve”, or “overhead”, or some fraction/percentage of your ultimate limit, you should know what these are for ALL of your resources. Civil, mechanical engineers use them when designing bridges, airbags, buildings, seatbelts, toasters. So should we.
  17. Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that.
  18. Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that. Why 85%? Because that is what history has told me I could see spikes of (15%)
  19. Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that.
  20. Forecasting capacity means making educated guesses about the future by using data from the past. Throw in any knowledge you have about: - timing of feature launches - seasonal differences - new hardware deployment
  21. We’ll just go through a simple example of using extrapolation and curve-fitting to make a prediction on how data (capacity metrics) will change in the future. THERE IS NO SUCH THING AS PREDICTING THE FUTURE.
  22. RRDtool data, put into Excel. Need a bigger sample than 1 week, not enough peaks....let’s try 6 weeks...
  23. ...try 6 weeks of data...
  24. “ Add a Trendline” is a feature in Excel. Note the # of weeks at the bottom. The R-squared number is the “coefficient of determination” which indicates how good of a “fit” the equation is to the data.
  25. This is a linear equation given for the curve-fitting function.
  26. Using Excel is time-consuming, you should be able to automate this so you can keep tabs on it easier. fityk has a command-line version, cfityk.
  27. Same drill with Excel.
  28. The same! Yay!
  29. High and low-water marks
  30. ADD THE ESTIMATED TIME LEFT ON HERE...
  31. Yay! Savings all around!
  32. Terabytes will be consumed today, not including video.
  33. Watch out for 2nd-order effects when deploying new/faster machines. When throttles are opened, the dam can get moved down the river.
  34. Watch out for 2nd-order effects when deploying new/faster machines. When throttles are opened, the dam can get moved down the river. Artur mentioned that faster pages = more traffic...we see the same thing.
  35. Some well-known tips and tricks for when the shit hits the fan.
  36. Before capistrano, before puppet, there was dsh. Quick and dirty.
  37. Running a command on any arbitrary number of hosts, interactively. Not revolutionary, but useful.
  38. Better to be mostly up than down for features that aren’t used much.
  39. Hosting is like $7.95 a month for a blog. Spend the cash.
  40. We have put squid in front of our search cluster and cached aggressively when we ran close to capacity.