Capacity Management from Flickr

Capacity Management ,[object Object],John Allspaw Operations Engineering

Rules of Thumb Planning/Forecasting Stupid Capacity Tricks (with some Flickr statistics sprinkled in)

[object Object],[object Object],[object Object],[object Object],* (should be the last thing you need to worry about) Things that can cause downtime

Capacity != Performance ,[object Object],[object Object],[object Object]

Thank You HPC Industry! ,[object Object],[object Object],a lot of great deployment and management tricks come from them, adopted by web ops

Good Measurement Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],I

Clouds need planning too ,[object Object],[object Object],[object Object]

Metrics ,[object Object],(photos processed per minute) (average processing time per photo) (apache requests) (concurrent busy apache procs)

Metrics ,[object Object],here, total CPU = ~1.12 * # busy apache procs (ymmv)

2400 photos per minute being uploaded right NOW (Tuesday afternoon)

Ceilings the most amount of “work” your resources will allow before degradation or failure

Find your ceilings what you have left The End

Use real live production data to find ceilings Production: “it’s like a lab, but bigger!”

Like: database ceilings replication lag: bad!

Ceilings waiting on disk too much sustained disk I/O wait for >40% creates slave lag* *for us, YMMV

35,000 photo requests per second on a Tuesday peak

Safety Factors Ceiling * Factor of Safety = UR LIMITZ

“ safe” ceiling @85% CPU Safety Factors 85% total CPU = ~76 busy apache procs what you have left

Safety Factors Yahoo Front Page link to Chinese NewYear Photos (photo requests/second) (8% spike)

Forecasting Fictional Example: webservers

Forecasting Fictional example: 15 webservers. 1 week. peak of the week

...bigger sample, 6 weeks....isolate the peaks... Forecasting

...”Add a Trendline” with some decent correlation... Forecasting now not too shabby

Forecasting 15 servers @76 busy apache proc limit = 1140 total procs when is this? this will tell you when it is ceiling what you have left

Forecasting (week #10, duh) (1140-726) / 42.751 = 9.68

[object Object],[object Object],Forecasting Automation Use http://fityk.sf.net to automate the curve-fit

Forecasting Fictional Example: storage consumption

Forecasting Automation actual flickr storage consumption from early 2005, in GB (ceiling is fictional) this will tell you when this is

Forecasting Automation cmd line script output jallspaw:~]$cfityk ./fit-storage.fit 1> # Fityk script. Fityk version: 0.8.2 2> @0 < '/home/jallspaw/storage-consumption.xy' 15 points. No explicit std. dev. Set as sqrt(y) 3> guess Quadratic New function %_1 was created. 4> fit Initial values: lambda=0.001 WSSR=464.564 #1: WSSR=0.90162 lambda=0.0001 d(WSSR)=-463.663 (99.8059%) #2: WSSR=0.736787 lambda=1e-05 d(WSSR)=-0.164833 (18.2818%) #3: WSSR=0.736763 lambda=1e-06 d(WSSR)=-2.45151e-05 (0.00332729%) #4: WSSR=0.736763 lambda=1e-07 d(WSSR)=-3.84524e-11 (5.21909e-09%) Fit converged. Better fit found (WSSR = 0.736763, was 464.564, -99.8414%). 5> info formula in @0 # storage-consumption 14147.4+146.657*x+0.786854*x^2 6> quit bye...

Forecasting Automation (SAME) fityk gave: y = 0.786854x 2 + 146.657x + 14147.4 ( R 2 = 99.84) Excel gave: y = 0.7675x 2 + 146.96x + 14147.3 ( R 2 = 99.84)

Capacity Health ,[object Object],[object Object],[object Object],[object Object],[object Object]

High and Low Water Marks alert if higher alert if lower Per server, squid requests per second

A good dashboard looks something like... (yes, fictional numbers) type # limit/box ceiling units limit (total) current (peak) % peak Est days left www 20 80 busy procs 1600 1000 62.50% 36 shard db 20 40 I/O wait 800 220 27.50% 120 squid 18 950 req/sec 17,100 11,400 66.67% 48

Diagonal Scaling ,[object Object],[object Object],vertically scaling your already horizontal nodes

Diagonal Scaling example: image processing 4 cores 8 cores (about the same CPU “usage” per box)

~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “ processing” means making 4 sizes from originals Diagonal Scaling example: image processing throughput

Diagonal Scaling example: image processing 3008.4 Watts 1036.8 Watts went from: 23 Dell PE860s 8 HP DL140 G3s to: 1035 photos/min 1120 photos/min ( 75% faster, even) 23U rack 8U rack !!!

3.52 terabytes will be consumed today (on a Tuesday)

2nd Order Effects (beware the wandering bottleneck) running hot, so add more

2nd Order Effects (beware the wandering bottleneck) running great now, so more traffic! now these run hot

Stupid Capacity Tricks quick and dirty management DSH http://freshmeat.net/projects/dsh [root@netmon101 ~]# cat group.of.servers www100 www118 dbcontacts3 admin1 admin2

Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>

Stupid Capacity Tricks Turn Stuff OFF ,[object Object],[object Object]

Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.

[object Object],[object Object],Stupid Capacity Tricks Outages Happen

Stupid Capacity Tricks Hit the Pause Button ,[object Object],[object Object]

thanks http://flickr.com/photos/bondidwhat/402089763/ http://flickr.com/photos/74876632@N00/2394833962/ http://flickr.com/photos/42311564@N00/220394633/ http://flickr.com/photos/unloveable/2422483859/ http://flickr.com/photos/absolutwade/149702085/ http://flickr.com/photos/krawiec/521836276/ http://flickr.com/photos/eschipul/1560875648/ http://flickr.com/photos/library_of_congress/2179060841/ http://flickr.com/photos/jekkyl/511187885/ http://flickr.com/photos/ab8wn/368021672/ http://flickr.com/photos/jaxxon/165559708/ http://flickr.com/photos/sparktography/75499095/

We’re Hiring! flickr.com/jobs Come see me!

Capacity Management from Flickr

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Capacity Management from Flickr

Ähnlich wie Capacity Management from Flickr (20)

Mehr von xlight

Mehr von xlight (17)

Capacity Management from Flickr

Hinweis der Redaktion