43. ~45 images/min @ peak ~140 images/min @ peak (same CPU usage, but ~3x more work) “ processing” means making 4 sizes from originals Diagonal Scaling example: image processing throughput
50. Stupid Capacity Tricks quick and dirty management [root@netmon101 ~]# dsh -N group.of.servers dsh> date executing 'date' www100: Mon Jun 23 14:14:53 UTC 2008 www118: Mon Jun 23 14:14:53 UTC 2008 dbcontacts3: Mon Jun 23 07:14:53 PDT 2008 admin1: Mon Jun 23 14:14:53 UTC 2008 admin2: Mon Jun 23 14:14:53 UTC 2008 dsh>
51.
52. Stupid Capacity Tricks Turn Stuff OFF uploads (photo) uploads (video) uploads by email various API things various mobile things various search things etc., etc.
How many of you manage servers for your site? How many of you know how many servers you have? (databases, webservers, etc.) How many of you collect metrics for all of your capacity resources?
I’ll be repeating some concepts that I’ve talked about in other presentations on the same topic... 1. Planning 2. Manage 3. Stupid Catastrophe Tricks with some random statistics from Flickr sprinkled throughout
By “capacity problems” I mean NORMAL capacity trends, not spiking ones. By edge cases, I mean usage patterns that exist outside the realm of normal operation. Examples: users with 60,000 tags on 20 photos (not possible anymore)...search API calls with 60 ORs (not possible anymore)
The High Performance Computing industry has created a lot of tools and deployment philosophies that web operations can learn from.
It *DOESN’T* matter which tool you use, as long as it can satisfy these criteria.
Knowing what system resources mean in terms of application usage puts the whole capacity shebang into context. Another example would be: Max QPS for a MySQL server = X users, Y photos.
(that’s about 44 per second.)
Artificial stress testing is rarely good for testing real capacity ceilings. It’s great for comparing two different hardware platforms, tho.
How many of you know how many QPS your MySQL machines can do without degrading or failing? (slave lag, anyone?)
Find ceilings by measuring *real* data from production. WHY?? 1. Development “cycles” are TIGHT, so code changes, so load characteristics change all the time. (sometimes in big ways) 2. Edge cases get shown in production, not in my imagination. (60k tags on 100 photos?) 3. Too much time wasted on artificial test setups to get accuracy that doesn’t matter.
Sometimes you don’t have to increase load artificially, you bump up against the limits naturally...
So, our ceiling is disk I/O wait, and it’s around 30-40% that we want to stay under... WHY does this happen? I don’t know, and I don’t care, not right now....
Squid requests per second, at peak, on a Tuesday.
Structural and mechanical engineering use a Factor of Safety (FoS) when designing components that experience load, both stress and strain: bridges, airbags, buildings, seatbelts, toasters. So should we, as web operations.
Whether you express it as a “reserve”, or “overhead”, or some fraction/percentage of your ultimate limit, you should know what these are for ALL of your resources. Civil, mechanical engineers use them when designing bridges, airbags, buildings, seatbelts, toasters. So should we.
Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that.
Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that. Why 85%? Because that is what history has told me I could see spikes of (15%)
Being a multi-core box, I know I can get up to 95% user CPU without performance degradation, but I don’t want to get that far. We’ll plan to have our webservers reach 85% so we have wiggle room for sporadic spikes above that.
Forecasting capacity means making educated guesses about the future by using data from the past. Throw in any knowledge you have about: - timing of feature launches - seasonal differences - new hardware deployment
We’ll just go through a simple example of using extrapolation and curve-fitting to make a prediction on how data (capacity metrics) will change in the future. THERE IS NO SUCH THING AS PREDICTING THE FUTURE.
RRDtool data, put into Excel. Need a bigger sample than 1 week, not enough peaks....let’s try 6 weeks...
...try 6 weeks of data...
“ Add a Trendline” is a feature in Excel. Note the # of weeks at the bottom. The R-squared number is the “coefficient of determination” which indicates how good of a “fit” the equation is to the data.
This is a linear equation given for the curve-fitting function.
Using Excel is time-consuming, you should be able to automate this so you can keep tabs on it easier. fityk has a command-line version, cfityk.
Same drill with Excel.
The same! Yay!
High and low-water marks
ADD THE ESTIMATED TIME LEFT ON HERE...
Yay! Savings all around!
Terabytes will be consumed today, not including video.
Watch out for 2nd-order effects when deploying new/faster machines. When throttles are opened, the dam can get moved down the river.
Watch out for 2nd-order effects when deploying new/faster machines. When throttles are opened, the dam can get moved down the river. Artur mentioned that faster pages = more traffic...we see the same thing.
Some well-known tips and tricks for when the shit hits the fan.
Before capistrano, before puppet, there was dsh. Quick and dirty.
Running a command on any arbitrary number of hosts, interactively. Not revolutionary, but useful.
Better to be mostly up than down for features that aren’t used much.
Hosting is like $7.95 a month for a blog. Spend the cash.
We have put squid in front of our search cluster and cached aggressively when we ran close to capacity.