Peak load, and burst-y traffic are problem spaces which are often (and tragically) confused for each other, invariably to the detriment of both ops and users. While peak-load is all about capacity management, in a burst-y situation, you might have to prioritize - or even drop! - requests. Knowing which requests to process, and how to actually process them is the world of Active Queue Management (AQM). While AQM has long been exclusively in the domain of the TCP/IP crowd, it has been slowly making its way into the world of cloud-services, albeit with much (faulty!) wheel-reinventing.
Join me as I take you through the world of Active Queue Management, back-pressure, load-ramping, and tactical avoidance, things that most people should be architecting into their services, but aren't.
Just part of one cluster failed, but a threshold had been passed
No worries, we’ll just bounce that one cluster, it’ll all be good
Total System Meltdown
All the calls keep retrying, causing memory utilization to go through the roof
Voicemail conversion was going on independent of everything else, causing CPU utilization to spike
Eventually, the cache timed out, and tried to reload stuff from the disk.
And then everyone tries the Apps, and the Twitters and the facebooks and the everythings.
Some of us have been confronted by this
Total System Meltdown
And you always get asked this
There is only so much planning you can do. At some point, the 1000 year flood hits
The point being, Shit will happen.The question is, when Shit happens, can you clean up?
Its not just us
Its not just us
Its not just us!!!We are not alone!(Breaking Benjamin)(5 of 5 leading providers…)
Do you have disks in the loop?Maybe humans?Or large data? (postgres data moved to backup datacenter?)
Yeah right.Its what everybody sez.And then shit happens
How fast are you?How quickly can you come back up? Can you store enough state to survive?
Is BufferBloat a problem?
Once you are up, can you draw down the queue fast enough?Or at all, for that matter?
Is backpressure going to be a problem?
If the answer is “Yes”, then the talk is over, because it just works.
What if the answer is “No”? (Now we have a story)
ProgrammableIf you’re lucky, you’re infrastructure will automagically support ramping
Fake it. People respond subconsciously to these, and actually waitYou can even get away with dropping the request(This assumes that you can recover in time)
This happens inside the airport too!Passengers self-select the best gates to enter(intelligent routing)
(Programmable, Behavioral, & self managed)the plane move around different runways before leaving, to free up gates, and make passengers think something is happening(always take the first flight out! And the last flight back!)
Surprisingly, airlines are ridiculously good at AQM.
The question is, what do you do when you can’t come up in time? 3 gallon bucket, 5 gallons of water…
Just start dropping when queue fills upThis is pretty bad – global synchronization becomes a problemPlanes don’t take off till they get clearance from the other end
Slow Start, AQM, RED, CoDEL, …Why don’t we learn from networks?
RED / SRED(RED in a different light – toilet bowl)
RED / SRED(RED in a different light – toilet bowl)
The 3rd priority airport always gets the shaft
F(low) REDRED on a per-flow basis (the entire route map)Kinda the default. Discard second request)
RED – P(referential) D(rop)Does RED only for High BW flows (high traffic routes)(Throttle spammy clients. Or features.)
Fixed two bugs in REDMade it feedback based (self-tuning)Toilet diagram caused problems
Sliced bread has nothing on itDave Taht
Know when something breaks
Know when something breaks
Know what broke
Know what broke
Know what broke
Know what broke
And then everyone tries the Apps, and the Twitters and the facebooks and the everythings.