Massive scale with no fear and no secrets for Apigee and our customers. Both for our cloud or their data centers. At I Love APIs 2014 Nicola Cardace, a global performance practice lead at Apigee along with Aaron Strey, senior engineer at Target walked us through a customer success story. A journey into zen-like discipline, obsessive customer focus where the secrets of the mastery will be disclosed. A report about the delivery of architectures at scale and how Apigee applies thought processes during a customer engagement.
3. Performance Engineering
Capacity Planning
In the Cloud:
• We handle traffic surge
• We manage your capacity for growth in traffic
On-premise:
• We work with our customers – plan, deploy and support
3
My name is Aaron Strey and I am a senior engineer on the API team at Target. Last year I had the responsibility of ensuring that we were ready to handle . Outside of what we are talking about today the things that really excite me in the industry today are real time, the lambda architecture, and containers (Docker).
We process tens of millions of API requests per day. (cloud is our solution)
If you are from the US you know what Black Friday is. For those that aren’t…
We see ~a ten fold increase
Stability is paramount
My volunteering story (said I would do it, panic -> awesome challenge)
The rest of this talk will provide a little insight into the things we learned along the way.
Things that come along with making performance testing a first class citizen.
Integrate performance testing into your CI pipeline
Structure your plans in a way that makes sense
Use version control for your scripts (they should be checked into source control with all of your other code)
Performance testing owned by the development team
Take the time to think about the types of tests you want to run!
Stress: Figure out the load you can handle and still meet your SLA
Load: 80% of stress
Soak: 80% of load (48 hours? More?)
Spike: 80% of load to 120% of load back and forth
Load testing writes presents some additional challenges, especially when we are talking about ecommerce. We are not a unicorn company yet. We are on our way, but can’t spin up an exact production replica at the drop of a hat. Options:
Stub out responses as far back in the stack as possible
Build out a lower environment that is as close to production as possible
Assume the risk of not testing
After you get all of your technology in place, this is absolutely the hardest part about load testing…
There is a key industry development that has taken place over the last couple of years that maken this much easier (big data)
We log every single API request that is made to our platform and we persist it for future analysis. This puts us in a great position to estimate future load. Formula we used:
max(avg TPS) over a 60 second timeframe 2 Sundays before last Thanksgiving and max(avg TPS) over a 60 second timeframe on last Thanksgiving to get a multiplier.
In many cases replaying production traffic was our best option.
What happens when there is no history? (net new API consumers)
In the era of big data our ability to monitor and troubleshoot is unparalleled.
-We log everything and we persist it. At every layer of the stack. We continually poll our infrastructure with tools like vmstat, top, netstat, iostat, etc and we forward that information to a central location.
-We also log and monitor application level details like the request URI for every single request.
-Log and monitor data store logs.
-Putting them all in the same location allows us to query data ad-hoc and tie together events at different layers of our API stack (we use Splunk).
-Probably the sole most important thing we did.
-Target gave an entire presentation on API logging and monitoring alone.
We are currently maturing our ability to integrate performance testing into our continuous integration pipeline. The idea is that every time me make a code change we run a bit of load to guage performance increase/decrease.
Think Chaos Monkey
At our last internal hack day a handful of engineers on my team built a poc that does the following:
Query’s chef server to get a list of nodes running a specific API cookbook.
Connect to a specified number of those nodes and kill the process running the API.
Run some load
Bring the node back up
Ensure there was no performance impact when bringing down nodes.