1. Ensuring Performance in a Fast-
Paced Environment
Martin Spier
Performance Engineering @ Netflix
@spiermar
mspier@netflix.com
2. Martin Spier
● Performance Engineer @ Netflix
● Previously @ Expedia and Dell
● Performance
○ Architecture, Tuning and Profiling
○ Testing and Frameworks
○ Tool Development
● Blog @ http://overloaded.io
● Twitter @spiermar
3. Agenda
● How Things Worked
○ Pass/Fail Testing, Manual
● How Netflix Works
○ Development Model, Freedom & Responsibility
● Rethinking Performance
○ Tools, Methodologies, Canary Analysis,
Performance Test Framework, Public Cloud,
Automated Analysis
7. ● World's leading Internet television network
● ⅓ of all traffic heading into American homes at
peak hours
● > 50 million members
● > 40 countries
● > 1 billion hours of TV shows and movies per
month
● > 100s different client devices
8. ● Culture deck* is TRUE
○ 11M+ views
● Minimal process
● Context over control
● Root access to everything
● No approvals required
● Only Senior Engineers
Freedom and Responsibility
* http://www.slideshare.net/reed2001/culture-1798664
9. Independent Development Teams
● Highly aligned, loosely coupled
● Free to define release cycles
● Free to choose use any methodology
● But it’s an agile environment
● And there is a “paved road”
10. Development Agility
● Continuous innovation cycle
● Shorter development cycles
● Continuous delivery
● Self-service deployments
● A/B Tests
● Failure cost close to zero
● Lower time to market
● Innovation > Risk
12. ● Not a part of any development team
● Not a shared service
● Through consultation improve and maintain the
performance
● Provide self-service performance analysis utilities
● Disseminate performance best practices
Performance Engineering
16. Red/Black Pushes
● New builds are rolled out as new
Auto-Scaling Groups (ASGs)
● Elastic Load Balancers (ELBs)
control the traffic going to each
ASG
● Fast and simple rollback if issues
are found
● Canary Clusters are used to test
builds before a full rollout
17. Squeeze Tests
● Stress Test, with Production Load
● Steering Production Traffic
● Understand the Upper Limits of Capacity
● Adjust Auto-Scaling Policies
● Automated Squeeze Tests
18. Simian Army
● Ensures cloud handles failures
through regular testing
● The Monkeys
○ Chaos Monkey: Resiliency
○ Latency: Artificial Delays
○ Conformity: Best-practices
○ Janitor: Unused Instances
○ Doctor: Health checks
○ Security: Security Violations
○ Chaos Gorilla: AZ Failure
○ Chaos Kong: Region Failure
19. Canary Release
“Canary release is a technique to reduce the risk
of introducing a new software version in
production by slowly rolling out the change to a
small subset of users before rolling it out to the
entire infrastructure and making it available to
everybody.”
20. Automatic Canary Analysis (ACA)
Exactly what the name implies. An automated
way of analyzing a canary release.
21. ACA: Use Case
● You are a service owner and have finished
implementing a new feature into your application.
● You want to determine if the new build, v1.1, is
performing analogous to the existing build.
● The new build is deployed automatically to a canary
cluster
● A small percentage of production traffic is steered to the
canary cluster
● After a short period of time, canary analysis
is triggered
22. Automated Canary Analysis
● For a given set of metrics, ACA will compare
samples from control and canary;
● Determine if they are analogous;
● Identify any metrics that deviate from the
baseline;
● And generate a score that indicates the overall
similarity of the canary.
23. Automated Canary Analysis
● The score will be associated
with a Go/No-Go decision;
● And the new build will be
rolled out (or not) to the rest
of the production
environment.
● No workload definitions
● No synthetic load
● No environment issues
24. When is it appropriate?
What about pre-
production Performance
Testing?
26. Remember the short release cycles?
With the short time span between production builds,
pre-production tests don’t warn us much sooner.
(And there’s ACA)
27. When it brings value. Not just because is
part of a process.
So when?
28. When? Use Cases
● New Services
● Initial Cluster Sizing
● Large Code Refactoring
● Architecture Changes
● Workload Changes
● Proof of Concept
● Instance Type Migration
29. Use Cases, cont.
● Troubleshooting
● Tuning
● Teams that release less frequently
○ Intermediary Builds
● Base Components (Paved Road)
○ Amazon Cloud Images (AMIs)
○ Platform
○ Common Libraries
30. Who?
● Push “tests” to development teams
● Development understands the product, they
developed It
● Performance Engineering knows the tools
and techniques (so we help!)
● Easier to scale the effort!
31. How? Environment
● Free to create any environment configuration
● Integration stack
● Full production-like or scaled-down environment
● Hybrid model
○ Performance + integration stack
● Production testing
32. How? Monitoring
● We developed our own
tools
● Commercial tools did
not work for us
● Open source
● Atlas and Vector
39. Takeaways
● Canary analysis
● Testing only when it brings VALUE
● Leveraging cloud for tests
● Automated test analysis
● Pushing execution to development teams
● Open source tools