SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
Ensuring Performance in a Fast- 
Paced Environment 
Martin Spier 
Performance Engineering @ Netflix 
@spiermar 
mspier@netflix.com 
Performance & Capacity 2014 by CMG
Martin Spier 
● Performance Engineer @ Netflix 
● Previously @ Expedia and Dell 
● Performance 
o Architecture, Tuning and Profiling 
o Testing and Frameworks 
o Tool Development 
● Blog @ http://overloaded.io 
● Twitter @spiermar
● World's leading Internet television network 
● ⅓ of all traffic heading into American homes at 
peak hours 
● > 50 million members 
● > 40 countries 
● > 1 billion hours of TV shows and movies per 
month 
● > 100s different client devices
Agenda 
● How Netflix Works 
o Culture, Development Model, High-level 
Architecture, Platform 
● Ensuring Performance 
o Auto-Scaling, Squeeze Tests, Simian Army, Hystrix, 
Redundancy, Canary Analysis, Performance Test 
Framework, Large Scale Tests
Freedom and Responsibility 
● Culture deck* is TRUE 
o 9M+ views 
● Minimal process 
● Context over control 
● Root access to everything 
● No approvals required 
● Only Senior Engineers 
* http://www.slideshare.net/reed2001/culture-1798664
Independent Development Teams 
● Highly aligned, loosely coupled 
● Free to define release cycles 
● Free to choose use any methodology 
● But it’s an agile environment 
● And there is a “paved road”
Development Agility 
● Continuous innovation cycle 
● Shorter development cycles 
● Automate everything! 
● Self-service deployments 
● A/B Tests 
● Failure cost close to zero 
● Lower time to market 
● Innovation > Risk
Architecture 
● Scalable and Resilient 
● Micro-services 
● Stateless 
● Assume Failure 
● Backwards Compatible 
● Service Discovery
Zuul & Dynamic Routing 
● Zuul, the front door for all requests from devices and 
websites to the backend of the Netflix streaming 
application 
● Dynamic Routing 
● Monitoring 
● Resiliency and Security 
● Region and AZ Failure 
* https://github.com/Netflix/zuul
Cloud 
● Amazon’s AWS 
● Multi-region Active/Active 
● Ephemeral Instances 
● Auto-Scaling 
● Netflix OSS (https://github.com/Netflix)
Performance Engineering 
● Not a part of any development team 
● Not a shared service 
● Through consultation improve and maintain the 
performance and reliability 
● Provide self-service performance analysis utilities 
● Disseminate performance best practices 
● And we’re hiring!
What about Performance?
Auto-Scaling 
● 5-6x Intraday 
● Auto-Scaling Groups (ASGs) 
● Reactive Auto-Scaling 
● Predictive Auto-Scaling (Scryer)
Squeeze Tests 
● Stress Test, with Production Load 
● Steering Production Traffic 
● Understand the Upper Limits of Capacity 
● Adjust Auto-Scaling Policies 
● Automated Squeeze Tests
Red/Black Pushes 
● New builds are rolled out as new 
Auto-Scaling Groups (ASGs) 
● Elastic Load Balancers (ELBs) 
control the traffic going to each 
ASG 
● Fast and simple rollback if issues 
are found 
● Canary Clusters are used to test 
builds before a full rollout
Monitoring: Atlas 
● Humongous, 1.2 billion distinct time 
series 
● Integrated to all systems, production 
and test 
● 1 minute resolution, quick roll ups 
● 12-month persistence 
● API and querying UI 
● System and Application Level 
● Servo (github.com/Netflix/servo) 
● Custom dashboards
Vector 
● 1 second Resolution 
● No Persistence 
● Leverages Performance Co- 
Pilot (PCP) 
● System-level Metrics 
● Java Metrics (parfait) 
● ElasticSearch, Cassandra 
● Flame Graphs (Brendan Gregg)
Mogul 
● ASG and Instance Level 
● Resource Demand; 
● Performance 
Characteristics; 
● And Downstream 
Dependencies.
Slalom 
● Cluster Level 
● High-level Demand Flow 
● Cross-application Request 
Tracing 
● Downstream and Upstream 
Demand
Canary Release 
“Canary release is a technique to reduce the risk 
of introducing a new software version in 
production by slowly rolling out the change to a 
small subset of users before rolling it out to the 
entire infrastructure and making it available to 
everybody.”
Automatic Canary Analysis (ACA) 
Exactly what the name implies. An automated 
way of analyzing a canary release.
ACA: Use Case 
● You are a service owner and have finished 
implementing a new feature into your application. 
● You want to determine if the new build, v1.1, is 
performing analogous to the existing build. 
● The new build is deployed automatically to a canary 
cluster 
● A small percentage of production traffic is steered to the 
canary cluster 
● After a short period of time, canary analysis 
is triggered
Automated Canary Analysis 
● For a given set of metrics, ACA will compare 
samples from baseline and canary; 
● Determine if they are analogous; 
● Identify any metrics that deviate from the 
baseline; 
● And generate a score that indicates the overall 
similarity of the canary.
Automated Canary Analysis 
● The score will be associated 
with a Go/No-Go decision; 
● And the new build will be 
rolled out (or not) to the rest 
of the production 
environment. 
● No workload definitions 
● No synthetic load
What about pre-production 
Performance 
Testing? 
When is it appropriate?
Not always! 
Sometimes it doesn't make sense to run 
performance tests.
Remember the short release cycles? 
With the short time span between production builds, 
pre-production tests don’t warn us much sooner. 
(And there’s ACA)
So when? 
When it brings value. Not just because is 
part of a process.
When? Use Cases 
● New Services 
● Large Code Refactoring 
● Architecture Changes 
● Workload Changes 
● Proof of Concept 
● Initial Cluster Sizing 
● Instance Type Migration
Use Cases, cont. 
● Troubleshooting 
● Tuning 
● Teams that release less frequently 
o Intermediary Builds 
● Base Components (Paved Road) 
o Amazon Cloud Images (AMIs) 
o Platform 
o Common Libraries
Who? 
● Push “tests” to development teams 
● Development understands the product, they 
developed It 
● Performance Engineering knows the tools 
and techniques (so we help!) 
● Easier to scale the effort!
How? Environment 
● Free to create any environment configuration 
● Integration stack 
● Full production-like or scaled-down environment 
● Hybrid model 
o Performance + integration stack 
● Production testing
How? Test Framework 
● Built around JMeter
How? Test Framework 
● Runs on Amazon’s EC2 
● Leverages Jenkins for orchestration
How? Analysis 
● In-house developed web analysis tool and API 
● Results persisted on Amazon’s S3 and RDS
How? Analysis 
● Automated analysis built-in (thresholds) 
● Customized alerts 
● Interface with monitoring tools
Large Scale Tests 
● > 100k req/s 
● > 100 of load generators 
● High Throughput Components 
o In-Memory Caches 
● Component scaling 
● Full production tests
Large Scale Tests: Problems 
● Your test client is likely the first bottleneck 
● Components are (often) not designed to 
scale 
o Great performance per node; 
o But they don’t scale horizontally. 
o Controller, data feeder, load generator*, result 
collection, result analysis, monitoring 
* often the exception
Large Scale Tests: Single Controller 
● Single controller, multiple load generators 
● Controller also serves as data feeder 
● Controller collects all results synchronously 
● Controller aggregates monitoring data 
● Batch and async might alleviate the problem 
● Analysis of large result sets is heavy (think 
percentiles)
Large Scale Tests: Distributed Model 
● Data Feeding and Load Generation 
o No Controller 
o Independent Load Generators 
● Data Collection and Monitoring 
o Decentralized Monitoring Platform 
● Data Analysis 
o Aggregation at node level 
o Hive/Pig 
o ElasticSearch
Takeaways 
● Canary analysis 
● Testing only when it brings VALUE 
● Leveraging cloud for tests 
● Automated test analysis 
● Pushing execution to development teams 
● Open source tools
Martin Spier 
mspier@netflix.com 
@spiermar 
http://overloaded.io/
References 
● parfait (https://code.google.com/p/parfait/) 
● servo (https://github.com/Netflix/servo) 
● hystrix (https://github.com/Netflix/Hystrix) 
● culture deck ( 
http://www.slideshare.net/reed2001/culture-1798664) 
● zuul (https://github.com/Netflix/zuul) 
● scryer ( 
http://techblog.netflix.com/2013/11/scryer-netflixs-predictive- 
auto-scaling.html)
Backup Slides
Simian Army 
● Ensures cloud handles failures 
through regular testing 
● The Monkeys 
o Chaos Monkey: Resiliency 
o Latency: Artificial Delays 
o Conformity: Best-practices 
o Janitor: Unused Instances 
o Doctor: Health checks 
o Security: Security Violations 
o Chaos Gorilla: AZ Failure 
o Chaos Kong: Region Failure
“... is a latency and fault 
tolerance library designed to 
isolate points of access to 
remote systems ...” 
● Stop cascading failures. 
● Fallbacks and graceful degradation 
● Fail fast and rapid recovery 
● Thread and semaphore isolation with 
circuit breakers 
● Real-time monitoring and 
configuration changes 
* https://github.com/Netflix/Hystrix
Real-time Analytics Platform (RTA) 
● ACA runs on top of RTA 
● Compute Engines 
o OpenCPU (R) 
o OpenPY (Python) 
● Data Sources 
o Real-time Monitoring Systems 
o Big Data Platforms 
● Reporting, Scheduling, Persistence
Slow Performance Regression 
● Deviation => “acceptable” regression 
● Small performance regressions might sneak in 
● Short release cycle = many releases 
● Many releases = cumullative regression
Slow Performance Regression
Testing Lower Level Components 
● Base AMIs 
o OS (Linux), tools and agents 
● Common Application Platform 
● Common Libraries 
● Reference Application 
o Leverages a common architecture (front, middle, 
data, memcache, jar clients, Hystrix) 
o Implements functions that stress 
specific resources (cpu, service, db)

Weitere ähnliche Inhalte

Was ist angesagt?

A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
Legacy Typesafe (now Lightbend)
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013
Matt Tesauro
 

Was ist angesagt? (20)

Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
 
JavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceJavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National Police
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National Police
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Introduction to Akka Streams
Introduction to Akka StreamsIntroduction to Akka Streams
Introduction to Akka Streams
 
LCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation ImprovementsLCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation Improvements
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
 
The Rocky Cloud Road
The Rocky Cloud RoadThe Rocky Cloud Road
The Rocky Cloud Road
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Quick Tour On Zeppelin
Quick Tour On ZeppelinQuick Tour On Zeppelin
Quick Tour On Zeppelin
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013
 

Ähnlich wie Ensuring Performance in a Fast-Paced Environment (CMG 2014)

Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
Peter Mounce
 
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2
 

Ähnlich wie Ensuring Performance in a Fast-Paced Environment (CMG 2014) (20)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
Expedia 3x3 presentation
Expedia 3x3 presentationExpedia 3x3 presentation
Expedia 3x3 presentation
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failure
 
Security in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps EngineersSecurity in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps Engineers
 
Gatling
Gatling Gatling
Gatling
 
Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With Gatling
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the Cloud
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Agile devops in the cloud
Agile devops in the cloudAgile devops in the cloud
Agile devops in the cloud
 
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the Cloud
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Ensuring Performance in a Fast-Paced Environment (CMG 2014)

  • 1. Ensuring Performance in a Fast- Paced Environment Martin Spier Performance Engineering @ Netflix @spiermar mspier@netflix.com Performance & Capacity 2014 by CMG
  • 2. Martin Spier ● Performance Engineer @ Netflix ● Previously @ Expedia and Dell ● Performance o Architecture, Tuning and Profiling o Testing and Frameworks o Tool Development ● Blog @ http://overloaded.io ● Twitter @spiermar
  • 3. ● World's leading Internet television network ● ⅓ of all traffic heading into American homes at peak hours ● > 50 million members ● > 40 countries ● > 1 billion hours of TV shows and movies per month ● > 100s different client devices
  • 4. Agenda ● How Netflix Works o Culture, Development Model, High-level Architecture, Platform ● Ensuring Performance o Auto-Scaling, Squeeze Tests, Simian Army, Hystrix, Redundancy, Canary Analysis, Performance Test Framework, Large Scale Tests
  • 5. Freedom and Responsibility ● Culture deck* is TRUE o 9M+ views ● Minimal process ● Context over control ● Root access to everything ● No approvals required ● Only Senior Engineers * http://www.slideshare.net/reed2001/culture-1798664
  • 6. Independent Development Teams ● Highly aligned, loosely coupled ● Free to define release cycles ● Free to choose use any methodology ● But it’s an agile environment ● And there is a “paved road”
  • 7. Development Agility ● Continuous innovation cycle ● Shorter development cycles ● Automate everything! ● Self-service deployments ● A/B Tests ● Failure cost close to zero ● Lower time to market ● Innovation > Risk
  • 8.
  • 9. Architecture ● Scalable and Resilient ● Micro-services ● Stateless ● Assume Failure ● Backwards Compatible ● Service Discovery
  • 10. Zuul & Dynamic Routing ● Zuul, the front door for all requests from devices and websites to the backend of the Netflix streaming application ● Dynamic Routing ● Monitoring ● Resiliency and Security ● Region and AZ Failure * https://github.com/Netflix/zuul
  • 11. Cloud ● Amazon’s AWS ● Multi-region Active/Active ● Ephemeral Instances ● Auto-Scaling ● Netflix OSS (https://github.com/Netflix)
  • 12. Performance Engineering ● Not a part of any development team ● Not a shared service ● Through consultation improve and maintain the performance and reliability ● Provide self-service performance analysis utilities ● Disseminate performance best practices ● And we’re hiring!
  • 14. Auto-Scaling ● 5-6x Intraday ● Auto-Scaling Groups (ASGs) ● Reactive Auto-Scaling ● Predictive Auto-Scaling (Scryer)
  • 15. Squeeze Tests ● Stress Test, with Production Load ● Steering Production Traffic ● Understand the Upper Limits of Capacity ● Adjust Auto-Scaling Policies ● Automated Squeeze Tests
  • 16. Red/Black Pushes ● New builds are rolled out as new Auto-Scaling Groups (ASGs) ● Elastic Load Balancers (ELBs) control the traffic going to each ASG ● Fast and simple rollback if issues are found ● Canary Clusters are used to test builds before a full rollout
  • 17. Monitoring: Atlas ● Humongous, 1.2 billion distinct time series ● Integrated to all systems, production and test ● 1 minute resolution, quick roll ups ● 12-month persistence ● API and querying UI ● System and Application Level ● Servo (github.com/Netflix/servo) ● Custom dashboards
  • 18. Vector ● 1 second Resolution ● No Persistence ● Leverages Performance Co- Pilot (PCP) ● System-level Metrics ● Java Metrics (parfait) ● ElasticSearch, Cassandra ● Flame Graphs (Brendan Gregg)
  • 19. Mogul ● ASG and Instance Level ● Resource Demand; ● Performance Characteristics; ● And Downstream Dependencies.
  • 20. Slalom ● Cluster Level ● High-level Demand Flow ● Cross-application Request Tracing ● Downstream and Upstream Demand
  • 21. Canary Release “Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.”
  • 22. Automatic Canary Analysis (ACA) Exactly what the name implies. An automated way of analyzing a canary release.
  • 23. ACA: Use Case ● You are a service owner and have finished implementing a new feature into your application. ● You want to determine if the new build, v1.1, is performing analogous to the existing build. ● The new build is deployed automatically to a canary cluster ● A small percentage of production traffic is steered to the canary cluster ● After a short period of time, canary analysis is triggered
  • 24. Automated Canary Analysis ● For a given set of metrics, ACA will compare samples from baseline and canary; ● Determine if they are analogous; ● Identify any metrics that deviate from the baseline; ● And generate a score that indicates the overall similarity of the canary.
  • 25. Automated Canary Analysis ● The score will be associated with a Go/No-Go decision; ● And the new build will be rolled out (or not) to the rest of the production environment. ● No workload definitions ● No synthetic load
  • 26. What about pre-production Performance Testing? When is it appropriate?
  • 27. Not always! Sometimes it doesn't make sense to run performance tests.
  • 28. Remember the short release cycles? With the short time span between production builds, pre-production tests don’t warn us much sooner. (And there’s ACA)
  • 29. So when? When it brings value. Not just because is part of a process.
  • 30. When? Use Cases ● New Services ● Large Code Refactoring ● Architecture Changes ● Workload Changes ● Proof of Concept ● Initial Cluster Sizing ● Instance Type Migration
  • 31. Use Cases, cont. ● Troubleshooting ● Tuning ● Teams that release less frequently o Intermediary Builds ● Base Components (Paved Road) o Amazon Cloud Images (AMIs) o Platform o Common Libraries
  • 32. Who? ● Push “tests” to development teams ● Development understands the product, they developed It ● Performance Engineering knows the tools and techniques (so we help!) ● Easier to scale the effort!
  • 33. How? Environment ● Free to create any environment configuration ● Integration stack ● Full production-like or scaled-down environment ● Hybrid model o Performance + integration stack ● Production testing
  • 34. How? Test Framework ● Built around JMeter
  • 35. How? Test Framework ● Runs on Amazon’s EC2 ● Leverages Jenkins for orchestration
  • 36. How? Analysis ● In-house developed web analysis tool and API ● Results persisted on Amazon’s S3 and RDS
  • 37. How? Analysis ● Automated analysis built-in (thresholds) ● Customized alerts ● Interface with monitoring tools
  • 38.
  • 39. Large Scale Tests ● > 100k req/s ● > 100 of load generators ● High Throughput Components o In-Memory Caches ● Component scaling ● Full production tests
  • 40. Large Scale Tests: Problems ● Your test client is likely the first bottleneck ● Components are (often) not designed to scale o Great performance per node; o But they don’t scale horizontally. o Controller, data feeder, load generator*, result collection, result analysis, monitoring * often the exception
  • 41. Large Scale Tests: Single Controller ● Single controller, multiple load generators ● Controller also serves as data feeder ● Controller collects all results synchronously ● Controller aggregates monitoring data ● Batch and async might alleviate the problem ● Analysis of large result sets is heavy (think percentiles)
  • 42. Large Scale Tests: Distributed Model ● Data Feeding and Load Generation o No Controller o Independent Load Generators ● Data Collection and Monitoring o Decentralized Monitoring Platform ● Data Analysis o Aggregation at node level o Hive/Pig o ElasticSearch
  • 43. Takeaways ● Canary analysis ● Testing only when it brings VALUE ● Leveraging cloud for tests ● Automated test analysis ● Pushing execution to development teams ● Open source tools
  • 44. Martin Spier mspier@netflix.com @spiermar http://overloaded.io/
  • 45. References ● parfait (https://code.google.com/p/parfait/) ● servo (https://github.com/Netflix/servo) ● hystrix (https://github.com/Netflix/Hystrix) ● culture deck ( http://www.slideshare.net/reed2001/culture-1798664) ● zuul (https://github.com/Netflix/zuul) ● scryer ( http://techblog.netflix.com/2013/11/scryer-netflixs-predictive- auto-scaling.html)
  • 47. Simian Army ● Ensures cloud handles failures through regular testing ● The Monkeys o Chaos Monkey: Resiliency o Latency: Artificial Delays o Conformity: Best-practices o Janitor: Unused Instances o Doctor: Health checks o Security: Security Violations o Chaos Gorilla: AZ Failure o Chaos Kong: Region Failure
  • 48. “... is a latency and fault tolerance library designed to isolate points of access to remote systems ...” ● Stop cascading failures. ● Fallbacks and graceful degradation ● Fail fast and rapid recovery ● Thread and semaphore isolation with circuit breakers ● Real-time monitoring and configuration changes * https://github.com/Netflix/Hystrix
  • 49. Real-time Analytics Platform (RTA) ● ACA runs on top of RTA ● Compute Engines o OpenCPU (R) o OpenPY (Python) ● Data Sources o Real-time Monitoring Systems o Big Data Platforms ● Reporting, Scheduling, Persistence
  • 50. Slow Performance Regression ● Deviation => “acceptable” regression ● Small performance regressions might sneak in ● Short release cycle = many releases ● Many releases = cumullative regression
  • 52. Testing Lower Level Components ● Base AMIs o OS (Linux), tools and agents ● Common Application Platform ● Common Libraries ● Reference Application o Leverages a common architecture (front, middle, data, memcache, jar clients, Hystrix) o Implements functions that stress specific resources (cpu, service, db)