2. Netflix Facts
❖ Leading Video streaming Service
❖ 140+ million paid subscribers globally
❖ 190 counties
❖ Millions of hours watched per month
❖ 13 Billion spent on content per year
❖ 15% world’s Internet bandwidth
❖ 1998 - Netflix was founded
❖ 1999 - DVD distribution launched
❖ 2007 - Video stream launched
❖ 2010 - Expanded into Canada
❖ 2014 - Expanded into Europe
❖ 2016 - Globally Launched
Amer Ather
Netflix Performance Engineering
3.
4. Load Balancing across AWS Regions
❖ Multiple active AWS regions
❖ Traffic load balanced across 3
AWS regions
❖ Takes into account geographical
location of subscriber
❖ Enough capacity to handle region
failures gracefully
❖ Region failover is handled via
Netflix Gateway (Zuul) and DNS
steering
Note: Netflix avoids unbalanced regions by
shifting portion of local traffic to remote regions
Netflix
Control
Plane
us-west-2
eu-west-1
us-east-1
Amer Ather
Netflix Performance Engineering
6. Zuul - Front Door to Netflix Ecosystem
Self Service Routing
❖ Traffic sharding
❖ Gradual migration
❖ Canary and
squeeze testing
❖ Authentication
❖ Security rules to
reject traffic from
bad devices
Resiliency and LB
❖ Failover around server
failures: slow response,
GC
❖ Graceful traffic ramp up
to newly launched
instances
❖ Blacklist bad instances
❖ Prevent overloading
❖ Track server utilization
Anomaly Detection
❖ Aggregate error rates
to detect if service is
in trouble
❖ Contextual alerting
about anomalies to
support and service
teams
❖ Helps with root cause
and correlation
Amer Ather
Netflix Performance Engineering
8. API - Netflix Edge Services
❖ Tier 1 Service
❖ Serves Netflix devices
❖ Compose calls to mid tier services
required to construct a response
❖ Orchestrate UI request to mid tier
services
❖ Fallback logic to avert customer
facing outages
❖ Abstract away mid tier changes
from UI development
❖ Facade over the entirety of Netflix
mid-tier services
❖ Proxy device requests to reduce
network chattiness and latency
❖ Promotes request/response model
that best fits device unique
requirement
Amer Ather
Netflix Performance Engineering
9. Edge PaaS - Netflix Edge Services
Decouple
device UI
development
from API and
mid Tier service
changes
Per device
endpoints
customized for
device type for
a richer
experience
Each endpoint is
isolated in
container for
better visibility
and debugging
Node Quark platform
for ease of node.js
development and
integration with Netflix
platform
Titus Container
Platform
Cloud
Deployment via
Spinnaker CI/CD
Platform
RSL
Remote
Service
Layer
for data
access to
API tier via
remote calls
R
S
L
Device specific
instead of
traditional REST
API
(Device
code is
mostly
written in
javascript)
nodejs
Amer Ather
Netflix Performance Engineering
11. Load Shedding (server)
❖ When service is running in steady state:
concurrency = service time x service rate
❖ Requests in excess of this concurrency
limit cannot be serviced immediately.
❖ Service has two options: queue or reject
❖ Netflix services reject requests over the
set limit to avoid oversaturation
❖ Server-side throttling is performed by
setting up a cap on concurrent request a
service can handle
Netflix microservices uses servletFilter mechanism, as part of
platform library, for intercepting interesting requests and throttle
it based on current load on the server
Amer Ather
Netflix Performance Engineering
12. Fault Tolerance (Client)
Netflix microservices based on gRPC
do not use Ribbon and Hystrix
libraries, as features offered are
already provided in gRPC.
❖ Services protects itself from latency and failure
conditions:
➢ 5xx response, connection refused and timeout
❖ Retry request can be routed to next server due to load
balancing until max retries are reached
❖ Fail fast and rapid recovery
❖ Fallback and graceful degrade
❖ Fallback to failure paths to avert outages
❖ Stop cascading failures
❖ AWS Zone aware load balancing (Zone Affinity)
Netflix microservices use Ribbon RestClient and Hystrix
library to setup latency and failure tolerance to downstream
dependency service.
Amer Ather
Netflix Performance Engineering
13. Auto Discover Concurrency Limits
❖ Setting concurrency limits manually in a changing environment is challenging
❖ Require constant care and monitoring due to change in load characteristics
❖ Better approach: Identify concurrency limits dynamically and throttle requests before service
degrades
❖ Concept is borrowed from TCP congestion control algorithms:
➢ congestion window to determine packets transferred without incurring timeouts
➢ Tracks minimum and time sampled latency ratio => RTTnoload/RTTactual
➢ Grow (increase request rate) window if ratio = 1
➢ Shrink (decrease request rate) window if ratio < 1
❖ Limit is adjusted using a formula: newLimit = currentLimit x (RTTnoload/RTTactual) + queueSize
Netflix microservices are in the process of
migrating to gRPC from internal Ribbon IPC
mechanism. Netflix has open sourced gRPC
library , for dynamically auto-detecting
concurrency limits of the service
queueSize is tunable, that determines how fast queue can grow
Amer Ather
Netflix Performance Engineering
15. Microservice Architecture
Architecture designed to decompose one large monolithic application into suite
of small services. Where each service:
❖ Implements different sets of business logic
❖ Is a software module exposed on network via web API
❖ Interacts via some form of RPC mechanism; Netflix Ribbon, gRPC
❖ RPC is a thin layer over standards: HTTP/1.1, HTTP/2.0 transports
❖ Exchanges data via: JSON, Protocol Buffers
❖ Builds, deploys, upgrades and scales independently
❖ Can be developed in different languages: java, python, go, nodejs..
❖ Is free to choose its own datastore for persistence: cassandra,
memcache, redis, elasticsearch, mongoDB..
❖ Platform libraries supports: Retry, Timeouts, Load balancing, Fall back
❖ Massively scalable due to loose coupling, stateless model and data
sharding
Amer Ather
Netflix Performance Engineering
16. Microservices Design Rules
❖ Services should not share data or database
❖ Services expose their data and functionality only through well defined service interface
❖ Transaction should not span multiple services as it violates their autonomy
❖ One service should not lock resources of another service
❖ API first, that takes into account upstream service requirements (client) and dependency on
downstream services. Externalizable (open to public) without major effort
❖ Split service into multiple microservices when functions performed by a service have no strong
relationship with one another.
Ideally, each service team should own the release cycle as well as the production
operations (DevOps) of their service
Amer Ather
Netflix Performance Engineering
17. Monolithic vs MicroServices
➢ Data Center Architecture
➢ Design for predictable scalability
➢ Relational DB: Oracle, mySQL
➢ Strong consistency
➢ Shared database
➢ Serial and synchronized processing
➢ Design to avoid failures
➢ Infrequent and slower updates
➢ Manual management
➢ Failures may result in outage
➢ Limited scalability due to stateful
design
➢ Cloud Architecture
➢ Decomposed and decentralized
➢ Design for elastic scale
➢ Polyglot persistence (mix of datastores)
➢ Eventual consistency
➢ Sharded datasets
➢ Parallel and async processing
➢ Design for failure
➢ Frequent updates (more features)
➢ Self-management (DevOps, CI/CD)
➢ Massively scalable due to stateless
design goals
➢ Immutable infrastructure
Amer Ather
Netflix Performance Engineering
18. RESTful Service (Web API)
❖ A platform that exposes data as a resource on which to operate
All client actions to resource (identified by URI) are represented by HTTP CRUD methods:
➢ POST/PATCH:Create | GET: Read | PUT: Update | DELETE: Deletion
➢ HTTP status codes (2xx, 3xx, 4xx, 5xx) are returned with response.
❖ Server response is sent in JSON
❖ A simple client (curl) can be used to invoke REST methods
❖ Each request/response is stateless and thus can be cached and massively scaled.
❖ Client maintains state and furnishes to server at every request
Amer Ather
Netflix Performance Engineering
19. “REST-ish” API - Netflix Falcor
❖ REST interface works well for large hypermedia resources
❖ WebApp deals with structured data, that can be large number of
small resources, e.g video metadata
❖ Latency becomes a major constraint when fetching these small
resources via REST calls on mobile networks.
➢ Rendering Netflix home page on device may require 20-30
REST calls to server
❖ REST-ish API where developer wants to do more with a REST call
❖ REST-ish API is less RESTful and more RPC
➢ URL of a resource is used to invokes a procedure call
➢ URL query string becomes RPC parameters
❖ Falcor represents data as one giant JSON model and offer async
API, that allows data to be pushed to model via callback
❖ Same benefits of REST (cache consistency, loose coupling)
❖ Batching multiple requests results in a single network request.
❖ Falcor represent data as JSON graph by using references. This
avoids duplicates and stale data by storing at one place
JSON graph detects duplicates
and avoids stale data
Amer Ather
Netflix Performance Engineering
20. gRPC for Microservices (Benefits)
❖ RPC framework for building microservices that uses HTTP/2 transport to support advanced features:
➢ Request multiplexing (streams) , pipelining, Server push; Binary protocol
❖ Protocol Buffers are used for defining and serializing structured data into efficient binary format No JSON scheme.
❖ A client can invoke a method on a different machine as if it were a local object. RPC methods becomes a RPC endpoints
❖ Netty transport provides async and non-blocking IO
❖ Strongly typed and versioned. Simplified API (struct in, struct out)
❖ Decouples the interface from any specific programming language via IDL
❖ Automated code generation to implement service interfaces (API): clients, server, data models, metrics, logging, tracing,
failover, retry, deadline, cancellation etc.. Support plugin to extend features
❖ HttpRule in service definition to define mapping of an RPC method to HTTP REST methods
Amer Ather
Netflix Performance Engineering
22. Netflix Global Cache (EVCache)
❖ Stateless microservices often maintain state in caches or persistent tier
❖ Caches offer loose coupling by maintaining states for stateless services
❖ EVCache is a RAM+NVMe based key-value store (memcache) that offers low latency and
scalable caching solution. Optimized for Cloud and Netflix use cases.
❖ Maintain state in-region and across region via global replication design to serve
requests originated from any region, using Kafka based cross-replication replication
➢ Eventual tunable consistency model that tolerates inconsistency for some time
➢ Asynchronous replication, keeps local cache operations not to be affected by transient
failures in updating caches in other region
➢ Avoid “thundering herd” scenario that may result due to cold caches after region failover.
❖ Caching tier is used for caching computed data and data retrieved from persistence
store like: Cassandra, S3, DB…
❖ Evcache tier is also used for replica and instance cache warming:
➢ To recover data from lost Evcache instances
➢ To scale up caching tier for more storage and network capacity
Amer Ather
Netflix Performance Engineering
24. What is Immutable Infrastructure
❖ Never be modified in production, merely replaced with the new updated one
❖ No reboot or individual server changes in production during its lifespan
❖ Changes are made to base image and then deployed on new server instances
➢ Older server instances are terminated at successful deployment
❖ Rollback changes in case of problem
❖ Guarantees known stable state, if frequently destroyed and deployed. No configuration drift
❖ Follow Infrastructure-as-a-code methodology, that rebuilds the whole environment from the scratch
by easy to adjust manifests.
Netflix microservices offer “fast property” that allows enabling/disabling limited features dynamically while
service is running in production.
Amer Ather
Netflix Performance Engineering
25. Immutable infrastructure
(CI/CD Platform)Updates, Canary analysis, and Deployments
are fully automated and architestrated via
continuous Integration and Continuous
Deployment or Delivery (CI/CD) platform
26. Public Cloud as Immutable Infrastructure
❖ Disposable or throw away cloud instances
❖ Decide what infrastructure and services to manage. Public cloud providers
offer number of useful managed services
❖ Cloud deployable entities: VM, Firecracker, Containers, Fargate, Lambda
❖ No hardware to repair or troubleshoot. Just provision a new one
❖ Failures are non-event. Health check failures result in redirected traffic
❖ Bad or terminated instances are replaced without human intervention
❖ No service down time due to massive deployment and fault tolerance
❖ Elastic capacity and pay-as-you-go model
❖ Auto scaling rules keep enough resources available to meet load demand
❖ Global reach and availability to execute disaster recovery plans
My presentation on Public Cloud Computing Workshop
Amer Ather
Netflix Performance Engineering
28. Planning for Failure
Regional Failures (Nimble)
❖ Drop in SPS (Stream Per Second) metrics
triggers regional failover
❖ Regional failover is executed in 7 Minutes
❖ Failover efficiency is achieved by keeping dark
capacity online in each region
❖ Dark capacity is whitelisted to take production
traffic at failover time.
Limited Scope Failures (ChAP)
❖ ChAP tests service resilience to
failures and validates fallbacks behave
as expected.
❖ ChAP helps uncover systemic
weaknesses that may occur when
higher latency is induced
❖ ChAP service uses FIT (Failure
Injection) framework for fine grain
control on failure and its impact
❖ Zuul gateway updates requests with
FIT metadata that provides failure
context to microservices involved.
❖ Microservices checks FIT context to
determine if particular request should
be impacted
Amer Ather
Netflix Performance Engineering
30. Open Connect Appliances (OCA)
Features
❖ Netflix Managed CDN
❖ Directed Cached Appliances
❖ Deployed at IXP and ISP
colocations globally
❖ Serve subscribers from location
closer to them for optimal
viewing experience
❖ Reduce ISP and Netflix cost of
transporting content
❖ Local caching leads to reduce
and responsible use of Internet
❖ Network capacity of ~100Gbps
❖ Stores portion of Netflix Catalog
❖ 7x24 monitoring
Periodic Fill and Allocation
❖ Push Fill Methodology
❖ Download popular content
during non-peak bandwidth
❖ Incremental and Tiered Filling
❖ ML models - compute content
popularity to decide what title to
catch by aggregating Title/File
usage and viewing history
❖ HCA algorithm for content
distribution that offers efficient
use of server resources
❖ Adopt algorithms to deal with
dynamics of regional member
preferences, evolving network
conditions, and new markets
Amer Ather
Netflix Performance Engineering
31. Adaptive Streaming
Adaptive Bitrate (ABR)
❖ Maximize video quality without rebuffer events
❖ Adopt to network events by picking different bitrate
❖ Multiple profiles (format) for every title encoded
❖ Supported Bitrate: 235 kbps - 5 Mb/s (4K video)
❖ Cache one or more files for each quadruple:
➢ title, profile, bitrate, language
➢ E.g: one episode of Crown = 1200 files
❖ Video Codec: H.264/AVC, HEVC, VP9, AV1
❖ Audio Codec: AAC, DD+, ATMOS
❖ Max Resolution: 1080p, 2160p, 4K, HDR, HFR
Per-shot encoding
❖ Dynamic Optimizer Encoding
❖ Allocate bits optimally for best overall quality
❖ Selects best encoding recipe per-shot
❖ Remove redundancy in video stream via
Spatial and temporal prediction/correlation
❖ 64% less bits for the same quality
❖ Good streaming experience at < 200 kbps
❖ 4 GB Plan = 30 Hours of quality viewing
33. Event Stream Processing at Cloud Scale
❖ Collects, aggregate, process and moves data at
cloud scale
➢ 500 billion events or 1.3 PB data per day
➢ Peak traffic : 8 million events/sec or 24 GB/s
❖ Type of event streams flowing into the pipeline:
➢ Video viewing and UI activities
➢ Device error logs, diagnostics, and perf events
❖ Kafka as a replicated persistent message queue
➢ Multiple copies with 12-24 hour retention period
❖ Data is ingested into kafka fronting clusters via
Java Library or via Kafka REST endpoint.
❖ Route events from Kafka to various sinks:
➢ Elasticsearch - for near real time analysis
➢ S3 bucket - imported into Hives for Data
warehouse and Big Data Analytics
➢ consumer kafka tier - used by streaming
services like: Mentis and Spark streaming
Amer Ather
Netflix Performance Engineering
34. Business Insight
Consumer Insight
❖ Predicts viewing habits
❖ Fuels recommendation
engines
❖ Qualitative research
❖ A/B testing
❖ Adaptive row ordering
❖ Title Placement
❖ ..
Events gathered
❖ Time of day content is
watched
❖ Time spent selecting
content
❖ Playback stopped by
user or network
congestion
❖ Bookmarking
❖ ..
Anomaly Detection
❖ Device firmware
differences
❖ Real time diagnostics
❖ Device health check
❖ Network tput and
congestion differences
across ISP networks
❖ ..
Amer Ather
Netflix Performance Engineering
36. Self Service Monitoring and Debugging
Monitoring: Build custom dashboards for better correlation and root cause analysis
❖ Service monitoring (Atlas/Lumen) - Telemetry system for microservice health and auto-scaling
❖ Device monitoring (Mantis) - Event streams from devices filtered for: health check, perf, analytics..
❖ Host monitoring - (Vector) - Web UI for on-demand system monitoring
❖ Ad hoc monitoring - (Abyss) - Low level deep dive performance analysis for escalated issues
Anomaly Detection: Detect slower performing instances, infrastructure issues and faulty hardware
❖ Alerts - Set up alerts on Atlas metrics and filters on Mantis event streams
❖ Chronos - Tracks infrastructure changes: Service or OS. changes are logged to aid root cause
❖ jvmquake - Terminate nodes with abnormal Garbage Collection Time (GC)
❖ aws_io_detection - Monitor and terminate nodes with IO errors
❖ BaseAMI is frequently updated to detect and remedy known cloud infrastructure issues
Tracing: Low level analysis and distributed tracing
❖ FlameGraph - Aggregates cpu profiling data. Help identify hot stacks
❖ FlameScope - identify cause of cpu usage variation at sub-second granularity
❖ Zipkin | Slalom- Distributed or dependency graphing and tracing for microservices
❖ Java Flight Recorder - A profiling and event collection framework available in OpenJDK
37. Netflix Tech Blogs Resources
❖ Netflix Edge Load Balancing
❖ API - making API Resilient to Failures
❖ Cachie Warming for Stateful Service
❖ Netflix Falcor and Json Graph
❖ Regional Failover in 7 minutes
❖ FIT: Failure Injection Testing
❖ ChAP: Netflix Chaos Automation Platform
❖ Distributing Content to Netflix CDN
❖ Machine Learning to Improve Streaming Quality
❖ Netflix Playback and Downloads
❖ Stream-processing with Mantis
❖ Netflix Stream Data Pipeline
❖ Vector: on-host performance monitoring
❖ Extending Vector with eBPF
❖ Atlas: Netflix Telemetry Platform
❖ Flamegraph: visualize cpu profiling data
❖ FlameScope: Trace Event, Chrome and More Profile Formats
❖ Titus: Running Containers at Scale at Netflix
❖ Spinnaker: Global Continuous Delivery
My BIO
Amer Ather
Netflix Performance Engineering