An edge gateway is an essential piece of infrastructure for large scale cloud based services. This presentation details the purpose, benefits and use cases for an edge gateway to provide security, traffic management and cloud cross region resiliency. How a gateway can be used to enhance continuous deployment, and help testing of new service versions and get service insights and more are discussed. Philosophical and architectural approaches to what belongs in a gateway vs what should be in services will be discussed. Real examples of how gateway services are used in front of nearly all of Netflix's consumer facing traffic will show how gateway infrastructure is used in real highly available, massive scale services.
7. From the Internet to Services in the Cloud
Gateway
Gateway
?????
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
8. Our Edge Gateway @ Netflix
Handles most netflix.com hosts
Over 20 production Zuul clusters
~ 50 elbs
Gateway handles ~10 origin services
9. Netflix Gateway Scale
Tens of billions of requests per day
3 AWS regions
Over 1000 device types
Hundreds of permutations of protocols and
device versions
19. Anti-patterns of most cloud proxies
Static configurations
Service push needed to
change behavior
Limited range of
functionality
Limited to HTTP
20. Zuul Created
2012
Dynamically injected and compiled filters
Manipulate requests and responses
Headers / Body / etc
Change routing
Add metrics and other functions
Built on Netflix’s OSS stack
Open Sourced
21. Zuul - A Victim of Success
Easy and convenient
Instant results
High adoption
Happy customers
Business logic in proxy
Affects system resiliency
Zuul team in critical path
35. A Global Cloud Deployment
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
36. Global Cloud Routing
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
37. A Failing region
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
38. Gateway routing to other regions
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
Persistence Tier
Business
services Tier
Presentation
Tier
Network Tier
Websites
API
Proxy
DB
45. A Room with a View - Insights
Gateway
Gateway
Gateway
Origin (API)
Origin (API)
API
Origin (API)
Origin (API)
Website
Insights
46. What’s Next for Netflix’s Gateway?
Gateway as a service
Self-service dynamic routing / route validation
Control APIs for special routing functions
Netty Based Zuul (using RxNetty)
Handling persistent connections
non-blocking, async
Transport protocol agnostic routing
Reactive Socket http://reactivesocket.io/
Our gateway strategy will change the way you think about resiliency, debugging, continuous delivery, service operations, and insights.
Devices slow to update
Need emergency policies
Fast action
Limited range of functionality
Hard to program
Authentication
Authorization
Static responses / Origin specific headers
Why?
Federation of logic across systems creates complexity
Minimize gateway dependencies to maximize availability
Origin services run many clusters
Route to service clusters based on dynamic routing rules
Shape or reject traffic based on service, regional health, or attack
React fast in emergencies
Realtime analytics and insights
Ensures request delivery from internet to services running in the cloud
Dynamically changes routing behaviors
Routes to services
Services have multiple clusters
Clusters have dynamically changing nodes
Bridges multiple cloud regions and data centers
Provides system Insights
Same service: Subclusters for many purposes
Set up by filters in Zuul
Self serviceable by cluster owners
Automated Quality assurance / Test Automation
Targeted debugging
Test Automation
Canary / Baseline
A/B testing of service behavior per build
Squeeze Testing
Service capacity testing
Trickle traffic
Instrumented builds
Sticky Canary
A/B testing of client behavior per origin build
Trickling traffic into clusters
High Overhead profiling tools
“Coalmine”
verbose logging
Server capacity testing
Gateway gradually increases traffic until performance degradation is detected
Automated or manual
Isolate requests by customer, route, type of device, or any routing rule
Debug node(s) are often instrumented to give verbose logging
Custom Request Routing
Compare server behavior and metrics
Equal traffic rates hit both clusters
Automated part of production push process
Error rates
CPU for equivalent work
Automated metrics analysis returns a score of how well the canary cluster performed
A poor score stops the push process
Servers may be healthy data may be bad
API changes that affect devices
Data changes certain devices can’t interpret
Protocol and transport changes that some devices can’t accept
Testing 1000’s of types of devices would be a time consuming, tedious process.
Sticky Canary idea - Stick all requests for a small subset of customers for a limited time to a “sticky canary” or “sticky baseline”
If servers are equivalent, there should be no behavioral differences.
Insights can help find these anomalies
Limited scope of impact - a very small subset of customers could be affected but only for a short period of time
Reroute to the closer region to the client - DNS accuracy issues, etc
Reroute due to region failure.