Serverless functions (like AWS Lambda, Google Cloud Functions, and Azure Functions) have the ability to scale almost infinitely to handle massive workload spikes. While this is a great solution for compute, it can be a MAJOR PROBLEM for other downstream resources like RDBMS, third-party APIs, legacy systems, and even most managed services hosted by your cloud provider. Whether you’re maxing out database connections, exceeding API quotas, or simply flooding a system with too many requests at once, serverless functions can DDoS your components and potentially take down your application. In this talk, we’ll discuss strategies and architectural patterns to create highly resilient serverless applications that can mitigate and alleviate pressure on “non-serverless” downstream systems during peak load times.
2. Jeremy Daly
• CTO at AlertMe.news
• Consult with companies building in the cloud
• 20+ year veteran of technology startups
• Started working with AWS in 2009 and started using
Lambda in 2015
• Blogger (jeremydaly.com), OSS contributor, speaker
• Publish the Off-by-none serverless newsletter
• Host of the Serverless Chats podcast
@jeremy_daly
3. Agenda
• What is resiliency and what is serverless?
• Working with “less-than-scalable” RDBMS
• Using unreliable APIs
• Managing API quotas
• Decoupling our services
• Other “non-serverless” components
@jeremy_daly
4. What is resiliency?
“The ability of a software solution to absorb the impact
of a problem in one or more parts of a system, while
continuing to provide an acceptable service level to the
business.” ~ IBM
@jeremy_daly
IT’S NOT ABOUT PREVENTING FAILURE
IT’S UNDERSTANDING HOWTO GRACEFULLY DEAL WITH IT
5. What does it mean to be serverless?
• No server management
• Flexible scaling
• Pay for value
• Automated high availability
@jeremy_daly
6. Serverless versus non-serverless
@jeremy_daly
ElastiCache
RDS
EMR Amazon ES
Redshift
Fargate
Anything “on EC2”Lambda Cognito Kinesis
S3 DynamoDB SQS
SNS API Gateway CloudWatch
AppSync IoT Comprehend
Serverless Managed Not Serverless
DocumentDB
(MongoDB)
Managed Streaming
for Kafka
Definitely
7. Everything has limits!
• Reserved Concurrency 🚦
• FunctionTimeouts ⏳
• Memory Limits 🧠
• NetworkThroughput 🚰
Some components are better than others
@jeremy_daly
Know
Your
Limits
8. Simple Serverless Web Service
Client
API Gateway Lambda DynamoDB
@jeremy_daly
Highly Scalable Highly Scalable Highly Scalable
9. “I want my, I want my, I want my SQL”
~ Dire Straits
10. Simple Serverless Web Service
Client
API Gateway Lambda
@jeremy_daly
Highly Scalable Highly Scalable NotThat Scalable 😳
RDS
^
not so
RDBMS and FaaS don’t play nicely together:
• Concurrency model doesn’t allow connection pooling
• Limited number of DB connections available
• Recycled containers create zombies
11. Ways to Manage DB Connections
• Increase max_connections setting
• Limit concurrent executions
• Lower your connection timeouts
• Limit connections per username
• Close connection before function ends
@jeremy_daly
🤞
😡
⚠
🎲
😱
👎
12. Better Ways to Manage DB Connections
• Implement a good caching strategy 💾
• Buffer events for throttling and durability 🏋! " ♂
• Manage connections ourselves 🤔
@jeremy_daly
👎
13. hit
Implement a good caching strategy
Client API Gateway RDSLambda
Elasticache
Key Points:
• Create new RDS connections ONLY on misses
• Make sureTTLs are set appropriately
• Include the ability to invalidate cache
@jeremy_daly
YOU STILL NEEDTO
SIZEYOUR DATABASE
CLUSTERS APPROPRIATELY
14. Do you really need immediate feedback?
Synchronous Communication ⏳
Services can be invoked by other services and must wait for a reply.
This is considered a blocking request, because the invoking service
cannot finish executing until a response is received.
Asynchronous Communication 🚀
This is a non-blocking request. A service can invoke (or trigger)
another service directly or it can use another type of communication
channel to queue information.The service typically only needs to wait
for confirmation (ack) that the request was received.
@jeremy_daly
15. RDS
Buffer events for throttling and durability
Client API Gateway
SQS
Queue
SQS
(DLQ)
Lambda Lambda
(throttled)
ack
“Asynchronous”
Request
Synchronous
Request
@jeremy_daly
Key Points:
• SQS adds durability
• Throttled Lambdas reduce downstream pressure
• Failed events are stored for further inspection/replay
Limit the
concurrency to match
RDS throughput
16. Manage connections ourselves
1. Count open connections
2. Close connection if connection ratio threshold exceeded
3. Close sleeping connections with high time values
4. Retry connections with exponential back off
@jeremy_daly
19. Close connection if over ratio threshold
@jeremy_daly
If we exceed the
connection ratio
Calculate our timeout
Try to kill zombies
If no zombies,
terminate connection
Else, just try to kill
zombies
20. Close sleeping connections with high time values
@jeremy_daly
Query processlist for zombies
Kill zombies
21. Retry connections with exponential back off
@jeremy_daly
If error trying to connect
Retry with Jitter
22. Does this really work?
@jeremy_daly
• Aurora Serverless (2 ACUs)
• 90 connections available
• 1,024 MB of memory
• 500 users/sec for one minute
• Avg. response time was 41 ms
• ZERO ERRORS
23. We shouldn’t have to do this!
@jeremy_daly
Amazon
Aurora Serverless
Aurora Serverless
DATA API
Doesn’t solve the max_connections issue Getting better, but limited to Aurora Serverless
25. Manage calls to third-party APIs
• Implement a good caching strategy
• Buffer events for throttling and durability
• Implement circuit breakers 🚦
@jeremy_daly
26. DynamoDB
Stripe API
The Circuit Breaker
API
Consumer
API Gateway Lambda
Key Points:
• Cache your cache with warm functions
• Use a reasonable failure count
• Understand idempotency
Status
Check CLOSED
OPEN
Increment Failure Count
HALF OPEN
“Everything fails all the time.”
~WernerVogels
@jeremy_daly
🔥
🔥
Elasticache
or
27. What about quotas?
• Concurrency has no effect on frequency ⏰
• Stateless functions are not coordinated 😿
• Step Functions would be very expensive 💰
• Adding state wouldn’t prevent needless invocations 🗑
@jeremy_daly
28. Can we build a better system?
• 100% serverless
• Cost effective
• Scalable
• Resilient
• Efficient
• Coordinated
@jeremy_daly
BUT I DON’T HAVE
TIMETOTELLYOU
ABOUT IT!
YES!
29. Lambda Orchestrator
(concurrency 1)
The Lambda Orchestrator
DynamoDB
LambdaWorker
LambdaWorker
LambdaWorker
Concurrent Executions
of the SAME function
SQS (DLQ)
@jeremy_daly
CloudWatch Rule
(trigger every minute)
SQS QueueSQS (DLQ)
Status?
Gmail API
250 Quota Units
per minute
jeremydaly.com/throttling-third-party-api-calls-with-aws-lambda
31. Multicasting with SNS
Key Points:
• SNS has a “well-defined API”
• Decouples downstream processes
• Allows multiple subscribers with message filters
Client
SNS
“Asynchronous”
Request
ack
Event Service
@jeremy_daly
HTTP
SMS
Lambda
SQS
Email
32. Multicasting with EventBridge
Key Points:
• Create up to 100 event buses per account
• Allows multiple subscribers with RULES and EVENT PATTERNS
• Forward events to other accounts
@jeremy_daly
Asynchronous
“PutEvents” Request
ack
w/ event id
Amazon
EventBridge
Lambda
SQS
Client
Step Function
Event Bus
+13 others
33. Stripe API
@jeremy_daly
Distribute & Throttle
ack
SQS
Queue Lambda
(concurrency 25)
SNS
Topic
Client API
Gateway
Lambda
Order Service
total > $0
Key Points:
• SNS to SQS is “guaranteed” (100,010 retries)
• Filter events to selectively trigger services
• Manage throttling/quotas per service
RDS
SQS
Queue Lambda
(concurrency 10)
SMS Alerting Service
Twilio API
SQS
Queue Lambda
(concurrency 5)
Billing Service
status == ”order_complete”
Event Service
35. “Non-serverless” components are inevitable
• Know the limits of your components
• Use a good caching strategy
• Embrace asynchronous processes
• Buffer and throttle events to distributed systems
• Utilize eventual consistency
@jeremy_daly
👈