SlideShare a Scribd company logo
1 of 24
Download to read offline
SELF-HEALING SERVERLESS APPLICATIONS
Glue Conference 2018
NATE TAGGART
AWS | LAMBDA FEATURES PAGE
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to configure anything. There is no limit to the
number of requests your code can handle.
The Promise:
SELF-HEALING SERVERLESS APPLICATIONS | PG2
AWS | LAMBDA FEATURES PAGE
The Reality:
AWS Lambda invokes your code only when
needed and automatically scales to support the
rate of incoming requests without requiring you
to configure anything. There is no limit to the
number of requests your code can handle.
s
architecture
sometimes
certain
s
es
every
can
but
areproperly
^
(suggested edits)
SELF-HEALING SERVERLESS APPLICATIONS | PG3
What to expect

when you’re not expecting.
SELF-HEALING SERVERLESS APPLICATIONS | PG4
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
SELF-HEALING SERVERLESS APPLICATIONS | PG5
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
Synchronous invocations:
• Function fails
• Returns error to caller
• Logs timestamp, error message,
& stack trace to CloudWatch
Asynchronous invocations:
• Retries up to three times (or
more if reading from a stream)
• Caller is unaware of error
• Logs timestamp, error message,
& stack trace to CloudWatch
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but raises
an unhandled exception
in your code.
SELF-HEALING SERVERLESS APPLICATIONS | PG6
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
Synchronous invocations:
• Lambda returns error to caller
(if client hasn’t timed out)
• Logs timestamp and error
message to CloudWatch
Asynchronous invocations:
• Retries up to three times (more
if reading from stream)
• Caller is unaware of error
• Logs timestamp & error
message to CloudWatch
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
An event triggers your
Lambda to run, but
execution does not
complete within the
configured maximum
execution time.
(Lambda’s default
configuration is a 

3-second timeout.)
SELF-HEALING SERVERLESS APPLICATIONS | PG7
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
When noisy:
• Behaves as Uncaught
Exception
• Visible in CloudWatch, but may
be difficult to diagnose without
event visibility
When silent:
• Unexpected application
behavior
• Can be lost permanently
• Can tank performance and
dramatically spike costs
An event triggers your
Lambda to run, but the
message is malformed or
state is improperly
provided causing
unexpected behavior.
SELF-HEALING SERVERLESS APPLICATIONS | PG8
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
but not in logs
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more Lambda
instances are required
than are allowed to be
concurrently running by
AWS for your account.
Your compute can’t scale
high enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG9
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Unbuffered invocations:
• Fails to invoke
• No retry
• Visible in CloudWatch metrics,
nothing in logs

(but really non-obvious)
Buffered invocations:
• Initially fails to invoke
• Will eventually continue
reading from stream as volume
drops
Your application becomes
throttled as more new
Lambda instances are
required than are allowed
to spawn by AWS for your
account.
Your compute can’t scale
fast enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG10
FAILURE TYPES DESCRIPTION
Common Serverless Failures
FOR LAMBDA-BASED ARCHITECTURES
DEFAULT BEHAVIOR
• Runtime Error:
• Uncaught Exception
• Timeout
• Bad State
• Scaling:
• Concurrency Limits
• Spawn Limits
• Bottlenecking
Upstream bottlenecks:
• Fails to invoke
• No retry
• Visible in CloudWatch, as long
as you know where to look
Downstream bottlenecks:
• Can throw error, timeout, 

and/or distribute failures to
other functions.
• Can cause cascading failures
• Can tank performance and
dramatically spike costs
Your application is
throttled due to
throughput pressure
upstream or downstream
of your Lambda.
Your architecture can’t
scale enough.
SELF-HEALING SERVERLESS APPLICATIONS | PG11
Introducing:
Self-Healing Serverless Applications
SELF-HEALING SERVERLESS APPLICATIONS | PG12
Self-Healing Design Principles
LEADING PRACTICES FOR RESILIENT SYSTEMS
STANDARDIZE FAIL GRACEFULLY
• Reroute and unblock
• Automate known
solutions
• Notify a human
SELF-HEALING SERVERLESS APPLICATIONS | PG13
Learn to fail.
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
PLAN FOR FAILURE
• Identify service limits
• Use self-throttling
• Consider alternative
resource types
SELF-HEALING SERVERLESS APPLICATIONS | PG14
Scenario: Uncaught Exceptions
WHEN THINGS BREAK AND YOU DON’T KNOW WHY
PROBLEM
Lambda periodically fails.
Error messages and stack
traces are visible in
CloudWatch logs. Failing
events are lost, making
reproduction difficult.
KEY PRINCIPLES
• Introduce universal
instrumentation
• Collect event-centric
diagnostics
• Give everyone visibility
SOLUTION
• Use function wrapper or
decorator pattern
• Capture and log events
which fail
SELF-HEALING SERVERLESS APPLICATIONS | PG15
Decrease time to resolution by capturing event data.
Event Diagnostics Wrapper Example
SELF-HEALING SERVERLESS APPLICATIONS | PG16
WHEN YOUR LAMBDAS AREN’T GETTING INVOKED
PROBLEM
API Gateway hits
throughput limits and fails
to invoke Lambda on
every request.
KEY PRINCIPLES
• Identify service limits
• Use self-throttling
• Notify a human
SOLUTION
• Implement retries with
exponential backoff
logic for 429 responses
• Raise alarm on:
4XXError
Scenario: Upstream bottleneck
SELF-HEALING SERVERLESS APPLICATIONS | PG17
Don’t overlook client-side solutions to backend failures.
SELF-HEALING SERVERLESS APPLICATIONS | PG18
WHEN EXECUTION TAKES TOO LONG
PROBLEM
Lambda is periodically
timing out.
KEY PRINCIPLES
• Introduce universal
instrumentation
• Use self-throttling
• Consider alternative
resource types
SOLUTION
• Use function wrapper or
decorator pattern
• Evaluate Fargate or
alternative long-running
resources
Scenario: Timeouts
SELF-HEALING SERVERLESS APPLICATIONS | PG19
Enforce your own limits.
Timeout Wrapper Example
SELF-HEALING SERVERLESS APPLICATIONS | PG20
WHEN FAILURES ARE BLOCKING THE REST OF THE STREAM
PROBLEM
Lambda exceptions and/or
timeouts are blocking
processing of a Kinesis
shard.
KEY PRINCIPLES
• Reroute and unblock
• Automate known
solutions
• Consider alternative
resource types
SOLUTION
• Introduce state machine-
type logic
• Move bad messages to
alternate stream
• Potentially architect with
Fargate or SNS
Scenario: Stream processing gets “stuck”
SELF-HEALING SERVERLESS APPLICATIONS | PG21
Small failures are preferable to large ones.
PROBLEM
Your Lambdas have scaled
up but are depleting your
RDS database connection
pools.
KEY PRINCIPLES
• Identify service limits
• Automate known
solutions
• Give everyone visibility
SOLUTION
• Always close database
connections
• Scale your database
• Map your dependencies
Scenario: Downstream bottleneck
WHEN LAMBDA IS OUT-SCALING YOUR DATABASE
SELF-HEALING SERVERLESS APPLICATIONS | PG22
Scale dependencies, too.
Thank you!
@stackeryio

More Related Content

What's hot

Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Matt Tesauro
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesAlexander Dean
 
Open stack ocata summit enabling aws lambda-like functionality with openstac...
Open stack ocata summit  enabling aws lambda-like functionality with openstac...Open stack ocata summit  enabling aws lambda-like functionality with openstac...
Open stack ocata summit enabling aws lambda-like functionality with openstac...Shaun Murakami
 
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by InstrumentAmazon Web Services
 
(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the Noise(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the NoiseAmazon Web Services
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Apache Kafka® at Dropbox
Apache Kafka® at DropboxApache Kafka® at Dropbox
Apache Kafka® at Dropboxconfluent
 
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...HostedbyConfluent
 
DevOps Days Tel Aviv - Serverless Architecture
DevOps Days Tel Aviv - Serverless ArchitectureDevOps Days Tel Aviv - Serverless Architecture
DevOps Days Tel Aviv - Serverless ArchitectureAntons Kranga
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data PlatformLivePerson
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
Going serverless with aws
Going serverless with awsGoing serverless with aws
Going serverless with awsAlex Landa
 
Next generation pipelines
Next generation pipelinesNext generation pipelines
Next generation pipelinesAlex Landa
 
Apache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingApache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingRatish Ravindran
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafkaconfluent
 
Fast Deployments to Multiple Golang Lambda Functions
Fast Deployments to Multiple Golang Lambda FunctionsFast Deployments to Multiple Golang Lambda Functions
Fast Deployments to Multiple Golang Lambda FunctionsKp Krishnamoorthy
 

What's hot (20)

Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013
 
What Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registriesWhat Crimean War gunboats teach us about the need for schema registries
What Crimean War gunboats teach us about the need for schema registries
 
Open stack ocata summit enabling aws lambda-like functionality with openstac...
Open stack ocata summit  enabling aws lambda-like functionality with openstac...Open stack ocata summit  enabling aws lambda-like functionality with openstac...
Open stack ocata summit enabling aws lambda-like functionality with openstac...
 
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
(DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
 
(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the Noise(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the Noise
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Apache Kafka® at Dropbox
Apache Kafka® at DropboxApache Kafka® at Dropbox
Apache Kafka® at Dropbox
 
Netflix conductor
Netflix conductorNetflix conductor
Netflix conductor
 
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
Monitoring and Resiliency Testing our Apache Kafka Clusters at Goldman Sachs ...
 
DevOps Days Tel Aviv - Serverless Architecture
DevOps Days Tel Aviv - Serverless ArchitectureDevOps Days Tel Aviv - Serverless Architecture
DevOps Days Tel Aviv - Serverless Architecture
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data Platform
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Going serverless with aws
Going serverless with awsGoing serverless with aws
Going serverless with aws
 
Next generation pipelines
Next generation pipelinesNext generation pipelines
Next generation pipelines
 
Apache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs AlertingApache Kafka : Monitoring vs Alerting
Apache Kafka : Monitoring vs Alerting
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Exactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache KafkaExactly-once Semantics in Apache Kafka
Exactly-once Semantics in Apache Kafka
 
Fast Deployments to Multiple Golang Lambda Functions
Fast Deployments to Multiple Golang Lambda FunctionsFast Deployments to Multiple Golang Lambda Functions
Fast Deployments to Multiple Golang Lambda Functions
 

Similar to Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture PatternsAmazon Web Services
 
Serverless design considerations for Cloud Native workloads
Serverless design considerations for Cloud Native workloadsServerless design considerations for Cloud Native workloads
Serverless design considerations for Cloud Native workloadsTensult
 
Going Serverless with AWS Lambda at ReportGarden
Going Serverless with AWS Lambda at ReportGardenGoing Serverless with AWS Lambda at ReportGarden
Going Serverless with AWS Lambda at ReportGardenJay Gandhi
 
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API GatewayStephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API GatewaySteve Androulakis
 
Building serverless backends - Tech talk 5 May 2017
Building serverless backends - Tech talk 5 May 2017Building serverless backends - Tech talk 5 May 2017
Building serverless backends - Tech talk 5 May 2017ARDC
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know confluent
 
ServerlessPresentation
ServerlessPresentationServerlessPresentation
ServerlessPresentationRohit Kumar
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture PatternsAmazon Web Services
 
serverless_architecture_patterns_london_loft.pdf
serverless_architecture_patterns_london_loft.pdfserverless_architecture_patterns_london_loft.pdf
serverless_architecture_patterns_london_loft.pdfAmazon Web Services
 
The Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of ServerlessThe Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of ServerlessPipedrive
 
Building Resilient Serverless Systems with Non-Serverless Components
Building Resilient Serverless Systems with Non-Serverless ComponentsBuilding Resilient Serverless Systems with Non-Serverless Components
Building Resilient Serverless Systems with Non-Serverless ComponentsJeremy Daly
 
Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...
Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...
Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...Amazon Web Services
 
AWS Serverless patterns & best-practices in AWS
AWS Serverless  patterns & best-practices in AWSAWS Serverless  patterns & best-practices in AWS
AWS Serverless patterns & best-practices in AWSDima Pasko
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesBhakti Mehta
 
.NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ...
.NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ....NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ...
.NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ...NETFest
 
Serverless use cases .NET Fest
Serverless use cases .NET FestServerless use cases .NET Fest
Serverless use cases .NET FestStas Lebedenko
 

Similar to Self-Healing Serverless Applications (Stackery @ GlueCon 2018) (20)

Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture Patterns
 
Serverless design considerations for Cloud Native workloads
Serverless design considerations for Cloud Native workloadsServerless design considerations for Cloud Native workloads
Serverless design considerations for Cloud Native workloads
 
Going Serverless with AWS Lambda at ReportGarden
Going Serverless with AWS Lambda at ReportGardenGoing Serverless with AWS Lambda at ReportGarden
Going Serverless with AWS Lambda at ReportGarden
 
AWS Jungle - Lambda
AWS Jungle - LambdaAWS Jungle - Lambda
AWS Jungle - Lambda
 
Operating Your Production API
Operating Your Production APIOperating Your Production API
Operating Your Production API
 
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API GatewayStephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
Stephen Liedig: Building Serverless Backends with AWS Lambda and API Gateway
 
Building serverless backends - Tech talk 5 May 2017
Building serverless backends - Tech talk 5 May 2017Building serverless backends - Tech talk 5 May 2017
Building serverless backends - Tech talk 5 May 2017
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
ServerlessPresentation
ServerlessPresentationServerlessPresentation
ServerlessPresentation
 
Serverless Architecture Patterns
Serverless Architecture PatternsServerless Architecture Patterns
Serverless Architecture Patterns
 
serverless_architecture_patterns_london_loft.pdf
serverless_architecture_patterns_london_loft.pdfserverless_architecture_patterns_london_loft.pdf
serverless_architecture_patterns_london_loft.pdf
 
The Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of ServerlessThe Good, Bad and Ugly of Serverless
The Good, Bad and Ugly of Serverless
 
Building Resilient Serverless Systems with Non-Serverless Components
Building Resilient Serverless Systems with Non-Serverless ComponentsBuilding Resilient Serverless Systems with Non-Serverless Components
Building Resilient Serverless Systems with Non-Serverless Components
 
Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...
Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...
Get the EDGE to scale: Using Cloudfront along with edge compute to scale your...
 
AWS Serverless patterns & best-practices in AWS
AWS Serverless  patterns & best-practices in AWSAWS Serverless  patterns & best-practices in AWS
AWS Serverless patterns & best-practices in AWS
 
Operating your Production API
Operating your Production APIOperating your Production API
Operating your Production API
 
Expect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservicesExpect the unexpected: Prepare for failures in microservices
Expect the unexpected: Prepare for failures in microservices
 
.NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ...
.NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ....NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ...
.NET Fest 2019. Stas Lebedenko. Practical serverless use cases in Azure with ...
 
Serverless use cases .NET Fest
Serverless use cases .NET FestServerless use cases .NET Fest
Serverless use cases .NET Fest
 
What's New with AWS Lambda
What's New with AWS LambdaWhat's New with AWS Lambda
What's New with AWS Lambda
 

Recently uploaded

『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesLumiverse Solutions Pvt Ltd
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 

Recently uploaded (9)

『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)

  • 1. SELF-HEALING SERVERLESS APPLICATIONS Glue Conference 2018 NATE TAGGART
  • 2. AWS | LAMBDA FEATURES PAGE AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. There is no limit to the number of requests your code can handle. The Promise: SELF-HEALING SERVERLESS APPLICATIONS | PG2
  • 3. AWS | LAMBDA FEATURES PAGE The Reality: AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. There is no limit to the number of requests your code can handle. s architecture sometimes certain s es every can but areproperly ^ (suggested edits) SELF-HEALING SERVERLESS APPLICATIONS | PG3
  • 4. What to expect
 when you’re not expecting. SELF-HEALING SERVERLESS APPLICATIONS | PG4
  • 5. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR SELF-HEALING SERVERLESS APPLICATIONS | PG5 • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking
  • 6. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR Synchronous invocations: • Function fails • Returns error to caller • Logs timestamp, error message, & stack trace to CloudWatch Asynchronous invocations: • Retries up to three times (or more if reading from a stream) • Caller is unaware of error • Logs timestamp, error message, & stack trace to CloudWatch • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking An event triggers your Lambda to run, but raises an unhandled exception in your code. SELF-HEALING SERVERLESS APPLICATIONS | PG6
  • 7. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR Synchronous invocations: • Lambda returns error to caller (if client hasn’t timed out) • Logs timestamp and error message to CloudWatch Asynchronous invocations: • Retries up to three times (more if reading from stream) • Caller is unaware of error • Logs timestamp & error message to CloudWatch • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking An event triggers your Lambda to run, but execution does not complete within the configured maximum execution time. (Lambda’s default configuration is a 
 3-second timeout.) SELF-HEALING SERVERLESS APPLICATIONS | PG7
  • 8. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking When noisy: • Behaves as Uncaught Exception • Visible in CloudWatch, but may be difficult to diagnose without event visibility When silent: • Unexpected application behavior • Can be lost permanently • Can tank performance and dramatically spike costs An event triggers your Lambda to run, but the message is malformed or state is improperly provided causing unexpected behavior. SELF-HEALING SERVERLESS APPLICATIONS | PG8
  • 9. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Unbuffered invocations: • Fails to invoke • No retry • Visible in CloudWatch metrics, but not in logs Buffered invocations: • Initially fails to invoke • Will eventually continue reading from stream as volume drops Your application becomes throttled as more Lambda instances are required than are allowed to be concurrently running by AWS for your account. Your compute can’t scale high enough. SELF-HEALING SERVERLESS APPLICATIONS | PG9
  • 10. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Unbuffered invocations: • Fails to invoke • No retry • Visible in CloudWatch metrics, nothing in logs
 (but really non-obvious) Buffered invocations: • Initially fails to invoke • Will eventually continue reading from stream as volume drops Your application becomes throttled as more new Lambda instances are required than are allowed to spawn by AWS for your account. Your compute can’t scale fast enough. SELF-HEALING SERVERLESS APPLICATIONS | PG10
  • 11. FAILURE TYPES DESCRIPTION Common Serverless Failures FOR LAMBDA-BASED ARCHITECTURES DEFAULT BEHAVIOR • Runtime Error: • Uncaught Exception • Timeout • Bad State • Scaling: • Concurrency Limits • Spawn Limits • Bottlenecking Upstream bottlenecks: • Fails to invoke • No retry • Visible in CloudWatch, as long as you know where to look Downstream bottlenecks: • Can throw error, timeout, 
 and/or distribute failures to other functions. • Can cause cascading failures • Can tank performance and dramatically spike costs Your application is throttled due to throughput pressure upstream or downstream of your Lambda. Your architecture can’t scale enough. SELF-HEALING SERVERLESS APPLICATIONS | PG11
  • 13. Self-Healing Design Principles LEADING PRACTICES FOR RESILIENT SYSTEMS STANDARDIZE FAIL GRACEFULLY • Reroute and unblock • Automate known solutions • Notify a human SELF-HEALING SERVERLESS APPLICATIONS | PG13 Learn to fail. • Introduce universal instrumentation • Collect event-centric diagnostics • Give everyone visibility PLAN FOR FAILURE • Identify service limits • Use self-throttling • Consider alternative resource types
  • 15. Scenario: Uncaught Exceptions WHEN THINGS BREAK AND YOU DON’T KNOW WHY PROBLEM Lambda periodically fails. Error messages and stack traces are visible in CloudWatch logs. Failing events are lost, making reproduction difficult. KEY PRINCIPLES • Introduce universal instrumentation • Collect event-centric diagnostics • Give everyone visibility SOLUTION • Use function wrapper or decorator pattern • Capture and log events which fail SELF-HEALING SERVERLESS APPLICATIONS | PG15 Decrease time to resolution by capturing event data.
  • 16. Event Diagnostics Wrapper Example SELF-HEALING SERVERLESS APPLICATIONS | PG16
  • 17. WHEN YOUR LAMBDAS AREN’T GETTING INVOKED PROBLEM API Gateway hits throughput limits and fails to invoke Lambda on every request. KEY PRINCIPLES • Identify service limits • Use self-throttling • Notify a human SOLUTION • Implement retries with exponential backoff logic for 429 responses • Raise alarm on: 4XXError Scenario: Upstream bottleneck SELF-HEALING SERVERLESS APPLICATIONS | PG17 Don’t overlook client-side solutions to backend failures.
  • 19. WHEN EXECUTION TAKES TOO LONG PROBLEM Lambda is periodically timing out. KEY PRINCIPLES • Introduce universal instrumentation • Use self-throttling • Consider alternative resource types SOLUTION • Use function wrapper or decorator pattern • Evaluate Fargate or alternative long-running resources Scenario: Timeouts SELF-HEALING SERVERLESS APPLICATIONS | PG19 Enforce your own limits.
  • 20. Timeout Wrapper Example SELF-HEALING SERVERLESS APPLICATIONS | PG20
  • 21. WHEN FAILURES ARE BLOCKING THE REST OF THE STREAM PROBLEM Lambda exceptions and/or timeouts are blocking processing of a Kinesis shard. KEY PRINCIPLES • Reroute and unblock • Automate known solutions • Consider alternative resource types SOLUTION • Introduce state machine- type logic • Move bad messages to alternate stream • Potentially architect with Fargate or SNS Scenario: Stream processing gets “stuck” SELF-HEALING SERVERLESS APPLICATIONS | PG21 Small failures are preferable to large ones.
  • 22. PROBLEM Your Lambdas have scaled up but are depleting your RDS database connection pools. KEY PRINCIPLES • Identify service limits • Automate known solutions • Give everyone visibility SOLUTION • Always close database connections • Scale your database • Map your dependencies Scenario: Downstream bottleneck WHEN LAMBDA IS OUT-SCALING YOUR DATABASE SELF-HEALING SERVERLESS APPLICATIONS | PG22 Scale dependencies, too.