SlideShare a Scribd company logo
1 of 31
Jürgen Etzlstorfer
@jetzlstorfer
Technology Strategist
A framework for self-healing applications –
the path to enable auto-remediation
Developer Week Nürnberg, 27th June 2018
confidential
The journey
 Why self-healing applications?
 What is needed for self-healing applications
 Auto-remediation as part of a CI/CD pipeline
 Build your own auto-remediation
On average, a single transaction uses 82 different types of technology
Browser
Multi-geo
Mobile Network
Code
Hosts
Logs
IoT
3rd parties
Services
Cloud SDN
Containers
Applications are getting more complex!
Problem
• Not repeatable in Test and cannot be
troubleshooted with current tooling
• After months of investigation and customers
being impacted, the root-cause of the issue
cannot be found
Impact
• Issue causes severe slow downs for the users
and timeouts, eventually needing a manual
failover to the DR site
• Operations team mislead by current alerting on
their investigation path
Consequences
• Poor customer experience drive
poor conversion rates
Recurring issue
for months
479 hours
lost in War-room
up to today.
6 teams and one 3rd party
were involved
Happening
more frequently
Has cost so far
£23,950
Brand reputation
impacted by bad tweets$32,494
Consequences of complexity
MTTD
MTTD vs MTTR
confidential
If you write applications,
they will break eventually
~ Murphy‘s law
confidential
What if you had
something similar to
a self-healing robot?
confidential
What is needed for self-healing applications?
 Monitoring: know what’s going on in your
applications
 End-to-end
 Full-stack – fully integrated in production
(or even in staging)
 Automation/Execution: perform
mitigation/remediation actions
 Access to all systems
 Automation system should be isolated from
production system
APIs
confidential
Know what‘s going on in your
applications
 Monitor your applications Identify the root cause
of the problem!
Applications
are
monitored
Thresholds
are breached
Problem is
analyzed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable remediation
Monitoring Mitigation
confidential
How to automate?
 Automation engines
 Ansible (Tower), Stackstorm, …
 Serverless approaches
 AWS Lambda, Azure Functions, …
Full-stack
environment
is monitored
Anomalies
are detected
automatically
Root
cause
analysis is
performed
Problem
notification
is sent
Event is
received
Job is
triggered
Playbook is
executed
Problem is
remediated
How to enable auto-remediation
Version 123
Staging
Approve
Staging
Production
Approve
Production
Up and
running
Version 124
Scenario: How to mitigate a bad deployment?
Staging
Approve
Staging
Production
Approve
Production
Remediation
Roll-
back
confidential
Steps to mitigate the bad deployment
Fetch
information
about event
Process the
data
Select
corresponding
remediation
action
1.Execution the
remediation
action
Keep track of all automation steps
confidential
Auto-remediation with Ansible (Tower)
 APIs are key to enable automation
 Ansible Tower makes extensive use APIs internally and exposes them also externally
 Ansible playbooks are scripts that are executed from a central host on different machines
 Multiple OS are supported
 Idempotent
 Playbooks can be orchestrated in workflows and job templates
confidential
---
- name: rollback to previous version
hosts: localhost
vars:
...
tasks:
- name: push comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Remediation playbook started.", "user": "{{commentuser}}", "context":
"Ansible Tower" }"
- name: fetch custom deployment events
uri:
url: "{{dtdeploymentapiurl}}"
return_content: yes
with_items: "{{ impactedEntities }}"
register: customproperties
ignore_errors: no
- name: parse deployment events
set_fact:
deployment_events: "{{item.json.events}}"
with_items: "{{ customproperties.results }}"
register: app_result
confidential
- name: call remediation action
uri:
url: "{{ myItem.remediationAction }}"
method: POST
body_format: json
body: "{{ payload | to_json }}"
return_content: yes
ignore_errors: yes
register: result
- name: push success comment to dynatrace
uri:
url: "{{dtcommentapiurl}}"
method: POST
body_format: json
body: "{ "comment": "Invoked remediation action successfully executed: {{result.content}}",
"user": "{{commentuser}}", "context": "Ansible Tower" }"
when: result.status == 200
- name: push error comment to dynatrace
...
body: "{ "comment": "Invoked remediation action failed: {{result.content}}", "user":
"{{commentuser}}", "context": "Ansible Tower" }"
when: result.status != 200
confidential
Auto-remediation with Serverless approaches
 No need for separate installation / maintenance of a system
 Pay-as-you-go (most often for free)
 Support for a variety of languages
 No built-in support for automation tasks
confidential
// remediation
dtUtils.getProblemDetails(myProblem.pid, function (err, resp) {
if (err || !resp.ok) {
console.error("error getProblemDetails for pid " + myEvent.pid + ": " + JSON.stringify(err));
return callback(err);
}
var myRankedEvents = resp.body.result.rankedEvents;
console.info("rankedEvents: " + JSON.stringify(myRankedEvents));
if (myRankedEvents != null) {
var myRootCause = getRootCause(myRankedEvents);
if (myRootCause != undefined) {
// root cause found
console.info("root cause for PID " + myEvent.pid + ": " + JSON.stringify(myRootCause.eventType));
triggerRemediationAction(myProblem, myRootCause, function (err, res, remediationAction) {
if (err) {
console.error("error for remediation of " + myEvent.pid + " (" + myRootCause.eventType + "): " +
JSON.stringify(err));
addComment(myEvent.pid, "error when performing remediation " + JSON.stringify(err), function
(err, res) {
if (err) {
return callback(err);
}
} );
return callback(err);
}
var remediationLog = "Auto-remediation: " + remediationAction.title + " executed:n " +
remediationAction.description;
confidential
Comparison
 Automation Platforms
 Runbook/Playbook automation built-in
 Step-by-step instructions (yaml)
 Specialized for deployment, provisioning,
configuration management
 Maintenance of platform needed
 Serverless
 Different vendors
 Different languages (js, java, python, …)
 Not limited to runbooks
 No support for typical runbook tasks
confidential
Auto-remediation is a safety net
It does not fix your problem
confidential
https://blogs.msdn.microsoft.com/visualstudioalmrangers/2017/04/17/set-up-a-cicd-pipeline-for-your-team-services-extension/
confidential
Embed auto-remediation in your CI/CD pipeline
Shift-Left: Break Pipeline Earlier
Path to NoOps: Self-Healing, …
Shift-Right: Tags, Deploys, Events
Actionable Feedback Loops
Injecting speed &
quality: automatic gate
at test & performance
• Continuous Performance Validation for daily builds
• Root Cause details automatically pushed to JIRA
• Decisions made to compare, break, or good-to-go
Shift-left:engage Dev withearlier & automatedfeedback
confidential
Shift-right:empowerOps withmore contextto react faster
https://github.com/Dynatrace/AWSDevOpsTutorial
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Compares Builds and Approves/Rejects Pipeline
pushDynatraceDeploymentEvent
Pushes Deployment Info to Dynatrace Entities
validateBuildDynatraceWorker
Validates Production and Approves/Rejects Pipeline
handleDynatraceProblemNotification
Executes Auto-Remediating Actions, e.g: Rollback
Build 6
Build 7
Production
Production
Auto-Approve!
Auto-Reject!
Auto-Approve!
Auto-Reject!
How to start?
1. Monitor your environment
2. Define your runbooks
3. Start small and with low hanging fruits
 What are frequent issues?
 Of these, which ones are easy to deal with?
4. Build more and more automation along the way
Cultural Change!
confidential
AI to the rescue
Automated selection
or generation of solution
AI, big data, …
Automated calling of scripts
Ansible Tower, Workflows, …
Predefined
actions to execute
Runbooks, Shell scripts,
batch files, …
www.dynatrace.com
confidential
Jürgen Etzlstorfer
Technology Strategist
juergen.etzlstorfer@dynatrace.com
@jetzlstorfer
Thank you!

More Related Content

What's hot

Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgNils Meder
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsC4Media
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionKeet Sugathadasa
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSBilal Aybar
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...Tori Wieldt
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native ObservabilityTyler Treat
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaKeet Sugathadasa
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern ApplicationsAmazon Web Services
 
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...Amazon Web Services
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using DatadogMukta Aphale
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaegerOracle Korea
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)Ravi Tadwalkar
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
Api observability
Api observability Api observability
Api observability Red Hat
 
Torry Harris API and Application Integration Governance Framework
Torry Harris API and Application Integration Governance FrameworkTorry Harris API and Application Integration Governance Framework
Torry Harris API and Application Integration Governance FrameworkShubaS4
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityElasticsearch
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos EngineeringGremlin
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)Setyo Legowo
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management Argyle Executive Forum
 

What's hot (20)

Principles Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering HamburgPrinciples Of Chaos Engineering - Chaos Engineering Hamburg
Principles Of Chaos Engineering - Chaos Engineering Hamburg
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Observability For Modern Applications
Observability For Modern ApplicationsObservability For Modern Applications
Observability For Modern Applications
 
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
SaaS Reference Architectures: Review of Real-World Patterns & Strategies (GPS...
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaeger
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Api observability
Api observability Api observability
Api observability
 
Torry Harris API and Application Integration Governance Framework
Torry Harris API and Application Integration Governance FrameworkTorry Harris API and Application Integration Governance Framework
Torry Harris API and Application Integration Governance Framework
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified Observability
 
An Introduction to Chaos Engineering
An Introduction to Chaos EngineeringAn Introduction to Chaos Engineering
An Introduction to Chaos Engineering
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Implementing Vulnerability Management
Implementing Vulnerability Management Implementing Vulnerability Management
Implementing Vulnerability Management
 
Grafana.pptx
Grafana.pptxGrafana.pptx
Grafana.pptx
 

Similar to A framework for self-healing applications – the path to enable auto-remediation

How to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup MunichHow to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup MunichJürgen Etzlstorfer
 
Self-healing Applications with Ansible
Self-healing Applications with AnsibleSelf-healing Applications with Ansible
Self-healing Applications with AnsibleJürgen Etzlstorfer
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandMaarten Balliauw
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnelukdpe
 
MTR Troubleshooting
MTR TroubleshootingMTR Troubleshooting
MTR TroubleshootingGraham Walsh
 
Intro To webOS
Intro To webOSIntro To webOS
Intro To webOSfpatton
 
StackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops AutomationStackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops AutomationDmitri Zimine
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyonddion
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wyciekówKonrad Kokosa
 
Kogito: cloud native business automation
Kogito: cloud native business automationKogito: cloud native business automation
Kogito: cloud native business automationMario Fusco
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Puppet
 
The Next Generation Application Server – How Event Based Processing yields s...
The Next Generation  Application Server – How Event Based Processing yields s...The Next Generation  Application Server – How Event Based Processing yields s...
The Next Generation Application Server – How Event Based Processing yields s...Guy Korland
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go BadSteve Loughran
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...Flink Forward
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
 
Security automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automationSecurity automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automationMoses Schwartz
 
[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0camunda services GmbH
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security LLC
 
Unicenter Autosys Job Management
Unicenter Autosys Job ManagementUnicenter Autosys Job Management
Unicenter Autosys Job ManagementVenkata Duvvuri
 

Similar to A framework for self-healing applications – the path to enable auto-remediation (20)

How to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup MunichHow to build your own auto-remediation workflow - Ansible Meetup Munich
How to build your own auto-remediation workflow - Ansible Meetup Munich
 
Self-healing Applications with Ansible
Self-healing Applications with AnsibleSelf-healing Applications with Ansible
Self-healing Applications with Ansible
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 
MTR Troubleshooting
MTR TroubleshootingMTR Troubleshooting
MTR Troubleshooting
 
Intro To webOS
Intro To webOSIntro To webOS
Intro To webOS
 
StackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops AutomationStackStrom: If-This-Than-That for Devops Automation
StackStrom: If-This-Than-That for Devops Automation
 
Google Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and BeyondGoogle Back To Front: From Gears to App Engine and Beyond
Google Back To Front: From Gears to App Engine and Beyond
 
A miało być tak... bez wycieków
A miało być tak... bez wyciekówA miało być tak... bez wycieków
A miało być tak... bez wycieków
 
Kogito: cloud native business automation
Kogito: cloud native business automationKogito: cloud native business automation
Kogito: cloud native business automation
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
The Next Generation Application Server – How Event Based Processing yields s...
The Next Generation  Application Server – How Event Based Processing yields s...The Next Generation  Application Server – How Event Based Processing yields s...
The Next Generation Application Server – How Event Based Processing yields s...
 
When Web Services Go Bad
When Web Services Go BadWhen Web Services Go Bad
When Web Services Go Bad
 
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
Security automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automationSecurity automation simplified: an intro to DIY security automation
Security automation simplified: an intro to DIY security automation
 
[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0[Webinar] Camunda Optimize Release 3.0
[Webinar] Camunda Optimize Release 3.0
 
Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠Integris Security - Hacking With Glue ℠
Integris Security - Hacking With Glue ℠
 
Unicenter Autosys Job Management
Unicenter Autosys Job ManagementUnicenter Autosys Job Management
Unicenter Autosys Job Management
 

Recently uploaded

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

A framework for self-healing applications – the path to enable auto-remediation

  • 1. Jürgen Etzlstorfer @jetzlstorfer Technology Strategist A framework for self-healing applications – the path to enable auto-remediation Developer Week Nürnberg, 27th June 2018
  • 2. confidential The journey  Why self-healing applications?  What is needed for self-healing applications  Auto-remediation as part of a CI/CD pipeline  Build your own auto-remediation
  • 3. On average, a single transaction uses 82 different types of technology Browser Multi-geo Mobile Network Code Hosts Logs IoT 3rd parties Services Cloud SDN Containers Applications are getting more complex!
  • 4. Problem • Not repeatable in Test and cannot be troubleshooted with current tooling • After months of investigation and customers being impacted, the root-cause of the issue cannot be found Impact • Issue causes severe slow downs for the users and timeouts, eventually needing a manual failover to the DR site • Operations team mislead by current alerting on their investigation path Consequences • Poor customer experience drive poor conversion rates Recurring issue for months 479 hours lost in War-room up to today. 6 teams and one 3rd party were involved Happening more frequently Has cost so far £23,950 Brand reputation impacted by bad tweets$32,494 Consequences of complexity
  • 5.
  • 7. confidential If you write applications, they will break eventually ~ Murphy‘s law
  • 8. confidential What if you had something similar to a self-healing robot?
  • 9. confidential What is needed for self-healing applications?  Monitoring: know what’s going on in your applications  End-to-end  Full-stack – fully integrated in production (or even in staging)  Automation/Execution: perform mitigation/remediation actions  Access to all systems  Automation system should be isolated from production system APIs
  • 10. confidential Know what‘s going on in your applications  Monitor your applications Identify the root cause of the problem!
  • 11. Applications are monitored Thresholds are breached Problem is analyzed Problem notification is sent Event is received Job is triggered Playbook is executed Problem is remediated How to enable remediation Monitoring Mitigation
  • 12. confidential How to automate?  Automation engines  Ansible (Tower), Stackstorm, …  Serverless approaches  AWS Lambda, Azure Functions, …
  • 13. Full-stack environment is monitored Anomalies are detected automatically Root cause analysis is performed Problem notification is sent Event is received Job is triggered Playbook is executed Problem is remediated How to enable auto-remediation
  • 14. Version 123 Staging Approve Staging Production Approve Production Up and running Version 124 Scenario: How to mitigate a bad deployment? Staging Approve Staging Production Approve Production Remediation Roll- back
  • 15. confidential Steps to mitigate the bad deployment Fetch information about event Process the data Select corresponding remediation action 1.Execution the remediation action Keep track of all automation steps
  • 16. confidential Auto-remediation with Ansible (Tower)  APIs are key to enable automation  Ansible Tower makes extensive use APIs internally and exposes them also externally  Ansible playbooks are scripts that are executed from a central host on different machines  Multiple OS are supported  Idempotent  Playbooks can be orchestrated in workflows and job templates
  • 17. confidential --- - name: rollback to previous version hosts: localhost vars: ... tasks: - name: push comment to dynatrace uri: url: "{{dtcommentapiurl}}" method: POST body_format: json body: "{ "comment": "Remediation playbook started.", "user": "{{commentuser}}", "context": "Ansible Tower" }" - name: fetch custom deployment events uri: url: "{{dtdeploymentapiurl}}" return_content: yes with_items: "{{ impactedEntities }}" register: customproperties ignore_errors: no - name: parse deployment events set_fact: deployment_events: "{{item.json.events}}" with_items: "{{ customproperties.results }}" register: app_result
  • 18. confidential - name: call remediation action uri: url: "{{ myItem.remediationAction }}" method: POST body_format: json body: "{{ payload | to_json }}" return_content: yes ignore_errors: yes register: result - name: push success comment to dynatrace uri: url: "{{dtcommentapiurl}}" method: POST body_format: json body: "{ "comment": "Invoked remediation action successfully executed: {{result.content}}", "user": "{{commentuser}}", "context": "Ansible Tower" }" when: result.status == 200 - name: push error comment to dynatrace ... body: "{ "comment": "Invoked remediation action failed: {{result.content}}", "user": "{{commentuser}}", "context": "Ansible Tower" }" when: result.status != 200
  • 19. confidential Auto-remediation with Serverless approaches  No need for separate installation / maintenance of a system  Pay-as-you-go (most often for free)  Support for a variety of languages  No built-in support for automation tasks
  • 20. confidential // remediation dtUtils.getProblemDetails(myProblem.pid, function (err, resp) { if (err || !resp.ok) { console.error("error getProblemDetails for pid " + myEvent.pid + ": " + JSON.stringify(err)); return callback(err); } var myRankedEvents = resp.body.result.rankedEvents; console.info("rankedEvents: " + JSON.stringify(myRankedEvents)); if (myRankedEvents != null) { var myRootCause = getRootCause(myRankedEvents); if (myRootCause != undefined) { // root cause found console.info("root cause for PID " + myEvent.pid + ": " + JSON.stringify(myRootCause.eventType)); triggerRemediationAction(myProblem, myRootCause, function (err, res, remediationAction) { if (err) { console.error("error for remediation of " + myEvent.pid + " (" + myRootCause.eventType + "): " + JSON.stringify(err)); addComment(myEvent.pid, "error when performing remediation " + JSON.stringify(err), function (err, res) { if (err) { return callback(err); } } ); return callback(err); } var remediationLog = "Auto-remediation: " + remediationAction.title + " executed:n " + remediationAction.description;
  • 21. confidential Comparison  Automation Platforms  Runbook/Playbook automation built-in  Step-by-step instructions (yaml)  Specialized for deployment, provisioning, configuration management  Maintenance of platform needed  Serverless  Different vendors  Different languages (js, java, python, …)  Not limited to runbooks  No support for typical runbook tasks
  • 22. confidential Auto-remediation is a safety net It does not fix your problem
  • 24. confidential Embed auto-remediation in your CI/CD pipeline Shift-Left: Break Pipeline Earlier Path to NoOps: Self-Healing, … Shift-Right: Tags, Deploys, Events Actionable Feedback Loops
  • 25. Injecting speed & quality: automatic gate at test & performance • Continuous Performance Validation for daily builds • Root Cause details automatically pushed to JIRA • Decisions made to compare, break, or good-to-go Shift-left:engage Dev withearlier & automatedfeedback
  • 27. https://github.com/Dynatrace/AWSDevOpsTutorial pushDynatraceDeploymentEvent Pushes Deployment Info to Dynatrace Entities validateBuildDynatraceWorker Compares Builds and Approves/Rejects Pipeline pushDynatraceDeploymentEvent Pushes Deployment Info to Dynatrace Entities validateBuildDynatraceWorker Validates Production and Approves/Rejects Pipeline handleDynatraceProblemNotification Executes Auto-Remediating Actions, e.g: Rollback Build 6 Build 7 Production Production Auto-Approve! Auto-Reject! Auto-Approve! Auto-Reject!
  • 28. How to start? 1. Monitor your environment 2. Define your runbooks 3. Start small and with low hanging fruits  What are frequent issues?  Of these, which ones are easy to deal with? 4. Build more and more automation along the way Cultural Change!
  • 30. AI to the rescue Automated selection or generation of solution AI, big data, … Automated calling of scripts Ansible Tower, Workflows, … Predefined actions to execute Runbooks, Shell scripts, batch files, …

Editor's Notes

  1. that’s not going to be easy – container and cloud platforms allow for faster deployments, independent release cycles WHILE increasing operational complexity monolith to microservice, in memory call / network call, Istio (more hops, more technologies) – overall we see on average 82! applications are incredibly complex how it works end-to-end? nobody knows all parts ...
  2. Real customer problem in a complex cloud environment Problem is not only the money spent on this, but also time and bad brand reputation – problem was that
  3. Does your Enterprise look like this today?
  4. Bob has many layers to look through for problems. Mean time to Recovery (MTTR) for application problems could take 72 hours or more. Can Bob find the problem quickly let alone fix it? What about the impact? In many cases the Mean Time to Discovery (MTTD) takes up two-thirds of the MTTR. In that time how many other users or applications may be impacted?
  5. It might not break immediately but there will be a point in time when your applications will break. It can be a broken dependency, it can be a infrastructure failure, it can be a database slowdown severely impacting your service – however, your application will break. Murphys law: whatever can go wrong, will go wrong!
  6. A self-healing robot fixing itself when it experiences troubles. This could mean freeing up additional resources, restarting things that are not doing well, rolling back to a state where everything worked perfectly…
  7. Monitoring: End to end means that you have to track the complete path of your requests to not look at black boxes Full-stack: has to cover your complete application stack from frontend to backend technologies Automation: Means that can execute what you would do manually in case of outages
  8. What we see a lot in customer environments is that the actual root cause of the problem is buried somewhere else than you would expect at first sight. For example, if your services experience a slow down, the actual problem might be even the network or the underlying database of a different service the one that you are looking for is depending on.
  9. Let‘s take a look… What measures are needed for enabling remediation? As a prerequite we have to make sure we somehow monitor our applications simple because we need to know what‘s going on in either our application or our environment. We define thresholds that should not be breached. We then look at the dashboards and once the dashboards are breeched we analyze the problem and send it over to someone else. This could be either a human operator or even an automation platform. We can for example employ XXX and trigger a previously defined job that executes a playbook. Basically it‘s a sequence of instructions to automate tasks that can include restarts of processes, scaling up the environments, …
  10. We at Dynatrace have automated this process, since the traditional way still means a lot of manual monitoring and looking at dashboards. We achieve this by using our own monitoring tool and integrating it with 3rd party vendors. Also, Dynatrace provides full stack monitoring to detect issues in either layer of your environment. Automatic baselining further allows to automatically detect anomalies without the need to manually define tresholds, since they might differ substantially between applications. Our AI-based root cause analysis finally detects the real root cause of the problem and sends exactly this notification. Now a third party vendor such as Ansible Tower can take over.
  11. As an example, let‘s take a look at a simple delivery pipeline. When deploying a new version, we make sure to carefully test our new build. However, despite thorough tests in staging and maybe even in production errors might occur. Although the pipeline was build to fail early this is not always possible. So it might happen that the error is only discovered in production. If the error occurs Saturday night it might not possible to inspect it immediately and schedule counter actions. Therefore with auto-remediation in place we can for example automatically rollback to the previous stable version to save the weekend.
  12. - you see the problem in the picture for automation?
  13. As we can see being able to automate lies in the core of even enabling auto-remediation or self-healing. First you need to have runbooks or scripts that can kick in every time they are needed. Next you can connect your tools of choice to this scripts to enable auto-remediation. However, you still have to have dedicated runbooks for each scenario in place and have to connect the right problems to the right counter-actions. Finally, with self-healing we can leverage the power of AI and big data to fully understand the root causes of problems and automatically determine executable steps for remediations.