StackStorm is an open source automation platform that treats automation tasks as events. It allows users to create workflows by connecting triggers to actions using rules. Some key benefits of StackStorm include reducing mean time to resolution, avoiding failures through automated fixes, reducing risk of human error, and helping engineers sleep better by automating incident response. It has been used by companies like Symantec and Dimension Data for tasks like OpenStack cluster remediation and legacy systems replacement.
21. What SHOULD be automated?
From: Practice of Cloud System Administration, by Thomas Limoncelli
22. What HAS BEEN automated with StackStorm
• Security checks
– On malware detection in a VM, isolate
network port on a switch
• App blue-green deployment
– On Jenkins tests passed, bring new vm
claster, deploy and configure app, set
loadbalancer to send % of traffic to new
app, monitor, roll forward, or back out
• Networking
– On BGP peer goes down: collect
troubleshooting data, post on slack &
create JIRA ticket
– On Link aggregation member error,
check load, if capacity of rest of LAG
bundle enough, disable link with error
• OpenStack
– orphan VM clean-up: On orphans
detected, shut down, email owner, keep for
few days, delete
– VM evacuation on HW failures: On host
RAID failure, get list of impacted VMs,
email VM owners, evacuate VMs, create
JIRA ticket for hardware replacement.
• Service remediation:
– Cassandra “node down” recovery: On ring
node dying, deploy new node, configure,
add to the ring.
– Remediating RabbitMQ, Galera cluster,
MySQL, and more…
22
24. “Auto-Remediation & Automation at Facebook” @ Auto-Remediation Meetup SF
https://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/236704012/
Facebook FBAR:
43 % Problem Fixed
51 % False positives
94 % Automated
25. “Auto-Remediation & Automation at Facebook” @ Auto-Remediation Meetup SF
https://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/236704012/
FBAR & StackStorm are friends
StackStorm is inspired by FBAR
StackStorm and FB FBAR collaborating since 2014
26. “Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona
Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
27. “Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona
Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
User: Symantec (Mirantis)
Use case: OpenStack cluster remediation
Presented by Mirantis at OpenStack Barcelona
StackStorm at Symantec
28. Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SF
https://goo.gl/lHzq4r
33. Device support
Proven approach from cloud compute space
33
NAPALM
AUTOMATION
Packs
PYTHON
BINDINGS
INTEGRATION
PACKS
DEVICE / OS
INTERFACES
Some workflows [1] leverage unique device functions, so they
call actions of the device’s integration pack.
Other workflows [2] need to be abstracted and treat devices
alike (e.g. IXP provision on mixture of SLX and MLX). So they
use “unified” Napalm pack.
device drivers
Integration packs expose device configurations and operations
as st2 actions. VDX and SLX packs will expose most/all of
device functionality. MLX is “best effort”. Napalm integration
pack provide “unified” device actions, just like libcloud does in
“compute” domain.
While integration packs can call device interfaces directly,
python bindings provide reuse, and abstract device OS
versions. PyNOS binds VDX via netconf. SLX and MLX don’t
exist yet.
Napalm is Open source project that exposes a unified python
API to interact with different network devices via device drivers.
Devices expose various interfaces: RESTCONF, NETCONF,
CLI/TELNET, SNMP.
VDX MLX SLX
[1] [2]
Other vendors
35. “Innovation at Dimension Data: Optimizing Operations with Event Driven Automation”
https://stackstorm.com/2016/12/15/dimension-data-devops-beyond-deployment/
Dimension Data (SP, part of NTT)
36. “Innovation at Dimension Data: Optimizing Operations with Event Driven Automation”
https://stackstorm.com/2016/12/15/dimension-data-devops-beyond-deployment/
• Integrate IT systems & tools
• Security automation
• Legacy Run-book replacement
• Automation-aaS to end-users
• Top st2 contributors
Dimension Data (SP, part of NTT)
41. StackStorm is like …
41
AWS Lambda AWS Step Functions
OpenSource, for DIY Serverless
42. StackStorm is like …
42
ActionsSensors
WorkflowsRules
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring Ops Support
Step Functions
AWS Lambda
43. Serverless with Swarm
for Genomic Annotation Computing
Dmitri Zimine
http://github.com/dzimine/serverle
@dzimine
Image by Miki Yoshihito, Creative Commons license
44. Many Use Cases – One Platform
44
StackStorm automation platform
Network
Automation
Assisted
Troubleshooting
Auto
Remediation
ITprocess
integration
IoT
InternetofThings
Serverless
CI/CD
ContinuousDeployment
NFV
Security
Orchestration
ChatOps
50. 50
Why not legacy Runbook Automation?
• Microsoft System Center Orchestrator
• HP Operation Orchestrator
• Cisco Process Orchestrator (CPO)
• VMWare vCO / vRealize
They do not DevOps
53. 53
Infrastructure as code
Case Study
• Service Catalog backed up by workflow
• Automate provisioning on VMW/OpenStack, 4 Data centers
• Before: CPO, operator updates via GUI, click and pray, x4
• After: StackStorm, dev -> code review -> staging -> QA-> prod
64. • Use StackStorm.
Try it, find automation, nail POC. Let us know, good & bad.
curl -sSL https://stackstorm.com/packages/install.sh | bash -s
docs.stackstorm.com/install
• Commit code. Become a “community maintainer”
It is not hard (2 days?). We help & support.
• Spread the word
Blog. Tweet. Talk. Mention. Bug. Github Star!
64
Contribute! Everything counts