A quick introduction to AIOps, the business reasons why the CI/CD pipeline needs to constantly improve, and how this can be accomplished with data that's already available with existing Machine Learning and other algorithms.
How to apply machine learning into your CI/CD pipeline
1. How to Apply Machine
Learning into Your
CI/CD Pipeline
Alon Weiss / Sealights
2. ● Complexity
○ Architecture
○ Deployment & infrastructure
○ Technologies
○ Product → Service
○ Visibility
● Load
○ Automation → more tests, longer CI cycles
○ Shift left → more tests and build steps
○ Support more devices & platforms → More tests
○ Lack of resources = bottleneck
● Human resources
○ Hiring
○ Lack of time and expertise to research, plan & execute strategic engineering tasks
The Digital Transformation and DevOps
3. AI-Ops for CI/CD | Business Impact
“We see continued evidence that software speed, stability, and availability contribute to organizational
performance (including profitability, productivity, and customer satisfaction). Our highest performers
are twice as likely to meet or exceed their organizational performance goals.” - DORA 2019
5. Alon Weiss
Chief Architect @SeaLights
alonweiss
alonw@sealights.io | www.sealights.io
The world’s #1 Software Quality Intelligence Platform that
fastens executions without sacrificing quality.
6. Research & Inspiration
Our daily needs - Releasing Faster and with Higher Quality
Continuous Delivery: Reliable Software Releases Through Build, Test, and
Deployment Automation / Jez Humble, David Farley
Accelerate / Nicole Forsgren, Jez Humble and Gene Kim
“Market Guide for AIOps Platforms”, “Artificial Intelligence for IT
Operations Delivers Improved Business Outcomes” by Gartner
“Take The Mystery Out Of AI for IT Operations (AIOps)” by Forrester
7. AI-Ops | Definition / Gartner
AIOps platforms combine big data and machine
learning functionality to support all primary IT
operations functions through the scalable ingestion and
analysis of the ever-increasing volume, variety and
velocity of data generated by IT.
The goal of the analytics effort is the discovery of
patterns — novel elements used to look forward in time
to predict possible incidents and emerging usage
profiles — and to look backward in time to determine
the root causes of current system behaviors .
8. AI-Ops | Definition / Forrester
Software that applies AI/ML or other advanced
analytics to business and operations data to make
correlations and provide prescriptive and
predictive answers in real time. These insights
produce real-time business performance KPIs,
allow teams to resolve incidents faster, and help
avoid incidents altogether.
9. AI-Ops | Use cases
● IT groups
○ Monitoring (IT Infrastructure, SREs)
■ excessive data usage
■ communication patterns
■ intrusion detection
○ Security
○ Release pipeline (the majority of this talk)
■ Release faster and with greater confidence in quality
● Non-IT
○ Demand / Order processing / Customer Satisfaction
○ Business Health
○ Marketing
10. AI-Ops | Usage Patterns
● Noise reduction (e.g. Alert Consolidation)
● Root Cause Analysis (e.g. during/after
incidents)
● Incident prevention (extrapolate future
events to prevent breakdowns)
● Anomaly detection beyond thresholds and
rule-based systems
● Initiating action using automation or
escalation
11. AI-Ops | Existing Tools
AI-Ops platforms
BigPanda
“Intelligent Automation for IT
Incident Management”
Moogsoft
“Purpose-Built AIOps Platform
for IT. Less Noise. Faster Fixes.
Shorter Outages."
APMs
NewRelic AI - NRAI launched
last week
Appdynamics - “Central
Nervous System”
Dynatrace - “Davis”
Splunk
Trends
“Current tools and processes aren’t
up to the task of monitoring today’s
apps and their underpinnings”
- Forrester
“AIOps tools show a “right-shift”
across the four stages of
monitoring — data acquisition,
aggregation, analysis and action —
with their core capabilities at data
aggregation and analysis. As the
technology matures further, users
will be able to leverage proactive
advice from the platform, enabling
the action stage. ”
- Gartner
13. AI-Ops for CI/CD | Data Sources
● GitHub / GitLab / Bitbucket / Azure Devops
● JIRA / ServiceNow
● Jenkins / *Pipelines / others
● Test Stages - coverage per test, timing, pass/fail
● Static scanners - code quality, dependencies, automated code review
● APMs
● Logs - ELK, Splunk
● Calendars / IM status
● Provisioning - Terraform, Ansible, Puppet, Chef
● Salesforce
14. Release Pipeline Components (pre-production)
Pain Solution
Build Queues Important jobs wait time Prioritize or Parallelize
Build+Package Time Smaller components, parallelize
Tests (Unit tests, Integration,
Selenium ,e2e)
Time
Time-to-failure
Test failure RCA
Test Impact Analysis and Test Prioritization
Pinpoint root cause to developers
Infrastructure & Provisioning Limited resources
Cold starts
Provision ahead of time
Risk management Manual and mostly gut-based AI assisted (anomaly detection)
Monitoring after deployment Engineers are rarely involved Notify stakeholders and facilitate RCA
15. AI-Ops for CI/CD | Optimized Build Queues
● Goal:
○ Some jobs are more important than others. Prioritize the queues.
● Data Sources:
○ JIRA - issue types, relations, priority, severity, custom fields
○ VCS - commit history (also available on Jenkins) and change area/scope
○ Jenkins - Historic build graph and timing
○ Salesforce - customer account importance
● Machine Learning algorithm family:
Graph Neural Network to determine priority, Regression to determine build length
● Usage:
A CI plugin to determine and assign the priority, then sort the queue
16. AI-Ops for CI/CD | Smart Testing
● Goal:
○ Use Test Impact Analysis to run the minimal set of tests that are necessary
○ Fail fast
○ Eliminate Overlapping tests
● Data Sources:
○ Git/Build tool - build content and changes
○ Deep Coverage tools - per-test coverage
● Machine Learning algorithm family:
○ Classification
○ Statistical models
● Usage:
○ Find impacted tests by cross-referencing the changes and past test history
○ Deep integration with test runners so they run only those that are needed
17. AI-Ops for CI/CD | Flaky test detection
● Goal:
○ Isolate and weed out flaky tests
● Data Sources:
○ VCS - commit history (also available on Jenkins) and change area/scope
○ Deep Coverage tools - per-test coverage
○ Jenkins / Test Runners - Test results and history
● Machine Learning algorithm family:
○ Regression
○ Statistical models
● Usage:
○ If a test flips between passing/failing status without a detected change to explain it, it
may be flaky
○ Automatically quarantine tests, notify author
18. AI-Ops for CI/CD | Infrastructure provisioning
● Goal:
○ Prevent resource contention in CI/CD
○ Minimize wait time for resource provisioning
● Data Sources:
○ Jenkins / * Pipelines - job history and graph
○ Infrastructure - historic demand & usage, real-time capacity
○ IM / Calendar - Engineers availability
● Machine Learning algorithm family:
○ Predictive analytics (Regression)
○ Statistical models
● Usage:
○ Update autoscaler targets continuously based on real-time and historic capacity
19. AI-Ops for CI/CD | Smart Risk Management
● Goal:
○ Formally introduce the concept of Risk Management to the semi-automatic review
process
○ Find common risks
■ Untested code and configuration changes
■ Anomalies
● Test time
● Code paths
● Network usage pattern
● Git: Big changes & unusual churn, New contributors,
Self-merging PRs, Long-running PR
● Data Sources:
○ VCS
○ APMs, NPMDs tools
20. AI-Ops for CI/CD | Smart Risk Management
● Machine Learning algorithm family: Anomaly Detection
● Usage:
○ Evaluate risk using Anomaly Detection, 3rd party tools (e.g. GitPrime)
○ Put smart quality gates in place
○ Require manual approval only when risks are too high
○ Determine APM thresholds and rollout configuration according to risk
21. AI-Ops for CI/CD | Proactive Root Cause Analysis
● Goal:
○ Facilitate root cause analysis for production and test failures
● Data Sources:
○ VCS - commit history (also available on Jenkins) and change area/scope
○ ALMs - incidents, stack frames
○ Log collectors - capture messages, function names, stack frames
● Machine Learning algorithm family:
○ None! Good old text indexing
● Usage:
○ Cross reference the suspected code areas and logs with the commit history and
escalate to the contributors
22. AI-Ops Market | Market Direction and Forecast
Devops adoption is accelerating:
“The proportion of our highest performers has almost tripled, now comprising 20% of all teams.
This shows that excellence is possible - those that execute on key capabilities see the benefits.”
AIOps adoption in increasing, platforms are the next big thing:
“By 2020, approximately 50% of enterprises will actively use AIOps technologies ... up from 10%
today” - Gartner
“Over the next 5 years, wide-scope AIOps platforms will become the de facto form-factor for the
delivery of AIOps functionality as opposed to AIOps functionality embedded in a monitoring tool
like APM” - Gartner
Architecture - Microservices, Serverless technologies, FaaS/Lambda
Deployment - Monoliths → Microservices, K8s value comes with its cost
Technologies - Teams often choose their own mix. More common to see a lot of technologies (java, node, python, .NET, etc.)
Product - a product is now a service, needs an ecosystem of monitoring and support
Visibility - multiple tools (APMs, network monitoring, logs)
Load - MORE services, pipelines,
This is the first slide. Need to polish this to be perfect. Hook. Tell them what you’re going to tell them.
TODO: Insert quotes from research papers regarding the increasing requirements from CI/CD and DevOps
TODO: Insert stats from DORA 2019 (pg 5, 18) on speed,stability, availability contributing to organization performance, and from there to profitability, productivity and customer satisfaction
IT-generated data grows by 2-3x per annum
AIOps described as extending the monitoring from Application, through Infrastructure to Business and Integrations
It’s tough to keep up with the demands from business stakeholders
I&O teams have become too siloed by discipline
1st gen - observation, 2nd gen - diagnosis. The problem of visibility is getting worse and will continue to do so
IT - monitoring too many Systems, Data, using Rules and Thresholds instead of using Anomaly Detection algorithms
NR AI - noise reduction, improved correlations, augmented intelligence
The central concept: The Change. Get it to production as soon as possible, minimize human involvement and without sacrificing quality and security.
Shifting left = “CI before the CI”, so we find and fix problems at the right time, before this affects others.
Key principles of CI:
Build quality in - automation & fail fast
Word in small batches
Computers perform repetitive tasks; people solve problems
Continuous Improvement
Everyone is responsible
Note: In a cloud-native, unconstrained datacenter, build queues are replaced with scheduling algorithms (e.g. kubernetes-native solutions like Jenkins X, Tekton, Gitlab)
Fail fast - Shift left using Pull Requests
Run as many test stages as possible before they are merged
Can be further correlated with ALM and Log collectors to find the difference between the “passing” and “failing” profiles
This also minimizes Infrastructure cost by scaling down during off-peak hours
Every change imposes a risk, but proper risk management is not a common practice
Change-advisory board are manual and error-prone
Shoutout to GitPrime and their “20 patterns to watch for in your engineering team”
Add NewRelic, dynatrace, AppD logos
Risk management should be an optional final step before accepting the change
The same data can flow backwards - give hint to developers when they are touching code that’s known to be sensitive / erroneous.
This “risky code” can be shown during code review to ensure the future modifications are well tested
Shoutout to GitPrime and their “20 patterns to watch for in your engineering team”