SlideShare ist ein Scribd-Unternehmen logo
1 von 23
How to Apply Machine
Learning into Your
CI/CD Pipeline
Alon Weiss / Sealights
● Complexity
○ Architecture
○ Deployment & infrastructure
○ Technologies
○ Product → Service
○ Visibility
● Load
○ Automation → more tests, longer CI cycles
○ Shift left → more tests and build steps
○ Support more devices & platforms → More tests
○ Lack of resources = bottleneck
● Human resources
○ Hiring
○ Lack of time and expertise to research, plan & execute strategic engineering tasks
The Digital Transformation and DevOps
AI-Ops for CI/CD | Business Impact
“We see continued evidence that software speed, stability, and availability contribute to organizational
performance (including profitability, productivity, and customer satisfaction). Our highest performers
are twice as likely to meet or exceed their organizational performance goals.” - DORA 2019
AI-Ops for CI/CD | Business Impact
Alon Weiss
Chief Architect @SeaLights
alonweiss
alonw@sealights.io | www.sealights.io
The world’s #1 Software Quality Intelligence Platform that
fastens executions without sacrificing quality.
Research & Inspiration
Our daily needs - Releasing Faster and with Higher Quality
Continuous Delivery: Reliable Software Releases Through Build, Test, and
Deployment Automation / Jez Humble, David Farley
Accelerate / Nicole Forsgren, Jez Humble and Gene Kim
“Market Guide for AIOps Platforms”, “Artificial Intelligence for IT
Operations Delivers Improved Business Outcomes” by Gartner
“Take The Mystery Out Of AI for IT Operations (AIOps)” by Forrester
AI-Ops | Definition / Gartner
AIOps platforms combine big data and machine
learning functionality to support all primary IT
operations functions through the scalable ingestion and
analysis of the ever-increasing volume, variety and
velocity of data generated by IT.
The goal of the analytics effort is the discovery of
patterns — novel elements used to look forward in time
to predict possible incidents and emerging usage
profiles — and to look backward in time to determine
the root causes of current system behaviors .
AI-Ops | Definition / Forrester
Software that applies AI/ML or other advanced
analytics to business and operations data to make
correlations and provide prescriptive and
predictive answers in real time. These insights
produce real-time business performance KPIs,
allow teams to resolve incidents faster, and help
avoid incidents altogether.
AI-Ops | Use cases
● IT groups
○ Monitoring (IT Infrastructure, SREs)
■ excessive data usage
■ communication patterns
■ intrusion detection
○ Security
○ Release pipeline (the majority of this talk)
■ Release faster and with greater confidence in quality
● Non-IT
○ Demand / Order processing / Customer Satisfaction
○ Business Health
○ Marketing
AI-Ops | Usage Patterns
● Noise reduction (e.g. Alert Consolidation)
● Root Cause Analysis (e.g. during/after
incidents)
● Incident prevention (extrapolate future
events to prevent breakdowns)
● Anomaly detection beyond thresholds and
rule-based systems
● Initiating action using automation or
escalation
AI-Ops | Existing Tools
AI-Ops platforms
BigPanda
“Intelligent Automation for IT
Incident Management”
Moogsoft
“Purpose-Built AIOps Platform
for IT. Less Noise. Faster Fixes.
Shorter Outages."
APMs
NewRelic AI - NRAI launched
last week
Appdynamics - “Central
Nervous System”
Dynatrace - “Davis”
Splunk
Trends
“Current tools and processes aren’t
up to the task of monitoring today’s
apps and their underpinnings”
- Forrester
“AIOps tools show a “right-shift”
across the four stages of
monitoring — data acquisition,
aggregation, analysis and action —
with their core capabilities at data
aggregation and analysis. As the
technology matures further, users
will be able to leverage proactive
advice from the platform, enabling
the action stage. ”
- Gartner
Applying AI-Ops to the CI/CD pipeline
AI-Ops for CI/CD | Data Sources
● GitHub / GitLab / Bitbucket / Azure Devops
● JIRA / ServiceNow
● Jenkins / *Pipelines / others
● Test Stages - coverage per test, timing, pass/fail
● Static scanners - code quality, dependencies, automated code review
● APMs
● Logs - ELK, Splunk
● Calendars / IM status
● Provisioning - Terraform, Ansible, Puppet, Chef
● Salesforce
Release Pipeline Components (pre-production)
Pain Solution
Build Queues Important jobs wait time Prioritize or Parallelize
Build+Package Time Smaller components, parallelize
Tests (Unit tests, Integration,
Selenium ,e2e)
Time
Time-to-failure
Test failure RCA
Test Impact Analysis and Test Prioritization
Pinpoint root cause to developers
Infrastructure & Provisioning Limited resources
Cold starts
Provision ahead of time
Risk management Manual and mostly gut-based AI assisted (anomaly detection)
Monitoring after deployment Engineers are rarely involved Notify stakeholders and facilitate RCA
AI-Ops for CI/CD | Optimized Build Queues
● Goal:
○ Some jobs are more important than others. Prioritize the queues.
● Data Sources:
○ JIRA - issue types, relations, priority, severity, custom fields
○ VCS - commit history (also available on Jenkins) and change area/scope
○ Jenkins - Historic build graph and timing
○ Salesforce - customer account importance
● Machine Learning algorithm family:
Graph Neural Network to determine priority, Regression to determine build length
● Usage:
A CI plugin to determine and assign the priority, then sort the queue
AI-Ops for CI/CD | Smart Testing
● Goal:
○ Use Test Impact Analysis to run the minimal set of tests that are necessary
○ Fail fast
○ Eliminate Overlapping tests
● Data Sources:
○ Git/Build tool - build content and changes
○ Deep Coverage tools - per-test coverage
● Machine Learning algorithm family:
○ Classification
○ Statistical models
● Usage:
○ Find impacted tests by cross-referencing the changes and past test history
○ Deep integration with test runners so they run only those that are needed
AI-Ops for CI/CD | Flaky test detection
● Goal:
○ Isolate and weed out flaky tests
● Data Sources:
○ VCS - commit history (also available on Jenkins) and change area/scope
○ Deep Coverage tools - per-test coverage
○ Jenkins / Test Runners - Test results and history
● Machine Learning algorithm family:
○ Regression
○ Statistical models
● Usage:
○ If a test flips between passing/failing status without a detected change to explain it, it
may be flaky
○ Automatically quarantine tests, notify author
AI-Ops for CI/CD | Infrastructure provisioning
● Goal:
○ Prevent resource contention in CI/CD
○ Minimize wait time for resource provisioning
● Data Sources:
○ Jenkins / * Pipelines - job history and graph
○ Infrastructure - historic demand & usage, real-time capacity
○ IM / Calendar - Engineers availability
● Machine Learning algorithm family:
○ Predictive analytics (Regression)
○ Statistical models
● Usage:
○ Update autoscaler targets continuously based on real-time and historic capacity
AI-Ops for CI/CD | Smart Risk Management
● Goal:
○ Formally introduce the concept of Risk Management to the semi-automatic review
process
○ Find common risks
■ Untested code and configuration changes
■ Anomalies
● Test time
● Code paths
● Network usage pattern
● Git: Big changes & unusual churn, New contributors,
Self-merging PRs, Long-running PR
● Data Sources:
○ VCS
○ APMs, NPMDs tools
AI-Ops for CI/CD | Smart Risk Management
● Machine Learning algorithm family: Anomaly Detection
● Usage:
○ Evaluate risk using Anomaly Detection, 3rd party tools (e.g. GitPrime)
○ Put smart quality gates in place
○ Require manual approval only when risks are too high
○ Determine APM thresholds and rollout configuration according to risk
AI-Ops for CI/CD | Proactive Root Cause Analysis
● Goal:
○ Facilitate root cause analysis for production and test failures
● Data Sources:
○ VCS - commit history (also available on Jenkins) and change area/scope
○ ALMs - incidents, stack frames
○ Log collectors - capture messages, function names, stack frames
● Machine Learning algorithm family:
○ None! Good old text indexing
● Usage:
○ Cross reference the suspected code areas and logs with the commit history and
escalate to the contributors
AI-Ops Market | Market Direction and Forecast
Devops adoption is accelerating:
“The proportion of our highest performers has almost tripled, now comprising 20% of all teams.
This shows that excellence is possible - those that execute on key capabilities see the benefits.”
AIOps adoption in increasing, platforms are the next big thing:
“By 2020, approximately 50% of enterprises will actively use AIOps technologies ... up from 10%
today” - Gartner
“Over the next 5 years, wide-scope AIOps platforms will become the de facto form-factor for the
delivery of AIOps functionality as opposed to AIOps functionality embedded in a monitoring tool
like APM” - Gartner
Thank You!
Of course, we’re hiring! :-)
Questions, anyone?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Chaos Engineering for Docker
Chaos Engineering for DockerChaos Engineering for Docker
Chaos Engineering for Docker
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
 
Observability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with SpringObservability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with Spring
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD Pipeline
 
Observability
ObservabilityObservability
Observability
 
A Definition of Done for DevSecOps
A Definition of Done for DevSecOpsA Definition of Done for DevSecOps
A Definition of Done for DevSecOps
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
AIOps - The next 5 years
AIOps - The next 5 yearsAIOps - The next 5 years
AIOps - The next 5 years
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
 
The vital role of AIOps in overcoming IT operational challenges - DEM07-SR - ...
The vital role of AIOps in overcoming IT operational challenges - DEM07-SR - ...The vital role of AIOps in overcoming IT operational challenges - DEM07-SR - ...
The vital role of AIOps in overcoming IT operational challenges - DEM07-SR - ...
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 
Infrastructure as Code Maturity Model v1
Infrastructure as Code Maturity Model v1Infrastructure as Code Maturity Model v1
Infrastructure as Code Maturity Model v1
 
Application Monitoring using Datadog
Application Monitoring using DatadogApplication Monitoring using Datadog
Application Monitoring using Datadog
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
 
Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
 

Ähnlich wie How to apply machine learning into your CI/CD pipeline

The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
Jonah Kowall
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
AgileNetwork
 
Curiosity Software, Infuse and Kumoco present: The Democratisation of Testing
Curiosity Software, Infuse and Kumoco present: The Democratisation of TestingCuriosity Software, Infuse and Kumoco present: The Democratisation of Testing
Curiosity Software, Infuse and Kumoco present: The Democratisation of Testing
Curiosity Software Ireland
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 

Ähnlich wie How to apply machine learning into your CI/CD pipeline (20)

AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)DATA @ NFLX (Tableau Conference 2014 Presentation)
DATA @ NFLX (Tableau Conference 2014 Presentation)
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ..."Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
 
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
Agile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful InformationAgile Testing Process Analytics: From Data to Insightful Information
Agile Testing Process Analytics: From Data to Insightful Information
 
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux FestBuilding an Open Source AppSec Pipeline - 2015 Texas Linux Fest
Building an Open Source AppSec Pipeline - 2015 Texas Linux Fest
 
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & LogsSplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
 
Curiosity Software, Infuse and Kumoco present: The Democratisation of Testing
Curiosity Software, Infuse and Kumoco present: The Democratisation of TestingCuriosity Software, Infuse and Kumoco present: The Democratisation of Testing
Curiosity Software, Infuse and Kumoco present: The Democratisation of Testing
 
DevOps Powered by Splunk
DevOps Powered by SplunkDevOps Powered by Splunk
DevOps Powered by Splunk
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
(Technologies) AI, Machine Learning, Predictive Analytics, IIOT, Cloud,Web-fr...
(Technologies) AI, Machine Learning, Predictive Analytics, IIOT, Cloud,Web-fr...(Technologies) AI, Machine Learning, Predictive Analytics, IIOT, Cloud,Web-fr...
(Technologies) AI, Machine Learning, Predictive Analytics, IIOT, Cloud,Web-fr...
 
SplunkLive! Munich 2018: Integrating Metrics and Logs
SplunkLive! Munich 2018: Integrating Metrics and LogsSplunkLive! Munich 2018: Integrating Metrics and Logs
SplunkLive! Munich 2018: Integrating Metrics and Logs
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

How to apply machine learning into your CI/CD pipeline

  • 1. How to Apply Machine Learning into Your CI/CD Pipeline Alon Weiss / Sealights
  • 2. ● Complexity ○ Architecture ○ Deployment & infrastructure ○ Technologies ○ Product → Service ○ Visibility ● Load ○ Automation → more tests, longer CI cycles ○ Shift left → more tests and build steps ○ Support more devices & platforms → More tests ○ Lack of resources = bottleneck ● Human resources ○ Hiring ○ Lack of time and expertise to research, plan & execute strategic engineering tasks The Digital Transformation and DevOps
  • 3. AI-Ops for CI/CD | Business Impact “We see continued evidence that software speed, stability, and availability contribute to organizational performance (including profitability, productivity, and customer satisfaction). Our highest performers are twice as likely to meet or exceed their organizational performance goals.” - DORA 2019
  • 4. AI-Ops for CI/CD | Business Impact
  • 5. Alon Weiss Chief Architect @SeaLights alonweiss alonw@sealights.io | www.sealights.io The world’s #1 Software Quality Intelligence Platform that fastens executions without sacrificing quality.
  • 6. Research & Inspiration Our daily needs - Releasing Faster and with Higher Quality Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation / Jez Humble, David Farley Accelerate / Nicole Forsgren, Jez Humble and Gene Kim “Market Guide for AIOps Platforms”, “Artificial Intelligence for IT Operations Delivers Improved Business Outcomes” by Gartner “Take The Mystery Out Of AI for IT Operations (AIOps)” by Forrester
  • 7. AI-Ops | Definition / Gartner AIOps platforms combine big data and machine learning functionality to support all primary IT operations functions through the scalable ingestion and analysis of the ever-increasing volume, variety and velocity of data generated by IT. The goal of the analytics effort is the discovery of patterns — novel elements used to look forward in time to predict possible incidents and emerging usage profiles — and to look backward in time to determine the root causes of current system behaviors .
  • 8. AI-Ops | Definition / Forrester Software that applies AI/ML or other advanced analytics to business and operations data to make correlations and provide prescriptive and predictive answers in real time. These insights produce real-time business performance KPIs, allow teams to resolve incidents faster, and help avoid incidents altogether.
  • 9. AI-Ops | Use cases ● IT groups ○ Monitoring (IT Infrastructure, SREs) ■ excessive data usage ■ communication patterns ■ intrusion detection ○ Security ○ Release pipeline (the majority of this talk) ■ Release faster and with greater confidence in quality ● Non-IT ○ Demand / Order processing / Customer Satisfaction ○ Business Health ○ Marketing
  • 10. AI-Ops | Usage Patterns ● Noise reduction (e.g. Alert Consolidation) ● Root Cause Analysis (e.g. during/after incidents) ● Incident prevention (extrapolate future events to prevent breakdowns) ● Anomaly detection beyond thresholds and rule-based systems ● Initiating action using automation or escalation
  • 11. AI-Ops | Existing Tools AI-Ops platforms BigPanda “Intelligent Automation for IT Incident Management” Moogsoft “Purpose-Built AIOps Platform for IT. Less Noise. Faster Fixes. Shorter Outages." APMs NewRelic AI - NRAI launched last week Appdynamics - “Central Nervous System” Dynatrace - “Davis” Splunk Trends “Current tools and processes aren’t up to the task of monitoring today’s apps and their underpinnings” - Forrester “AIOps tools show a “right-shift” across the four stages of monitoring — data acquisition, aggregation, analysis and action — with their core capabilities at data aggregation and analysis. As the technology matures further, users will be able to leverage proactive advice from the platform, enabling the action stage. ” - Gartner
  • 12. Applying AI-Ops to the CI/CD pipeline
  • 13. AI-Ops for CI/CD | Data Sources ● GitHub / GitLab / Bitbucket / Azure Devops ● JIRA / ServiceNow ● Jenkins / *Pipelines / others ● Test Stages - coverage per test, timing, pass/fail ● Static scanners - code quality, dependencies, automated code review ● APMs ● Logs - ELK, Splunk ● Calendars / IM status ● Provisioning - Terraform, Ansible, Puppet, Chef ● Salesforce
  • 14. Release Pipeline Components (pre-production) Pain Solution Build Queues Important jobs wait time Prioritize or Parallelize Build+Package Time Smaller components, parallelize Tests (Unit tests, Integration, Selenium ,e2e) Time Time-to-failure Test failure RCA Test Impact Analysis and Test Prioritization Pinpoint root cause to developers Infrastructure & Provisioning Limited resources Cold starts Provision ahead of time Risk management Manual and mostly gut-based AI assisted (anomaly detection) Monitoring after deployment Engineers are rarely involved Notify stakeholders and facilitate RCA
  • 15. AI-Ops for CI/CD | Optimized Build Queues ● Goal: ○ Some jobs are more important than others. Prioritize the queues. ● Data Sources: ○ JIRA - issue types, relations, priority, severity, custom fields ○ VCS - commit history (also available on Jenkins) and change area/scope ○ Jenkins - Historic build graph and timing ○ Salesforce - customer account importance ● Machine Learning algorithm family: Graph Neural Network to determine priority, Regression to determine build length ● Usage: A CI plugin to determine and assign the priority, then sort the queue
  • 16. AI-Ops for CI/CD | Smart Testing ● Goal: ○ Use Test Impact Analysis to run the minimal set of tests that are necessary ○ Fail fast ○ Eliminate Overlapping tests ● Data Sources: ○ Git/Build tool - build content and changes ○ Deep Coverage tools - per-test coverage ● Machine Learning algorithm family: ○ Classification ○ Statistical models ● Usage: ○ Find impacted tests by cross-referencing the changes and past test history ○ Deep integration with test runners so they run only those that are needed
  • 17. AI-Ops for CI/CD | Flaky test detection ● Goal: ○ Isolate and weed out flaky tests ● Data Sources: ○ VCS - commit history (also available on Jenkins) and change area/scope ○ Deep Coverage tools - per-test coverage ○ Jenkins / Test Runners - Test results and history ● Machine Learning algorithm family: ○ Regression ○ Statistical models ● Usage: ○ If a test flips between passing/failing status without a detected change to explain it, it may be flaky ○ Automatically quarantine tests, notify author
  • 18. AI-Ops for CI/CD | Infrastructure provisioning ● Goal: ○ Prevent resource contention in CI/CD ○ Minimize wait time for resource provisioning ● Data Sources: ○ Jenkins / * Pipelines - job history and graph ○ Infrastructure - historic demand & usage, real-time capacity ○ IM / Calendar - Engineers availability ● Machine Learning algorithm family: ○ Predictive analytics (Regression) ○ Statistical models ● Usage: ○ Update autoscaler targets continuously based on real-time and historic capacity
  • 19. AI-Ops for CI/CD | Smart Risk Management ● Goal: ○ Formally introduce the concept of Risk Management to the semi-automatic review process ○ Find common risks ■ Untested code and configuration changes ■ Anomalies ● Test time ● Code paths ● Network usage pattern ● Git: Big changes & unusual churn, New contributors, Self-merging PRs, Long-running PR ● Data Sources: ○ VCS ○ APMs, NPMDs tools
  • 20. AI-Ops for CI/CD | Smart Risk Management ● Machine Learning algorithm family: Anomaly Detection ● Usage: ○ Evaluate risk using Anomaly Detection, 3rd party tools (e.g. GitPrime) ○ Put smart quality gates in place ○ Require manual approval only when risks are too high ○ Determine APM thresholds and rollout configuration according to risk
  • 21. AI-Ops for CI/CD | Proactive Root Cause Analysis ● Goal: ○ Facilitate root cause analysis for production and test failures ● Data Sources: ○ VCS - commit history (also available on Jenkins) and change area/scope ○ ALMs - incidents, stack frames ○ Log collectors - capture messages, function names, stack frames ● Machine Learning algorithm family: ○ None! Good old text indexing ● Usage: ○ Cross reference the suspected code areas and logs with the commit history and escalate to the contributors
  • 22. AI-Ops Market | Market Direction and Forecast Devops adoption is accelerating: “The proportion of our highest performers has almost tripled, now comprising 20% of all teams. This shows that excellence is possible - those that execute on key capabilities see the benefits.” AIOps adoption in increasing, platforms are the next big thing: “By 2020, approximately 50% of enterprises will actively use AIOps technologies ... up from 10% today” - Gartner “Over the next 5 years, wide-scope AIOps platforms will become the de facto form-factor for the delivery of AIOps functionality as opposed to AIOps functionality embedded in a monitoring tool like APM” - Gartner
  • 23. Thank You! Of course, we’re hiring! :-) Questions, anyone?

Hinweis der Redaktion

  1. Architecture - Microservices, Serverless technologies, FaaS/Lambda Deployment - Monoliths → Microservices, K8s value comes with its cost Technologies - Teams often choose their own mix. More common to see a lot of technologies (java, node, python, .NET, etc.) Product - a product is now a service, needs an ecosystem of monitoring and support Visibility - multiple tools (APMs, network monitoring, logs) Load - MORE services, pipelines, This is the first slide. Need to polish this to be perfect. Hook. Tell them what you’re going to tell them. TODO: Insert quotes from research papers regarding the increasing requirements from CI/CD and DevOps TODO: Insert stats from DORA 2019 (pg 5, 18) on speed,stability, availability contributing to organization performance, and from there to profitability, productivity and customer satisfaction
  2. IT-generated data grows by 2-3x per annum
  3. AIOps described as extending the monitoring from Application, through Infrastructure to Business and Integrations It’s tough to keep up with the demands from business stakeholders I&O teams have become too siloed by discipline 1st gen - observation, 2nd gen - diagnosis. The problem of visibility is getting worse and will continue to do so
  4. IT - monitoring too many Systems, Data, using Rules and Thresholds instead of using Anomaly Detection algorithms
  5. NR AI - noise reduction, improved correlations, augmented intelligence
  6. The central concept: The Change. Get it to production as soon as possible, minimize human involvement and without sacrificing quality and security. Shifting left = “CI before the CI”, so we find and fix problems at the right time, before this affects others. Key principles of CI: Build quality in - automation & fail fast Word in small batches Computers perform repetitive tasks; people solve problems Continuous Improvement Everyone is responsible
  7. Note: In a cloud-native, unconstrained datacenter, build queues are replaced with scheduling algorithms (e.g. kubernetes-native solutions like Jenkins X, Tekton, Gitlab)
  8. Fail fast - Shift left using Pull Requests Run as many test stages as possible before they are merged
  9. Can be further correlated with ALM and Log collectors to find the difference between the “passing” and “failing” profiles
  10. This also minimizes Infrastructure cost by scaling down during off-peak hours
  11. Every change imposes a risk, but proper risk management is not a common practice Change-advisory board are manual and error-prone Shoutout to GitPrime and their “20 patterns to watch for in your engineering team” Add NewRelic, dynatrace, AppD logos
  12. Risk management should be an optional final step before accepting the change
  13. The same data can flow backwards - give hint to developers when they are touching code that’s known to be sensitive / erroneous. This “risky code” can be shown during code review to ensure the future modifications are well tested
  14. Shoutout to GitPrime and their “20 patterns to watch for in your engineering team”