SlideShare ist ein Scribd-Unternehmen logo
1 von 21
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Root Cause Analysis using ML
Rohit Choudhary & Gaurav Nagar, Hortonworks
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Introduction
 HDP
– Cumulative Big Data Package with 25+ Certified Open Source Apache Projects
– Source Code arrives from both - Community and Internal Engineering
 QE and Certification Process
– Every change goes through Git and Gerrit
– System tests are written for each components, 100s of new tests added every release
 Release Stability
– Determined by System Test failure and pass percentages
– Once new features and System Tests and are at 100%, we call the release done!
 Releases
– On-premise Releases
– Cloud Releases – HDI and HDC
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem Statement
 Test Suite Size
– System Tests are organized as Suites, also called Splits – 700
– Several 1000s of test cases, executed in every run
 Infrastructure
– YarnCloud Infrastructure &OpenStack Infrastructure
– 700 X 5 Node+ HDP Clusters – Creation and Tear Downs
– Test Suites are run on each clusters and Logs are collected
– Test produce 1-1.5 TB of System Logs across our stack everyday
 Failure Assessments and Subsequent Process
– Component owners undertake the responsibilities of identifying failures
– Time-taking, Repetitive without increasing system knowledge
– Restrictive (reduces our ability to release faster)
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Mool – Automated Log Analysis
 Root Cause across components in one click
– Identify common failure causes across components
 Recommend Actions instead of assisted search with
– Systemic Knowledge/Repository of Errors and their associations
– Recency of occurrence
– Source modifications as data features
– Current and past reported issues in ticketing systems
 Integrate with downstream process lifecycle
– Test Analysis
– Ticketing system integration
Mool – Sanskrit meaning Root
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Past Industry Efforts – AALA @Siemens
 Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet
“The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms
do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true
answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the
test system verdicts.”
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Analysis Process
Log Message Feature Extraction
Test Failure Feature Extraction
Feature Extraction
1
Enriched with Test Execution Time
Origin Components
Enrichment
Error Categorization
RCA Analysis
Error Repository Upgrades
Learning
2
3
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra
TC2
TC1
TC3
TC4
Run ID
Component
Suite
E2
E1
E3
E4
Test Case – Error Correlation
TC1 = {E1, E2,E4}
TC2 = {E1, E3}
TC3 = {E3, E4}
TC4 = {E1, E4}
Error – Test Case Correlation (Conversely)
E1 = {TC1, TC2,TC4}
E4 = {TC1, TC4}
Where Components = {C1, C2, C3, C4}
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
Suite l…
Suite j
Suite k
Suite n
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
SingleClusterRun
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Algebra - Explained
Suite i
T =t
Errors Test Cases
E1, E2, E3… TC1, TC2, T3…
T =0 T =f
Multi-clusterRun
Suite i
T =t2
E1, E2, E3… TCi1, TCi2, Ti3…
T =t1 T =f2
Suite i
T =t3
E1, E2, E3… TC1, TC2, T3…
T =t2 T =f3
Suite i
T =t4
E1, E2, E3… TC1, TC2, T3…
T =t3 T =f4
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Error Paths and Feature Extraction
Hive Server2
Yarn
ATS
HDFS
Livy
Yarn
HDFS
Pig
Hive
Yarn
HDFS
Spark Oozie WorkflowHive Suite
Test Suites
Stack Call
E1, E2, E3 E1, E2, E3, E4, E5, E6 En….
Test Case Features = {name, suite_name, start_time, end_time, status}
Error Features = {stacktrace, message, occurrence_time, origin, category, file_name}
Errors
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Salient Points: Failure Sample & Error Samples
Test Case Failures
12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
System Interactions
Ensemble Modeling &
Learning
Customer
Reports
Data Pipeline
Source
Code
Historical
Error DB
Ticket
Systems
Recommendations Automated Actions
Metadata Store
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Application Architecture
Log Accumulation/release
branch
Grok parsers for
HDP/Ambari components
Identical Match
(Stacktrace)
Nearest Match
(Levenshtein Adaptation)
RCA/Associative AnalysisError Hierarchy
Association
Automated Ticket
Processing
Recommendation
Based on Recency
Unsupervised Learning
Ingestion
Outcome
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Split Processor
Test Clusters
Storage
Deployment Architecture
Livy (Job Server)
HDFS
Spark Jobs
MetaData
Store
Log Daemon
Log daemons
Push Logs into HDFS
Trigger Analysis at End of Run
Web Application
Manual Input for Selection/Rejection of Outcome
Data Processing Data SourceApplication
15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Versus Error Graph Creation
 Error Graphs Creation Failed
– FP Growth Algorithm did not yield desired results
– Too many closed loops, cyclic dependencies
– Time as a split dimension was not enough
 Moved towards RCAs
– Origin of the error chain was easier to find out
– Accuracy was higher
– Enough data supporting multiple code-flows
 Easier to validate through out system Analysts
– Unsupervised Learning is hard to validate without manual intervention
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Rejections
 False Positives are very prevalent
– Dominating Exceptions because of frequent code path execution
– They are repetitive and need to be ignored, statistically based on decile values
 Priority versus Ignored versus Historical
– Historical RCA’s based on the source code changes and recency allows final decision
– If corresponding tickets are open, then those issues take priority
 Common Exceptions or Common RCA’s
– Prioritize the ones that are causing cross-component failures
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
RCA Graph
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Quick Stats
Item Data
Total Run Ids Analyzed 14410
Total Splits across components 115 K
Raw errors parsed from logs 120 M
Unique Errors 45025
Total Test Case failure 170 K
Errors related to Failed Test Cases 592 K
Unique Errors related to Failed
Test Cases
30570
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Adoption Challenges
 Great for fast changing code base
– Individual component owners have reported upto 99% accuracy
– Multi-component use case scenarios needs improvement
 Log collection required multiple iterations
– Order of logs being written and collected
– Central Log server issues
 Stable releases are harder to instrument
– Our internal team has been unable to use it
– Source code changes are minimal/recency parameters are harder to provide
 Unsupervised learning verification is harder
– Very hard to effectively judge performance of models without manual interference
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Future Work
 Unsupervised learning validation using automated techniques
 Online processing using Spark Streaming
 Event based error detection on live production clusters
 Correlation with other log events/customer use cases
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Thank You
Rohit Choudhary & Gaurav Nagar

Weitere ähnliche Inhalte

Was ist angesagt?

Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...DataWorks Summit
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop ZooDataWorks Summit
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at ScaleDataWorks Summit
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...DataWorks Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseDataWorks Summit/Hadoop Summit
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easyDataWorks Summit
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 

Was ist angesagt? (20)

Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Quality for the Hadoop Zoo
Quality for the Hadoop ZooQuality for the Hadoop Zoo
Quality for the Hadoop Zoo
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?Why is my Hadoop cluster slow?
Why is my Hadoop cluster slow?
 
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
Real-Time Ingesting and Transforming Sensor Data and Social Data with NiFi an...
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the EnterpriseEnabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Fine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and HiveFine-Grained Security for Spark and Hive
Fine-Grained Security for Spark and Hive
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 

Ähnlich wie Mool - Automated Log Analysis using Data Science and ML

Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...Perficient, Inc.
 
Effective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo IteratorsEffective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo IteratorsJosh Elser
 
10_years_Experience_in_Automation
10_years_Experience_in_Automation10_years_Experience_in_Automation
10_years_Experience_in_AutomationArpita Gohel
 
002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrsSREEKANTH Kama
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTWNGINX, Inc.
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challengessnrism
 
IEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualIEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualFreyrSCADA Embedded Solution
 
SCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemSCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemCompuware
 
Connectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan WeberConnectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan WeberKimberly Daich
 
Michael_Joshua_Validation
Michael_Joshua_ValidationMichael_Joshua_Validation
Michael_Joshua_ValidationMichaelJoshua
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingDataWorks Summit
 
Soma_Mishra_Resume
Soma_Mishra_ResumeSoma_Mishra_Resume
Soma_Mishra_Resumesoma mishra
 
Define enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasadDefine enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasadBhawani N Prasad
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 EstimationLawrence Bernstein
 
Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016WEI YANN LEE
 
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful APIBIOVIA
 

Ähnlich wie Mool - Automated Log Analysis using Data Science and ML (20)

Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
Automatically Retrieving and Loading Data into Siebel CTMS from Multiple CRO ...
 
Effective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo IteratorsEffective Testing of Apache Accumulo Iterators
Effective Testing of Apache Accumulo Iterators
 
10_years_Experience_in_Automation
10_years_Experience_in_Automation10_years_Experience_in_Automation
10_years_Experience_in_Automation
 
002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs002 srikanth system & network administrator 8+yrs
002 srikanth system & network administrator 8+yrs
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
OpenTelemetry 101 FTW
OpenTelemetry 101 FTWOpenTelemetry 101 FTW
OpenTelemetry 101 FTW
 
SDN Controller - Programming Challenges
SDN Controller - Programming ChallengesSDN Controller - Programming Challenges
SDN Controller - Programming Challenges
 
IEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manualIEC 60870-5 101 Protocol Server Simulator User manual
IEC 60870-5 101 Protocol Server Simulator User manual
 
Tarun_Medimi
Tarun_MedimiTarun_Medimi
Tarun_Medimi
 
SCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemSCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome Them
 
Connectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan WeberConnectivity challenges APC Europe by Alan Weber
Connectivity challenges APC Europe by Alan Weber
 
Michael_Joshua_Validation
Michael_Joshua_ValidationMichael_Joshua_Validation
Michael_Joshua_Validation
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
 
Soma_Mishra_Resume
Soma_Mishra_ResumeSoma_Mishra_Resume
Soma_Mishra_Resume
 
Define enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasadDefine enterprise integration strategy by industry leader bhawani nandanprasad
Define enterprise integration strategy by industry leader bhawani nandanprasad
 
Cs 568 Spring 10 Lecture 5 Estimation
Cs 568 Spring 10  Lecture 5 EstimationCs 568 Spring 10  Lecture 5 Estimation
Cs 568 Spring 10 Lecture 5 Estimation
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
 
Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016Lee Wei Yann Resume 2016
Lee Wei Yann Resume 2016
 
eG Innovations
eG InnovationseG Innovations
eG Innovations
 
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
(ATS6-DEV08) Integrating Contur ELN with other systems using a RESTful API
 

Mehr von DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Kürzlich hochgeladen

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Kürzlich hochgeladen (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Mool - Automated Log Analysis using Data Science and ML

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Root Cause Analysis using ML Rohit Choudhary & Gaurav Nagar, Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Introduction  HDP – Cumulative Big Data Package with 25+ Certified Open Source Apache Projects – Source Code arrives from both - Community and Internal Engineering  QE and Certification Process – Every change goes through Git and Gerrit – System tests are written for each components, 100s of new tests added every release  Release Stability – Determined by System Test failure and pass percentages – Once new features and System Tests and are at 100%, we call the release done!  Releases – On-premise Releases – Cloud Releases – HDI and HDC
  • 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Problem Statement  Test Suite Size – System Tests are organized as Suites, also called Splits – 700 – Several 1000s of test cases, executed in every run  Infrastructure – YarnCloud Infrastructure &OpenStack Infrastructure – 700 X 5 Node+ HDP Clusters – Creation and Tear Downs – Test Suites are run on each clusters and Logs are collected – Test produce 1-1.5 TB of System Logs across our stack everyday  Failure Assessments and Subsequent Process – Component owners undertake the responsibilities of identifying failures – Time-taking, Repetitive without increasing system knowledge – Restrictive (reduces our ability to release faster)
  • 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Mool – Automated Log Analysis  Root Cause across components in one click – Identify common failure causes across components  Recommend Actions instead of assisted search with – Systemic Knowledge/Repository of Errors and their associations – Recency of occurrence – Source modifications as data features – Current and past reported issues in ticketing systems  Integrate with downstream process lifecycle – Test Analysis – Ticketing system integration Mool – Sanskrit meaning Root
  • 5. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Past Industry Efforts – AALA @Siemens  Automated Log Analysis Using Machine Learning, Weixi Li, Uppsala Universitet “The biggest limitation in our project is that it is hard to find experts to analyze the logs manually. Although the clustering algorithms do not need any feedback during learning, the evaluation of different models need to compare our prediction results with the true answers. But it is not possible to find the experts to do manual log analysis in this project, therefore, the evaluation is based on the test system verdicts.”
  • 6. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Analysis Process Log Message Feature Extraction Test Failure Feature Extraction Feature Extraction 1 Enriched with Test Execution Time Origin Components Enrichment Error Categorization RCA Analysis Error Repository Upgrades Learning 2 3
  • 7. 7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra TC2 TC1 TC3 TC4 Run ID Component Suite E2 E1 E3 E4 Test Case – Error Correlation TC1 = {E1, E2,E4} TC2 = {E1, E3} TC3 = {E3, E4} TC4 = {E1, E4} Error – Test Case Correlation (Conversely) E1 = {TC1, TC2,TC4} E4 = {TC1, TC4} Where Components = {C1, C2, C3, C4}
  • 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra - Explained Suite i Suite l… Suite j Suite k Suite n T =t Errors Test Cases E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… E1, E2, E3… TC1, TC2, T3… T =0 T =f SingleClusterRun
  • 9. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Algebra - Explained Suite i T =t Errors Test Cases E1, E2, E3… TC1, TC2, T3… T =0 T =f Multi-clusterRun Suite i T =t2 E1, E2, E3… TCi1, TCi2, Ti3… T =t1 T =f2 Suite i T =t3 E1, E2, E3… TC1, TC2, T3… T =t2 T =f3 Suite i T =t4 E1, E2, E3… TC1, TC2, T3… T =t3 T =f4
  • 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Error Paths and Feature Extraction Hive Server2 Yarn ATS HDFS Livy Yarn HDFS Pig Hive Yarn HDFS Spark Oozie WorkflowHive Suite Test Suites Stack Call E1, E2, E3 E1, E2, E3, E4, E5, E6 En…. Test Case Features = {name, suite_name, start_time, end_time, status} Error Features = {stacktrace, message, occurrence_time, origin, category, file_name} Errors
  • 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Salient Points: Failure Sample & Error Samples Test Case Failures
  • 12. 12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved System Interactions Ensemble Modeling & Learning Customer Reports Data Pipeline Source Code Historical Error DB Ticket Systems Recommendations Automated Actions Metadata Store
  • 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Application Architecture Log Accumulation/release branch Grok parsers for HDP/Ambari components Identical Match (Stacktrace) Nearest Match (Levenshtein Adaptation) RCA/Associative AnalysisError Hierarchy Association Automated Ticket Processing Recommendation Based on Recency Unsupervised Learning Ingestion Outcome
  • 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Split Processor Test Clusters Storage Deployment Architecture Livy (Job Server) HDFS Spark Jobs MetaData Store Log Daemon Log daemons Push Logs into HDFS Trigger Analysis at End of Run Web Application Manual Input for Selection/Rejection of Outcome Data Processing Data SourceApplication
  • 15. 15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Versus Error Graph Creation  Error Graphs Creation Failed – FP Growth Algorithm did not yield desired results – Too many closed loops, cyclic dependencies – Time as a split dimension was not enough  Moved towards RCAs – Origin of the error chain was easier to find out – Accuracy was higher – Enough data supporting multiple code-flows  Easier to validate through out system Analysts – Unsupervised Learning is hard to validate without manual intervention
  • 16. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Rejections  False Positives are very prevalent – Dominating Exceptions because of frequent code path execution – They are repetitive and need to be ignored, statistically based on decile values  Priority versus Ignored versus Historical – Historical RCA’s based on the source code changes and recency allows final decision – If corresponding tickets are open, then those issues take priority  Common Exceptions or Common RCA’s – Prioritize the ones that are causing cross-component failures
  • 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved RCA Graph
  • 18. 18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Quick Stats Item Data Total Run Ids Analyzed 14410 Total Splits across components 115 K Raw errors parsed from logs 120 M Unique Errors 45025 Total Test Case failure 170 K Errors related to Failed Test Cases 592 K Unique Errors related to Failed Test Cases 30570
  • 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Adoption Challenges  Great for fast changing code base – Individual component owners have reported upto 99% accuracy – Multi-component use case scenarios needs improvement  Log collection required multiple iterations – Order of logs being written and collected – Central Log server issues  Stable releases are harder to instrument – Our internal team has been unable to use it – Source code changes are minimal/recency parameters are harder to provide  Unsupervised learning verification is harder – Very hard to effectively judge performance of models without manual interference
  • 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Future Work  Unsupervised learning validation using automated techniques  Online processing using Spark Streaming  Event based error detection on live production clusters  Correlation with other log events/customer use cases
  • 21. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Thank You Rohit Choudhary & Gaurav Nagar

Hinweis der Redaktion

  1. TALK TRACK Mool is the application th [NEXT SLIDE]