Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

[DSC Europe 22] AIOPS – How can machine learning help in IT operations - Damir Kopljar

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 11 Anzeige

[DSC Europe 22] AIOPS – How can machine learning help in IT operations - Damir Kopljar

Herunterladen, um offline zu lesen

Making sure that one application works 24/7 is never easy, but making sure all services are running in a large enterprise is definitely a challenging task.On the other hand, today everybody is talking about big data, AI, and machine learning, but many companies still struggle to find a use case where machine learning can make a difference or build a production-ready ML system. AIOPS tries to combine those two fields – big data and machine learning to automate IT operations processes. How can machine learning help in IT operations, what does it take to build machine learning system that cooperates with developers, and what can we expect in the future… these are just some of the questions we will try to answer in this talk.

Making sure that one application works 24/7 is never easy, but making sure all services are running in a large enterprise is definitely a challenging task.On the other hand, today everybody is talking about big data, AI, and machine learning, but many companies still struggle to find a use case where machine learning can make a difference or build a production-ready ML system. AIOPS tries to combine those two fields – big data and machine learning to automate IT operations processes. How can machine learning help in IT operations, what does it take to build machine learning system that cooperates with developers, and what can we expect in the future… these are just some of the questions we will try to answer in this talk.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Weitere von DataScienceConferenc1 (20)

Aktuellste (20)

Anzeige

[DSC Europe 22] AIOPS – How can machine learning help in IT operations - Damir Kopljar

  1. 1. Meet Filip COMPANY: Big Telco ROLE : Director of Operations & E2E QA
  2. 2. THE PROBLEM • Telco mobile app is not working • >300 IT systems, 150+ mil logs daily • How to write a good rule?
  3. 3. Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Today’s systems are getting harder and harder to monitor Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 ms 00 2 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Latency: 50ms Requests: 200 40x Errors: 2 50x Errors: 1 Memory: 150mb Disk: 20% Spans: 20 Num of users: 200
  4. 4. Problem definition Train normal behavior based on historic data Idea: Let machine learning model learn what is normal state Detect anomalies in real time
  5. 5. How to train the model? Avg latency Raw logs 30 sec features Normal state for service X Num of logs Errors Part of day New data Machine learning model
  6. 6. Problem started System doesn’t look good System seems normal Service 1 getDetails endpoint 42 errors High latency
  7. 7. Building AIOPS solution App logs APM logs DB logs ETL service Feature store AI service Anomaly score Alert Action Model repository Dataset repository Retrain jobs Training environment Data prep jobs Inferencing environment
  8. 8. Traditional programming Data Rules Machine learning Data Desired behavior Desired behavior Rules Some problems require paradigm shift
  9. 9. Reusable Building Blocks
  10. 10. How can machine learning help you in your business context?

×