Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 42 Anzeige

Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

Herunterladen, um offline zu lesen

The first "Insights in Technology Conference" was in Schaffhausen on December 16, 2019. The event is organized by the Schaffhausen Institute of Technology SIT. Special guest is Nobel Prize winner Wolfgang Ketterle.

Schaffhausen Institute of Technology website: http://sit.org

The first "Insights in Technology Conference" was in Schaffhausen on December 16, 2019. The event is organized by the Schaffhausen Institute of Technology SIT. Special guest is Nobel Prize winner Wolfgang Ketterle.

Schaffhausen Institute of Technology website: http://sit.org

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019 (20)

Weitere von Schaffhausen Institute of Technology (20)

Anzeige

Aktuellste (20)

Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

  1. 1. © 2019 All rights reserved.Schaffhausen Institute of Technology Mauro Pezzè, Schaffhausen Institute of Technology Self-healing cloud systems
  2. 2. © 2019 All rights reserved.Schaffhausen Institute of Technology Cloud in finance 2 The cloud is transforming the banking industry as banks adopt cloud solutions to help deliver against increasing customer expectations. “The cloud and emerging technologies such as AI and machine learning serve as both a catalyst and a reason to change for the financial industry,” Financial industry adopts cloud solutions IBM Expert Advice April 2019
  3. 3. © 2019 All rights reserved.Schaffhausen Institute of Technology 3 runtime failures
  4. 4. © 2019 All rights reserved.Schaffhausen Institute of Technology 4 Unavoidable
  5. 5. © 2019 All rights reserved.Schaffhausen Institute of Technology 5 Expensive 10 hours average downtime per year IWGCR 1.25B$—2.5B$ total cost of unplanned application downtime per year fortune .5M$—1M$ average cost of a critical app failure per hour IDC
  6. 6. © 2019 All rights reserved.Schaffhausen Institute of Technology Finance software is not bug free 1.. 6 Less then a week into 2016, HSBC become the first bank to suffer a major IT outage. Millions of the bank’s costumers were unable to access online accounts. Services only returned to normal after a two-day outage. The bank’s chief operating officer Jack Hackett blamed a ’complex technical issue’ with its internal systems.
  7. 7. © 2019 All rights reserved.Schaffhausen Institute of Technology Finance software is not bug free ..2.. 7 In August 2015 a reported 275,000 individual payments failed to be processed by HSBC, which left many potentially without pay before the Bank Holiday weekend. The cause of this major failure was a problem with its electronic payment system for its business banking users which affected salary payments.
  8. 8. © 2019 All rights reserved.Schaffhausen Institute of Technology Finance software is not bug free ..3.. 8 In April 2015, Blomberg’s London office suffered a software glitch resulting in their trading terminals going down for two hours. In a statement Bloomberg said: “Service has been fully restored. We experienced a combination of hardware and software failures in the network, which caused an excessive volume of network traffic.”
  9. 9. © 2019 All rights reserved.Schaffhausen Institute of Technology Finance software is not bug free ..4.. 9 In June 2015 about 600,000 payments failed to enter the accounts of RBS overnight — including wages and benefit payments. Many took several days to come through. The bank chief officer said a ‘technology fault meant we could not ingest a file from a third-party provider”…. In 2012 6.5 million RBS customers experiences an outage due to batch scheduling software, a glitch for which the bank was subsequently fined 56 million pounds.
  10. 10. © 2019 All rights reserved.Schaffhausen Institute of Technology Self-healing (cloud) systems Preventing Tolerating Removing By Predicting failures Locating bugs Working around failures Fixing bugs Failures 10
  11. 11. © 2019 All rights reserved.Schaffhausen Institute of Technology State-of-the-art (Cloud) healing solutions Monitoring tools: • Kube-state metrics • metrics-server • Envoy • Helm charts Self-healing tools: • Liveliness/Readiness probes • Health indicators • Pod phase, probe, restart • … Limitations performance interference no knowledge of system status no knowledge of applications Tools: • Monasca: monitoring • Aodh: alarming • Congress: policy-based governance • Mistral: workflow • Senlin: clustering service • Vitrage: root cause analysis • Watcher: optimisation • Masakari compute healing advice • Freezer-dr: compute healing advice • Doctor: fault management • Fault Genes Working Group: fault classification and recovery strategy • Craton: fleet management Features monitoring hardware/system recovery Pod recovery 11
  12. 12. © 2019 All rights reserved.Schaffhausen Institute of Technology STAR moving on from to 12 Limitations Performance interference No knowledge of system status No knowledge of applications Features Limited performance interference Knowledge of application composition Holistic hierarchical system view STAR
  13. 13. © 2019 All rights reserved.Schaffhausen Institute of Technology Normal state timeError state Failure Prediction Fault activation Healing Failure Alert Faulty component Failure Localisation 13 STAR
  14. 14. © 2019 All rights reserved.Schaffhausen Institute of Technology SystemSensor Actuator Fault Localizer HealerFailure Predictor monitor 14 STAR
  15. 15. © 2019 All rights reserved.Schaffhausen Institute of Technology Linux Server Openstack Clearwater cross-layer partial monitoring with built-in facilities 15 (Cloud) Monitors STAR
  16. 16. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure predictor 1.0 Data analytics Machine learning STAR
  17. 17. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure Alerts SystemSensor Actuator Anomalies Anomaly Classifier Anomaly Detector Failure Type Fault Location Monitored KPIs Fault Localizer HealerFailure Predictor KPI2 (Packets Received, R7) Data Analytics Failure predictor 1.0 17 STAR
  18. 18. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure Alerts SystemSensor Actuator Anomalies Anomaly Classifier Anomaly Detector Failure Type Fault Location Monitored KPIs Fault Localizer HealerFailure Predictor ℎ1 (1) ℎ2 (1) ℎ3 (1) ℎ4 (1) ℎ5 (1) ℎ6 (1) ℎ7 (1) ℎ8 (1) ℎ1 (2) ℎ2 (2) ℎ3 (2) ℎ4 (2) ℎ5 (2) ℎ6 (2) ℎ1 (3) ℎ2 (3) ℎ3 (3) ℎ1 (4) ℎ2 (4) ℎ3 (4) ℎ4 (4) ℎ5 (4) ℎ6 (4) ℎ1 (5) ℎ2 (5) ℎ3 (5) ℎ4 (5) ℎ5 (5) ℎ6 (5) ℎ7 (5) ℎ8 (5) 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6 𝑣7 𝑣 ̂ 8 𝑣9 𝑣10 𝑣 ̂ 1 𝑣 ̂ 2 𝑣 ̂ 3 𝑣 ̂ 4 𝑣 ̂ 5 𝑣 ̂ 6 𝑣 ̂ 7 𝑣 ̂ 9 𝑣 ̂ 10 𝑣8 Deep Autoencoder 18 Failure predictor 1.0STAR
  19. 19. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure Alerts SystemSensor Actuator Anomalies Anomaly Classifier Anomaly Detector Failure Type Fault Location Monitored KPIs Fault Localizer HealerFailure Predictor spurious anomalous KPI failure-prone anomalous KPI 19 Failure predictor 1.0STAR
  20. 20. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure predictor 1.0 Precise failure prediction and fault localization but (extensive) training with seeded faults STAR
  21. 21. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure predictor 2.0 Machine learning STAR
  22. 22. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure Alerts SystemSensor Actuator Anomalies Anomaly Classifier Anomaly Detector Failure Type Fault Location Monitored KPIs Fault Localizer HealerFailure Predictor KPI1 KPI2 KPI3 KPI4 … KPIn*m(t) 09:00 KPI1(t) KPI2 (t) KPI3 (t) KPI4 (t) … KPIn*m(t) KPI1(t) KPI2 (t) KPI3 (t) KPI4 (t) … KPIn*m(t) 10:00 16:00 … ONE-class Support Vector Machine with RBF kernel 22 Failure predictor 2.0STAR
  23. 23. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure predictor 2.0 Precise failure prediction NO fault localization with NO Seeded faults but still (extensive) training STAR
  24. 24. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure predictor 3.0 Energy based models Deep Learning STAR
  25. 25. © 2019 All rights reserved.Schaffhausen Institute of Technology SystemSensor Actuator Free Energy Calculator Failure Alerts Fault Localizer HealerFailure Predictor monitored KPIs 25 Failure predictor 3.0STAR
  26. 26. © 2019 All rights reserved.Schaffhausen Institute of Technology Free Energy Gtrain(t) Baseline model with normal data v h KPIs Time KPI1 … KPIn*m (M1, R1) … (Mn, Rm) 5’ 2500.00 … 4645.33 10’ 2500.00 … 3833.20 15’ 2500.00 … 3981.20 20’ … … … 26 Failure predictor 3.0STAR
  27. 27. © 2019 All rights reserved.Schaffhausen Institute of Technology Free Energy Gfaulty(t) Predicting failures in error state Faulty Data Time KPI1 … KPIn*m (M1, R1) … (Mn, Rm) 5’ 2500.00 … 4645.33 10’ 2776.47 … 3833.20 15’ 2776.47 … 3981.20 20’ … … … v h 27 Failure predictor 3.0STAR
  28. 28. © 2019 All rights reserved.Schaffhausen Institute of Technology Precision Recall 95.64% 99.98% 28 Failure predictor 3.0STAR Performance training time ~ 24 seconds 16 GB RAM laptop 3840 NVIDIA CUDA cores input size: 350 KPIs
  29. 29. © 2019 All rights reserved.Schaffhausen Institute of Technology Failure predictor 3.0 Precise failure prediction NO fault localization Negligible overhead Online incremental training STAR
  30. 30. © 2019 All rights reserved.Schaffhausen Institute of Technology Fault Localizer KPI ranking STAR
  31. 31. © 2019 All rights reserved.Schaffhausen Institute of Technology CloudSensor Actuator Anomalies Anomaly Classifier Anomaly Detector Failure Alerts graphs Graph Generator Graph Ranker Fault Localizer HealerFailure Predictor (Retransmitted Packets, VM) (Retransmitted Packets, Server) (Db latency, Server) (Memory Usage, Server) / (# of Connections, Server) / (# of Processes, Server) node: KPI = (M, R) edge: KPIi → KPIj Granger causality with probability wij node: KPI = (M, R) edge: KPIi → KPIj Granger causality with probability wij (Retransmitted Packets, VM) (Retransmitted Packets, Server) (Db latency, Server) (Memory Usage, Server) / (# of Connections, Server) / (# of Processes, Server) 09:00 Ranking (M1, R1) (M70, R5) (M15, R5) (M7, R5) 10:00 15:40 Failure Alert 31 Fault LocalizerSTAR
  32. 32. © 2019 All rights reserved.Schaffhausen Institute of Technology Fault LocalizerSTAR CloudSensor Actuator Anomalies Anomaly Classifier Anomaly Detector Failure Alerts graphs Graph Generator Graph Ranker Fault Localization Fault Localizer HealerFailure Predictor Fault Injection 32
  33. 33. © 2019 All rights reserved.Schaffhausen Institute of Technology Fault Localizer Precise localisation No training No overhead STAR
  34. 34. © 2019 All rights reserved.Schaffhausen Institute of Technology Healer NLP (and more) for learning automatic workarounds STAR
  35. 35. © 2019 All rights reserved.Schaffhausen Institute of Technology SystemSensor Actuator 0 . 20 . 2 0 . 3 0 . 4 0 . 5Anomalies Anomaly Classifier Anomaly Detector Failure Alerts graphs Graph Generator Graph Ranker Fault Localizer HealerFailure Predictor Automatic workaround generator automatic workarounds natural language annotations 35 Healer STAR danger threat search found a thread is found search from dangerFROM TO word embedding and word mover distance Contextual NLP
  36. 36. © 2019 All rights reserved.Schaffhausen Institute of Technology Healer NLP to automatically identify workarounds STAR
  37. 37. © 2019 All rights reserved.Schaffhausen Institute of Technology SystemSensor Actuator Fault Localizer HealerFailure Predictor monitor extensive experience with data analytics machine learning deep learning excellent results on large scale industrial systems for packet loss/corruption/latency CPU hogs memory leaks excessive workload FAILURE PREDICTION The star approachSTAR
  38. 38. © 2019 All rights reserved.Schaffhausen Institute of Technology SystemSensor Actuator Fault Localizer HealerFailure Predictor monitor FAULT LOCALIZER extensive experience with machine learning KPI ranking excellent results on large scale industrial systems for packet loss/corruption/latency CPU hogs memory leaks excessive workload The star approachSTAR
  39. 39. © 2019 All rights reserved.Schaffhausen Institute of Technology SystemSensor Actuator Fault Localizer HealerFailure Predictor monitor HEALER experience NLP (Natural Language Processing) to identify automatic workarounds excellent results on small scale systems The star approachSTAR
  40. 40. © 2019 All rights reserved.Schaffhausen Institute of Technology Plans 40 From classic cloud to highly dynamic cloud configurations (Microservices and Kubernetes) • Predictor 3.0 — deep learning • Dynamically devolving system models From functional and performance issues to cybersecurity breaches • Empirical studies on cybersecurity breaches From simple to pervasive automatic workarounds • NLP on pervasive contradicting annotations • Image and video processing
  41. 41. © 2019 All rights reserved.Schaffhausen Institute of Technology 41 SIT research today Two research chairs in software engineering / verification / security Bertrand Meyer SIT Professor of Software Engineering and Provost Mauro Pezze SIT Professor of Software Quality and Cybersecurity Software Quality and Cybersecurity SQCProgram Environment People Reliability & Protection Outputs • Software • Papers • PhD theses • Patents • Technology transfer
  42. 42. © 2019 All rights reserved.Schaffhausen Institute of Technology Come join us! UNIVERSITY • RESEARCH • TECHPARK • ECOSYSTEM • R&D CENTERS • STARTUPS

Hinweis der Redaktion

  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.
  • Building a model of the normal behavior for each application from collections of pods running the
    same application, by relying on fast deep learning techniques (Deep Belief Networks, Deep
    Convolutional Neural Networks) trained in a semi-supervised fashion, without relying on faulty data
    for training
     Improving supervised learning techniques for performance deviation analysis, leveraging userbased
    SLA violation as labels for each application , eg. distribution of response times below a
    certain threshold
     Analyzing distributions of response time at service level and exploit hypothesis testing and
    regression techniques to predict behavior and detect deviations from the norm. Salacia will
    implement fast algorithms based on standard machine learning techniques for fast and robust nonlinear
    regression.
     Localising faults by analyzing the relation between the health status at application level and the
    application topology retrieved from weave scope as an adjacency list of containers, and issuing
    11/30
    fault alerts that indicate the culprit application and/or pod, by exploiting the information on the
    application topology.
     Activating self-healing procedures, which will leverage self-healing functionalities of Kubernetes to
    implement self-healing actions on the pods that Salacia localises as responsible for the faulty
    behavior at application level.

×