Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

© 2019 All rights reserved.Schaffhausen Institute of Technology
Mauro Pezzè,
Schaffhausen Institute of Technology
Self-healing
cloud systems

Cloud in finance
2
The cloud is transforming the banking
industry as banks adopt cloud solutions to help
deliver against increasing customer expectations.
“The cloud and emerging technologies such as AI
and machine learning serve as both a catalyst and
a reason to change for the financial industry,”
Financial industry adopts cloud solutions
IBM Expert Advice April 2019

© 2019 All rights reserved.Schaffhausen Institute of Technology 3
runtime failures

Unavoidable

Expensive
10 hours average downtime per year IWGCR
1.25B$—2.5B$
total cost of unplanned application
downtime per year
fortune
.5M$—1M$
average cost of a critical app
failure per hour
IDC

Finance software is not bug free 1..
6
Less then a week into 2016, HSBC become the first bank
to suffer a major IT outage. Millions of the bank’s
costumers were unable to access online accounts.
Services only returned to normal after a two-day outage.
The bank’s chief operating officer Jack Hackett blamed a
’complex technical issue’ with its internal systems.

Finance software is not bug free ..2..
7
In August 2015 a reported 275,000 individual payments
failed to be processed by HSBC, which left many
potentially without pay before the Bank Holiday weekend.
The cause of this major failure was a problem with its
electronic payment system for its business banking users
which affected salary payments.

8
In April 2015, Blomberg’s London office suffered a software
glitch resulting in their trading terminals going down for two
hours.
In a statement Bloomberg said: “Service has been fully
restored. We experienced a combination of hardware and
software failures in the network, which caused an
excessive volume of network traffic.”

9
In June 2015 about 600,000 payments failed to enter the
accounts of RBS overnight — including wages and benefit
payments. Many took several days to come through. The
bank chief officer said a ‘technology fault meant we could
not ingest a file from a third-party provider”….
In 2012 6.5 million RBS customers experiences an outage
due to batch scheduling software, a glitch for which the
bank was subsequently fined 56 million pounds.

Self-healing
(cloud) systems
Preventing
Tolerating
Removing
By
Predicting failures
Locating bugs
Working around failures
Fixing bugs
Failures
10

State-of-the-art (Cloud)
healing solutions
Monitoring tools:
• Kube-state metrics
• metrics-server
• Envoy
• Helm charts
Self-healing tools:
• Liveliness/Readiness probes
• Health indicators
• Pod phase, probe, restart
• …
Limitations
performance interference
no knowledge of system status
no knowledge of applications
Tools:
• Monasca: monitoring
• Aodh: alarming
• Congress: policy-based governance
• Mistral: workflow
• Senlin: clustering service
• Vitrage: root cause analysis
• Watcher: optimisation
• Masakari compute healing advice
• Freezer-dr: compute healing advice
• Doctor: fault management
• Fault Genes Working Group: fault classification and recovery strategy
• Craton: fleet management
Features
monitoring
hardware/system recovery
Pod recovery
11

STAR moving on
from to
12
Limitations
Performance interference
No knowledge of system status
No knowledge of applications
Features
Limited performance interference
Knowledge of application composition
Holistic hierarchical system view
STAR

Normal state timeError state
Failure Prediction
Fault
activation
Healing
Failure Alert Faulty
component
Failure
Localisation
13
STAR

SystemSensor Actuator
Fault Localizer HealerFailure Predictor
monitor
14
STAR

Linux Server
Openstack
Clearwater
cross-layer partial monitoring
with built-in facilities
15
(Cloud) Monitors
STAR

Failure predictor 1.0
Data analytics
Machine learning
STAR

Failure Alerts
Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure Type
Fault
Location
Monitored
KPIs
KPI2 (Packets Received, R7)
Data
Analytics
Failure
predictor 1.0
17
STAR

Failure Alerts
Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure Type
Fault
Location
Monitored
KPIs
ℎ1
(1)
ℎ2
(1)
ℎ3
(1)
ℎ4
(1)
ℎ5
(1)
ℎ6
(1)
ℎ7
(1)
ℎ8
(1)
ℎ1
(2)
ℎ2
(2)
ℎ3
(2)
ℎ4
(2)
ℎ5
(2)
ℎ6
(2)
ℎ1
(3)
ℎ2
(3)
ℎ3
(3)
ℎ1
(4)
ℎ2
(4)
ℎ3
(4)
ℎ4
(4)
ℎ5
(4)
ℎ6
(4)
ℎ1
(5)
ℎ2
(5)
ℎ3
(5)
ℎ4
(5)
ℎ5
(5)
ℎ6
(5)
ℎ7
(5)
ℎ8
(5)
𝑣1
𝑣2
𝑣3
𝑣4
𝑣5
𝑣6
𝑣7
𝑣
̂
8
𝑣9
𝑣10
𝑣
̂
1
𝑣
̂
2
𝑣
̂
3
𝑣
̂
4
𝑣
̂
5
𝑣
̂
6
𝑣
̂
7
𝑣
̂
9
𝑣
̂
10
𝑣8
Deep Autoencoder
18
Failure
predictor 1.0STAR

Failure Alerts
Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure Type
Fault
Location
Monitored
KPIs
spurious anomalous KPI
failure-prone anomalous KPI
19
Failure
predictor 1.0STAR

Precise failure prediction and fault
localization but (extensive) training
with seeded faults
STAR

Machine learning
STAR

Failure
Alerts
Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure Type
Fault
Location
Monitored
KPIs
KPI1
KPI2
KPI3
KPI4
…
KPIn*m(t)
09:00
KPI1(t)
KPI2 (t)
KPI3 (t)
KPI4 (t)
…
KPIn*m(t)
KPI1(t)
KPI2 (t)
KPI3 (t)
KPI4 (t)
…
KPIn*m(t)
10:00 16:00
…
ONE-class Support Vector Machine with RBF kernel
22
Failure
predictor 2.0STAR

Precise failure prediction
NO fault localization with NO
Seeded faults but still
(extensive) training STAR

Energy based models
Deep Learning
STAR

Free Energy
Calculator Failure Alerts
monitored KPIs
25
Failure
predictor 3.0STAR

Free Energy
Gtrain(t)
Baseline model with normal data
v h
KPIs
Time
KPI1 … KPIn*m
(M1, R1) … (Mn, Rm)
5’ 2500.00 … 4645.33
10’ 2500.00 … 3833.20
15’ 2500.00 … 3981.20
20’ … … …
26
Failure
predictor 3.0STAR

Free Energy
Gfaulty(t)
Predicting failures in error state
Faulty Data
Time
KPI1 … KPIn*m
(M1, R1) … (Mn, Rm)
5’ 2500.00 … 4645.33
10’ 2776.47 … 3833.20
15’ 2776.47 … 3981.20
20’ … … …
v h
27
Failure
predictor 3.0STAR

Precision Recall
95.64% 99.98%
28
Failure
predictor 3.0STAR
Performance
training time ~ 24 seconds
16 GB RAM laptop
3840 NVIDIA CUDA cores
input size: 350 KPIs

Precise failure prediction
NO fault localization
Negligible overhead
Online incremental training STAR

Fault Localizer
KPI ranking
STAR

CloudSensor Actuator
Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure
Alerts
graphs
Graph
Generator
Graph
Ranker
(Retransmitted Packets, VM)
(Retransmitted Packets, Server)
(Db latency, Server)
(Memory Usage, Server)
/
(# of Connections, Server)
/
(# of Processes, Server)
node: KPI = (M, R)
edge: KPIi → KPIj
Granger causality
with probability wij
node: KPI = (M, R)
edge: KPIi → KPIj
Granger causality
with probability wij
(Retransmitted Packets, VM)
(Retransmitted Packets, Server)
(Db latency, Server)
(Memory Usage, Server)
/
(# of Connections, Server)
/
(# of Processes, Server)
09:00
Ranking
(M1, R1)
(M70, R5)
(M15, R5)
(M7, R5)
10:00 15:40
Failure Alert
31
Fault
LocalizerSTAR

Fault
LocalizerSTAR
CloudSensor Actuator
Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure
Alerts
graphs
Graph
Generator
Graph
Ranker
Fault
Localization
Fault
Injection
32

Fault Localizer
Precise localisation
No training
No overhead
STAR

Healer
NLP (and more)
for learning automatic
workarounds
STAR

0
.
20
.
2
0
.
3
0
.
4
0
.
5Anomalies
Anomaly
Classifier
Anomaly
Detector
Failure Alerts
graphs
Graph
Generator
Graph
Ranker
Automatic
workaround
generator
automatic workarounds
natural language annotations
35
Healer
STAR
danger
threat
search
found
a thread is found search from dangerFROM TO
word embedding and word mover distance
Contextual
NLP

Healer
NLP to automatically identify
workarounds
STAR

monitor
extensive experience with
data analytics
machine learning
deep learning
excellent results
on large scale industrial systems
for
packet loss/corruption/latency
CPU hogs
memory leaks
excessive workload
FAILURE PREDICTION
The star
approachSTAR

monitor
FAULT LOCALIZER
extensive experience with
machine learning
KPI ranking
excellent results
on large scale industrial
systems
for
packet
loss/corruption/latency
CPU hogs
memory leaks
excessive workload
The star
approachSTAR

monitor
HEALER
experience
NLP (Natural Language Processing)
to identify automatic workarounds
excellent results
on small scale systems
The star
approachSTAR

Plans
40
From classic cloud to highly dynamic cloud configurations
(Microservices and Kubernetes)
• Predictor 3.0 — deep learning
• Dynamically devolving system models
From functional and performance issues to cybersecurity breaches
• Empirical studies on cybersecurity breaches
From simple to pervasive automatic workarounds
• NLP on pervasive contradicting annotations
• Image and video processing

SIT research today
Two research chairs in software
engineering / verification / security
Bertrand
Meyer
SIT Professor of
Software Engineering
and Provost
Mauro
Pezze
SIT Professor of
Software Quality and
Cybersecurity
Software Quality and Cybersecurity
SQCProgram
Environment
People
Reliability &
Protection
Outputs
• Software
• Papers
• PhD theses
• Patents
• Technology transfer

Come join us!
UNIVERSITY • RESEARCH • TECHPARK • ECOSYSTEM • R&D CENTERS • STARTUPS

Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

Ähnlich wie Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019 (20)

Mehr von Schaffhausen Institute of Technology

Mehr von Schaffhausen Institute of Technology (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mauro Pezzè - Self-healing cloud systems at SIT Insights in Technology 2019

Hinweis der Redaktion