SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
9th International Workshop on Software Engineering for Resilient Systems
September 4-5, 2017, Geneva, Switzerland
Prof. Dr. Jorge Cardoso
University of Coimbra
Portugal
Cloud Reliability
Decreasing outage frequency using fault injection
Chief Architect for Cloud Operations and Analytics
Huawei Research, Munich
Germany
2
l Title: Cloud Reliability: Decreasing outage frequency using fault injection
l Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes
of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a
successful cloud platform. Several approaches can be explored to increase reliability ranging from automated
replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault
injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix
and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject
failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will
explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to
effectively and efficiently detect the damages caused by the faults injected.
l Acknowledgments: This research was conducted in collaboration with Deutsche Telekom/T-Systems and with Ankur
Bhatia from the Technical University of Munich to analyze the reliability and resilience of modern public cloud platforms.
Executive Summary
l Short CV: Dr. Jorge Cardoso is Chief Architect for Cloud Operations and Analytics at
Huawei’s German Research Centre (GRC) in Munich. He is also Professor at the University
of Coimbra since 2009. In 2013 and 2014, he was a Guest Professor at the Karlsruhe
Institute of Technology (KIT) and a Fellow at the Technical University of Dresden (TU
Dresden). Previously, he worked for major companies such as SAP Research (Germany) on
the Internet of services and the Boeing Company in Seattle (USA) on Enterprise Application
Integration. Since 2013, he is the Vice-Chair of the KEYSTONE COST Action, a EU
research network bringing together more than 70 researchers from 26 countries. He has a
Ph.D. in Computer Science from the University of Georgia (USA).
3
Huawei at a Glance
4
Winning Consumer Loyalty and
Building Brand Influence
5
Globalized Resource Deployment and
Localized Business Operations
6
Huawei Public Cloud Products and Services
Compute
Elastic
Cloud
Server
Image
Mgmt
Service
Auto
Scaling Container
Service
Dedicated Cloud Service
Storage
Elastic IP
Elastic
Volume
Service
Object
Storage
Service
Elastic
Load
Balance
Virtual
Private
Cloud
Volume
Backup
Service
Management &
Deployment
Cloud Eye Identity and
Access
Management
Direct
Connect
MaaS
Network DNS
Security Anti-DDoS
Database
Enterprise
Application
Relational
Database
Service
Workspace
SAP Hana/SAP
Suite (IaaS)
MapReduce
Solution SAP Hana/SAP
Suite (IaaS)
7
Global Deployment of Public Cloud Services
u Huawei Enterprise Cloud (HEC) public cloud regions include Langfang and Suzhou in China.
l Open Telefonica Cloud worldwide regions include Mexico, Brazil, Chile, USA, Spain, Peru, Argentina, and Colombia.
n China Telecom Cloud (CTC) public cloud regions include Guizhou and Beijing in China.
▲ Orange Cloud Business (OBS) public cloud regions include France, Singapore, USA, Holland, and Africa.
u Open Telekom Cloud (OTC), Germany
Mexico
Brazil
Chile
Colombia
USA
Peru
Spain
2016 Q1
2016 Q4
Argentina
2017 Jan
2017 July
France
Singapore
USA
Holland
Africa Region 2018 Jan
Beijing
Guizhou
Langfang
Suzhou
2015 Q2
Germany
8
OpenLab Munich
9
Fault Injection into Clouds
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to survive
them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
10
OpenStack Architecture
nova-api nova-compute L2	agent L3	agentDHCP	agent
11
Fault Injection Plans
Director
Create
VM
1
I: None
O: VM name
N: 1/controller
Deployer
Update
DB
1
I: None
O: None
N: 1/controller
Selector
Get
compute
PPID
1
I: None
O: (node, pid)+
N: 1/controllers
V: 1/nova-comp.
mon-fri: 9h-17h, every
3h
Observer
Wait for
State
1
I: VM name
O: VM name,
state
N: 1/controller
V: 1/{ networking,
block_device_ma
pping, spawning}
Injector
Inject
Fault
1
I: (node, pid)+
O: none
V: 1/{SIGKILL,
SIGTERM,
SIGINT}
Detector
Check
VM
State?
1
I: VM name
O: VM state
N: 1/controller
Cleaner
Delete
VM
1
I: VM name
O: None
N: 1/localhost
(node, pid)+ VM_name
VM_name
FIP Scenario Get PPID Variability Inject Fault Variability Wait for State Variability 2 Result
ID12 Create VM Controller One of
controller
services
Send signal
to process
Signal {SIGKILL,
SIGTERM,
SIGINT}
Loop, read status,
sleep
States { networking,
block_device_mapping,
spawning}
T1 Create VM B_xzy Controller nova-compute Send signal SIGKILL Wait for networking OK
T2 Create VM B_jsk Controller nova-api Send signal SIGTERM Wait for block_device_mapping NOK
T999
Rapporteur
Generat
e Report
1
I: data pipeline
O: None
N: 1/localhost
Variability
12
l With improvements in processing power, network and storage technologies, cloud
platforms have witnessed an unprecedented growth in complexity.
l Due to the increase in complexity, the need to efficiently diagnose failures in
cloud platforms has also risen.
l Because of these challenges, cloud operators often develop new sets of tests to
diagnose failures.
l But as mentioned before, cloud platforms are continuously evolving. They
undergo modification and frequent updates, and have periodic release cycles.
Hence, the tests developed become outdated and there is a constant need to
modify them when a new release is available. Therefore, this approach is costly
for the cloud operators.
Background - Requirement
Work done In collaboration with Ankur Bhatia from Technical University of Munich (TUM), Germany
13
l As with most software, the validation of all the modules of a cloud platform is
done through a test suite containing a large number of unit tests. It is a part of
the software development process where the smallest testable part of an
application, called unit, along with associated control data are individually and
independently tested.
l Executing unit tests is a very effective way to test code integration during the
development of software. They are often executed to validate changes made by
developers and to guarantee that the code is error free.
l Although unit tests are extremely useful for the purpose of development and
integration, they are not meant to diagnose failures.
Solution
14
Fault Injection Plans
Director
Create
VM
1
I: None
O: VM name
N: 1/controller
Deployer
Update
DB
1
I: None
O: None
N: 1/controller
Selector
Get
compute
PPID
1
I: None
O: (node, pid)+
N: 1/controllers
mon-fri: 9h-17h, every
3h
Observer
Wait for
State
1
I: VM name
O: VM name,
state
N: 1/controller
Injector
Inject
Fault
1
I: (node, pid)+
O: none
Detector
Check
VM
State?
1
I: VM name
O: VM state
N: 1/controller
Cleaner
Delete
VM
1
I: VM name
O: None
N: 1/localhost
VM_name
Rapporteur
Generat
e Report
1
I: data pipeline
O: None
N: 1/localhost
Unit	tests. It	is	a	part	of	the	
software	development	
process	where	the	smallest	
testable	part	of	a	cloud	
platform,	called	unit,	along	
with	associated	control	data	
are	individually	and	
independently	tested.	
Failure	diagnosis	system.	
Diagnoses	failures	in	cloud	
platforms	making	use	of	unit	
tests	and	helps	to	detect	
nonresponsive	and	failed	
services.	
Unit	tests	for	failure	
diagnosis.	Reuse	the	already	
developed	unit	tests	to	test	
a	cloud	platform	at	runtime.
15
Using unit tests for failure diagnosis
presents a set of challenges
l Unit tests do not provide any information
about the nonresponsive or failed services in
cloud platforms.
p The execution of unit tests generates a list of
passed and failed tests. This list can help to
locate software errors or to find issues with
individual modules of the code but cannot
diagnose failures as there are no relationships
between unit tests and services running on a
cloud platform.
l With the increase in codebase of cloud
platforms, the number of unit tests also
increases. Thus, it takes a considerable
amount of time to execute them.
Challenges
Experiments
l OpenStack has more than 1500 unit tests
l Used for development and integration
l Only validate APIs
l Time consuming
p E.g., unit tests to create a VM, uploading a large
operating system image, etc., need a few minutes to
execute.
p It can take up to 3 to 4 hours to execute all unit tests.
Considering the reliability requirements of 99.95%,
cloud platforms can have a downtime of only 21.6
minutes per month.
p Hence, the time required to execute unit tests is
considerably high.
l Not able to directly detect services that are
not functioning as expected
16
The approach of 3 phases efficiently
diagnoses a cloud platform
l 1. Reduce the number of unit tests using
the set cover algorithm
l 2. Establish relationships between the
reduced unit tests and the services running
on a cloud platform. These relationships
help to determine nonresponsive or failed
services. This is done by simulating failure
of services in cloud platforms and running
unit tests in this environment.
l 3. Construct a decision tree based on these
relationships to select the unit tests that are
most relevant to diagnose failures. The ID3
algorithm is used to construct a decision
tree.
General Approach
Results
l The approach diagnoses failures by running
only 4-5% of the original unit tests.
17
l Unit Tests Reduction (Phase 1). Reduce the number of unit tests written during the code
development
l Service Mapping (Phase 2). Establish relationships between the reduced unit tests and
services. It further reduces the number of unit tests based on these relationships
l Sequential Diagnosis (Phase 3). Use relationships to construct a decision tree to select the
most relevant unit tests to diagnose failures
General Approach
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_empty_name
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_nonexistent_vo
lume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_get_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_empty_name
empest.api.volume.admin.test_volume_types_negative.Volu
meTypesNegativeV2Test.test_create_with_nonexistent_vol
ume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_idt
empest.api.volume.admin.test_volume_types_negative.Volu
meTypesNegativeV2Test.test_create_with_empty_name
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_nonexistent_vo
lume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_get_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_create_with_empty_name
empest.api.volume.admin.test_volume_types_negative.Volu
meTypesNegativeV2Test.test_create_with_nonexistent_vol
ume_type
tempest.api.volume.admin.test_volume_types_negative.Vol
umeTypesNegativeV2Test.test_delete_nonexistent_type_id
Unit Tests
Reduction
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_creat
e_with_empty_name
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_creat
e_with_nonexistent_volume_type
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_delet
e_nonexistent_type_id
tempest.api.volume.admin.test_volume_types_n
egative.VolumeTypesNegativeV2Test.test_get_
nonexistent_type_id
Service
Mapping
Sequential
Diagnosis
Reduced unit testsUnit tests
Phase 1 Phase 2 Phase 3
Decision tree
Faulty
Service
Service mapping
1 0 1
1 0 0
t1
s1 s2 .. ..
1
1tm
..
sn
18
l Identify modules (1/5)
p Identify the modules available in the unit test framework
l Construct Abstract Syntax Tree (2/5)
p For each module, an AST is constructed to establish a
relationship between unit test methods and support
methods
l Filter AST (3/5)
p All irrelevant support methods are filtered out from the
AST using a Inverse Document Frequency
l Support method deduplication (4/5)
p Unit test methods that perform redundant operations
are eliminated by finding the minimum set cover. Time
to execute an unit test is the cost of a set
l Cross module deduplication (5/5)
p Unit test methods from one module are compared with
methods of other modules and are further reduced
Phase 1 - Unit Test Reduction
List of
all unit
tests
List of all
modules
Identify
modules
Construct AST Filter AST
Support
method
de-duplication
Abstract
Syntax Tree
Final
reduced
unit tests
Eliminate
redundant unit
test methods
across modules
Eliminate
redundant unit
test methods
within modules
Reduced
unit tests
Filtered
AST
Unit test manager
Identify irrelevant
support methods
Cross module
de-duplication
19
l Identify modules (1/5)
p A module is the source file which contains
the definition of the unit test method
l Construct AST (2/5)
p For all the modules construct an AST
l Traversed the AST to identify unit test
methods and their support methods
l For each unit test method, a set of its
support methods is created:
p test_create_flavor: {create_flavor,
assertEqual}
p test_delete_flavor: {create_flavor,
delete_fl, assertTrue}
Phase 1 (1-2/5)
tempest.api.compute.flavors.test_flavors.			Flavors.																									test_create_flavor
Class name Unit test
method
Path of the module:
tempest/api/compute/flavors/test_flavors.py
Part	1 Part	2 Part	3
A	simple	AST
class Flavors (base.Test):
def test_create_flavor(self):
new_flavor_id = self.create_flavor()
self.assertEqual(new_flavor_id, flavor_id)
def test_delete_flavor(self):
flavor = self.create_flavor(flavor_name)
del_flavor = self.delete_fl(flavor)
self.assertTrue(some_statement)
A	Python	module	(test_flavor.py)	containing	the	
implementations	of	unit	tests
test_flavor.py
Flavors
test_create_
flavor
test_delete_
flavor
Assert
Equal
create_
flavor
delete_fl assertTrue
create_
flavor
Module
Class
Unit test
methods
Support
methods
Note: Apart from the nodes shown in the figure, an AST
contains other nodes like variables, constructors, etc.
Since they are not relevant to us, they are excluded for
simplicity.
Identify Module & Construct AST
20
Phase 1 (3/5)
l Support methods are eliminated
p Some of the support methods are specific
to the unit test framework
p E.g., assertEquals
l Use Inverted Document Frequency
(IDF) technique
p Higher occurrence -> lower IDF
l Calculated IDF for support methods
p Set of documents = ASTs of modules
l Eliminate support methods having a
IDF below a threshold alpha
p These support methods identified do not
play any role in the reduction of the unit
tests
p They are irrelevant
Module 1
Class A:
def
U1(self):
Assert()
S1()
def
U2(self):
S2()
S3()
Class B:
def
U3(self):
S2()
S3()
AST construction
Unit test
methods
Support
Methods
(words)
Class Class
U1
Module
S1 S2 S1
U2 U3
Asser
t
Filter AST
21
Phase 1 (3/5)
IDF(x)	=	log	(#	documents/	#	documents	with	word	x)
Unit test
methods
Support
Methods
(words)
Document 1 Document 2 Document 3 Document 4
Class Class
U1
Module
S1 S2 S1
U2 U3
Class Class
U4
Module
S3
Asser
t
S2 S3
U5 U6
Class Class
U7
Module
S5
Asser
t
S5 S5
U8 U9
Class Class
U10
Module
S6
Asser
t
S7 S18
U11 U12
Asser
t
§ The	support	methods	Assert is	present	in	all	the	modules	->	its	IDF	is	log(4/4)	=	0
§ Similarly,	IDF	for	S1 is	log(4),	S2 is	log(2)
§ Thus,	Assert	is	irrelevant.		It	is	eliminated	from	the	list	of	support	methods.
22
Phase 1 (4/5)
l Each module has unit test methods which
call support methods
l In most cases, a subset of the unit test
methods call all the support methods
l Thus, some unit test methods are
redundant
l Prune the ASTs constructed
p Pruning is done using the minimum set cover
l Two important parameters: cost and
coverage
p Cost is the time a unit test takes to execute
p Coverage is the percentage of support methods
to consider in the set cover
Support method deduplication
23
Phase 1 (4/5)
Support method (S) and time in seconds for each
Unit test method (U)
U1 : {S1 , S3}, 15
U2 : {S1 , S3,} ,8
U3 : {S2}, 9
Class A
U1
Module 1
S1
U2 U3
Class B
S1 S3 S2S3
Loop 1:
a1 = 15/2
a2 = 8/2
a3 = 9/1
I = {S1, S3}
Class A
Module 1
U2
Class B
S1 S3
Loop 1:
a1 = 15/2
a2 = 8/2
a3 = 9/1
I = {S1, S3}
Loop 2:
a1 = 15/0
a2 = 8/0
a3 = 9/1
I = {S1, S2, S3}
Class A
Module 1
U2 U3
Class B
S1 S3 S2
Minimum coverage requested = 100 %
U = {S1 , S2 , S3}
Required set = all elements of U
Minimum coverage requested = 50 %
U = {S1 , S2 , S3}
Required set = é(50/100 ) * 3ù = 2
Coverage obtained: 66.6 %Coverage obtained: 100 %
In case 2, the minimum requested is 50% so any
2 support methods out of 3 are sufficient. U2
contains 2 distinct support methods and has the
least cost. Hence, U1 and U3 are eliminated.
In case 1, minimum coverage is100 %, so all the
support methods must be present. U2 and U3
covers all the distinct support methods and has
the least cost. Hence, U1 is eliminated.
24
Phase 1 (5/5)
U5 : {S4, S5} U10 : {S4}
U11 : {S10, S4}
U12 : {S12, S5}
§ The support method de-duplication subsystem guarantees
that {S4, S5, S10, S12} are called by the subset of the unit
test methods in Module 4
§ All the support methods called by U5 (i.e., {S4, S5}) are
already covered in Module 4.
§ Hence, U5 is redundant and can be eliminated
Cross module deduplication
§ The AST pruning reduces the unit test methods by finding the minimum set cover of the support methods in the same
module. However, some unit test methods are covered in other modules
§ Thus, the reduction can be improved without losing the coverage
§ Cross module deduplication compares the support methods of a unit test method from one module with the universe
of the support methods of other modules
Module 4
Class G:
def U10(self):
Assert()
S4()
def U11(self):
S10()
S4()
Class H:
def U12(self):
S12()
S5()
Module 2
Class C:
def U4(self):
Assert()
S2()
def U5(self):
S4()
S5()
Class D:
def U6(self):
S4()
S7()
Our experiment with OpenStack enabled us to reduce 1391 unit tests
created by developers to 538 tests with 100% coverage
25
Phase 2
Reduced set of unit
tests from Phase 1
Unit test and
service mapping
Isomorphic unit
test elimination
Test service mapper
Matrix
representing
relationships
between unit
tests and
testability of
services
Reduced matrix representing
relationships between unit
tests and testability of services
Service Mapping
One of the major challenges of using unit tests for failure diagnosis is the lack of relationships between unit tests and services
running on cloud platforms. The execution of unit tests outputs a list of passed, failed and skipped tests.
§ Establish a relation between the unit tests and the services they can test
§ Isomorphic unit test elimination
unittestUi
unittestUi
Experiment: 20 OpenStack services running and 538 unit tests after
reduction from Phase 1. At the end of Phase 2, the isomorphic test
elimination procedure reduced the number of test to 25.
For each service s:
1. Disable s (stop/shutdown)
to simulate a failure of a
service in a cloud platform
2. Run reduced set of unit
tests. Unit tests that
depend on service s will
fail. Record result.
3. Enable s (start)
The execution time for unit tests U1 is 10 seconds, U2 is 9
seconds, U3 is 13 seconds, U4 is 19 seconds and U5 is 3
seconds. Hence, U1, U5 and U2 are selected
26
Phase 3
unit test 3
unit test 2 unit test 4
unit test 1
unit test 5
service 1
service 2
service 5
service 5
service 3
service 1
service 4
fail
fail fail
fail
fail
pass
pass
pass
pass
Sequential Diagnosis
The main task of this phase is to analyze these relationships by constructing a decision tree and use the tree to detect failed
service(s) in a cloud platform in operation. Moreover, the failed services are detected without executing all the unit tests.
§ Construction of the decision tree
§ Execution of unit tests based on the decision tree
The tree of the Figure (right)
§ Execute unit test 3
§ If it fails, unit test 2 is executed
§ The failure of unit test 2 indicates
that service 1 is in a failed state
In some cases, there might be more than
one possible service in a failed state
§ If unit test 2 passes, then either service
2 or service 5 or both are in a failed
state
A failed service(s) can be detected by running log(n) number of unit tests instead of n, the total
number of unit tests after eliminating the isomorphic unit tests. Hence, the tree enables the
users to detect the failed service(s) automatically without executing all the unit tests.
27
Results
We evaluated the approach with OpenStack = 1391 unit
tests
Phase 1. Reduce the number of unit tests to 538 with
100% coverage
Phase 2. Established relationships between the reduced
set of unit tests and the OpenStack services. There were
20 services. Eliminated isomorphic unit tests. At the end of
Phase 2, we were able to reduce the number of unit tests
to 25.
Phase 3. Build decision tree using the relationships
established. 5 unit tests on average.
1391
538
25
5
28
Open Positions and CFP
l FTE, PostDoc, PhD
Students
p Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
p Cloud Operations and
Analytics for High Availability
p AI, machine learning, data
mining, and time-series
analysis for Cloud operations
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

Weitere ähnliche Inhalte

Was ist angesagt?

Disaster recovery with cloud computing
Disaster recovery with cloud computingDisaster recovery with cloud computing
Disaster recovery with cloud computingIsrael Roy Sambu
 
The new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layerThe new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layerDonnie Berkholz
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Codemotion Tel Aviv
 
Security and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseSecurity and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseAmazon Web Services
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsGanesan Narayanasamy
 
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET Journal
 
App Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized WorkloadsApp Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized WorkloadsDataCore Software
 
Achieving observability-in-modern-applications
Achieving observability-in-modern-applicationsAchieving observability-in-modern-applications
Achieving observability-in-modern-applicationsJulio Antúnez Tarín
 
SplunkLive! - Splunk for IT Operations
SplunkLive! - Splunk for IT OperationsSplunkLive! - Splunk for IT Operations
SplunkLive! - Splunk for IT OperationsSplunk
 
Splunk Enterpise for Information Security Hands-On
Splunk Enterpise for Information Security Hands-OnSplunk Enterpise for Information Security Hands-On
Splunk Enterpise for Information Security Hands-OnSplunk
 
CMG White Paper
CMG White PaperCMG White Paper
CMG White PaperLen Jejer
 
CodeValue Architecture Next 2018 - Executive track dilemmas and solutions in...
CodeValue Architecture Next 2018 - Executive track  dilemmas and solutions in...CodeValue Architecture Next 2018 - Executive track  dilemmas and solutions in...
CodeValue Architecture Next 2018 - Executive track dilemmas and solutions in...Erez PEDRO
 
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...Grigori Fursin
 
A Novel Method of Directly Auditing Integrity On Encrypted Data
A Novel Method of Directly Auditing Integrity On Encrypted DataA Novel Method of Directly Auditing Integrity On Encrypted Data
A Novel Method of Directly Auditing Integrity On Encrypted DataIRJET Journal
 
Splunk Enterprise for IT Troubleshooting
Splunk Enterprise for IT TroubleshootingSplunk Enterprise for IT Troubleshooting
Splunk Enterprise for IT TroubleshootingSplunk
 
scalable distributed service integrity attestation for software-as-a-service ...
scalable distributed service integrity attestation for software-as-a-service ...scalable distributed service integrity attestation for software-as-a-service ...
scalable distributed service integrity attestation for software-as-a-service ...swathi78
 

Was ist angesagt? (20)

Disaster recovery with cloud computing
Disaster recovery with cloud computingDisaster recovery with cloud computing
Disaster recovery with cloud computing
 
Webinar_DevOps_Nov10_D2
Webinar_DevOps_Nov10_D2Webinar_DevOps_Nov10_D2
Webinar_DevOps_Nov10_D2
 
The new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layerThe new stack isn’t a stack: Fragmentation and terraforming 
the service layer
The new stack isn’t a stack: Fragmentation and terraforming 
the service layer
 
Incident response in cloud environments
Incident response in cloud environmentsIncident response in cloud environments
Incident response in cloud environments
 
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...Faster deep learning solutions from training to inference - Amitai Armon & Ni...
Faster deep learning solutions from training to inference - Amitai Armon & Ni...
 
Security and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseSecurity and Advanced Automation in the Enterprise
Security and Advanced Automation in the Enterprise
 
Deep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systemsDeep Learning Use Cases using OpenPOWER systems
Deep Learning Use Cases using OpenPOWER systems
 
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
 
Observability
ObservabilityObservability
Observability
 
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
Aplicações Potenciais de Deep Learning à Indústria do PetróleoAplicações Potenciais de Deep Learning à Indústria do Petróleo
Aplicações Potenciais de Deep Learning à Indústria do Petróleo
 
App Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized WorkloadsApp Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized Workloads
 
Achieving observability-in-modern-applications
Achieving observability-in-modern-applicationsAchieving observability-in-modern-applications
Achieving observability-in-modern-applications
 
SplunkLive! - Splunk for IT Operations
SplunkLive! - Splunk for IT OperationsSplunkLive! - Splunk for IT Operations
SplunkLive! - Splunk for IT Operations
 
Splunk Enterpise for Information Security Hands-On
Splunk Enterpise for Information Security Hands-OnSplunk Enterpise for Information Security Hands-On
Splunk Enterpise for Information Security Hands-On
 
CMG White Paper
CMG White PaperCMG White Paper
CMG White Paper
 
CodeValue Architecture Next 2018 - Executive track dilemmas and solutions in...
CodeValue Architecture Next 2018 - Executive track  dilemmas and solutions in...CodeValue Architecture Next 2018 - Executive track  dilemmas and solutions in...
CodeValue Architecture Next 2018 - Executive track dilemmas and solutions in...
 
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...
Adapting to a Cambrian AI/SW/HW explosion with open co-design competitions an...
 
A Novel Method of Directly Auditing Integrity On Encrypted Data
A Novel Method of Directly Auditing Integrity On Encrypted DataA Novel Method of Directly Auditing Integrity On Encrypted Data
A Novel Method of Directly Auditing Integrity On Encrypted Data
 
Splunk Enterprise for IT Troubleshooting
Splunk Enterprise for IT TroubleshootingSplunk Enterprise for IT Troubleshooting
Splunk Enterprise for IT Troubleshooting
 
scalable distributed service integrity attestation for software-as-a-service ...
scalable distributed service integrity attestation for software-as-a-service ...scalable distributed service integrity attestation for software-as-a-service ...
scalable distributed service integrity attestation for software-as-a-service ...
 

Ähnlich wie Cloud Reliability: Decreasing outage frequency using fault injection

Let's banish "it works on my machine"
Let's banish "it works on my machine"Let's banish "it works on my machine"
Let's banish "it works on my machine"Stephanie Locke
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil
 
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010TEST Huddle
 
What is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdfWhat is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdfpcloudy2
 
Effektives Consulting - Performance Engineering
Effektives Consulting - Performance EngineeringEffektives Consulting - Performance Engineering
Effektives Consulting - Performance Engineeringhitdhits
 
Unit Testing Essay
Unit Testing EssayUnit Testing Essay
Unit Testing EssayDani Cox
 
Harnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White PaperHarnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White PaperImpetus Technologies
 
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...IRJET Journal
 
Surekha_haoop_exp
Surekha_haoop_expSurekha_haoop_exp
Surekha_haoop_expsurekhakadi
 
How to effectively perform Multiple Device Testing on Cloud (1).pdf
How to effectively perform Multiple Device Testing on Cloud (1).pdfHow to effectively perform Multiple Device Testing on Cloud (1).pdf
How to effectively perform Multiple Device Testing on Cloud (1).pdfpCloudy
 
Fuzzing101 uvm-reporting-and-mitigation-2011-02-10
Fuzzing101 uvm-reporting-and-mitigation-2011-02-10Fuzzing101 uvm-reporting-and-mitigation-2011-02-10
Fuzzing101 uvm-reporting-and-mitigation-2011-02-10Codenomicon
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningAmazon Web Services
 
DevOps - Introduction to data science
DevOps - Introduction to data scienceDevOps - Introduction to data science
DevOps - Introduction to data scienceFrank Kienle
 
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...University of Antwerp
 

Ähnlich wie Cloud Reliability: Decreasing outage frequency using fault injection (20)

Path to continuous delivery
Path to continuous deliveryPath to continuous delivery
Path to continuous delivery
 
Let's banish "it works on my machine"
Let's banish "it works on my machine"Let's banish "it works on my machine"
Let's banish "it works on my machine"
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010
Frank Cohen - Are We Ready For Cloud Testing - EuroSTAR 2010
 
What is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdfWhat is Cloud Testing Everything you need to know.pdf
What is Cloud Testing Everything you need to know.pdf
 
Effektives Consulting - Performance Engineering
Effektives Consulting - Performance EngineeringEffektives Consulting - Performance Engineering
Effektives Consulting - Performance Engineering
 
Unit Testing Essay
Unit Testing EssayUnit Testing Essay
Unit Testing Essay
 
Harnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White PaperHarnessing the Cloud for Performance Testing- Impetus White Paper
Harnessing the Cloud for Performance Testing- Impetus White Paper
 
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...
IRJET- An Efficient Hardware-Oriented Runtime Approach for Stack-Based Softwa...
 
Surekha_haoop_exp
Surekha_haoop_expSurekha_haoop_exp
Surekha_haoop_exp
 
Rv11
Rv11Rv11
Rv11
 
How to effectively perform Multiple Device Testing on Cloud (1).pdf
How to effectively perform Multiple Device Testing on Cloud (1).pdfHow to effectively perform Multiple Device Testing on Cloud (1).pdf
How to effectively perform Multiple Device Testing on Cloud (1).pdf
 
DevOps explained
DevOps explainedDevOps explained
DevOps explained
 
Seminar
SeminarSeminar
Seminar
 
ShwetaKBijay-resume
ShwetaKBijay-resumeShwetaKBijay-resume
ShwetaKBijay-resume
 
Fuzzing101 uvm-reporting-and-mitigation-2011-02-10
Fuzzing101 uvm-reporting-and-mitigation-2011-02-10Fuzzing101 uvm-reporting-and-mitigation-2011-02-10
Fuzzing101 uvm-reporting-and-mitigation-2011-02-10
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
DevOps - Introduction to data science
DevOps - Introduction to data scienceDevOps - Introduction to data science
DevOps - Introduction to data science
 
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
 
Resume
ResumeResume
Resume
 

Mehr von Jorge Cardoso

On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Jorge Cardoso
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDLJorge Cardoso
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveJorge Cardoso
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network AnalysisJorge Cardoso
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksJorge Cardoso
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications Jorge Cardoso
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service NetworksJorge Cardoso
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksJorge Cardoso
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Jorge Cardoso
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesJorge Cardoso
 
Community based harversting for USDL
Community based harversting for USDLCommunity based harversting for USDL
Community based harversting for USDLJorge Cardoso
 

Mehr von Jorge Cardoso (19)

On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
 
Community based harversting for USDL
Community based harversting for USDLCommunity based harversting for USDL
Community based harversting for USDL
 

Kürzlich hochgeladen

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Cloud Reliability: Decreasing outage frequency using fault injection

  • 1. 9th International Workshop on Software Engineering for Resilient Systems September 4-5, 2017, Geneva, Switzerland Prof. Dr. Jorge Cardoso University of Coimbra Portugal Cloud Reliability Decreasing outage frequency using fault injection Chief Architect for Cloud Operations and Analytics Huawei Research, Munich Germany
  • 2. 2 l Title: Cloud Reliability: Decreasing outage frequency using fault injection l Abstract: In 2016, Google Cloud had 74 minutes of total downtime, Microsoft Azure had 270 minutes, and 108 minutes of downtime for Amazon Web Services (see cloudharmony.com). Reliability is one of the most important properties of a successful cloud platform. Several approaches can be explored to increase reliability ranging from automated replication, to live migration, and to formal system analysis. Another interesting approach is to use software fault injection to test a platform during prototyping, implementation and operation. Fault injection was popularized by Netflix and their Chaos Monkey fault-injection tool to test cloud applications. The main idea behind this technique is to inject failures in a controlled manner to guarantee the ability of a system to survive failures during operations. This talk will explain how fault injection can also be applied to detect vulnerabilities of OpenStack cloud platform and how to effectively and efficiently detect the damages caused by the faults injected. l Acknowledgments: This research was conducted in collaboration with Deutsche Telekom/T-Systems and with Ankur Bhatia from the Technical University of Munich to analyze the reliability and resilience of modern public cloud platforms. Executive Summary l Short CV: Dr. Jorge Cardoso is Chief Architect for Cloud Operations and Analytics at Huawei’s German Research Centre (GRC) in Munich. He is also Professor at the University of Coimbra since 2009. In 2013 and 2014, he was a Guest Professor at the Karlsruhe Institute of Technology (KIT) and a Fellow at the Technical University of Dresden (TU Dresden). Previously, he worked for major companies such as SAP Research (Germany) on the Internet of services and the Boeing Company in Seattle (USA) on Enterprise Application Integration. Since 2013, he is the Vice-Chair of the KEYSTONE COST Action, a EU research network bringing together more than 70 researchers from 26 countries. He has a Ph.D. in Computer Science from the University of Georgia (USA).
  • 3. 3 Huawei at a Glance
  • 4. 4 Winning Consumer Loyalty and Building Brand Influence
  • 5. 5 Globalized Resource Deployment and Localized Business Operations
  • 6. 6 Huawei Public Cloud Products and Services Compute Elastic Cloud Server Image Mgmt Service Auto Scaling Container Service Dedicated Cloud Service Storage Elastic IP Elastic Volume Service Object Storage Service Elastic Load Balance Virtual Private Cloud Volume Backup Service Management & Deployment Cloud Eye Identity and Access Management Direct Connect MaaS Network DNS Security Anti-DDoS Database Enterprise Application Relational Database Service Workspace SAP Hana/SAP Suite (IaaS) MapReduce Solution SAP Hana/SAP Suite (IaaS)
  • 7. 7 Global Deployment of Public Cloud Services u Huawei Enterprise Cloud (HEC) public cloud regions include Langfang and Suzhou in China. l Open Telefonica Cloud worldwide regions include Mexico, Brazil, Chile, USA, Spain, Peru, Argentina, and Colombia. n China Telecom Cloud (CTC) public cloud regions include Guizhou and Beijing in China. ▲ Orange Cloud Business (OBS) public cloud regions include France, Singapore, USA, Holland, and Africa. u Open Telekom Cloud (OTC), Germany Mexico Brazil Chile Colombia USA Peru Spain 2016 Q1 2016 Q4 Argentina 2017 Jan 2017 July France Singapore USA Holland Africa Region 2018 Jan Beijing Guizhou Langfang Suzhou 2015 Q2 Germany
  • 9. 9 Fault Injection into Clouds Enables to Automatically Test and Repair OpenStack and Cloud Applications CLOUD APPLICATION HUAWEI FusionSphere The system works by intentionally injecting different failures, test the ability to survive them, and learn how to predict and repair failures preemptively Failure Repair Test
  • 10. 10 OpenStack Architecture nova-api nova-compute L2 agent L3 agentDHCP agent
  • 11. 11 Fault Injection Plans Director Create VM 1 I: None O: VM name N: 1/controller Deployer Update DB 1 I: None O: None N: 1/controller Selector Get compute PPID 1 I: None O: (node, pid)+ N: 1/controllers V: 1/nova-comp. mon-fri: 9h-17h, every 3h Observer Wait for State 1 I: VM name O: VM name, state N: 1/controller V: 1/{ networking, block_device_ma pping, spawning} Injector Inject Fault 1 I: (node, pid)+ O: none V: 1/{SIGKILL, SIGTERM, SIGINT} Detector Check VM State? 1 I: VM name O: VM state N: 1/controller Cleaner Delete VM 1 I: VM name O: None N: 1/localhost (node, pid)+ VM_name VM_name FIP Scenario Get PPID Variability Inject Fault Variability Wait for State Variability 2 Result ID12 Create VM Controller One of controller services Send signal to process Signal {SIGKILL, SIGTERM, SIGINT} Loop, read status, sleep States { networking, block_device_mapping, spawning} T1 Create VM B_xzy Controller nova-compute Send signal SIGKILL Wait for networking OK T2 Create VM B_jsk Controller nova-api Send signal SIGTERM Wait for block_device_mapping NOK T999 Rapporteur Generat e Report 1 I: data pipeline O: None N: 1/localhost Variability
  • 12. 12 l With improvements in processing power, network and storage technologies, cloud platforms have witnessed an unprecedented growth in complexity. l Due to the increase in complexity, the need to efficiently diagnose failures in cloud platforms has also risen. l Because of these challenges, cloud operators often develop new sets of tests to diagnose failures. l But as mentioned before, cloud platforms are continuously evolving. They undergo modification and frequent updates, and have periodic release cycles. Hence, the tests developed become outdated and there is a constant need to modify them when a new release is available. Therefore, this approach is costly for the cloud operators. Background - Requirement Work done In collaboration with Ankur Bhatia from Technical University of Munich (TUM), Germany
  • 13. 13 l As with most software, the validation of all the modules of a cloud platform is done through a test suite containing a large number of unit tests. It is a part of the software development process where the smallest testable part of an application, called unit, along with associated control data are individually and independently tested. l Executing unit tests is a very effective way to test code integration during the development of software. They are often executed to validate changes made by developers and to guarantee that the code is error free. l Although unit tests are extremely useful for the purpose of development and integration, they are not meant to diagnose failures. Solution
  • 14. 14 Fault Injection Plans Director Create VM 1 I: None O: VM name N: 1/controller Deployer Update DB 1 I: None O: None N: 1/controller Selector Get compute PPID 1 I: None O: (node, pid)+ N: 1/controllers mon-fri: 9h-17h, every 3h Observer Wait for State 1 I: VM name O: VM name, state N: 1/controller Injector Inject Fault 1 I: (node, pid)+ O: none Detector Check VM State? 1 I: VM name O: VM state N: 1/controller Cleaner Delete VM 1 I: VM name O: None N: 1/localhost VM_name Rapporteur Generat e Report 1 I: data pipeline O: None N: 1/localhost Unit tests. It is a part of the software development process where the smallest testable part of a cloud platform, called unit, along with associated control data are individually and independently tested. Failure diagnosis system. Diagnoses failures in cloud platforms making use of unit tests and helps to detect nonresponsive and failed services. Unit tests for failure diagnosis. Reuse the already developed unit tests to test a cloud platform at runtime.
  • 15. 15 Using unit tests for failure diagnosis presents a set of challenges l Unit tests do not provide any information about the nonresponsive or failed services in cloud platforms. p The execution of unit tests generates a list of passed and failed tests. This list can help to locate software errors or to find issues with individual modules of the code but cannot diagnose failures as there are no relationships between unit tests and services running on a cloud platform. l With the increase in codebase of cloud platforms, the number of unit tests also increases. Thus, it takes a considerable amount of time to execute them. Challenges Experiments l OpenStack has more than 1500 unit tests l Used for development and integration l Only validate APIs l Time consuming p E.g., unit tests to create a VM, uploading a large operating system image, etc., need a few minutes to execute. p It can take up to 3 to 4 hours to execute all unit tests. Considering the reliability requirements of 99.95%, cloud platforms can have a downtime of only 21.6 minutes per month. p Hence, the time required to execute unit tests is considerably high. l Not able to directly detect services that are not functioning as expected
  • 16. 16 The approach of 3 phases efficiently diagnoses a cloud platform l 1. Reduce the number of unit tests using the set cover algorithm l 2. Establish relationships between the reduced unit tests and the services running on a cloud platform. These relationships help to determine nonresponsive or failed services. This is done by simulating failure of services in cloud platforms and running unit tests in this environment. l 3. Construct a decision tree based on these relationships to select the unit tests that are most relevant to diagnose failures. The ID3 algorithm is used to construct a decision tree. General Approach Results l The approach diagnoses failures by running only 4-5% of the original unit tests.
  • 17. 17 l Unit Tests Reduction (Phase 1). Reduce the number of unit tests written during the code development l Service Mapping (Phase 2). Establish relationships between the reduced unit tests and services. It further reduces the number of unit tests based on these relationships l Sequential Diagnosis (Phase 3). Use relationships to construct a decision tree to select the most relevant unit tests to diagnose failures General Approach tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_create_with_empty_name tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_create_with_nonexistent_vo lume_type tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_delete_nonexistent_type_id tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_get_nonexistent_type_id tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_create_with_empty_name empest.api.volume.admin.test_volume_types_negative.Volu meTypesNegativeV2Test.test_create_with_nonexistent_vol ume_type tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_delete_nonexistent_type_idt empest.api.volume.admin.test_volume_types_negative.Volu meTypesNegativeV2Test.test_create_with_empty_name tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_create_with_nonexistent_vo lume_type tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_delete_nonexistent_type_id tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_get_nonexistent_type_id tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_create_with_empty_name empest.api.volume.admin.test_volume_types_negative.Volu meTypesNegativeV2Test.test_create_with_nonexistent_vol ume_type tempest.api.volume.admin.test_volume_types_negative.Vol umeTypesNegativeV2Test.test_delete_nonexistent_type_id Unit Tests Reduction tempest.api.volume.admin.test_volume_types_n egative.VolumeTypesNegativeV2Test.test_creat e_with_empty_name tempest.api.volume.admin.test_volume_types_n egative.VolumeTypesNegativeV2Test.test_creat e_with_nonexistent_volume_type tempest.api.volume.admin.test_volume_types_n egative.VolumeTypesNegativeV2Test.test_delet e_nonexistent_type_id tempest.api.volume.admin.test_volume_types_n egative.VolumeTypesNegativeV2Test.test_get_ nonexistent_type_id Service Mapping Sequential Diagnosis Reduced unit testsUnit tests Phase 1 Phase 2 Phase 3 Decision tree Faulty Service Service mapping 1 0 1 1 0 0 t1 s1 s2 .. .. 1 1tm .. sn
  • 18. 18 l Identify modules (1/5) p Identify the modules available in the unit test framework l Construct Abstract Syntax Tree (2/5) p For each module, an AST is constructed to establish a relationship between unit test methods and support methods l Filter AST (3/5) p All irrelevant support methods are filtered out from the AST using a Inverse Document Frequency l Support method deduplication (4/5) p Unit test methods that perform redundant operations are eliminated by finding the minimum set cover. Time to execute an unit test is the cost of a set l Cross module deduplication (5/5) p Unit test methods from one module are compared with methods of other modules and are further reduced Phase 1 - Unit Test Reduction List of all unit tests List of all modules Identify modules Construct AST Filter AST Support method de-duplication Abstract Syntax Tree Final reduced unit tests Eliminate redundant unit test methods across modules Eliminate redundant unit test methods within modules Reduced unit tests Filtered AST Unit test manager Identify irrelevant support methods Cross module de-duplication
  • 19. 19 l Identify modules (1/5) p A module is the source file which contains the definition of the unit test method l Construct AST (2/5) p For all the modules construct an AST l Traversed the AST to identify unit test methods and their support methods l For each unit test method, a set of its support methods is created: p test_create_flavor: {create_flavor, assertEqual} p test_delete_flavor: {create_flavor, delete_fl, assertTrue} Phase 1 (1-2/5) tempest.api.compute.flavors.test_flavors. Flavors. test_create_flavor Class name Unit test method Path of the module: tempest/api/compute/flavors/test_flavors.py Part 1 Part 2 Part 3 A simple AST class Flavors (base.Test): def test_create_flavor(self): new_flavor_id = self.create_flavor() self.assertEqual(new_flavor_id, flavor_id) def test_delete_flavor(self): flavor = self.create_flavor(flavor_name) del_flavor = self.delete_fl(flavor) self.assertTrue(some_statement) A Python module (test_flavor.py) containing the implementations of unit tests test_flavor.py Flavors test_create_ flavor test_delete_ flavor Assert Equal create_ flavor delete_fl assertTrue create_ flavor Module Class Unit test methods Support methods Note: Apart from the nodes shown in the figure, an AST contains other nodes like variables, constructors, etc. Since they are not relevant to us, they are excluded for simplicity. Identify Module & Construct AST
  • 20. 20 Phase 1 (3/5) l Support methods are eliminated p Some of the support methods are specific to the unit test framework p E.g., assertEquals l Use Inverted Document Frequency (IDF) technique p Higher occurrence -> lower IDF l Calculated IDF for support methods p Set of documents = ASTs of modules l Eliminate support methods having a IDF below a threshold alpha p These support methods identified do not play any role in the reduction of the unit tests p They are irrelevant Module 1 Class A: def U1(self): Assert() S1() def U2(self): S2() S3() Class B: def U3(self): S2() S3() AST construction Unit test methods Support Methods (words) Class Class U1 Module S1 S2 S1 U2 U3 Asser t Filter AST
  • 21. 21 Phase 1 (3/5) IDF(x) = log (# documents/ # documents with word x) Unit test methods Support Methods (words) Document 1 Document 2 Document 3 Document 4 Class Class U1 Module S1 S2 S1 U2 U3 Class Class U4 Module S3 Asser t S2 S3 U5 U6 Class Class U7 Module S5 Asser t S5 S5 U8 U9 Class Class U10 Module S6 Asser t S7 S18 U11 U12 Asser t § The support methods Assert is present in all the modules -> its IDF is log(4/4) = 0 § Similarly, IDF for S1 is log(4), S2 is log(2) § Thus, Assert is irrelevant. It is eliminated from the list of support methods.
  • 22. 22 Phase 1 (4/5) l Each module has unit test methods which call support methods l In most cases, a subset of the unit test methods call all the support methods l Thus, some unit test methods are redundant l Prune the ASTs constructed p Pruning is done using the minimum set cover l Two important parameters: cost and coverage p Cost is the time a unit test takes to execute p Coverage is the percentage of support methods to consider in the set cover Support method deduplication
  • 23. 23 Phase 1 (4/5) Support method (S) and time in seconds for each Unit test method (U) U1 : {S1 , S3}, 15 U2 : {S1 , S3,} ,8 U3 : {S2}, 9 Class A U1 Module 1 S1 U2 U3 Class B S1 S3 S2S3 Loop 1: a1 = 15/2 a2 = 8/2 a3 = 9/1 I = {S1, S3} Class A Module 1 U2 Class B S1 S3 Loop 1: a1 = 15/2 a2 = 8/2 a3 = 9/1 I = {S1, S3} Loop 2: a1 = 15/0 a2 = 8/0 a3 = 9/1 I = {S1, S2, S3} Class A Module 1 U2 U3 Class B S1 S3 S2 Minimum coverage requested = 100 % U = {S1 , S2 , S3} Required set = all elements of U Minimum coverage requested = 50 % U = {S1 , S2 , S3} Required set = é(50/100 ) * 3ù = 2 Coverage obtained: 66.6 %Coverage obtained: 100 % In case 2, the minimum requested is 50% so any 2 support methods out of 3 are sufficient. U2 contains 2 distinct support methods and has the least cost. Hence, U1 and U3 are eliminated. In case 1, minimum coverage is100 %, so all the support methods must be present. U2 and U3 covers all the distinct support methods and has the least cost. Hence, U1 is eliminated.
  • 24. 24 Phase 1 (5/5) U5 : {S4, S5} U10 : {S4} U11 : {S10, S4} U12 : {S12, S5} § The support method de-duplication subsystem guarantees that {S4, S5, S10, S12} are called by the subset of the unit test methods in Module 4 § All the support methods called by U5 (i.e., {S4, S5}) are already covered in Module 4. § Hence, U5 is redundant and can be eliminated Cross module deduplication § The AST pruning reduces the unit test methods by finding the minimum set cover of the support methods in the same module. However, some unit test methods are covered in other modules § Thus, the reduction can be improved without losing the coverage § Cross module deduplication compares the support methods of a unit test method from one module with the universe of the support methods of other modules Module 4 Class G: def U10(self): Assert() S4() def U11(self): S10() S4() Class H: def U12(self): S12() S5() Module 2 Class C: def U4(self): Assert() S2() def U5(self): S4() S5() Class D: def U6(self): S4() S7() Our experiment with OpenStack enabled us to reduce 1391 unit tests created by developers to 538 tests with 100% coverage
  • 25. 25 Phase 2 Reduced set of unit tests from Phase 1 Unit test and service mapping Isomorphic unit test elimination Test service mapper Matrix representing relationships between unit tests and testability of services Reduced matrix representing relationships between unit tests and testability of services Service Mapping One of the major challenges of using unit tests for failure diagnosis is the lack of relationships between unit tests and services running on cloud platforms. The execution of unit tests outputs a list of passed, failed and skipped tests. § Establish a relation between the unit tests and the services they can test § Isomorphic unit test elimination unittestUi unittestUi Experiment: 20 OpenStack services running and 538 unit tests after reduction from Phase 1. At the end of Phase 2, the isomorphic test elimination procedure reduced the number of test to 25. For each service s: 1. Disable s (stop/shutdown) to simulate a failure of a service in a cloud platform 2. Run reduced set of unit tests. Unit tests that depend on service s will fail. Record result. 3. Enable s (start) The execution time for unit tests U1 is 10 seconds, U2 is 9 seconds, U3 is 13 seconds, U4 is 19 seconds and U5 is 3 seconds. Hence, U1, U5 and U2 are selected
  • 26. 26 Phase 3 unit test 3 unit test 2 unit test 4 unit test 1 unit test 5 service 1 service 2 service 5 service 5 service 3 service 1 service 4 fail fail fail fail fail pass pass pass pass Sequential Diagnosis The main task of this phase is to analyze these relationships by constructing a decision tree and use the tree to detect failed service(s) in a cloud platform in operation. Moreover, the failed services are detected without executing all the unit tests. § Construction of the decision tree § Execution of unit tests based on the decision tree The tree of the Figure (right) § Execute unit test 3 § If it fails, unit test 2 is executed § The failure of unit test 2 indicates that service 1 is in a failed state In some cases, there might be more than one possible service in a failed state § If unit test 2 passes, then either service 2 or service 5 or both are in a failed state A failed service(s) can be detected by running log(n) number of unit tests instead of n, the total number of unit tests after eliminating the isomorphic unit tests. Hence, the tree enables the users to detect the failed service(s) automatically without executing all the unit tests.
  • 27. 27 Results We evaluated the approach with OpenStack = 1391 unit tests Phase 1. Reduce the number of unit tests to 538 with 100% coverage Phase 2. Established relationships between the reduced set of unit tests and the OpenStack services. There were 20 services. Eliminated isomorphic unit tests. At the end of Phase 2, we were able to reduce the number of unit tests to 25. Phase 3. Build decision tree using the relationships established. 5 unit tests on average. 1391 538 25 5
  • 28. 28 Open Positions and CFP l FTE, PostDoc, PhD Students p Fault injection, fault models, fault libraries, fault plans, brake and rebuild systems all day long, … p Cloud Operations and Analytics for High Availability p AI, machine learning, data mining, and time-series analysis for Cloud operations
  • 29. Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY