SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Design for Recovery,York EngD Programme, 2010 	

Slide 1	

Design for recovery
	

Prof. Ian Sommerville
Design for Recovery,York EngD Programme, 2010 	

Slide 2	

Objectives	

•  To discuss the notion of ‘failure’ in software systems	

•  To explain why this conventional notion of ‘failure’ is not
appropriate for many LSCITS	

•  To propose an approach to failure management in LSCITS
based on recoverability rather than failure avoidance
Design for Recovery,York EngD Programme, 2010 	

Slide 3	

Complex IT systems	

•  Organisational systems that support different functions
within an organisation	

•  Can usually be considered as systems of systems, ie
different parts are systems in their own right	

•  Usually distributed and normally constructed by
integrating existing systems/components/services	

•  Not subject to limitations derived from the laws of
physics (so, no natural constraints on their size)	

•  Data intensive, with very long lifetime data	

•  An integral part of wider socio-technical systems
Design for Recovery,York EngD Programme, 2010 	

Slide 4	

What is failure?	

•  From a reductionist perspective, a failure can be
considered to be ‘a deviation from a specification’.	

•  An oracle can examine a specification and observe a
system’s behaviour and detect failures.	

•  Failure is an absolute - the system has either failed or it
hasn’t	

•  Of course, some failures are more serious than others; it
is widely accepted that failures with minor consequences
are to be expected and tolerated
Design for Recovery,York EngD Programme, 2010 	

Slide 5	

A question to the audience	

•  A hospital system is designed to maintain information about available
beds for incoming patients and to provide information about the
number of beds to the admissions unit. 	

•  It is assumed that the hospital has a number of empty beds and this
changes over time. The variable B reflects the number of empty beds
known to the system.	

•  Sometimes the system reports that the number of empty beds is the
actual number available; sometimes the system reports that fewer
than the actual number are available . 	

•  In circumstances where the system reports that an incorrect number
of beds are available, is this a failure?
Design for Recovery,York EngD Programme, 2010 	

Slide 6	

Bed management system	

•  The percentage of system users who considered the
system’s incorrect reporting of the number of available
beds to be a failure was 0%.	

•  Mostly, the number did not matter so long as it was
greater than 1. What mattered was whether or not
patients could be admitted to the hospital.	

•  When the hospital was very busy (available beds = 0),
then people understood that it was practically impossible
for the system to be accurate. 	

•  They used other methods to find out whether or not a
bed was available for an incoming patient.
Design for Recovery,York EngD Programme, 2010 	

Slide 7	

Failure is a judgement	

•  Specifications are a simplification of reality. 	

•  Users don’t read and don’t care about specifications	

•  Whether or not system behaviour should be considered to
be a failure, depends on the judgement of an observer of that
behaviour	

•  This judgement depends on:	

•  The observer’s expectations	

•  The observer’s knowledge and experience	

•  The observer’s role	

•  The observer’s context or situation	

•  The observer’s authority
Design for Recovery,York EngD Programme, 2010 	

Slide 8	

System failure	

•  ‘Failures’ are not just catastrophic events but normal,
everyday system behaviour that disrupts normal work and
that mean that people have to spend more time on a task
than necessary	

•  A system failure occurs when a direct or indirect user of
a system has to carry out extra work, over and above
that normally required to carry out some task, in
response to some inappropriate system behaviour	

•  This extra work constitutes the cost of recovery from
system failure
Design for Recovery,York EngD Programme, 2010 	

Slide 9	

Failures are inevitable	

•  Technical reasons	

•  When systems are composed of opaque and uncontrolled components,
the behaviour of these components cannot be completely understood	

•  Failures often can be considered to be failures in data rather than failures
in behaviour	

•  Socio-technical reasons	

•  Changing contexts of use mean that the judgement on what constitutes a
failure changes as the effectiveness of the system in supporting work
changes	

•  Different stakeholders will interpret the same behaviour in different
ways because of different interpretations of ‘the problem’
Design for Recovery,York EngD Programme, 2010 	

Slide 10	

Conflict inevitability	

•  Impossible to establish a set of requirements where
stakeholder conflicts are all resolved	

•  Therefore, successful operation of a system for one set of
stakeholders will inevitably mean ‘failure’ for another set
of stakeholders	

•  Groups of stakeholders in organisations are often in
perennial conflict (e.g. managers and clinicians in a
hospital). The support delivered by a system depends on
the power held at some time by a stakeholder group.
Design for Recovery,York EngD Programme, 2010 	

Slide 11	

Where are we?	

•  Large-scale information systems are inevitably complex
systems	

•  Such systems cannot be created using a reductionist
approach	

•  Failures are a judgement and this may change over time	

•  Failures are inevitable and cannot be engineered out of a
system
Design for Recovery,York EngD Programme, 2010 	

Slide 12	

The way forward	

•  Software design has to be seen as part of a wider process
of LSCITS engineering	

•  We need to accept that technical system failures will
always occur and examine how we can design these
systems to allow the broader socio-technical systems, in
which these technical systems are used, to recognise,
diagnose and recover from these failures
Design for Recovery,York EngD Programme, 2010 	

Slide 13	

Software dependability	

•  A reductionist approach to software dependability takes the view that
software failures are a consequence of software faults	

•  Techniques to improve dependability include	

•  Fault avoidance	

•  Fault detection	

•  Fault tolerance	

•  These approaches have taken us quite a long way in improving
software dependability. However, further progress is unlikely to be
achieved by further improvement of these techniques as they rely on
a reductionist view of failure.
Design for Recovery,York EngD Programme, 2010 	

Slide 14	

Failure recovery	

•  Recognition	

•  Recognise that inappropriate behaviour has occurred	

•  Hypothesis	

•  Formulate an explanation for the unexpected behaviour	

•  Recovery	

•  Take steps to compensate for the problem that has arisen
Design for Recovery,York EngD Programme, 2010 	

Slide 15	

Coping with failure	

•  Socio-technical systems are remarkably robust because
people are good at coping with unexpected situations
when things go wrong.	

•  We have the unique ability to apply previous experience from
different areas to unseen problems.	

•  Individuals can take the initiative, adopt responsibilities and,
where necessary, break the rules or step outside the normal
process of doing things.	

•  People can prioritise and focus on the essence of a problem
Design for Recovery,York EngD Programme, 2010 	

Slide 16	

Recovering from failure	

•  Local knowledge 	

•  Who to call; who knows what; where things are	

•  Process reconfiguration	

•  Doing things in a different way from that defined in the ‘standard’ process	

•  Work-arounds, breaking the rules (safe violations)	

•  Redundancy and diversity	

•  Maintaining copies of information in different forms from that maintained
in a software system	

•  Informal information annotation	

•  Using multiple communication channels	

•  Trust	

•  Relying on others to cope
Design for Recovery,York EngD Programme, 2010 	

Slide 17	

Design for recovery	

•  The aim of a strategy of design for recovery is to:	

•  Ensure that system design decisions do not increase the amount of
recovery work required	

•  Make system design decisions that make it easier to recover from
problems (i.e. reduce extra work required)	

•  Earlier recognition of problems	

•  Visibility to make hypotheses easier to formulate	

•  Flexibility to support recovery actions	

•  Designing for recovery is an holistic approach to system design and not
(just) the identification of ‘recovery requirements’ 	

•  Should support the natural ability of people and organisations to cope with
problems
Design for Recovery,York EngD Programme, 2010 	

Slide 18	

Problems 	

	

•  Security and recoverability	

•  Automation hiding	

•  Process tyranny	

•  Multi-organisational systems
Design for Recovery,York EngD Programme, 2010 	

Slide 19	

Security and recoverability	

•  There is an inherent tension between security and
recoverability	

•  Recoverability	

•  Relies on trusting operators of the system not to abuse privileges
that they may have been granted to help recover from problems	

•  Security	

•  Relies on mistrusting users and restricting access to information
on a ‘need to know’ basis
Design for Recovery,York EngD Programme, 2010 	

Slide 20	

Automation hiding	

•  A problem with automation is that information becomes subject to
organizational policies that restrict access to that information.	

•  Even when access is not restricted, we don’t have any shared culture
in how to organise a large information store	

•  Generally, authorisation models maintained by the system are based
on normal rather than exceptional operation.	

•  When problems arise and/or when people are unavailable, breaking
the rules to solve these problems is made more difficult.
Design for Recovery,York EngD Programme, 2010 	

Slide 21	

Process tyranny	

•  Increasingly, there is a notion that ‘standard’ business
processes can be defined and embedded in systems that
support these processes	

•  Implicitly or explicitly, the system enforces the use of the
‘standard’ process	

•  But this assumes three things:	

•  The standard process is always appropriate	

•  The standard process has anticipated all possible failures	

•  The system can be respond in a timely way to process changes
Design for Recovery,York EngD Programme, 2010 	

Slide 22	

Multi-organisational systems	

•  Many rules enforced in different ways by different systems.	

•  No single manager or owner of the system . Who do you call when
failures occur?	

•  Information is distributed - users may not be aware of where
information is located, who owns information, etc.	

•  Processes involve remote actors so process reconfiguration is more
difficult	

•  Restricted information channels (e.g. help unavailable outside normal
business hours; no phone numbers published, etc.)	

•  Lack of trust. Owners of components will blame other components
for system failure. Learning is inhibited and trust compromised.
Design for Recovery,York EngD Programme, 2010 	

Slide 23	

Design guidelines	

•  Local knowledge	

•  Process reconfiguration	

•  Redundancy and diversity
Design for Recovery,York EngD Programme, 2010 	

Slide 24	

Local knowledge	

•  Local knowledge includes knowledge of who does what,
how authority structures can be bypassed, what rules can
be broken, etc.	

•  Impossible to replicate entirely in distributed systems but
some steps can be taken	

•  Maintain information about the provenance of data	

•  Who provided the data, where the data came from, when it
was created, edited, etc.	

•  Maintain organisational models	

•  Who is responsible for what, contact details
Design for Recovery,York EngD Programme, 2010 	

Slide 25	

Process reconfiguration	

•  Make workflows explicit rather than embedding them in the software	

•  Not just ‘continue’ buttons! Users should know where they are and
where they are supposed to go	

•  Support workflow navigation/interruption/restart	

•  Design systems with an ‘emergency mode’ where the the system
changes from enforcing policies to auditing actions	

•  This would allow the rules to be broken but the system would maintain a
log of what has been done and why so that subsequent investigations
could trace what happened	

•  Support ‘Help, I’m in trouble!’ as well as ‘Help, I need information?’
Design for Recovery,York EngD Programme, 2010 	

Slide 26	

Redundancy and diversity	

•  Maintaining a single ‘golden copy’ of data may be efficient but it may
not be effective or desirable	

•  Encourage the creation of ‘shadow systems’ and provide import
and export from these systems	

•  Allow schemas to be extended	

•  Schemas for data are rarely designed for problem solving. Always
allow informal extension (a free text box) so that annotations,
explanations and additional information can be provided	

•  Maintain organisational models	

•  To allow for multi-channel communications when things go wrong
Design for Recovery,York EngD Programme, 2010 	

Slide 27	

Summary	

•  A reductionist approach to software engineering is no longer viable.
on its own, for complex systems engineering	

•  Improving existing software engineering methods will help but will
not deal with the problems of complexity that are inherent in
distributed systems of systems	

•  We must learn to live with normal, everyday failures	

•  Design for recovery involves designing so that the work required to
recover from a failure is minimised	

•  Recovery strategies include supporting information redundancy and
annotation and maintaining organisational models

Weitere ähnliche Inhalte

Was ist angesagt?

Intelligent Decision Support Systems
Intelligent Decision Support SystemsIntelligent Decision Support Systems
Intelligent Decision Support Systems
Gildardo Sanchez-Ante
 
Dialogue management system
Dialogue management systemDialogue management system
Dialogue management system
Mayank Agarwal
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
Kitty Soso
 
Project management
Project managementProject management
Project management
David Terry
 
The Usability of everyday things
The Usability of everyday thingsThe Usability of everyday things
The Usability of everyday things
Rosa Quiroga
 
Decision Support and Knowledge Based Systems
Decision Support and Knowledge Based SystemsDecision Support and Knowledge Based Systems
Decision Support and Knowledge Based Systems
Jivan Nepali
 
GDSS Group Decision Support System
GDSS Group Decision Support SystemGDSS Group Decision Support System
GDSS Group Decision Support System
Enaam Alotaibi
 

Was ist angesagt? (20)

Unit 1 DSS
Unit 1 DSSUnit 1 DSS
Unit 1 DSS
 
Intelligent Decision Support Systems
Intelligent Decision Support SystemsIntelligent Decision Support Systems
Intelligent Decision Support Systems
 
LSCITS-engineering
LSCITS-engineeringLSCITS-engineering
LSCITS-engineering
 
Dit yvol3iss41
Dit yvol3iss41Dit yvol3iss41
Dit yvol3iss41
 
Dialogue management system
Dialogue management systemDialogue management system
Dialogue management system
 
Mark Dean Notes
Mark Dean NotesMark Dean Notes
Mark Dean Notes
 
Decision support system
Decision support systemDecision support system
Decision support system
 
Decision support systems, group decision support systems,expert systems-manag...
Decision support systems, group decision support systems,expert systems-manag...Decision support systems, group decision support systems,expert systems-manag...
Decision support systems, group decision support systems,expert systems-manag...
 
dss
dssdss
dss
 
Scenario based recovery metrics
Scenario based recovery metricsScenario based recovery metrics
Scenario based recovery metrics
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Slide01 introductory
Slide01 introductorySlide01 introductory
Slide01 introductory
 
ISCRAM 2013: Applying ISO 9241-110 Dialogue Principles to Tablet Applications...
ISCRAM 2013: Applying ISO 9241-110 Dialogue Principles to Tablet Applications...ISCRAM 2013: Applying ISO 9241-110 Dialogue Principles to Tablet Applications...
ISCRAM 2013: Applying ISO 9241-110 Dialogue Principles to Tablet Applications...
 
Gdss
GdssGdss
Gdss
 
Project management
Project managementProject management
Project management
 
St josephs project management
St josephs project managementSt josephs project management
St josephs project management
 
The Usability of everyday things
The Usability of everyday thingsThe Usability of everyday things
The Usability of everyday things
 
Decision Support and Knowledge Based Systems
Decision Support and Knowledge Based SystemsDecision Support and Knowledge Based Systems
Decision Support and Knowledge Based Systems
 
GDSS Group Decision Support System
GDSS Group Decision Support SystemGDSS Group Decision Support System
GDSS Group Decision Support System
 
IDAS and the Accounting Professional
IDAS and the Accounting ProfessionalIDAS and the Accounting Professional
IDAS and the Accounting Professional
 

Ähnlich wie Designing Complex Systems for Recovery (LSCITS EngD 2011)

system development life cycle
system development life cycle system development life cycle
system development life cycle
Sumit Yadav
 
CS 5032 L4 requirements engineering 2013
CS 5032 L4 requirements engineering 2013CS 5032 L4 requirements engineering 2013
CS 5032 L4 requirements engineering 2013
Ian Sommerville
 

Ähnlich wie Designing Complex Systems for Recovery (LSCITS EngD 2011) (20)

System success and failure
System success and failureSystem success and failure
System success and failure
 
Dependability requirements for LSCITS
Dependability requirements for LSCITSDependability requirements for LSCITS
Dependability requirements for LSCITS
 
Responsibility Modelling
Responsibility ModellingResponsibility Modelling
Responsibility Modelling
 
L7 Design For Recovery
L7 Design For RecoveryL7 Design For Recovery
L7 Design For Recovery
 
Towards Dementia-friendly Smart Home
Towards Dementia-friendly Smart HomeTowards Dementia-friendly Smart Home
Towards Dementia-friendly Smart Home
 
system development life cycle
system development life cycle system development life cycle
system development life cycle
 
Requirements Engineering for LSCITS
Requirements Engineering for LSCITSRequirements Engineering for LSCITS
Requirements Engineering for LSCITS
 
Usability Evaluation
Usability EvaluationUsability Evaluation
Usability Evaluation
 
CS 5032 L4 requirements engineering 2013
CS 5032 L4 requirements engineering 2013CS 5032 L4 requirements engineering 2013
CS 5032 L4 requirements engineering 2013
 
Requirement Gathering
Requirement GatheringRequirement Gathering
Requirement Gathering
 
REQUIREMENT ENGINEERING
REQUIREMENT ENGINEERINGREQUIREMENT ENGINEERING
REQUIREMENT ENGINEERING
 
Requirementengg
RequirementenggRequirementengg
Requirementengg
 
SRE.pptx
SRE.pptxSRE.pptx
SRE.pptx
 
5. SE RequirementEngineering task.ppt
5. SE RequirementEngineering task.ppt5. SE RequirementEngineering task.ppt
5. SE RequirementEngineering task.ppt
 
Requirements reality
Requirements realityRequirements reality
Requirements reality
 
BA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexity
BA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexityBA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexity
BA and Beyond 19 Sponsor spotlight - Namahn - Beating complexity with complexity
 
Usability requirements
Usability requirements Usability requirements
Usability requirements
 
Unit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptxUnit 3_Evaluation Technique.pptx
Unit 3_Evaluation Technique.pptx
 
Ai lecture 06 applications of es
Ai lecture 06 applications of esAi lecture 06 applications of es
Ai lecture 06 applications of es
 
Comp10 unit10 lecture_slides
Comp10 unit10 lecture_slidesComp10 unit10 lecture_slides
Comp10 unit10 lecture_slides
 

Mehr von Ian Sommerville

Security case buffer overflow
Security case buffer overflowSecurity case buffer overflow
Security case buffer overflow
Ian Sommerville
 
CS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failureCS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failure
Ian Sommerville
 
CS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disasterCS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disaster
Ian Sommerville
 
CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1
Ian Sommerville
 
CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2
Ian Sommerville
 
L17 CS5032 critical infrastructure
L17 CS5032 critical infrastructureL17 CS5032 critical infrastructure
L17 CS5032 critical infrastructure
Ian Sommerville
 
CS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breachCS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breach
Ian Sommerville
 
CS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systemsCS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systems
Ian Sommerville
 
CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013
Ian Sommerville
 
CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013
Ian Sommerville
 
CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013
Ian Sommerville
 
CS 5032 L12 security testing and dependability cases 2013
CS 5032 L12  security testing and dependability cases 2013CS 5032 L12  security testing and dependability cases 2013
CS 5032 L12 security testing and dependability cases 2013
Ian Sommerville
 
CS 5032 L7 dependability engineering 2013
CS 5032 L7 dependability engineering 2013CS 5032 L7 dependability engineering 2013
CS 5032 L7 dependability engineering 2013
Ian Sommerville
 
CS 5032 L6 reliability and security specification 2013
CS 5032 L6 reliability and security specification 2013CS 5032 L6 reliability and security specification 2013
CS 5032 L6 reliability and security specification 2013
Ian Sommerville
 
CS 5032 L5 safety specification 2013
CS 5032 L5 safety specification 2013CS 5032 L5 safety specification 2013
CS 5032 L5 safety specification 2013
Ian Sommerville
 
CS 5032 L3 socio-technical systems 2013
CS 5032 L3 socio-technical systems 2013CS 5032 L3 socio-technical systems 2013
CS 5032 L3 socio-technical systems 2013
Ian Sommerville
 

Mehr von Ian Sommerville (20)

Conceptual systems design
Conceptual systems designConceptual systems design
Conceptual systems design
 
An introduction to LSCITS
An introduction to LSCITSAn introduction to LSCITS
An introduction to LSCITS
 
Internet worm-case-study
Internet worm-case-studyInternet worm-case-study
Internet worm-case-study
 
Designing software for a million users
Designing software for a million usersDesigning software for a million users
Designing software for a million users
 
Security case buffer overflow
Security case buffer overflowSecurity case buffer overflow
Security case buffer overflow
 
CS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failureCS5032 Case study Ariane 5 launcher failure
CS5032 Case study Ariane 5 launcher failure
 
CS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disasterCS5032 Case study Kegworth air disaster
CS5032 Case study Kegworth air disaster
 
CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1CS5032 L19 cybersecurity 1
CS5032 L19 cybersecurity 1
 
CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2CS5032 L20 cybersecurity 2
CS5032 L20 cybersecurity 2
 
L17 CS5032 critical infrastructure
L17 CS5032 critical infrastructureL17 CS5032 critical infrastructure
L17 CS5032 critical infrastructure
 
CS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breachCS5032 Case study Maroochy water breach
CS5032 Case study Maroochy water breach
 
CS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systemsCS 5032 L18 Critical infrastructure 2: SCADA systems
CS 5032 L18 Critical infrastructure 2: SCADA systems
 
CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013CS5032 L9 security engineering 1 2013
CS5032 L9 security engineering 1 2013
 
CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013CS5032 L10 security engineering 2 2013
CS5032 L10 security engineering 2 2013
 
CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013CS5032 L11 validation and reliability testing 2013
CS5032 L11 validation and reliability testing 2013
 
CS 5032 L12 security testing and dependability cases 2013
CS 5032 L12  security testing and dependability cases 2013CS 5032 L12  security testing and dependability cases 2013
CS 5032 L12 security testing and dependability cases 2013
 
CS 5032 L7 dependability engineering 2013
CS 5032 L7 dependability engineering 2013CS 5032 L7 dependability engineering 2013
CS 5032 L7 dependability engineering 2013
 
CS 5032 L6 reliability and security specification 2013
CS 5032 L6 reliability and security specification 2013CS 5032 L6 reliability and security specification 2013
CS 5032 L6 reliability and security specification 2013
 
CS 5032 L5 safety specification 2013
CS 5032 L5 safety specification 2013CS 5032 L5 safety specification 2013
CS 5032 L5 safety specification 2013
 
CS 5032 L3 socio-technical systems 2013
CS 5032 L3 socio-technical systems 2013CS 5032 L3 socio-technical systems 2013
CS 5032 L3 socio-technical systems 2013
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Designing Complex Systems for Recovery (LSCITS EngD 2011)

  • 1. Design for Recovery,York EngD Programme, 2010 Slide 1 Design for recovery Prof. Ian Sommerville
  • 2. Design for Recovery,York EngD Programme, 2010 Slide 2 Objectives •  To discuss the notion of ‘failure’ in software systems •  To explain why this conventional notion of ‘failure’ is not appropriate for many LSCITS •  To propose an approach to failure management in LSCITS based on recoverability rather than failure avoidance
  • 3. Design for Recovery,York EngD Programme, 2010 Slide 3 Complex IT systems •  Organisational systems that support different functions within an organisation •  Can usually be considered as systems of systems, ie different parts are systems in their own right •  Usually distributed and normally constructed by integrating existing systems/components/services •  Not subject to limitations derived from the laws of physics (so, no natural constraints on their size) •  Data intensive, with very long lifetime data •  An integral part of wider socio-technical systems
  • 4. Design for Recovery,York EngD Programme, 2010 Slide 4 What is failure? •  From a reductionist perspective, a failure can be considered to be ‘a deviation from a specification’. •  An oracle can examine a specification and observe a system’s behaviour and detect failures. •  Failure is an absolute - the system has either failed or it hasn’t •  Of course, some failures are more serious than others; it is widely accepted that failures with minor consequences are to be expected and tolerated
  • 5. Design for Recovery,York EngD Programme, 2010 Slide 5 A question to the audience •  A hospital system is designed to maintain information about available beds for incoming patients and to provide information about the number of beds to the admissions unit. •  It is assumed that the hospital has a number of empty beds and this changes over time. The variable B reflects the number of empty beds known to the system. •  Sometimes the system reports that the number of empty beds is the actual number available; sometimes the system reports that fewer than the actual number are available . •  In circumstances where the system reports that an incorrect number of beds are available, is this a failure?
  • 6. Design for Recovery,York EngD Programme, 2010 Slide 6 Bed management system •  The percentage of system users who considered the system’s incorrect reporting of the number of available beds to be a failure was 0%. •  Mostly, the number did not matter so long as it was greater than 1. What mattered was whether or not patients could be admitted to the hospital. •  When the hospital was very busy (available beds = 0), then people understood that it was practically impossible for the system to be accurate. •  They used other methods to find out whether or not a bed was available for an incoming patient.
  • 7. Design for Recovery,York EngD Programme, 2010 Slide 7 Failure is a judgement •  Specifications are a simplification of reality. •  Users don’t read and don’t care about specifications •  Whether or not system behaviour should be considered to be a failure, depends on the judgement of an observer of that behaviour •  This judgement depends on: •  The observer’s expectations •  The observer’s knowledge and experience •  The observer’s role •  The observer’s context or situation •  The observer’s authority
  • 8. Design for Recovery,York EngD Programme, 2010 Slide 8 System failure •  ‘Failures’ are not just catastrophic events but normal, everyday system behaviour that disrupts normal work and that mean that people have to spend more time on a task than necessary •  A system failure occurs when a direct or indirect user of a system has to carry out extra work, over and above that normally required to carry out some task, in response to some inappropriate system behaviour •  This extra work constitutes the cost of recovery from system failure
  • 9. Design for Recovery,York EngD Programme, 2010 Slide 9 Failures are inevitable •  Technical reasons •  When systems are composed of opaque and uncontrolled components, the behaviour of these components cannot be completely understood •  Failures often can be considered to be failures in data rather than failures in behaviour •  Socio-technical reasons •  Changing contexts of use mean that the judgement on what constitutes a failure changes as the effectiveness of the system in supporting work changes •  Different stakeholders will interpret the same behaviour in different ways because of different interpretations of ‘the problem’
  • 10. Design for Recovery,York EngD Programme, 2010 Slide 10 Conflict inevitability •  Impossible to establish a set of requirements where stakeholder conflicts are all resolved •  Therefore, successful operation of a system for one set of stakeholders will inevitably mean ‘failure’ for another set of stakeholders •  Groups of stakeholders in organisations are often in perennial conflict (e.g. managers and clinicians in a hospital). The support delivered by a system depends on the power held at some time by a stakeholder group.
  • 11. Design for Recovery,York EngD Programme, 2010 Slide 11 Where are we? •  Large-scale information systems are inevitably complex systems •  Such systems cannot be created using a reductionist approach •  Failures are a judgement and this may change over time •  Failures are inevitable and cannot be engineered out of a system
  • 12. Design for Recovery,York EngD Programme, 2010 Slide 12 The way forward •  Software design has to be seen as part of a wider process of LSCITS engineering •  We need to accept that technical system failures will always occur and examine how we can design these systems to allow the broader socio-technical systems, in which these technical systems are used, to recognise, diagnose and recover from these failures
  • 13. Design for Recovery,York EngD Programme, 2010 Slide 13 Software dependability •  A reductionist approach to software dependability takes the view that software failures are a consequence of software faults •  Techniques to improve dependability include •  Fault avoidance •  Fault detection •  Fault tolerance •  These approaches have taken us quite a long way in improving software dependability. However, further progress is unlikely to be achieved by further improvement of these techniques as they rely on a reductionist view of failure.
  • 14. Design for Recovery,York EngD Programme, 2010 Slide 14 Failure recovery •  Recognition •  Recognise that inappropriate behaviour has occurred •  Hypothesis •  Formulate an explanation for the unexpected behaviour •  Recovery •  Take steps to compensate for the problem that has arisen
  • 15. Design for Recovery,York EngD Programme, 2010 Slide 15 Coping with failure •  Socio-technical systems are remarkably robust because people are good at coping with unexpected situations when things go wrong. •  We have the unique ability to apply previous experience from different areas to unseen problems. •  Individuals can take the initiative, adopt responsibilities and, where necessary, break the rules or step outside the normal process of doing things. •  People can prioritise and focus on the essence of a problem
  • 16. Design for Recovery,York EngD Programme, 2010 Slide 16 Recovering from failure •  Local knowledge •  Who to call; who knows what; where things are •  Process reconfiguration •  Doing things in a different way from that defined in the ‘standard’ process •  Work-arounds, breaking the rules (safe violations) •  Redundancy and diversity •  Maintaining copies of information in different forms from that maintained in a software system •  Informal information annotation •  Using multiple communication channels •  Trust •  Relying on others to cope
  • 17. Design for Recovery,York EngD Programme, 2010 Slide 17 Design for recovery •  The aim of a strategy of design for recovery is to: •  Ensure that system design decisions do not increase the amount of recovery work required •  Make system design decisions that make it easier to recover from problems (i.e. reduce extra work required) •  Earlier recognition of problems •  Visibility to make hypotheses easier to formulate •  Flexibility to support recovery actions •  Designing for recovery is an holistic approach to system design and not (just) the identification of ‘recovery requirements’ •  Should support the natural ability of people and organisations to cope with problems
  • 18. Design for Recovery,York EngD Programme, 2010 Slide 18 Problems •  Security and recoverability •  Automation hiding •  Process tyranny •  Multi-organisational systems
  • 19. Design for Recovery,York EngD Programme, 2010 Slide 19 Security and recoverability •  There is an inherent tension between security and recoverability •  Recoverability •  Relies on trusting operators of the system not to abuse privileges that they may have been granted to help recover from problems •  Security •  Relies on mistrusting users and restricting access to information on a ‘need to know’ basis
  • 20. Design for Recovery,York EngD Programme, 2010 Slide 20 Automation hiding •  A problem with automation is that information becomes subject to organizational policies that restrict access to that information. •  Even when access is not restricted, we don’t have any shared culture in how to organise a large information store •  Generally, authorisation models maintained by the system are based on normal rather than exceptional operation. •  When problems arise and/or when people are unavailable, breaking the rules to solve these problems is made more difficult.
  • 21. Design for Recovery,York EngD Programme, 2010 Slide 21 Process tyranny •  Increasingly, there is a notion that ‘standard’ business processes can be defined and embedded in systems that support these processes •  Implicitly or explicitly, the system enforces the use of the ‘standard’ process •  But this assumes three things: •  The standard process is always appropriate •  The standard process has anticipated all possible failures •  The system can be respond in a timely way to process changes
  • 22. Design for Recovery,York EngD Programme, 2010 Slide 22 Multi-organisational systems •  Many rules enforced in different ways by different systems. •  No single manager or owner of the system . Who do you call when failures occur? •  Information is distributed - users may not be aware of where information is located, who owns information, etc. •  Processes involve remote actors so process reconfiguration is more difficult •  Restricted information channels (e.g. help unavailable outside normal business hours; no phone numbers published, etc.) •  Lack of trust. Owners of components will blame other components for system failure. Learning is inhibited and trust compromised.
  • 23. Design for Recovery,York EngD Programme, 2010 Slide 23 Design guidelines •  Local knowledge •  Process reconfiguration •  Redundancy and diversity
  • 24. Design for Recovery,York EngD Programme, 2010 Slide 24 Local knowledge •  Local knowledge includes knowledge of who does what, how authority structures can be bypassed, what rules can be broken, etc. •  Impossible to replicate entirely in distributed systems but some steps can be taken •  Maintain information about the provenance of data •  Who provided the data, where the data came from, when it was created, edited, etc. •  Maintain organisational models •  Who is responsible for what, contact details
  • 25. Design for Recovery,York EngD Programme, 2010 Slide 25 Process reconfiguration •  Make workflows explicit rather than embedding them in the software •  Not just ‘continue’ buttons! Users should know where they are and where they are supposed to go •  Support workflow navigation/interruption/restart •  Design systems with an ‘emergency mode’ where the the system changes from enforcing policies to auditing actions •  This would allow the rules to be broken but the system would maintain a log of what has been done and why so that subsequent investigations could trace what happened •  Support ‘Help, I’m in trouble!’ as well as ‘Help, I need information?’
  • 26. Design for Recovery,York EngD Programme, 2010 Slide 26 Redundancy and diversity •  Maintaining a single ‘golden copy’ of data may be efficient but it may not be effective or desirable •  Encourage the creation of ‘shadow systems’ and provide import and export from these systems •  Allow schemas to be extended •  Schemas for data are rarely designed for problem solving. Always allow informal extension (a free text box) so that annotations, explanations and additional information can be provided •  Maintain organisational models •  To allow for multi-channel communications when things go wrong
  • 27. Design for Recovery,York EngD Programme, 2010 Slide 27 Summary •  A reductionist approach to software engineering is no longer viable. on its own, for complex systems engineering •  Improving existing software engineering methods will help but will not deal with the problems of complexity that are inherent in distributed systems of systems •  We must learn to live with normal, everyday failures •  Design for recovery involves designing so that the work required to recover from a failure is minimised •  Recovery strategies include supporting information redundancy and annotation and maintaining organisational models