SlideShare ist ein Scribd-Unternehmen logo
1 von 29
FAILURE



DR JOHN ROOKSBY
OR …
RESILIENCE
IN THIS LECTURE…
This lecture
• Will introduce you to many of the themes I will cover on
  the course.
• Will characterise failure as the norm rather than the
  exception in systems operation.
• Will outline why critical systems engineering must
  address organisational and human factors as well as
  technical issues.
• Will build upon the idea of socio-technical systems
  engineering introduced in the last lecture, and will
  introduce the idea of resilience engineering
A STORY
A professor has to give an important lecture. He wakes up
late because his alarm clock fails to go off.
His wife has left the house already. Unfortunately she has
left the kitchen tap running and it has flooded the floor.
The professor rushes to clean up the mess.
He gets to his car only to realise he has locked his car and
house keys inside.
He has left a spare house-key with a neighbour – but the
neighbour is away.
He phones his wife but she doesn’t answer.
A STORY
He calls a friend and asks for a lift, but the friend’s car is
broken down.
The professor sets off for the bus, but then remembers there
is a bus strike.
He calls a taxi, but the taxi company is overwhelmed because
of the bus strike.
He gives up, calls work and cancels the lecture.




                         This story is adapted from Perrow C (1984) Normal Accidents. Living with
                         High Risk Technologies Basic Books.
ABOUT FAILURE
Failure is a judgement
Failures are common
Failures often have multiple causes
Failures cascade
Some failures are more serious than others
Failures often have no ill effect
Failures can often be recovered from
Engineering cannot eliminate failure
Success is as complex as failure
FAILURE IS A JUDGMENT
What do we judge the exact failure to be?
• Failure to get to work? Failure to give lecture? The smaller
  failures that led to cancellation?
What do we judge to be a significant failure?
• Does cancelling a lecture matter?
• Can cancellation be corrected for?
Different perspectives can be taken on failure
• Different explanations often suit different purposes
• There may sometimes be no definite agreement about a
  failure, but this does not mean any interpretation will do.
Sources:
Graph - The Passport Delays of Summer 1999.
NAO Report.
Images – BBC News
                                              Passport issuing 1998/9
FAILURES ARE COMMON
Errors and failures happen all the time, particularly in
complex systems where there is a lot to go wrong.
How many errors have you made in the last half an hour?

                                                                        If servers in a data
                                                                        center have 99.999%
                                                                        reliability, what are the
                                                                        odds that all will be
                                                                        working at any one time:
                                                                        a) if it has 10,000
                                                                           servers?
                                                                        b) if it has 100,000
                                                                           servers?
http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
FAILURES OFTEN HAVE
MULTIPLE CAUSES
There were multiple (mainly mundane) causes behind the
lecture cancellation:
  •   Human error (leaving tap running, forgetting keys)
  •   Practices and procedures (Waking up late, rushing)
  •   Technical failure (Alarm clock, Car)
  •   System design (Door allows you to be locked out)
  •   Environment (Lives too far from work)
  •   External failures (Bus strike, lack of taxi capability)
  •   Planning (Relying on a single lecturer)

Who or what is responsible?
Who has responsibility?
http://gizmodo.com/5844628/a-passenger-airplane-nearly-flew-upside-down-because-of-a-dumb-pilot
FAILURES CASCADE
Complex systems have a high number of components and
will be dependent on a high number of external factors.
These interdependencies may not always be apparent.
Often the cause or causes of failure are at an order of remove
from the failure itself
• A simplistic view is that there are chains of failure. A
  domino effect where one problem leads to another
• A more complex view is that failures have complex webs
  of causes and influences
• We may also view failures in terms of problems with
  defenses
Disasters often result from unfortunate coincidences and
combinations of failure.
SWISS CHEESE
MODEL
               Operation




               Software




               Hardware
SOME FAILURES ARE MORE
SERIOUS THAN OTHERS
It is often helpful to distinguish between faults, errors, failures,
disasters and catastrophe. But there is no consistently used
terminology.
Failure is a judgment
The seriousness of a failure is contextually dependent.
• Failure in a life-critical system vs in a word processor
• When is it acceptable for an aging component to fail?
• When is it acceptable to take risks (e.g. do maintenance)?
Engineers take different perspectives on failure. Some argue
that all failures, no matter how small, should be taken seriously.
Some argue we need systems to be “good enough”.
FAILURES OFTEN HAVE NO ILL
EFFECT
An error or failure may happen
many times with no ill effect.
• This can lead people to be
  complacent
• It may one day lead to disaster
For example the Columbia shuttle
disaster occurred when foam
damaged tiles on the shuttle
• Similar foam strikes had
  happened many times
• NASA couldn’t believe this strike
  would cause the loss of
  Columbia
FAILURES CAN OFTEN BE
RECOVERED FROM
A disaster is rarely an instantaneous event. Often a disaster
results from an unfortunate combination of failures and often
these take place over a period of time.
• Failures can often be mitigated
• Failures can often be recovered from
A resilient system is one that is able to recover from failures.
It is the opposite of a brittle system.
We must give operators the ability to mitigate and recover
from failure.
Image from: ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2010-089 Preliminary
ENGINEERING CANNOT
ELIMINATE FAILURES
Good engineering can greatly reduce but never eliminate the
possibility of failure.
• Testing can be used to find problems but never show their
  absence
• Formal methods can be used to eliminate design faults but
  this does not mean problems will not emerge in
  manufacturing or system operation
Critical systems engineering must focus on operation as well
as design.
Systems are increasing operated as services rather than
products, so this risk is increasingly on the developers (!)
SUCCESS IS AS COMPLEX AS
FAILURE
We need to learn from success, not just failure
• But success is even harder to define than failure.
Success is a judgment
• One person’s success is another’s failure
• A successful system may just be one that hasn’t yet failed
Success can be studied in terms of
• Noteworthy success
• Ordinary operation
• “Successful failures”
SOCIO-TECHNICAL SYSTEMS
ENGINEERING

                     Society

                  Organisations

              People and Processes

Socio-             Applications
Technical
Systems                               Software
Engineering   Communications + Data   Engineering
                  Management
                Operating Systems

                   Equipment
RESILIENCE
Design for failure
• How can a system fail gracefully and appropriately?
Design for recovery
• How can a system be designed to support mitigation and
  recovery from failure?
Design for avoidance
• How can we reduce the number of failures a system will
  encounter?
For all of these we need to understand systems operation.
Critical systems engineering is not just about the design
process, but also about understanding operation.
Microsoft “containerised” data centre
SUMMARY
1. Failure is the norm, not the exception
2. Resilient systems are able to cope with, recover from and
   avoid failure
3. Resilience is a socio-technical, not technical problem
HOMEWORK
First read
Chapter 3 “The Human Contribution” from J Reason (2008)
The Human Contribution. Farnham, Ashgate.


Then
Make a note of any interesting slips, lapses, mistakes,
violations, etc. that you have made recently

Weitere ähnliche Inhalte

Was ist angesagt?

Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015Dustin Collins
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)ClubHack
 
E guide weathering the storm at your business
E guide weathering the storm at your businessE guide weathering the storm at your business
E guide weathering the storm at your businessSoma Technology Group
 
Tool Box Talk - Human Induced Failures 2
Tool Box Talk  - Human Induced Failures  2Tool Box Talk  - Human Induced Failures  2
Tool Box Talk - Human Induced Failures 2Ricky Smith CMRP, CMRT
 
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...Sam Knutson
 
When Things Break
When Things BreakWhen Things Break
When Things BreakRon Graham
 
The complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenanceThe complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenanceeyob eshetu
 
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...Adrian Sanabria
 
Blameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to KnowBlameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to KnowVictorOps
 
Successful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User AcceptanceSuccessful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User AcceptanceRoberta Macklin
 
Typing Work by Complexity
Typing Work by ComplexityTyping Work by Complexity
Typing Work by ComplexityDerek W. Wade
 

Was ist angesagt? (16)

Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015Human error and secure systems - DevOpsDays Ohio 2015
Human error and secure systems - DevOpsDays Ohio 2015
 
Human errors
Human errorsHuman errors
Human errors
 
Dit yvol3iss41
Dit yvol3iss41Dit yvol3iss41
Dit yvol3iss41
 
Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)Mere Paas Teensy Hai (Nikhil Mittal)
Mere Paas Teensy Hai (Nikhil Mittal)
 
E guide weathering the storm at your business
E guide weathering the storm at your businessE guide weathering the storm at your business
E guide weathering the storm at your business
 
Tool Box Talk - Human Induced Failures 2
Tool Box Talk  - Human Induced Failures  2Tool Box Talk  - Human Induced Failures  2
Tool Box Talk - Human Induced Failures 2
 
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
zNextGen Project Opening and Keynote at SHARE in Seattle 2010: Lessons Learne...
 
When Things Break
When Things BreakWhen Things Break
When Things Break
 
The complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenanceThe complete-guide-to-home-computer-maintenance
The complete-guide-to-home-computer-maintenance
 
Downtime-Whitepaper
Downtime-WhitepaperDowntime-Whitepaper
Downtime-Whitepaper
 
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
Stranded on Infosec Island: Defending the Enterprise with Nothing but Windows...
 
Creating a Technology Disaster Plan
Creating a Technology Disaster PlanCreating a Technology Disaster Plan
Creating a Technology Disaster Plan
 
232 a7d01
232 a7d01232 a7d01
232 a7d01
 
Blameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to KnowBlameless Post-mortems: Everything You Ever Wanted to Know
Blameless Post-mortems: Everything You Ever Wanted to Know
 
Successful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User AcceptanceSuccessful EMIS Implementation - Gaining User Acceptance
Successful EMIS Implementation - Gaining User Acceptance
 
Typing Work by Complexity
Typing Work by ComplexityTyping Work by Complexity
Typing Work by Complexity
 

Andere mochten auch

CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1John Rooksby
 
Studying foursquare
Studying foursquareStudying foursquare
Studying foursquareMattias Rost
 
CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2John Rooksby
 
Designing apps lecture
Designing apps lectureDesigning apps lecture
Designing apps lectureJohn Rooksby
 
Testing Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport IssuingTesting Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport IssuingJohn Rooksby
 
Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5John Rooksby
 
Self tracking and digital health
Self tracking and digital healthSelf tracking and digital health
Self tracking and digital healthJohn Rooksby
 
Top 10 lies of Entrepreneurs
Top 10 lies of EntrepreneursTop 10 lies of Entrepreneurs
Top 10 lies of Entrepreneurshuer1278ft
 
Entrepreneurship & business modelling workshop
Entrepreneurship & business modelling  workshopEntrepreneurship & business modelling  workshop
Entrepreneurship & business modelling workshophgomersall
 
It's All About Execution in a Startup
It's All About Execution in a StartupIt's All About Execution in a Startup
It's All About Execution in a StartupAbhishek Shah
 
Leadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur AttributesLeadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur AttributesAdam Walz
 
17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs Posses17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs PossesAbhishek Shah
 
Building a Career as an Entrepreneur
Building a Career as an EntrepreneurBuilding a Career as an Entrepreneur
Building a Career as an EntrepreneurEric Tachibana
 
99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must read99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must readEric Tachibana
 
Stealing Your Einstein Ideas
Stealing Your Einstein IdeasStealing Your Einstein Ideas
Stealing Your Einstein IdeasAbhishek Shah
 
10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneur10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneurEric Tachibana
 
How to run a scrappy startup
How to run a scrappy startupHow to run a scrappy startup
How to run a scrappy startupRashmi Sinha
 

Andere mochten auch (20)

CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1CS5032 Lecture 9: Learning from failure 1
CS5032 Lecture 9: Learning from failure 1
 
Studying foursquare
Studying foursquareStudying foursquare
Studying foursquare
 
CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2CS5032 Lecture 10: Learning from failure 2
CS5032 Lecture 10: Learning from failure 2
 
Designing apps lecture
Designing apps lectureDesigning apps lecture
Designing apps lecture
 
Testing Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport IssuingTesting Sociotechnical Systems: Passport Issuing
Testing Sociotechnical Systems: Passport Issuing
 
Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5Testing Sociotechnical Systems: Heathrow Terminal 5
Testing Sociotechnical Systems: Heathrow Terminal 5
 
Self tracking and digital health
Self tracking and digital healthSelf tracking and digital health
Self tracking and digital health
 
Innovation Can be Trained
Innovation Can be TrainedInnovation Can be Trained
Innovation Can be Trained
 
Top 10 lies of Entrepreneurs
Top 10 lies of EntrepreneursTop 10 lies of Entrepreneurs
Top 10 lies of Entrepreneurs
 
Entrepreneurship & business modelling workshop
Entrepreneurship & business modelling  workshopEntrepreneurship & business modelling  workshop
Entrepreneurship & business modelling workshop
 
It's All About Execution in a Startup
It's All About Execution in a StartupIt's All About Execution in a Startup
It's All About Execution in a Startup
 
Leadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur AttributesLeadership Mashups: 100 Entrepreneur Attributes
Leadership Mashups: 100 Entrepreneur Attributes
 
17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs Posses17 Traits That Entrepreneurs Posses
17 Traits That Entrepreneurs Posses
 
How can entrepreneurial mindset be developed in organisations?
How can entrepreneurial mindset be developed in organisations?How can entrepreneurial mindset be developed in organisations?
How can entrepreneurial mindset be developed in organisations?
 
Building a Career as an Entrepreneur
Building a Career as an EntrepreneurBuilding a Career as an Entrepreneur
Building a Career as an Entrepreneur
 
99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must read99 slideshares that every entrepreneur must read
99 slideshares that every entrepreneur must read
 
Stealing Your Einstein Ideas
Stealing Your Einstein IdeasStealing Your Einstein Ideas
Stealing Your Einstein Ideas
 
5 things I wish I knew before starting up
5 things I wish I knew before starting up5 things I wish I knew before starting up
5 things I wish I knew before starting up
 
10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneur10 reasons it sucks to be an entrepreneur
10 reasons it sucks to be an entrepreneur
 
How to run a scrappy startup
How to run a scrappy startupHow to run a scrappy startup
How to run a scrappy startup
 

Ähnlich wie CS5032 Lecture 2: Failure

Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsYury Roa
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed SystemsInes Sombra
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software designUwe Friedrichsen
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeAlberto Acerbis
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Jaap van Ekris
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systemsJaap van Ekris
 
Debugging microservices in production
Debugging microservices in productionDebugging microservices in production
Debugging microservices in productionbcantrill
 
Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service DesignUwe Friedrichsen
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskUwe Friedrichsen
 
2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systems2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systemsJaap van Ekris
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Markus Eisele
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!Bert Ertman
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindbcantrill
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadKevin Crawley
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos EngineeringYury Roa
 
Problem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technologyProblem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technologyRonald Bartels
 

Ähnlich wie CS5032 Lecture 2: Failure (20)

Chaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in SystemsChaos Engineering: Injecting Failure for Building Resilience in Systems
Chaos Engineering: Injecting Failure for Building Resilience in Systems
 
Architectural Patterns of Resilient Distributed Systems
 Architectural Patterns of Resilient Distributed Systems Architectural Patterns of Resilient Distributed Systems
Architectural Patterns of Resilient Distributed Systems
 
The 7 quests of resilient software design
The 7 quests of resilient software designThe 7 quests of resilient software design
The 7 quests of resilient software design
 
Chaos engineering
Chaos engineering Chaos engineering
Chaos engineering
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems
 
Debugging microservices in production
Debugging microservices in productionDebugging microservices in production
Debugging microservices in production
 
Resilient Functional Service Design
Resilient Functional Service DesignResilient Functional Service Design
Resilient Functional Service Design
 
Introduction to Chaos Engineering
Introduction to Chaos EngineeringIntroduction to Chaos Engineering
Introduction to Chaos Engineering
 
dist_systems.pdf
dist_systems.pdfdist_systems.pdf
dist_systems.pdf
 
Microservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack riskMicroservices - stress-free and without increased heart attack risk
Microservices - stress-free and without increased heart attack risk
 
2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systems2016-04-28 - VU Amsterdam - testing safety critical systems
2016-04-28 - VU Amsterdam - testing safety critical systems
 
Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?Architecting for failure - Why are distributed systems hard?
Architecting for failure - Why are distributed systems hard?
 
Microservices Gone Wrong!
Microservices Gone Wrong!Microservices Gone Wrong!
Microservices Gone Wrong!
 
Debugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mind
 
DockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is DeadDockerCon SF 2019 - TDD is Dead
DockerCon SF 2019 - TDD is Dead
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
Problem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technologyProblem management foundation - Significant havoc in technology
Problem management foundation - Significant havoc in technology
 

Mehr von John Rooksby

Implementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App DeploymentImplementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App DeploymentJohn Rooksby
 
Digital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine FitzpatrickDigital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine FitzpatrickJohn Rooksby
 
How to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change toolsHow to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change toolsJohn Rooksby
 
Guest lecture: Designing mobile apps
Guest lecture: Designing mobile appsGuest lecture: Designing mobile apps
Guest lecture: Designing mobile appsJohn Rooksby
 
Talk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday UseTalk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday UseJohn Rooksby
 
Intimacy and Mobile Devices
Intimacy and Mobile DevicesIntimacy and Mobile Devices
Intimacy and Mobile DevicesJohn Rooksby
 
CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2John Rooksby
 
CS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructureCS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructureJohn Rooksby
 
CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2John Rooksby
 
CS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failureCS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failureJohn Rooksby
 

Mehr von John Rooksby (12)

Implementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App DeploymentImplementing Ethics for a Mobile App Deployment
Implementing Ethics for a Mobile App Deployment
 
Digital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine FitzpatrickDigital Health From an HCI Perspective - Geraldine Fitzpatrick
Digital Health From an HCI Perspective - Geraldine Fitzpatrick
 
How to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change toolsHow to evaluate and improve the quality of mHealth behaviour change tools
How to evaluate and improve the quality of mHealth behaviour change tools
 
Guest lecture: Designing mobile apps
Guest lecture: Designing mobile appsGuest lecture: Designing mobile apps
Guest lecture: Designing mobile apps
 
Talk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday UseTalk at UCL: Mobile Devices in Everyday Use
Talk at UCL: Mobile Devices in Everyday Use
 
Fitts' Law
Fitts' LawFitts' Law
Fitts' Law
 
Intimacy and Mobile Devices
Intimacy and Mobile DevicesIntimacy and Mobile Devices
Intimacy and Mobile Devices
 
Making data
Making dataMaking data
Making data
 
CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2CS5032 Lecture 20: Dependable infrastructure 2
CS5032 Lecture 20: Dependable infrastructure 2
 
CS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructureCS5032 Lecture 19: Dependable infrastructure
CS5032 Lecture 19: Dependable infrastructure
 
CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2CS5032 Lecture 14: Organisations and failure 2
CS5032 Lecture 14: Organisations and failure 2
 
CS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failureCS5032 Lecture 13: organisations and failure
CS5032 Lecture 13: organisations and failure
 

Kürzlich hochgeladen

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Kürzlich hochgeladen (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

CS5032 Lecture 2: Failure

  • 4. IN THIS LECTURE… This lecture • Will introduce you to many of the themes I will cover on the course. • Will characterise failure as the norm rather than the exception in systems operation. • Will outline why critical systems engineering must address organisational and human factors as well as technical issues. • Will build upon the idea of socio-technical systems engineering introduced in the last lecture, and will introduce the idea of resilience engineering
  • 5. A STORY A professor has to give an important lecture. He wakes up late because his alarm clock fails to go off. His wife has left the house already. Unfortunately she has left the kitchen tap running and it has flooded the floor. The professor rushes to clean up the mess. He gets to his car only to realise he has locked his car and house keys inside. He has left a spare house-key with a neighbour – but the neighbour is away. He phones his wife but she doesn’t answer.
  • 6. A STORY He calls a friend and asks for a lift, but the friend’s car is broken down. The professor sets off for the bus, but then remembers there is a bus strike. He calls a taxi, but the taxi company is overwhelmed because of the bus strike. He gives up, calls work and cancels the lecture. This story is adapted from Perrow C (1984) Normal Accidents. Living with High Risk Technologies Basic Books.
  • 7. ABOUT FAILURE Failure is a judgement Failures are common Failures often have multiple causes Failures cascade Some failures are more serious than others Failures often have no ill effect Failures can often be recovered from Engineering cannot eliminate failure Success is as complex as failure
  • 8. FAILURE IS A JUDGMENT What do we judge the exact failure to be? • Failure to get to work? Failure to give lecture? The smaller failures that led to cancellation? What do we judge to be a significant failure? • Does cancelling a lecture matter? • Can cancellation be corrected for? Different perspectives can be taken on failure • Different explanations often suit different purposes • There may sometimes be no definite agreement about a failure, but this does not mean any interpretation will do.
  • 9. Sources: Graph - The Passport Delays of Summer 1999. NAO Report. Images – BBC News Passport issuing 1998/9
  • 10. FAILURES ARE COMMON Errors and failures happen all the time, particularly in complex systems where there is a lot to go wrong. How many errors have you made in the last half an hour? If servers in a data center have 99.999% reliability, what are the odds that all will be working at any one time: a) if it has 10,000 servers? b) if it has 100,000 servers? http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
  • 11. FAILURES OFTEN HAVE MULTIPLE CAUSES There were multiple (mainly mundane) causes behind the lecture cancellation: • Human error (leaving tap running, forgetting keys) • Practices and procedures (Waking up late, rushing) • Technical failure (Alarm clock, Car) • System design (Door allows you to be locked out) • Environment (Lives too far from work) • External failures (Bus strike, lack of taxi capability) • Planning (Relying on a single lecturer) Who or what is responsible? Who has responsibility?
  • 13. FAILURES CASCADE Complex systems have a high number of components and will be dependent on a high number of external factors. These interdependencies may not always be apparent. Often the cause or causes of failure are at an order of remove from the failure itself • A simplistic view is that there are chains of failure. A domino effect where one problem leads to another • A more complex view is that failures have complex webs of causes and influences • We may also view failures in terms of problems with defenses Disasters often result from unfortunate coincidences and combinations of failure.
  • 14. SWISS CHEESE MODEL Operation Software Hardware
  • 15. SOME FAILURES ARE MORE SERIOUS THAN OTHERS It is often helpful to distinguish between faults, errors, failures, disasters and catastrophe. But there is no consistently used terminology. Failure is a judgment The seriousness of a failure is contextually dependent. • Failure in a life-critical system vs in a word processor • When is it acceptable for an aging component to fail? • When is it acceptable to take risks (e.g. do maintenance)? Engineers take different perspectives on failure. Some argue that all failures, no matter how small, should be taken seriously. Some argue we need systems to be “good enough”.
  • 16.
  • 17. FAILURES OFTEN HAVE NO ILL EFFECT An error or failure may happen many times with no ill effect. • This can lead people to be complacent • It may one day lead to disaster For example the Columbia shuttle disaster occurred when foam damaged tiles on the shuttle • Similar foam strikes had happened many times • NASA couldn’t believe this strike would cause the loss of Columbia
  • 18. FAILURES CAN OFTEN BE RECOVERED FROM A disaster is rarely an instantaneous event. Often a disaster results from an unfortunate combination of failures and often these take place over a period of time. • Failures can often be mitigated • Failures can often be recovered from A resilient system is one that is able to recover from failures. It is the opposite of a brittle system. We must give operators the ability to mitigate and recover from failure.
  • 19. Image from: ATSB TRANSPORT SAFETY REPORT Aviation Occurrence Investigation – AO-2010-089 Preliminary
  • 20. ENGINEERING CANNOT ELIMINATE FAILURES Good engineering can greatly reduce but never eliminate the possibility of failure. • Testing can be used to find problems but never show their absence • Formal methods can be used to eliminate design faults but this does not mean problems will not emerge in manufacturing or system operation Critical systems engineering must focus on operation as well as design. Systems are increasing operated as services rather than products, so this risk is increasingly on the developers (!)
  • 21.
  • 22. SUCCESS IS AS COMPLEX AS FAILURE We need to learn from success, not just failure • But success is even harder to define than failure. Success is a judgment • One person’s success is another’s failure • A successful system may just be one that hasn’t yet failed Success can be studied in terms of • Noteworthy success • Ordinary operation • “Successful failures”
  • 23.
  • 24.
  • 25. SOCIO-TECHNICAL SYSTEMS ENGINEERING Society Organisations People and Processes Socio- Applications Technical Systems Software Engineering Communications + Data Engineering Management Operating Systems Equipment
  • 26. RESILIENCE Design for failure • How can a system fail gracefully and appropriately? Design for recovery • How can a system be designed to support mitigation and recovery from failure? Design for avoidance • How can we reduce the number of failures a system will encounter? For all of these we need to understand systems operation. Critical systems engineering is not just about the design process, but also about understanding operation.
  • 28. SUMMARY 1. Failure is the norm, not the exception 2. Resilient systems are able to cope with, recover from and avoid failure 3. Resilience is a socio-technical, not technical problem
  • 29. HOMEWORK First read Chapter 3 “The Human Contribution” from J Reason (2008) The Human Contribution. Farnham, Ashgate. Then Make a note of any interesting slips, lapses, mistakes, violations, etc. that you have made recently

Hinweis der Redaktion

  1. 90 37
  2. All Nippon Airways. Flight “flips” on 6/9/2011
  3. During the swine flu outbreak?How well has a medic washed hands? (line infections)
  4. Qantas Batam Island Engine Explosion
  5. Unsinkable system
  6. The failures and successes of the London Ambulance Service
  7. Talk also aboutvirtualisation