4. IN THIS LECTURE…
This lecture
• Will introduce you to many of the themes I will cover on
the course.
• Will characterise failure as the norm rather than the
exception in systems operation.
• Will outline why critical systems engineering must
address organisational and human factors as well as
technical issues.
• Will build upon the idea of socio-technical systems
engineering introduced in the last lecture, and will
introduce the idea of resilience engineering
5. A STORY
A professor has to give an important lecture. He wakes up
late because his alarm clock fails to go off.
His wife has left the house already. Unfortunately she has
left the kitchen tap running and it has flooded the floor.
The professor rushes to clean up the mess.
He gets to his car only to realise he has locked his car and
house keys inside.
He has left a spare house-key with a neighbour – but the
neighbour is away.
He phones his wife but she doesn’t answer.
6. A STORY
He calls a friend and asks for a lift, but the friend’s car is
broken down.
The professor sets off for the bus, but then remembers there
is a bus strike.
He calls a taxi, but the taxi company is overwhelmed because
of the bus strike.
He gives up, calls work and cancels the lecture.
This story is adapted from Perrow C (1984) Normal Accidents. Living with
High Risk Technologies Basic Books.
7. ABOUT FAILURE
Failure is a judgement
Failures are common
Failures often have multiple causes
Failures cascade
Some failures are more serious than others
Failures often have no ill effect
Failures can often be recovered from
Engineering cannot eliminate failure
Success is as complex as failure
8. FAILURE IS A JUDGMENT
What do we judge the exact failure to be?
• Failure to get to work? Failure to give lecture? The smaller
failures that led to cancellation?
What do we judge to be a significant failure?
• Does cancelling a lecture matter?
• Can cancellation be corrected for?
Different perspectives can be taken on failure
• Different explanations often suit different purposes
• There may sometimes be no definite agreement about a
failure, but this does not mean any interpretation will do.
9. Sources:
Graph - The Passport Delays of Summer 1999.
NAO Report.
Images – BBC News
Passport issuing 1998/9
10. FAILURES ARE COMMON
Errors and failures happen all the time, particularly in
complex systems where there is a lot to go wrong.
How many errors have you made in the last half an hour?
If servers in a data
center have 99.999%
reliability, what are the
odds that all will be
working at any one time:
a) if it has 10,000
servers?
b) if it has 100,000
servers?
http://www.time.com/time/photogallery/0,29307,2036928_2218548,00.html
11. FAILURES OFTEN HAVE
MULTIPLE CAUSES
There were multiple (mainly mundane) causes behind the
lecture cancellation:
• Human error (leaving tap running, forgetting keys)
• Practices and procedures (Waking up late, rushing)
• Technical failure (Alarm clock, Car)
• System design (Door allows you to be locked out)
• Environment (Lives too far from work)
• External failures (Bus strike, lack of taxi capability)
• Planning (Relying on a single lecturer)
Who or what is responsible?
Who has responsibility?
13. FAILURES CASCADE
Complex systems have a high number of components and
will be dependent on a high number of external factors.
These interdependencies may not always be apparent.
Often the cause or causes of failure are at an order of remove
from the failure itself
• A simplistic view is that there are chains of failure. A
domino effect where one problem leads to another
• A more complex view is that failures have complex webs
of causes and influences
• We may also view failures in terms of problems with
defenses
Disasters often result from unfortunate coincidences and
combinations of failure.
15. SOME FAILURES ARE MORE
SERIOUS THAN OTHERS
It is often helpful to distinguish between faults, errors, failures,
disasters and catastrophe. But there is no consistently used
terminology.
Failure is a judgment
The seriousness of a failure is contextually dependent.
• Failure in a life-critical system vs in a word processor
• When is it acceptable for an aging component to fail?
• When is it acceptable to take risks (e.g. do maintenance)?
Engineers take different perspectives on failure. Some argue
that all failures, no matter how small, should be taken seriously.
Some argue we need systems to be “good enough”.
16.
17. FAILURES OFTEN HAVE NO ILL
EFFECT
An error or failure may happen
many times with no ill effect.
• This can lead people to be
complacent
• It may one day lead to disaster
For example the Columbia shuttle
disaster occurred when foam
damaged tiles on the shuttle
• Similar foam strikes had
happened many times
• NASA couldn’t believe this strike
would cause the loss of
Columbia
18. FAILURES CAN OFTEN BE
RECOVERED FROM
A disaster is rarely an instantaneous event. Often a disaster
results from an unfortunate combination of failures and often
these take place over a period of time.
• Failures can often be mitigated
• Failures can often be recovered from
A resilient system is one that is able to recover from failures.
It is the opposite of a brittle system.
We must give operators the ability to mitigate and recover
from failure.
20. ENGINEERING CANNOT
ELIMINATE FAILURES
Good engineering can greatly reduce but never eliminate the
possibility of failure.
• Testing can be used to find problems but never show their
absence
• Formal methods can be used to eliminate design faults but
this does not mean problems will not emerge in
manufacturing or system operation
Critical systems engineering must focus on operation as well
as design.
Systems are increasing operated as services rather than
products, so this risk is increasingly on the developers (!)
21.
22. SUCCESS IS AS COMPLEX AS
FAILURE
We need to learn from success, not just failure
• But success is even harder to define than failure.
Success is a judgment
• One person’s success is another’s failure
• A successful system may just be one that hasn’t yet failed
Success can be studied in terms of
• Noteworthy success
• Ordinary operation
• “Successful failures”
23.
24.
25. SOCIO-TECHNICAL SYSTEMS
ENGINEERING
Society
Organisations
People and Processes
Socio- Applications
Technical
Systems Software
Engineering Communications + Data Engineering
Management
Operating Systems
Equipment
26. RESILIENCE
Design for failure
• How can a system fail gracefully and appropriately?
Design for recovery
• How can a system be designed to support mitigation and
recovery from failure?
Design for avoidance
• How can we reduce the number of failures a system will
encounter?
For all of these we need to understand systems operation.
Critical systems engineering is not just about the design
process, but also about understanding operation.
28. SUMMARY
1. Failure is the norm, not the exception
2. Resilient systems are able to cope with, recover from and
avoid failure
3. Resilience is a socio-technical, not technical problem
29. HOMEWORK
First read
Chapter 3 “The Human Contribution” from J Reason (2008)
The Human Contribution. Farnham, Ashgate.
Then
Make a note of any interesting slips, lapses, mistakes,
violations, etc. that you have made recently
Hinweis der Redaktion
90 37
All Nippon Airways. Flight “flips” on 6/9/2011
During the swine flu outbreak?How well has a medic washed hands? (line infections)
Qantas Batam Island Engine Explosion
Unsinkable system
The failures and successes of the London Ambulance Service