The presentation was compiled by Thinking Dimensions Global in November 2012 for the ITSMF conference held in London. The content relates to the KEPNERandFOURIE process for dealing with incidents and problems in IT and in particular a means of determining the Root Cause and providing the best solution.
The presentation was co-presented by Dr Mat-thys Fourie and John Hudson of Thinking Dimensions Global
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Information Technology - Discover the Root Cause and Develop a solution through structured processes
1. John Hudson & Matt Fourie
5 November 2012
Go Direct to the
Root Cause –
itRCA the solution?
2. Agenda
“Most incident
investigators ask
the wrong
questions, so do not
change your people
but change the
questions they are
asking”
Matt Fourie
•
Introduction
•
Current situation
•
Components of a credible approach
•
•
Minimalistic information, being specific
and knowledge (wisdom) creation
The Three critical investigation skills
1.
Service Recovery Analysis
2.
Technical Cause Analysis
3.
Root Cause Analysis
•
Client outcomes
•
Questions & answers
3. Thinking Dimensions
Some of our recent
clients...
Barclays IT
ANZ IT Division
Macquarie ITG
Unisys
Polypore IT
Medtronic IT
SITA Global
BT Financial
Westpac IT
McDonalds IT
Queensland Police IT
Lockheed Martin Space
Systems
SPARQ IT
• Thinking Dimensions
International - operating
KEPNERandFOURIE
company initiatives for the
last 25 years
• Specialising in RCA
Methodology for IT Incident
and Problem Management
4. Global Presence
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Baxter International
Blue Cross Blue Shield
Bosch
Caltex Oil
Carraro
Crown Cork and Seal
Dometic
Electrolux
Federal Judiciary Center
General Dynamics IT
Hollister,Inc
Infineon
BASF
Macquarie Bank IT
BT Financial IT
Stihl
Westpac IT
Maersk
Norfolk Naval Shipyard
Selig
Siemens
SITA
SKF
Americas
• Canada
• Chile
• Peru
• USA
EMEA
• Germany
• Italy
• Netherlands
• Poland
• Saudi Arabia
• South Africa
• Spain
• Turkey
• United Kingdom
Asia Pacific
• Australia
• China
• India
• South Korea
• Thailand
• Singapore
6. The Three Skills…
1. itSRA®
Incident
2. itTCA
®
3. itRCA ®
Service
Recovery
Analysis
Recovery &
Containment
Tools & Templates
Technical
Cause
Analysis
Technical Cause
Process and
Techniques
Root
Cause
Analysis
Root Cause & FIX
Checklist &
Templates
7. Current Default Root Causes
• Hardware
• Software
• “Human Error”
• Environment
Technical Cause
Root Cause
12. Incisive Thinking
Incident Statement
Technical Cause Root Cause
Internet Banking
Degrading
New browser
configuration
issue
Encrypted “hello”
message not
returned
Integrative testing
not done properly
13. Incisive Thinking
Incident Statement
Technical Cause Root Cause
Internet Banking
Degrading
New browser
configuration
issue
Encrypted “hello”
message not
returned
„Beta‟ Certificate
used
Integrative testing
not done properly
14. Incisive Thinking
Incident Statement
Technical Cause Root Cause
Internet Banking
Degrading
New browser
configuration
issue
Integrative testing
not done properly
Encrypted “hello”
message not
returned
„Beta‟ Certificate
used
Policy
requirements for
“production”
environment not
adhered to
19. Incisive Thinking
Incident Statement
Technical Cause Root Cause
G-Force System
Freezing
High volume
G-Force SQL DB
G-Force program
thread count
not closing out
exceeding maximum threads
Too many users
allowed access
20. Incisive Thinking
Incident Statement
Technical Cause Root Cause
G-Force System
Freezing
High volume
Too many users
allowed access
G-Force SQL DB
G-Force program Vendor
thread count
not closing out
implemented an
exceeding maximum threads
untested program
update
21. Basic phases of problem solving
Procedure for addressing an Incident
1. State the purpose
Divergent
Thinking
2. Gather incident/problem detail
3. Evaluate for causes
Convergent
Thinking
4. Confirm technical/root cause
1. Testing
2. Verifying cause
22. Basic phases of problem solving
Procedure for addressing an Incident
1. State the purpose
Divergent
Thinking
2. Gather incident/problem detail
3. Evaluate for causes
Convergent
Thinking
4. Confirm technical/root cause
1. Testing
2. Verifying cause
23. Good RCA…
YOU NEED TO SOLVE AN INCIDENT;
•
QUICKLY [Service Recovery]
•
ACCURATELY [Technical Cause]
•
PERMANENTLY [Root Cause]
24. Factors in minimalistic approach
Factor
I Keep six honest serving-men:
(They taught me all I knew);
Their names are What and
Why and When
What
Where
And How and Where and Who.
When
I send them over land and sea,
How
I send them east and west;
But after they have worked for me,
I give them all a rest.
Rudyard Kipling
Why
Who
IS
BUT NOT
25. Extreme Focus With “Specificity”
Object
Servers
Fault
Not
communicating
“The key to success
is to be insistent
about specificity –
the more specific
you are the better
your chances to
Solve an incident.”
KEPNERandFOURIE
Specificity Rules
•One object one fault
•Single-minded &
simplistic
•Highly focused
•Must find the correct
entry point
•Ask a question –
expect an answer
26. Extreme Focus With “Specificity”
Object
Servers
Fault
Not
communicating
Data not
transferred
Specificity Rules
•One object one fault
•Single-minded &
simplistic
•Highly focused
•Must find the correct
entry point
•Ask a question –
expect an answer
27. Extreme Focus With “Specificity
Object
Servers
Fault
Specificity Rules
Not
communicating
•One object one fault
Data not
transferred
•Single-minded &
simplistic
Sent but not
received by
receiving servers
•Highly focused
•Must find the correct
entry point
•Ask a question –
expect an answer
28. Extreme Focus With “Specificity”
Object
Servers
Fault
Specificity Rules
•One object one fault
Data not
transferred.
Sent but not
received by
receiving servers
Data for Large
Outlets
Not
communicating
•Single-minded &
simplistic
Not received
•Highly focused
•Must find the correct
entry point
•Ask a question –
expect an answer
29. Extreme Focus With “Specificity”
Object
Servers
Fault
Specificity Rules
Not
communicating
•One object one fault
Data not
transferred.
Sent but not
received by
receiving servers
•Single-minded &
simplistic
•Highly focused
Data for Large
Outlets
Not received
•Must find the correct
entry point
Sales turnover
numbers for
Large Outlets
Not received
•Ask a question –
expect an answer
30. Creating Intelligence
DATA
INFORMATION
IS
BUT NOT
Internet
Banking
Intranet
Banking
KNOWLEDGE
WHY NOT
Different routing
SSL handshake
Unexpected Outcomes
•“BUT NOT” clarifies the
facts
•Creates a curious “contrast”
Slow
Freezing
Volume?
APAC users
USA, UK
ADSL lines
Started Oct 1
Before
New passwords
Continuous
After 4pm
Different routing
•Looking at answers at a
“granular level”
•Stimulates deductive
reasoning
32. Service Recovery [ MTR]
FACTOR
IS
BUT NOT
REQUIREMENT
OBJECT
Mobile
website
access
PC website
access
WHAT TO
RESTORE
FAULT
Denied – not
authorized
Slow/freezing
WHAT PROBLEMS
TO REMOVE
WHO
Blackberry
users
Other Smart
phones
WHO
WHERE
Asia
ANZ, UK,
USA
WHERE
IMPACT
Customer
complaints
PATTERN
Sporadic
TO WHAT EXTENT
continuous
FOR HOW LONG
ACTIONS TO
CONSIDER
33. Service Recovery [ MTR]
Statement: Restore website access to customers
Key Solution Requirements
Various actions to meet key requirements
1
2
3
4
5
1. Provide access to client to at least receive
interim non-availability notice
0
3
2
1
3
2. No loss of Data
3
3
0
0
1
3. Should not impact System Performance
1
0
3
1
0
4. ADSL compatible for Asia
1
2
0
0
0
5. Improve reliability
3
0
3
1
1
6. Implementation within the hour
1
3
3
1
2
Possible Actions:
1. Upload or switch on simple site maintenance page
2. Set up or start up back up service
3. Reroute 20/80 service all to back up service
4. Restrict access to low load tasks only
5. Allow access based on region
34. Service Recovery [ MTR]
Statement: Restore website access to customers
Key Solution Requirements
Various actions to meet key requirements
1
2
3
4
5
1. Provide access to client to at least receive
interim non-availability notice
0
3
2
1
3
2. No loss of Data
3
3
0
0
1
3. Should not impact System Performance
1
0
3
1
0
4. ADSL compatible for Asia
1
2
0
0
0
5. Improve reliability
3
0
3
1
1
6. Implementation within the hour
1
3
3
1
2
Possible Actions:
1. Upload or switch on simple site maintenance page
2. Set up or start up back up service
3. Reroute 20/80 service all to back up service
4. Restrict access to low load tasks only
5. Allow access based on region
36. Technical Cause Analysis [TCA - MTTR]
IS
BUT
NOT
WHY
NOT
OBJECT
OBJECT – What object and which other object(s)
not?
FAULT
FAULT – What fault and which other typical faults
not?
USERS
USERS – Who has the problem and who does not?
WHERE
WHERE – Where are these users and where could
they have been but are not?
TIMING
TIMING – When did it happen first time and when
not?
PATTERN
PATTERN – What is the pattern of faults and what
could it have been but is not?
CYCLE
CYCLE – In which cycle does the problem occur and
in which cycle does it not occur?
37. Technical Cause Analysis [TCA]
DIMENSION
IS
BUT NOT
WHY NOT
Object
Fireburst
V2.0
connection
E-Express,
Mango
connections
F/B upgrade from V1
to V2, Poor testing
issue
Fault
dropping
Freezing, slow
Time out settings,
configuration of drivers
Location
of Object
ANZ, USA,
UK
Asia
LAN, Proxy server
issues, F/Wall rules
Timing
Monday,
Sept 2nd with
SOB
Any time earlier
than Sept 2nd
Java upgrade,
Netscape upgrade
Pattern
Continuous
Sporadic,
Periodic
Don‟t know
Life Cycle
When doing
a transaction
“x” time into
transaction
Operator error, Code
error on a specific
page
Phase of
Work
Just after
logging in
Logging in or out
OS configuration issue,
DNS issue
Possible Causes &
Testing
38. Technical Cause Analysis [TCA]
DIMENSION
IS
BUT NOT
WHY NOT
Object
Fireburst
V2.0
connection
E-Express,
Mango
connections
F/B upgrade from V1
to V2, Poor testing
issue
Fault
Dropping
Freezing, slow
Time out settings,
configuration of drivers
Location
of Object
ANZ, USA,
UK
Asia
LAN, Proxy server
issues, F/Wall rules
Timing
Monday,
Sept 2nd with
SOB
Any time earlier
than Sept 2nd
Java upgrade,
Netscape upgrade
Pattern
Continuous
Sporadic,
Periodic
Don‟t know
Life Cycle
When doing
a transaction
“x” time into
transaction
Operator error, Code
error on a specific
page
Phase of
Work
Just after
logging in
Logging in or out
OS configuration issue,
DNS issue
Possible Causes &
Testing
1. Proxy server tampered with during the Java
upgrade on the LAN
2. Java upgrade caused driver incompatibility
with Fireburst website V2.0
3. Netscape upgrade caused driver
incompatibility with Fireburst website V2.0
39. Technical Cause Analysis [TCA]
DIMENSION
IS
BUT NOT
WHY NOT
Object
Fireburst
V2.0
connection
E-Express,
Mango
connections
F/B upgrade from V1
to V2, Poor testing
issue
Fault
Dropping
Freezing, slow
Time out settings,
configuration of drivers
Location
of Object
ANZ, USA,
UK
Asia
LAN, Proxy server
issues, F/Wall rules
Timing
Monday,
Sept 2nd with
SOB
Any time earlier
than Sept 2nd
Java upgrade,
Netscape upgrade
Pattern
Continuous
Sporadic,
Periodic
Don‟t know
Life Cycle
When doing
a transaction
“x” time into
transaction
Operator error, Code
error on a specific
page
Just after
logging in
Logging in or out
OS configuration issue,
DNS issue
Phase of
Work
Possible Causes &
Testing
1. Proxy server tampered with during the Java
upgrade on the LAN
X
2. Java upgrade caused driver incompatibility
with Fireburst website V2.0
√
√
X
3. Netscape upgrade caused driver
incompatibility with Fireburst website V2.0
√
√
A1
√
√
√
√
A1- Only if the staff in Asia did not upgrade to
Netscape
41. A Case of a good thinking process
• Deviation Statement
• Factor Analysis
• Possible causal factors
• Testing the causal
hypotheses
• Find the underlying
reason(s) for incident
'The truth, if it
exists, is in the
details'
“Bartlett – Familiar
Quotations”
42. The Right Starting Point
• Find the technical
cause first
• Do 5 Why‟s to get to
the systemic level
• Find the root
cause(s)
• Fix the
incident/problem for
good
“If a team has not
solved an incident,
the person with the
information was not
invited”
Chuck Kepner
43. Four Questions to get Started
•
Is the object deviation within the control of
your own system? Can you fix the root cause
with actions under your control?
•
Is the technical cause deviation in the vendor's
system? Can you only fix the root cause with
the vendor's help?
ITRCA
Max4
ITRCA
Max4
Is the object deviation within the control of
your own system? Can you only fix the root
cause with the vendor's help?
•
RiskWise
•
Is the technical cause deviation in the vendor's
system? We would only be able to take
avoiding actions.
44. Root Cause Analysis [RCA]
DIMENSION
IS
BUT
NOT
APPPLICATION:
What application and which
other applications not?
DEVIATION
DEVIATION:
What deviation do we have
and which ones not?
FUNCTION
FUNCTION:
Which job/function/process is
involved and which ones not?
WHO
USERS:
Who has the problem and who
does not?
WHERE
WHERE:
TIMING
TIMING:
Where are these users and
where could they have been
but are not?
When did it happen first time
and when not?
FREQUENCY
FREQUENCY:
APPLICATION
How frequent is the fault
occurring?
45. Root Cause Analysis [RCA]
COMPONENT
CAUSAL FACTORS
Decision Making
Process and Collaboration for inputs
Implementation
issues
Resources and Scope & Definition of
Poor decision process and documentation for this
project
task
Standard Operating
Procedures
Applicability of SOP and Awareness
of SOP
Management
Management of Work and Staff
Measurement
KPI”s and Roles & Responsibilities
CAUSAL ELEMENTS
Critical stakeholder requirements not consulted for
this task
Inadequate authority levels for making good
decisions
Inadequate standards guiding the decision making
Time Zone difficulties hampering effective
decision making
Unrealistic time, cost and performance
expectations
Poor initial estimation of resources needed for the
project
Poor updated approval data making the procedure
unclear
Poor work guidance/coaching for correct
performance
Work standards for this task is not enforced
Poor management support in getting this task
done
KPI and metrics regarding this output not clear or
absent
Poor feedback on this KPI
Duplication and GAPS making roles and
responsibilities difficult
46. Root Cause Analysis 2 cont. [RCA]
COMPONENT
CAUSAL FACTORS
Support
Internal and External Vendor support
Communications
Clarity of communications and
instructions
Work Environment
Task Interference and consequences
Skills
Complexity and applicability
Testing Practices
CAUSAL ELEMENTS
Procedures and requirements
Overuse of the SME causing sub-standard work
Poor continual vendor support for this output
Continual interruptions in performing the task
Task performance request not properly understood
Work environment not conducive for the demands
of the task
Unrealistic task and performance expectation for
this task
Not having enough experience with similar tasks
No vendor training provided for new product and or
service
Poor risk analysis and decision pressure during
testing
Not all aspects tested and the test was incomplete
Personal
Aptitude and Attitude
Inadequate problem solving ability for this type of
task
Incumbent does not follow instructions or Standard
Procedure
47. Root Cause Analysis [RCA]
COMPONENT
CAUSAL FACTORS
Decision Making
Process and Collaboration for
inputs
Implementation
issues
Resources and Scope &
Definition of project
Standard
Operating
Procedures
Applicability of SOP and
Awareness of SOP
Management
Management of Work and Staff
Measurement
KPI”s and Roles &
Responsibilities
CAUSAL ELEMENTS
Critical stakeholder requirements not consulted for
this task
Inadequate authority levels for making good
decisions
Poor decision process and documentation for this
task
Inadequate standards guiding the decision making
Time Zone difficulties hampering effective decision
making
Unrealistic time, cost and performance expectations
Poor initial estimation of resources needed for the
project
Poor updated approval data making the procedure
unclear
Poor work guidance/coaching for correct
performance
Work standards for this task is not enforced
Poor management support in getting this task done
KPI and metrics regarding this output not clear or
absent
Poor feedback on this KPI
Duplication and GAPS making roles and
responsibilities difficult
48. Root Cause Analysis [RCA]
COMPONENT
CAUSAL FACTORS
Support
Internal and External
Vendor support
Communications
Clarity of communications
and instructions
Work
Environment
Task Interference and
consequences
Skills
Complexity and applicability
Testing Practices
Procedures and
requirements
Personal
Aptitude and Attitude
CAUSAL ELEMENTS
Overuse of the SME causing sub-standard work
Poor continual vendor support for this output
Continual interruptions in performing the task
Task performance request not properly understood
Work environment not conducive for the
demands of the task
Unrealistic task and performance expectation
for this task
Not having enough experience with similar tasks
No vendor training provided for new product and or
service
Poor risk analysis and decision pressure during
testing
Not all aspects tested and the test was incomplete
Inadequate problem solving ability for this type of
task
Incumbent does not follow instructions or Standard
Procedure
49. Testing the Hypothesis
The decision making process is too
cumbersome to allow for own initiatives and
the staff member must make a choice with
given alternatives which is not most optimal for
the situation
Final Conclusion and
Action Plan:
1.
The job incumbent did not get the necessary
support to do his job under a pressure situation
adding to task interference
✗
2.
External vendor support for certain technical
decisions was not available and that resulted
in a less optimized decision choice.
3.
In 2011 we are represented in 20 countries and in 12 different languages. TD has been growing steadily over the last 10 years. As you can see from the list TD and its network were already working with a formidable list of global clients. 2011 was also the year that TD officially decided in their strategy that they will niche exclusively into the IT market.
The procedure for problem solving is the following; First you have to state the problem situation and then once you have the correct statement and thus the correct “entry point” into the problem situation, you would be able to gather the most relevant information pertaining to the problem. Once you have the information, you need to analyze it and then come to a mutually agreed answer.
The procedure for problem solving is the following; First you have to state the problem situation and then once you have the correct statement and thus the correct “entry point” into the problem situation, you would be able to gather the most relevant information pertaining to the problem. Once you have the information, you need to analyze it and then come to a mutually agreed answer.