We struggled because of too many issues in the live products. They didn't allow project teams to make any forecasts or develop new features without interruptions. In the presentation I share the successful experience how we applied ITIL Problem and Incident Management processes, by talking only the best from them. It allowed to start fixing the problems that existed in our organization more effectively, while organization allowed to use different methodologies for different teams.
24. identify
Incident only when:
➔ game becomes unavailable, or
➔ game revenue drops more
than €XXX, or
➔ severe issues with servers
25. identify
Incident only when:
➔ game becomes unavailable, or
➔ game revenue drops more
than €XXX, or
➔ severe issues with servers, or
➔ it can't wait for next
planned deployment
26. We don’t panic when the incident
occurs. We follow the process:
28. ➔ Elect a SWAT team
➔ Plan Communication
handle
29. ➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
handle
30. ➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
handle
31. ➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
handle
32. ➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
➔ Send email notifications to
stakeholders on every update
handle
33. ➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
➔ Send email notifications to
stakeholders on every update
➔ Follow defined policies and
guidelines
handle
52. we ask questions:
could any of the incidents be prevented?
can we detect incident symptoms?
are there any patterns?
53. we ask questions:
could any of the incidents be prevented?
can we detect incident symptoms?
are there any patterns?
54. “the aim of incident management is to restore the service as
quickly as possible, often through a workaround, rather than
through trying to find a permanent solution which is the aim
of problem management.”
Summary
56. IDENTIFICATION CLOSUREHANDLING
‣ Receive data regarding the incident and
ensure it is full and clear
‣ Qualify issue as an Incident.
DELIVERABLES
➨ Incident process is triggered
After the incident has been solved, make sure
to:
‣ Communicate the results to relevant
stakeholders by sending mail following the
'Issue on Live' closure procedure as per the
template.
‣ Take corrective actions to prevent issue from
happening again. Create JIRA tickets where
possible.
‣ Evaluate possible procedure updates that can
be made in the teams in the pipeline.
‣ Submit “Incident Login” form.
DELIVERABLES
➨ Sent report to XXX email
➨ Submitted related JIRA tickets
➨ Submitted “Incident Login” form
‣ Elect a SWAT team to fix the incident issue.
‣ Decide on a War Room for the Huddle.
‣ Huddle and lay down an Action Plan.
‣ Send out email notification to all
stakeholders. No one is allowed to disturb the
SWAT team from work, while they actively
investigate/resolve the Incident
‣ Huddle regularly to update the action plan.
‣ Send out updates to all stakeholders.
‣ If devOps is necessary, follow the “Emergency
IT Support Policy”.
‣ Follow “Live Actions Guidelines”
DELIVERABLES
➨ Resolved incident (possibly workarounded)
➨ Sent Incident report(s) to XXX email
incident management process
57. Resolved ?
no
Add / Update Incident record
(via Incident Login form)
Open >3 days ?
yes
Create JIRAs for fixing root cause or other
related issues if possible
Add / Update Incident record
(via Incident Login form)
yes
HANDLING
CLOSURE
IDENTIFICATION
Action plan
(five minutes huddle of the SWAT team in a war room)
Send email
Create / Update JIRAs
(contact OPS if necessary)
Fix
(first QA, then Live)
Incident
detected
Send email
(keep one thread)
no
58. problem management process
PROBLEM
DETECTION
ROOT CAUSE
IDENTIFICATION
SOLUTION
DEFINITION
PRIORITISATION
PROBLEM
LOGGING
IMPLEMENTATION,
CLOSURE
ACTIVITIES
‣ Define the problem
‣ Receive data regarding the
problem from incident
management
‣ Ensure the collected data is full
and clear
‣ Define which teams or
departments are affected
‣ Gather other data at the day of
incident
‣ Analyze symptoms
‣ Analyze the data collected from
various sources relating to the
major incident
‣ Analyze historical data to see if
there was such problem before
DELIVERABLES
➨ Analyzed problem
➨ Updated incident record
ACTIVITIES
Problem investigation and
diagnosis (requires tech experts)
‣ To conduct root cause
analyses using various
techniques if necessary:
• Make a sketch
• Draw Ishikawa (fishbone)
diagram
• Kepner-Tregoe
• Flow diagrams
• etc.
‣ Determine workarounds
‣ Think of potential solutions
‣ Assess the problem and
recommended actions to
resolve the problem
DELIVERABLES
➨ Updated problem record
➨ Root cause detected
➨ Workaround(s) identified
ACTIVITIES
‣ Identify the team for solution
development
‣ Determine possible resolutions
‣ Choose the best approach
‣ Make sure the solution can
effectively prevent
reoccurrence
DELIVERABLES
➨ Updated problem record
➨ Other tasks in JIRA
➨ Updated incident records
➨ Defined resources that are
necessary for implementation
ACTIVITIES
‣ Identify the urgency and
impact of this task
‣ Define a priority in the
Problem management queue
‣ Identify responsible for
the implementation
‣ Decide how this problem
should be prioritized among
other tasks of the team
DELIVERABLES
➨ The task(s) has a priority
➨ The team leads are aware of
the task and can plan it in
their sprints
ACTIVITIES
‣ Create a new JIRA record or
update the old one:
• Unique ID, timestamp
• Name of submitter
• Link associate problem
records
(with hierarchy if applicable)
• Link associate incident
records
• Problem description
• Problem category
• Status
• Severity and Impact
• Responsible person, team
• Affected game
• Associate JIRA records
• History of all taken actions
• Workaround
• Permanent solution (if known
already)
DELIVERABLES
➨ Created/updated problem
record
➨ Analyzed and updated
incident data
ACTIVITIES
‣ Conduct activities to implement
the fix to the problem
‣ Verify if the solution is appropriate
and close problem record
‣ Submit a record to the Error
Knowledge Base if applicable
‣ Share Lessons learned via email
if reasonable
‣ Ensure that all the associated
incidents are closed with a proper
fix or resolution
DELIVERABLES
➨ Updated incident record
➨ Updated problem record
➨ Updated Known Errors
Knowledge Base spreadsheet
➨ Lessons learned shared
➨ Report is sent
59. IMPLEMENTATION
&
CLOSURE
ROOT CAUSE
IDENTIFICATION
&
SOLUTION
DEFINITION
Close
(update knowledge base, submit lessons learned, send email)
Implement
(by defined implementation team)
DETECTION
&
LOGGING
Choose the problem area
Analyze related incident data
(symptoms, relations, historical data)
Request missing data
(symptoms, relations, historical data)
Create new Problem Record /update existing
(JIRA)
Identify root cause, workarounds
Determine work for identified solutions
(and choose implementation team)
Prioritize
Incident
record(s)
update
Problem
record
update
Problem
record
update
Problem
record
update
Problem
record
update
Incident
record(s)
update
Incident
record(s)
update
Known
Errors
update
60. Run MeetingPrepare Meeting
Who: Problem Manager
Process summary:
- to ensure the quality of the
incident spreadsheet
- to select follow-up’s
- to prefill the problem
management spreadsheet
Efforts: 3-5 mh
When: no later than 3 days
before the meeting
Who: particular person is
responsible for every problem as
defined in the meeting
Process summary:
- implement
- verify
- update all records
Outcomes: Updated incident and
problem records
Chairperson: Problem Manager
Participants: PMs, APs, OPS
representative
Frequency: monthly
Activities:
- identify problems
- prioritize & agree on
actions
- define responsible teams
When: at the end of the month. In
case of holidays or emergency
moved to the next working day.
Outcomes: Assigned tasks
Take Actions
problem management process simplified