SlideShare ist ein Scribd-Unternehmen logo
1 von 35
#PDSummit16
Using Incident Data to Build Better
Internal Processes
#PDSummit16#PDSummit16
Andy Domeier
Director – System Operations
SPS Commerce
Twitter: @ajdomie
#PDSummit16#PDSummit16
• Incident Story Time
• SPS Commerce
• 3 Qualities of Effective Incident Management
• Tips for Getting There
• Agenda
#PDSummit16#PDSummit16
Incident Story Time
An tale of an Unhealthy Incident Culture
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16#PDSummit16
#PDSummit16
Congratulations
#PDSummit16
Most Importantly
#PDSummit16
Or Worse…
#PDSummit16
It’s not just about the
outage response….
Has it happened before?
Will it happen again?
Why did this happen?
#PDSummit16#PDSummit16
• Supply Chain Communications
Network
• Connecting over 60,000 Trading
Partners Globally
• Services:
• Fulfillment
• Integration
• Item Management
• Analytics
#PDSummit16
Core
UX
Logging
APM
SysOps • Level 1 India
Engineer • On-Call
MGMT • On-Call
ChatOpsChatOps
Automation
#PDSummit16
3 Qualities of Highly Effective
Incident Management
• Measurement
• If it Moves….Graph It!
• Credit - Ian Malpass – Etsy
• https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
• Transparency
• System Health & Availability
• State of the Incident
• Collaboration
• Effective Cross Team Troubleshooting
• Effective Prevention Efforts
Collaboration
TransparencyMeasurement
#PDSummit16
Measurement: Where to start? Collaboration
TransparencyMeasurement
#PDSummit16
Measurement: Start with the Basics
• Basics:
• Total Counts
• MTTR
• Escalations
• Group by:
• Service
• Team
• Severity
#PDSummit16
Measurement: Make Sense of the Spikes
• Typically spikes indicate a larger issue in scope
#PDSummit16
Trivia:
• A Group of Geese is a
• A Group of Cows is a
• A Group of Tigers is a
• A Group of Alerts is
Flock
Herd
Streak
?????
#PDSummit16
PagerDuty:
Infrastructure Health Module (Preview)
#PDSummit16
Alert or Incident
• Service Impacting
• Important to Detail & Understand
Example:
“The Site is Down”
• Tactical & Explicit
• Important to Trend & Remediate
Examples:
“CPU > 99%”
“Disk Space @ 95%”
Incidents
VS
Alerts
#PDSummit16
Measurement: Alert Analysis
• Trend Alert Totals Overtime
• Try to remove incident related alerts
• Group by:
• Alert Types – CPU, Memory, Etc..
• Source Host – Common Themes
• Host Types – Database, App, Network, etc..
• Prioritize Time to Remediate
• Short Term & Long Term
• Manage Alert Fatigue
Collaboration
TransparencyMeasurement
#PDSummit16#PDSummit16
“Alert trends ignored
today are tomorrow’s
incidents…”
#PDSummit16
Measurement: Incident Rates & Cost
• Trend Incident Rates by Service
• Establishes Frequency & MTTR Trends
• Enables benchmarking (& Comparison)
• Enables forecasting to effectively plan time
• Establish Cost Metrics
• Recovery Efforts
• Capture the # of engineers involved in recovery efforts
• Capture the hours of engineering effort involved in recovery
• Customer Impact
• Correlate customer contacts to specific incidents
• Establish business metrics that can reflect customer impact
#PDSummit16
Measurement: Incident Cause & Recovery
• Analyze Cause with Organization
• Potential Causes:
• Change Released – Reference Change Ticket
• Establish Objective Confidence Levels for Change (by Service)
• Code/Infra/Bug Issues – Reference Bug Ticket
• Creates a Tangible Cost to Priority Discussion
• 3rd Party Service Dependency (Cloud, Monitoring, ISP, Etc…)
• Tangible Business Impact
• Recovery
• Corrective Action
• Monitoring Effectiveness
#PDSummit16
You can’t measure Incidents you
avoided, be sure to also measure
success.
#PDSummit16
Transparency: Current State & Historical
• Current State of Services & Incidents:
• Maintain a Service Status Page (Internal & External)
• Service Status – Outage, Degraded, etc….
• Incident Dashboard
• Severity
• Establishes Urgency Expectations
• Referenceable History
• Simplify Searching History
• Link Recovery Documentation to past Incidents
Collaboration
TransparencyMeasurement
#PDSummit16
Collaboration:
• Transparency to data has a cultural influence
• Fix it Together
• Inquisitive Troubleshooting
• Fix it Long Term
• Team recognizes impact and create empathy
• Product Team Engagement
• Objective data on product performance
Collaboration
TransparencyMeasurement
#PDSummit16
Measurement + Transparency + Collaboration
• Incident Response & Recovery Times Decrease
• Incident Frequency Decreases
• Incident Recovery Cost Decreases
• Increasing Engineering Output
• Decision Making Abilities Improve
• Team Morale Improves
• And Most Importantly….
• Happy Confident Customers
#PDSummit16
Tips for Getting There
• Measure stuff
• Be transparent with your metrics
• Don’t try to do it all at once
• Don’t make your Incident process bulky
• Consistent Ceremonies
Collaboration
TransparencyMeasurement
#PDSummit16#PDSummit16
“Fosture a Culture
that Challenges &
Learns from Failure..”
#PDSummit16#PDSummit16
Thanks for listening!
Twitter: @ajdomie
#PDSummit16#PDSummit16
Please provide
feedback for this
session by filling out
the feedback survey

Weitere ähnliche Inhalte

Ähnlich wie PDSummit16 - Using Incident Data to Improve your Business

Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015IQ Business - agility@IQ
 
Managed IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It MattersManaged IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It MattersNet at Work
 
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014ColomboCampsCommunity
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsConnotate
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsConnotate
 
An Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PMAn Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PMProduct School
 
Ideal Customer Profile Guide
Ideal Customer Profile GuideIdeal Customer Profile Guide
Ideal Customer Profile GuideJoseph Barbato
 
Fundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery CenterFundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery CenterScottMadden, Inc.
 
GEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure UpGEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure UpTammy Kobliuk
 
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015Martin Thompson
 
FIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdfFIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdfXolaniRadebeRadebe
 
NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS! NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS! Elizabeth Magill
 
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015Neo Insight
 
How to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big DataHow to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big DataMark Beekman
 
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013IBM Switzerland
 
Maintenance Metrics that Matter
Maintenance Metrics that MatterMaintenance Metrics that Matter
Maintenance Metrics that MattereMaint Enterprises
 
Targets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan StoreyTargets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan StoreyService Desk Institute
 

Ähnlich wie PDSummit16 - Using Incident Data to Improve your Business (20)

Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015Alignment between CEM and Agile - Building the Right product - BASSA2015
Alignment between CEM and Agile - Building the Right product - BASSA2015
 
ICG 6 sigma transformation
ICG 6 sigma transformationICG 6 sigma transformation
ICG 6 sigma transformation
 
Managed IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It MattersManaged IT Services: What It Is and Why It Matters
Managed IT Services: What It Is and Why It Matters
 
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
'Metrics That Matter': Gabrielle Benefield @ Colombo Agile Con 2014
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce Costs
 
Using Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce CostsUsing Web Data to Drive Revenue and Reduce Costs
Using Web Data to Drive Revenue and Reduce Costs
 
Paradigm 2020
Paradigm 2020Paradigm 2020
Paradigm 2020
 
An Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PMAn Ounce of Validation = a Pound of Pivot by LinkedIn PM
An Ounce of Validation = a Pound of Pivot by LinkedIn PM
 
Ideal Customer Profile Guide
Ideal Customer Profile GuideIdeal Customer Profile Guide
Ideal Customer Profile Guide
 
Fundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery CenterFundamentals of Designing, Building, & Implementing a Service Delivery Center
Fundamentals of Designing, Building, & Implementing a Service Delivery Center
 
GEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure UpGEOALBERTA 2015 - StAlbert - Do You Measure Up
GEOALBERTA 2015 - StAlbert - Do You Measure Up
 
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
SIAM Skills Workshop, BCS, ITSM Review 17th Nov 2015
 
FIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdfFIRST-WA-Project-Management-May-2020.pdf
FIRST-WA-Project-Management-May-2020.pdf
 
NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS! NPS is Dead, Long Live NPS!
NPS is Dead, Long Live NPS!
 
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
Success, Failure, Disaster: Cisco Measures Top Tasks - CapCHI 21 Oct 2015
 
3 types of monitoring for 2020
3 types of monitoring for 20203 types of monitoring for 2020
3 types of monitoring for 2020
 
How to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big DataHow to seize B2B market opportunities thanks to Big Data
How to seize B2B market opportunities thanks to Big Data
 
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
Erfolgreicher agieren mit Analytics_Markus Barmettler_IBM Symposium 2013
 
Maintenance Metrics that Matter
Maintenance Metrics that MatterMaintenance Metrics that Matter
Maintenance Metrics that Matter
 
Targets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan StoreyTargets That Work (for the Service Desk), Susan Storey
Targets That Work (for the Service Desk), Susan Storey
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

PDSummit16 - Using Incident Data to Improve your Business

Hinweis der Redaktion

  1. We’ll go through a generic example of what commonly happens in an organization that doesn’t do Incident Management well.
  2. Once upon a time…. The site was down
  3. Good thing executives noticed before anyone else. Pretty sure she’s just sitting in there with her door closed clicking refresh all day….. Site outages are always the database! (please note in the middle that the Network person left the room immediately) That ended up being timely as the DB team made assumptions it’s the Network’s issue
  4. Now we start to see customer contacts coming in as a result of this issue
  5. We have not entered the phase of the incident I refer to as the “Buckshot” phase. Everyone communicating poorly to everyone else with none of it contributing to resolving the issue
  6. Turn it off and turn it on again, brilliant! Poke that database performance as a root cause one more time….
  7. If you’re looking for me you can find me with my face in my palm at my desk….
  8. Congratulations, you just contributed to Global Warming in your own way… Someone get me some marshmellows!
  9. You just made your customer do this, or worse yet, they spent the time they were waiting for your site to come online looking for new service providers.
  10. They are spending their wasted time by your site being down looking for a new provider to replace you!
  11. But, It’s not just about the outage response here. There is is so much more to Incident Management than response and resolution. OK, In just having fun with the story here, but this is a terrible experience for customers and employees. Nobody talented wants to work in an environment like this and nobody wants to do business with a company that operates like this. .
  12. At SPS Commerce we’ve fostered a great culture around incident Management
  13. Micro services Architecture & Hybrid cloud leveraging PagerDuty as a notification hub as well as an event emitter to help us start to trigger automated responses.
  14. There is so much data out there, where do we start?
  15. If you’re not already looking at some of this data I definitely recommend starting with the basics. And the PD UI does a great job of getting you there quickly.
  16. When you analyze your data what you will typically find is “spikes” in alert counts & MTTR are typically indicative of a much larger issue that likely had more significant impact on your service performance than just one Disk space issue for example.
  17. PagerDuty recently released to Beta some work they’re doing on event processing and analysis. I think this is the beginning of something very interesting. This beta is a great way to visualize the concept of an alert versus an incident. You can see here (first animation) that we’re looking at a large group and volume of alerts compared to normal. I would likely assess this an incident that is probably the result of a shared dependency having performance issues. In this 2nd example you can see just a few small alert groupings is probably more indicative of isolated issues like CPU or Disk Space.
  18. This is something PD’s new OCC is going to be awesome for! What we have done Alert rates Got us thinking more about “Incidents” v. “Alerts” Alerts are by nature more isolated small scale issues that are important to respond to as well as remediate the go forward risks at a larger scale Incidents typically impact your customers or have risk to and will absolutely cost you valuable engineering time, it’s absolutely critical you leverage that to motivate a healthy short and long term response process.
  19. There are so many dimensions to an Incident, It’s important to start small and iterate. Recovery Efforts – Hours and #’s should not be perfect, give it a good swag. Establishing business metrics – Canary Testing
  20. Change helps you shape your confidence level with true numbers Code/Bug – Tangible priority discussion around cost of Time, Customer UX, and frequency of recurrence. Recovery – If you have a service that is regularly corrected by restarting it, maybe it’s time to automate that?
  21. Visible Status Page – current state of service health should always be visible in a shared location. (not in an Email string, that’s exclusive we need inclusive during an incident) Referenceable – props to PD road map on Event efforts.
  22. Decision Making Improves – you make better decisions as a business