SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Downloaden Sie, um offline zu lesen
MANAGING YOUR HEROES
     The People Aspect of Monitoring
(a.k.a. Dealing with Outages and Failures)
               Alex Solomon
             alex@pagerduty.com
WHO AM I?
     Alex Solomon

    • Founder     / CEO of PagerDuty

    • Intersect   Inc.

    • Amazon.com




                                       2
DEFINITIONS



              3
Service Level Agreement (SLA)

 Mean Time To Resolution (MTTR)

      Mean Time To Response

Mean Time Between Failures (MTBF)



                                    4
OUTAGES



          5
Can we prevent them?




                       6
PREVENTING OUTAGES

  Single Points of Failure (SPOFs)
  Redundant systems

  Complex, monolithic systems
  Service-oriented architecture




                                     7
Netflix distributed SOA system




                                8
PREVENTING OUTAGES

                 Change

 (not much you can do about this one)




                                        9
OUTAGES


          10
FAILURE LIFECYCLE



                    11
Monitoring
                      detect failure



                                         Alert


                                       Investigate


                                          Fix



Root-cause Analysis

                                                     12
Critical Incident Timeline

              Alert                Investigate       Fix

            RESPONSE TIME
                                 RESOLUTION TIME


 Issue is               Engineer starts                    Issue is
detected               working on issue                      fixed




                                                                      13
MONITOR



          14
MONITOR EVERYTHING!
All levels of the stack
•   Data center
•   Network
•   Servers
•   Database
•   Application
•   Website
•   Business Metrics

                          15
WHY MONITOR EVERYTHING?


                 Metrics!

                 Metrics!

                 Metrics!




                            16
TOOLS
•   Internal monitoring (behind the firewall):
    •

    •

•   External monitoring (SaaS-based):
    •

    •

•   Metrics:
    •   Graphite or

                                                17
ALERT



        18
Best Practice: Categorize alerts by severity.




                                                19
SEVERITIES
Define severities based on business impact:
• sev1   - large scale business loss       {
                                           2 critical
                                           severities
• sev2   - small to medium business loss

• sev3



• sev4
       - no immediate business loss,
 customers may be impacted

       - no business loss, no
                                       {       2 non-critical
                                               severities
 customers impacted

                                                                20
Each severity level should have its own standard
operating procedure (SOP):
•   Who

•   How

•   Response time




                                                   21
•   Sev1: Major outage, all hands on deck
    •   Notify the entire team via phone and SMS
    •   Response time: 5 min
•   Sev2: Critical issue
    •   Notify the on-call person via phone and SMS
    •   Response time: 15 min
•   Sev3: Non-critical issue
    •   Notify the on-call person via email
    •   Response time: next day during business hours



                                                        22
•   Sev1 incidents
    •   Rare
    •   Rarely auto-generated
    •   Frequently start as sev2 which are upgraded to sev1




                                                              23
•   Sev2 incidents
    •   More common
    •   Mostly auto-generated




                                24
•   Sev3 incidents
    •   Non-critical incidents
    •   Can be auto-generated
    •   Can also be manually generated




                                         25
•   Severities can be downgraded or upgraded
    •   ex. sev2 ➞ sev1 (problem got worse)
    •   ex. sev1 ➞ sev2 (problem was partially fixed)
    •   ex. sev2 ➞ sev3 (critical problem was fixed but we
        still need to investigate root cause)




                                                            26
One more best-practice:

Alert before your systems fail completely




                                            27
Main benefit of severities

Only page on critical issues (sev1 or 2)




                                           28
Preserve sanity




                  29
Avoid “Peter and the wolf ” scenarios




                                        30
ON-CALL BEST PRACTICES


   Person      Team
    Level      Level




                         31
ON-CALL AT THE PERSON LEVEL


         Cellphone




                              32
Cellphone
Smart phone



    OR        AND




                    33
4G / 3G internet




4G hotspot   4G USB modem   3G/4G tethering



     (don’t forget your laptop)



                                              34
Page multiple times until you respond
 • Time    zero: email and SMS

 •1   min later: phone-call on cell

 •5   min later: phone-call on cell

 •5   min later: phone-call on landline

 •5   min later: phone-call to girlfriend



                                            35
Bonus: vibrating bluetooth bracelet




                                      36
ON-CALL AT THE TEAM LEVEL


  Rarely
  Do not send alerts to the entire team


                sev1 OK
                sev2 NO



                                          37
On-call schedules:
 •   Simple rotation-based schedule
     •   ex. weekly - everyone is on-call for a
         week at a time
 •   Set up a follow-the-sun schedule
     •   people in multiple timezones
     •   no night-shifts                          simple rotation




                                                                    38
What happens if the on-call person doesn’t respond at all?




                                                             39
If you care about uptime, you need redundancy in your on-call.

                                                                 40
Set up multiple on-call levels with automatic
escalation between them:

  Level 1: Primary on-call
       Escalate after 15 min

  Level 2: Secondary on-call
       Escalate after 20 min

  Level 3: Team on-call (alert entire team)




                                                41
Best Practice: Put management in the on-call chain
  Level 1: Primary on-call
       Escalate after 15 min

  Level 2: Secondary on-call
       Escalate after 20 min

  Level 3: Team on-call (alert entire team)
       Escalate after 20 min

  Level 4: Manager / Director



                                                     42
Best Practice: put software engineers in the
on-call chain
  •   Devops model
  •   Devs need to own the systems they write
  •   Getting paged provides a strong incentive to engineer
      better systems




                                                              43
Best Practice: measure on-call performance
“You can’t improve what you don’t measure.”

   •   Measure: mean-time-to-response
   •   Measure: % of issues that were escalated
   •   Set up policies to encourage good performance
       •   Put managers in on-call chain
       •   Pay people extra to do on-call



                                                       44
Network Operations Center




                            45
NOC with lots of Nagios goodness


                                   46
NOCs:
 •   Reduce the mean-time-to-response drastically
 •   Expensive (staffed 24x7 with multiple people)
 •   Train NOC staff to fix a good %age of issues
 •   As you scale your org, you may want a hybrid on-call
     approach (where NOC handles some issues, teams handle
     other issues directly)




                                                             47
Critical Incident Timeline

                 Alert                 Investigate          Fix
                 RESPONSE TIME
                                     RESOLUTION TIME

   Issue is               Engineer starts                         Issue is
  detected               working on issue                           fixed




                                   Alert
              Alerting system                Engineer gets to a
               gets ahold of                computer, connects
                somebody                        to internet



 Issue is                      Engineer is                Engineer starts
detected                      aware of issue             working on issue



                                                                             48
Alert
          Alerting system               Engineer gets to a
           gets ahold of               computer, connects
            somebody                       to internet




{
{
 Issue is                      Engineer is                  Engineer starts
detected                      aware of issue               working on issue



    How to minimize:                   How to minimize:
    •   Alert via phone & SMS          •   Carry 4G internet device +
                                           laptop at all times
    •   Alert multiple times via
        multiple channels              •   Set loud ringtone at night
    •   Failing that, escalate!
    •   Failing that, escalate to
        manager!

                                                                              49
RESEARCH & FIX



                 50
How do we reduce the amount of time needed to
             investigate and fix?

           Investigate       Fix




                                                51
Set up an Emergency Ops Guide:
  •   When you encounter a new failure, document it in the
      Guide
  •   Document symptoms, research steps, fixes
  •   Use a wiki




                                                             52
53
54
Automate fixes
           or


Add more fault tolerance




                           55
You need the right tools:
 •   Tools to help you diagnose problems faster
     •   Comprehensive monitoring, metrics and dashboards
     •   Tools that help search for problems in log files quickly (ie.
         Splunk)
 •   Tools to help your team communicate efficiently
     •   Voice: Conference bridge, Skype, Google Hangout
     •   Chat: Hipchat, Campfire


                                                                        56
Best Practice: Incident Commander




                                    57
Incident Commander:
 •   Essential for dealing with sev1 issues
 •   In charge of the situation
     •   Providers leadership, prevents analysis paralysis
     •   He/she directs people to do things
     •   Helps save time making decisions




                                                             58
Questions?


 Alex Solomon
alex@pagerduty.com

Weitere ähnliche Inhalte

Ähnlich wie Nagios Conference 2012 - Alex Solomon - Managing Your Heros

Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007Denim Group
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systemsJaap van Ekris
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Jaap van Ekris
 
Vulnerability Management In An Application Security World
Vulnerability Management In An Application Security WorldVulnerability Management In An Application Security World
Vulnerability Management In An Application Security WorldDenim Group
 
Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-GearmanNagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-GearmanNagios
 
Webinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCMWebinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCMeasy2comply
 
Technical Debt - PHPBenelux
Technical Debt - PHPBeneluxTechnical Debt - PHPBenelux
Technical Debt - PHPBeneluxenaramore
 
Continous Monitoring
Continous MonitoringContinous Monitoring
Continous MonitoringNaresh Jain
 
Big Events Cause Network Mayhem
Big Events Cause Network MayhemBig Events Cause Network Mayhem
Big Events Cause Network MayhemPacketTrap Msp
 
Application Assessment Techniques
Application Assessment TechniquesApplication Assessment Techniques
Application Assessment TechniquesDenim Group
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE AtlasRIPE NCC
 
Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 2012Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 201244CON
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...Puppet
 
The Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into SoftwareThe Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into SoftwareJames Wickett
 
Rugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into SoftwareRugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into SoftwareInnoTech
 
Dan Cornell - The Real Cost of Software Remediation
Dan Cornell  - The Real Cost of Software RemediationDan Cornell  - The Real Cost of Software Remediation
Dan Cornell - The Real Cost of Software RemediationSource Conference
 
vCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 trainingvCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 trainingEric Sloof
 
Everything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to askEverything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to asklauraxthomson
 
DevOps for the sysadmin
DevOps for the sysadminDevOps for the sysadmin
DevOps for the sysadminRobert Nelson
 
2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns Distilled2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns DistilledGene Kim
 

Ähnlich wie Nagios Conference 2012 - Alex Solomon - Managing Your Heros (20)

Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007Web Application Remediation - OWASP San Antonio March 2007
Web Application Remediation - OWASP San Antonio March 2007
 
2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems2015 05-07 - vu amsterdam - testing safety critical systems
2015 05-07 - vu amsterdam - testing safety critical systems
 
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)Testing Safety Critical Systems (10-02-2014, VU amsterdam)
Testing Safety Critical Systems (10-02-2014, VU amsterdam)
 
Vulnerability Management In An Application Security World
Vulnerability Management In An Application Security WorldVulnerability Management In An Application Security World
Vulnerability Management In An Application Security World
 
Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-GearmanNagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman
 
Webinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCMWebinar - Disaster in Japan: A Lesson in BCM
Webinar - Disaster in Japan: A Lesson in BCM
 
Technical Debt - PHPBenelux
Technical Debt - PHPBeneluxTechnical Debt - PHPBenelux
Technical Debt - PHPBenelux
 
Continous Monitoring
Continous MonitoringContinous Monitoring
Continous Monitoring
 
Big Events Cause Network Mayhem
Big Events Cause Network MayhemBig Events Cause Network Mayhem
Big Events Cause Network Mayhem
 
Application Assessment Techniques
Application Assessment TechniquesApplication Assessment Techniques
Application Assessment Techniques
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
 
Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 2012Modern Post-Exploitation Strategies - 44CON 2012
Modern Post-Exploitation Strategies - 44CON 2012
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
The Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into SoftwareThe Rugged Way in the Cloud--Building Reliability and Security into Software
The Rugged Way in the Cloud--Building Reliability and Security into Software
 
Rugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into SoftwareRugged Dev: Building Reliability and Security Into Software
Rugged Dev: Building Reliability and Security Into Software
 
Dan Cornell - The Real Cost of Software Remediation
Dan Cornell  - The Real Cost of Software RemediationDan Cornell  - The Real Cost of Software Remediation
Dan Cornell - The Real Cost of Software Remediation
 
vCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 trainingvCenter Operations 5: Level 300 training
vCenter Operations 5: Level 300 training
 
Everything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to askEverything you ever wanted to know about deployment but were afraid to ask
Everything you ever wanted to know about deployment but were afraid to ask
 
DevOps for the sysadmin
DevOps for the sysadminDevOps for the sysadmin
DevOps for the sysadmin
 
2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns Distilled2012 Velocity London: DevOps Patterns Distilled
2012 Velocity London: DevOps Patterns Distilled
 

Mehr von Nagios

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best PracticesNagios
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewNagios
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The HoodNagios
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsNagios
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsNagios
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceNagios
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksNagios
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationNagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Nagios
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosNagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Nagios
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosNagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Nagios
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNagios
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - FeaturesNagios
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios
 

Mehr von Nagios (20)

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 

Kürzlich hochgeladen

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Kürzlich hochgeladen (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Nagios Conference 2012 - Alex Solomon - Managing Your Heros

  • 1. MANAGING YOUR HEROES The People Aspect of Monitoring (a.k.a. Dealing with Outages and Failures) Alex Solomon alex@pagerduty.com
  • 2. WHO AM I? Alex Solomon • Founder / CEO of PagerDuty • Intersect Inc. • Amazon.com 2
  • 4. Service Level Agreement (SLA) Mean Time To Resolution (MTTR) Mean Time To Response Mean Time Between Failures (MTBF) 4
  • 6. Can we prevent them? 6
  • 7. PREVENTING OUTAGES Single Points of Failure (SPOFs) Redundant systems Complex, monolithic systems Service-oriented architecture 7
  • 9. PREVENTING OUTAGES Change (not much you can do about this one) 9
  • 10. OUTAGES 10
  • 12. Monitoring detect failure Alert Investigate Fix Root-cause Analysis 12
  • 13. Critical Incident Timeline Alert Investigate Fix RESPONSE TIME RESOLUTION TIME Issue is Engineer starts Issue is detected working on issue fixed 13
  • 14. MONITOR 14
  • 15. MONITOR EVERYTHING! All levels of the stack • Data center • Network • Servers • Database • Application • Website • Business Metrics 15
  • 16. WHY MONITOR EVERYTHING? Metrics! Metrics! Metrics! 16
  • 17. TOOLS • Internal monitoring (behind the firewall): • • • External monitoring (SaaS-based): • • • Metrics: • Graphite or 17
  • 18. ALERT 18
  • 19. Best Practice: Categorize alerts by severity. 19
  • 20. SEVERITIES Define severities based on business impact: • sev1 - large scale business loss { 2 critical severities • sev2 - small to medium business loss • sev3 • sev4 - no immediate business loss, customers may be impacted - no business loss, no { 2 non-critical severities customers impacted 20
  • 21. Each severity level should have its own standard operating procedure (SOP): • Who • How • Response time 21
  • 22. Sev1: Major outage, all hands on deck • Notify the entire team via phone and SMS • Response time: 5 min • Sev2: Critical issue • Notify the on-call person via phone and SMS • Response time: 15 min • Sev3: Non-critical issue • Notify the on-call person via email • Response time: next day during business hours 22
  • 23. Sev1 incidents • Rare • Rarely auto-generated • Frequently start as sev2 which are upgraded to sev1 23
  • 24. Sev2 incidents • More common • Mostly auto-generated 24
  • 25. Sev3 incidents • Non-critical incidents • Can be auto-generated • Can also be manually generated 25
  • 26. Severities can be downgraded or upgraded • ex. sev2 ➞ sev1 (problem got worse) • ex. sev1 ➞ sev2 (problem was partially fixed) • ex. sev2 ➞ sev3 (critical problem was fixed but we still need to investigate root cause) 26
  • 27. One more best-practice: Alert before your systems fail completely 27
  • 28. Main benefit of severities Only page on critical issues (sev1 or 2) 28
  • 30. Avoid “Peter and the wolf ” scenarios 30
  • 31. ON-CALL BEST PRACTICES Person Team Level Level 31
  • 32. ON-CALL AT THE PERSON LEVEL Cellphone 32
  • 34. 4G / 3G internet 4G hotspot 4G USB modem 3G/4G tethering (don’t forget your laptop) 34
  • 35. Page multiple times until you respond • Time zero: email and SMS •1 min later: phone-call on cell •5 min later: phone-call on cell •5 min later: phone-call on landline •5 min later: phone-call to girlfriend 35
  • 37. ON-CALL AT THE TEAM LEVEL Rarely Do not send alerts to the entire team sev1 OK sev2 NO 37
  • 38. On-call schedules: • Simple rotation-based schedule • ex. weekly - everyone is on-call for a week at a time • Set up a follow-the-sun schedule • people in multiple timezones • no night-shifts simple rotation 38
  • 39. What happens if the on-call person doesn’t respond at all? 39
  • 40. If you care about uptime, you need redundancy in your on-call. 40
  • 41. Set up multiple on-call levels with automatic escalation between them: Level 1: Primary on-call Escalate after 15 min Level 2: Secondary on-call Escalate after 20 min Level 3: Team on-call (alert entire team) 41
  • 42. Best Practice: Put management in the on-call chain Level 1: Primary on-call Escalate after 15 min Level 2: Secondary on-call Escalate after 20 min Level 3: Team on-call (alert entire team) Escalate after 20 min Level 4: Manager / Director 42
  • 43. Best Practice: put software engineers in the on-call chain • Devops model • Devs need to own the systems they write • Getting paged provides a strong incentive to engineer better systems 43
  • 44. Best Practice: measure on-call performance “You can’t improve what you don’t measure.” • Measure: mean-time-to-response • Measure: % of issues that were escalated • Set up policies to encourage good performance • Put managers in on-call chain • Pay people extra to do on-call 44
  • 46. NOC with lots of Nagios goodness 46
  • 47. NOCs: • Reduce the mean-time-to-response drastically • Expensive (staffed 24x7 with multiple people) • Train NOC staff to fix a good %age of issues • As you scale your org, you may want a hybrid on-call approach (where NOC handles some issues, teams handle other issues directly) 47
  • 48. Critical Incident Timeline Alert Investigate Fix RESPONSE TIME RESOLUTION TIME Issue is Engineer starts Issue is detected working on issue fixed Alert Alerting system Engineer gets to a gets ahold of computer, connects somebody to internet Issue is Engineer is Engineer starts detected aware of issue working on issue 48
  • 49. Alert Alerting system Engineer gets to a gets ahold of computer, connects somebody to internet { { Issue is Engineer is Engineer starts detected aware of issue working on issue How to minimize: How to minimize: • Alert via phone & SMS • Carry 4G internet device + laptop at all times • Alert multiple times via multiple channels • Set loud ringtone at night • Failing that, escalate! • Failing that, escalate to manager! 49
  • 51. How do we reduce the amount of time needed to investigate and fix? Investigate Fix 51
  • 52. Set up an Emergency Ops Guide: • When you encounter a new failure, document it in the Guide • Document symptoms, research steps, fixes • Use a wiki 52
  • 53. 53
  • 54. 54
  • 55. Automate fixes or Add more fault tolerance 55
  • 56. You need the right tools: • Tools to help you diagnose problems faster • Comprehensive monitoring, metrics and dashboards • Tools that help search for problems in log files quickly (ie. Splunk) • Tools to help your team communicate efficiently • Voice: Conference bridge, Skype, Google Hangout • Chat: Hipchat, Campfire 56
  • 57. Best Practice: Incident Commander 57
  • 58. Incident Commander: • Essential for dealing with sev1 issues • In charge of the situation • Providers leadership, prevents analysis paralysis • He/she directs people to do things • Helps save time making decisions 58