SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Downloaden Sie, um offline zu lesen
Bringing Back the Love: How
                                                                     Situational Awareness Improves
                                                                     User Experience!


http://www.flickr.com/photos/64123293@N00/5985619750/sizes/l/in/photostream/!
Andrew White!
                Manager of Systems and !
                Event Management At !
                Nationwide Insurance!
                                    !
Mr. White leads a team of software developers focused on creating
   tools that collect and analyze health information from Nationwide's
   IT systems. These tools have a wide variety of applications, from
   fault detection and problem investigation to trend reporting and
   capacity planning.!
!
Andrew has over ten years of experience designing and managing the
   deployment of systems management software. Prior to joining
   Nationwide, Andrew developed solutions for a wide variety of
   organizations, including the Mexican Secretaría de Hacienda y
   Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase,
   and the US Navy Facilities and Engineering Command.!
GROUND RULES FOR THIS SESSION…!


1.  If you can’t tell if I am trying to be funny…!
    !GO AHEAD AND LAUGH!!
2.  Feel free to text, tweet, yammer, or whatever. People
    gotta hear this!!
3.  If you have a question, no need to wait until the end. Just
    interrupt me. Seriously… I don’t mind.!




 Follow Us: #ITSMSummit!
My name is Andrew White!


         I lead a Systems and Event Management team !
I am here today to talk about!



                  Situational
                  Awareness!
Definitions:!
SITŸUŸAŸTION – [SI-CHƏ-WĀ-SHƏN]!
                    -noun"
                      1.  manner of being situated; location or
                           position with reference to environment:
                           The situation of the house allowed for a
                           beautiful view. "
                      2.  condition; case; plight: He is in a
                           desperate situation. "
                      3.  the state of affairs; combination of
                           circumstances: The present
                           international situation is dangerous. "
                      4.  a state of affairs of special or critical
                           significance in the course of a play,
                           novel, etc. "

Follow Us: #ITSMSummit!
Not this Situation…!



                   Think this situation…!
AŸWAREŸNESS – [UH-WAIR-NIS]!
                    -noun"
                      1.  having knowledge; conscious;
                           cognizant: aware of danger. "
                      2.  informed; alert; knowledgeable;
                           sophisticated: She is one of the most
                           politically aware young women around. "




Follow Us: #ITSMSummit!
http://dc-cdn.virtacore.com/holding_door.jpg!
When you put them together we get:!

      The perception of and reaction to a set of changing
      events in terms of what can be done instead of
      merely the recollection of a stimuli.1 !



                     Most outages are the result of the
                      lack of situational awareness!

1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!
http://www.picfaz.com/pic/1390/02/12/4.jpg!
I am going to talk some new
capabilities that will help you.!
Why do we lose situational awareness?!
This is Magenta…!


        It doesn’t exist. L!
Cyan = 600nm - 620nm!




        Yellow = 510nm - 530nm!




Magenta???!
The two color wave lengths that produce it are
not side-by-side in the spectrum!
Squares A and B
are the same color!
We cannot control the way our brain processes information!!
So… why do we lose situational awareness?!
SOMETIMES WE MISS WHAT IS GOING ON!




                                 Say… what’s a
                                 mountain goat doing all
                                 the way up here in a
                                 cloud bank?!

Follow Us: #ITSMSummit!
WHICH DO YOU USE WHEN?!

                           We don’t have a tooling problem…!


                                                         Technology Areas!




we have an understanding problem!!

                                                 Tool!         Tool!         Tool!




 Follow Us: #ITSMSummit!
Our systems are capable of producing a huge
        amount of data, both on the status of their own
        components and on the status of the
        environment. The problem with today’s systems
        is not a lack of information, but finding what is
        needed when it is needed.!
1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems.
Human Factors 37(1), 32–64.!
!
I would like to show why this happens…!
BOYD’S OODA “LOOP”!
                 Observe!                              Orient!                                          Decide!                               Act!
                                                                         Implicit Guidance & Control!

   Unfolding
Circumstances!                                             Cultural!
                                                           Norms!


                                   Feed      Knowledge !                 Cognitive!     Feed                                   Feed
                                  Forward!   Life Cycle!                  Abilities!
                                                                                       Forward!       Decision!               Forward!      Action
                 Observation!
                                                                                                    (Hypothesis)!                           (Test)!
                                                     New !              Prior!
                                                 Information!          Wisdom!
    Outside
  Information!

                                     Feedback!


                                   Feedback!
                       Unfolding Interaction
                       With Environment!

   •     Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the
         feedback and other phenomena coming into our sensing or observing window.!
   •     Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing
         process of projection, empathy, correlation, and rejection.!
   !
        Follow Us: #ITSMSummit!                                                  From “The Essence of Winning and Losing,” John R. Boyd, January 1996.!
WHERE THE BREAKDOWN OCCURS!
                                                                                           • System Capability!
                                                                                           • Interface Design!
                                                               Systemic Influences!         • Stress & Workload!
                                                                                           • Complexity!
                                                                                           • Automation!

                                                                  Feedback!
      Current State!




                                    Situational Awareness!

                         Perception of         Comprehension          Projection of
                                                                                                                           Performance
                         Elements in           of Current             Future Status!           Decision!
                         Current Situation!    Situation!                      !                                            of Actions!
                                 !                      !                      !
                              Level 1!              Level 2!               Level 3!




                       Observe!                                Orient!                      Decide!                          Act!
                                              • Goals & Objectives!
                                              • Preconceptions!                            Cognitive Processes!
                                              • Expectations!
                                                                                       Long Term
                                                                                                           Automaticity!
                                                                                        Memory!


                                                  Individual Influences!
                                                                                           • Abilities!
Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness                 • Experience!
     Follow Us: #ITSMSummit!
in dynamic systems. Human Factors 37(1), 32–64.!                                           • Training!
Maybe.!
Let me show you why this is important…!
WE (IT) SELLS PROMISES…!



   The value of these promises depends on the
   customer’s perception that we are willing and
   capable of making good on the promise when
   the time comes. This perception is affected by
   the interactions they have with us. !




Follow Us: #ITSMSummit!
Objective #1: Users Love Our IT Systems…




http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
WHAT THIS MEANS TO US…!
There are a few inescapable facts we face:!
1.  Weneeds reliable systems to store the promises it
     makes to its customers !
2.  Our systems mirror the complexity of the
     businesses they support!
3.  Our environments must be massive to scale to
     handle the workload!
4.  There is too much activity for a single person to be
     totally situationally aware!
5.  If the users can’t use it, it doesn’t work!


 Follow Us: #ITSMSummit!
EVENT MANAGEMENT FOCUS…!
In addition to monitoring for performance, we are here to
help manage availability.!

 Our Formula:!
 1.  Continually collect, categorize, and analyze all
     events from as many sources as possible!
 2.  Correlate events and analyze them using
     previous outages as patterns to identify situations
     worth investigating!
 3.  Notify a support team so the situation can be
     mitigated before becoming an outage!
Follow Us: #ITSMSummit!
When all of these happen at the same time…!

                                Ug…!
Bad Experience!!!!




http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!
OK.!
So now what?!
CLEANING UP THE LANDSCAPE!


                                                                                            Launch Pad!
      Silo!




                      Monolithic


                                            Niche!
                      Framework!


                                                                                         Information Bus!




                                   Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009
Follow Us: #ITSMSummit!            https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391!
ONE INTEGRATED ENVIRONMENT!


                                                                                                                                                          CMDB!

                                                                                              Paging!      Service Desk!
                                                 Presentation Framework!


3rd Party Providers!                                                                                                                                    Knowledge!




                                                                                                                                                       Asset Mgmt!

                                                                                                                           Enrichment & Correlation!
               Event API!

                                  Event Pool!                                                                                                          Event Catalog!
                                                                                                                Predictive!




Business Telemetry!




                       Mainframe! Distributed!   Database!   Network!      Middleware!   Storage!




                                                                                                      Operational!
                                                                                                    Data Warehouse!
        Follow Us: #ITSMSummit!
CONCEPTUALIZING SITUATIONAL AWARENESS!

                                              Real-Time
                                            Event Streams!

                                                                                                                        Detected and
                                                                                                                     Predicted Situations!
                                                                                    Situational
                                                                                    Awareness
                                                                                     Engine!




                                                                                               Causal Relationship
                                                                            Patterns from       from Past RCAs!
                                                                            Historical Data!




        Follow Us: #ITSMSummit!
Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-
how-to-build-an-event-processing-application-presentation-717795!
SITUATIONAL AWARENESS MODEL DESIGN!
                            Data!                        Information!           Knowledge!          Intelligence!




                                                                                                Runbook
                                                                                               Automation!
                                                                                                Level 5!



                      Event Taxonomy
                                                                    Historical Event
                      and Enrichment!
                                                                        Archive!
                         Level 1!
Event Sources!




                             !
                                                                                                                        Solicitations for
                                                                                                                        User Interaction
                                                     Event Pipeline!                                                        via the
                                                                                                                         Visualization
                                                                                                                          Framework!

                                  Event Tracking!          Situation              Predictive
                                        !                  Detection!             Analysis!
                                     Level 2!               Level 3!               Level 4!




                                                                 Causal Relationship
                                              Patterns from       from Past RCAs!         Adapted from the JDL: Steinberg, A., & Bowman, C.,
        Follow Us: #ITSMSummit!
                                              Historical Data!                            Handbook of Multisensor Data Fusion, CRC Press, 2001!
REQUIREMENTS FOR UNITY OF EFFORT!

                                                Symptoms of Missing Elements
                                                                           !
                                         •  Command and control (No Leadership)!
              1. Command                      •  The team lacks a clear direction!
              and Control!
                                              •  Lots of activity, lack of progress!
                                         •  Shared Experience (Poor Relationships)!
                                              •  Us vs. Them mentality!
                                              •  Unhealthy competition!
3. Situational             2. Shared     •  Situational Awareness (Poor Communication)!
 Awareness!                Experience!        •  Focused on cooperation, not collaboration!
                                              •  Blame culture!
                                              •  Infrequent or non-existent communication!




 Follow Us: #ITSMSummit!
Our success in any endeavor depends directly on
our ability to solve problems!




                    What do we need to do that?!
You Gotta Have Skillz…!
WHAT MATTERS MOST?!



Cook County Hospital, Chicago, IL!



                                                            Dr. Lee Goldman!



         The Goldman Algorithm!
§  Is the patient feeling unstable
    angina?!
§  Is there fluid in the patient’s lungs?!
§  Is the patient’s systolic blood
    pressure below 100?

    !
                   Prediction of Patients Who Will
                   Have a Heart Attack Within 72
           100	
  
                                Hours!
            90	
  
            80	
  
            70	
  
            60	
  
            50	
  
                                                                               By paying attention to what really matters, Dr.
            40	
  
            30	
  
            20	
  
                                                                               Goldman improved the “false negatives” by
            10	
  
             0	
  
                     Traditional	
  Techniques	
   Goldman	
  Algorithm	
        20 percentage points and eliminated the
                                                                                       “false positives” altogether. !
     Follow Us: #ITSMSummit!
THE GOLDMAN ALGORITHM!
                                                                            ECG Evidence of Acute Myocardial Infarction (MI)?!
                                                                            ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous
 Patient enters ED                   Perform
                                                                            Leads (New or Unknown Age)!
with suspected Acute            Electrocardiogram
                                                                            or!
  Cardiac Ischema!                    (EKG)!
                                                                            Pathologic Q Waves in ≥ 2 Contiguous Leads (New
                                                                            or Unknown Age)!
                                                                     Yes!
                                                                                                      No!


                                                    ECG Evidence of Acute Ischemia?!
Coronary                                            ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads
Care Unit!                                          (New or Unknown Age) or!
                                                    T- Wave Inversion in ≥ 2 Contiguous Leads (New or
                                                    Unknown Age) or!
                                                    Left Bundle-Branch Block (New or Unknown Age)!
                                             Yes!                                                                  No!

                                  Urgent Factors Present?!                                  Urgent Factors Present?!
                                  Rates Above Both Lung Bases!                              Rates Above Both Lung Bases!
                                  Systolic Blood Pressure <100 mm Hg!                       Systolic Blood Pressure <100 mm Hg!
                                  Unstable Ischemic Heart Disease!                          Unstable Ischemic Heart Disease!



                         2 or 3 Factors!                      0 or 1 Factors!       2 or 3 Factors!         1 Factors!     0 Factors!



                             High Risk!                                  Moderate Risk!                     Low Risk!    Very Low Risk!



                                                                            Inpatient
                                                                                                                          Observation
                                                                            Telemetry
                                                                                                                             Unit!
   Follow Us: #ITSMSummit!                                                     Unit!
NICE.!
What does this look like in our world?!
WHAT GOOD MONITORING LOOKS LIKE
       Elements of Good Monitoring!

1.  System Availability!
2.  Operating System Performance!                          !!
                                                    1! 2! 3! 4! 5! 6!
                                                                                     !8!
3.  Hardware Monitoring!                                   !!                        !
4.  Service/Daemon and Process Availability!
5.  Error Logs!
6.  Application Resource KPIs!
7.  End-to-End Transactions!
8.  Point of Failure Transactions!
9.  Fail-Over Success!
                                                                                                 Load Balancer!
10. “Activity Monitors” and “Reverse Hockey                                                                                         Mainframe!
    Stick”!                                                                                                          Data Power!
                                                                                                  Switch!


                                                                    Load Balancer!
                                                       Firewall!


                                               !
                                               !
                                                                              Web Server Farm!
                                               7!
                                               !
                                               !             Corporate!
                                                            LANs & VPNs!                         Load Balancer!
                                                                                                                                   Database!

                                                                                                                  Middleware!




                                                                                       !
                                                                                       !
                                                                                       !
                                                                                       !
                                                                                       !
                                                                                     ! !
                                                                                       !
                                                                                       !
                                                                                     ! !
                                                                                     ! !
                                                                                 9! 10!!
                                                                                       !
                                                                                       !
                                                                                       !
                                                                                       !
                                                                                     ! !
                                                                                       !
                                                                                       !
                                                                                     ! !
                                                                                     ! !


    Follow Us: #ITSMSummit!
FINDING METRICS THAT MATTER!
                          Evaluating the Effectiveness of a Metric
                                                                 !
§  Will the metric be used in a report? If so, which one? How is it used in the
    report?!
§  Will the metric be used in a dashboard? If so, which one? How will it be
    used?!
§  What action(s) will be taken if an alert is generated? Who are the actors?
    Will a ticket be generated? If so, what severity?!
§  How often is this event likely to occur? What is the impact if the event
    occurs? What is the likelihood it can be detected by monitoring?!
§  Will the metric help identify the source of a problem? Is it a coincident /
    symptomatic indicator?!
§  Is the metric always associated with a single problem? Could this metric
    become a false indicator?!
§  What is the impact if this goes undetected?!
§  What is the lifespan for this metric? What is the potential for changes that
    may reduce the efficacy of the metric?!

Follow Us: #ITSMSummit!
ANATOMY OF AN OUTAGE!
                          IM01109089: P0 - Affecting Multiple apps & Internet Sales West!
                                           6:00-ish pm: MQ flows start                        5:45-ish pm: CICS ABENDS
                                           are interrupted and are                           start flooding MainView but not
                                           alerting in Flow Diagnostics!                     high enough to ticket!


                                                                   !2!                                         !1!
                                                                   !                                           !
                                                                   Database!
                                                            WAS!




                          Load Balancer!
                                                                                                 zOS!
                                                                                                 CICS!
                   Firewall!                                                                                         DB2!


                        Corporate!
                       LANs & VPNs!                                             Message!         zOS!
                                              Web!          WAS!
                                                                                 Queue!          MQ!
                                             Servers!
                                                                   Database!
                          !
                          !
                          !
                         3!
                          !
                          !
                          !              6:54pm: Support teams
                                                                           !
                                                                           !        !
                                                                                    !      10:29pm: Support teams
                                         investigate the interrupted                       investigate MQ and ultimately
                                         flows and determine it is a        4!
                                                                           !
                                                                           !        5!
                                                                                    !
                                                                                    !      and rule it out and ultimately
6:04pm: Synthetic transactions fail at   “back-end” problem!                               decide to reset CICS to resolve
and 6:14 the Ops Center confirms the                                                        the issue!
issue Follow Us: #ITSMSummit!
      and creates a P0 Incident!
DRIVING THE RIGHT KIND OF ACTION!



                                                                           Application!


                            End User
                           Experience!                                                                       Infrastructure!



Gainesville!       San Antonio!       Des Moines!        Columbus!          Network!      Mainframe!    Storage!          Linux!        Middleware!   Database!



  Transaction 1!     Transaction 1!     Transaction 1!    Transaction 1!         KPI 1!        KPI 1!       KPI 1!             KPI 1!        KPI 1!       KPI 1!



  Transaction 2!     Transaction 2!     Transaction 2!    Transaction 2!         KPI 2!        KPI 2!       KPI 2!             KPI 2!        KPI 2!       KPI 2!



  Transaction N!     Transaction N!     Transaction N!    Transaction N!         KPI N!        KPI N!       KPI N!             KPI N!        KPI N!       KPI N!




    Follow Us: #ITSMSummit!
DRIVING THE RIGHT KIND OF ACTION!



                                                                           Application!


                            End User
                           Experience!                                                                       Infrastructure!



Gainesville!       San Antonio!       Des Moines!        Columbus!          Network!      Mainframe!    Storage!          Linux!        Middleware!   Database!



  Transaction 1!     Transaction 1!     Transaction 1!    Transaction 1!         KPI 1!        KPI 1!       KPI 1!             KPI 1!        KPI 1!       KPI 1!



  Transaction 2!     Transaction 2!     Transaction 2!    Transaction 2!         KPI 2!        KPI 2!       KPI 2!             KPI 2!        KPI 2!       KPI 2!



  Transaction N!     Transaction N!     Transaction N!    Transaction N!         KPI N!        KPI N!       KPI N!             KPI N!        KPI N!       KPI N!




    Follow Us: #ITSMSummit!
DRIVING THE RIGHT KIND OF ACTION!



                                                                           Application!


                            End User
                           Experience!                                                                       Infrastructure!



Gainesville!       San Antonio!       Des Moines!        Columbus!          Network!      Mainframe!    Storage!          Linux!        Middleware!   Database!



  Transaction 1!     Transaction 1!     Transaction 1!    Transaction 1!         KPI 1!        KPI 1!       KPI 1!             KPI 1!        KPI 1!       KPI 1!



  Transaction 2!     Transaction 2!     Transaction 2!    Transaction 2!         KPI 2!        KPI 2!       KPI 2!             KPI 2!        KPI 2!       KPI 2!



  Transaction N!     Transaction N!     Transaction N!    Transaction N!         KPI N!        KPI N!       KPI N!             KPI N!        KPI N!       KPI N!




    Follow Us: #ITSMSummit!
DRIVING THE RIGHT KIND OF ACTION!



                                                                           Application!


                            End User
                           Experience!                                                                       Infrastructure!



Gainesville!       San Antonio!       Des Moines!        Columbus!          Network!      Mainframe!    Storage!          Linux!        Middleware!   Database!



  Transaction 1!     Transaction 1!     Transaction 1!    Transaction 1!         KPI 1!        KPI 1!       KPI 1!             KPI 1!        KPI 1!       KPI 1!



  Transaction 2!     Transaction 2!     Transaction 2!    Transaction 2!         KPI 2!        KPI 2!       KPI 2!             KPI 2!        KPI 2!       KPI 2!



  Transaction N!     Transaction N!     Transaction N!    Transaction N!         KPI N!        KPI N!       KPI N!             KPI N!        KPI N!       KPI N!




    Follow Us: #ITSMSummit!
DRIVING THE RIGHT KIND OF ACTION!



                                                                           Application!


                            End User
                           Experience!                                                                       Infrastructure!



Gainesville!       San Antonio!       Des Moines!        Columbus!          Network!      Mainframe!    Storage!          Linux!        Middleware!   Database!



  Transaction 1!     Transaction 1!     Transaction 1!    Transaction 1!         KPI 1!        KPI 1!       KPI 1!             KPI 1!        KPI 1!       KPI 1!



  Transaction 2!     Transaction 2!     Transaction 2!    Transaction 2!         KPI 2!        KPI 2!       KPI 2!             KPI 2!        KPI 2!       KPI 2!



  Transaction N!     Transaction N!     Transaction N!    Transaction N!         KPI N!        KPI N!       KPI N!             KPI N!        KPI N!       KPI N!




    Follow Us: #ITSMSummit!
COMMON PROBLEM TYPES!

                          §    Design Problems!
                          §    Creative Problems!
                          §    Daily Problems!
                          §    People Problems!


               Rule-Based                       Event Based
                Approach!                        Approach!




Follow Us: #ITSMSummit!
EVENT-BASED PROBLEM SOLVING!



                    §    Appreciative Understanding!
                    §    Know What We Are Solving!
                    §    Create A Common Reality!
                    §    Solutions Based on Causes !




Follow Us: #ITSMSummit!
CAUSAL RELATIONSHIPS!



                          ①  Causes are effects, and effects are causes!


    Database                                                               Logs Not
     Down !                             Drive Full
                                                                           Truncated
                                      (Cause/Effect)!
      (Effect)!                                                             (Cause)!




Follow Us: #ITSMSummit!
CAUSAL RELATIONSHIPS!



                          ②  You can keep identifying causes – there is no limit!

 End of the               Database Down !                            Logs Not
 Universe                                        Drive Full          Truncated       Beginning of
                          (Primary Effect)!    (Cause/Effect)!                       Time (Cause)!
  (Effect)!                                                        (Cause/Effect)!




Follow Us: #ITSMSummit!
TWO IMPORTANT QUESTIONS!



                                                                 Ask “Why?”!

 End of the               Database Down !                              Logs Not
 Universe                                        Drive Full            Truncated       Beginning of
                          (Primary Effect)!    (Cause/Effect)!                         Time (Cause)!
  (Effect)!                                                          (Cause/Effect)!




                                         Ask “What”!




Follow Us: #ITSMSummit!
RULES FOR CAUSAL RELATIONSHIPS!

                            ③  An Effect is often the result of multiple causes!

                                                                                                       DBA on
                                                                                                     honeymoon
                                                                                                    vacation in Fiji!


                                                                                                  Logs are truncated
                                                                                                      manually!
                                                                      Logs were not      -AND-!
                                       Transaction log                  truncated!
                                      was unable to grow!                                         Company has only
        SQL Server was                                                                                1 DBA!
         not processing      -AND-!
        queries (Effect)!
                                      T: Drive at 0 Bytes
                                                            -AND-!                                “Backup” DBA was
                                              free!
                                                                                                  not aware the logs
                                                                                                   require truncation!


                                                                     Space allocations
                                                                        are fixed!                  Lack of Control!




Follow Us: #ITSMSummit!
RULES FOR CAUSAL RELATIONSHIPS!

                          ④  Causes need to be both necessary and sufficient!

                                                                                                               DBA on honeymoon
                                                                                                                 vacation in Fiji!
                                                                                                               (Transitory Cause)!


                                                                                                                Logs are truncated
                                                                                                                    manually!
                                                                               Logs were not                  (Non-Transitory Cause)!
                                                                                 truncated!
                                                                                                     -AND-!
                                       Transaction log was                  (Transitory Cause &
                                          unable to grow                           Effect)!                    Company has only 1
                                        (Transitory Cause)!                                                           DBA!
        SQL Server was not
        processing queries   -AND-!                                                                           (Non-Transitory Cause)!
             (Effect)!
                                      T: Drive at 0 Bytes free!
                                      (Non-transitory Cause       -AND-!                                       “Backup” DBA was not
                                              & Effect)!                                                       aware the logs require
                                                                                                                    truncation!
                                                                                                              (Non-Transitory Cause)!


                                                                            Space allocations are
                                                                                   fixed!                         Lack of Control!
                                                                           (Non-Transitory Cause)!




Follow Us: #ITSMSummit!
HOW FIRE WORKS!
                           Transitory!




                                                                                               Non-Transitory!
                                                  Oxygen!


                              Match Strike!
                                                   Heat!
                                                   Fuel!
                                                               Fire!


                                                    Time!

                           Oxygen!
                                              •  Transitory Causes act as catalysts to bring
                            Heat!
                                              about change (think Transition)!
       Fire!      -AND-!

                             Fuel!            •  Non-Transitory Causes are objects,
                                              properties/attributes, and status!
                            Match
                            Strike!

Follow Us: #ITSMSummit!
TAKE AN SOLOGIC RCA DIAGRAM!


                                                                                                                                                        DBA on honeymoon
                                                                                                                                                          vacation in Fiji!



                                                                                                                                                        Logs are truncated
                                                                                                                                                            manually!
                                                                                                                                Logs were not
                                                                                                                                  truncated!       -AND-!
                                                                                                 Transaction log was
                                                                                                   unable to grow!                                      Company has only 1
                                                                      SQL Server was not                                                                      DBA!
                                                                      processing queries!   -AND-!
                                                                                                     T: Drive at 0 Bytes
                                                                                                             free!         -AND-!                       “Backup” DBA was
                                                                                                                                                        not aware the logs
                                               The application                                                                                           require truncation!
                                              server was timing    -AND-!
                                                     out!
                                                                                                                               Space allocations
                                                                                                      DR SQL Cluster!             are fixed!                 Lack of Control!
                      Web Server                                       Only one database
                  returning 500 errors!   -AND-!                                            -AND-!
                                                                         cluster in use!
                                                                                                  DR Cluster being             More Information
                                                                                                 used for UAT testing!             Needed!
 Customers                                   One one application       More Information
               -AND-!
Complaining!                                    server exists!             Needed!


                      Trying to do
                    business on the           Desired Condition!
                        website!




     Follow Us: #ITSMSummit!
ADD THE EVIDENCE!

                                                                                            Statistical Data!

                                                                                                                                                           DBA on honeymoon
                                                                                                                                                             vacation in Fiji!
         Observation!


                                                                                                                                                           Logs are truncated
                                                                                                                                                               manually!
                                                                                                                                   Logs were not
                                                                                                                                     truncated!       -AND-!
                                                                                                      Transaction log was
                                                                                                        unable to grow!                                    Company has only 1
                                                                        SQL Server was not                                                                       DBA!
                                                                        processing queries!    -AND-!
                                                                                                        T: Drive at 0 Bytes
                                                                                                                free!         -AND-!                       “Backup” DBA was
                                                                                                                                                           not aware the logs
                                                 The application                                                                                            require truncation!
                                                server was timing    -AND-!
                                                       out!
                                                                                                                                  Space allocations
                                                                                                         DR SQL Cluster!             are fixed!                 Lack of Control!
                        Web Server                                       Only one database
                    returning 500 errors!   -AND-!                                             -AND-!
                                                                           cluster in use!
                                                                                                       DR Cluster being           More Information
                                                                                                      used for UAT testing!           Needed!
 Customers                                     One one application       More Information
                -AND-!
Complaining!                                      server exists!             Needed!


                          Trying to do
                        business on the         Desired Condition!
                            website!



                                                                                                   Situational!


     Follow Us: #ITSMSummit!
FAILURE MODES AND EFFECT ANALYSI
                                                                                                                    DBA on honeymoon
                                                                                                                      vacation in Fiji!




                                                                                                                     Logs are truncated
                                                                                                                         manually!


                                           Transaction log is unable
                                                                               Logs were not truncated!     -AND-!
                                                   to grow!
                                                                                                                    Company has only 1
                                                                                                                          DBA!

                                                -AND-!
                                                                                                                   “Backup” DBA was not
                                                                                                                   aware the logs require
                                               T: Drive at 0 Bytes free!   -AND-!                                       truncation!
                                                                                                                     (Condition Cause)!


                                                                                Space allocations are
                                                                                       fixed!                           Lack of Control!
                                                                                    (Condition Cause)!
                      SQL Server Not                                                                               Minidump is configured
                        Available!     -OR-!                                                                         to write to C: Drive!




                                                                                                                     Server was ASRing
                                           SQL is unable to cache                                                        frequently!
                                               query results !



                                                -AND-!
                                                                               C: Drive at 0 Bytes free!   -OR-!

                                               Available RAM at 0                                                   Software distributions
                                                   Bytes Free!             -AND-!                                  were leaving files in the
                                                                                                                        TEMP folder!
                                                                                Kernel able to write to
                                                                                      page file!

                                                                                                                   %TEMP% configured to
                                                                                                                        C:Temp!




Follow Us: #ITSMSummit!
GETTING TO OUR REQUIREMENTS!
                                 At least one point                                                                  DBA on honeymoon
                                                                                                                       vacation in Fiji!
                                 along each branch
                                    after the “OR”!

                                                                                                                     Logs are truncated
                                                                                                                         manually!
                                                                                       Logs were not
       Monitor the                                                                       truncated!          -AND-!
                                                       Transaction log is
     intersections at                                   unable to grow!
        the “OR’s”!                                                                                                 Company has only 1
                                                                                                                          DBA!
                                                      -AND-!
                                                                                                                    “Backup” DBA was not
                                                       T: Drive at 0 Bytes                                          aware the logs require
                                                               free!          -AND-!                                     truncation!
                                                                                                                      (Condition Cause)!

                                                                                  Space allocations are
                                                                                         fixed!                         Lack of Control!
                          SQL Server Not                                           (Condition Cause)!                   Minidump is
                            Available!       -OR-!                                                                  configured to write to
                                                                                                                         C: Drive!




                                                       SQL is unable to                                              Server was ASRing
                                                      cache query results !                                              frequently!



                                                      -AND-!
                                                                                   C: Drive at 0 Bytes      -OR-!
                                                                                          free!
                                                      Available RAM at 0                                            Software distributions
                                                          Bytes Free!         -AND-!                                 were leaving files in
                                                                                                                      the TEMP folder!
                                                                                  Kernel able to write to
                                                                                        page file!
                                                                                                                    %TEMP% configured
                                                                                                                       to C:Temp!




Follow Us: #ITSMSummit!
FMEA MATRIX (IMPACT CALCULATION)!



                                                                                         Very high (1-2): during the design phase!
                                                                                         High (3-4): during peer review or unit
                                                                                         testing!
                                                                                         Moderate (5-6): during system testing or
                                                                                         acceptance testing!
                                                                                         Remote (7-8): during or immediately after
                                                                                         production deployment!
Negligible (1-2): no loss in functionality,
                                                                                         Very Remote (9-10): only after heavy
mostly cosmetic!
                                                                                         usage by users!
Marginal (3-4): temporary interruptions or
the degradation lasts for a brief period of
time!
Critical (5-6): the problem will not resolve
itself but a work around exists allowing the   Improbable (1-2): less than 1 time per year!
problem to be bypassed!                        Remote (3-4): 1 time per year!
Serious (7-8): the problem will not resolve    Occasional (5-6): 1 time per month!
itself and no work around is possible.         Probable (7-8): 1 time per day!
Functionality is impaired or lost but the      Chronic (9-10): 1 or more times per day!
system is usable to some extent!
Catastrophic (9-10): the system is
completely unusable!




    Follow Us: #ITSMSummit!
FMEA MATRIX (EVIDENCE)!




                                                      These are the events that help us RULE OUT
                                                      the failure mode as not relevant!
     These are the events that help us to RULE IN a
     failure mode as a possible cause!




Follow Us: #ITSMSummit!
HOW TO DETERMINE EVENT SEVERITY!
            Six Levels of Severity
                                 !
•    The event severity is determined with                Logical Server!
     respect to the component generating the
     event!
•    The event severity does not consider                                 Physical Server!
     impact or urgency!                                     Virtual
•    The incident priority is not determined by            Machine 1!
     event severity!                                                       Server     Logical Volumes!
•    The event severity helps drive an effective                             1!
     triage when multiple events arrive at                                             Volume
     approximately the same time!                                                      Group 1!   Physical Volumes!
•    Only after the effected components and                 Virtual        Server                    Hard       Hard       Hard
     their relationships to each other have been           Machine 2!                  Volume       Drive 1!   Drive 2!   Drive 3!
                                                                             2!        Group 2!
     determined can impact and urgency be
     determined!




                     Severity!                                            Description!
                 Critical!        The component has completely failed!
                 Major!           The component is operating but is in a degraded or crippled state!
                 Minor!           The component is functioning normally but is at risk of a more serious failure!
                 Informational!   The component is functioning normally but is reporting a change in state!
                 Unknown!         The component has changed its operating state but the effect is not known!
                 Clear!           The component is operating normally or a higher severity event has been resolved!




     Follow Us: #ITSMSummit!
MONITORING BASED ON PATTERNS!


 Layers of Pre-Defined Monitoring Patterns !
•  The OS template is deployed when the
   server is provisioned!
•  As a server is customized to fit its role,
   additional templates are deployed!




•  Templates are stacked on top of each
   other until no gaps remain!
•  This approach provides a high degree of
   standardization without sacrificing the
   ability to develop a custom solution !

Follow Us: #ITSMSummit!
APPLICATION-TECHNOLOGY MATRIX!
                                         Maps services, applications and technologies
                                         enabling:!
                                         • Monitoring investment prioritization!
                                         • Monitoring maturity!
                                         • Which templates need to be deployed when
                                         new hardware is acquired!
                                         • Whether an service has sufficient monitoring
                                         coverage based on its application components!
                                         • This approach allows for anticipating changes to
                                         a customer’s monitoring needs!




Scores indicate:!
0 – No Strategy!
1 – Limited Monitoring!
2 – Fully Integrated Strategy!

 Follow Us: #ITSMSummit!
EVENT LIFECYCLE EXAMPLE!

                        Software-Operating System!                                                       Legend!
                                                                                                   Element Manager!
                      Activity
                             !           Responsible Tool
                                                        !                                          Distributed Collectors!
                                                                                                   Object Server Triggers!
       Data Collection!                                                                            Impact Policies!
                                                                                                   ITNM RCA Engine!
       Anomaly Detection!
                                                                                                   Gateway Replication!
                                                                                                   Webtop Event List!
       Event Generation!

       Integration!

       Event Processing!

       Enrichment!

       Event Suppression!

       Correlation!

       Root Cause Analysis!
                                                                   Activity
                                                                          !           Responsible Tool
                                                                                                     !
       Business Impact Analysis!
                                                            Trigger Ticket Request!
       Automation!
                                                            Create Ticket!
       Notification & Escalation!
                                                            Update Event with IM#!
       Presentation!
                                                            Trigger Courtesy Pages!
       User Interaction Tools!
                                                            Send Pages!
       Archiving!

       Reporting!



Follow Us: #ITSMSummit!
AGGREGATION AND ANALYSIS OVERVIEW
 LOB Managed
  Monitoring
   System!                                                                                                                                                                      Service Center
                     Distributed Collectors!
                                                                                                                                                 Business
                                                                                                                                                                                and Enterprise
                                                                                                                                              Telemetry Data!
                                                                                                                                                                                Notification Tool!
Service Provider
   Managed
  Monitoring                                                                                                               Topology And
    System!                                                                                                                 Relationship                                                               Automated Action
                                                                                                                                                                                                            Tools!
                                                                                                                             Database!
Vendor Managed
   Monitoring
    System!




                                                                                                              Root Cause Analysis!




                                                                                                                                                                                                        Archive and Report!
                                                                                     Event Suppression!




                                                                                                                                                                                 Automated Action!
                                                                                                                                       Business Impact
                                                      Common Event




                                                                                                                                                           Notification and
                                                                                      Correlation and
                                                                       Enrichment!




                                                                                                                                                             Escalation!
          Element




                                                                                                                                          Analysis!
                                                         Format!




          Manager!
                     Distributed Collectors!




          Element
          Manager!



          Element                                                                                         Meta-Data Integration Bus!
          Manager!


  Automated                                            Other
   Change                                                            Document        Service                                           Batch             Knowledge           Online Run              PBX/Call
                                                     Enterprise                                              CMDB!
                                                                      Sharing!        Desk!                                          Scheduling!          Database!            Book!                 Manager!
 Reconciliation!
                     Distributed Collectors!




                                                       Data!



  Predictive
  Analysis!
                                                                                                           Visualization Framework!
 Automated
                                                   Security
 Provisioning
                                                 Management!
   System!
     Follow Us: #ITSMSummit!
How do we keep it evolving?!
FACILITATING PRODUCTION ASSURANCE!
§  CritSits!
   §  Start the CritSit meeting and provide an accounting of all
       the potential failure modes, which have been successfully
       ruled out, and which need to be investigated!
   §  Include other potential failure modes into the KT matrix!
§  Problem Management!
   §  Document the causal elements as new failure modes!
   §  Disseminate new failure modes to Architecture, ESM and
       the Command Center!
§  Reporting!
   §  Produce a monthly news letter to application owners with
       the list of failure modes they should discuss with their
       architects!
   §  Incorporate failure modes into “Fault Line” analysis!

   Follow Us: #ITSMSummit!
DURING THE DESIGN PROCESS!
•  Architects !!
   •  Certify that designs do not contain the known failure
      modes or document that the failure mode does not present
      an unacceptable risk!
   •  Document the requirements for Solution Architects to
      follow to ensure the mitigation strategies are implemented
      completely!
•  Developers!
   •  Certify that designs do not contain the known failure
      modes or document that the failure mode does not present
      an unacceptable risk!
   •  Certify the designs implement the mitigation strategies
      correctly!

  Follow Us: #ITSMSummit!
IMPROVING ENTERPRISE TOOLS!
§  Systems Management!
    §  Develop new monitoring requirements using the
        documented indications and contraindications!
§  Event Management!
    §  Develop new correlations tying indications and
        contraindications to failure modes to assist in
        ruling out or ruling in those “in play” more
        efficiently!
§  Configuration Management!
    §  Develop new discovery patterns using the
        documented indications and contraindications!
    §  Develop automations to detect the presence of
        failure mode conditions and generate an event to
        the Event Management System!
  Follow Us: #ITSMSummit!
DURING SERVICE SUPPORT!

•  Command Centers and Support Teams!
  –  Use the failure modes to rule out potential failure modes!
  –  Each failure mode will have a documented process to
      follow to mitigate the impact once a failure mode is
      identified!
•  Incident Managers!
  –  Start bridge calls and provide an accounting of all the
     potential failure modes, which have been successfully
     ruled out, and which need to be investigated!
  –  Coordinate the investigation assignments and consolidate
     the investigation results!


  Follow Us: #ITSMSummit!
LET’S KEEP THE CONVERSATION GOING…!
                    Andrew.P.White@Gmail.com!

                    Andrew.P.White@Gmail.com!

                    @SystemsMgmtZen!

                    SystemsManagementZen.Wordpress.com!

                    systemsmanagementzen.wordpress.com/feed/!

                    ReverendDrew!

                    ReverendDrew!

                    614-306-3434!




Follow Us: #ITSMSummit!

Weitere ähnliche Inhalte

Ähnlich wie Bright talk bringing back the love - final

Impression Formation Presentation
Impression Formation PresentationImpression Formation Presentation
Impression Formation PresentationRobert Powell
 
Mesh 08: Reputation Monitoring Ladner
Mesh 08: Reputation Monitoring LadnerMesh 08: Reputation Monitoring Ladner
Mesh 08: Reputation Monitoring LadnerSam Ladner
 
Hsv 6350 Module I Part 1 Neurobiology Of Trauma Dr. Mark Sloane
Hsv 6350 Module I  Part 1 Neurobiology Of  Trauma Dr. Mark SloaneHsv 6350 Module I  Part 1 Neurobiology Of  Trauma Dr. Mark Sloane
Hsv 6350 Module I Part 1 Neurobiology Of Trauma Dr. Mark Sloanebenjatchison
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013Ken Mwai
 
Principles of Human Performance Improvement
Principles of Human Performance ImprovementPrinciples of Human Performance Improvement
Principles of Human Performance ImprovementDIv CHAS
 
Psychology for Startups
Psychology for StartupsPsychology for Startups
Psychology for Startupsjericsinger
 
The Problem of Consciousness and the Singularity
The Problem of Consciousness and the SingularityThe Problem of Consciousness and the Singularity
The Problem of Consciousness and the SingularityStefan van der Wel
 
The Subtle Art of Persuasion
The Subtle Art of PersuasionThe Subtle Art of Persuasion
The Subtle Art of PersuasionJames Box
 
David Peake Cmi July 2012 Event
David Peake   Cmi July 2012 EventDavid Peake   Cmi July 2012 Event
David Peake Cmi July 2012 Eventkyliemalmberg
 
Misusability workshop at Interaction 18 in Lyon
Misusability workshop at Interaction 18 in LyonMisusability workshop at Interaction 18 in Lyon
Misusability workshop at Interaction 18 in LyonPer Axbom
 
Integrated Safety, Security & Surveillance
Integrated Safety, Security & SurveillanceIntegrated Safety, Security & Surveillance
Integrated Safety, Security & SurveillanceEdgevalue
 
Sxsw tree slideshare
Sxsw tree slideshareSxsw tree slideshare
Sxsw tree slideshareroscoe007
 
Perception and Behavior: How To Stimulate Creativity
Perception and Behavior: How To Stimulate CreativityPerception and Behavior: How To Stimulate Creativity
Perception and Behavior: How To Stimulate CreativityZest and Zen International
 
Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...
Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...
Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...marginproject
 
Good New We Have A Crisis Ccl Revised Webinar Print Out
Good New We Have A Crisis   Ccl Revised Webinar Print OutGood New We Have A Crisis   Ccl Revised Webinar Print Out
Good New We Have A Crisis Ccl Revised Webinar Print OutDavid K. Hurst
 
Prof l prasad ii sc alumni day aug 2012 infinite vision dr v
Prof l prasad ii sc alumni day aug 2012 infinite vision dr vProf l prasad ii sc alumni day aug 2012 infinite vision dr v
Prof l prasad ii sc alumni day aug 2012 infinite vision dr vYavagalu Naguta
 
Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...
Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...
Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...Andrew Krzmarzick
 

Ähnlich wie Bright talk bringing back the love - final (20)

Impression Formation Presentation
Impression Formation PresentationImpression Formation Presentation
Impression Formation Presentation
 
Mesh 08: Reputation Monitoring Ladner
Mesh 08: Reputation Monitoring LadnerMesh 08: Reputation Monitoring Ladner
Mesh 08: Reputation Monitoring Ladner
 
Hsv 6350 Module I Part 1 Neurobiology Of Trauma Dr. Mark Sloane
Hsv 6350 Module I  Part 1 Neurobiology Of  Trauma Dr. Mark SloaneHsv 6350 Module I  Part 1 Neurobiology Of  Trauma Dr. Mark Sloane
Hsv 6350 Module I Part 1 Neurobiology Of Trauma Dr. Mark Sloane
 
Judging a book
Judging a bookJudging a book
Judging a book
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
Principles of Human Performance Improvement
Principles of Human Performance ImprovementPrinciples of Human Performance Improvement
Principles of Human Performance Improvement
 
Psychology for Startups
Psychology for StartupsPsychology for Startups
Psychology for Startups
 
The Problem of Consciousness and the Singularity
The Problem of Consciousness and the SingularityThe Problem of Consciousness and the Singularity
The Problem of Consciousness and the Singularity
 
The Subtle Art of Persuasion
The Subtle Art of PersuasionThe Subtle Art of Persuasion
The Subtle Art of Persuasion
 
Social Thinking
Social ThinkingSocial Thinking
Social Thinking
 
David Peake Cmi July 2012 Event
David Peake   Cmi July 2012 EventDavid Peake   Cmi July 2012 Event
David Peake Cmi July 2012 Event
 
Misusability workshop at Interaction 18 in Lyon
Misusability workshop at Interaction 18 in LyonMisusability workshop at Interaction 18 in Lyon
Misusability workshop at Interaction 18 in Lyon
 
Integrated Safety, Security & Surveillance
Integrated Safety, Security & SurveillanceIntegrated Safety, Security & Surveillance
Integrated Safety, Security & Surveillance
 
Sxsw tree slideshare
Sxsw tree slideshareSxsw tree slideshare
Sxsw tree slideshare
 
Perception and Behavior: How To Stimulate Creativity
Perception and Behavior: How To Stimulate CreativityPerception and Behavior: How To Stimulate Creativity
Perception and Behavior: How To Stimulate Creativity
 
Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...
Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...
Theodora Ziamou: The shadow on the wall - a (safe) dose of science to underst...
 
Good New We Have A Crisis Ccl Revised Webinar Print Out
Good New We Have A Crisis   Ccl Revised Webinar Print OutGood New We Have A Crisis   Ccl Revised Webinar Print Out
Good New We Have A Crisis Ccl Revised Webinar Print Out
 
Ideas You Can Play With
Ideas You Can Play WithIdeas You Can Play With
Ideas You Can Play With
 
Prof l prasad ii sc alumni day aug 2012 infinite vision dr v
Prof l prasad ii sc alumni day aug 2012 infinite vision dr vProf l prasad ii sc alumni day aug 2012 infinite vision dr v
Prof l prasad ii sc alumni day aug 2012 infinite vision dr v
 
Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...
Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...
Why Can't We All Just Get Along? Four Generations Working Side by Side in Har...
 

Mehr von Andrew White

How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoringAndrew White
 
Brighttalk learning to cook- network management recipes - final
Brighttalk   learning to cook- network management recipes - finalBrighttalk   learning to cook- network management recipes - final
Brighttalk learning to cook- network management recipes - finalAndrew White
 
Brighttalk converged infrastructure and it operations management - final
Brighttalk   converged infrastructure and it operations management - finalBrighttalk   converged infrastructure and it operations management - final
Brighttalk converged infrastructure and it operations management - finalAndrew White
 
Brighttalk outage insurance- what you need to know - final
Brighttalk   outage insurance- what you need to know - finalBrighttalk   outage insurance- what you need to know - final
Brighttalk outage insurance- what you need to know - finalAndrew White
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - finalAndrew White
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - finalAndrew White
 
Brighttalk brining it all together - final
Brighttalk   brining it all together - finalBrighttalk   brining it all together - final
Brighttalk brining it all together - finalAndrew White
 
Brighttalk reason 114 for learning math - final
Brighttalk   reason 114 for learning math - finalBrighttalk   reason 114 for learning math - final
Brighttalk reason 114 for learning math - finalAndrew White
 
Brighttalk what should we be monitoring - final
Brighttalk   what should we be monitoring - finalBrighttalk   what should we be monitoring - final
Brighttalk what should we be monitoring - finalAndrew White
 
Brighttalk getting back on track - final
Brighttalk   getting back on track - finalBrighttalk   getting back on track - final
Brighttalk getting back on track - finalAndrew White
 
Bright talk running a cloud - final
Bright talk   running a cloud - finalBright talk   running a cloud - final
Bright talk running a cloud - finalAndrew White
 

Mehr von Andrew White (11)

How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoring
 
Brighttalk learning to cook- network management recipes - final
Brighttalk   learning to cook- network management recipes - finalBrighttalk   learning to cook- network management recipes - final
Brighttalk learning to cook- network management recipes - final
 
Brighttalk converged infrastructure and it operations management - final
Brighttalk   converged infrastructure and it operations management - finalBrighttalk   converged infrastructure and it operations management - final
Brighttalk converged infrastructure and it operations management - final
 
Brighttalk outage insurance- what you need to know - final
Brighttalk   outage insurance- what you need to know - finalBrighttalk   outage insurance- what you need to know - final
Brighttalk outage insurance- what you need to know - final
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - final
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
 
Brighttalk brining it all together - final
Brighttalk   brining it all together - finalBrighttalk   brining it all together - final
Brighttalk brining it all together - final
 
Brighttalk reason 114 for learning math - final
Brighttalk   reason 114 for learning math - finalBrighttalk   reason 114 for learning math - final
Brighttalk reason 114 for learning math - final
 
Brighttalk what should we be monitoring - final
Brighttalk   what should we be monitoring - finalBrighttalk   what should we be monitoring - final
Brighttalk what should we be monitoring - final
 
Brighttalk getting back on track - final
Brighttalk   getting back on track - finalBrighttalk   getting back on track - final
Brighttalk getting back on track - final
 
Bright talk running a cloud - final
Bright talk   running a cloud - finalBright talk   running a cloud - final
Bright talk running a cloud - final
 

Kürzlich hochgeladen

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Bright talk bringing back the love - final

  • 1. Bringing Back the Love: How Situational Awareness Improves User Experience! http://www.flickr.com/photos/64123293@N00/5985619750/sizes/l/in/photostream/!
  • 2. Andrew White! Manager of Systems and ! Event Management At ! Nationwide Insurance! ! Mr. White leads a team of software developers focused on creating tools that collect and analyze health information from Nationwide's IT systems. These tools have a wide variety of applications, from fault detection and problem investigation to trend reporting and capacity planning.! ! Andrew has over ten years of experience designing and managing the deployment of systems management software. Prior to joining Nationwide, Andrew developed solutions for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, and the US Navy Facilities and Engineering Command.!
  • 3. GROUND RULES FOR THIS SESSION…! 1.  If you can’t tell if I am trying to be funny…! !GO AHEAD AND LAUGH!! 2.  Feel free to text, tweet, yammer, or whatever. People gotta hear this!! 3.  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.! Follow Us: #ITSMSummit!
  • 4. My name is Andrew White! I lead a Systems and Event Management team !
  • 5. I am here today to talk about! Situational Awareness!
  • 7. SITŸUŸAŸTION – [SI-CHƏ-WĀ-SHƏN]! -noun" 1.  manner of being situated; location or position with reference to environment: The situation of the house allowed for a beautiful view. " 2.  condition; case; plight: He is in a desperate situation. " 3.  the state of affairs; combination of circumstances: The present international situation is dangerous. " 4.  a state of affairs of special or critical significance in the course of a play, novel, etc. " Follow Us: #ITSMSummit!
  • 8. Not this Situation…! Think this situation…!
  • 9. AŸWAREŸNESS – [UH-WAIR-NIS]! -noun" 1.  having knowledge; conscious; cognizant: aware of danger. " 2.  informed; alert; knowledgeable; sophisticated: She is one of the most politically aware young women around. " Follow Us: #ITSMSummit!
  • 11. When you put them together we get:! The perception of and reaction to a set of changing events in terms of what can be done instead of merely the recollection of a stimuli.1 ! Most outages are the result of the lack of situational awareness! 1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!
  • 13. I am going to talk some new capabilities that will help you.!
  • 14. Why do we lose situational awareness?!
  • 15. This is Magenta…! It doesn’t exist. L!
  • 16. Cyan = 600nm - 620nm! Yellow = 510nm - 530nm! Magenta???!
  • 17. The two color wave lengths that produce it are not side-by-side in the spectrum!
  • 18. Squares A and B are the same color!
  • 19. We cannot control the way our brain processes information!!
  • 20. So… why do we lose situational awareness?!
  • 21. SOMETIMES WE MISS WHAT IS GOING ON! Say… what’s a mountain goat doing all the way up here in a cloud bank?! Follow Us: #ITSMSummit!
  • 22. WHICH DO YOU USE WHEN?! We don’t have a tooling problem…! Technology Areas! we have an understanding problem!! Tool! Tool! Tool! Follow Us: #ITSMSummit!
  • 23. Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.! 1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.! !
  • 24. I would like to show why this happens…!
  • 25. BOYD’S OODA “LOOP”! Observe! Orient! Decide! Act! Implicit Guidance & Control! Unfolding Circumstances! Cultural! Norms! Feed Knowledge ! Cognitive! Feed Feed Forward! Life Cycle! Abilities! Forward! Decision! Forward! Action Observation! (Hypothesis)! (Test)! New ! Prior! Information! Wisdom! Outside Information! Feedback! Feedback! Unfolding Interaction With Environment! •  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.! •  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.! ! Follow Us: #ITSMSummit! From “The Essence of Winning and Losing,” John R. Boyd, January 1996.!
  • 26. WHERE THE BREAKDOWN OCCURS! • System Capability! • Interface Design! Systemic Influences! • Stress & Workload! • Complexity! • Automation! Feedback! Current State! Situational Awareness! Perception of Comprehension Projection of Performance Elements in of Current Future Status! Decision! Current Situation! Situation! ! of Actions! ! ! ! Level 1! Level 2! Level 3! Observe! Orient! Decide! Act! • Goals & Objectives! • Preconceptions! Cognitive Processes! • Expectations! Long Term Automaticity! Memory! Individual Influences! • Abilities! Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness • Experience! Follow Us: #ITSMSummit! in dynamic systems. Human Factors 37(1), 32–64.! • Training!
  • 27.
  • 28. Maybe.! Let me show you why this is important…!
  • 29. WE (IT) SELLS PROMISES…! The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us. ! Follow Us: #ITSMSummit!
  • 30. Objective #1: Users Love Our IT Systems… http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
  • 31. WHAT THIS MEANS TO US…! There are a few inescapable facts we face:! 1.  Weneeds reliable systems to store the promises it makes to its customers ! 2.  Our systems mirror the complexity of the businesses they support! 3.  Our environments must be massive to scale to handle the workload! 4.  There is too much activity for a single person to be totally situationally aware! 5.  If the users can’t use it, it doesn’t work! Follow Us: #ITSMSummit!
  • 32. EVENT MANAGEMENT FOCUS…! In addition to monitoring for performance, we are here to help manage availability.! Our Formula:! 1.  Continually collect, categorize, and analyze all events from as many sources as possible! 2.  Correlate events and analyze them using previous outages as patterns to identify situations worth investigating! 3.  Notify a support team so the situation can be mitigated before becoming an outage! Follow Us: #ITSMSummit!
  • 33. When all of these happen at the same time…! Ug…!
  • 36. CLEANING UP THE LANDSCAPE! Launch Pad! Silo! Monolithic Niche! Framework! Information Bus! Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009 Follow Us: #ITSMSummit! https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391!
  • 37. ONE INTEGRATED ENVIRONMENT! CMDB! Paging! Service Desk! Presentation Framework! 3rd Party Providers! Knowledge! Asset Mgmt! Enrichment & Correlation! Event API! Event Pool! Event Catalog! Predictive! Business Telemetry! Mainframe! Distributed! Database! Network! Middleware! Storage! Operational! Data Warehouse! Follow Us: #ITSMSummit!
  • 38. CONCEPTUALIZING SITUATIONAL AWARENESS! Real-Time Event Streams! Detected and Predicted Situations! Situational Awareness Engine! Causal Relationship Patterns from from Past RCAs! Historical Data! Follow Us: #ITSMSummit! Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep- how-to-build-an-event-processing-application-presentation-717795!
  • 39. SITUATIONAL AWARENESS MODEL DESIGN! Data! Information! Knowledge! Intelligence! Runbook Automation! Level 5! Event Taxonomy Historical Event and Enrichment! Archive! Level 1! Event Sources! ! Solicitations for User Interaction Event Pipeline! via the Visualization Framework! Event Tracking! Situation Predictive ! Detection! Analysis! Level 2! Level 3! Level 4! Causal Relationship Patterns from from Past RCAs! Adapted from the JDL: Steinberg, A., & Bowman, C., Follow Us: #ITSMSummit! Historical Data! Handbook of Multisensor Data Fusion, CRC Press, 2001!
  • 40. REQUIREMENTS FOR UNITY OF EFFORT! Symptoms of Missing Elements ! •  Command and control (No Leadership)! 1. Command •  The team lacks a clear direction! and Control! •  Lots of activity, lack of progress! •  Shared Experience (Poor Relationships)! •  Us vs. Them mentality! •  Unhealthy competition! 3. Situational 2. Shared •  Situational Awareness (Poor Communication)! Awareness! Experience! •  Focused on cooperation, not collaboration! •  Blame culture! •  Infrequent or non-existent communication! Follow Us: #ITSMSummit!
  • 41. Our success in any endeavor depends directly on our ability to solve problems! What do we need to do that?!
  • 42. You Gotta Have Skillz…!
  • 43. WHAT MATTERS MOST?! Cook County Hospital, Chicago, IL! Dr. Lee Goldman! The Goldman Algorithm! §  Is the patient feeling unstable angina?! §  Is there fluid in the patient’s lungs?! §  Is the patient’s systolic blood pressure below 100?
 ! Prediction of Patients Who Will Have a Heart Attack Within 72 100   Hours! 90   80   70   60   50   By paying attention to what really matters, Dr. 40   30   20   Goldman improved the “false negatives” by 10   0   Traditional  Techniques   Goldman  Algorithm   20 percentage points and eliminated the “false positives” altogether. ! Follow Us: #ITSMSummit!
  • 44. THE GOLDMAN ALGORITHM! ECG Evidence of Acute Myocardial Infarction (MI)?! ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Patient enters ED Perform Leads (New or Unknown Age)! with suspected Acute Electrocardiogram or! Cardiac Ischema! (EKG)! Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)! Yes! No! ECG Evidence of Acute Ischemia?! Coronary ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads Care Unit! (New or Unknown Age) or! T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or! Left Bundle-Branch Block (New or Unknown Age)! Yes! No! Urgent Factors Present?! Urgent Factors Present?! Rates Above Both Lung Bases! Rates Above Both Lung Bases! Systolic Blood Pressure <100 mm Hg! Systolic Blood Pressure <100 mm Hg! Unstable Ischemic Heart Disease! Unstable Ischemic Heart Disease! 2 or 3 Factors! 0 or 1 Factors! 2 or 3 Factors! 1 Factors! 0 Factors! High Risk! Moderate Risk! Low Risk! Very Low Risk! Inpatient Observation Telemetry Unit! Follow Us: #ITSMSummit! Unit!
  • 45. NICE.! What does this look like in our world?!
  • 46. WHAT GOOD MONITORING LOOKS LIKE Elements of Good Monitoring! 1.  System Availability! 2.  Operating System Performance! !! 1! 2! 3! 4! 5! 6! !8! 3.  Hardware Monitoring! !! ! 4.  Service/Daemon and Process Availability! 5.  Error Logs! 6.  Application Resource KPIs! 7.  End-to-End Transactions! 8.  Point of Failure Transactions! 9.  Fail-Over Success! Load Balancer! 10. “Activity Monitors” and “Reverse Hockey Mainframe! Stick”! Data Power! Switch! Load Balancer! Firewall! ! ! Web Server Farm! 7! ! ! Corporate! LANs & VPNs! Load Balancer! Database! Middleware! ! ! ! ! ! ! ! ! ! ! ! ! ! 9! 10!! ! ! ! ! ! ! ! ! ! ! ! ! Follow Us: #ITSMSummit!
  • 47. FINDING METRICS THAT MATTER! Evaluating the Effectiveness of a Metric ! §  Will the metric be used in a report? If so, which one? How is it used in the report?! §  Will the metric be used in a dashboard? If so, which one? How will it be used?! §  What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity?! §  How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring?! §  Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator?! §  Is the metric always associated with a single problem? Could this metric become a false indicator?! §  What is the impact if this goes undetected?! §  What is the lifespan for this metric? What is the potential for changes that may reduce the efficacy of the metric?! Follow Us: #ITSMSummit!
  • 48. ANATOMY OF AN OUTAGE! IM01109089: P0 - Affecting Multiple apps & Internet Sales West! 6:00-ish pm: MQ flows start 5:45-ish pm: CICS ABENDS are interrupted and are start flooding MainView but not alerting in Flow Diagnostics! high enough to ticket! !2! !1! ! ! Database! WAS! Load Balancer! zOS! CICS! Firewall! DB2! Corporate! LANs & VPNs! Message! zOS! Web! WAS! Queue! MQ! Servers! Database! ! ! ! 3! ! ! ! 6:54pm: Support teams ! ! ! ! 10:29pm: Support teams investigate the interrupted investigate MQ and ultimately flows and determine it is a 4! ! ! 5! ! ! and rule it out and ultimately 6:04pm: Synthetic transactions fail at “back-end” problem! decide to reset CICS to resolve and 6:14 the Ops Center confirms the the issue! issue Follow Us: #ITSMSummit! and creates a P0 Incident!
  • 49. DRIVING THE RIGHT KIND OF ACTION! Application! End User Experience! Infrastructure! Gainesville! San Antonio! Des Moines! Columbus! Network! Mainframe! Storage! Linux! Middleware! Database! Transaction 1! Transaction 1! Transaction 1! Transaction 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! Transaction 2! Transaction 2! Transaction 2! Transaction 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! Transaction N! Transaction N! Transaction N! Transaction N! KPI N! KPI N! KPI N! KPI N! KPI N! KPI N! Follow Us: #ITSMSummit!
  • 50. DRIVING THE RIGHT KIND OF ACTION! Application! End User Experience! Infrastructure! Gainesville! San Antonio! Des Moines! Columbus! Network! Mainframe! Storage! Linux! Middleware! Database! Transaction 1! Transaction 1! Transaction 1! Transaction 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! Transaction 2! Transaction 2! Transaction 2! Transaction 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! Transaction N! Transaction N! Transaction N! Transaction N! KPI N! KPI N! KPI N! KPI N! KPI N! KPI N! Follow Us: #ITSMSummit!
  • 51. DRIVING THE RIGHT KIND OF ACTION! Application! End User Experience! Infrastructure! Gainesville! San Antonio! Des Moines! Columbus! Network! Mainframe! Storage! Linux! Middleware! Database! Transaction 1! Transaction 1! Transaction 1! Transaction 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! Transaction 2! Transaction 2! Transaction 2! Transaction 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! Transaction N! Transaction N! Transaction N! Transaction N! KPI N! KPI N! KPI N! KPI N! KPI N! KPI N! Follow Us: #ITSMSummit!
  • 52. DRIVING THE RIGHT KIND OF ACTION! Application! End User Experience! Infrastructure! Gainesville! San Antonio! Des Moines! Columbus! Network! Mainframe! Storage! Linux! Middleware! Database! Transaction 1! Transaction 1! Transaction 1! Transaction 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! Transaction 2! Transaction 2! Transaction 2! Transaction 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! Transaction N! Transaction N! Transaction N! Transaction N! KPI N! KPI N! KPI N! KPI N! KPI N! KPI N! Follow Us: #ITSMSummit!
  • 53. DRIVING THE RIGHT KIND OF ACTION! Application! End User Experience! Infrastructure! Gainesville! San Antonio! Des Moines! Columbus! Network! Mainframe! Storage! Linux! Middleware! Database! Transaction 1! Transaction 1! Transaction 1! Transaction 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! KPI 1! Transaction 2! Transaction 2! Transaction 2! Transaction 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! KPI 2! Transaction N! Transaction N! Transaction N! Transaction N! KPI N! KPI N! KPI N! KPI N! KPI N! KPI N! Follow Us: #ITSMSummit!
  • 54.
  • 55. COMMON PROBLEM TYPES! §  Design Problems! §  Creative Problems! §  Daily Problems! §  People Problems! Rule-Based Event Based Approach! Approach! Follow Us: #ITSMSummit!
  • 56. EVENT-BASED PROBLEM SOLVING! §  Appreciative Understanding! §  Know What We Are Solving! §  Create A Common Reality! §  Solutions Based on Causes ! Follow Us: #ITSMSummit!
  • 57. CAUSAL RELATIONSHIPS! ①  Causes are effects, and effects are causes! Database Logs Not Down ! Drive Full Truncated (Cause/Effect)! (Effect)! (Cause)! Follow Us: #ITSMSummit!
  • 58. CAUSAL RELATIONSHIPS! ②  You can keep identifying causes – there is no limit! End of the Database Down ! Logs Not Universe Drive Full Truncated Beginning of (Primary Effect)! (Cause/Effect)! Time (Cause)! (Effect)! (Cause/Effect)! Follow Us: #ITSMSummit!
  • 59. TWO IMPORTANT QUESTIONS! Ask “Why?”! End of the Database Down ! Logs Not Universe Drive Full Truncated Beginning of (Primary Effect)! (Cause/Effect)! Time (Cause)! (Effect)! (Cause/Effect)! Ask “What”! Follow Us: #ITSMSummit!
  • 60. RULES FOR CAUSAL RELATIONSHIPS! ③  An Effect is often the result of multiple causes! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Logs were not -AND-! Transaction log truncated! was unable to grow! Company has only SQL Server was 1 DBA! not processing -AND-! queries (Effect)! T: Drive at 0 Bytes -AND-! “Backup” DBA was free! not aware the logs require truncation! Space allocations are fixed! Lack of Control! Follow Us: #ITSMSummit!
  • 61. RULES FOR CAUSAL RELATIONSHIPS! ④  Causes need to be both necessary and sufficient! DBA on honeymoon vacation in Fiji! (Transitory Cause)! Logs are truncated manually! Logs were not (Non-Transitory Cause)! truncated! -AND-! Transaction log was (Transitory Cause & unable to grow Effect)! Company has only 1 (Transitory Cause)! DBA! SQL Server was not processing queries -AND-! (Non-Transitory Cause)! (Effect)! T: Drive at 0 Bytes free! (Non-transitory Cause -AND-! “Backup” DBA was not & Effect)! aware the logs require truncation! (Non-Transitory Cause)! Space allocations are fixed! Lack of Control! (Non-Transitory Cause)! Follow Us: #ITSMSummit!
  • 62. HOW FIRE WORKS! Transitory! Non-Transitory! Oxygen! Match Strike! Heat! Fuel! Fire! Time! Oxygen! •  Transitory Causes act as catalysts to bring Heat! about change (think Transition)! Fire! -AND-! Fuel! •  Non-Transitory Causes are objects, properties/attributes, and status! Match Strike! Follow Us: #ITSMSummit!
  • 63. TAKE AN SOLOGIC RCA DIAGRAM! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Logs were not truncated! -AND-! Transaction log was unable to grow! Company has only 1 SQL Server was not DBA! processing queries! -AND-! T: Drive at 0 Bytes free! -AND-! “Backup” DBA was not aware the logs The application require truncation! server was timing -AND-! out! Space allocations DR SQL Cluster! are fixed! Lack of Control! Web Server Only one database returning 500 errors! -AND-! -AND-! cluster in use! DR Cluster being More Information used for UAT testing! Needed! Customers One one application More Information -AND-! Complaining! server exists! Needed! Trying to do business on the Desired Condition! website! Follow Us: #ITSMSummit!
  • 64. ADD THE EVIDENCE! Statistical Data! DBA on honeymoon vacation in Fiji! Observation! Logs are truncated manually! Logs were not truncated! -AND-! Transaction log was unable to grow! Company has only 1 SQL Server was not DBA! processing queries! -AND-! T: Drive at 0 Bytes free! -AND-! “Backup” DBA was not aware the logs The application require truncation! server was timing -AND-! out! Space allocations DR SQL Cluster! are fixed! Lack of Control! Web Server Only one database returning 500 errors! -AND-! -AND-! cluster in use! DR Cluster being More Information used for UAT testing! Needed! Customers One one application More Information -AND-! Complaining! server exists! Needed! Trying to do business on the Desired Condition! website! Situational! Follow Us: #ITSMSummit!
  • 65. FAILURE MODES AND EFFECT ANALYSI DBA on honeymoon vacation in Fiji! Logs are truncated manually! Transaction log is unable Logs were not truncated! -AND-! to grow! Company has only 1 DBA! -AND-! “Backup” DBA was not aware the logs require T: Drive at 0 Bytes free! -AND-! truncation! (Condition Cause)! Space allocations are fixed! Lack of Control! (Condition Cause)! SQL Server Not Minidump is configured Available! -OR-! to write to C: Drive! Server was ASRing SQL is unable to cache frequently! query results ! -AND-! C: Drive at 0 Bytes free! -OR-! Available RAM at 0 Software distributions Bytes Free! -AND-! were leaving files in the TEMP folder! Kernel able to write to page file! %TEMP% configured to C:Temp! Follow Us: #ITSMSummit!
  • 66. GETTING TO OUR REQUIREMENTS! At least one point DBA on honeymoon vacation in Fiji! along each branch after the “OR”! Logs are truncated manually! Logs were not Monitor the truncated! -AND-! Transaction log is intersections at unable to grow! the “OR’s”! Company has only 1 DBA! -AND-! “Backup” DBA was not T: Drive at 0 Bytes aware the logs require free! -AND-! truncation! (Condition Cause)! Space allocations are fixed! Lack of Control! SQL Server Not (Condition Cause)! Minidump is Available! -OR-! configured to write to C: Drive! SQL is unable to Server was ASRing cache query results ! frequently! -AND-! C: Drive at 0 Bytes -OR-! free! Available RAM at 0 Software distributions Bytes Free! -AND-! were leaving files in the TEMP folder! Kernel able to write to page file! %TEMP% configured to C:Temp! Follow Us: #ITSMSummit!
  • 67. FMEA MATRIX (IMPACT CALCULATION)! Very high (1-2): during the design phase! High (3-4): during peer review or unit testing! Moderate (5-6): during system testing or acceptance testing! Remote (7-8): during or immediately after production deployment! Negligible (1-2): no loss in functionality, Very Remote (9-10): only after heavy mostly cosmetic! usage by users! Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time! Critical (5-6): the problem will not resolve itself but a work around exists allowing the Improbable (1-2): less than 1 time per year! problem to be bypassed! Remote (3-4): 1 time per year! Serious (7-8): the problem will not resolve Occasional (5-6): 1 time per month! itself and no work around is possible. Probable (7-8): 1 time per day! Functionality is impaired or lost but the Chronic (9-10): 1 or more times per day! system is usable to some extent! Catastrophic (9-10): the system is completely unusable! Follow Us: #ITSMSummit!
  • 68. FMEA MATRIX (EVIDENCE)! These are the events that help us RULE OUT the failure mode as not relevant! These are the events that help us to RULE IN a failure mode as a possible cause! Follow Us: #ITSMSummit!
  • 69. HOW TO DETERMINE EVENT SEVERITY! Six Levels of Severity ! •  The event severity is determined with Logical Server! respect to the component generating the event! •  The event severity does not consider Physical Server! impact or urgency! Virtual •  The incident priority is not determined by Machine 1! event severity! Server Logical Volumes! •  The event severity helps drive an effective 1! triage when multiple events arrive at Volume approximately the same time! Group 1! Physical Volumes! •  Only after the effected components and Virtual Server Hard Hard Hard their relationships to each other have been Machine 2! Volume Drive 1! Drive 2! Drive 3! 2! Group 2! determined can impact and urgency be determined! Severity! Description! Critical! The component has completely failed! Major! The component is operating but is in a degraded or crippled state! Minor! The component is functioning normally but is at risk of a more serious failure! Informational! The component is functioning normally but is reporting a change in state! Unknown! The component has changed its operating state but the effect is not known! Clear! The component is operating normally or a higher severity event has been resolved! Follow Us: #ITSMSummit!
  • 70. MONITORING BASED ON PATTERNS! Layers of Pre-Defined Monitoring Patterns ! •  The OS template is deployed when the server is provisioned! •  As a server is customized to fit its role, additional templates are deployed! •  Templates are stacked on top of each other until no gaps remain! •  This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution ! Follow Us: #ITSMSummit!
  • 71. APPLICATION-TECHNOLOGY MATRIX! Maps services, applications and technologies enabling:! • Monitoring investment prioritization! • Monitoring maturity! • Which templates need to be deployed when new hardware is acquired! • Whether an service has sufficient monitoring coverage based on its application components! • This approach allows for anticipating changes to a customer’s monitoring needs! Scores indicate:! 0 – No Strategy! 1 – Limited Monitoring! 2 – Fully Integrated Strategy! Follow Us: #ITSMSummit!
  • 72. EVENT LIFECYCLE EXAMPLE! Software-Operating System! Legend! Element Manager! Activity ! Responsible Tool ! Distributed Collectors! Object Server Triggers! Data Collection! Impact Policies! ITNM RCA Engine! Anomaly Detection! Gateway Replication! Webtop Event List! Event Generation! Integration! Event Processing! Enrichment! Event Suppression! Correlation! Root Cause Analysis! Activity ! Responsible Tool ! Business Impact Analysis! Trigger Ticket Request! Automation! Create Ticket! Notification & Escalation! Update Event with IM#! Presentation! Trigger Courtesy Pages! User Interaction Tools! Send Pages! Archiving! Reporting! Follow Us: #ITSMSummit!
  • 73. AGGREGATION AND ANALYSIS OVERVIEW LOB Managed Monitoring System! Service Center Distributed Collectors! Business and Enterprise Telemetry Data! Notification Tool! Service Provider Managed Monitoring Topology And System! Relationship Automated Action Tools! Database! Vendor Managed Monitoring System! Root Cause Analysis! Archive and Report! Event Suppression! Automated Action! Business Impact Common Event Notification and Correlation and Enrichment! Escalation! Element Analysis! Format! Manager! Distributed Collectors! Element Manager! Element Meta-Data Integration Bus! Manager! Automated Other Change Document Service Batch Knowledge Online Run PBX/Call Enterprise CMDB! Sharing! Desk! Scheduling! Database! Book! Manager! Reconciliation! Distributed Collectors! Data! Predictive Analysis! Visualization Framework! Automated Security Provisioning Management! System! Follow Us: #ITSMSummit!
  • 74. How do we keep it evolving?!
  • 75. FACILITATING PRODUCTION ASSURANCE! §  CritSits! §  Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated! §  Include other potential failure modes into the KT matrix! §  Problem Management! §  Document the causal elements as new failure modes! §  Disseminate new failure modes to Architecture, ESM and the Command Center! §  Reporting! §  Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects! §  Incorporate failure modes into “Fault Line” analysis! Follow Us: #ITSMSummit!
  • 76. DURING THE DESIGN PROCESS! •  Architects !! •  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk! •  Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented completely! •  Developers! •  Certify that designs do not contain the known failure modes or document that the failure mode does not present an unacceptable risk! •  Certify the designs implement the mitigation strategies correctly! Follow Us: #ITSMSummit!
  • 77. IMPROVING ENTERPRISE TOOLS! §  Systems Management! §  Develop new monitoring requirements using the documented indications and contraindications! §  Event Management! §  Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently! §  Configuration Management! §  Develop new discovery patterns using the documented indications and contraindications! §  Develop automations to detect the presence of failure mode conditions and generate an event to the Event Management System! Follow Us: #ITSMSummit!
  • 78. DURING SERVICE SUPPORT! •  Command Centers and Support Teams! –  Use the failure modes to rule out potential failure modes! –  Each failure mode will have a documented process to follow to mitigate the impact once a failure mode is identified! •  Incident Managers! –  Start bridge calls and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated! –  Coordinate the investigation assignments and consolidate the investigation results! Follow Us: #ITSMSummit!
  • 79. LET’S KEEP THE CONVERSATION GOING…! Andrew.P.White@Gmail.com! Andrew.P.White@Gmail.com! @SystemsMgmtZen! SystemsManagementZen.Wordpress.com! systemsmanagementzen.wordpress.com/feed/! ReverendDrew! ReverendDrew! 614-306-3434! Follow Us: #ITSMSummit!