SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Top-Down Approach to Monitoring
July 30, 2015
1996
2
Tivoli Software
acquired by IBM
Patrol Software
acquired by BMC
Ethan Galstad creates a simple

MS-DOS application designed to 

"ping" Novell Netware servers
“HOW to monitor?” is the primary question
2015
3
https://www.bigpanda.io/monitoringscape/
Shifting from “How?” to “What?”
4
5
Bottom-Up Approach
6
Network Servers Apps
Overall System Health
Problem #1: Inflation of Tools
7
Problem #2: Inflation of “Whats”
8
Problem #3: Inflation of Alerts
9
10
11
We’re trying to answer a simple question:
Is our system in a healthy state?
12
No Alerts
Many Alerts Unhealthy System≠
≠ Healthy System
13
Healthy System =
A system that continuously 

generates value for its users

under a well known set of KPIs
Top-Down Approach
14
KPIs UX
Overall System Health
15
KPIs UX
Overall System Health Network Servers Apps
Overall System Health
• Selective
• Proactive
• Exhaustive
• Reactive
vs
Bottom-UpTop-Down
A key performance indicator (KPI)
is a business metric used to
evaluate factors that are crucial to
the success of an organization.
KPIs differ per organization;
Definition of KPI
16
Let’s play a game!
17
CPU Utilization # Clicks on 

a button
TemperatureThis is Sam
What does Sam’s company do?
We sought out a single indicator that closely approximated our most
important activity: viewing. We discovered that a server-side metric
related to playback starts (the act of “clicking play”) had both a
predictable pattern and fluctuated significantly when UI/device/server
problems were happening. The Netflix streaming pulse was created. 



The Pulse of Netflix
18
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
We named it “SPS” for “starts per second”.
Healthy SPS Pattern
19
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
Unhealthy SPS Pattern
20
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html
What’s so special about SPS?
21
• SPS is easy to understand by all stakeholders
• One metric that covers different point of failure: server
problems, device problems, etc.
• Most important: it’s a clear KPI that indicates when user
experience is compromised
But what about root cause analysis?
22
KPIs UX
Overall System Health
Network Servers Apps
Github: need for speed
23
https://github.com/blog/1252-how-we-keep-github-fast
The most important factor in web application
design is responsiveness. And the first step
toward responsiveness is speed. But speed
within a web application is complicated.
Start from the Top:

Response Times Dashboard
24
https://github.com/blog/1252-how-we-keep-github-fast
• Each row represented a different major

component
• Clicking one of the rows allows you to dive in 

and see the mean, 98th percentile, and 99.9th 

percentile response times
Digging Deeper:

Mission Control Bar
25
https://github.com/blog/1252-how-we-keep-github-fast
Total Time Render Time Cache & Database JS & CSS Size
And Deeper
26
https://github.com/blog/1252-how-we-keep-github-fast
Render Breakdown
SQL Query Viewer
27
Why talk about BigPanda?
Because Pandas 

are awesome!
BigPanda
28
Because..
• We’re not Netflix or Github: growing startup (7 devs, 1 full-time Ops)
• We feel the pain!
• Our KPIs are easy to describe and understand (especially if you’re an
Ops person)
BigPanda
29
As a unified dashboard on top of all your
monitoring systems, and eventually a single
point of truth for production incidents, our data
pipeline has to be reliable and fast.
KPI: Low data pipeline latency
Pipeline Latency Metric
30
• Metric are sent from within the apps
• Stored in Graphite
• Sum of all the average latencies of
all alerts that went through the
pipeline
• Monitored by Nagios
• Very good indicator of possible service outage
• Must have for detection of SLA violation
• Very good indicator of performance
bottlenecks (can be broken down to sub-
pipelines / specific organizations etc)
• Simple and high-level: easy to explain to non-
technical stakeholders (e.g. sales)
Pipeline Latency Metric
31
• Bottom-up approach (“monitor all the things”) is easier to start with, but soon enough
leads to alert fatigue and disorientation.
• Top-down approach requires thought and custom instrumentation, but keeps you
focused on what’s important.
• High level metrics can be complemented by low level metrics. Trying to deduce the
former from the latter is futile.
• Take advantage of the rich monitoring landscape, but as means to an end. Don’t let the
tools dictate to you what you need to measure.
• Monitoring is - first of all - about your business.
TL;DR
32
33
Questions?
34
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

Grab a coffee and take 5 mins out
Grab a coffee and take 5 mins outGrab a coffee and take 5 mins out
Grab a coffee and take 5 mins out
Druantia
 
Patch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare ITPatch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare IT
Kaseya
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT
Kaseya
 
BA 372 - Final Presentation UPDATED 3
BA 372 - Final Presentation UPDATED 3BA 372 - Final Presentation UPDATED 3
BA 372 - Final Presentation UPDATED 3
Rayan AlRasheed
 

Was ist angesagt? (20)

Spur Infrastructure Performance With Proactive IT Monitoring
Spur Infrastructure Performance With Proactive IT MonitoringSpur Infrastructure Performance With Proactive IT Monitoring
Spur Infrastructure Performance With Proactive IT Monitoring
 
Computer Audit an Introductory
Computer Audit an IntroductoryComputer Audit an Introductory
Computer Audit an Introductory
 
Modernising Change Management with Enterprise DevOps
Modernising Change Management with Enterprise DevOpsModernising Change Management with Enterprise DevOps
Modernising Change Management with Enterprise DevOps
 
Role of it in strategic planning
Role of it in strategic planningRole of it in strategic planning
Role of it in strategic planning
 
Top 5 IT challenges for 2017
Top 5 IT challenges for 2017Top 5 IT challenges for 2017
Top 5 IT challenges for 2017
 
Enterprise transformation models their utility, common pitfalls and adaptive IT
Enterprise transformation models their utility, common pitfalls and adaptive ITEnterprise transformation models their utility, common pitfalls and adaptive IT
Enterprise transformation models their utility, common pitfalls and adaptive IT
 
Splitting The Check On Compliance and Security
Splitting The Check On Compliance and SecuritySplitting The Check On Compliance and Security
Splitting The Check On Compliance and Security
 
Verification at scale: Fitting static code analysis into continuous integration
Verification at scale: Fitting static code analysis into continuous integrationVerification at scale: Fitting static code analysis into continuous integration
Verification at scale: Fitting static code analysis into continuous integration
 
5 Ways NCM Can Save You From A Disaster
5 Ways NCM Can Save You From A Disaster5 Ways NCM Can Save You From A Disaster
5 Ways NCM Can Save You From A Disaster
 
Vulnerability and Patch Management
Vulnerability and Patch ManagementVulnerability and Patch Management
Vulnerability and Patch Management
 
DevOpsDays Chicago 2014 - Controlling Devops
DevOpsDays Chicago 2014 -  Controlling DevopsDevOpsDays Chicago 2014 -  Controlling Devops
DevOpsDays Chicago 2014 - Controlling Devops
 
Grab a coffee and take 5 mins out
Grab a coffee and take 5 mins outGrab a coffee and take 5 mins out
Grab a coffee and take 5 mins out
 
Introducing the OCR Audit Readiness Assessment
Introducing the OCR Audit Readiness AssessmentIntroducing the OCR Audit Readiness Assessment
Introducing the OCR Audit Readiness Assessment
 
Patch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare ITPatch Management: 4 Best Practices and More for Today's Healthcare IT
Patch Management: 4 Best Practices and More for Today's Healthcare IT
 
4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT4 Best Practices for Patch Management in Education IT
4 Best Practices for Patch Management in Education IT
 
E-GEN
E-GENE-GEN
E-GEN
 
EAM & CMMS - Introduction to Transcendent
EAM & CMMS - Introduction to TranscendentEAM & CMMS - Introduction to Transcendent
EAM & CMMS - Introduction to Transcendent
 
Application Modeller
Application ModellerApplication Modeller
Application Modeller
 
How important is IT auditing
How important is IT auditingHow important is IT auditing
How important is IT auditing
 
BA 372 - Final Presentation UPDATED 3
BA 372 - Final Presentation UPDATED 3BA 372 - Final Presentation UPDATED 3
BA 372 - Final Presentation UPDATED 3
 

Andere mochten auch

báo giá dịch vụ giúp việc gia đình lâu dài ở sài gòn
báo giá dịch vụ giúp việc gia đình lâu dài ở sài gònbáo giá dịch vụ giúp việc gia đình lâu dài ở sài gòn
báo giá dịch vụ giúp việc gia đình lâu dài ở sài gòn
jone854
 
Horários de
Horários deHorários de
Horários de
Colmeias
 
Netplus SXSWi Submission
Netplus SXSWi SubmissionNetplus SXSWi Submission
Netplus SXSWi Submission
Netplus
 

Andere mochten auch (11)

Ferramenta de Robótica
Ferramenta de RobóticaFerramenta de Robótica
Ferramenta de Robótica
 
εκπαιδευτικο προγραμμα ο_αερας (1) (1)
εκπαιδευτικο προγραμμα ο_αερας (1) (1)εκπαιδευτικο προγραμμα ο_αερας (1) (1)
εκπαιδευτικο προγραμμα ο_αερας (1) (1)
 
Nternet de las cosas
Nternet de las cosasNternet de las cosas
Nternet de las cosas
 
báo giá dịch vụ giúp việc gia đình lâu dài ở sài gòn
báo giá dịch vụ giúp việc gia đình lâu dài ở sài gònbáo giá dịch vụ giúp việc gia đình lâu dài ở sài gòn
báo giá dịch vụ giúp việc gia đình lâu dài ở sài gòn
 
Horários de
Horários deHorários de
Horários de
 
Project Report
Project ReportProject Report
Project Report
 
Default certificate of completion
Default certificate of completionDefault certificate of completion
Default certificate of completion
 
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark SonisStatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
StatsCraft 2015: Introduction to monitoring - Yoav Abrahami and Mark Sonis
 
Εκπαιδευτικό πρόγραμμα" Ο ήλιος"
Εκπαιδευτικό πρόγραμμα" Ο ήλιος" Εκπαιδευτικό πρόγραμμα" Ο ήλιος"
Εκπαιδευτικό πρόγραμμα" Ο ήλιος"
 
Christopher duren flying the american flag
Christopher duren  flying the american flagChristopher duren  flying the american flag
Christopher duren flying the american flag
 
Netplus SXSWi Submission
Netplus SXSWi SubmissionNetplus SXSWi Submission
Netplus SXSWi Submission
 

Ähnlich wie StatsCraft 2015: Top down approach to monitoring - Shahar Kedar

Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?
Reveille Software
 
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
NETWAYS
 

Ähnlich wie StatsCraft 2015: Top down approach to monitoring - Shahar Kedar (20)

NetIQ AppManager & NetIQ Operations Center. NCU Ltd
NetIQ AppManager & NetIQ Operations Center. NCU LtdNetIQ AppManager & NetIQ Operations Center. NCU Ltd
NetIQ AppManager & NetIQ Operations Center. NCU Ltd
 
Improving Lean Manufacturing Through a KPI Analysis System
Improving Lean Manufacturing Through a KPI Analysis SystemImproving Lean Manufacturing Through a KPI Analysis System
Improving Lean Manufacturing Through a KPI Analysis System
 
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
Robert Mircea & Virgil Chereches: Our Journey To Continuous Delivery at I T.A...
 
AppSphere 15 - AppDynamics: Beyond APM - Building an Operations Center
AppSphere 15 - AppDynamics: Beyond APM - Building an Operations CenterAppSphere 15 - AppDynamics: Beyond APM - Building an Operations Center
AppSphere 15 - AppDynamics: Beyond APM - Building an Operations Center
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chain
 
Life of an event - A never ending tool chain
Life of an event - A never ending tool chainLife of an event - A never ending tool chain
Life of an event - A never ending tool chain
 
Our Journey To Continuous Delivery
Our Journey To Continuous DeliveryOur Journey To Continuous Delivery
Our Journey To Continuous Delivery
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
Six cigma AJAL
Six cigma AJALSix cigma AJAL
Six cigma AJAL
 
AppSphere 15 - Achieving Stability and End-to-End Monitoring
AppSphere 15 - Achieving Stability and End-to-End MonitoringAppSphere 15 - Achieving Stability and End-to-End Monitoring
AppSphere 15 - Achieving Stability and End-to-End Monitoring
 
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
How the World Bank Standardized on AppDynamics as its Enterprise-Wide APM Sol...
 
The benefits of ALM and PLM Integration
The benefits of ALM and PLM IntegrationThe benefits of ALM and PLM Integration
The benefits of ALM and PLM Integration
 
Observability in highly distributed systems
Observability in highly distributed systemsObservability in highly distributed systems
Observability in highly distributed systems
 
What’s New with NGINX Controller Load Balancing Module 2.0?
What’s New with NGINX Controller Load Balancing Module 2.0?What’s New with NGINX Controller Load Balancing Module 2.0?
What’s New with NGINX Controller Load Balancing Module 2.0?
 
Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?Are Your End Users Doing Your ECM QA?
Are Your End Users Doing Your ECM QA?
 
Software Engineering 2 lecture slide
Software Engineering 2 lecture slideSoftware Engineering 2 lecture slide
Software Engineering 2 lecture slide
 
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
OSMC 2016 | Application Performance Management with Open-Source-Tooling by Ma...
 
OSMC 2016 - Application Performance Management with Open-Source-Tooling by M...
OSMC 2016 -  Application Performance Management with Open-Source-Tooling by M...OSMC 2016 -  Application Performance Management with Open-Source-Tooling by M...
OSMC 2016 - Application Performance Management with Open-Source-Tooling by M...
 
Take Control of Application Performance
Take Control of Application PerformanceTake Control of Application Performance
Take Control of Application Performance
 
System design
System designSystem design
System design
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

StatsCraft 2015: Top down approach to monitoring - Shahar Kedar