SlideShare ist ein Scribd-Unternehmen logo
SRE Demystified
Practical Alerting
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer
SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257
Monitoring
• Monitoring a very large system is challenging for a couple of
reasons:
• The sheer number of components being analyzed
• The need to maintain a reasonably low maintenance burden on the
engineers responsible for the system
• A large system should be designed to aggregate signals and
prune outliers
• We need monitoring systems that allow us to alert for high-
level service objectives, but retain the granularity to inspect
individual components as needed
3
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Borgmon monitoring at Google
• White-box monitoring
• Instead of executing custom scripts to detect system failures,
Borgmon relies on a common data exposition format
• This enables mass data collection with low overheads and avoids
the costs of subprocess execution and network connection setup
• The data is used both for rendering charts and creating
alerts, which are accomplished using simple arithmetic
• To facilitate mass collection, the metrics format had to be
standardized
4
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Instrumentation of applications
• Adding mapped variables for example
• An example map-valued variable
• Showing 25 HTTP 200 responses and 12 HTTP 500s:
• http_responses map:code 200:25 404:0 500:12
5
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
• A service is typically made up of many binaries running as
many tasks, on many machines, in many clusters
• Borgmon needs to keep all that data organized, while allowing
flexible querying and slicing of that data
• Borgmon stores all the data in an in-memory database,
regularly checkpointed to disk
• The data points have the form (timestamp, value), and are
stored in chronological lists called time-series, and each time-
series is named by a unique set of labels, of the
form name=value.
6
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
7
A time-series for errors labeled by the original host each was collected from
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Time-series are stored as sequences of numbers and
timestamps, which are referred to as vectors
• Like vectors in linear algebra, these vectors are slices and cross-sections of
the multidimensional matrix of data points in the arena
• The name of a time-series is a labelset, because it’s implemented
as a set of labels expressed as key=value pairs. One of these
labels is the variable name itself, the key that appears on the varz
page
8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Example variable expression
{var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west}
9
Label Value
var The name of the variable
job The name given to the type of server being monitored
service A loosely defined collection of jobs that provide a service to users,
either internal or external
zone Location of the Borgmon that performed the collection of this
variable
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Rule Evaluation
• The Borgmon program code, also known as Borgmon
rules, consists of simple algebraic expressions that
compute time-series from other time-series
• Rules run in a parallel threadpool where possible, but are
dependent on ordering when using previously defined
rules as input
• Aggregation is the cornerstone of rule evaluation in a
distributed environment
10
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Rule
11
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Alert Rule
• Creates an alert when the error ratio over 10 minutes exceeds
1% and the total number of errors exceeds 1 per second
12
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Maintaining the configuration
• Borgmon configuration separates the definition of the rules
from the targets being monitored
• Borgmon also supports language templates
• The first class simply codifies the emergent schema of
variables exported from a given library of code
• Such templates exist for the HTTP server library, memory
allocation, the storage client library
• The second class templates are to manage the aggregation
of data from a single-server task to the global service footprint
13
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
References
14
Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

Weitere ähnliche Inhalte

Was ist angesagt?

DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
Rosalie Lauren
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
Levon Avakyan
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
Hussain Mansoor
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
Michae Blakeney
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
Abeer R
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
Bob Wise
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
jeetendra mandal
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2
Chris Huang
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
Tori Wieldt
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
DevOpsDays Tel Aviv
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
DevClub_lv
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
Mark Underwood
 
Performance Engineering Masterclass: Efficient Automation with the Help of SR...
Performance Engineering Masterclass: Efficient Automation with the Help of SR...Performance Engineering Masterclass: Efficient Automation with the Help of SR...
Performance Engineering Masterclass: Efficient Automation with the Help of SR...
ScyllaDB
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build Automation
Heiswayi Nrird
 
Kubernetes and Prometheus
Kubernetes and PrometheusKubernetes and Prometheus
Kubernetes and Prometheus
Weaveworks
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 

Was ist angesagt? (20)

DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains InfotechDevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
DevOps Vs SRE Major Differences That You Need To Know - Hidden Brains Infotech
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2
 
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
Performance Engineering Masterclass: Efficient Automation with the Help of SR...
Performance Engineering Masterclass: Efficient Automation with the Help of SR...Performance Engineering Masterclass: Efficient Automation with the Help of SR...
Performance Engineering Masterclass: Efficient Automation with the Help of SR...
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build Automation
 
Kubernetes and Prometheus
Kubernetes and PrometheusKubernetes and Prometheus
Kubernetes and Prometheus
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 

Ähnlich wie SRE Demystified - 07 - Practical Alerting

Overview of Postgres Utility Processes
Overview of Postgres Utility ProcessesOverview of Postgres Utility Processes
Overview of Postgres Utility Processes
EDB
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
Dana Elisabeth Groce
 
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
Dr. Prakash Sahu
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - Recommendations
Vigilant Technologies
 
Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup)
Arseny Chernov
 
515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx
ssuser03ec3c
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
Betclic Everest Group Tech Team
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
Tobias Koprowski
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
Lari Hotari
 
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBeganKoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
Tobias Koprowski
 
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBeganKoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
Tobias Koprowski
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
Maurvi04
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
Steve Feldman
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET Journal
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
LemonReddy1
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthy
Denodo
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
Scott Moonen
 
Good vs power automation frameworks
Good vs power automation frameworksGood vs power automation frameworks
Good vs power automation frameworks
Kumar Swamy Dontamsetti
 
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Leighton Nelson
 

Ähnlich wie SRE Demystified - 07 - Practical Alerting (20)

Overview of Postgres Utility Processes
Overview of Postgres Utility ProcessesOverview of Postgres Utility Processes
Overview of Postgres Utility Processes
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
 
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - Recommendations
 
Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup)
 
515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBeganKoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
 
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBeganKoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthy
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
Good vs power automation frameworks
Good vs power automation frameworksGood vs power automation frameworks
Good vs power automation frameworks
 
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
 

Mehr von Dr Ganesh Iyer

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
Dr Ganesh Iyer
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overview
Dr Ganesh Iyer
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
Dr Ganesh Iyer
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
Dr Ganesh Iyer
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2
Dr Ganesh Iyer
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
Dr Ganesh Iyer
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - Simplicity
Dr Ganesh Iyer
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed Monitoring
Dr Ganesh Iyer
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement Model
Dr Ganesh Iyer
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOs
Dr Ganesh Iyer
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
Dr Ganesh Iyer
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approach
Dr Ganesh Iyer
 
Cloud and Industry4.0
Cloud and Industry4.0Cloud and Industry4.0
Cloud and Industry4.0
Dr Ganesh Iyer
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
Dr Ganesh Iyer
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
Dr Ganesh Iyer
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneur
Dr Ganesh Iyer
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
Dr Ganesh Iyer
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deployment
Dr Ganesh Iyer
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
Dr Ganesh Iyer
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data Scientists
Dr Ganesh Iyer
 

Mehr von Dr Ganesh Iyer (20)

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overview
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - Simplicity
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed Monitoring
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement Model
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOs
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approach
 
Cloud and Industry4.0
Cloud and Industry4.0Cloud and Industry4.0
Cloud and Industry4.0
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneur
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deployment
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data Scientists
 

Kürzlich hochgeladen

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 

Kürzlich hochgeladen (20)

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 

SRE Demystified - 07 - Practical Alerting

  • 3. Monitoring • Monitoring a very large system is challenging for a couple of reasons: • The sheer number of components being analyzed • The need to maintain a reasonably low maintenance burden on the engineers responsible for the system • A large system should be designed to aggregate signals and prune outliers • We need monitoring systems that allow us to alert for high- level service objectives, but retain the granularity to inspect individual components as needed 3 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 4. Borgmon monitoring at Google • White-box monitoring • Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format • This enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup • The data is used both for rendering charts and creating alerts, which are accomplished using simple arithmetic • To facilitate mass collection, the metrics format had to be standardized 4 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 5. Instrumentation of applications • Adding mapped variables for example • An example map-valued variable • Showing 25 HTTP 200 responses and 12 HTTP 500s: • http_responses map:code 200:25 404:0 500:12 5 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 6. Storage in the Time-Series Arena • A service is typically made up of many binaries running as many tasks, on many machines, in many clusters • Borgmon needs to keep all that data organized, while allowing flexible querying and slicing of that data • Borgmon stores all the data in an in-memory database, regularly checkpointed to disk • The data points have the form (timestamp, value), and are stored in chronological lists called time-series, and each time- series is named by a unique set of labels, of the form name=value. 6 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 7. Storage in the Time-Series Arena 7 A time-series for errors labeled by the original host each was collected from https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 8. Labels and Vectors • Time-series are stored as sequences of numbers and timestamps, which are referred to as vectors • Like vectors in linear algebra, these vectors are slices and cross-sections of the multidimensional matrix of data points in the arena • The name of a time-series is a labelset, because it’s implemented as a set of labels expressed as key=value pairs. One of these labels is the variable name itself, the key that appears on the varz page 8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 9. Labels and Vectors • Example variable expression {var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west} 9 Label Value var The name of the variable job The name given to the type of server being monitored service A loosely defined collection of jobs that provide a service to users, either internal or external zone Location of the Borgmon that performed the collection of this variable https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 10. Rule Evaluation • The Borgmon program code, also known as Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series • Rules run in a parallel threadpool where possible, but are dependent on ordering when using previously defined rules as input • Aggregation is the cornerstone of rule evaluation in a distributed environment 10 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 12. Example Alert Rule • Creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1 per second 12 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 13. Maintaining the configuration • Borgmon configuration separates the definition of the rules from the targets being monitored • Borgmon also supports language templates • The first class simply codifies the emergent schema of variables exported from a given library of code • Such templates exist for the HTTP server library, memory allocation, the storage client library • The second class templates are to manage the aggregation of data from a single-server task to the global service footprint 13 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 15. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com