SlideShare a Scribd company logo
1 of 15
Download to read offline
SRE Demystified
Practical Alerting
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com,
http://ganeshniyer.com
Dr Ganesh Neelakanta Iyer
SRE
•
2https://image.slidesharecdn.com/devopssreatgooglescale-190121123035/95/devops-sre-at-google-scale-30-638.jpg?cb=1548074257
Monitoring
• Monitoring a very large system is challenging for a couple of
reasons:
• The sheer number of components being analyzed
• The need to maintain a reasonably low maintenance burden on the
engineers responsible for the system
• A large system should be designed to aggregate signals and
prune outliers
• We need monitoring systems that allow us to alert for high-
level service objectives, but retain the granularity to inspect
individual components as needed
3
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Borgmon monitoring at Google
• White-box monitoring
• Instead of executing custom scripts to detect system failures,
Borgmon relies on a common data exposition format
• This enables mass data collection with low overheads and avoids
the costs of subprocess execution and network connection setup
• The data is used both for rendering charts and creating
alerts, which are accomplished using simple arithmetic
• To facilitate mass collection, the metrics format had to be
standardized
4
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Instrumentation of applications
• Adding mapped variables for example
• An example map-valued variable
• Showing 25 HTTP 200 responses and 12 HTTP 500s:
• http_responses map:code 200:25 404:0 500:12
5
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
• A service is typically made up of many binaries running as
many tasks, on many machines, in many clusters
• Borgmon needs to keep all that data organized, while allowing
flexible querying and slicing of that data
• Borgmon stores all the data in an in-memory database,
regularly checkpointed to disk
• The data points have the form (timestamp, value), and are
stored in chronological lists called time-series, and each time-
series is named by a unique set of labels, of the
form name=value.
6
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Storage in the Time-Series Arena
7
A time-series for errors labeled by the original host each was collected from
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Time-series are stored as sequences of numbers and
timestamps, which are referred to as vectors
• Like vectors in linear algebra, these vectors are slices and cross-sections of
the multidimensional matrix of data points in the arena
• The name of a time-series is a labelset, because it’s implemented
as a set of labels expressed as key=value pairs. One of these
labels is the variable name itself, the key that appears on the varz
page
8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Labels and Vectors
• Example variable expression
{var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west}
9
Label Value
var The name of the variable
job The name given to the type of server being monitored
service A loosely defined collection of jobs that provide a service to users,
either internal or external
zone Location of the Borgmon that performed the collection of this
variable
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Rule Evaluation
• The Borgmon program code, also known as Borgmon
rules, consists of simple algebraic expressions that
compute time-series from other time-series
• Rules run in a parallel threadpool where possible, but are
dependent on ordering when using previously defined
rules as input
• Aggregation is the cornerstone of rule evaluation in a
distributed environment
10
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Rule
11
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Example Alert Rule
• Creates an alert when the error ratio over 10 minutes exceeds
1% and the total number of errors exceeds 1 per second
12
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
Maintaining the configuration
• Borgmon configuration separates the definition of the rules
from the targets being monitored
• Borgmon also supports language templates
• The first class simply codifies the emergent schema of
variables exported from a given library of code
• Such templates exist for the HTTP server library, memory
allocation, the storage client library
• The second class templates are to manage the aggregation
of data from a single-server task to the global service footprint
13
https://landing.google.com/sre/sre-book/chapters/practical-alerting/
References
14
Dr Ganesh Neelakanta Iyer
ganesh@ganeshniyer.com
ganesh.vigneswara@gmail.com

More Related Content

What's hot

SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SREBob Wise
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability EngineeringMark Underwood
 
Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Valerii Kravchuk
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)Setyo Legowo
 
Agile at Salesforce From theory to practice, how to be agile at scale
Agile at Salesforce From theory to practice, how to be agile at scaleAgile at Salesforce From theory to practice, how to be agile at scale
Agile at Salesforce From theory to practice, how to be agile at scaleSalesforce Engineering
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Abeer R
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkMichae Blakeney
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsMarc Hornbeek
 
SRE for Everyone: Making Tomorrow Better Than Today
SRE for Everyone: Making Tomorrow Better Than Today SRE for Everyone: Making Tomorrow Better Than Today
SRE for Everyone: Making Tomorrow Better Than Today Rundeck
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)Hussain Mansoor
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceFranklin Angulo
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 

What's hot (20)

SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Sre summary
Sre summarySre summary
Sre summary
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)Gdb basics for my sql db as (percona live europe 2019)
Gdb basics for my sql db as (percona live europe 2019)
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Agile at Salesforce From theory to practice, how to be agile at scale
Agile at Salesforce From theory to practice, how to be agile at scaleAgile at Salesforce From theory to practice, how to be agile at scale
Agile at Salesforce From theory to practice, how to be agile at scale
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE Assessments
 
Observability
Observability Observability
Observability
 
SRE for Everyone: Making Tomorrow Better Than Today
SRE for Everyone: Making Tomorrow Better Than Today SRE for Everyone: Making Tomorrow Better Than Today
SRE for Everyone: Making Tomorrow Better Than Today
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
Shift left Observability
Shift left ObservabilityShift left Observability
Shift left Observability
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 

Similar to SRE Demystified - 07 - Practical Alerting

Overview of Postgres Utility Processes
Overview of Postgres Utility ProcessesOverview of Postgres Utility Processes
Overview of Postgres Utility ProcessesEDB
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Dana Elisabeth Groce
 
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsVigilant Technologies
 
Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Arseny Chernov
 
515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptxssuser03ec3c
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganTobias Koprowski
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Lari Hotari
 
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBeganKoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBeganTobias Koprowski
 
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBeganKoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBeganTobias Koprowski
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSMaurvi04
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephenSteve Feldman
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET Journal
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptxLemonReddy1
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthyDenodo
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemScott Moonen
 
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...Leighton Nelson
 

Similar to SRE Demystified - 07 - Practical Alerting (20)

Overview of Postgres Utility Processes
Overview of Postgres Utility ProcessesOverview of Postgres Utility Processes
Overview of Postgres Utility Processes
 
Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2Webinar: Best Practices for Upgrading to MongoDB 3.2
Webinar: Best Practices for Upgrading to MongoDB 3.2
 
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDB
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
 
Oracle EBS Production Support - Recommendations
Oracle EBS Production Support - RecommendationsOracle EBS Production Support - Recommendations
Oracle EBS Production Support - Recommendations
 
Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup) Introduction to Prometheus Monitoring (Singapore Meetup)
Introduction to Prometheus Monitoring (Singapore Meetup)
 
515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx515689311-Postgresql-DBA-Architecture.pptx
515689311-Postgresql-DBA-Architecture.pptx
 
Visual Studio Profiler
Visual Studio ProfilerVisual Studio Profiler
Visual Studio Profiler
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBeganKoprowskiT_SPBizConf_2AMaDisasterJustBegan
KoprowskiT_SPBizConf_2AMaDisasterJustBegan
 
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBeganKoprowskiT_SPBizConference_2AMaDisasterJustBegan
KoprowskiT_SPBizConference_2AMaDisasterJustBegan
 
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSFAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDS
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
python_development.pptx
python_development.pptxpython_development.pptx
python_development.pptx
 
Why advanced monitoring is key for healthy
Why advanced monitoring is key for healthyWhy advanced monitoring is key for healthy
Why advanced monitoring is key for healthy
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
Good vs power automation frameworks
Good vs power automation frameworksGood vs power automation frameworks
Good vs power automation frameworks
 
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
Oracle Enteprise Manager Cloud Control 12c - Setting Up Metrics and Monitorin...
 

More from Dr Ganesh Iyer

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignDr Ganesh Iyer
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewDr Ganesh Iyer
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2Dr Ganesh Iyer
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 Dr Ganesh Iyer
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2Dr Ganesh Iyer
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1Dr Ganesh Iyer
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicityDr Ganesh Iyer
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringDr Ganesh Iyer
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelDr Ganesh Iyer
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsDr Ganesh Iyer
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionDr Ganesh Iyer
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachDr Ganesh Iyer
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering ApplicationsDr Ganesh Iyer
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneurDr Ganesh Iyer
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetesDr Ganesh Iyer
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentDr Ganesh Iyer
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering ApplicationsDr Ganesh Iyer
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDr Ganesh Iyer
 

More from Dr Ganesh Iyer (20)

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overview
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - Simplicity
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed Monitoring
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement Model
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOs
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approach
 
Cloud and Industry4.0
Cloud and Industry4.0Cloud and Industry4.0
Cloud and Industry4.0
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneur
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 
Containerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deploymentContainerization Principles Overview for app development and deployment
Containerization Principles Overview for app development and deployment
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Demystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data ScientistsDemystifying Containerization Principles for Data Scientists
Demystifying Containerization Principles for Data Scientists
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

SRE Demystified - 07 - Practical Alerting

  • 3. Monitoring • Monitoring a very large system is challenging for a couple of reasons: • The sheer number of components being analyzed • The need to maintain a reasonably low maintenance burden on the engineers responsible for the system • A large system should be designed to aggregate signals and prune outliers • We need monitoring systems that allow us to alert for high- level service objectives, but retain the granularity to inspect individual components as needed 3 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 4. Borgmon monitoring at Google • White-box monitoring • Instead of executing custom scripts to detect system failures, Borgmon relies on a common data exposition format • This enables mass data collection with low overheads and avoids the costs of subprocess execution and network connection setup • The data is used both for rendering charts and creating alerts, which are accomplished using simple arithmetic • To facilitate mass collection, the metrics format had to be standardized 4 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 5. Instrumentation of applications • Adding mapped variables for example • An example map-valued variable • Showing 25 HTTP 200 responses and 12 HTTP 500s: • http_responses map:code 200:25 404:0 500:12 5 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 6. Storage in the Time-Series Arena • A service is typically made up of many binaries running as many tasks, on many machines, in many clusters • Borgmon needs to keep all that data organized, while allowing flexible querying and slicing of that data • Borgmon stores all the data in an in-memory database, regularly checkpointed to disk • The data points have the form (timestamp, value), and are stored in chronological lists called time-series, and each time- series is named by a unique set of labels, of the form name=value. 6 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 7. Storage in the Time-Series Arena 7 A time-series for errors labeled by the original host each was collected from https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 8. Labels and Vectors • Time-series are stored as sequences of numbers and timestamps, which are referred to as vectors • Like vectors in linear algebra, these vectors are slices and cross-sections of the multidimensional matrix of data points in the arena • The name of a time-series is a labelset, because it’s implemented as a set of labels expressed as key=value pairs. One of these labels is the variable name itself, the key that appears on the varz page 8https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 9. Labels and Vectors • Example variable expression {var=http_requests,job=webserver,instance=host0:80,service=web,zone=us-west} 9 Label Value var The name of the variable job The name given to the type of server being monitored service A loosely defined collection of jobs that provide a service to users, either internal or external zone Location of the Borgmon that performed the collection of this variable https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 10. Rule Evaluation • The Borgmon program code, also known as Borgmon rules, consists of simple algebraic expressions that compute time-series from other time-series • Rules run in a parallel threadpool where possible, but are dependent on ordering when using previously defined rules as input • Aggregation is the cornerstone of rule evaluation in a distributed environment 10 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 12. Example Alert Rule • Creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1 per second 12 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 13. Maintaining the configuration • Borgmon configuration separates the definition of the rules from the targets being monitored • Borgmon also supports language templates • The first class simply codifies the emergent schema of variables exported from a given library of code • Such templates exist for the HTTP server library, memory allocation, the storage client library • The second class templates are to manage the aggregation of data from a single-server task to the global service footprint 13 https://landing.google.com/sre/sre-book/chapters/practical-alerting/
  • 15. Dr Ganesh Neelakanta Iyer ganesh@ganeshniyer.com ganesh.vigneswara@gmail.com