SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Latency SLOs Done Right
SREcon19 Americas
#SREcon@phredmoye
#SREcon
Fred Moyer
Developer Evangelist, Circonus
@phredmoyer
Latency
Is it important?
#SREcon@phredmoye
Latency
For any of your
services, how many
requests were
served within 500
ms over the last
month?
@phredmoye #SREcon
500MS
?
Latency
For any of your
services, how many
requests were
served within 250
ms over the last
month?
@phredmoye #SREcon
250MS
?
Latency
How would you answer that
question for your services?
@phredmoye #SREcon
Latency
How accurate
would your
answer be?
@phredmoye #SREcon
?10%
ERROR 20%ERROR
50%
ERROR
200%
ERROR
I’m Fred and I like SLOs
- Developer Evangelist @Circonus
- Engineer who talks to people
- Writing code and breaking prod
for 20 years
- @phredmoyer
- Likes C, Go, Perl, PostgreSQL
@phredmoye 100% UPTIME
Talk Agenda
● SLO Refresher
● A Common Mistake
● Computing SLOs with log data
● Computing SLOs by counting
requests
● Computing SLOs with histograms
@phredmoye #SREcon
Service Level Objectives
SLI - Service Level Indicator
SLO - Service Level Objectives
SLA - Service Level Agreement
@phredmoye #SREcon
@phredmoye
Service Level Objectives
#SREcon
“99th percentile latency
of homepage requests
over the past 5 minutes
< 300ms”
“SLIs drive
SLOs which
inform SLAs”
SLI - Service Level Indicator
Measure of the service that
can be quantified
Excerpted from:
“SLIs, SLOs, SLAs, oh my!”
@sethvargo @lizthegrey
https://youtu.be/tEylFyxbDL
E
“99th percentile
homepage SLI will
succeed 99.9% over
trailing year”
“SLIs drive
SLOs which
inform SLAs”
SLO - Service Level
Objective, a target for
Service Level Indicators
Excerpted from:
“SLIs, SLOs, SLAs, oh my!”
@sethvargo @lizthegrey
https://youtu.be/tEylFyxbDL
E
“99th percentile
homepage SLI will
succeed 99% over
trailing year”
“SLIs drive
SLOs which
inform SLAs”
SLA - Service Level
Agreement, a legal
agreement
Excerpted from:
“SLIs, SLOs, SLAs, oh my!”
@sethvargo @lizthegrey
https://youtu.be/tEylFyxbDL
E
Talk Agenda
● SLO Refresher
● A Common Mistake
● Computing SLOs with log data
● Computing SLOs by counting
requests
● Computing SLOs with histograms
@phredmoye #SREcon
A Common Mistake
@phredmoye
Averaging Percentiles
p95(W1 âˆȘ W2) != (p95(W1)+ p95(W2))/2
Works fine when node workload is symmetric
Hides problems when workloads are asymmetric
#SREcon
A Common Mistake
@phredmoye #SREcon
A Common Mistake
@phredmoye
99% of requests
served here
#SREcon
@phredmoye
Averaging Percentiles
A Common Mistake
#SREcon
@phredmoye
p95(W1) = 220ms
p95(W2) = 650ms
p95(W1 âˆȘ W2) = 230ms
(p95(W1)+p95(W2))/2 = 430ms
~200% difference
A Common Mistake
#SREcon
@phredmoye
Averaging Percentiles
A Common Mistake
p95 actual (230ms)
p95 average (430ms)
ERROR
#SREcon
A Common Mistake
@phredmoye
Log parser => Metrics (mtail)
What metrics are you storing?
Averages?
p50, p90, p95, p99, p99.9, p99.9?
#SREcon
Talk Agenda
● SLO Refresher
● A Common Mistake
● Computing SLOs with log data
● Computing SLOs by counting
requests
● Computing SLOs with
histograms
@phredmoye #SREcon
Computing SLOs with log
data
@phredmoye
"%{%d/%b/%Y %T}t.%{msec}t %{%z}t"
#SREcon
~100 bytes per log line
~1GB for 10M requests
@phredmoye
Logs => HDFS
Logs => ElasticSearch/Splunk
ssh -- `grep ... | awk ... > 550 ... | wc -l`
#SREcon
Computing SLOs with log
data
@phredmoye
1. Extract samples for time window
2. Sort the samples by value
3. Find the sample 5% count from largest
4. That’s your p95
#SREcon
Computing SLOs with log
data
@phredmoye
“95th percentile SLI will succeed 99.9% trailing year”
1. Divide 1 year samples into 1,000 slices
2. For each slice, calculate SLI
3. Was p95 SLI met for 999 slices? Met SLO if so
#SREcon
Computing SLOs with log
data
Computing SLOs with log
data
@phredmoye
Pros:
1. Easy to configure logs to
capture latency
2. Easy to roll your own
processing code, some open
source options out there
3. Accurate results
#SREcon
Cons:
1. Expensive (see log analysis
solution pricing)
2. Sampling possible but skews
accuracy
3. Slow
4. Difficult to scale
Talk Agenda
● SLO Refresher
● A Common Mistake
● Computing SLOs with log data
● Computing SLOs by counting
requests
● Computing SLOs with histograms
@phredmoye #SREcon
@phredmoye
1. Count # of requests that violate SLI threshold
2. Count total number of requests
3. % success = 100 - (#failed_reqs/#total_reqs)*100
Similar to Prometheus cumulative ‘<=’ histogram
#SREcon
Computing SLOs by counting requests
Computing SLOs by counting requests
@phredmoye #SREcon
Computing SLOs by counting requests
@phredmoye
SLO = 90% of reqs < 30ms
# bad requests = 2,262
# total requests = 60,124
100-(2262/60124)*100=96.2%
SLO was met
#SREcon
@phredmoye
Computing SLOs by counting
requests
#SREcon
Pros:
1. Simple to implement
2. Performant
3. Scalable
4. Accurate
Cons:
1. Fixed SLO threshold -
must reconfigure
2. Look back impossible
for other thresholds
Talk Agenda
● SLO Refresher
● A Common Mistake
● Computing SLOs with log data
● Computing SLOs by counting
requests
● Computing SLOs with histograms
@phredmoye #SREcon
Computing SLOs
with histograms
AKA distributions
Sample counts in
bins/buckets
Gil Tene’s
hdrhistogram.org
Sample value
# Samples
Median
q(0.5)
Mode
q(0.9)
q(1)Mean
@phredmoye
Some histogram types:
1. Linear
2. Approximate
3. Fixed bin
4. Cumulative
5. Log Linear
Computing SLOs by counting
requests
#SREcon
@phredmoye
Log Linear Histogram
github.com/circonus-labs/libcircllhist
github.com/circonus-labs/circonusllhist
#SREcon
@phredmoye
Log Linear Histogram
#SREcon
@phredmoye
h(A âˆȘ B) = h(A) âˆȘ h(B)
A & B must have identical bin boundaries
Can be aggregated both in space and time
Mergeability
#SREcon
@phredmoye
How many requests are faster than 330ms?
1. Walk the bins lowest to highest until you reach 330ms
2. Sum the counts in those bins
3. Done
Computing SLOs with histograms
#SREcon
@phredmoye #SREcon
@phredmoye
For the libcircllhist implementation we have bins at:
... 320, 330, 340, ...
.... And: 10,11,12,13...
.... And: 0.0000010, 0.0000011, 0.0000012,
For every decimal floating point number, with 2 significant
digits, we have a bin (within 10^{+/-128}).
So ... where are the bin boundaries?
#SREcon
@phredmoye
Pros:
1. Space Efficient (HH: ~ 300bytes / histogram in practice, 10x more
efficient than logs)
2. Full Flexibility:
- Thresholds can be chosen as needed and analyzed
- Statistical methods applicable, IQR, count_below, q(1), etc.
3. Mergability (HH: Aggregate data across nodes)
4. Performance (ns insertions, ÎŒs percentile calculations)
5. Bounded error (half the bin size)
6. Several open source libraries available
Computing SLOs with histograms
#SREcon
@phredmoye
Computing SLOs with histograms
#SREcon
Cons:
1. Math is more complex than other methods
2. Some loss of accuracy (<<5%) in worst cases
@phredmoye
github.com/circonus-labs/libcircllhist
(autoconf && ./configure && make install)
github.com/circonus-labs/libcircllhist/tree/master/src/python
(pip install circllhist)
Log Linear histograms with Python
#SREcon
@phredmoye
h = Circllhist() # make a new histogram
h.insert(123) # insert value 123
h.insert(456) # insert value 456
h.insert(789) # insert value 789
print(h.count()) # prints 3
print(h.sum()) # prints 1,368
print(h.quantile(0.5)) # prints 456
#SREcon
Log Linear histograms with Python
@phredmoye
from matplotlib import pyplot as plt
from circllhist import Circllhist
H = Circllhist()

 # add latency data to H via insert()
H.plot()
plt.axvline(x=H.quantile(0.95), color=red)
#SREcon
Log Linear histograms with Python
@phredmoye
Averaging Percentiles
#SREcon
Log Linear histograms with Python
@phredmoye
Conclusions
1. Averaging Percentiles is tempting, but misleading
2. Use counters or histograms to calculate SLOs correctly
3. Histograms give the most flexibility in choosing latency
thresholds, but only a couple libraries implement them
(libcircllhist, hdrhistogram)
4. Full support for (sparsely encoded-, HDR-) histograms in
TSDBs still lacking (except IRONdb).
#SREcon
#SREcon
Fred Moyer
Developer Evangelist, Circonus
@phredmoyer
Thank you!
slack.s.circonus
slideshare.net/redhotpenguin

Weitere Àhnliche Inhalte

Ähnlich wie SREcon americas 2019 - Latency SLOs Done Right

Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics HeroTechWell
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner
 
From crash to testcase
From crash to testcaseFrom crash to testcase
From crash to testcaseRoel Van de Paar
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 
Rails Software Metrics
Rails Software MetricsRails Software Metrics
Rails Software Metricschiel
 
Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ 
Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ 
Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ kwatch
 
Improving Code Quality Through Effective Review Process
Improving Code Quality Through Effective  Review ProcessImproving Code Quality Through Effective  Review Process
Improving Code Quality Through Effective Review ProcessDr. Syed Hassan Amin
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric SystemErin Dees
 
Dynomite at Erlang Factory
Dynomite at Erlang FactoryDynomite at Erlang Factory
Dynomite at Erlang Factorymoonpolysoft
 
Variables & Expressions
Variables & ExpressionsVariables & Expressions
Variables & ExpressionsRich Price
 
Developing a Culture of Quality Code (Midwest PHP 2020)
Developing a Culture of Quality Code (Midwest PHP 2020)Developing a Culture of Quality Code (Midwest PHP 2020)
Developing a Culture of Quality Code (Midwest PHP 2020)Scott Keck-Warren
 
magellan_mongodb_workload_analysis
magellan_mongodb_workload_analysismagellan_mongodb_workload_analysis
magellan_mongodb_workload_analysisPraveen Narayanan
 
The Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingThe Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingAndrey Karpov
 
The Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingThe Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingPVS-Studio
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xMatthew Gaudet
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Odoo
 
Dealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter ScottDealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter ScottO'Reilly Media
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)Robert Swisher
 
Speedy TDD with Rails
Speedy TDD with RailsSpeedy TDD with Rails
Speedy TDD with RailsPatchSpace Ltd
 

Ähnlich wie SREcon americas 2019 - Latency SLOs Done Right (20)

Become a Performance Diagnostics Hero
Become a Performance Diagnostics HeroBecome a Performance Diagnostics Hero
Become a Performance Diagnostics Hero
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your Pipeline
 
From crash to testcase
From crash to testcaseFrom crash to testcase
From crash to testcase
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Rails Software Metrics
Rails Software MetricsRails Software Metrics
Rails Software Metrics
 
Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ 
Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ 
Cより速いRubyăƒ—ăƒ­ă‚°ăƒ©ăƒ 
 
Improving Code Quality Through Effective Review Process
Improving Code Quality Through Effective  Review ProcessImproving Code Quality Through Effective  Review Process
Improving Code Quality Through Effective Review Process
 
Your Own Metric System
Your Own Metric SystemYour Own Metric System
Your Own Metric System
 
Dynomite at Erlang Factory
Dynomite at Erlang FactoryDynomite at Erlang Factory
Dynomite at Erlang Factory
 
Variables & Expressions
Variables & ExpressionsVariables & Expressions
Variables & Expressions
 
Developing a Culture of Quality Code (Midwest PHP 2020)
Developing a Culture of Quality Code (Midwest PHP 2020)Developing a Culture of Quality Code (Midwest PHP 2020)
Developing a Culture of Quality Code (Midwest PHP 2020)
 
magellan_mongodb_workload_analysis
magellan_mongodb_workload_analysismagellan_mongodb_workload_analysis
magellan_mongodb_workload_analysis
 
The Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingThe Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and Everything
 
The Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and EverythingThe Ultimate Question of Programming, Refactoring, and Everything
The Ultimate Question of Programming, Refactoring, and Everything
 
PraveenBOUT++
PraveenBOUT++PraveenBOUT++
PraveenBOUT++
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3x
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...
 
Dealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter ScottDealing with Legacy Perl Code - Peter Scott
Dealing with Legacy Perl Code - Peter Scott
 
SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)SDPHP - Percona Toolkit (It's Basically Magic)
SDPHP - Percona Toolkit (It's Basically Magic)
 
Speedy TDD with Rails
Speedy TDD with RailsSpeedy TDD with Rails
Speedy TDD with Rails
 

Mehr von Fred Moyer

Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done rightFred Moyer
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioFred Moyer
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioFred Moyer
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsFred Moyer
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummiesFred Moyer
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018Fred Moyer
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fred Moyer
 
Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016Fred Moyer
 
Better service monitoring through histograms
Better service monitoring through histogramsBetter service monitoring through histograms
Better service monitoring through histogramsFred Moyer
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseFred Moyer
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmersFred Moyer
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningFred Moyer
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache DispatchFred Moyer
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Fred Moyer
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator SimplifiedFred Moyer
 

Mehr von Fred Moyer (16)

Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and IstioComprehensive Container Based Service Monitoring with Kubernetes and Istio
Comprehensive Container Based Service Monitoring with Kubernetes and Istio
 
Comprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istioComprehensive container based service monitoring with kubernetes and istio
Comprehensive container based service monitoring with kubernetes and istio
 
Effective management of high volume numeric data with histograms
Effective management of high volume numeric data with histogramsEffective management of high volume numeric data with histograms
Effective management of high volume numeric data with histograms
 
Statistics for dummies
Statistics for dummiesStatistics for dummies
Statistics for dummies
 
GrafanaCon EU 2018
GrafanaCon EU 2018GrafanaCon EU 2018
GrafanaCon EU 2018
 
Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017Fredmoyer postgresopen 2017
Fredmoyer postgresopen 2017
 
Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016Better service monitoring through histograms sv perl 09012016
Better service monitoring through histograms sv perl 09012016
 
Better service monitoring through histograms
Better service monitoring through histogramsBetter service monitoring through histograms
Better service monitoring through histograms
 
The Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL DatabaseThe Breakup - Logically Sharding a Growing PostgreSQL Database
The Breakup - Logically Sharding a Growing PostgreSQL Database
 
Learning go for perl programmers
Learning go for perl programmersLearning go for perl programmers
Learning go for perl programmers
 
Surge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightningSurge 2012 fred_moyer_lightning
Surge 2012 fred_moyer_lightning
 
Qpsmtpd
QpsmtpdQpsmtpd
Qpsmtpd
 
Apache Dispatch
Apache DispatchApache Dispatch
Apache Dispatch
 
Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008Ball Of Mud Yapc 2008
Ball Of Mud Yapc 2008
 
Data::FormValidator Simplified
Data::FormValidator SimplifiedData::FormValidator Simplified
Data::FormValidator Simplified
 

KĂŒrzlich hochgeladen

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfayushiqss
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžDelhi Call girls
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 

KĂŒrzlich hochgeladen (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïžcall girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
call girls in Vaishali (Ghaziabad) 🔝 >àŒ’8448380779 🔝 genuine Escort Service đŸ”âœ”ïžâœ”ïž
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 

SREcon americas 2019 - Latency SLOs Done Right

  • 1. Latency SLOs Done Right SREcon19 Americas #SREcon@phredmoye
  • 4. Latency For any of your services, how many requests were served within 500 ms over the last month? @phredmoye #SREcon 500MS ?
  • 5. Latency For any of your services, how many requests were served within 250 ms over the last month? @phredmoye #SREcon 250MS ?
  • 6. Latency How would you answer that question for your services? @phredmoye #SREcon
  • 7. Latency How accurate would your answer be? @phredmoye #SREcon ?10% ERROR 20%ERROR 50% ERROR 200% ERROR
  • 8. I’m Fred and I like SLOs - Developer Evangelist @Circonus - Engineer who talks to people - Writing code and breaking prod for 20 years - @phredmoyer - Likes C, Go, Perl, PostgreSQL @phredmoye 100% UPTIME
  • 9. Talk Agenda ● SLO Refresher ● A Common Mistake ● Computing SLOs with log data ● Computing SLOs by counting requests ● Computing SLOs with histograms @phredmoye #SREcon
  • 10. Service Level Objectives SLI - Service Level Indicator SLO - Service Level Objectives SLA - Service Level Agreement @phredmoye #SREcon
  • 12. “99th percentile latency of homepage requests over the past 5 minutes < 300ms” “SLIs drive SLOs which inform SLAs” SLI - Service Level Indicator Measure of the service that can be quantified Excerpted from: “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https://youtu.be/tEylFyxbDL E
  • 13. “99th percentile homepage SLI will succeed 99.9% over trailing year” “SLIs drive SLOs which inform SLAs” SLO - Service Level Objective, a target for Service Level Indicators Excerpted from: “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https://youtu.be/tEylFyxbDL E
  • 14. “99th percentile homepage SLI will succeed 99% over trailing year” “SLIs drive SLOs which inform SLAs” SLA - Service Level Agreement, a legal agreement Excerpted from: “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https://youtu.be/tEylFyxbDL E
  • 15. Talk Agenda ● SLO Refresher ● A Common Mistake ● Computing SLOs with log data ● Computing SLOs by counting requests ● Computing SLOs with histograms @phredmoye #SREcon
  • 16. A Common Mistake @phredmoye Averaging Percentiles p95(W1 âˆȘ W2) != (p95(W1)+ p95(W2))/2 Works fine when node workload is symmetric Hides problems when workloads are asymmetric #SREcon
  • 18. A Common Mistake @phredmoye 99% of requests served here #SREcon
  • 20. @phredmoye p95(W1) = 220ms p95(W2) = 650ms p95(W1 âˆȘ W2) = 230ms (p95(W1)+p95(W2))/2 = 430ms ~200% difference A Common Mistake #SREcon
  • 21. @phredmoye Averaging Percentiles A Common Mistake p95 actual (230ms) p95 average (430ms) ERROR #SREcon
  • 22. A Common Mistake @phredmoye Log parser => Metrics (mtail) What metrics are you storing? Averages? p50, p90, p95, p99, p99.9, p99.9? #SREcon
  • 23. Talk Agenda ● SLO Refresher ● A Common Mistake ● Computing SLOs with log data ● Computing SLOs by counting requests ● Computing SLOs with histograms @phredmoye #SREcon
  • 24. Computing SLOs with log data @phredmoye "%{%d/%b/%Y %T}t.%{msec}t %{%z}t" #SREcon ~100 bytes per log line ~1GB for 10M requests
  • 25. @phredmoye Logs => HDFS Logs => ElasticSearch/Splunk ssh -- `grep ... | awk ... > 550 ... | wc -l` #SREcon Computing SLOs with log data
  • 26. @phredmoye 1. Extract samples for time window 2. Sort the samples by value 3. Find the sample 5% count from largest 4. That’s your p95 #SREcon Computing SLOs with log data
  • 27. @phredmoye “95th percentile SLI will succeed 99.9% trailing year” 1. Divide 1 year samples into 1,000 slices 2. For each slice, calculate SLI 3. Was p95 SLI met for 999 slices? Met SLO if so #SREcon Computing SLOs with log data
  • 28. Computing SLOs with log data @phredmoye Pros: 1. Easy to configure logs to capture latency 2. Easy to roll your own processing code, some open source options out there 3. Accurate results #SREcon Cons: 1. Expensive (see log analysis solution pricing) 2. Sampling possible but skews accuracy 3. Slow 4. Difficult to scale
  • 29. Talk Agenda ● SLO Refresher ● A Common Mistake ● Computing SLOs with log data ● Computing SLOs by counting requests ● Computing SLOs with histograms @phredmoye #SREcon
  • 30. @phredmoye 1. Count # of requests that violate SLI threshold 2. Count total number of requests 3. % success = 100 - (#failed_reqs/#total_reqs)*100 Similar to Prometheus cumulative ‘<=’ histogram #SREcon Computing SLOs by counting requests
  • 31. Computing SLOs by counting requests @phredmoye #SREcon
  • 32. Computing SLOs by counting requests @phredmoye SLO = 90% of reqs < 30ms # bad requests = 2,262 # total requests = 60,124 100-(2262/60124)*100=96.2% SLO was met #SREcon
  • 33. @phredmoye Computing SLOs by counting requests #SREcon Pros: 1. Simple to implement 2. Performant 3. Scalable 4. Accurate Cons: 1. Fixed SLO threshold - must reconfigure 2. Look back impossible for other thresholds
  • 34. Talk Agenda ● SLO Refresher ● A Common Mistake ● Computing SLOs with log data ● Computing SLOs by counting requests ● Computing SLOs with histograms @phredmoye #SREcon
  • 35. Computing SLOs with histograms AKA distributions Sample counts in bins/buckets Gil Tene’s hdrhistogram.org Sample value # Samples Median q(0.5) Mode q(0.9) q(1)Mean
  • 36. @phredmoye Some histogram types: 1. Linear 2. Approximate 3. Fixed bin 4. Cumulative 5. Log Linear Computing SLOs by counting requests #SREcon
  • 39. @phredmoye h(A âˆȘ B) = h(A) âˆȘ h(B) A & B must have identical bin boundaries Can be aggregated both in space and time Mergeability #SREcon
  • 40. @phredmoye How many requests are faster than 330ms? 1. Walk the bins lowest to highest until you reach 330ms 2. Sum the counts in those bins 3. Done Computing SLOs with histograms #SREcon
  • 42. @phredmoye For the libcircllhist implementation we have bins at: ... 320, 330, 340, ... .... And: 10,11,12,13... .... And: 0.0000010, 0.0000011, 0.0000012, For every decimal floating point number, with 2 significant digits, we have a bin (within 10^{+/-128}). So ... where are the bin boundaries? #SREcon
  • 43. @phredmoye Pros: 1. Space Efficient (HH: ~ 300bytes / histogram in practice, 10x more efficient than logs) 2. Full Flexibility: - Thresholds can be chosen as needed and analyzed - Statistical methods applicable, IQR, count_below, q(1), etc. 3. Mergability (HH: Aggregate data across nodes) 4. Performance (ns insertions, ÎŒs percentile calculations) 5. Bounded error (half the bin size) 6. Several open source libraries available Computing SLOs with histograms #SREcon
  • 44. @phredmoye Computing SLOs with histograms #SREcon Cons: 1. Math is more complex than other methods 2. Some loss of accuracy (<<5%) in worst cases
  • 45. @phredmoye github.com/circonus-labs/libcircllhist (autoconf && ./configure && make install) github.com/circonus-labs/libcircllhist/tree/master/src/python (pip install circllhist) Log Linear histograms with Python #SREcon
  • 46. @phredmoye h = Circllhist() # make a new histogram h.insert(123) # insert value 123 h.insert(456) # insert value 456 h.insert(789) # insert value 789 print(h.count()) # prints 3 print(h.sum()) # prints 1,368 print(h.quantile(0.5)) # prints 456 #SREcon Log Linear histograms with Python
  • 47. @phredmoye from matplotlib import pyplot as plt from circllhist import Circllhist H = Circllhist() 
 # add latency data to H via insert() H.plot() plt.axvline(x=H.quantile(0.95), color=red) #SREcon Log Linear histograms with Python
  • 49. @phredmoye Conclusions 1. Averaging Percentiles is tempting, but misleading 2. Use counters or histograms to calculate SLOs correctly 3. Histograms give the most flexibility in choosing latency thresholds, but only a couple libraries implement them (libcircllhist, hdrhistogram) 4. Full support for (sparsely encoded-, HDR-) histograms in TSDBs still lacking (except IRONdb). #SREcon
  • 50. #SREcon Fred Moyer Developer Evangelist, Circonus @phredmoyer Thank you! slack.s.circonus slideshare.net/redhotpenguin

Hinweis der Redaktion

  1. Hello folks. Welcome to Latency SLOs Done Right. I want to give a shout out to my colleague Heinrich Hartmann who is a data scientist and originally did a blog post on the material I’m about to present. Let’s get started.
  2. I’m Fred Moyer, your host. There’s my twitter handle, it’s fredmoyer with a ph. More about me in a few minutes.
  3. How many people here think that latency is an important metric to track for their applications? Please raise your hands if you are already tracking latency in any of your services as a business critical metric.
  4. So now that we’ve seen a few folks here think that latency is an important metric to track, let me ask for a given service in your infrastructure, how many requests were served faster than 500 milliseconds seconds over the past month? I don’t expect anyone to have an exact answer on hand, this is a fairly specific question.
  5. So now that we’ve seen a few folks here think that latency is an important metric to track, let me ask for a given service in your infrastructure, how many requests were served faster than 500 milliseconds seconds over the past month? I don’t expect anyone to have an exact answer on hand, this is a fairly specific question.
  6. But I want you to ask yourself how you could answer that question? Do you have the capabilities to glean this information from your systems? There are many tools out there which everyone here is familiar with, so your answer to that is probably yes.
  7. But I ask you, would your answer be correct? How accurate do you think it would be? As I mentioned, there’s a wide range of tools available to answer these questions, but often you need to question the answers they give. Another way to ask this question, is what level of errors do you expect in your answer?
  8. Today we’re going to be looking at how we can question those answers. First we’ll do a quick SLO refresher. Then we’ll look at a common mistake made when using percentiles. Next we’ll see how to calculate SLOs the right way using three different approaches. With log data, but counting requests, and with histograms. So let’s get started.
  9. Most people here are probably familiar with the Google SRE book. The concept of SLAs has been around for at least a decade, but service level objectives and service level indicators are also becoming ubiquitous amongst site reliability engineers. The amount of online content around these terms has been increasing rapidly. These three terms SLI, SLO, and SLA are now fairly standard lexicon amongst site reliability engineers.
  10. In addition to the Google SRE book, there are two other recent books that talk about service level objectives. The site reliability workbook has a dedicated chapter on service level objectives. Seeking SRE has a chapter on defining SLOs by Theo Schlossnagle, the Circonus CEO who got me into all this stuff five or six years ago. The site reliability workbook chapter 2 is about implementing SLOs. Go read it. One thing to note that there isn’t one standard definition of SLO. Everyone’s business needs are different, and what you should take away from these books and their discussions of service level objectives is there are many ways to define them, so you should base your definition of a service level objective on the one that makes the most sense for your business.
  11. I’m pulling in a few excerpts from a great youtube video by Seth Vargo and Liz Fong Jones I recently watched which I think explains these concepts really well. I recommend watching the video to get an in depth understanding. The gist of that video is that SLIs drive SLOs which inform SLAs. A service level indicator is basically a metric derived measure of health for a service. For example, I could have an SLI that says my 99th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.
  12. A service level objective is basically how we take a service level indicator, and extend the scope of it to quantify how we expect our service to perform over a strategic time interval. Drawing on the SLI we talked about in the previous slide, we could say that our SLO is that we want to meet the criteria set by that SLI for three nines over a trailing year window. SLAs take SLOs one step further, but use the same criteria. They are generally crafted by lawyers to limit the possibility of having to give customers money for those times when our service doesn’t perform like we committed to. SLAs are similar to SLOs, but the commitment level is relaxed, as we want our internal facing targets to be more strict than our external facing targets.
  13. A service level agreement is a legal agreement that is generally less restrictive than the SLO which the operations team is accustomed to delivering. It is crafted by lawyers and generally meant to be as risk averse as possible. When SLAs are violated, bad things happen. Customers notice, then they try to get money back from you. Executives call meetings, and folks get called on the carpet about why the SLA couldn’t be met. And here’s the kicker. If you don’t have an SLO, YOUR SLA IS YOUR SLO. So your internal reliability targets are now your external reliability targets. There’s a reason we separate SLOs and SLAs. One is a target we don’t ever want to miss. The other is a realistic measure of what we can achieve, but might not always accomplish. When we bring things like error budgets into the discussion, we want to be able to take risk with deployments and expect downtime so that we can move quickly. It’s not about moving fast and breaking things. It’s about using math to figure out what the risk is if we move a given speed.
  14. So we just did a brief refresher of Service Level Objectives, something folks here are probably somewhat familiar with already. I encourage folks here to read the books I’ve listed, but also to remember that SLOs are tools for the business, and should be tailored appropriately to your use cases. Now let’s look at a common mistake when using percentiles with SLOs - averaging percentiles.
  15. Averaging percentiles is probably the single most common mistake made when working with latency metrics. Why is this? Part of this happens because averaging percentiles is actually a reasonable approach which systems are functioning normally and nodes are exhibiting symmetric workloads. It’s easy to get an ideal of aggregate system performance in those situations by just adding up percentiles from nodes and dividing by the number of nodes. The data from most monitoring systems makes this very easy to do. When this approach becomes problematic though is when node workloads are asymmetric. If you’re looking an an average of percentiles when that happens, you’ll rarely know it though, this approach hides those asymmetries.
  16. The is a graph of 5 minute p99,p95,p90,p50 over ~24 hours. What is the p99 over the entire time range? We have a peak of ~300 microseconds for 6 hours, and about 180 microseconds for the other 18 hours. So we can guess that the p99 is probably around 200 microseconds right?
  17. What if I told you that 99% of the request volume here occurred during the elevated percentile levels? That makes sense right? The system will likely be slower when it has a surge of requests thrown at it. This is a start example of how percentile based graphs can be deceiving.
  18. Let’s take a look at averaging percentiles over nodes. Here is the distribution of requests for 2 webservers, along with the corresponding p95s (this is over constant time). This distribution shows the sample number on the Y axis and the sample value on the X axis. You can see that the blue webserver had more samples at lower latencies, and the red webserver a flatter distribution of latencies that were generally higher than the blue webserver. What happens if we calculate P95 by averaging them vs calculating then by aggregating the samples?
  19. If we combine the samples and calculate the correct p95, we get 230 milliseconds. If I average the p95 from both sample sets, we get 430 milliseconds. That’s a 200% difference. If these sample distributions matched up exactly, we could average the p95s and get the same answer as taking the p95 of the aggregated samples, but that rarely happens, and it especially doesn’t happen when workloads are asymmetric.
  20. So how do people end up averaging percentiles? Well there are some common workflows that make it very easy. One way to collect latency data without instrumenting your application is to use a log parser like Google’s mtail which extracts latency metrics from logs, and then stores them in a Time Series Database or sends them to a statsd server or some other aggregation point. So what latency metric do you end up storing? Almost none of the open source tooling exposes the raw samples, they provide the average, the median, or any one of the common percentiles captured for analysis. So if you run mtail on each web server, you end up with storing latency percentiles for each node, which results in the graph we just showed. Sure, this is an easy situation to avoid if you have this knowledge, but that’s usually the exception as opposed to the rule. Though I’ve talked to folks who do this and say “yeah we know it’s not correct some of the time, but it’s the best we have right now and we don’t have time to change it”. Well, we’re going to see how we can do better.
  21. So let’s start off with our first of three approaches on how to compute service level objects the right way.
  22. The first approach is to compute our service level objective by using log data. This is an example Apache log line configuration to log request time as milliseconds. Pretty much anyone running a web service can log the time of the request with just a small configuration change. So for each log line emit, we have to store about 100 bytes of data. Which means that 10M requests will cost about us about one gigabyte. Remember thought that if this is a web page, you’ll also be logging the request time to serve images and other static content. So if you are collecting more than just API request data you may need to multiply this number by several dozen.
  23. Once we have our logfile configured to emit latency, collecting it usually goes something like this. You stuff your metrics into a store like HDFS, logstash, or Splunk. And then you either use Elastic search, Splunk, or good old fashioned grep and awk to query the logs. This is all pretty straightforward. You just need lots of servers.
  24. Once you have those latency metrics available, you can calculate your service level indicator. Just sort the samples and find the sample which is 5% from the top. That latency value is your 95th percentile. Some of the tools I just mentioned like Splunk and ElasticSearch have this capability built in.
  25. Now you can take that SLI and apply it across your SLO timeframe. Again this is a straightforward mathematical calculation and can be easily coded up or done with the tools I previously mentioned. Of course, the devil is in the details with this approach - as I previously mentioned, you’ll need a lot of servers, and this is not an option that you can really implement for realtime analysis.
  26. Let’s take a look at the pros of this approach. It’s easy to get latency out of your logs. The math is fairly easy to implement, you can use some open source tools or build your own. The results are accurate. I’ve taken this approach in the past using Splunk.
  27. So that covers approach one on computing SLOs the right way. Let’s look at another option of calculating SLOs, this time by counting requests.
  28. The approach to calculating a service level objective by counting requests is fairly simple. First you pick your SLI threshold, let’s choose 30 milliseconds. You instrument your application to count the number of total requests, and the number of requests that violated your SLI threshold. Then you calculate the percentage of successful requests - that’s your SLO. This approach is similar to the le histograms use by Prometheus. Those specify a number of predetermined bins and count up the number of requests that are under those thresholds.
  29. Here’s a visualization of this approach. The requests that violated our SLI are in red at the bottom of the image, the total count of requests are in grey. You can generate a graph like this with pretty much any monitoring or observability system out there.
  30. Our SLO is 90% of requests in less than 30 milliseconds. Calculating the SLO here is easy. Two thousand divided by sixty thousand - about 96%. We met our SLO. Let’s take a look at the pros and cons of this approach.
  31. This is a simple approach to implement, the math is very easy. Any number of tools can be used for this approach. It’s very performant - counting requests is fast. It’s quite scalable. I can keep the two metrics I need in around 128 bytes of RAM, one 64 bit int for total requests, and one for unsuccessful requests. The results for this approach are accurate. It’s difficult to screw up the calculations.
  32. We’ve looked at two approaches for calculating SLOs, so now let’s talk about using histograms.
  33. There are several different types of histograms. I’m listing a few of some of the most common variations here. Linear, Approximate, Fixed bin, Cumulative, and Log Linear. These different types really represent attributes of certain histogram implementations, such that you could combine these to create a histogram that fits your business needs. For example, you could create a cumulative log linear histogram which has bins in powers of 10s, but each subsequent bin would contain the sum of the bins with lower values. For example, the hdrhistogram reference I mentioned in the previous slide is a log linear type histogram. I won’t go into detail about each of these types here, we’ll be looking at log linear histograms, but I’ve got a presentation up on slideshare which details each of these types if you want to learn more.
  34. This is the log linear histogram type that we’ve implemented at Circonus. Bin sizes increase by a factor of 10 every power of 10, and there are 90 bins between each power of 10. The X axis scale is in microsends, y axis is number of samples. As an example, there are 90 bins between 100,000 and 1 million, each with a size of 10,000. This sample histogram shows latency distribution from a web service in microseconds. I’ve overlayed the average, median, and 90th percentile values. Note how the bin size increases from 10,000 to 100,000 at the 1 million value. I’ve listed the github repos here that show code implementations of this data structure in both C and Golang. This histogram represents about 50 million or so data samples - it’s relatively cheap to store in this format. The number of data samples is invariant to the size needed to store it.
  35. This is another implementation of the log linear histogram. This particular graph shows syscall latencies captured by eBPF for sysread and syswrite calls. Notice that each dataset in the histogram has several modes. We can also clearly see where the bin size changes at the 10 microsecond boundary. This particular histogram has 15 million samples in it.
  36. Mergeability. Histograms have the property of mergeability, which means that they can be merged together as long as they have a common set of bin boundaries. So if I have two histograms, each representing latency distributions, I can merge them together in one histogram. I could also do this for 10,000 histograms, each of those could represent the latency of a web server over a certain time period. I can also merge together histograms across time. I can take a distribution of yesterday’s latencies, and merge it with today’s to get an aggregate distribution for the combined time range.
  37. We can generate SLOs from histograms that contain latency data. Say I have a distribution of request latency in a histogram, and I want to ask how many requests are faster than 330 milliseconds. The math is simple, I walk the bins from lowest value to highest until I reach the bin that has a value of 330 milliseconds, aggregating the bin counts along the way. The sum of those samples is how many requests were faster than 330 milliseconds. Pretty simple.
  38. So I gave a lightning version of this talk a few months ago at NewOpsDays at Splunk, and then put my slides up on Twitter. Liz Fong-Jones apparently read them and brought up the good point about what happens if the value you are interested in falls between histogram bin boundaries. You don’t need to only be interested in sample values that lie on bin boundaries. If I had chosen 330 milliseconds, and my bin boundaries are 300 and 400 milliseconds, I can interpolate across those boundaries to get an approximate answer. Errors in operational data using the log linear histogram I’ve shown has a maximum value of 5 percent we’ve found. But this brings up the question of what binning algorithm is used with the log linear structure I’ve shown.
  39. In the implementation I’ve shown, we have bin boundaries at 320, 330, and 340. At the scale of 10, the bin boundaries occur at each integer. We can also represent much smaller values, which have increased precision. This log linear histogram implementation provides a very wide range of values while simultaneously being able achieve high precision across those ranges. In practice, we generally see about 300 bins total needed to represent operational latency telemetry. The maximum error experienced with this type of data structure is a bounded little less than 5%. Say I have a bin bounded at 10 and 11. If I insert several values of 10.99, those are interpolated in the bin to 10.5, which gives me an error of approximately 5%. That’s for a worst case sample set, which we pretty much never see in practice. So these bin boundaries provide a very good base for calculating SLOs against.
  40. So let’s summarize the pros of calculating SLOs using histograms. They are space efficient. In practice we see about 300 bytes per histogram, which is about 1/10th the size of the log data approach. We can choose our SLI thresholds as needed to calculate SLOs. We can also calculate inter-quartile ranges, get counts of samples below a certain threshold (remember that SLOs are business specific, not one size fits all), do standard deviation, and a number of other statistical calculations. They can be aggregated across both space and time. They are computationally efficient. Typical values for the implementation I have just shown has nanosecond bin insertion latency, and percentiles can be calculated in a microsecond or so. If we have some extra time I’ll pull up the code and walk through it. Errors are bounded to half a bin size, which is typically worst case 5%. There are several open source libraries available as I’ve shown to do these kinds of calculations, you don’t need to go out there and write your own.
  41. So there are some downsides to calculating SLOs with histograms. The math is more complex than the other methods I’ve shown, but it’s still relatively simple when compared to t-digest and other quantile approximation methods. There is some loss of accuracy when compared to the other two methods, but in practice 5% is a worst case scenario which is never really seen in production workloads.
  42. So let’s take a look at how we can do some of these calculations with the log linear histogram library I mentioned, and Python. Python bindings to libcircllhist are available to install with the pip utility. You’ll need to have the libcircllhist C library installed first before running the pip command. There is not a way to specify a C dependency, at least not a way that I’m aware of.
  43. So here I’m going to create a histogram using this library, and then insert a few values. Each of these insertions should take about a nanosecond since that is handled in the C library. Then I can easily get a count of samples, and generate a quantile from these samples. The percentile calculation happens in the C library and doesn’t take more than a few microseconds. This is a pretty simple example that I think most folks here should be able to accomplish easily. Let’s look at something a little more complex.
  44. If I have a set of latency values for a web service, I can create a histogram from those as I’ve shown, and then generate a visual plot of that histogram using code that looks like this. I can also calculate my 95th percentile and draw a line on the graph. So let’s see what that might look like.
  45. This plot should look familiar - I showed it earlier in this presentation. I created this plot using actual service latency data and the commands that I just showed you. There’s about 10,000 samples here. You can scale this to several million samples with commodity hardware using the library I referenced. If we have some time leftover after questions I can show this in action.
  46. So let’s review what we’ve seen. Be careful of averaging percentiles; it’s very easy to do, but you can easily come up with results that will lead you to incorrect conclusions. The best approach for calculating SLOs is using counters or histograms. The approach using log data produces the correct results, but is economically inefficient. Histograms give you the widest range of flexibility for choosing different latency thresholds, but I’m only aware of two open source implementations, the Go and C log linear implementation from Circonus, and the hdrhistogram implementation in Java. There are not any time series databases that support storing either sparsely encoded or high dynamic range histograms yet except for IRONdb. Storing histogram data serialized on disk is one option, but you may run into some challenges at large scale.
  47. That’s it. It looks like I met my service level objective of talk length, and we have a small error budget for questions. Or if we have time allows, I can show some of the python code in action.