SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Richard “RichiH” Hartmann
Intro to open source observability
with Grafana, Prometheus,
Loki, and Tempo
I come from the trenches of tech
Bad tools are worse than no tools
| 2
Today’s reality: Disparate systems. Disparate data.
Back to the basics
Let’s rethink this
| 4
Or: Buzzwords, and their useful parts
Observability & SRE
| 5
| 6
Dirty secret
● What is “cloud native scale” is “Internet scale” of networks two
decades ago
● The combination of metrics and events has been standard in
power measurement for half a century
● Modern tools with good engineering practices map very well onto
brownfield technology
| 7
Buzzword alert!
● Cool new term, almost meaningless by now, what does it mean?
○ Pitfall alert: Cargo culting
○ It’s about changing the behaviour, not about changing the name
● “Monitoring” has taken on a meaning of collecting, not using data
○ One extreme: Full text indexing
○ Other extreme: Data lake
● “Observability” is about enabling humans to understand complex
systems
○ Ask why it’s not working instead of just knowing that it’s not
| 8
If you can’t ask new questions on the fly, it’s not
observability
| 9
| 10
Complexity
● Fake complexity, a.k.a. bad design
○ Can be reduced
● Real, system-inherent complexity
○ Can be moved (monolith vs client-server vs microservices)
○ Must be compartmentalized (service boundaries)
○ Should be distilled meaningfully
| 11
SRE, an instantiation of DevOps
● At its core: Align incentives across the org
○ Error budgets allow devs, ops, PMs, etc. to optimize for shared benefits
● Measure it!
○ SLI: Service Level Indicator: What you measure
○ SLO: Service Level Objective: What you need to hit
○ SLA: Service Level Agreement: When you need to pay
| 12
Shared understanding
● Everyone uses the same tools & dashboards
○ Shared incentive to invest into tooling
○ Pooling of institutional system knowledge
○ Shared language & understanding of services
| 13
Services
● Service?
○ Compartmentalized complexity, with an interface
○ Different owners/teams
○ Contracts define interfaces
● Why “contract”: Shared agreement which MUST NOT be broken
○ Internal and external customers rely on what you build and maintain
● Other common term: layer
○ The Internet would not exist without network layering
○ Enables innovation, parallelizes human engineering
● Other examples: CPUs, harddisk, compute nodes, your lunch
| 14
Alerting
● Customers care about services being up, not about individual components
● Discern between different SLIs
○ Primary: service-relevant, for alerting
○ Secondary: informational, debugging, might be underlying’s primary
Anything currently or imminently impacting customer service must be
alerted upon
But nothing(!) else
Prometheus
| 15
| 16
Prometheus 101
● Inspired by Google's Borgmon
● Time series database
● unit64 millisecond timestamp, float64 value
● Instrumentation & exporters
● Not for event logging
● Dashboarding via Grafana
| 17
Main selling points
● Highly dynamic, built-in service discovery
● No hierarchical model, n-dimensional label set
● PromQL: for processing, graphing, alerting, and export
● Simple operation
● Highly efficient
| 18
Main selling points
● Prometheus is a pull-based system
● Black-box monitoring: Looking at a service from the outside (Does the server
answer to HTTP requests?)
● White-box monitoring: Instrumenting code from the inside (How much time
does this subroutine take?)
● Every service should have its own metrics endpoint
● Hard API commitments within major versions
| 19
Time series
● Time series are recorded values which change over time
● Individual events are usually merged into counters and/or histograms
● Changing values are recorded as gauges
● Typical examples
○ Requests to a webserver (counter)
○ Temperatures in a datacenter (gauge)
○ Service latency (histograms)
| 20
Super easy to emit, parse & read
http_requests_total{env="prod",method="post",code="200"} 1027
http_requests_total{env="prod",method="post",code="400"} 3
http_requests_total{env="prod",method="post",code="500"} 12
http_requests_total{env="prod",method="get",code="200"} 20
http_requests_total{env="test",method="post",code="200"} 372
http_requests_total{env="test",method="post",code="400"} 75
| 21
Scale
● Kubernetes is Borg
● Prometheus is Borgmon
● Google couldn't have run Borg without Borgmon (plus Omega and Monarch)
● Kubernetes & Prometheus are designed and written with each other in mind
| 22
Prometheus scale
● 1,000,000+ samples/second no problem on current hardware
● ~200,000 samples/second/core
● 16 bytes/sample compressed to 1.36 bytes/sample
The highest we saw in production on a single Prometheus instance were 15
million active times series at once!
| 23
Long term storage
● Two long term storage solutions have Prometheus-team members
working on them
○ Thanos
■ Historically easier to run, but slower
■ Scales storage horizontally
○ Cortex
■ Easy to run these days
■ Scales both storage, ingester, and querier horizontally
| 24
Cortex @ Grafana (largest cluster, 2021-09)
● ~65 million active series (just the one cluster)
● 668 CPU cores
● 3,349 GiB RAM
One customer running at 3 billion active series
Loki
| 25
| 26
Loki 101
● Following the same label-based system like Prometheus
● No full text index needed, incredible speed
● Work with logs at scale, without the massive cost
● Access logs with the same label sets as metrics
● Turn logs into metrics, to make it easier to work with them
● Make direct use of syslog data, via promtail
2019-12-11T10:01:02.123456789Z {env="prod",instance=”1.1.1.1”} GET /about
Timestamp
with nanosecond precision
Content
log line
Prometheus-style Labels
key-value pairs
indexed unindexed
| 28
Loki @ Grafana Labs
● Queries regularly see 40 GiB/s
● Query terabytes of data in under a minute
○ Including complex processing of result sets
Tempo
| 29
| 30
Tempo
● Exemplars: Jump from relevant logs & metrics
○ Native to Prometheus, Cortex, Thanos, and Loki
○ Exemplars work at Google scale, with the ease of Grafana
● Index and search by labelsets available for those who need it
● Object store only: No Cassandra, Elastic, etc.
● 100% compatible with OpenTelemetry Tracing, Zipkin, Jaeger
● 100% of your traces, no sampling
| 31
Tempo @ Grafana Labs (2021-07)
● 2,200,000 samples per second @ 350 MiB/s
● 14-day retention @ 3 copies stored
● ~240 CPU cores (includes compression cost)
● ~450 GiB RAM
● 132 TiB object storage
● Latencies:
○ p99 - 2.5s
○ p90 - 2.3s
○ p50 - 1.6s
Bringing it together
| 32
From logs to traces
| 33
From metrics to traces
| 34
From metrics to traces
| 35
...and from traces to logs
| 36
All of this is Open Source and you can run it yourself
| 37
grafana.com/go/obscon2021
| 40
For a deeper dive into
Open Source Observability, go to:
Thank you!
richih@richih.org
twitter.com/TwitchiH
github.com/RichiH/talks

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Grafana.pptx
Grafana.pptxGrafana.pptx
Grafana.pptx
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
OpenTelemetry Introduction
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
 
Observability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryObservability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetry
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
 
Server monitoring using grafana and prometheus
Server monitoring using grafana and prometheusServer monitoring using grafana and prometheus
Server monitoring using grafana and prometheus
 
Improve monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss toolsImprove monitoring and observability for kubernetes with oss tools
Improve monitoring and observability for kubernetes with oss tools
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Prometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb SolutionPrometheus with Grafana - AddWeb Solution
Prometheus with Grafana - AddWeb Solution
 
Observability
ObservabilityObservability
Observability
 
Adopting OpenTelemetry
Adopting OpenTelemetryAdopting OpenTelemetry
Adopting OpenTelemetry
 
Application Performance Monitoring with OpenTelemetry
Application Performance Monitoring with OpenTelemetryApplication Performance Monitoring with OpenTelemetry
Application Performance Monitoring with OpenTelemetry
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 

Ähnlich wie Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Marcus Hanwell
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
Srinath Perera
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 

Ähnlich wie Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx (20)

Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
WTF is a Microservice - Rafael Schloming, Datawire
WTF is a Microservice - Rafael Schloming, DatawireWTF is a Microservice - Rafael Schloming, Datawire
WTF is a Microservice - Rafael Schloming, Datawire
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
High-Speed Reactive Microservices
High-Speed Reactive MicroservicesHigh-Speed Reactive Microservices
High-Speed Reactive Microservices
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdf
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 

Mehr von LibbySchulze

Mehr von LibbySchulze (20)

Running distributed tests with k6.pdf
Running distributed tests with k6.pdfRunning distributed tests with k6.pdf
Running distributed tests with k6.pdf
 
Extending Kubectl.pptx
Extending Kubectl.pptxExtending Kubectl.pptx
Extending Kubectl.pptx
 
Enhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo WorkflowsEnhancing Data Protection Workflows with Kanister And Argo Workflows
Enhancing Data Protection Workflows with Kanister And Argo Workflows
 
Fallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdfFallacies in Platform Engineering.pdf
Fallacies in Platform Engineering.pdf
 
Intro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdfIntro to Fluvio.pptx.pdf
Intro to Fluvio.pptx.pdf
 
Enhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptxEnhance your Kafka Infrastructure with Fluvio.pptx
Enhance your Kafka Infrastructure with Fluvio.pptx
 
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdfCNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
CNCF On-Demand Webinar_ LitmusChaos Project Updates.pdf
 
Oh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdfOh The Places You'll Sign.pdf
Oh The Places You'll Sign.pdf
 
Rancher MasterClass - Avoiding-configuration-drift.pptx
Rancher  MasterClass - Avoiding-configuration-drift.pptxRancher  MasterClass - Avoiding-configuration-drift.pptx
Rancher MasterClass - Avoiding-configuration-drift.pptx
 
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptxvFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
vFunction Konveyor Meetup - Why App Modernization Projects Fail - Aug 2022.pptx
 
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVMCNCF Live Webinar: Low Footprint Java Containers with GraalVM
CNCF Live Webinar: Low Footprint Java Containers with GraalVM
 
EnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdfEnRoute-OPA-Integration.pdf
EnRoute-OPA-Integration.pdf
 
AirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdfAirGap_zusammen_neu.pdf
AirGap_zusammen_neu.pdf
 
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
 
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
OTel Me All About OpenTelemetry The Current & Future State, Navigating the Pr...
 
CNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdfCNCF_ A step to step guide to platforming your delivery setup.pdf
CNCF_ A step to step guide to platforming your delivery setup.pdf
 
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdfCNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
CNCF Online - Data Protection Guardrails using Open Policy Agent (OPA).pdf
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
 
Securing Windows workloads.pdf
Securing Windows workloads.pdfSecuring Windows workloads.pdf
Securing Windows workloads.pdf
 
Advancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for AzureAdvancements in Kubernetes Workload Identity for Azure
Advancements in Kubernetes Workload Identity for Azure
 

Kürzlich hochgeladen

在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
ydyuyu
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Monica Sydney
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
F
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
ydyuyu
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
F
 

Kürzlich hochgeladen (20)

Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
Call girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girlsCall girls Service in Ajman 0505086370 Ajman call girls
Call girls Service in Ajman 0505086370 Ajman call girls
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理一比一原版田纳西大学毕业证如何办理
一比一原版田纳西大学毕业证如何办理
 

Intro to open source observability with grafana, prometheus, loki, and tempo(2).pptx

  • 1. Richard “RichiH” Hartmann Intro to open source observability with Grafana, Prometheus, Loki, and Tempo
  • 2. I come from the trenches of tech Bad tools are worse than no tools | 2
  • 3. Today’s reality: Disparate systems. Disparate data.
  • 4. Back to the basics Let’s rethink this | 4
  • 5. Or: Buzzwords, and their useful parts Observability & SRE | 5
  • 6. | 6 Dirty secret ● What is “cloud native scale” is “Internet scale” of networks two decades ago ● The combination of metrics and events has been standard in power measurement for half a century ● Modern tools with good engineering practices map very well onto brownfield technology
  • 7. | 7 Buzzword alert! ● Cool new term, almost meaningless by now, what does it mean? ○ Pitfall alert: Cargo culting ○ It’s about changing the behaviour, not about changing the name ● “Monitoring” has taken on a meaning of collecting, not using data ○ One extreme: Full text indexing ○ Other extreme: Data lake ● “Observability” is about enabling humans to understand complex systems ○ Ask why it’s not working instead of just knowing that it’s not
  • 8. | 8
  • 9. If you can’t ask new questions on the fly, it’s not observability | 9
  • 10. | 10 Complexity ● Fake complexity, a.k.a. bad design ○ Can be reduced ● Real, system-inherent complexity ○ Can be moved (monolith vs client-server vs microservices) ○ Must be compartmentalized (service boundaries) ○ Should be distilled meaningfully
  • 11. | 11 SRE, an instantiation of DevOps ● At its core: Align incentives across the org ○ Error budgets allow devs, ops, PMs, etc. to optimize for shared benefits ● Measure it! ○ SLI: Service Level Indicator: What you measure ○ SLO: Service Level Objective: What you need to hit ○ SLA: Service Level Agreement: When you need to pay
  • 12. | 12 Shared understanding ● Everyone uses the same tools & dashboards ○ Shared incentive to invest into tooling ○ Pooling of institutional system knowledge ○ Shared language & understanding of services
  • 13. | 13 Services ● Service? ○ Compartmentalized complexity, with an interface ○ Different owners/teams ○ Contracts define interfaces ● Why “contract”: Shared agreement which MUST NOT be broken ○ Internal and external customers rely on what you build and maintain ● Other common term: layer ○ The Internet would not exist without network layering ○ Enables innovation, parallelizes human engineering ● Other examples: CPUs, harddisk, compute nodes, your lunch
  • 14. | 14 Alerting ● Customers care about services being up, not about individual components ● Discern between different SLIs ○ Primary: service-relevant, for alerting ○ Secondary: informational, debugging, might be underlying’s primary Anything currently or imminently impacting customer service must be alerted upon But nothing(!) else
  • 16. | 16 Prometheus 101 ● Inspired by Google's Borgmon ● Time series database ● unit64 millisecond timestamp, float64 value ● Instrumentation & exporters ● Not for event logging ● Dashboarding via Grafana
  • 17. | 17 Main selling points ● Highly dynamic, built-in service discovery ● No hierarchical model, n-dimensional label set ● PromQL: for processing, graphing, alerting, and export ● Simple operation ● Highly efficient
  • 18. | 18 Main selling points ● Prometheus is a pull-based system ● Black-box monitoring: Looking at a service from the outside (Does the server answer to HTTP requests?) ● White-box monitoring: Instrumenting code from the inside (How much time does this subroutine take?) ● Every service should have its own metrics endpoint ● Hard API commitments within major versions
  • 19. | 19 Time series ● Time series are recorded values which change over time ● Individual events are usually merged into counters and/or histograms ● Changing values are recorded as gauges ● Typical examples ○ Requests to a webserver (counter) ○ Temperatures in a datacenter (gauge) ○ Service latency (histograms)
  • 20. | 20 Super easy to emit, parse & read http_requests_total{env="prod",method="post",code="200"} 1027 http_requests_total{env="prod",method="post",code="400"} 3 http_requests_total{env="prod",method="post",code="500"} 12 http_requests_total{env="prod",method="get",code="200"} 20 http_requests_total{env="test",method="post",code="200"} 372 http_requests_total{env="test",method="post",code="400"} 75
  • 21. | 21 Scale ● Kubernetes is Borg ● Prometheus is Borgmon ● Google couldn't have run Borg without Borgmon (plus Omega and Monarch) ● Kubernetes & Prometheus are designed and written with each other in mind
  • 22. | 22 Prometheus scale ● 1,000,000+ samples/second no problem on current hardware ● ~200,000 samples/second/core ● 16 bytes/sample compressed to 1.36 bytes/sample The highest we saw in production on a single Prometheus instance were 15 million active times series at once!
  • 23. | 23 Long term storage ● Two long term storage solutions have Prometheus-team members working on them ○ Thanos ■ Historically easier to run, but slower ■ Scales storage horizontally ○ Cortex ■ Easy to run these days ■ Scales both storage, ingester, and querier horizontally
  • 24. | 24 Cortex @ Grafana (largest cluster, 2021-09) ● ~65 million active series (just the one cluster) ● 668 CPU cores ● 3,349 GiB RAM One customer running at 3 billion active series
  • 26. | 26 Loki 101 ● Following the same label-based system like Prometheus ● No full text index needed, incredible speed ● Work with logs at scale, without the massive cost ● Access logs with the same label sets as metrics ● Turn logs into metrics, to make it easier to work with them ● Make direct use of syslog data, via promtail
  • 27. 2019-12-11T10:01:02.123456789Z {env="prod",instance=”1.1.1.1”} GET /about Timestamp with nanosecond precision Content log line Prometheus-style Labels key-value pairs indexed unindexed
  • 28. | 28 Loki @ Grafana Labs ● Queries regularly see 40 GiB/s ● Query terabytes of data in under a minute ○ Including complex processing of result sets
  • 30. | 30 Tempo ● Exemplars: Jump from relevant logs & metrics ○ Native to Prometheus, Cortex, Thanos, and Loki ○ Exemplars work at Google scale, with the ease of Grafana ● Index and search by labelsets available for those who need it ● Object store only: No Cassandra, Elastic, etc. ● 100% compatible with OpenTelemetry Tracing, Zipkin, Jaeger ● 100% of your traces, no sampling
  • 31. | 31 Tempo @ Grafana Labs (2021-07) ● 2,200,000 samples per second @ 350 MiB/s ● 14-day retention @ 3 copies stored ● ~240 CPU cores (includes compression cost) ● ~450 GiB RAM ● 132 TiB object storage ● Latencies: ○ p99 - 2.5s ○ p90 - 2.3s ○ p50 - 1.6s
  • 33. From logs to traces | 33
  • 34. From metrics to traces | 34
  • 35. From metrics to traces | 35
  • 36. ...and from traces to logs | 36
  • 37. All of this is Open Source and you can run it yourself | 37
  • 38.
  • 39.
  • 40. grafana.com/go/obscon2021 | 40 For a deeper dive into Open Source Observability, go to: