Prometheus Introduction (InfraCoders Vienna)

•

2 gefällt mir•618 views

Oliver Moser

The slides for my Prometheus talk at InfraCoders Vienna Meetup on 11.09.2017

Technologie

1
Oliver Moser, September 2017
Monitoring Stuff with Prometheus

Who am I
• Working at A1 and TAG for quite a while
• some Big Data
• DevOps/SRE/Containers/Orchestration
2

Why do I need monitoring at all?
• Know when stuff breaks (and act on it)
• Understand the performance characteristics of your applications
• To meet service level objectives
• To improve performance/reliability

Fundamentals: Push vs Pull
Monitoring Service
HTTP Server
Metrics Agent
success_event
error_event
...
Monitoring Service
HTTP Server
Metrics Agent
GET /metrics
req_total
req_latency

Whitebox Monitoring
shows internal service information
e.g. request_processing_time,
request_errors_total
Fundamentals: Blackbox vs. Whitebox Monitoring
Blackbox Monitoring
restricted to external service
behaviour
e.g. ping, http
Agent
HTTP Server
HTTP GET / 200 OK
Agent
HTTP Server
GET /metrics
errors_total
req_total
req_latency

Prometheus Project Overview
• OpenSource monitoring tool, originally built at Soundcloud
• Heavily inspired by Google’s Borgmon
• Written in Go (mainly)
• Very active community (150+ contributors for core Prometheus)
• Member of the Cloud Native Computing Foundation

Prometheus Architecture
source: https://prometheus.io/docs/introduction/overview/

Prometheus Server
source: https://prometheus.io/docs/introduction/overview/

Service Discovery
source: https://prometheus.io/docs/introduction/overview/

Service Discovery
• Could live without SD in static environments
• In dynamic environments (e.g. Kubernetes) you must use SD
• Pods, Services,... come and go à impossible to statically configure
API Server
Pod 1
Pod 2
Pod n
...

Jobs, Targets and Exporters
source: https://prometheus.io/docs/introduction/overview/

Graphing and Visualization
source: https://prometheus.io/docs/introduction/overview/

Alertmanager
source: https://prometheus.io/docs/introduction/overview/

Alertmanager
• Alerts are configured in Prometheus and once triggered forwarded to
Alertmanager
• Alertmanager does
• Notifications (SMS, Slack, Pagerduty, Email etc)
• Deduplication/Grouping
• Silencing
• Inhibition
Prometheus
Server Alertmanager
forward ‘Instance Down’

Timeseries
• Tracks values of a metric over time (timestamp t, value v)
• Timestamps increase (strictly) monotonically
• 𝑡" < 𝑡"$% ∀ n ∈ ℕ
• Values can both increase or decrease
(t1,v1) (t2,v2) (t3,v3)
time

Metric Types
• Only relevant for Client libs (only untyped timeseries on the server)
• Counters: Values always increase
• requests_total, errors_total
• Gauges: Values can increase and decrease
• users_online, memory_free_bytes
• Histogram: Puts your measurements in buckets
• requests_latency_seconds_bucket
• Summary: Calculates percentiles over sliding time window
• requests_latency_seconds_summary

$Labels • A list of key/value pairs (𝑘% = 𝑣%, 𝑘. = 𝑣., … , 𝑘" = 𝑣") • Labels partition a metric into timeseries • So for every possible label combo in a given metric there will be a timeseries is created req_total{job=”job1", ver=”0.1”}: 10 à timeseries_1 req_total{job=”job1", ver=”0.2”}: 3 à timeseries_2 req_total{job=”job2", ver=”0.1”}: 4 à timeseries_3 req_total{job=”job1", ver=”0.1”}: 12 à timeseries_1 req_total{job=”job1"}: 1 à timeseries_4$

$Overall Data Model events_processed_total {component="enricher-deployment", instance="10.244.2.98:8080", job=”geo-enricher", namespace="geo-analytics", version="1.9.4-r5” } : 131241535 labels metric name label key label value metric value$

$PromQL • Query language to select and aggregate timeseries • Queries evaluate to either • Instant Vector: Multiple timeseries with the same timestamp • requests_total{job=”prom-boot”} • Range Vector: Multiple timeseries with a range of timestamps • rate(requests_total{job=“prom-boot”}[5m]) • Scalar: A single value • sum(requests_total)$

$Alerting Rules ALERT ErrorRateHigh IF rate(errors_total[5m]) > 10 LABELS { service = ”service_1", severity = "warning" } ANNOTATIONS { summary = "A high number of errors in service”, description = "{{ $value }} errors have been registered within the last hour for {{ $labels.instance }}" }$

Empfohlen

Kubernetes at Telekom Austria Group Oliver Moser

Prometheus (Microsoft, 2016)Brian Brazil

Prometheus on AWSMitsuhiro Tanda

Prometheus Is Good for Your Small Startup - ShuttleCloud Corp. - 2016ShuttleCloud

promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno

Prometheuswyukawa

Infrastructure & System Monitoring using PrometheusMarco Pas

An Introduction to PrometheusEvgeny Shmarnev

Empfohlen

Kubernetes at Telekom Austria Group Oliver Moser

Prometheus (Microsoft, 2016)Brian Brazil

Prometheus on AWSMitsuhiro Tanda

Prometheus Is Good for Your Small Startup - ShuttleCloud Corp. - 2016ShuttleCloud

promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno

Prometheuswyukawa

Infrastructure & System Monitoring using PrometheusMarco Pas

An Introduction to PrometheusEvgeny Shmarnev

Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil

Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil

Monitoring Cloud Native Applications with PrometheusJacopo Nardiello

Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil

Cloud Monitoring with PrometheusQAware GmbH

The history of Prometheus at SoundCloudTobias Schmidt

Monitoring Kafka w/ Prometheuskawamuray

Introduction to Reactive programmingDwi Randy Herdinanto

Prometheus casual talk1wyukawa

Microservices and Prometheus (Microservices NYC 2016)Brian Brazil

Project Reactor By ExampleDenny Abraham Cheriyan

Efficient monitoring and alertingTobias Schmidt

Introduction to reactive programmingLeapfrog Technology Inc.

Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko

Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil

Prometheus - Open Source Forum JapanBrian Brazil

Monitoring using Prometheus and GrafanaArvind Kumar G.S

Prometheus design and philosophy Docker, Inc.

Load Testing with Taurus using Jenkins and AWSGuy Salton

Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil

Using Modern Browser APIs to Improve the Performance of Your Web ApplicationsNicholas Jansma

Weitere ähnliche Inhalte

Was ist angesagt?

Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...Brian Brazil

Prometheus for Monitoring Metrics (Fermilab 2018)Brian Brazil

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil

Monitoring Cloud Native Applications with PrometheusJacopo Nardiello

Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Brian Brazil

Cloud Monitoring with PrometheusQAware GmbH

The history of Prometheus at SoundCloudTobias Schmidt

Monitoring Kafka w/ Prometheuskawamuray

Introduction to Reactive programmingDwi Randy Herdinanto

Prometheus casual talk1wyukawa

Microservices and Prometheus (Microservices NYC 2016)Brian Brazil

Project Reactor By ExampleDenny Abraham Cheriyan

Efficient monitoring and alertingTobias Schmidt

Introduction to reactive programmingLeapfrog Technology Inc.

Getting Started Monitoring with Prometheus and GrafanaSyah Dwi Prihatmoko

Prometheus for Monitoring Metrics (Percona Live Europe 2017)Brian Brazil

Prometheus - Open Source Forum JapanBrian Brazil

Monitoring using Prometheus and GrafanaArvind Kumar G.S

Prometheus design and philosophy Docker, Inc.

Load Testing with Taurus using Jenkins and AWSGuy Salton

Was ist angesagt? (20)

Better Monitoring for Python: Inclusive Monitoring with Prometheus (Pycon Ire...

Prometheus for Monitoring Metrics (Fermilab 2018)

Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)

Monitoring Cloud Native Applications with Prometheus

Prometheus: A Next Generation Monitoring System (FOSDEM 2016)

Cloud Monitoring with Prometheus

The history of Prometheus at SoundCloud

Monitoring Kafka w/ Prometheus

Introduction to Reactive programming

Prometheus casual talk1

Microservices and Prometheus (Microservices NYC 2016)

Project Reactor By Example

Efficient monitoring and alerting

Introduction to reactive programming

Getting Started Monitoring with Prometheus and Grafana

Prometheus for Monitoring Metrics (Percona Live Europe 2017)

Prometheus - Open Source Forum Japan

Monitoring using Prometheus and Grafana

Prometheus design and philosophy

Load Testing with Taurus using Jenkins and AWS

Ähnlich wie Prometheus Introduction (InfraCoders Vienna)

Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil

Using Modern Browser APIs to Improve the Performance of Your Web ApplicationsNicholas Jansma

Cashing in on logging and exception dataStackify

JavaOne 2015: Top Performance Patterns Deep DiveAndreas Grabner

Redundant devopsSzabolcs Szabolcsi-Tóth

AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...Amazon Web Services

Measuring CDN performance and why you're doing it wrongFastly

The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI

The Incremental Path to ObservabilityEmily Nakashima

Performance eng prakash.sahuDr. Prakash Sahu

Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster

Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner

Diesel load testing toolSyed Zaid Irshad

ATAGTR2017 Unified APM: The new age performance monitoring for production sys...Agile Testing Alliance

Monitoring NGINX (plus): key metrics and how-toDatadog

Monitoring microservice applications: An SRE’s perspectiveDevOpsProdigy

Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan

Writing Asynchronous Programs with Scala & AkkaYardena Meymann

Stateful Stream Processing at In-Memory SpeedJamie Grier

Correlate Log Data with Business Metrics Like a JediTrevor Parsons

Ähnlich wie Prometheus Introduction (InfraCoders Vienna) (20)

Monitoring your Python with Prometheus (Python Ireland April 2015)

Using Modern Browser APIs to Improve the Performance of Your Web Applications

Cashing in on logging and exception data

JavaOne 2015: Top Performance Patterns Deep Dive

Redundant devops

AWS re:Invent 2016: How Fulfillment by Amazon (FBA) and Scopely Improved Resu...

Measuring CDN performance and why you're doing it wrong

The Patterns of Distributed Logging and Containers

The Incremental Path to Observability

Performance eng prakash.sahu

Reactive Development: Commands, Actors and Events. Oh My!!

Top Java Performance Problems and Metrics To Check in Your Pipeline

Diesel load testing tool

ATAGTR2017 Unified APM: The new age performance monitoring for production sys...

Monitoring NGINX (plus): key metrics and how-to

Monitoring microservice applications: An SRE’s perspective

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

Writing Asynchronous Programs with Scala & Akka

Stateful Stream Processing at In-Memory Speed

Correlate Log Data with Business Metrics Like a Jedi

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

unit 4 immunoblotting technique complete.pptxBkGupta21

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Take control of your SAP testing with UiPath Test SuiteDianaGray10

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Rise of the Machines: Known As Drones...Rick Flair

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

DSPy a system for AI to Write Prompts and Do Fine Tuning

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

unit 4 immunoblotting technique complete.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

From Family Reminiscence to Scholarly Archive .

Anypoint Exchange: It’s Not Just a Repo!

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Generative AI for Technical Writer or Information Developers

Take control of your SAP testing with UiPath Test Suite

WordPress Websites for Engineers: Elevate Your Brand

Time Series Foundation Models - current state and future directions

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

DevEX - reference for building teams, processes, and platforms

Rise of the Machines: Known As Drones...

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Prometheus Introduction (InfraCoders Vienna)

1. 1 Oliver Moser, September 2017 Monitoring Stuff with Prometheus

2. Who am I • Working at A1 and TAG for quite a while • some Big Data • DevOps/SRE/Containers/Orchestration 2

3. Fundamentals 3

4. Why do I need monitoring at all? • Know when stuff breaks (and act on it) • Understand the performance characteristics of your applications • To meet service level objectives • To improve performance/reliability

5. Fundamentals: Push vs Pull Monitoring Service HTTP Server Metrics Agent success_event error_event ... Monitoring Service HTTP Server Metrics Agent GET /metrics req_total req_latency

6. Whitebox Monitoring shows internal service information e.g. request_processing_time, request_errors_total Fundamentals: Blackbox vs. Whitebox Monitoring Blackbox Monitoring restricted to external service behaviour e.g. ping, http Agent HTTP Server HTTP GET / 200 OK Agent HTTP Server GET /metrics errors_total req_total req_latency

7. Prometheus Project Overview • OpenSource monitoring tool, originally built at Soundcloud • Heavily inspired by Google’s Borgmon • Written in Go (mainly) • Very active community (150+ contributors for core Prometheus) • Member of the Cloud Native Computing Foundation

8. Prometheus Architecture source: https://prometheus.io/docs/introduction/overview/

9. Prometheus Server source: https://prometheus.io/docs/introduction/overview/

10. Service Discovery source: https://prometheus.io/docs/introduction/overview/

11. Service Discovery • Could live without SD in static environments • In dynamic environments (e.g. Kubernetes) you must use SD • Pods, Services,... come and go à impossible to statically configure API Server Pod 1 Pod 2 Pod n ...

12. Jobs, Targets and Exporters source: https://prometheus.io/docs/introduction/overview/

13. 13

14. Graphing and Visualization source: https://prometheus.io/docs/introduction/overview/

15. 15

16. Alertmanager source: https://prometheus.io/docs/introduction/overview/

17. Alertmanager • Alerts are configured in Prometheus and once triggered forwarded to Alertmanager • Alertmanager does • Notifications (SMS, Slack, Pagerduty, Email etc) • Deduplication/Grouping • Silencing • Inhibition Prometheus Server Alertmanager forward ‘Instance Down’

18. Datamodel 18

19. Timeseries • Tracks values of a metric over time (timestamp t, value v) • Timestamps increase (strictly) monotonically • 𝑡" < 𝑡"$% ∀ n ∈ ℕ • Values can both increase or decrease (t1,v1) (t2,v2) (t3,v3) time

20. Metric Types • Only relevant for Client libs (only untyped timeseries on the server) • Counters: Values always increase • requests_total, errors_total • Gauges: Values can increase and decrease • users_online, memory_free_bytes • Histogram: Puts your measurements in buckets • requests_latency_seconds_bucket • Summary: Calculates percentiles over sliding time window • requests_latency_seconds_summary

21. Labels • A list of key/value pairs (𝑘% = 𝑣%, 𝑘. = 𝑣., … , 𝑘" = 𝑣") • Labels partition a metric into timeseries • So for every possible label combo in a given metric there will be a timeseries is created req_total{job=”job1", ver=”0.1”}: 10 à timeseries_1 req_total{job=”job1", ver=”0.2”}: 3 à timeseries_2 req_total{job=”job2", ver=”0.1”}: 4 à timeseries_3 req_total{job=”job1", ver=”0.1”}: 12 à timeseries_1 req_total{job=”job1"}: 1 à timeseries_4

22. Overall Data Model events_processed_total {component="enricher-deployment", instance="10.244.2.98:8080", job=”geo-enricher", namespace="geo-analytics", version="1.9.4-r5” } : 131241535 labels metric name label key label value metric value

23. PromQL • Query language to select and aggregate timeseries • Queries evaluate to either • Instant Vector: Multiple timeseries with the same timestamp • requests_total{job=”prom-boot”} • Range Vector: Multiple timeseries with a range of timestamps • rate(requests_total{job=“prom-boot”}[5m]) • Scalar: A single value • sum(requests_total)

24. Alerting Rules ALERT ErrorRateHigh IF rate(errors_total[5m]) > 10 LABELS { service = ”service_1", severity = "warning" } ANNOTATIONS { summary = "A high number of errors in service”, description = "{{ $value }} errors have been registered within the last hour for {{ $labels.instance }}" }

25. Demo Time 25

26. Thanks! 26