stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf

NETWAYS
NETWAYSNETWAYS
How to reduce expenses on monitoring
with VictoriaMetrics
Roman Khavronenko | github.com/hagen1778
Roman Khavronenko
Co-founder of VictoriaMetrics
Software engineer with experience in distributed systems,
monitoring and high-performance services.
https://github.com/hagen1778
https://twitter.com/hagen1778
What this talk is about
1. Best ways for storing and processing metrics
2. Open source tools only
3. For people familiar with Prometheus,
Thanos, Mimir, VictoriaMetrics
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
Expenses!
You can either have a faster car…
…or be a smarter driver!
What can you get from simple replacing?
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
Prometheus remote-write benchmark
Prometheus vs VictoriaMetrics benchmark
# the number of nodeexporter instances to scrape
targetsCount: 1000
# how frequently to scrape nodeexporter targets
scrapeInterval: 15s
# rules evaluation interval
# https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1
queryInterval: 30s
# scrapeConfigUpdatePercent is a churn rate generated once
# per scrapeConfigUpdateInterval
scrapeConfigUpdatePercent: 5
scrapeConfigUpdateInterval: 10m
Prometheus vs VictoriaMetrics benchmark
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
x16 times faster!
x1.9 times faster!
x1.7 less memory!
x2.5 times less!
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf
Summary after 7d benchmark (1k nodeexporter targets)
Prometheus:
CPU avg used: 0.79 / 3 cores
Mem max used: 8.12 GiB / 12 GiB
Read latency avg:
50th - 70.5ms
99th - 7s
VictoriaMetrics:
CPU avg used: 0.76 / 3 cores
Mem max used: 4.5 GiB / 12 GiB
Read latency avg:
50th - 4.3ms
99th - 3.6s
Data transfer costs
Network Data transfer costs
x4.5 times less!
Improving network compression
1. Increase compression level, trade CPU for network savings:
a. remoteWrite.vmProtoCompressLevel
2. Increase batch size, trade latency for compression:
a. remoteWrite.maxBlockSize
b. remoteWrite.maxRowsPerBlock
c. remoteWrite.flushInterval
3. Reduce entropy to improve compression:
a. -remoteWrite.significantFigures
b. -remoteWrite.roundDigits
Keeping only significant figures
instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781
instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236
rules:
- record: instance:cpu_utilization:ratio_avg
expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
Keeping only significant figures
Applying --vm-significant-figures=8 to recording rules
0.05055757575781
0.050557576
changed compression ratio from 1.2B to 0.8B per sample
See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
How to be smarter about data
Understanding the data - query tracing
VictoriaMetrics supports query tracing for detecting bottlenecks during query processing.
This is like EXPLAIN ANALYZE from Postgresql!
https://play.victoriametrics.com
Query tracing demo!
If query tracing demo didn't work…
Typical query takes 4s to execute… Why?
If query tracing demo didn't work…
Let's check the trace!
If query tracing demo didn't work…
Let's check the trace!
If query tracing demo didn't work…
91% of the time was spent on vmselect while aggregating
9.4k series, 13Mil data samples!
How to improve query speed?
1. Add more resources to monitoring.
2. Or… be smarter about data!
Cardinality explorer demo!
https://play.victoriametrics.com
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
If cardinality explorer demo didn't work…
Cardinality explorer: summary
VictoriaMetrics allows exploring time series cardinality to identify:
● Metric names with the highest number of series
● Labels with the highest number of series
● Values with the highest number of series for the selected label
● label=name pairs with the highest number of series
● Labels with the highest number of unique values
* Available built-in in VictoriaMetrics components
* Supports specifying Prometheus URL
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is Data-in + Recording Rules results
Streaming aggregation vs Recording rules
The number of time series stored in TSDB
is only what needs to be persisted
How to use streaming aggregation
- match: "grpc_server_handled_total" # timeseries selector
interval: "2m" # on 2m interval
outputs: ["total"] # aggregate as counter
without: ["grpc_method"] # group without label
Result:
grpc_server_handled_total:2m_without_grpc_method_total
How to use streaming aggregation
https://play.victoriametrics.com
Streaming aggregation: summary
1. Aggregate incoming samples in streaming mode before data is written to remote
storage
2. Aggregation is applied to all the metrics received via any supported data
ingestion protocol and/or scraped from Prometheus-compatible targets
3. Statsd alternative
4. Recording rules alternative
5. Reducing the number of stored samples
6. Reducing the number of stored series
7. Compatible with tools supporting Prometheus remote write protocol
Complexity penalty
Cortex architecture
Mimir architecture
VictoriaMetrics architecture
Complexity penalty
● Complex systems are harder to maintain
● Complex systems are harder to educate about
● Complex systems are more expensive to scale
Additional materials
1. Snapshot of Grafana dashboard from the benchmark
2. Benchmark repo for reproducing the test
3. Save network costs with VictoriaMetrics remote write protocol
4. VictoriaMetrics: achieving better compression than Gorilla for time series data
5. Streaming aggregation
6. VictoriaMetrics playground
Questions?
● https://github.com/VictoriaMetrics
● https://github.com/hagen1778
1 von 55

Recomendados

How to reduce expenses on monitoring von
How to reduce expenses on monitoringHow to reduce expenses on monitoring
How to reduce expenses on monitoringRomanKhavronenko
69 views54 Folien
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P... von
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...
DiscoveredByte - Java Performance Monitoring, Tuning and Optimization - Key P...DiscoveredByte
618 views20 Folien
observability pre-release: using prometheus to test and fix new software von
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareSneha Inguva
516 views79 Folien
Prometheus Everything, Observing Kubernetes in the Cloud von
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudSneha Inguva
1.9K views50 Folien
Kafka monitoring and metrics von
Kafka monitoring and metricsKafka monitoring and metrics
Kafka monitoring and metricsTouraj Ebrahimi
2.1K views20 Folien
Performance eng prakash.sahu von
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahuDr. Prakash Sahu
113 views40 Folien

Más contenido relacionado

Similar a stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf

How to monitor your micro-service with Prometheus? von
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?Wojciech Barczyński
676 views80 Folien
Monitor your Java application with Prometheus Stack von
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus StackWojciech Barczyński
1.6K views75 Folien
Overcoming scalability issues in your prometheus ecosystem von
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystemNebulaworks
78 views29 Folien
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics von
Query Optimization with MySQL 8.0 and MariaDB 10.3: The BasicsQuery Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Query Optimization with MySQL 8.0 and MariaDB 10.3: The BasicsJaime Crespo
1.6K views202 Folien
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System von
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemAccumulo Summit
521 views28 Folien
ATAGTR2017 An Innovative Take on Versa Test von
ATAGTR2017 An Innovative Take on Versa TestATAGTR2017 An Innovative Take on Versa Test
ATAGTR2017 An Innovative Take on Versa TestAgile Testing Alliance
620 views9 Folien

Similar a stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf(20)

Monitor your Java application with Prometheus Stack von Wojciech Barczyński
Monitor your Java application with Prometheus StackMonitor your Java application with Prometheus Stack
Monitor your Java application with Prometheus Stack
Overcoming scalability issues in your prometheus ecosystem von Nebulaworks
Overcoming scalability issues in your prometheus ecosystemOvercoming scalability issues in your prometheus ecosystem
Overcoming scalability issues in your prometheus ecosystem
Nebulaworks78 views
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics von Jaime Crespo
Query Optimization with MySQL 8.0 and MariaDB 10.3: The BasicsQuery Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics
Jaime Crespo1.6K views
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System von Accumulo Summit
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit521 views
Monitoring using Prometheus and Grafana von Arvind Kumar G.S
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S3.6K views
So You Want to Write an Exporter von Brian Brazil
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil4.2K views
Google Cloud Platform monitoring with Zabbix von Max Kuzkin
Google Cloud Platform monitoring with ZabbixGoogle Cloud Platform monitoring with Zabbix
Google Cloud Platform monitoring with Zabbix
Max Kuzkin11.1K views
Basic of jMeter von Shub
Basic of jMeter Basic of jMeter
Basic of jMeter
Shub3.6K views
DevoxxUK: Optimizating Application Performance on Kubernetes von Dinakar Guniguntala
DevoxxUK: Optimizating Application Performance on KubernetesDevoxxUK: Optimizating Application Performance on Kubernetes
DevoxxUK: Optimizating Application Performance on Kubernetes
A Framework for Scene Recognition Using Convolutional Neural Network as Featu... von Tahmid Abtahi
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Tahmid Abtahi2.4K views
Three Perspectives on Measuring Latency von ScyllaDB
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring Latency
ScyllaDB446 views
Measurement .Net Performance with BenchmarkDotNet von Vasyl Senko
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko1.7K views
"Surviving highload with Node.js", Andrii Shumada von Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays58 views
Prometheus and Docker (Docker Galway, November 2015) von Brian Brazil
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil9.8K views
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus von OpenStack Korea Community
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
Regain Control Thanks To Prometheus von Etienne Coutaud
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
Etienne Coutaud474 views

Último

Use of Economic Evidence in Cartel Cases – DAVIES – December 2023 OECD discus... von
Use of Economic Evidence in Cartel Cases – DAVIES – December 2023 OECD discus...Use of Economic Evidence in Cartel Cases – DAVIES – December 2023 OECD discus...
Use of Economic Evidence in Cartel Cases – DAVIES – December 2023 OECD discus...OECD Directorate for Financial and Enterprise Affairs
78 views38 Folien
I use my tools to help people von
I use my tools to help peopleI use my tools to help people
I use my tools to help peoplemywampa
9 views26 Folien
Serial Acquisitions and Industry Roll-ups – GOGA – December 2023 OECD discussion von
Serial Acquisitions and Industry Roll-ups – GOGA – December 2023 OECD discussionSerial Acquisitions and Industry Roll-ups – GOGA – December 2023 OECD discussion
Serial Acquisitions and Industry Roll-ups – GOGA – December 2023 OECD discussionOECD Directorate for Financial and Enterprise Affairs
128 views13 Folien
Competition and Innovation - The Role of Innovation in Enforcement Cases – VE... von
Competition and Innovation - The Role of Innovation in Enforcement Cases – VE...Competition and Innovation - The Role of Innovation in Enforcement Cases – VE...
Competition and Innovation - The Role of Innovation in Enforcement Cases – VE...OECD Directorate for Financial and Enterprise Affairs
170 views9 Folien
Consolidated Career Maps (1).pdf von
Consolidated Career Maps (1).pdfConsolidated Career Maps (1).pdf
Consolidated Career Maps (1).pdfvishankchauhan1
13 views561 Folien
Competition and Professional Sports – OECD – December 2023 OECD discussion von
Competition and Professional Sports – OECD – December 2023 OECD discussionCompetition and Professional Sports – OECD – December 2023 OECD discussion
Competition and Professional Sports – OECD – December 2023 OECD discussionOECD Directorate for Financial and Enterprise Affairs
292 views3 Folien

Último(20)

I use my tools to help people von mywampa
I use my tools to help peopleI use my tools to help people
I use my tools to help people
mywampa9 views
NguyenChristine_Portfolio (1).pdf von chnguyentv9
NguyenChristine_Portfolio (1).pdfNguyenChristine_Portfolio (1).pdf
NguyenChristine_Portfolio (1).pdf
chnguyentv930 views
RTC2023_Boost-App-Integration-with-AI_Kim.pdf von hossenkamal2
RTC2023_Boost-App-Integration-with-AI_Kim.pdfRTC2023_Boost-App-Integration-with-AI_Kim.pdf
RTC2023_Boost-App-Integration-with-AI_Kim.pdf
hossenkamal28 views
What I learnt in Antarctica about leadership, well-being and climate change von kristinashields1
What I learnt in Antarctica about leadership, well-being and climate changeWhat I learnt in Antarctica about leadership, well-being and climate change
What I learnt in Antarctica about leadership, well-being and climate change
kristinashields124 views
Maximiliano Roa - eRetail Week Blended [Professional] Experience 2023 von eCommerce Institute
Maximiliano Roa - eRetail Week Blended [Professional] Experience 2023Maximiliano Roa - eRetail Week Blended [Professional] Experience 2023
Maximiliano Roa - eRetail Week Blended [Professional] Experience 2023

stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by Roman Khavronenko.pdf

  • 1. How to reduce expenses on monitoring with VictoriaMetrics Roman Khavronenko | github.com/hagen1778
  • 2. Roman Khavronenko Co-founder of VictoriaMetrics Software engineer with experience in distributed systems, monitoring and high-performance services. https://github.com/hagen1778 https://twitter.com/hagen1778
  • 3. What this talk is about 1. Best ways for storing and processing metrics 2. Open source tools only 3. For people familiar with Prometheus, Thanos, Mimir, VictoriaMetrics
  • 10. You can either have a faster car… …or be a smarter driver!
  • 11. What can you get from simple replacing?
  • 15. # the number of nodeexporter instances to scrape targetsCount: 1000 # how frequently to scrape nodeexporter targets scrapeInterval: 15s # rules evaluation interval # https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware-1 queryInterval: 30s # scrapeConfigUpdatePercent is a churn rate generated once # per scrapeConfigUpdateInterval scrapeConfigUpdatePercent: 5 scrapeConfigUpdateInterval: 10m Prometheus vs VictoriaMetrics benchmark
  • 24. Summary after 7d benchmark (1k nodeexporter targets) Prometheus: CPU avg used: 0.79 / 3 cores Mem max used: 8.12 GiB / 12 GiB Read latency avg: 50th - 70.5ms 99th - 7s VictoriaMetrics: CPU avg used: 0.76 / 3 cores Mem max used: 4.5 GiB / 12 GiB Read latency avg: 50th - 4.3ms 99th - 3.6s
  • 28. Improving network compression 1. Increase compression level, trade CPU for network savings: a. remoteWrite.vmProtoCompressLevel 2. Increase batch size, trade latency for compression: a. remoteWrite.maxBlockSize b. remoteWrite.maxRowsPerBlock c. remoteWrite.flushInterval 3. Reduce entropy to improve compression: a. -remoteWrite.significantFigures b. -remoteWrite.roundDigits
  • 29. Keeping only significant figures instance:cpu_utilization:ratio_avg{instance="foo"} 0.05055757575781 instance:cpu_utilization:ratio_avg{instance="bar"} 0.05058181818236 rules: - record: instance:cpu_utilization:ratio_avg expr: avg_over_time(instance:node_cpu_utilization:ratio[5m])
  • 30. Keeping only significant figures Applying --vm-significant-figures=8 to recording rules 0.05055757575781 0.050557576 changed compression ratio from 1.2B to 0.8B per sample See more at https://medium.com/victoriametrics-how-to-migrate-data-from-prometheus
  • 31. How to be smarter about data
  • 32. Understanding the data - query tracing VictoriaMetrics supports query tracing for detecting bottlenecks during query processing. This is like EXPLAIN ANALYZE from Postgresql!
  • 34. If query tracing demo didn't work… Typical query takes 4s to execute… Why?
  • 35. If query tracing demo didn't work… Let's check the trace!
  • 36. If query tracing demo didn't work… Let's check the trace!
  • 37. If query tracing demo didn't work… 91% of the time was spent on vmselect while aggregating 9.4k series, 13Mil data samples!
  • 38. How to improve query speed? 1. Add more resources to monitoring. 2. Or… be smarter about data!
  • 40. If cardinality explorer demo didn't work…
  • 41. If cardinality explorer demo didn't work…
  • 42. If cardinality explorer demo didn't work…
  • 43. Cardinality explorer: summary VictoriaMetrics allows exploring time series cardinality to identify: ● Metric names with the highest number of series ● Labels with the highest number of series ● Values with the highest number of series for the selected label ● label=name pairs with the highest number of series ● Labels with the highest number of unique values * Available built-in in VictoriaMetrics components * Supports specifying Prometheus URL
  • 44. Streaming aggregation vs Recording rules The number of time series stored in TSDB is Data-in + Recording Rules results
  • 45. Streaming aggregation vs Recording rules The number of time series stored in TSDB is only what needs to be persisted
  • 46. How to use streaming aggregation - match: "grpc_server_handled_total" # timeseries selector interval: "2m" # on 2m interval outputs: ["total"] # aggregate as counter without: ["grpc_method"] # group without label Result: grpc_server_handled_total:2m_without_grpc_method_total
  • 47. How to use streaming aggregation https://play.victoriametrics.com
  • 48. Streaming aggregation: summary 1. Aggregate incoming samples in streaming mode before data is written to remote storage 2. Aggregation is applied to all the metrics received via any supported data ingestion protocol and/or scraped from Prometheus-compatible targets 3. Statsd alternative 4. Recording rules alternative 5. Reducing the number of stored samples 6. Reducing the number of stored series 7. Compatible with tools supporting Prometheus remote write protocol
  • 53. Complexity penalty ● Complex systems are harder to maintain ● Complex systems are harder to educate about ● Complex systems are more expensive to scale
  • 54. Additional materials 1. Snapshot of Grafana dashboard from the benchmark 2. Benchmark repo for reproducing the test 3. Save network costs with VictoriaMetrics remote write protocol 4. VictoriaMetrics: achieving better compression than Gorilla for time series data 5. Streaming aggregation 6. VictoriaMetrics playground