Lviv IT Arena is a conference specially designed for programmers, designers, developers, top managers, inverstors, entrepreneurs and startuppers. Annually it takes place at the beginning of October in Lviv at Arena Lviv stadium. In 2016 the conference gathered more than 1800 participants and over 100 speakers from companies like Microsoft, Philips, Twitter, UBER and IBM. More details about the conference at itarena.lviv.ua.
7. Metrics
Metrics @UBER is a first class citizen
T0 Service
Handling ~500M telemetry timeseries
Writing ~3M values/sec and running ~1K queries/sec
50M minutes worth of data per sec
Growing >25% month over month
13. Metrics Collection
Cassandra is a figure of epic tradition and of tragedy.
High write throughput
Cassandra data model supports time series
data-store - DTCS
Cassandra's native TTL support
14. Metrics Collection
Cassandra - our use case
Separate clusters for different types of data
Clusters spans multiple datacenters
Dynamically control to which cluster data is written
Forcibly deleting old data
https://github.com/m3db/m3db/
21. Observability: Past, Present, and Future
Alerting based on metrics
Query Based Alerting
graphite.absolute_threshold(
‘scale(sumSeries(transformNull(stats.*.counts.api.velocity_filter.uber.views.*.*.blocked, 0)), 0.1)’,
alias=’velocity filter blocked requests’,
warning_over=0.1,
critical_over=10.0,
)
22. Observability: Past, Present, and Future
Alerting based on metrics
Classic Thresholding
Classic high / low thresholds have some intrinsic problems.
• Labor-intensive: each threshold is hand-tuned and manually
updated.
• Too sensitive: hard to set thresholds for metrics with large
fluctuations, even if there’s an obvious pattern.
• Not sensitive enough: thresholds take a long time to catch
slow degradations.
• Poor UX: configuring really good alerts requires specialized
knowledge of the query language.
• No guidance: system doesn’t offer automated root cause
exploration.
23. Observability: Past, Present, and Future
Alerting based on metrics
• Zero config: thresholds are set and maintained automatically.
• Dynamic adjustment: thresholds cope with noise, underlying growth, seasonality
and rollouts.
• Rapid detection: embarrassingly parallel algorithm is efficient enough for minute-
by-minute analysis at scale.
• Integrated UX: work within our existing telemetry and alert configuration systems.
• Helpful: automated root cause analysis.
In short, the only input is a list of business-critical metrics.
Intelligent Monitoring
24. Observability: Past, Present, and Future
Alerting based on metrics
The max lower threshold
exceeds the min upper
threshold
Dynamic Thresholds
25. Observability: Past, Present, and Future
Alerting based on metrics
Outage Detection
< 1% outages missed.
6.5 out of 10 alerts are true issues.