Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

You can't spell "monitoring" without "monoid"

17 Aufrufe

Veröffentlicht am

Well, technically you can, but if you do you’re probably lying to yourself with the data you’re collecting.

If you care about aggregating your monitoring data over time or across dimensions like hosts or container instances, you care about monoids; you just might not know it! In this talk, I’ll explain what a monoid is (it’s not scary, I promise!) and why they form the basis for scalable telemetry data types. We’ll see how naive approaches to metrics can end up giving you the wrong answers to important questions and how a more mathematically well-founded approach can fix those problems.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

You can't spell "monitoring" without "monoid"

  1. 1. I’M KEVIN I WORK AT NEW RELIC I LIKE MATH I’m Kevin Scaldeferri. I work for New Relic as a Principal Engineer and distributed systems architect and I’m sort of a math geek, which leads to writing talk titles like
  2. 2. YOU CAN’T SPELL MONITORING WITHOUT MONOID Kevin Scaldeferri New Relic Before jumping into the math, first some motivation
  3. 3. I DON’T REALLY LIKE “METRIC TIME SERIES” I have a confession to make: I don’t really like metric time series. “But they’re so simple”
  4. 4. EASY IS NOT THE SAME AS SIMPLE Rich Hickey No, they are easy, not simple. Doing the easy thing often intertwines multiple concepts in way that complicate thinking about them.
  5. 5. STORY TIME Let me tell you a Story. We’re having an incident, people are looking for the root cause.
  6. 6. “Aha! CPU on this DB is going up and to the right. Page the database team!” DB Team: “nope, that’s normal, that metric’s not a gauge, it’s an accumulative counter, it’s always up and to the right”.
  7. 7. More unhelpful charts of accumulative counters. Why are all these instances different? Is there a real difference or were they just restarted at different times?
  8. 8. WHAT ABOUT GAUGES?
  9. 9. PDX -> SEA 10 ✈ @ 40 min 1000 🚙 @ 3.5 hr How long does it take to get from portland to seattle on average?
  10. 10. PDX -> SEA 10 ✈ @ 40 min 1000 🚙 @ 3.5 hr Avg = (10*40 + 1000*210) / 1010 = 208 min Is this right? Not really.
  11. 11. PDX -> SEA 10 ✈ @ 40 min <— 120 people 1000 🚙 @ 3.5 hr <— 1 person Avg = (1200*40 + 1000*210) / 2200 = 117 min You need to weight the average correctly. But what’s that got to do with metrics?
  12. 12. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms What the average response time of this app, where one host is slow for some reason.
  13. 13. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms Average: 34ms?
  14. 14. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms Average: 34ms? NO!
  15. 15. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms <— LB sends less Average: 34ms? NO! Don’t average averages.
  16. 16. PERCENTILES Everyone knows percentiles are better than averages anyway.
  17. 17. “MyResource.post-requests”: { "p50": 0.001, "p75": 0.002, "p95": 0.006, "p98": 0.007, "p99": 0.008, "p999": 0.018, } p99 per host is easy to come by, but my SL[IOA] is the p99 for the app overall.
  18. 18. Gil Tene shameless appeal to authority
  19. 19. UNIQUE COUNTS Businesses really care about unique counts. How many unique users are coming to the site? How many unique users have tried a new feature?
  20. 20. UNIQUE USERS 10 18 20 19 17 15 12 Weekly Unique Users? But we’re in trouble if we have daily unique counts and try to get a weekly value.
  21. 21. UNIQUE USERS 10 18 20 19 17 15 12 20 ≤ Weekly Unique Users ≤ 111 Could be anywhere from 20 to 111, which isn’t very satisfying to your business owner.
  22. 22. WELL THIS IS SORT OF DEPRESSING
  23. 23. MATH TO THE RESCUE!
  24. 24. A MONOID IS AN ALGEBRAIC STRUCTURE WITH A SINGLE ASSOCIATIVE BINARY OPERATION AND AN IDENTITY ELEMENT. Wikipedia What is a monoid? … what’s that mean?
  25. 25. “ALGEBRAIC STRUCTURE” = DATA TYPE
  26. 26. “ASSOCIATIVE BINARY OPERATION” = SOMETHING LIKE ADDITION
  27. 27. “IDENTITY ELEMENT” = SOMETHING LIKE ZERO
  28. 28. interface Monoid<T> { // (x + y) + z = x + (y + z) add(x:T, y:T) : T // 0 + x = x = x + 0 zero() : T } As an interface definition. But it’s not just addition. For example, multiplication or string concatenation satisfy these rules.
  29. 29. HOW DOES THIS HELP? How does this simple concept help fix the problems with our easy approach?
  30. 30. TEMPORAL AND DIMENSIONAL AGGREGATION
  31. 31. AGGREGATION 1 2 3 4 5 6 7 8 9 10 11 12 host A 10 14 15 19 17 15 12 11 12 15 14 17 host B 9 13 12 15 16 17 16 14 9 11 12 15 host C 10 15 13 16 13 19 15 16 13 13 12 14 host D 10 13 13 17 14 20 13 15 12 12 13 15 10 second resolution is great for tactical debugging.
  32. 32. AGGREGATION 1-4 5-8 9-12 host A 58 55 58 host B 49 63 47 host C 54 61 52 host D 53 62 52 but for long term analysis it’s too expensive and we want time roll-ups.
  33. 33. AGGREGATION 1 2 3 4 5 6 7 8 9 10 11 12 host A 10 14 15 19 17 15 12 11 12 15 14 17 host B 9 13 12 15 16 17 16 14 9 11 12 15 host C 10 15 13 16 13 19 15 16 13 13 12 14 host D 10 13 13 17 14 20 13 15 12 12 13 15 Similarly we want all those high-cardinality dimensions to track down problems and answer ad-hoc question.
  34. 34. AGGREGATION 1 2 3 4 5 6 7 8 9 10 11 12 all hosts 39 55 53 67 60 71 56 56 46 51 51 62 But you also need to measure SLIs. And a year from now you won’t care about that container ID.
  35. 35. ACCUMULATIVE COUNTERS replace accumulative counters with
  36. 36. DELTA COUNTERS delta counters
  37. 37. 12 REQUESTS TO THIS ENDPOINT WERE RECEIVED BY THIS HOST DURING THIS TIME INTERVAL Useful Monitoring Some of our sources of telemetry insist on giving us accumulators, but as quickly as possible we need to convert them to something like this.
  38. 38. (AND YOU SHOULD SUM THEM) Useful Monitoring And that measurement needs to tell us how to combine multiple data points.
  39. 39. A MONOID IS BOTH THE DATA AND THE OPERATION There’s more than one monoid on longs and doubles, and we need to be clear about what’s sensible to do with a particular metric.
  40. 40. MIN / MAX GAUGES Don’t sum or average a max or min. Take the max of all your maxes and the min of all your mins.
  41. 41. THE MAX MEMORY USED BY THIS HOST DURING THIS TIME INTERVAL WAS 1.2GB; AND AGGREGATE USING MAX Useful Monitoring This should be explicit, not something you have to extract from the metric name.
  42. 42. MONOIDS COMPOSE
  43. 43. AVERAGE RESPONSE TIME Host A: 10ms Host B: 12ms Host C: 80ms Average: ??? How do we do this right?
  44. 44. AVERAGE RESPONSE TIME Host A: 10s / 1000 reqs = 10ms avg Host B: 10.8s / 900 reqs = 12ms avg Host C: 9.6s / 120 reqs = 80ms avg Average: ??? Break it into two sum monoids for the total time of requests and the total number of requests.
  45. 45. AVERAGE RESPONSE TIME Host A: 10s / 1000 reqs = 10ms avg Host B: 10.8s / 900 reqs = 12ms avg Host C: 9.6s / 120 reqs = 80ms avg Avg: 30.4s / 2020 reqs = 15ms avg Now we can aggregate correctly and get the right answer. The Prometheus histogram Bryan showed yesterday is a more complicated example where you have to know exactly how to combine all those individual lines together. We can do better. Structured logs, why not structured metrics?
  46. 46. APPROXIMATION WITH RIGOR monoids tell us how to design approximate algorithms which are still mathematically sound
  47. 47. UNIQUE COUNTS Let’s revisit our unique count example.
  48. 48. UNIQUE USERS 10 18 20 19 17 15 12 Weekly Unique Users? We know that the unique counts for each day aren’t sufficient to let us calculate the unique users for the week, so what should we do? This is not at all obvious.
  49. 49. HYPERLOGLOG: THE ANALYSIS OF A NEAR-OPTIMAL CARDINALITY ESTIMATION ALGORITHM Flajolet, et al Lots of research, but at this point everyone pretty much agrees HyperLogLog is the way to go.
  50. 50. UNIQUE USERS - HYPERLOGLOG 1000110 0111101 01… 1010011 0010110 001… 1110010 0101101 00… 0001110 0001101 01… 1010010 0100101 00… 1110110 0001101 11… 1100100 0111010 111… Weekly Unique Users = 25 Takes about 700 bytes so you don’t want to track a ton of these, but reasonable for high-value business metrics.
  51. 51. PERCENTILES What about percentiles? The good news is that there’s lots of ways to approximate percentiles monoidally. But the bad news is also that there’s lots of ways to approximate percentiles monoidally.
  52. 52. RE-AGGREGATABLE PERCENTILES ▸MomentSketch ▸Q-Digest ▸T-Digest ▸GK-Array ▸HDRHistogram ▸Spectator histogram ▸CLWY “Random” Algorithm ▸DDSketch This is not a complete list, this is just some of the most well known approaches. These all make tradeoffs in a multi-dimensional space of speed, size, and accuracy and this is still an active area of research. Unlike unique counts, we don’t have a consensus about what approach all our monitoring tools should use. Hard to compare across data from multiple systems.
  53. 53. Gratuitous dog photo in case you were getting overwhelmed by math about now.
  54. 54. WRAPPING UP
  55. 55. METRIC TIME SERIES ▸ Can be misleading / surprising ▸ Accumulative Counters: please stop! ▸ Easy to do mathematical nonsense ▸ Accurate aggregation often impossible Metric time series have been the easy and dominant paradigm for monitoring data over the last decade or so, but they present challenges in today’s environment.
  56. 56. MONOIDS ▸ Data that tells us what math makes sense ▸ Collect high-resolution, high-cardinality data ▸ Aggregate after the fact as needed ▸ Composable ▸ Guides the design of approximate algorithms Monoids provide a simple framework which allows us to build mathematically sound monitoring systems.
  57. 57. CHALLENGES ▸ Self-describing data that includes how to aggregate ▸ Composite data types ▸ Universal support for HyperLogLogs ▸ Consensus on quantile estimation If we’re adding units and descriptions for humans to our metrics (a la Open Census), why not richer type annotations? Quantiles are hard, maybe Open Telemetry should tackle this.
  58. 58. THANK YOU KEVIN SCALDEFERRI @KSCALDEF

×