Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Rethinking metrics: metrics 2.0

790 Aufrufe

Veröffentlicht am

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

Rethinking metrics: metrics 2.0

  1. 1. rethinking metrics: metrics 2.0
  2. 2. by niteroi @ panoramio.com
  3. 3. vimeo.com/43800150
  4. 4. problems Metrics 2.0 concepts implementations uses & ideas
  5. 5. terminology sync
  6. 6. (1234567890, 82) (1234567900, 123) (1234567910, 109) (1234567920, 77) db15.mysql.queries_running host=db15 mysql.queries_running
  7. 7. Problems
  8. 8. Vimeo.com pagerequests/s? server X write perf?
  9. 9. stats.hits.vimeo_com stats_counts.hits.vimeo_com stats.*.vimeo_requests collectd.db.disk.sda1.disk_time.write
  10. 10. Understanding metrics Terminology? Meaning? Prefix? Unit? Aggregation? Source?
  11. 11. Unclear, inconsistent terminology, format tightly coupled lack information
  12. 12. http://litlquest.com/forest-trees/see-forest-trees-2
  13. 13. O(S*P*A*C) S = # Sources P = # People A = # Aggregators C = #Complexity
  14. 14. Graphs and dashboards are a huge time sink.
  15. 15. metrics 2.0 concepts
  16. 16. Self-describing Standardized Orthogonal dimensions
  17. 17. stats.timers.dfs5. proxy-server.object.GET.200. timing.upper_90
  18. 18. { server: dfvimeodfsproxy5, http_method: GET, http_code: 200, unit: ms, metric_type: gauge, stat: upper_90, swift_type: object }
  19. 19. allow more characters unit: Req/s, site: vimeo.com, ...
  20. 20. Metadata meta: { src: proxy.py:458, from: diamond }
  21. 21. Datamodel
  22. 22. Any protocol
  23. 23. Source format … service=foo instance=host unit=B 123 1234567890 {s}foo.{i}host.{u}B 123 1234567890 <uuid> 125 1234567890 #seperate data …
  24. 24. metrics20.org
  25. 25. MB/s Err/d Req/h ... B Err Warn Conn Job File Req ... SI + IEC
  26. 26. Immediate understanding of metrics Minimize time to graphs, alerting rules, debugging compatibility & flexibility in tooling
  27. 27. Implementations examples
  28. 28. Carbon-tagger … stats.gauges.host.foo 125 1234567890 service=foo instance=host target_type=gauge unit=B 123 1234567890 …
  29. 29. Statsdaemon unit=B unit=B ... unit=ms unit=ms ... unit=B/s unit=ms stat=mean unit=ms stat=upper_90 ...
  30. 30. Keep metric tags in sync with data
  31. 31. Graphing & dashboarding Visualization Alerting
  32. 32. Graphing & Dashboarding
  33. 33. Graph Explorer
  34. 34. Graph-Explorer queries 101 proxy-server swift server:regex unit=ms (AND)
  35. 35. upper_90 (or stat=upper_90) from <datetime> to <datetime> avg over <timespec> (5M, 1h, 3d, ...)
  36. 36. Compare object put/get stack … http_method:(PUT|GET) swift_type=object avg by http_code,server
  37. 37. Comparing servers http_method:(PUT|GET) group by unit,target_type avg by http_code, swift_type,http_method
  38. 38. transcode unit=Job/s avg over <time> from <datetime> to <datetime>
  39. 39. Note: data is obfuscated
  40. 40. Bucketing sum by zone:eu-west|us-east| ap-southeast|us-west| sa-east|vimeo-df|vimeo-lv group by state
  41. 41. Note: data is obfuscated
  42. 42. Compare job states per region (zones bucket) group by zone
  43. 43. Note: data is obfuscated
  44. 44. Unit conversion unit=Mb/s network server:regex sum by server
  45. 45. Integration Metric unit=B/s Query unit=TB
  46. 46. Deriving Metric unit=B Query unit=GB/d
  47. 47. Future work Faced-based suggestions Custom trees
  48. 48. Dashboard definition queries = [ 'cpu usage sum by core', 'mem unit=B !total group by type:swap', 'stack network unit=Mb/s', 'unit=B (free|used) group by =mountpoint' ]
  49. 49. Equivalence servers.host.cpu.total.iowait → “core” : “_sum_” servers.host.cpu.<core-number>.iowait servers.host.loadavg.15
  50. 50. Future Work
  51. 51. ●Storage aggregation rules ● graphite API functions such as cumulative, summarize and smartSummarize ●consolidateBy & Graph renderers
  52. 52. Self-describing & standardized stat=upper/lower/mean/... target_type=counter..
  53. 53. Visualizations
  54. 54. From: dygraphs.com
  55. 55. Select your view
  56. 56. bin=10 bin=20 bin=30 bin=40 bin=50 bin=100
  57. 57. Alerting
  58. 58. unit=Err/s
  59. 59. Automatic cause & effect
  60. 60. Different algo's for different things
  61. 61. Alert criticality & routing based on tags
  62. 62. integrating logs & metrics
  63. 63. Algorithms leverage both logs and metrics
  64. 64. Changing software
  65. 65. Conclusion structured self-describing standardized metrics = enabler
  66. 66. Conclusion What are your concerns? Ideas? Let's make this better Ready for early adopters! Work with me on next-gen telemetry! Tips on coordinating spec development? How does FB/G/AMZ/MS/APL/... do this stuff
  67. 67. Seen in this presentation: metrics20.org vimeo.github.io/graph-explorer github.com/vimeo/timeserieswidget github.com/vimeo/carbon-tagger github.com/vimeo/statsdaemon github.com/graphite-ng/carbon-relay-ng github.com/Dieterbe/anthracite
  68. 68. You might also like: github.com/vimeo/graphite-influxdb github.com/vimeo/graphite-api-influxdb-docker Github.com/vimeo/whisper-to-influxdb github.com/Dieterbe/influx-cli github.com/graphite-ng/graphite-ng Github.com/vimeo/smoketcp Github.com/vimeo/tailgate
  69. 69. Stay in touch! groups.google.com/forum/#!forum/metrics20 groups.google.com/forum/#!forum/it-telemetry twitter.com/Dieter_be dieter.plaetinck.be
  70. 70. Q&A

×