Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Monitoring Akka with Kamon 1.0

526 Aufrufe

Veröffentlicht am

Talk at react.sphere conference
Krakow, 16.04.2018

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Monitoring Akka with Kamon 1.0

  1. 1. Monitoring Akka with Kamon 1.0 Dr. Steffen Gebert
  2. 2. Insights into the inner workings of an application become crucial latest when performance and scalability issues are encountered. This becomes especially challenging in distributed systems, like when using Akka cluster. A popular open-source solution for monitoring on the JVM in general, and Akka in particular, is Kamon. With its recently reached 1.0 milestone, it features means for both metrics collection and tracing of Akka applications, running both standalone or distributed. This talk gives an introduction to Kamon 1.0 with a focus on its metrics features. The basic setup using Prometheus and Grafana will be described, as well as an overview over the different modules and its APIs for implementing custom metrics. The resulting setup allows to record both, automatically exposed metrics about Akka’s actor systems, as well as metrics tailored to the monitored application’s domain and service level indicators. Finally, learnings from a first-time user experience of getting started with Kamon will be reported. The example of adding instrumentation to EMnify’s mobile core application will illustrate, how easy it is to get started and how to kill the Prometheus on a daily basis. Abstract
  3. 3. • Steffen • has a heart beating for infrastructure • writes code at EMnify • PhD in computer science, topic: software-based networks • EMnify • MVNO focussed on IoT • runs virtualized mobile core network • Würzburg/Berlin, Germany About Me & Us @StGebert Slides available at st-g.de/speaking
  4. 4. • Kamon Overview • Metrics Instrumentation • Setup: Kamon with Prometheus & Grafana • Experience at EMnify • Summary Agenda
  5. 5. • Our application is slow • Nagios did not tell us • APM did Application Performance Monitoring
  6. 6. Kamon
  7. 7. Kamon • Open Source • Monitoring for the JVM • Integrations for Akka • Release 1.0 in January 2018 kamon.io / github.com/kamon-io
  8. 8. • Tracing • Per-request call graph • Context propagation across nodes • Exemplary objectives: • Request profiling • Understanding call graph • Metrics Kamon: Feature Set
  9. 9. Exemplary Trace
  10. 10. • Tracing • Per-request call graph • Context propagation across nodes • Exemplary objectives: • Request profiling • Understanding call graph • Metrics • Time series data • Counters / gauges / distributions • Exemplary objectives: • Function call counts and latency • Open DB connections • User logins • Generated revenue Kamon: Feature Set
  11. 11. • Custom Metrics • added to your code where it makes sense • Automatic Instrumentation • integrations into Akka, Akka HTTP, Play, JDBC, Servlet • system and JVM metrics Metrics
  12. 12. • Counter • function calls • customer buying our product • Gauge • number of open DB connections • mailbox size Custom Metric Types t t
  13. 13. • Histogram • latencies • shopping cart total prices • Timer • latencies • RangeSampler • number of open DB connections • mailbox size Custom Metric Types (2) histogram (single sample) observations value10 20 30 40 50
  14. 14. • Kamon.counter("hello.krakow").increment(); • Histogram hist = Kamon.histogram("age"); hist.record(33); hist.record(21); • CounterMetric c = Kamon.counter("participants"); Counter cReact = c.refine("conference", "react"); Counter cScala = c.refine("conference", "scala"); cReact.increment(42); Custom Metrics: Implementation
  15. 15. • Actor system metrics • processed messages • active actors • unhandled messages • dead letters • Per actor performance metrics • processing time (per message) • time in mailbox • mailbox sizes • errors Kamon Akka Mailbox Actor A Mailbox Actor B Mailbox Actor C Message
  16. 16. • Metrics related to • routers • dispatchers • executors • actor groups • remoting (with kamon-akka-remote) • Requirement (AOP) • AspectJ Weaver or • Kanela (Kamon Agent) Kamon Akka (2)
  17. 17. Kamon + Prometheus + Grafana Setup
  18. 18. Related Projects Targets Time Series DB Dashboard simple_client DropWizard Metrics Micrometer Commercial Tools Datadog, Dynatrace, Instana, NewRelic, etc.
  19. 19. • Time Series Database • collection, storage & query of metrics data • based on Google's Borgmon, CNCF project • Pull-based model • scrapes configured targets • HTTP endpoints on monitored targets • Easy deployment • statically linked Golang binaries • single YAML config file • Alertmanager.. for alerting ;-) Prometheus
  20. 20. • Integrated time series database • on disk, no external dependency • fixed retention period, no long-term storage / downsampling • very efficient storage [1] • query language PromQL Prometheus TSDB [1] Storing 16 bytes at scale, Fabian Reinartz @ PromCon 2017
  21. 21. Setup Application Targets Node Exporter cAdvisor Service Discovery (AWS EC2, Kubernetes, etc.) Time Series DB Dashboard
  22. 22. • Exporter output (scraped by Prom via HTTP): myapp_checkouts{product="sim_4ff"} 42.0 myapp_checkouts{product="sim_embedded"} 5412.0 akka_system_dead_letters_total{system="test"} 224.0 … • Querying with PromQL rate(akka_system_dead_letters_total[5m]) 0 // handles counter resets / overflows Ingesting & Querying 0
  23. 23. • Just a frontend to supply PromQL queries and build dashboards • Kamon Akka dashboard available at grafana.com/dashboards/4469 Grafana
  24. 24. with Kamon EMnify's Experience
  25. 25. • Tick interval (Kamon) and scrape frequency (Prometheus) • both should match! • usually (?) 30s or 60s • for load tests, we went for 5s • hope to go for 15s in production • Deployment [for development / load tests] • EC2 instances tagged in CloudFormation plus EC2 service discovery • started simple (stupid): Prometheus in container on AWS ECS with EFS Our Experiences with Kamon+Prometheus Docker automated build config github.com/EMnify/prometheus-docker
  26. 26. • Little CPU resources + NFS storage + high cardinality = • High cardinality? • akka_actor_processing_time_seconds_bucket{⏎ class="com.example.SomethingFrequentlyUsed", ⏎ le="0.33", …⏎ path="mystem/some-supervisor/$aX"} How to Kill Prometheus (Regularly)
  27. 27. • Define actor groups kamon.akka.actor-groups += "mygroup" kamon.util.filters { "akka.tracked-actor" { excludes = ["mysystem/some-supervisor/*"] } mygroup { includes = ["mysystem/some-supervisor/*"] } } • Delete Prometheus data to recover • Continue to watch out for metrics with unnamed actors How to Fix Kamon to Not Kill Prometheus
  28. 28. • Limit the number of samples per scrape: <scrape_config> # Per-scrape limit on number of scraped samples that will be accepted. [ sample_limit: <int> | default = 0 ] • Watch for limit kicking in: prometheus_target_scrapes_exceeded_sample_limit_total How to Fix Prometheus to Not Kill Itself
  29. 29. Bonus: Kamino
  30. 30. • Hosted service • by Kamon developers • currently in private beta • no price tags, yet • Great user experience for us • tailored to Akka monitoring • distributions over time • still, few rough edges Kamino Hosted Service Targets Time Series DB Dashboard
  31. 31. Per-Actor Metrics
  32. 32. Example: Fixing Bottle Neck restart deployment
  33. 33. • Kamon offers wide range of APM features • customized and automated metric collection • works with both on-prem/OSS and SaaS "backends" • super friendly community, thanks Ivan! • distributed tracing • Monitor your application (from the inside!) • now! • better start small Summary & Conclusion
  34. 34. Find me at the Speaker‘s Roundtable Questions, please!
  35. 35. Backup
  36. 36. • Data Collection • Core • Akka • Akka Remote • Akka HTTP • Play • JDBC • Executors • System Metrics • Reporting • Metrics: Prometheus, Kamino (WIP: Datadog, InfluxDB, statsd) • Tracing: Zipkin, Jaeger, Kamino • Logs: Logback • Context Propagation • Akka Remote, Akka HTTP, Play • http4s Kamon: Modules
  37. 37. Setup with Kamon JVM Your ApplicationPort 80 Kamon Kamon-prometheus Port 9095 Prometheus Storage Retrieval PromQL Port 9090 Node Exporter Port 9100 scrapes Grafana *magic* Prometheus Data Source
  38. 38. Kamon.histogram( "datavolume", MeasurementUnit.information().gigabytes(), DynamicRange.apply( 0, // lowestDiscernibleValue 10000, // highestTrackableValue 2 // significantValueDigits ) ); Measurement Units / Dynamic Ranges
  39. 39. Prometheus Architecture
  40. 40. • Kamon core trackable values • highest trackable values for range sampler / histogram • can be adjusted per metric • Default Prometheus histogram buckets might not fit • global default can be adjusted • PR pending for overriding per metric [1] Adjusting Value Ranges / Aggregation [1] kamon-io/kamon-prometheus#12
  41. 41. Histograms histogram over timevalue t 10 30 50 observations 0 max histogram (single sample) observations value10 20 30 40 50 • Better describe values than avg/min/max does • Can be aggregated across nodes • Usually percentiles/quantiles computed • Xth percentile: X% of the values lower than <n> • Median (=50th percentile) • SLO/SLA candidates 90/95/99th percentile of response times
  42. 42. https://github.com/improbable-eng/thanos https://www.slideshare.net/BartomiejPotka/thanos-global-durable-prometheus-monitoring Thanos: Prometheus Long-Term Storage
  43. 43. Thanos: Global Scale
  44. 44. global: scrape_interval: 5s scrape_timeout: 5s evaluation_interval: 1m Our Prometheus Config scrape_configs: - job_name: prometheus scrape_interval: 5s scrape_timeout: 5s metrics_path: /metrics scheme: http static_configs: - targets: - localhost:9090 - job_name: kamon scrape_interval: 5s scrape_timeout: 5s metrics_path: /metrics scheme: http sample_limit: 5000 ec2_sd_configs: - region: eu-west-1 refresh_interval: 1m port: 9095 relabel_configs: - source_labels: [__meta_ec2_tag_Environment] separator: ; regex: (.*) target_label: environment replacement: $1 action: replace - source_labels: [__meta_ec2_private_ip] separator: ; regex: (.*) target_label: __address__ replacement: ${1}:9095 action: replace - source_labels: [__meta_ec2_tag_Name] separator: ; regex: (.*) target_label: instance replacement: ${1}:9095 action: replace - source_labels: [__meta_ec2_instance_id] separator: ; regex: (.*) target_label: instance_id replacement: $1 action: replace - source_labels: [__meta_ec2_tag_Platform] separator: ; regex: akka target_label: platform replacement: $1 action: keep - source_labels: [__meta_ec2_tag_AkkaApplication separator: ; regex: (.*) target_label: akka_application replacement: $1 action: replace - source_labels: [__meta_ec2_tag_AkkaRole] separator: ; regex: (.*) target_label: akka_role replacement: $1 action: replace

×