Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Distributed Tracing in the Wild

149 Aufrufe

Veröffentlicht am

SpringOne Platform 2019
Title: Distributed Tracing in the Wild
Speakers: Adrian Cole, Zipkin guy, Pivotal; Tommy Ludwig, Software Developer, Pivotal; Narayanan Arunachalam, Senior Software Engineer, Netflix
Youtube: https://youtu.be/dJYHeRDxD5g

Veröffentlicht in: Software
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Distributed Tracing in the Wild

  1. 1. Set your sites on tracing An overview of distributed tracing practice
  2. 2. Introduction introduction use case example a typical zipkin site a typical Netflix site wrapping up #zipkin
  3. 3. GitHub: @adriancole Twitter: @adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin Tommy Ludwig GitHub: @shakuzen Twitter: @TommyLudwig • spring team at pivotal • project lead of micrometer • zipkin committer Nara GitHub: @narayaruna insight eng at netflix focused on telemetry platforms •
  4. 4. What is Distributed Tracing? Distributed tracing tracks production requests, as they touch different parts of your architecture. Requests have a unique trace ID, which you can use to lookup a trace diagram, or log entries related to it. Causal diagrams are easier to understand than scrolling through logs.
  5. 5. Distributed tracing is neat! https://twitter.com/zipkinprojecthttps://github.com/openzipkin/zipkin
  6. 6. We use it to solve our own latency problems This dotted line is how much latency we took of worst request performance!
  7. 7. Why do I care? - Reduce time in triage by contextualizing errors and delays - Visualize latency like time in my service vs waiting for other services - Understand complex applications, like async code or microservices - See your architecture, with live dependency diagrams built from traces
  8. 8. How do I turn on tracing? A tracer is a utility library, similar to metrics or logging libraries. It is a mechanism uses to trace an operation. Instrumentation is framework-specific code that uses a tracer to collect details such as the http url and request timing. Instrumentation must be configured and pointed to a tracing system for tracing to work. This is often done automatically with agents or frameworks like Spring Boot.
  9. 9. Distributed Tracing Vocabulary A Span is an individual operation that took place. A span contains timestamped events and tags A Trace is an end-to-end latency graph, composed of spans. A Tracer records spans and passes context required to connect them into a trace Instrumentation uses a tracer to record a task such as an http request as a span
  10. 10. Tracers send trace data out of process Tracers propagate IDs in-band, to tell the receiver there’s a trace in progress Completed spans are reported out-of-band, to reduce overhead and allow for batching
  11. 11. Zipkin is a distributed tracing system
  12. 12. Zipkin can be as simple as one file listening on one port $ curl -s localhost:9411/api/v2/services|jq . [ "api-proxy", "auth-api", "phoenix" ]
  13. 13. Use case example introduction use case example a typical zipkin site a typical Netflix site wrapping up #zipkin
  14. 14. Tracing end to end testing
  15. 15. E2E testing example Test scenarios Scenario 1 (UpdateMyOrder) ● GET /orders ● GET /orders/{id} ● PATCH orders/{id} Scenario 2 (CancelMultipleOrders) ● GET /orders/search ● DELETE /orders/{id} ● DELETE /orders/{id} ● GET /orders/search Test execution samples 20191008163035 E2E Run #112: UpdateMyOrder...PASSED CancelMultipleOrders...FAILED DELETE /orders/{id}... HTTP 500 Could not delete order. 20191008173035 E2E Run #113: UpdateMyOrder...TIMEOUT 5s CancelMultipleOrders...TIMEOUT 5s
  16. 16. Goals Identify failing component to assign investigation to correct service team. No guessing in correlating specific test requests/scenarios with traces/logs.
  17. 17. Improved end-to-end testing Known trace ID associated with each request to lookup traces from test execution. With a correlation ID, a whole test scenario (multiple requests) can share a single ID to lookup.
  18. 18. Sample test run with tracing CancelMultipleOrders Scenario run #112 (correlation-id: 370f1f786e61dd62) ● GET /orders/search (trace: 989ef0d9fa1d6738) ● DELETE /orders/{id} (trace: d5841212ef1f7b03) ● DELETE /orders/{id} (trace: 0b899a5594e2a2eb) orders items prices gateway
  19. 19. Sample test run output with tracing 20191008163035 E2E Run #112: UpdateMyOrder...PASSED CancelMultipleOrders...FAILED (370f1f786e61dd62 ) DELETE /orders/{id}... HTTP 500 (0b899a5594e2a2eb ) Could not delete order. Link to Zipkin search for traces with this correlation ID Jump to Zipkin trace view
  20. 20. Zipkin view of E2E test DELETE /orders/{id} Service gateway orders items prices DELETE /{id} DELETE /orders/{id} (trace: 0b899a5594e2a2eb) Logs
  21. 21. Spice up alerting with tracing ��
  22. 22. Alerts happen. Then what? ⚠ HighMaxLatencyAlert ⚠ Max latency on the orders service endpoint GET /orders/{id} has exceeded 500ms (1.26s). ⚠ HighErrorAlert ⚠ High error ratio on the orders service endpoint GET /orders/{id} (HTTP 500).
  23. 23. Alerts happen. Then what? ⚠ HighMaxLatencyAlert ⚠ Max latency on the orders service endpoint GET /orders/{id} has exceeded 500ms (1.26s). [Dashboard] [Traces] ⚠ HighErrorAlert ⚠ High error ratio on the orders service endpoint GET /orders/{id} (HTTP 500). [Dashboard] [Traces] Metrics dashboard Trace search Pass tag values as URL query parameters to filter to relevant results.
  24. 24. A typical Zipkin site introduction use case example a typical zipkin site a typical Netflix site wrapping up #zipkin
  25. 25. What is a Zipkin site Site owner: End user who champions Zipkin as a part of additional roles in their company. Many site owners are part time, yet contribute back to open source. Zipkin site: Production deployment of distributed tracing, which considers Zipkin format, instrumentation or backends strategic to their observability function.
  26. 26. What information do we collect on Zipkin sites * Introduction of the company context and team on tracing * System overview from application until visualization/analysis * Site-specific data conventions such as services are named * Why tracing is important, goals and service level agreements * Status like costs adoption, ingestion and costs incurred
  27. 27. Why bother with tracing? Ascend Money says: Measure latency improvements before and after refactoring the services. Identify non-conformant service communications that deviates from the design. Hotels.com says: helps in pointing out the worst offenders and by making it easier to identify performance improvements such as network calls that could be done in parallel. Netflix says: The business value is in providing operational visibility into the systems and enhance developer productivity.
  28. 28. What kind of infrastructure is involved? Effective tracing matches the architecture and skillset of the site owners. Sites have different application and tracing infrastructures.
  29. 29. So, a site doesn’t only run Zipkin server? Zipkin Server is the canonical backend which receives Zipkin format, and presents a UI. Some don’t run Zipkin server, or also run other products for various reasons. * SaaS preference * APM integration * Hybrid setup
  30. 30. And.. applications don’t always use Zipkin libraries?! Zipkin curates propagation and trace formats which decouple sites from a mandate of using our code. By producing the same data, applications have more flexibility and choice. * 3rd party libraries * Proxies (service mesh) integration * In-house custom tools
  31. 31. Let’s look at a site that once used Zipkin server Hotels.com started with a Zipkin backend, but are transitioning to Expedia Haystack, which provides more features like adaptive alerting. https://github.com/ExpediaDotCom/haystack Applications still emit data in Zipkin v2 format, which is forwarded to Haystack with a tool they created called Pitchfork. Developers still use Zipkin on their laptops for local troubleshooting, as it is easy to run.
  32. 32. Let’s look at a site that didn’t initially use Zipkin server Netflix created a Dapper-based tracing system to trace RPC calls involved in video streaming. This included framework libraries to produce trace headers and data. As Spring Boot became prevalent, Zipkin became more useful as it is built-into the tracing library Spring Cloud Sleuth. Netflix convert legacy spans into Zipkin v2 format in their Kafka/Flink pipeline. This allows traces to stitch together for query and analysis. Nara will talk about Netflix in a bit!
  33. 33. Let’s look at a site that never used Zipkin server Infostellar architecture runs in Google Cloud, except ground station software that runs locally at an antennae site. Many components trace with Zipkin libraries, some with OpenTracing, some homegrown. All use Zipkin’s B3 format for propagation. Even when using Zipkin libraries, data sends directly to Google Stackdriver for query and analysis. There’s no Zipkin server footprint at Infostellar.
  34. 34. Let’s look at a site that uses stock Zipkin server Medidata is an entirely AWS architecture, using the zipkin-aws image will allows http and SQS span collection. They collect 100% data into AWS-managed Elasticsearch storage. While the zipkin service is standard, Medidata has a service that reads trace data from Elasticsearch, comparing it with performance objectives in APIs and issuing alerts when performance degrades.
  35. 35. ok it is stock++ Medidata wrote SLAP
  36. 36. Besides architecture, what’s different across sites? Data collection policy: Typeform always provision request IDs. Infostellar use antenna, satellite and plan tags for business context. LINE add company-specific tags like phase and instance ID. Expedia Haystack scrubs secrets. Data retention policy: Medidata retain 100% for 100 days. Netflix sample 100% of FIT experiments, 0.1% otherwise. SoundCloud retain a very low sample rate for 7 days. Tracing adoption rate: LINE is only one team’s services, Ascend <50%, Tyro is over 90%
  37. 37. How do sites get started with tracing Proxy: starting traces at a proxy can raise visibility of upstream and downstream. Typeform initialise a trace and request ID in their custom proxy. Single service: hotels.com recognised even though tracing is a team sport, starting with a single service can still add value. New Framework: Sites like Ascend rolled out tracing in new applications as it was out-of-box supported with Spring Boot (via Spring Cloud Sleuth). Green Field: Infostellar engineers had previous experience with tracing, and built their platform with tracing in mind.
  38. 38. A typical Netflix site introduction use case example a typical zipkin site a typical Netflix site wrapping up #zipkin
  39. 39. Why do we Trace? Troubleshooting in production Understand services behavior for Chaos experiments Services impact for AB testing Understand service demands for device types to help with traffic routing
  40. 40. What do we need? A tracing platform that is flexible to satisfy current and future needs.
  41. 41. Is Tracing new for Netflix? We developed a homegrown solution in 2013 based on Dapper that worked well. We also have a custom tracing solution used only in certain services Fragmentation
  42. 42. Evolving Tracing Spring Boot adoption OSS for trace collection - Spring-cloud-sleuth - Zipkin Brave
  43. 43. Evolving tracing - Legacy tracer Roman ride both the tracing systems Converted to Zipkin format in the backend
  44. 44. Evolving tracing Kafka (trace_annotation) Legacy Tracer Insight_ES Tracing Stream Processing Job (Flink) Iceberg Zipkin Tracer Publisher Kafka (iep-tracing) Tracing APIZipkin UI
  45. 45. Sampling 0.1% random sampling for services with high traffic On-demand 100% sampling driven by rules Secondary sampling (in progress) 100% for services with low traffic
  46. 46. Wrapping up introduction use case example a typical zipkin site a typical Netflix site wrapping up #zipkin
  47. 47. Wrapping up Contribute our site documents Chat any time on Gitter #zipkin gitter.im/openzipkin/zipkin github.com/openzipkin/zipkin