4. 4
Observability is achieved through a set of tools and practices that aims
to turn data points and context into insights.
• Beyond traditional monitoring
• Constant partial degradation/failure
• Expect the unexpected
• Answer unknown questions about your system
5. 5
You want to provide a great experience for users of your system.
• Observability builds confidence in production
• Ownership. Give yourself the tools to be a good owner.
• MTTR is key – failures will are happening
• early detection + fast recovery + increased understanding
* MTTR = mean time to recovery
6. 6
• Finish your work faster/easier
• Find and fix problems sooner (before release, before QA)
• Improve your service by better understanding its behavior
8. 8
• Spring Boot Actuator is awesome.
• You get so much out-of-the-box.
• But... is it enough? Like most things, it depends.
• Inherently information is instance-scoped
9. Spring Boot Admin makes it easy to
access and use each instance’s
Actuator endpoints.
https://github.com/codecentric/spring-boot-admin
12. 12
• Any request spans multiple processes
• Need to stitch together local info and slice/drill-down
• Increased points of failure
• Scaling and ephemeral instances*
* Not strictly properties of a distributed system
14. 14
…
• 3 sides to observability
• Non-functional requirements (generic/specific)
• Overlap exists, but use all 3 for best insight
Source: Peter Bourgon, access date: 2018-05-18
http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
15. 15
When it comes to logging, metrics, and tracing:
• Common needs just work out-of-the-box.
• Custom needs can be met with a little extra effort.
See also: 80-20 rule
17. 17
• Arbitrary messages you want to find later
• Formatted to give context: logging levels, timestamp
• Message examples
• Exceptions/stack traces
• Additional context
• Access logs
• Request/response bodies
18. 18
VM App1 Logs
I want to check
the logs…
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
Get logs Search
logs
🤔
App2
App1 App2
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
💥
Legend:
19. 19
• Does not scale; Too much work and knowledge required
• Multithreaded, concurrent requests intermingle logs
• Low usability – searching is limited/difficult
21. 21
Spring Cloud Sleuth
• adds trace ID for request correlation
• Query all collected logs by any field or full-text search
• time window, service, log level, trace ID, message
Centralized, request-correlated, formatted logs
indexed and searchable across your system
22. 22
Spring Boot
• Configurable via Spring Environment (see also Spring Cloud Config)
• log format – make a common format across applications
• log levels (logging.level.*)
• Configurable via Actuator (at runtime)
• log levels
23. 23
Spring Cloud Config – shared config properties
• Common log pattern
Travel Auto-configuration
• Correlation ID added to MDC
ELK
• Elasticsearch – log storage/querying/indexing
• Logstash – log forwarding/parsing
• Kibana – search / UI for querying Elasticsearch
25. 25
Characteristics:
• Aggregate time-series data; bounded size
• Can slice based on multiple dimensions/tags/labels*
Purpose:
• Visualize / identify trends and deviation
• Alerting based on metric queries
* See also https://www.datadoghq.com/blog/the-power-of-tagged-metrics/
26. 26
Example metric Type Example tags
response time timer uri, status, method
number of classes loaded gauge
response body size histogram uri, status, method
number of garbage collections counter cause, action
30. 30
• Spring Boot 2 uses Micrometer as its native metrics library
• Micrometer supports many metrics backends
• e.g. Atlas, Datadog, Influx, Prometheus, SignalFX, Wavefront
• Instrumentation of common components auto-configured
• JVM/system, HTTP server/client requests, Spring Integration, DataSource…
• Custom metrics also easy to add
31. 31
• Configure via properties
• management.metrics.*
• Disable certain metrics
• Enable percentiles/SLAs/percentile histograms
• Common tags
• e.g. application name, instance, stack, region, zone
32. 32
Travel Service Starter (included in service-parent)
• Includes micrometer-registry-prometheus dependency
Travel Auto-configuration
• Common metric tag for application name (spring.application.name)
Travel Metrics Platform
• Micrometer library for metrics instrumentation/reporting
• Prometheus for metrics collection/storage/querying
• Grafana for dashboards/graphing sourced by Prometheus
33. 33
• Visualize metrics, compare over time
• Have a question you’re trying to answer
• Do NOT just stare at dashboards
39. 39
Distributed tracing: tracing across process boundaries
• Propagate context/hierarchy; join together after
• Request-scoped latency analysis across services
• Metrics lack request context
• Logging has local context but limited distributed info
40. 40
Tracing instrumented system
👤
service1 service2
service3
service4
①
① start span / sampling decision
② propagate trace context
③ continue trace
④ report spans
② ③
④
= tracer / instrumentation
Tracing
backenduser
48. 48
Together you have correlated logging, metrics, and tracing across the
whole system. Jump between each using common identifiers.
Adapted from: Adrian Cole, “Observability 3 ways: logging metrics and tracing”; access date: 2018-05-18
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
49. 49
spring.application.name
=
Zipkin service name
Configure as Micrometer
common tag
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/hello",} 4.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/hello",} 0.02570928
http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/hello",} 0.0
Micrometer tags
Zipkin tags
50. 50
Link to e.g. Kibana search by traceId
can also do Logs Trace
https://github.com/openzipkin/zipkin/tree/master/zipkin-ui#how-do-i-find-logs-associated-with-a-particular-trace
51. 51
• Confirm request flow – does it match the expected
design/architecture?
• Check service dependencies in Zipkin
• Check request flow in Zipkin; jump to logs if necessary
• Filter by service name, span name, tags
• Adjust log levels via Actuator if necessary
52. 52
• Automated tests generate a correlation ID per test case execution.
• Use correlation ID to find the related traces in Zipkin.
cID0001
cID0001
trace1
trace2
53. 53
• Manual tests (in non-production environments) from the browser can use
Zipkin Browser Extension to get the traceId for a browser request
• Where in the request flow did the error occur or why was it slow?
• Check request flow in Zipkin; jump to logs (if necessary)
• Adjust log levels via Actuator (if necessary)
54. 54
検知
調査
復旧
調整
アラート ・
問い合わせ
1. Starts with an alert/report
2. Check metrics
3. Check tracing data (if needed)
4. Check logs (if needed)
5. Triage issue
6. Make adjustment to prevent
recurrence
🔁
56. 56
• System-wide observability is crucial in distributed architectures
• Tools exist and Spring makes them easy to integrate
• Most common cases are covered out-of-the-box or configurable.
Custom instrumentation is possible as needed.
• Use the right tool for the job; synergize across tools
57.
58. 58
• “Distributed Systems Observability” e-book by Cindy Sridharan:
http://distributed-systems-observability-ebook.humio.com/
• Articles by Cindy Sridharan (@copyconstruct): https://medium.com/@copyconstruct
• Talks by Charity Majors (@mipsytipsy): https://speakerdeck.com/charity
• “Observability+” articles by JBD (@rakyll): https://medium.com/observability