Observability with Spring-based distributed systems

Tommy Ludwig
Rakuten, Inc.
Travel Product Development Department
Foundation Office
Spring Fest 2018
2018-10-31

2
• Observability: what / why
• 3 pillars of observability: Logging, Metrics, Tracing
• Putting it all together

4
Observability is achieved through a set of tools and practices that aims
to turn data points and context into insights.
• Beyond traditional monitoring
• Constant partial degradation/failure
• Expect the unexpected
• Answer unknown questions about your system

5
You want to provide a great experience for users of your system.
• Observability builds confidence in production
• Ownership. Give yourself the tools to be a good owner.
• MTTR is key – failures will are happening
• early detection + fast recovery + increased understanding
* MTTR = mean time to recovery

6
• Finish your work faster/easier
• Find and fix problems sooner (before release, before QA)
• Improve your service by better understanding its behavior

8
• Spring Boot Actuator is awesome.
• You get so much out-of-the-box.
• But... is it enough? Like most things, it depends.
• Inherently information is instance-scoped

Spring Boot Admin makes it easy to
access and use each instance’s
Actuator endpoints.
https://github.com/codecentric/spring-boot-admin

11
DB DB DB
User User
👤 👤

12
• Any request spans multiple processes
• Need to stitch together local info and slice/drill-down
• Increased points of failure
• Scaling and ephemeral instances*
* Not strictly properties of a distributed system

14
…
• 3 sides to observability
• Non-functional requirements (generic/specific)
• Overlap exists, but use all 3 for best insight
Source: Peter Bourgon, access date: 2018-05-18
http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

15
When it comes to logging, metrics, and tracing:
• Common needs just work out-of-the-box.
• Custom needs can be met with a little extra effort.
See also: 80-20 rule

17
• Arbitrary messages you want to find later
• Formatted to give context: logging levels, timestamp
• Message examples
• Exceptions/stack traces
• Additional context
• Access logs
• Request/response bodies

18
VM App1 Logs
I want to check
the logs…
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
Get logs Search
logs
🤔
App2
App1 App2
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
💥
Legend:

19
• Does not scale; Too much work and knowledge required
• Multithreaded, concurrent requests intermingle logs
• Low usability – searching is limited/difficult

20
Central log
store service
stream logs
Query
request
Collection of
matching logs
query logs
VM App1 LogsApp2Legend:

21
Spring Cloud Sleuth
• adds trace ID for request correlation
• Query all collected logs by any field or full-text search
• time window, service, log level, trace ID, message
Centralized, request-correlated, formatted logs
indexed and searchable across your system

22
Spring Boot
• Configurable via Spring Environment (see also Spring Cloud Config)
• log format – make a common format across applications
• log levels (logging.level.*)
• Configurable via Actuator (at runtime)
• log levels

23
Spring Cloud Config – shared config properties
• Common log pattern
Travel Auto-configuration
• Correlation ID added to MDC
ELK
• Elasticsearch – log storage/querying/indexing
• Logstash – log forwarding/parsing
• Kibana – search / UI for querying Elasticsearch

25
Characteristics:
• Aggregate time-series data; bounded size
• Can slice based on multiple dimensions/tags/labels*
Purpose:
• Visualize / identify trends and deviation
• Alerting based on metric queries
* See also https://www.datadoghq.com/blog/the-power-of-tagged-metrics/

26
Example metric Type Example tags
response time timer uri, status, method
number of classes loaded gauge
response body size histogram uri, status, method
number of garbage collections counter cause, action

27
HTTP server requests
👥
my-application
👤
HTTP GET metricscontroller
metrics over JMX

28
HTTP server requests
👥
my-application
👤
controller
my-application
controller
LB

29
my-application
controller
my-application
controller
Metrics
backend
😌
publish
metrics
Alerts
☠
Visualization

30
• Spring Boot 2 uses Micrometer as its native metrics library
• Micrometer supports many metrics backends
• e.g. Atlas, Datadog, Influx, Prometheus, SignalFX, Wavefront
• Instrumentation of common components auto-configured
• JVM/system, HTTP server/client requests, Spring Integration, DataSource…
• Custom metrics also easy to add

31
• Configure via properties
• management.metrics.*
• Disable certain metrics
• Enable percentiles/SLAs/percentile histograms
• Common tags
• e.g. application name, instance, stack, region, zone

32
Travel Service Starter (included in service-parent)
• Includes micrometer-registry-prometheus dependency
• Common metric tag for application name (spring.application.name)
Travel Metrics Platform
• Micrometer library for metrics instrumentation/reporting
• Prometheus for metrics collection/storage/querying
• Grafana for dashboards/graphing sourced by Prometheus

33
• Visualize metrics, compare over time
• Have a question you’re trying to answer
• Do NOT just stare at dashboards

34
• 4 Golden signals
• Latency
• Errors
• Rate
• Saturation

35
• Don’t double
alert!
• Symptoms, not
causes

37
• Investigate a slow request
• Understand dependency/call relationship between services
• Where did the error occur in the request?

38
• local tracing: Actuator /httptrace
endpoint
• Latency data + request metadata
{
"traces" : [ {
"timestamp" : "2018-05-09T13:28:32.867Z",
"principal" : {
"name" : "alice”
},
"session" : {
"id" : "728aebfe-8222-4dd2-856c-256104b20bfe”
},
"request" : {
"method" : "GET",
"uri" : "https://api.example.com",
"headers" : {
"Accept" : [ "application/json" ]
}
},
"response" : {
"status" : 200,
"headers" : {
"Content-Type" : [ "application/json" ]
}
},
"timeTaken" : 3
} ]
}
Source: Spring Boot Actuator Web API Documentation; access date: 2018-05-18
https://docs.spring.io/spring-boot/docs/2.0.2.RELEASE/actuator-api/html/#http-trace

39
Distributed tracing: tracing across process boundaries
• Propagate context/hierarchy; join together after
• Request-scoped latency analysis across services
• Metrics lack request context
• Logging has local context but limited distributed info

40
Tracing instrumented system
👤
service1 service2
service3
service4
①
① start span / sampling decision
② propagate trace context
③ continue trace
④ report spans
② ③
④
= tracer / instrumentation
Tracing
backenduser

42
[2010]
Google
Dapper
[2012]
Twitter
Zipkin
[2015]
OpenZipkin
[2017]
Zipkin
Meetup #1
[2018]
Apache
Incubator
Today
https://zipkin.io/
WIKI: https://cwiki.apache.org/confluence/display/ZIPKIN/

43
Source: Spring Cloud Sleuth reference documentation; access date: 2018-05-18
http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.RC1/single/spring-cloud-sleuth.html#_distributed_tracing_with_zipkin
Zipkin UI workshop happening this week!
https://cwiki.apache.org/confluence/display/ZIPKIN/2018-10-29+Zipkin+UI+at+LINE+Tokyo

44
Zipkin server
transport
collector UI
storage
datastore
API
👩 💻
• HTTP
• Kafka
• RabbitMQ
• In-memory *
• MySQL *
• Elasticsearch
• Cassandra
Reference:
https://zipkin.io/pages/architecture.html
Tracing instrumented system
👤 s1 s2
s3
s4

45
Tracing backend: Zipkin Server getting started
Spring Cloud Sleuth: spring-cloud-starter-zipkin dependency
• auto-configures tracing instrumentation (Zipkin’s Brave)
• reports recorded spans to Zipkin async/batched

46
Travel Service Starter (included in service-parent)
• Includes spring-cloud-zipkin-starter dependency (Spring Cloud Sleuth)
• Tag root span with correlation ID
Travel Cloud Config
• Zipkin server address
• Sampling %, skip patterns

48
Together you have correlated logging, metrics, and tracing across the
whole system. Jump between each using common identifiers.
Adapted from: Adrian Cole, “Observability 3 ways: logging metrics and tracing”; access date: 2018-05-18
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing

49

spring.application.name
=
Zipkin service name
Configure as Micrometer
common tag
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/hello",} 4.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/hello",} 0.02570928
http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/hello",} 0.0
Micrometer tags
Zipkin tags

50

Link to e.g. Kibana search by traceId
can also do Logs  Trace
https://github.com/openzipkin/zipkin/tree/master/zipkin-ui#how-do-i-find-logs-associated-with-a-particular-trace

51
• Confirm request flow – does it match the expected
design/architecture?
• Check service dependencies in Zipkin
• Check request flow in Zipkin; jump to logs if necessary
• Filter by service name, span name, tags
• Adjust log levels via Actuator if necessary

52
• Automated tests generate a correlation ID per test case execution.
• Use correlation ID to find the related traces in Zipkin.
cID0001
cID0001
trace1
trace2

53
• Manual tests (in non-production environments) from the browser can use
Zipkin Browser Extension to get the traceId for a browser request
• Where in the request flow did the error occur or why was it slow?
• Check request flow in Zipkin; jump to logs (if necessary)
• Adjust log levels via Actuator (if necessary)

54
検知
調査
復旧
調整
アラート・
問い合わせ
1. Starts with an alert/report
2. Check metrics
3. Check tracing data (if needed)
4. Check logs (if needed)
5. Triage issue
6. Make adjustment to prevent
recurrence
🔁

56
• System-wide observability is crucial in distributed architectures
• Tools exist and Spring makes them easy to integrate
• Most common cases are covered out-of-the-box or configurable.
Custom instrumentation is possible as needed.
• Use the right tool for the job; synergize across tools

58
• “Distributed Systems Observability” e-book by Cindy Sridharan:
http://distributed-systems-observability-ebook.humio.com/
• Articles by Cindy Sridharan (@copyconstruct): https://medium.com/@copyconstruct
• Talks by Charity Majors (@mipsytipsy): https://speakerdeck.com/charity
• “Observability+” articles by JBD (@rakyll): https://medium.com/observability

Observability with Spring-based distributed systems

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Observability with Spring-based distributed systems

Ähnlich wie Observability with Spring-based distributed systems (20)

Mehr von Rakuten Group, Inc.

Mehr von Rakuten Group, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Observability with Spring-based distributed systems