SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Tommy Ludwig
Rakuten, Inc.
Travel Product Development Department
Foundation Office
Spring Fest 2018
2018-10-31
2
• Observability: what / why
• 3 pillars of observability: Logging, Metrics, Tracing
• Putting it all together
3
4
Observability is achieved through a set of tools and practices that aims
to turn data points and context into insights.
• Beyond traditional monitoring
• Constant partial degradation/failure
• Expect the unexpected
• Answer unknown questions about your system
5
You want to provide a great experience for users of your system.
• Observability builds confidence in production
• Ownership. Give yourself the tools to be a good owner.
• MTTR is key – failures will are happening
• early detection + fast recovery + increased understanding
* MTTR = mean time to recovery
6
• Finish your work faster/easier
• Find and fix problems sooner (before release, before QA)
• Improve your service by better understanding its behavior
7
8
• Spring Boot Actuator is awesome.
• You get so much out-of-the-box.
• But... is it enough? Like most things, it depends.
• Inherently information is instance-scoped
Spring Boot Admin makes it easy to
access and use each instance’s
Actuator endpoints.
https://github.com/codecentric/spring-boot-admin
10
11
DB DB DB
User User
👤 👤
12
• Any request spans multiple processes
• Need to stitch together local info and slice/drill-down
• Increased points of failure
• Scaling and ephemeral instances*
* Not strictly properties of a distributed system
13
14
…
• 3 sides to observability
• Non-functional requirements (generic/specific)
• Overlap exists, but use all 3 for best insight
Source: Peter Bourgon, access date: 2018-05-18
http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
15
When it comes to logging, metrics, and tracing:
• Common needs just work out-of-the-box.
• Custom needs can be met with a little extra effort.
See also: 80-20 rule
16
17
• Arbitrary messages you want to find later
• Formatted to give context: logging levels, timestamp
• Message examples
• Exceptions/stack traces
• Additional context
• Access logs
• Request/response bodies
18
VM App1 Logs
I want to check
the logs…
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
Get logs Search
logs
🤔
App2
App1 App2
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~
~~~~~~~~~~~~
~~~~~~~~~~~~~~
💥
Legend:
19
• Does not scale; Too much work and knowledge required
• Multithreaded, concurrent requests intermingle logs
• Low usability – searching is limited/difficult
20
Central log
store service
stream logs
Query
request
Collection of
matching logs
query logs
VM App1 LogsApp2Legend:
21
Spring Cloud Sleuth
• adds trace ID for request correlation
• Query all collected logs by any field or full-text search
• time window, service, log level, trace ID, message
Centralized, request-correlated, formatted logs
indexed and searchable across your system
22
Spring Boot
• Configurable via Spring Environment (see also Spring Cloud Config)
• log format – make a common format across applications
• log levels (logging.level.*)
• Configurable via Actuator (at runtime)
• log levels
23
Spring Cloud Config – shared config properties
• Common log pattern
Travel Auto-configuration
• Correlation ID added to MDC
ELK
• Elasticsearch – log storage/querying/indexing
• Logstash – log forwarding/parsing
• Kibana – search / UI for querying Elasticsearch
24
25
Characteristics:
• Aggregate time-series data; bounded size
• Can slice based on multiple dimensions/tags/labels*
Purpose:
• Visualize / identify trends and deviation
• Alerting based on metric queries
* See also https://www.datadoghq.com/blog/the-power-of-tagged-metrics/
26
Example metric Type Example tags
response time timer uri, status, method
number of classes loaded gauge
response body size histogram uri, status, method
number of garbage collections counter cause, action
27
HTTP server requests
👥
my-application
👤
HTTP GET metricscontroller
metrics over JMX
28
HTTP server requests
👥
my-application
👤
controller
my-application
controller
LB
29
my-application
controller
my-application
controller
Metrics
backend
😌
publish
metrics
Alerts
☠
Visualization
30
• Spring Boot 2 uses Micrometer as its native metrics library
• Micrometer supports many metrics backends
• e.g. Atlas, Datadog, Influx, Prometheus, SignalFX, Wavefront
• Instrumentation of common components auto-configured
• JVM/system, HTTP server/client requests, Spring Integration, DataSource…
• Custom metrics also easy to add
31
• Configure via properties
• management.metrics.*
• Disable certain metrics
• Enable percentiles/SLAs/percentile histograms
• Common tags
• e.g. application name, instance, stack, region, zone
32
Travel Service Starter (included in service-parent)
• Includes micrometer-registry-prometheus dependency
Travel Auto-configuration
• Common metric tag for application name (spring.application.name)
Travel Metrics Platform
• Micrometer library for metrics instrumentation/reporting
• Prometheus for metrics collection/storage/querying
• Grafana for dashboards/graphing sourced by Prometheus
33
• Visualize metrics, compare over time
• Have a question you’re trying to answer
• Do NOT just stare at dashboards
34
• 4 Golden signals
• Latency
• Errors
• Rate
• Saturation
35
• Don’t double
alert!
• Symptoms, not
causes
36
37
• Investigate a slow request
• Understand dependency/call relationship between services
• Where did the error occur in the request?
38
• local tracing: Actuator /httptrace
endpoint
• Latency data + request metadata
{
"traces" : [ {
"timestamp" : "2018-05-09T13:28:32.867Z",
"principal" : {
"name" : "alice”
},
"session" : {
"id" : "728aebfe-8222-4dd2-856c-256104b20bfe”
},
"request" : {
"method" : "GET",
"uri" : "https://api.example.com",
"headers" : {
"Accept" : [ "application/json" ]
}
},
"response" : {
"status" : 200,
"headers" : {
"Content-Type" : [ "application/json" ]
}
},
"timeTaken" : 3
} ]
}
Source: Spring Boot Actuator Web API Documentation; access date: 2018-05-18
https://docs.spring.io/spring-boot/docs/2.0.2.RELEASE/actuator-api/html/#http-trace
39
Distributed tracing: tracing across process boundaries
• Propagate context/hierarchy; join together after
• Request-scoped latency analysis across services
• Metrics lack request context
• Logging has local context but limited distributed info
40
Tracing instrumented system
👤
service1 service2
service3
service4
①
① start span / sampling decision
② propagate trace context
③ continue trace
④ report spans
② ③
④
= tracer / instrumentation
Tracing
backenduser
41
42
[2010]
Google
Dapper
[2012]
Twitter
Zipkin
[2015]
OpenZipkin
[2017]
Zipkin
Meetup #1
[2018]
Apache
Incubator
Today
https://zipkin.io/
WIKI: https://cwiki.apache.org/confluence/display/ZIPKIN/
43
Source: Spring Cloud Sleuth reference documentation; access date: 2018-05-18
http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.RC1/single/spring-cloud-sleuth.html#_distributed_tracing_with_zipkin
Zipkin UI workshop happening this week!
https://cwiki.apache.org/confluence/display/ZIPKIN/2018-10-29+Zipkin+UI+at+LINE+Tokyo
44
Zipkin server
transport
collector UI
storage
datastore
API
👩 💻
• HTTP
• Kafka
• RabbitMQ
• In-memory *
• MySQL *
• Elasticsearch
• Cassandra
Reference:
https://zipkin.io/pages/architecture.html
Tracing instrumented system
👤 s1 s2
s3
s4
45
Tracing backend: Zipkin Server getting started
Spring Cloud Sleuth: spring-cloud-starter-zipkin dependency
• auto-configures tracing instrumentation (Zipkin’s Brave)
• reports recorded spans to Zipkin async/batched
46
Travel Service Starter (included in service-parent)
• Includes spring-cloud-zipkin-starter dependency (Spring Cloud Sleuth)
Travel Auto-configuration
• Tag root span with correlation ID
Travel Cloud Config
• Zipkin server address
• Sampling %, skip patterns
47
48
Together you have correlated logging, metrics, and tracing across the
whole system. Jump between each using common identifiers.
Adapted from: Adrian Cole, “Observability 3 ways: logging metrics and tracing”; access date: 2018-05-18
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
49

spring.application.name
=
Zipkin service name
Configure as Micrometer
common tag
http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/hello",} 4.0
http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/hello",} 0.02570928
http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/hello",} 0.0
Micrometer tags
Zipkin tags
50

Link to e.g. Kibana search by traceId
can also do Logs  Trace
https://github.com/openzipkin/zipkin/tree/master/zipkin-ui#how-do-i-find-logs-associated-with-a-particular-trace
51
• Confirm request flow – does it match the expected
design/architecture?
• Check service dependencies in Zipkin
• Check request flow in Zipkin; jump to logs if necessary
• Filter by service name, span name, tags
• Adjust log levels via Actuator if necessary
52
• Automated tests generate a correlation ID per test case execution.
• Use correlation ID to find the related traces in Zipkin.
cID0001
cID0001
trace1
trace2
53
• Manual tests (in non-production environments) from the browser can use
Zipkin Browser Extension to get the traceId for a browser request
• Where in the request flow did the error occur or why was it slow?
• Check request flow in Zipkin; jump to logs (if necessary)
• Adjust log levels via Actuator (if necessary)
54
検知
調査
復旧
調整
アラート ・
問い合わせ
1. Starts with an alert/report
2. Check metrics
3. Check tracing data (if needed)
4. Check logs (if needed)
5. Triage issue
6. Make adjustment to prevent
recurrence
🔁
55
56
• System-wide observability is crucial in distributed architectures
• Tools exist and Spring makes them easy to integrate
• Most common cases are covered out-of-the-box or configurable.
Custom instrumentation is possible as needed.
• Use the right tool for the job; synergize across tools
58
• “Distributed Systems Observability” e-book by Cindy Sridharan:
http://distributed-systems-observability-ebook.humio.com/
• Articles by Cindy Sridharan (@copyconstruct): https://medium.com/@copyconstruct
• Talks by Charity Majors (@mipsytipsy): https://speakerdeck.com/charity
• “Observability+” articles by JBD (@rakyll): https://medium.com/observability

Weitere ähnliche Inhalte

Was ist angesagt?

Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
HostedbyConfluent
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
confluent
 

Was ist angesagt? (20)

Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka | J...
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
 
Measure() or die()
Measure() or die()Measure() or die()
Measure() or die()
 
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
Lessons From HPE: From Batch To Streaming For 20 Billion Sensors With Lightbe...
 
Building Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache KafkaBuilding Event-Driven Services with Apache Kafka
Building Event-Driven Services with Apache Kafka
 
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
 
Top 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & TricksTop 5 Java Performance Metrics, Tips & Tricks
Top 5 Java Performance Metrics, Tips & Tricks
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBMAvailability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
Availability of Kafka - Beyond the Brokers | Andrew Borley and Emma Humber, IBM
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using Kubernetes
 
How To Build, Integrate, and Deploy Real-Time Streaming Pipelines On Kubernetes
How To Build, Integrate, and Deploy Real-Time Streaming Pipelines On KubernetesHow To Build, Integrate, and Deploy Real-Time Streaming Pipelines On Kubernetes
How To Build, Integrate, and Deploy Real-Time Streaming Pipelines On Kubernetes
 
Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
Deployment Checkup: How to Regularly Tune Your Cloud Environment - RightScale...
Deployment Checkup: How to Regularly Tune Your Cloud Environment - RightScale...Deployment Checkup: How to Regularly Tune Your Cloud Environment - RightScale...
Deployment Checkup: How to Regularly Tune Your Cloud Environment - RightScale...
 
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...In-Stream Processing Service Blueprint, Reference architecture for real-time ...
In-Stream Processing Service Blueprint, Reference architecture for real-time ...
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
Supercharge Your Real-time Event Processing with Neo4j's Streams Kafka Connec...
 
Micro service architecture
Micro service architecture  Micro service architecture
Micro service architecture
 
Univa Presentation at DAC 2020
Univa Presentation at DAC 2020 Univa Presentation at DAC 2020
Univa Presentation at DAC 2020
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Application Performance Management
Application Performance ManagementApplication Performance Management
Application Performance Management
 

Ähnlich wie Observability with Spring-based distributed systems

Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
confluent
 

Ähnlich wie Observability with Spring-based distributed systems (20)

Observability with Spring-based distributed systems
Observability with Spring-based distributed systemsObservability with Spring-based distributed systems
Observability with Spring-based distributed systems
 
Cashing in on logging and exception data
Cashing in on logging and exception dataCashing in on logging and exception data
Cashing in on logging and exception data
 
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
Application Performance Troubleshooting 1x1 - Part 2 - Noch mehr Schweine und...
 
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant Inside Kafka Streams—Monitoring Comcast’s Outside Plant
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
Data Onboarding Breakout Session
Data Onboarding Breakout SessionData Onboarding Breakout Session
Data Onboarding Breakout Session
 
Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016Sumo Logic QuickStart Webinar - Jan 2016
Sumo Logic QuickStart Webinar - Jan 2016
 
Sumo Logic QuickStart Webinar
Sumo Logic QuickStart WebinarSumo Logic QuickStart Webinar
Sumo Logic QuickStart Webinar
 
SplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with SplunkSplunkLive! Presentation - Data Onboarding with Splunk
SplunkLive! Presentation - Data Onboarding with Splunk
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
 
Building high performance and scalable share point applications
Building high performance and scalable share point applicationsBuilding high performance and scalable share point applications
Building high performance and scalable share point applications
 
Service Mesh - Observability
Service Mesh - ObservabilityService Mesh - Observability
Service Mesh - Observability
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
Modern DevOps across Technologies on premises and clouds with Oracle Manageme...
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
 

Mehr von Rakuten Group, Inc.

Mehr von Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Observability with Spring-based distributed systems

  • 1. Tommy Ludwig Rakuten, Inc. Travel Product Development Department Foundation Office Spring Fest 2018 2018-10-31
  • 2. 2 • Observability: what / why • 3 pillars of observability: Logging, Metrics, Tracing • Putting it all together
  • 3. 3
  • 4. 4 Observability is achieved through a set of tools and practices that aims to turn data points and context into insights. • Beyond traditional monitoring • Constant partial degradation/failure • Expect the unexpected • Answer unknown questions about your system
  • 5. 5 You want to provide a great experience for users of your system. • Observability builds confidence in production • Ownership. Give yourself the tools to be a good owner. • MTTR is key – failures will are happening • early detection + fast recovery + increased understanding * MTTR = mean time to recovery
  • 6. 6 • Finish your work faster/easier • Find and fix problems sooner (before release, before QA) • Improve your service by better understanding its behavior
  • 7. 7
  • 8. 8 • Spring Boot Actuator is awesome. • You get so much out-of-the-box. • But... is it enough? Like most things, it depends. • Inherently information is instance-scoped
  • 9. Spring Boot Admin makes it easy to access and use each instance’s Actuator endpoints. https://github.com/codecentric/spring-boot-admin
  • 10. 10
  • 11. 11 DB DB DB User User 👤 👤
  • 12. 12 • Any request spans multiple processes • Need to stitch together local info and slice/drill-down • Increased points of failure • Scaling and ephemeral instances* * Not strictly properties of a distributed system
  • 13. 13
  • 14. 14 … • 3 sides to observability • Non-functional requirements (generic/specific) • Overlap exists, but use all 3 for best insight Source: Peter Bourgon, access date: 2018-05-18 http://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html
  • 15. 15 When it comes to logging, metrics, and tracing: • Common needs just work out-of-the-box. • Custom needs can be met with a little extra effort. See also: 80-20 rule
  • 16. 16
  • 17. 17 • Arbitrary messages you want to find later • Formatted to give context: logging levels, timestamp • Message examples • Exceptions/stack traces • Additional context • Access logs • Request/response bodies
  • 18. 18 VM App1 Logs I want to check the logs… ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ Get logs Search logs 🤔 App2 App1 App2 ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~ 💥 Legend:
  • 19. 19 • Does not scale; Too much work and knowledge required • Multithreaded, concurrent requests intermingle logs • Low usability – searching is limited/difficult
  • 20. 20 Central log store service stream logs Query request Collection of matching logs query logs VM App1 LogsApp2Legend:
  • 21. 21 Spring Cloud Sleuth • adds trace ID for request correlation • Query all collected logs by any field or full-text search • time window, service, log level, trace ID, message Centralized, request-correlated, formatted logs indexed and searchable across your system
  • 22. 22 Spring Boot • Configurable via Spring Environment (see also Spring Cloud Config) • log format – make a common format across applications • log levels (logging.level.*) • Configurable via Actuator (at runtime) • log levels
  • 23. 23 Spring Cloud Config – shared config properties • Common log pattern Travel Auto-configuration • Correlation ID added to MDC ELK • Elasticsearch – log storage/querying/indexing • Logstash – log forwarding/parsing • Kibana – search / UI for querying Elasticsearch
  • 24. 24
  • 25. 25 Characteristics: • Aggregate time-series data; bounded size • Can slice based on multiple dimensions/tags/labels* Purpose: • Visualize / identify trends and deviation • Alerting based on metric queries * See also https://www.datadoghq.com/blog/the-power-of-tagged-metrics/
  • 26. 26 Example metric Type Example tags response time timer uri, status, method number of classes loaded gauge response body size histogram uri, status, method number of garbage collections counter cause, action
  • 27. 27 HTTP server requests 👥 my-application 👤 HTTP GET metricscontroller metrics over JMX
  • 30. 30 • Spring Boot 2 uses Micrometer as its native metrics library • Micrometer supports many metrics backends • e.g. Atlas, Datadog, Influx, Prometheus, SignalFX, Wavefront • Instrumentation of common components auto-configured • JVM/system, HTTP server/client requests, Spring Integration, DataSource… • Custom metrics also easy to add
  • 31. 31 • Configure via properties • management.metrics.* • Disable certain metrics • Enable percentiles/SLAs/percentile histograms • Common tags • e.g. application name, instance, stack, region, zone
  • 32. 32 Travel Service Starter (included in service-parent) • Includes micrometer-registry-prometheus dependency Travel Auto-configuration • Common metric tag for application name (spring.application.name) Travel Metrics Platform • Micrometer library for metrics instrumentation/reporting • Prometheus for metrics collection/storage/querying • Grafana for dashboards/graphing sourced by Prometheus
  • 33. 33 • Visualize metrics, compare over time • Have a question you’re trying to answer • Do NOT just stare at dashboards
  • 34. 34 • 4 Golden signals • Latency • Errors • Rate • Saturation
  • 35. 35 • Don’t double alert! • Symptoms, not causes
  • 36. 36
  • 37. 37 • Investigate a slow request • Understand dependency/call relationship between services • Where did the error occur in the request?
  • 38. 38 • local tracing: Actuator /httptrace endpoint • Latency data + request metadata { "traces" : [ { "timestamp" : "2018-05-09T13:28:32.867Z", "principal" : { "name" : "alice” }, "session" : { "id" : "728aebfe-8222-4dd2-856c-256104b20bfe” }, "request" : { "method" : "GET", "uri" : "https://api.example.com", "headers" : { "Accept" : [ "application/json" ] } }, "response" : { "status" : 200, "headers" : { "Content-Type" : [ "application/json" ] } }, "timeTaken" : 3 } ] } Source: Spring Boot Actuator Web API Documentation; access date: 2018-05-18 https://docs.spring.io/spring-boot/docs/2.0.2.RELEASE/actuator-api/html/#http-trace
  • 39. 39 Distributed tracing: tracing across process boundaries • Propagate context/hierarchy; join together after • Request-scoped latency analysis across services • Metrics lack request context • Logging has local context but limited distributed info
  • 40. 40 Tracing instrumented system 👤 service1 service2 service3 service4 ① ① start span / sampling decision ② propagate trace context ③ continue trace ④ report spans ② ③ ④ = tracer / instrumentation Tracing backenduser
  • 41. 41
  • 43. 43 Source: Spring Cloud Sleuth reference documentation; access date: 2018-05-18 http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.RC1/single/spring-cloud-sleuth.html#_distributed_tracing_with_zipkin Zipkin UI workshop happening this week! https://cwiki.apache.org/confluence/display/ZIPKIN/2018-10-29+Zipkin+UI+at+LINE+Tokyo
  • 44. 44 Zipkin server transport collector UI storage datastore API 👩 💻 • HTTP • Kafka • RabbitMQ • In-memory * • MySQL * • Elasticsearch • Cassandra Reference: https://zipkin.io/pages/architecture.html Tracing instrumented system 👤 s1 s2 s3 s4
  • 45. 45 Tracing backend: Zipkin Server getting started Spring Cloud Sleuth: spring-cloud-starter-zipkin dependency • auto-configures tracing instrumentation (Zipkin’s Brave) • reports recorded spans to Zipkin async/batched
  • 46. 46 Travel Service Starter (included in service-parent) • Includes spring-cloud-zipkin-starter dependency (Spring Cloud Sleuth) Travel Auto-configuration • Tag root span with correlation ID Travel Cloud Config • Zipkin server address • Sampling %, skip patterns
  • 47. 47
  • 48. 48 Together you have correlated logging, metrics, and tracing across the whole system. Jump between each using common identifiers. Adapted from: Adrian Cole, “Observability 3 ways: logging metrics and tracing”; access date: 2018-05-18 https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
  • 49. 49  spring.application.name = Zipkin service name Configure as Micrometer common tag http_server_requests_seconds_count{exception="None",method="GET",status="200",uri="/hello",} 4.0 http_server_requests_seconds_sum{exception="None",method="GET",status="200",uri="/hello",} 0.02570928 http_server_requests_seconds_max{exception="None",method="GET",status="200",uri="/hello",} 0.0 Micrometer tags Zipkin tags
  • 50. 50  Link to e.g. Kibana search by traceId can also do Logs  Trace https://github.com/openzipkin/zipkin/tree/master/zipkin-ui#how-do-i-find-logs-associated-with-a-particular-trace
  • 51. 51 • Confirm request flow – does it match the expected design/architecture? • Check service dependencies in Zipkin • Check request flow in Zipkin; jump to logs if necessary • Filter by service name, span name, tags • Adjust log levels via Actuator if necessary
  • 52. 52 • Automated tests generate a correlation ID per test case execution. • Use correlation ID to find the related traces in Zipkin. cID0001 cID0001 trace1 trace2
  • 53. 53 • Manual tests (in non-production environments) from the browser can use Zipkin Browser Extension to get the traceId for a browser request • Where in the request flow did the error occur or why was it slow? • Check request flow in Zipkin; jump to logs (if necessary) • Adjust log levels via Actuator (if necessary)
  • 54. 54 検知 調査 復旧 調整 アラート ・ 問い合わせ 1. Starts with an alert/report 2. Check metrics 3. Check tracing data (if needed) 4. Check logs (if needed) 5. Triage issue 6. Make adjustment to prevent recurrence 🔁
  • 55. 55
  • 56. 56 • System-wide observability is crucial in distributed architectures • Tools exist and Spring makes them easy to integrate • Most common cases are covered out-of-the-box or configurable. Custom instrumentation is possible as needed. • Use the right tool for the job; synergize across tools
  • 57.
  • 58. 58 • “Distributed Systems Observability” e-book by Cindy Sridharan: http://distributed-systems-observability-ebook.humio.com/ • Articles by Cindy Sridharan (@copyconstruct): https://medium.com/@copyconstruct • Talks by Charity Majors (@mipsytipsy): https://speakerdeck.com/charity • “Observability+” articles by JBD (@rakyll): https://medium.com/observability