QuASD/PROFES 2018, Wolfsburg: Talk by Marcus Ciolkowski (@M_Ciolkowski, Principal IT Consultant at QAware) and Florian Lautenschlager (@flolaut, Senior Software Engineer)
=== Please download slides if blurred! ===
Abstract: Important and critical aspects of technical debt often surface at runtime only and are difficult to measure statically.
This is a particular challenge for cloud applications because of their highly distributed nature.
Fortunately, mature frameworks for collecting runtime data exist but need to be integrated.
In this paper, we report an experience from a project that implements a cloud application within Kubernetes on Azure.
To analyze the runtime data of this software system, we instrumented our services with Zipkin for distributed tracing; with Prometheus and Grafana for analyzing metrics; and with fluentd, Elasticsearch and Kibana for collecting, storing and exploring log files.
However, project team members did not utilize these runtime data until we created a unified and simple access using a chat bot.
We argue that even though your project collects runtime data, this is not sufficient to guarantee its usage: In order to be useful, a simple, unified access to different data sources is required that should be integrated into tools that are commonly used by team members.
Get the research paper: http://bitly.com/2QmSNwl
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
Making Runtime Data Useful for Incident Diagnosis: An Experience Report
1. Making Runtime Data Useful for Incident
Diagnosis: An Experience Report
Wolfsburg, November 28, 2018
QuASD 2018
Florian
Lautenschlager
Marcus
Ciolkowski
2. Measuring Technical Debt is not novel.
But: Existing Approaches ignore dynamic behavior.
2
Current software (quality) measurement is based on static metrics
plenty of tools exist (e.g. SonarQube)
rather simple to measure
much experience in measurement and presentation
Dynamic aspects / KPIs are underrepresented
test coverage is most common exception
dynamic indicators influence, e.g.
customer satisfaction
infrastructure and operation costs
tricky cases in maintenance hide often here (e.g., leaks and locks)
3. Gap: Measure and evaluate dynamic aspects
QAware 3
One example of dynamic aspects: Runtime structure
runtime structure: actual call relationships at runtime
runtime structure is important
to gain understanding of system at runtime
to identify performance problems
Challenge: often difficult to detect statically
abstraction and inheritance (many false positives)
code injection
reflection
soft links
EventMgr
Queue
add remove
Object event = queue.remove();
Method act = event.getMethod("act");
act.invoke(event);
Which class is called?
EventMgr
Queue
4. Focus today: How to gather and use runtime data for
incident diagnosis
QAware 4
What is incident diagnosis
Problem report or ticket
task: find & fix cause quickly
How is it done today
ask someone from the other team
(those who know how to get and interpret runtime data)
Idea: bring together runtime data in a unified, simple access
so everyone can easily use runtime data
and needs to ask only with more profound information
7. 7
Probes collect runtime data event-based or sampling-
based.
Event-based on state change
• Event for every observed change
• Bad for high frequency changes (high overhead)
Sampling-based on timer interval
• State changes might be lost
• Good for high frequency changes (overhead is controllable)
Probe
Probe
Runtime
Data_1
Runtime
Data_1
Runtime
Data_2
Runtime
Data_1
Runtime
Data_2
Runtime
Data_4
Runtime
Data_3
Runtime
Data_4
Software System
Interval Interval Interval
Event Event Event
State
Observe
Observe
represents
change
Interval
Event
8. 8
Runtime data does not have a predefined structure.
In practice, there are three types of runtime data.
Metric: Numeric value
Textual: Structured log message
Trace/Span: A {method|service} call
Part of a (distributed) trace
Trace: Runtime data that belong
together and is ordered by time
incoming_request_count telekom
Name Attribute Value
Name Value Name Value
Span
Span
Span
Span
Span Span
Time
Trace
10. QAware 10
Solution: All runtime data have to contain the
diagnosability context.
1. Detected: Alert on login failures for tenant smarthub
Metric: request_error{ tenant smarthub method
2. Analyze: Dig into login failures traces (spans) and logs (textual) for the tenant
Trace:
Logs: {“message”:”Connection Timeout”, “user”:”xb27”, “tenant”:”smarthub”}
Login
Validate Token Load User
Get Roles Get Rights
Tags:{tenant:”smarthub”, user:”xb27”}
11. Problem: Typical team member is not an expert in the toolchains.
Exploration
Probes,
Collection
and
Storage
Storage,
Exploration
Transport
Storage,
Exploration
Probes and
Collection
Smart Voice Hub
https://kibana.svh.de?traceId=d0227acffbd8671
https://zipkin.svh.de?traceId=d0227acffbd8671https://prometheus.svh.de?
Chatbot
Solution: Build a unified access to the toolchains!
11
13. 13
Allows to interact with Smart Voice Hub
Response contains always the traceId
(part of the diagnosability context).
In case of an error:
Zipkin Trace: Request Trace
Kibana Logs: Request Application log
The chatbot is our unified access layer.
16. Dynamic Analysis for Everyone Not Just For Experts
16
We first had to convince people, but afterwards they loved the analysis tools.
Established and accepted toolchain is a building block
It pays off:
Better (more qualified) cross-team communication
Higher motivation understanding software behavior
Lower overhead for analysis experts
Increased awareness of runtime characteristics
First level support is interested
17. QAware 17
Determine actual runtime architecture
from traces / spans
and log files
Construct model of runtime architecture
for visualization
abstract functional / logical abstraction layers
for compliance checks
Towards resilience
improve dynamic indicators of technical debt, e.g.
resource consumption (e.g., time spent)
component health (predict&react early)
Next Steps: Derive Runtime Architecture
for Compliance Checks
18. QAware GmbH München
Aschauer Straße 32
81549 München
Tel.: +49 (0) 89 23 23 15 0
Fax: +49 (0) 89 23 23 15 129 github.com/qaware
linkedin.com/qaware slideshare.net/qaware
twitter.com/qaware xing.com/qaware
youtube.com/qawaregmbh
Marcus Ciolkowski
marcus.ciolkowski@qaware.de
@M_Ciolkowski