Preventing the Next Deployment Issue with Continuous Performance Testing and ...
Analyzing DIRAC's Behavior using Model Checking
1. Analysing DIRAC's Behavior using Model Checking
with Process Algebra
Daniela Remenska - Jeff Templon - Tim Willemse - Henri Bal - Kees Verstoep - Wan Fokkink
Philippe Charpentier - Ricardo Graciani - Elisa Lanciotti - Krzysztof Daniel Ciba - Stefan Roiser
Motivation From DIRAC to mCRL2 Verification
DIRAC background DIRAC (Python) ~150000 loc
▪
Properties (Satefy / Progress / Deadlock)
▪ production activities and user analysis for LHCb Model-checker automatically probes them.
Abstracting the implementation depends
▪ distributed services and light-weight agents on the focus of the analysis. ▪ Property violated: counter-example trace
is provided.
Check for race-conditions
"blackboard"
or Agents update the state of shared entities.
"shared-memory"
paradigm
Systems: Storage and Workload Mgmt
Entities: Jobs, Cache-Replicas, Tasks
Figure 1: DIRAC subsystems
▪ jobs often get into incorrect
(or inconsistent) states Figure 6: Violation of progress and safety requirements
▪ staging requests become stuck
▪ difficult to trace the root of such
unexpected behavior Figure 2: Job state machine
many scenarios and components
Agents and storage become processes.
▪ manual intervention necessary Control-flow is abstracted using mCRL2
non-deterministic choice and
if-then-else constructs.
There are formal or systematic States of entities are described using Figure 7: "Zombie" job starts running after being killed
approaches to tackle this! custom abstract data types.
Conclusions
State-space generation Distributed systems are difficult to
Why Formal Methods? reason about; many components,
all run in parallel.
Based on process algebra laws
no ambiguity
Formal methods are a more rigorous
addition to testing, as a way to
Model checking tools improve software quality.
full control over the execution of parallel
processes. This way one gains more insight
A sound model needs to be written
into the system behavior.
manually. This requires experience
and can be error-prone.
Automatically explore the entire
state-space and check if some Similar techniques can be re-applied
"interesting" properties hold. to similar systems, once the learning
curve has lapsed.
Stronger than testing
Some drawbacks... Future Work
Abstraction of the "real" behavior is needed. Automate (to some degree) the
This means one must build a sound model. translation from code to model.
Expertise in formal methods and the system Figure 3: State-space visualisation with LTSView
domain is necessary.
The state-space of the model can explode. Analysis & Issues
Problems can be discovered while building and debugging the model:
Language & Toolset
Actions: atomic building blocks
can carry data parameters
Processes: composed of actions,
using algebra operators
Figure 4a: XSim simulator trace of a job workflow Figure 4b: DIRAC logging info of a job workflow
Built-in data types
integers, booleans, lists, sets, bags
Abstract data types
Figure 5: State-transition visualisation with DiaGraphica