1. Budapest University of Technology and Economics
Department of Measurement and Information Systems
DISTRIBUTED INCREMENTAL GRAPH QUERIES
Gábor Szárnyas, Dániel Varró
2 February, 2015
22nd Minisymposium of the
Department of Measurement and Information Systems
4. Model Sizes
Models = graphs with 100M–1B elements
o Car industry
o Avionics
o Software analysis
o Cyber-physical systems
Source: Markus Scheidgen, Automated andTransparent
Model Fragmentation for Persisting Large Models, 2012
application model size
software models 108
sensor data 109
geo-spatial models 1012
Validation may take hours
6. Motivating Example
Pattern for an AUTOSAR validation constraint
Communication
channel
Logical signal Mapping Physical signal
Invalid submodel
Validation
Valid submodel
7. Antijoin
Join
Join
Fill indexer nodesStore interim resultsRead result setEdit modelPropagating changesRead result set
Rete Algorithm
Communication
channel
Logical signal Mapping Physical signal
Result set
9. EMF-INCQUERY
Rete-based incremental graph query engine
Open source Eclipse project
Typical use cases
o Validation
o Incremental model transformation
o Model synchronization
10. Single Workstation Limitations
Majority of tools mostly work for <1M model
elements due to resource exhaustion
Best tools: <10M model elements
JVM limitations: cannot handle 15+ GB heap
memory efficiently
Proposed solution
o Horizontal scaling: distributed system
12. Goals of INCQUERY-D
Objectives
o Distributed incremental pattern matching
o Adapting EMF-INCQUERY’s tooling to distributed DBs
o Executed over a cloud infrastructure (COTS hardware)
Achieve scalability by avoiding memory bottleneck
o Sharding separately
• Data
• Indexers
• Query network
o In memory
• Index + query
14. Architecture and Data Representation
Is it possible to build a query engine which works
on various backends using different data
representation formats?
Is it possible to serve multiple users concurrently?
15. INCQUERY-D Architecture
Server 1
Database
shard 1
Server 2
Database
shard 2
Server 3
Database
shard 3
Transaction
In-memory
EMF model
Database
shard 0
Server 0
Rete net
Indexer
layer
EMF-INCQUERY INCQUERY-D
Distributed query evaluation network
Distributed indexer Model access adapter
Indexing
Indexer Indexer Indexer Indexer
Join
Join
Antijoin
In-memory storage
Distributed indexing,
notification
Production network
• Stores intermediate query results
• Propagates changes
Distributed persistent
storage
Distributed production network
• Each intermediate node can be allocated
to a different host
• Remote internode communication
16. Scalable Incremental Query Evaluation
Is it possible to utilise an incremental query
evaluation algorithm in a distributed system for
high performance query evaluation?
How can we benchmark a distributed system in a
reproducible manner?
17. Benchmark Results for Revalidation
Quick response time
for models with 88M elements
Different characteristics
18. Dimensions of Scalability
Infrastructure
o Number of machines
o Available memory / CPU
o Network performance
o Number of concurrent users
Model
o Model size
o Model characteristics
Queries
o Number of queries
o Query complexity
19. Optimisation and Dynamic Reconfiguration
How can we scale and optimise such a system?
How can the system adapt to the changes
o in the system?
o in the cloud environment?
How can we estimate the resources required by a
certain setup?
20. Dynamic Resource Allocation
Server 1 Server 2 Server 3Server 0
Indexer Indexer Indexer Indexer
Join
Join
Antijoin
10% 70% 60%
Δ
80%90%
Join
25%75%
Δ
Δ
Memory usage
21. Conclusion
MDE provides Big Data questions for research
Horizontal scaling is a way for querying large models
Theoretical challenges
o Distributed pattern matching algorithm
o Data representation
o Dynamic resource allocation
Practical challenges
o Integrating technologies: database, messaging
framework, monitoring, user interface, etc.
o High performance query evaluation
Scalability challenges
The size and complexity of the models is increasing.
Queries are more complex than typical queries in a transactional database.
Performance issues lower productivity and high costs
MDE scalability issues are well-known and documented.Also, models are continuously changing.
The Rete algorithm is well-known and proven in a single workstation environment.
One possible approach is to evaluate the queries incrementally.
..is a powerful tool but it has its limitation
A JVM cannot handle 15+ GB heap memory efficiently
Long GC pauses
Specialized JVMs (e.g. Azul Systems’ Zing)
Commercial, experimental
May require special hardware
This approach is well-known in the database community.
Also distributed triplestores and graph databases.
Assumptions kell-e?
„Rete kommunikáció mennyisége ≪ modell mérete”, de szerintem ezt szerencsésebb a változás méretével felírni, mert
Azonos dimenziójú adatokat hasonlítunk össze
Egyszerűbb
A lekérdezés teljes eredményhalmazára szükség van, v.ö. azzal az esettel, amikor pl. csak egy illeszkedésre van szükségünk.
A Rete háló frissítése változás méretéve arányos az elosztott Rete hálóban is.
Tipikus MDE alkalmazás: inkább olvasás, mint írás (= analitikus).
EMF-IncQuery is a single workstation incremental graph query engine. The scalability limitations of EMF-IncQuery arises due to the memory consumption of the Rete net and in-memory representation itself.
The transaction accesses the database through the model access adapter component. IncQuery-D extends EMF-IncQuery’s architecture.
Distributed systems introduces numerous challenges, concerning scalability (e.g. distributed, parallel model load; distributed transformation and validation) and benchmarking (e.g. generating large instance models)
Dynamic reallocation of resources may be necessary.
Different representations can be used. They can also be mixed, e.g. a submodel may be stored in EMF, another in an RDF triplestore and the third in a relational database