The document discusses model comparison approaches for delta-compression. It describes comparing models at the element level by matching elements between models and identifying differences. It also discusses representation of differences for compression purposes and experiments comparing EMF Compare and EMF Compress on reverse engineered models from Git repositories.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Model Comparison for Delta-Compression
1. Model Comparison for Delta-
Compression
1
Markus Scheidgen
scheidge@informatik.hu-berlin.de
@mscheidgen
BigMDE at STAF 2016, Vienna
2. Agenda
▶ Motivation for Delta-Compression
▶ Model Comparison: Approaches
▶ Experiments
▶ Conclusions
2
3. Motivation – Delta-Compression
▶ What it is: Only store the differences of similar models
▶ Where do we have a lot of similar models:
■ Model Versioning
■ Model-based Mining of Source Repositories with reverse
engineering
▶ Why: Storage space and (indirectly) execution time for
persistence operations (I/O, etc.)
3
4. Model Versioning – Approaches
4
(or versioning in general)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
5. Model Versioning – Approaches
4
(or versioning in general)
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
6. Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based
(or operation-based)
+
e.g. EMF-store,
requires to record
or infer operations
from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
7. Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based
(or operation-based)
+
e.g. EMF-store,
requires to record
or infer operations
from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
or compare?
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
8. Model Versioning – Approaches
4
(or versioning in general)
+ +
+
-
change-based
(or operation-based)
+
e.g. EMF-store,
requires to record
or infer operations
from the editing
environment
state-based
r0
r1
r2
r3
e.g. models stored
in regular version
control systems
(VCS)
+ +
+
-
hybrid
(persist changes, appear state-based)
+
e.g. GIT: you only see
whole files, internally uses
pack-files with delta-
compression
or compare?
1. Altmanninger, K., Seidl, M., Wimmer, M.: A survey on model versioning approaches. International Journal of Web
Information Systems 5(3), 271–304 (2009)
17. Comparison Time
Difference Model Usability
Extraction Time
Storage Space
Delta-Compression – Tradeoffs
7
Comparison Quality
priority for showing diffs to users [model comparison]
priority for using diffs in persistence [compression]
18. Model Comparison
▶ We know how to compare lists (e.g. lines of code) for a long
time
■ Meyer’s algorithm: O(N*D)
▶ Models aren’t list, but graphs (with spanning trees)
■ but, each feature in each model element is a “list” of values
■ we can compare two model elements, feature by feature
■ But, what pairs of elements should we compare?
■ We need a prior step to establish pairs of (supposingly)
matching elements
8
Meyers, E.W.: An O (ND) difference algorithm and its variations. Algorithmica 1(1- 4), 251–266 (1986)
19. Model Matching
9
model 1
model 2
matches differences
▶ Matching determines the quality of the comparison
▶ Different strategies to matching model elements
■ signatures: just meta-class, [name, parameter types], parent
■ similarity: lots of different criteria and heuristics
cheap
expensive
29. EMF-Compress
▶ We build a comparison framework for compression
■ signature-based matching
■ difference meta-model that allows patching
▶ http://github.com/markus1978/emf-compress
14
30. Experiments
▶ Reverse engineered GIT repositories with Java-code using
MoDisco
▶ Eclipse Foundation sources, i.e. Eclipse platform and plug-ins
▶ organized in different projects: JDT, CDT, EMF, ...
▶ available via GIT-Hub
▶ GIT repositories can be gathered automated via GIT-Hub’s
REST API
▶ We used the 200 largest Eclipse repositories that actually
contained Java code: 6.6 GB Git, 400 MLOC, 250 GB
(binary) models with 4 billion objects.
15
31. Experiment 1: EMF-Compare vs EMF-Compress
▶ only first 1000 revisions of the 100 largest (GIT-size) repos;
only CU’s with less than 20k elements: ~300k individual
comparisons
16
signature lines
Differences model size
signature similarity similarity similaritylines
020406080100
Number of matches
(%)
020406080100
(%)
signature parse
1550500
Avg. execution times (log)
avg.timepercompilationunit(ms)
34. Experiment 2: Partial Comparison
▶ Problem: not all elements have a meaningful signature
▶ Two signature matching strategies:
■ Named-only: only match named elements, use equality for the
contents of named elements
■ Meta-class: match named elements based on their signature,
match the contents of named elements based on their parent
and meta-class only
19
35. Experiment 2: Partial Comparison
20
cdt
cdo
...ompare
emf
...e.core
Delta
UC
Full
Size - Named-only Matcher
GB
0
2
4
6
8
10
12
14
cdt
cdo
...ompare
emf
...e.core
Delta
UC
Full
Size - Meta Class Matcher
GB
0
2
4
6
8
10
12
14
cdt
cdo
...ompare
emf
...e.core
Delta
UC
Full
Lines
MLines
0
20
40
60
80
100
Delta
UC
Full
All vs Matched - Named-only Matcher
MObjects
200
300
400
500 Delta
UC
Full
All vs Matched - Meta Class Matcher
MObjects
200
300
400
500 Delta
UC
Full
All vs Matched Lines
MLines
40
60
80
100
36. Discussion
▶ Only reverse engineered Java models, different result for
other meta-models possible
▶ Similarity-based matching != similarity-based matching:
more evaluation with different qualities of similarity-based
matching necessary
▶ Signatures are not necessarily meta-model independent
▶ We need a better understanding about the relationship of
comparison runtime and comparison quality
21
37. Conclusions
▶ EMF Compare is tailored for difference model usability, not
for model-compression
■ insufficient execution times
■ wrong representation of differences
▶ EMF-Compress is an alternative: http://github.com/
markus1978/emf-compress
▶ Better analysis of matching strategies necessary:
■ evaluation of more matching strategies
■ evaluation with models in different languages
22
44. Model-based Mining of Software Repositories
▶ MSR tools are already “model-based”, but in a proprietary
manner
▶ Idea: existing reverse engineering framework and
corresponding standard meta-models and modeling
frameworks instead of proprietary solutions
▶ Goals
■ deal with heterogeneity (different version control systems,
different languages)
■ reuse of existing meta-models, transformations, and languages
■ interoperability with existing analysis tools
■ retaining meaningful scalability
24
45. Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
25
46. Model-based Mining of Software Repositories
▶ Scope
■ depends on concreter MSR-application and its goals
■ number of software projects: single repositories, large
repositories, ultra-large repositories
■ Sources as text and text based metrics, e.g. LOC
■ Declarations only: packages, classes, methods, but no
statements, expressions, etc.
■ Full AST with or without cross-references
26
47. Reverse Engineering with MoDisco
▶ Model Discovery
▶ reverse engineering for Java, based on EMF
▶ discovery, i.e. finding sources (so called compilation units)
within projects, source folders, and packages
▶ uses Eclipse’s workspace and Java Development Toolkit (JDT)
▶ provides
■ discovers for many languages: Java, xText, JSP, XML
■ creates instances of a Java EMF meta-model that corresponds to
the handwritten JDT AST-model
■ provides transformation to language independent artifacts, e.g.
KDM
27
48. From Source Code- to Model-Repository
28
snapshot
A1 B1
snapshot
A2 B1
snapshot
A2 B3
snapshotsnapshotsnapshot
M3
M2
M1
f
B.f
fB.f
Load(r)
Analysis(r)
Merge(r)
Save(r)
Checkout(r)
X
d2CUs(r)
Parse(d)
X
R
Checkout +
X
CUs
Parse + Analysis
!
X
R
Checkout +
X
CUs
(Parse + Merge) + Analysis
!
X
R
X
CUs
(Load + Merge) + Analysis
!
X
R
X
CUs
(Load + Analysis0
)
versioncontrolsystem
A1-A2
A1 B1
B1-B3
X
d2CUs(r)
Parse(d)
66. Importing and Traversing Source Code Repositories
34
R1
R2
R3
A1 B1
f
A2
A1 B1
f
f
B3
A2
A1 B1
f
f
A2 B3
f
A2 B1
f
A1 B1
f
A2 B1
f
A1 B1
f
A1 B1
f
B3
A2
A1 B1
f
f
A2
A1 B1
f
f
A1 B1
f
A2 B3
A2 B1
f
f
A1 B1
f
A2 B1
f
A1 B1
f
A2 B.f?
A1 B1!fB.f?
A2 B3
f
A2 B1
f
A1 B1
f ✓
✗
✗
✓
✗
✓
import traverse