This document discusses robust hashing techniques for models:
1. Robust hashing aims to generate similar hashes for small variations of input models, unlike standard hashing which produces very different hashes for small changes. This allows detection of manipulated copies.
2. The techniques aim for robustness against data distortions and ability to discriminate between different models (avoid false positives/negatives). Applications include search, classification, plagiarism detection, and model accountability.
3. The approach fragments models into overlapping pieces, assigns signatures to pieces via minhashing, and groups similar pieces into buckets using locality sensitive hashing to minimize effect of variations while still detecting mutations. Testing showed robustness to model mutations and ability to discriminate
3. hash function is
any function that can be
used to map data of
arbitrary size to data of a
fixed size.
4. Hashing vs Robust Hashing
Hashing: Small changes on the input
generate very different hashes
Robust Hashing: Small variations on the
input generate same or similar hashes
Key to search/detect not only exact copies but
also manipulated versions
5.
6. Properties
Robustness: Resistance to data distortions. Attempts
to hide the copy are detected (avoid false negatives)
Discrimination: Capacity to detect that two models
are indeed diferent (avoid false positives)
12. Requirements
Independent of the storage format
Independent of the graphical layout
Independent of the concrete syntax
Must take into account not only element
properties but also their relationships
14. Model fragmentation
• To avoid using elements as hashing unit
• Created independently of each other
• Do not need to satisfy any semantic criteria
• Many, overlapping fragments for resilience
• Each fragment gets a signature via a minhash
16. Classification of fragments
• We group similar fragments in the same bucket
via Locality Sensitive hashing
• Buckets minimize the effect of model variations
(they end up still in the same bucket)
• We take only a few samples per bucket
• Mutations are not propagated across buckets
18. Robustness: Tested by automatically mutating an initial
set of models
Discrimination: Tested by taking models from the ATL
zoo and checking they are recognized as different (even
if some of them use a similar vocabulary)