Semantically-enabled Digital Investigations
Leveraging Semantic Web technologies for representing, integrating, correlating and querying multi-domain digital forensic data
3. Problem Area
• Complex attacks against
networked systems
• Multiple data sources of possible
evidentiary value
– Volume & Variety
– ”looking for a needle in a stack of
needles” – Paul Pillar, CIA CoA
• Analysis of the collected digital
data
– Least formalized process step
– Rely on investigators’ expertise and
experience
2015-05-17 ISACA Dagen 2013
4. Digital Evidence / Investigations
• Reliable digital data that support
hypothesizing about a security
incident
• Sound methods for collecting and
interpreting digital data
• Reconstruct events found to be
criminal (DF)
• Investigate and learn from
information security breaches (IR)2015-05-17 ISACA Dagen 2013
5. Forensic Tools
• Interpreters between data
abstraction layers
– e.g. Reconstruct raw disk data into
filesystem hierarchy and objects (files,
directories)
• Evidence- but not investigation-
centric design
• Limited tool interoperability
– Manual integration of tool findings
– Multiple (proprietary, undocumented)
data formats/models
2015-05-17 ISACA Dagen 2013
7. Semantic Web & Linked Data
Technologies
• ”… information is given well-defined
meaning, better enabling computers
and people to work in cooperation” –
(Tim Berners Lee, 2001)
• Ontology – ”explicit and formal
specification of a conceptualization”
– Entities, attributes, relationships
• Metadata - Context-based or domain-
specific annotation of data
• Reason and inference of implicit facts
2015-05-17 ISACA Dagen 2013
8. Semantic Web Architecture
• URI/IRI enables global data object
identification
• XML provides a machine readable,
validatable data encoding scheme
• RDF(S) is a metadata data model and
knowledge representation language
– Subject-Property-Object/Value statements
– Class and Property hierarchies
• OWL 2 is a more expressive KR
language for specifying ontologies
– Restrictions, Equivalence, Cardinality,
Property Chains
• Rule and RDF-query languages
2015-05-17 ISACA Dagen 2013
12. Semantic Representation
• Resource Unique Identification Scheme
• Parsing tools able to process each source type with
respect to the domain ontology
2015-05-17 ISACA Dagen 2013
13. Evidence Integration
• Automated linking among (homo/hetero-)geneous evidence
sources based on key properties & matching rules
2015-05-17 ISACA Dagen 2013
14. Evidence Correlation
• Link instances of dissimilar
type across a shared
domain
• Temporal Correlation
– Rules for establishing time
instant & interval relations
among recovered artifacts
• Mereological Correlation
– “partOf” transitivity relations
2015-05-17 ISACA Dagen 2013
16. Integrated Query
• Purpose-built triplestore (graph) database engine can
store the final dataset
– Up to billions of triples
• SQL-like queries against the integrated/correlated
evidence set
• Graph pattern matching
techniques
2015-05-17 ISACA Dagen 2013
20. Sample Query
• “Is any file resident on the disk malicious and if yes where
has it been downloaded from and which ISP did the IP
belong to?”
2015-05-17 ISACA Dagen 2013
22. Example Hypothesies-Queries
• Have there been any unsuccessful connection attempts
from systems in the same network as the one that hosted
the malicious file?
• Which disk files have been created or accessed shortly after
the malicious file was downloaded?
• Has there been any successful connection between our
system and a known malicious host?
• Which files have been accessed shortly before the host
communicated with any blacklisted network host?
• Which websites have been visited by the user shortly
before the download of the malicious file?
2015-05-17 ISACA Dagen 2013
23. Summary
• Ability to represent and integrate heterogeneous data
• Supports the formulation and execution of complex queries
• Expandable (ontologies, rules, queries)
• Computational complexity depends on the ontology, rules,
amount of data
• Reliance to online data sources may affect the accuracy of
the results
2015-05-17 ISACA Dagen 2013
24. Future Work
• Advanced reasoning capabilities (e.g. detect
anti-forensic inconsistencies)
• Extended analysis techniques (e.g. additional
data sources, user activities)
• Large scale performance evaluation, distributed
architecture
• User-friendly graphical interface for rule/query
formulation and result navigation
2015-05-17 ISACA Dagen 2013