Ontology reasoning is typically a computationally intensive operation. While soundness and completeness of results is required in some use cases, for many others, a sensible trade-off between computation efforts and correctness of results makes more sense. In this paper, we show that it is possible to approximate a central task in reasoning, i.e., A-box consistency checking, by training a machine learning model which approximates the behavior of that reasoner for a specific ontology. On four different datasets, we show that such learned models constantly achieve an accuracy above 95% at less than 2% of the runtime of a reasoner, using a decision tree with no more than 20 inner nodes. For example, this allows for validating 293M Microdata documents against the schema.org ontology in less than 90 minutes, compared to 18 days required by a state of the art ontology reasoner.
2. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 2
How to Play
Jazz Improvisation
on the Piano
Heiko Paulheim, Heiner Stuckenschmidt
3. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 3
Introduction
• Learning Improvisation
– Music comes with chord progressions
– Different scales fit different chords
– Scales are comprised of notes
• Playing Improvisation
– Inspect chord progression
– Identify matching scale(s)
– Play notes from those scales
4. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 4
Introduction
• Essentially, the model is an ontology
Scale Chord
Note
Major Minor
consists of consists of
matches
• ...and improvisation playing
is a reasoning task:
– Given a chord progression, infer matching notes
5. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 5
Introduction
• Challenge
– Reasoning is often slow
– But results are needed
in real time
• Solution
– Use fast approximate model
– Tolerate errors
(we call them “blue notes” ;-))
Guys, can you play a bit more slowly?
My reasoner is still running!
Thank God I don't
need reasoning
6. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 6
Introduction
• How you really learn improvisation playing
Scale Chord
Note
Major Minor
consists of consists of
matches
1) Learn about chords, scales, etc.
2) Play, play, play
3) Forget about chords, scales, etc.
7. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 7
Introduction
• Can we learn to imitate a reasoner?
– By learning a simpler approximate model
• Setup
– Observe what a reasoner is doing
– Use inputs and outputs as training signals
– Train a machine learning model
– Use the (fast) machine learning model
• In this paper
– Restricted to A-box consistency checking
– i.e., a binary classification task (consistent/inconsistent)
8. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 8
Motivation
• Key assumptions
– We have to classify many A-boxes for the same T-box
• i.e., a T-box specific reasoning model makes sense
– 100% accuracy is often not needed where reasoning is used
• e.g., information retrieval,
recommender systems
– On the other hand, real-time results are desirable
– Larger parts of the ontology and
language expressivity
may be neglectable
• see, e.g., our findings
on schema.org*
* Meusel, Bizer, and Paulheim (2015): A web-scale study of the
adoption and evolution of the schema.org vocabulary over time
9. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 9
Approach
• Obtain labeled training data using an actual reasoner
• Feed training data and labels into learning algorithm
– Requires propositional form (i.e., feature vectors)
• Use learned model as approximation
– Validate against ground truth (i.e., actual reasoner)
T-box
A-box
Ontology
Reasoner
ML
Classifier
consistency
Feature
Vector
Builder
instances
(features)
A-box
consistency
(prediction)
training data application data
10. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 10
Approach
• Turning RDF graphs into proposition features
– We use root path kernels (aka walk kernels)*
– Variation: literals are represented by their datatype, not the actual value
* Lösch et al. (2012): Graph kernels for RDF data
_:1
_:2
_:3
schema.org/
director
schema.org/
actor
“Red Tails”schema.org/
name
schema.org/Person
rdf:type
rdf:type
“Anthony Hemingway”
schema.org/
name
“Cuba Gooding Jr.”
schema.org/
name
schema.org:name→
xsd:string
schema.org:director→
schema.org:name→
xsd:string
schema.org:actor→
schema.org:name→
xsd:string
schema.org:director→
rdf:type→
schema.org:Person
schema.org:actor→
rdf:type→
schema.org/Person
rdf:type→
schema.org/Person
`
_:1 1 1 1 1 1 0
_:2 1 0 0 0 0 1
_:3 1 0 0 0 0 1
11. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 11
Evaluation
• Four cases
– Validating individual relation assertions in DBpedia+DOLCE*
• A relation assertion plus subject's and object's types
• DBpedia+DOLCE Ultra Lite ontology
– Validating individuals in YAGO+DOLCE
• An individual plus its types
• Using a mapping from YAGO to DOLCE Ultra Lite ontology
– Validating entire RDFa documents against the GoodRelations ontology
• From WebDataCommons
– Validating entire Microdata documents against the schema.org ontology
• From WebDataCommons
• Top level class disjointnesses added to schema.org ontology**
* Setting as in Paulheim and Gangemi (2015): Serving DBpedia with DOLCE - more than just
adding a cherry on top
** Using the OWL version from http://topbraid.org/schema/
12. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 12
Evaluation
Tim Berners-Lee Royal Society
Award
award
range
Description
subclass of
Social Agent
disjoint
with
DBpedia
ontology
DBpedia
instances
DOLCE
ontology
Organisation
is a
Social Person
equivalent class
subclass of
• DBpedia and DOLCE Ultra Lite
– Mapping included since release 3.9 (2013)
– Adds few top level disjointnesses
• Can help detect inconsistent assignments
13. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 13
Evaluation
• Disjointness axioms added for schema.org
– Between nine top level classes
– Human judgement based on intensional disjointness
WebDataCommons
14. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 14
Evaluation
• Three variants for each case
– 11, 111, and 1,111 A-box
– Rationale: training on 10, 100, and 1,000 A-box, 10-fold CV
• Comparison and ground truth system: HermiT
15. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 15
Results
• Using four well-known classifiers (in RapidMiner+Weka)
– In standard settings
– General: Decision Trees work best
• And are surprisingly small (<20 nodes)
17. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 17
Results
• Scalability considerations: how long would it take to validate
– All statements in DBpedia (~15M)
– All instances in YAGO (~2.9M)
– All RDFa documents using GoodRelations (~356k)
– All Microdata documents using schema.org (~293M)
• Formula for extrapolation:
• Results:
18. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 18
Summary
• Ontology reasoning can be well approximated
• Decision trees reach >95% accuracy w/ 1,000 labeled examples
• Decision trees are small (<20 nodes)
– Interesting to apply in computationally limited settings,
e.g., smart devices or sensors
• Approximation is highly scalable
– Validating all schema.org Microdata documents: 90 minutes vs. 18 days
• Why does it work so well?
– Learned models concentrate on relevant fragments
– Reasons for inconsistency are not equally distributed
→ learned models adapt to most frequent inconsistencies
19. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 19
Outlook on Future Work
• Faster application of learned models possible
– Think: translate 20 node decision tree into a set of SPARQL queries
• Ontology summarization
– Learned models (e.g., trees) point to relevant concepts
• Uncovering inconsistency explanations
– e.g., structured prediction
• More effective and efficient reasoner usage
– Example selection by active learning
• Systematically investigate interplay between
– Feature representation
– Ontology complexity
– Learning paradigm
20. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 20
Thank you!
Q/A session for a highly debated paper, as imagined by the speaker.
Details may vary.
21. 05/31/16 Heiko Paulheim, Heiner Stuckenschmidt 21
Fast Approximate
A-box Consistency Checking
using Machine Learning
Heiko Paulheim, Heiner Stuckenschmidt