A short talk on the problem of mining linked data (RDF) patterns, introducing a few preliminary notions towards the definition of generic linked data mining algorithms.
1. What makes a linked data pattern
interesting?
Szymon Klarman
Department of Computer Science
Brunel University London
June 7, 2016
Connected Data London
#ConnectedData2016
2. Linked Data
data/knowledge represented in W3C standards OWL/RDF(S)
flexible, unrestrictive, extendible
machine (and human) accessible
connected into a global Web of Data
(open) and reusable (and when combined great things might happen!)
perfectly functional also in closed environments
3. RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
has entity idlabel
GRB2 regulates GAB1 UniProt:P34723
4. RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
5. RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
Regulation
Molecular interaction
Biological event
subclass of
subclass of
has participant A
has participant
subproperty pf
domain range
Chemical
has participant B
6. RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
has participant
Molecular Interaction
Biological event
type
type
Chemical
type
Regulation
Molecular interaction
Biological event
subclass of
subclass of
has participant A
has participant
subproperty pf
domain range
Chemical
has participant B
7. RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
Querying:
?z ?y
has participant
Biological event Chemical
type type
Regulation
Molecular interaction
Biological event
has participant A
has participant
domain range
Chemical
has participant B
subclass of
subclass of
subproperty pf
8. Linked data mining
Emerging field: Workshop on Knowledge Discovery and Data Mining Meets
Linked Open Data since 2012 (+ Linked Data Mining Challange).
Problems:
finding novel/surprising/interesting linked data patterns
identifying relevant semantic connections
predicting facts/links in knowledge graphs
Most modest yet fundamental task:
What’s in that linked data set?
Web of Data will soon contain a lot of significant answers (42!)...
...so we need to know how to ask the right question...
...so we need to understand what’s in these data set.
Examples are from the Big Mechanism project (http://52.26.26.74/).
13. So what’s in that linked data set?
No structure...
14. Ontologies on the Web of Data
Concept & property hierarchies + type assertions make up most of the Web of Data.
B. Glimm, A. Hogan, M. Krötzsch, A. Polleres: „OWL: Yet to arrive on the Web of Data?”, 2012
Typical ontologies don’t reflect the actual
graph structure of data...
15. Biological event
Chemical / Event
Statement
Article
Journal
representsis represented by
is extracted from
Molecular interaction
has participanttype
Submitter
has submitter
The actual „conceptual data model”
published in
16. GRB2_regulates_GAB1
statement_1
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
NaCTeM
has submitter
PMC123456
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Biological event
type
17. ?z
?u
?x ?y
has participant A has participant B
?v
has submitter
?w
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Biological event
type
18. ?z
?u
?x ?y
has participant A has participant B
?v
has submitter
?w
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Linked data pattern ≈ conjunctive query / graph query
Query is a set of triples of the form:
( ?x type Concept )
( ?x Property ?y )
Linked data mining ≈ search through the query space
Biological event
type
19. When is a linked data pattern interesting?
Two evaluation criteria:
Frequency: the pattern has relatively many matches in the set;
Semantic content: the pattern contains relatively much information.
Frequency is the central criterion for the related problem of frequent
subgraph mining in the graph & multi-relational data setting.
⇢ linked data is graph data.
Semantic content criterion originates in logical/semantic theories of
information, and is used in inductive logic programming.
⇢ linked data is grounded in logic.
There is an inherent trade-off between the two criteria.
20. Frequency
The most frequent linked data patterns out there will always be:
X is something...
Something is somehow related to something else...
?x ?y
owl:topObjectProperty
owl:Thing
typetype
X is an event of type...?
22. Semantic content
The linked data pattern with the most
semantic content is the entire RDF graph...
Pattern Q1 has more semantic content than pattern Q2 (over ontology O)
if
Q1 (with O) logically entails Q2
?z ?y
has participant A
Regulation Protein
type type
?z ?y
has participant
Biological event Chemical
type type
23. Trade-off
FREQ (Q) CONT (Q)
VALUE(Q) =
weighted sum of FREQ(Q) and CONT(Q)
1 - Prob(Q is true a priori)#answers / #possible answers
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
25. Algorithm
The space of all patterns over realistic linked data sets is virtually infinite.
But there are some good search heuristics:
use precomputed „promising” building blocks;
„climb up” over the most successful queries so far (but use a restart rule
to avoid getting stuck locally).
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
26. What’s next...
The question „what’s in that linked data set?” is perhaps not the major one,
but the suggested notion of interestingness might well be:
„frequency vs. semantic content” trade-off reflects the dual – graphical
and logical – nature of the RDF(S) representation model.
many of the linked data mining tasks can be described as: given Q2 find
an interesting Q1 such that:
Q1 ⇢ Q2
other, more abstract criteria might be also necessary.
Linked data mining requires novel principles and foundational approaches.