Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

Detection of Related
Semantic Datasets Based
on Frequent Subgraph
Mining
Mikel Emaldi1, Oscar Corcho2, and Diego López-de-Ipiña1
1 Deusto Institute of Technology - DeustoTech, University of Deusto, Bilbao, Spain
{m.emaldi,dipina}@deusto.es
2 Ontology Engineering Group, Departamento de Inteligencia Artificial, Facultad de Informática,
Universidad Politécnica de Madrid, Spain
ocorcho@fi.upm.es

Table of Contents
1. Introduction
2. State of the Art
3. Proposed solution
4. Evaluation
5. Conclusions
6. Current & Future Work
2

May, 2007
Evolution of the LOD Cloud
3

August, 2014
Evolution of the LOD Cloud
4

Introduction
• The LOD Cloud has grown enormously.
• Difficulties when…
• … exploring data provided by different datasets.
• … selecting from which dataset we want to browse
data.
• … trying to avoid overlapping existing data.
• … finding other datasets for interlinking our
dataset.
5

My New
Dataset
Linking Framework
● Silk
● Limes
● ...
Links
The problem of selecting linking sources
?
6

State of the Art I
• Nikolov, A., d’Aquin, M.: Identifying relevant sources
for data linking using a Semantic Web index. In:
Proceedings of the Linked Data on the Web workshop in
conjunction with the 20th international World Wide Web
conference. vol. 813 (2011).
• Extract literals from rdfs:label, foaf:name or dc:title properties
and search them in Sig.ma, grouping the results by source
dataset.
• More results a source has, more chances to be linked with the
original dataset.
7

State of the Art II
• Leme, L.A.P.P., Lopes, G.R., Nunes, B.P., Casanova,
M.A., Dietze, S.: Identifying candidate datasets for
data interlinking. In: Proceedings of the 13th
International Conference on Web Engineering. pp. 354–
366. Springer (2013)
• Naive Bayes classifier for establishing a ranking of related
datasets based on correlations among them for reducing the
search space.
• These correlations are based on existing links.
8

State of the Art III
• Lopes, G.R., Leme, L.A.P.P., Nunes, B.P., Casanova,
M.A., Dietze, S.: Recommending tripleset interlinking
through a social network approach. In: Proceedings of
the 14 th international conference on Web Information
Systems Engineering. pp. 149–161. Springer (2013)
• They use already existing links for establishing new links among
datasets.
9

Proposed Solution I
• Our goals:
• Do not use literals because they are a
scarce resource.
• Do not use already existing links
• Trying to solve the cold start problem.
10

Proposed Solution II
• A solution based on the graph structure of
the datasets.
• Hypothesis:
Given a set of datasets from the Semantic Web,
candidate datasets to be interlinked can be found
through the comparison of their structures.
11

But...
...some datasets are incredibly large!!! 12

Proposed Solution III
http://datahub.io/dataset/nytimes-linked-open-data
triples > 345,000
http://datahub.io/dataset/rkb-explorer-acm
triples > 12,000,000
http://datahub.io/dataset/rkb-explorer-dblp
triples > 24,000,000
13

Proposed Solution IV
• Extract the most frequent subgraph from
each graph.
• Summarizing the graph.
• Easing the matching task among graphs.
• Subgraphs are smaller.
• The FSM task only has to be launched once
per graph (until the graph is modified).
14

Proposed Solution - Frequent Subgraph Mining I
• Main objective: to extract the most frequent
subgraphs from a single graph or a set of graphs.
• According to the state of the art [1], four
characteristics define a FSM tool:
• Single graphs vs. transactions: ability for extracting the
most frequent subgraphs from a single graph or from a set
of graphs.
• Directed graphs: ability for dealing with directed graphs.
• Labeled vertices and/or edges: ability for dealing with
graphs including labels in their vertices and/or edges.
15

Proposed Solution - Frequent Subgraph Mining II
Solution
Single Graph /
Transactions
Directed
Graphs
Labeled Vertex Labeled Edges
Subdue Single Graph ✔ ✔ ✔
AGM Transactions ✔ ✔ ✔
FSG Transactions ✖ ✔ ✔
DP Mine Transactions ✔ ✔ ✔
MoFA Transactions ✖ ✔ ✖
gSpan Transactions ✖ ✔ ✔
FFSM Transactions ✖ ✔ ✔
GREW Single Graph ✖ ✔ ✔
Gaston Single Graph ✖ ✔ ✔
gApprox Transactions ✖ ✖ ✖
(h/v)Sigram Single Graph ✖ ✔ ✔
16

Proposed Solution - SUBDUE
• SUBDUE defines the most frequent subgraph as the
subgraph that once replaced by a single node, most
compresses the original graph.
• G → original graph and S → candidate subgraph:
• Where:
17

Proposed Solution - From RDF to SUBDUE
• Some transformations in the RDF graphs before
applying FSM.
• Transformation I: replacing URIs from subjects by
the rdf:type of each resource.
• Transformation II: removal of external links.
18

Proposed Solution - From RDF to SUBDUE II
19

Proposed Solution - From RDF to SUBDUE III
20

Proposed Solution - Most frequent subgraph extraction &
matching
• Application of SUBDUE for extracting
most frequent subgraphs.
• Application of SUBDUE’s Graph Matcher
(gm) tool for matching extracted
subgraphs.
• Computes the cost of transforming one
subgraph on another subgraph.
• Cost of all transformations = 1.
21

Evaluation
• Efficacy of the system.
• Gold Standard.
• Baselines.
• Results.
22

Evaluation - Gold Standard
• Two different sources:
• Already existing links extracted from the DataHub through
the property links:<target_dataset_id>.
• Some obvious links do not exist!!
• Survey different researchers on Semantic Web and Linked
Data.
• They evaluate pairs of dataset determining the possibility of
establishing links between them.
• Each pair has been evaluated by almost three researchers. 23

The survey arises a new problem…
24Too many combinations!!!

Evaluation - Gold Standard - Survey
• Reduction of the search space inspired by [2]:
• Reduction from 7,038 evaluations to 594.
• Agreement according to Fleiss’ Kappa: 41% → Moderate agreement
(according to Landis & Coch).
25

Evaluation - Baselines
Three baselines:
Baseline #1: common ontologies. As more common ontologies have two
datasets, more chances to establish links between them.
Baseline #2: ontology ranking. As more common ontologies with similar
usage have two datasets, more chances to establish links between them.
Baseline #3: common triples. As more common triples (predicate and
object) two datasets have, more chances to establish links between
them.
27

Evaluation - Results III
• Possible over-fitting of Baseline #1.
• As can be seen in
http://apps.morelab.deusto.es/iesd2015/datasets.csv,
many of used datasets are from RKB Explorer.
• Same ontologies, same model, etc.
• Many datasets from the LOD Cloud are offline:
• No RDF dump.
• No SPARQL endpoint.
• A cleaning is needed!!
29

Conclusions & Future Work
• Precise recommendations, but improvable
recall.
• Recommendations done are OK, but they are not
enough.
• Future work:
• String similarity techniques.
• Recall improved ≳ [0.10, 0.30]
• Include generated recommendations into the
workflow.
30

Credits
Images:
http://lod-cloud.net/
https://www.flickr.com/photos/ramv/10057168055
Bibliography:
[1] Jiang, C., Coenen, F., Zito, M.: A survey of frequent subgraph mining algorithms. The
Knowledge Engineering Review 28(1), 75–105 (2013).
[2] Fetahu, B., Dietze, S., Nunes, B.P., Casanova, M.A., Taibi, D., Nejdl, W.: A scalable
approach for efficiently generating structured dataset topic profiles. In: The Semantic
Web: Trends and Challenges, pp. 519–534. Springer (2014).
[3] LANDIS, J. R. y KOCH, G. G. The measurement of observer agreement for cate- gorical data.
biometrics, pa ́ginas 159–174, 1977. 91
32

Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

Ähnlich wie Detection of Related Semantic Datasets Based on Frequent Subgraph Mining (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

Hinweis der Redaktion