SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Detection of Related
Semantic Datasets Based
on Frequent Subgraph
Mining
Mikel Emaldi1, Oscar Corcho2, and Diego López-de-Ipiña1
1 Deusto Institute of Technology - DeustoTech, University of Deusto, Bilbao, Spain
{m.emaldi,dipina}@deusto.es
2 Ontology Engineering Group, Departamento de Inteligencia Artificial, Facultad de Informática,
Universidad Politécnica de Madrid, Spain
ocorcho@fi.upm.es
Table of Contents
1. Introduction
2. State of the Art
3. Proposed solution
4. Evaluation
5. Conclusions
6. Current & Future Work
2
May, 2007
Evolution of the LOD Cloud
3
August, 2014
Evolution of the LOD Cloud
4
Introduction
• The LOD Cloud has grown enormously.
• Difficulties when…
• … exploring data provided by different datasets.
• … selecting from which dataset we want to browse
data.
• … trying to avoid overlapping existing data.
• … finding other datasets for interlinking our
dataset.
5
My New
Dataset
Linking Framework
● Silk
● Limes
● ...
Links
The problem of selecting linking sources
?
6
State of the Art I
• Nikolov, A., d’Aquin, M.: Identifying relevant sources
for data linking using a Semantic Web index. In:
Proceedings of the Linked Data on the Web workshop in
conjunction with the 20th international World Wide Web
conference. vol. 813 (2011).
• Extract literals from rdfs:label, foaf:name or dc:title properties
and search them in Sig.ma, grouping the results by source
dataset.
• More results a source has, more chances to be linked with the
original dataset.
7
State of the Art II
• Leme, L.A.P.P., Lopes, G.R., Nunes, B.P., Casanova,
M.A., Dietze, S.: Identifying candidate datasets for
data interlinking. In: Proceedings of the 13th
International Conference on Web Engineering. pp. 354–
366. Springer (2013)
• Naive Bayes classifier for establishing a ranking of related
datasets based on correlations among them for reducing the
search space.
• These correlations are based on existing links.
8
State of the Art III
• Lopes, G.R., Leme, L.A.P.P., Nunes, B.P., Casanova,
M.A., Dietze, S.: Recommending tripleset interlinking
through a social network approach. In: Proceedings of
the 14 th international conference on Web Information
Systems Engineering. pp. 149–161. Springer (2013)
• They use already existing links for establishing new links among
datasets.
9
Proposed Solution I
• Our goals:
• Do not use literals because they are a
scarce resource.
• Do not use already existing links
• Trying to solve the cold start problem.
10
Proposed Solution II
• A solution based on the graph structure of
the datasets.
• Hypothesis:
Given a set of datasets from the Semantic Web,
candidate datasets to be interlinked can be found
through the comparison of their structures.
11
But...
...some datasets are incredibly large!!! 12
Proposed Solution III
http://datahub.io/dataset/nytimes-linked-open-data
triples > 345,000
http://datahub.io/dataset/rkb-explorer-acm
triples > 12,000,000
http://datahub.io/dataset/rkb-explorer-dblp
triples > 24,000,000
13
Proposed Solution IV
• Extract the most frequent subgraph from
each graph.
• Summarizing the graph.
• Easing the matching task among graphs.
• Subgraphs are smaller.
• The FSM task only has to be launched once
per graph (until the graph is modified).
14
Proposed Solution - Frequent Subgraph Mining I
• Main objective: to extract the most frequent
subgraphs from a single graph or a set of graphs.
• According to the state of the art [1], four
characteristics define a FSM tool:
• Single graphs vs. transactions: ability for extracting the
most frequent subgraphs from a single graph or from a set
of graphs.
• Directed graphs: ability for dealing with directed graphs.
• Labeled vertices and/or edges: ability for dealing with
graphs including labels in their vertices and/or edges.
15
Proposed Solution - Frequent Subgraph Mining II
Solution
Single Graph /
Transactions
Directed
Graphs
Labeled Vertex Labeled Edges
Subdue Single Graph ✔ ✔ ✔
AGM Transactions ✔ ✔ ✔
FSG Transactions ✖ ✔ ✔
DP Mine Transactions ✔ ✔ ✔
MoFA Transactions ✖ ✔ ✖
gSpan Transactions ✖ ✔ ✔
FFSM Transactions ✖ ✔ ✔
GREW Single Graph ✖ ✔ ✔
Gaston Single Graph ✖ ✔ ✔
gApprox Transactions ✖ ✖ ✖
(h/v)Sigram Single Graph ✖ ✔ ✔
16
Proposed Solution - SUBDUE
• SUBDUE defines the most frequent subgraph as the
subgraph that once replaced by a single node, most
compresses the original graph.
• G → original graph and S → candidate subgraph:
• Where:
17
Proposed Solution - From RDF to SUBDUE
• Some transformations in the RDF graphs before
applying FSM.
• Transformation I: replacing URIs from subjects by
the rdf:type of each resource.
• Transformation II: removal of external links.
18
Proposed Solution - From RDF to SUBDUE II
19
Proposed Solution - From RDF to SUBDUE III
20
Proposed Solution - Most frequent subgraph extraction &
matching
• Application of SUBDUE for extracting
most frequent subgraphs.
• Application of SUBDUE’s Graph Matcher
(gm) tool for matching extracted
subgraphs.
• Computes the cost of transforming one
subgraph on another subgraph.
• Cost of all transformations = 1.
21
Evaluation
• Efficacy of the system.
• Gold Standard.
• Baselines.
• Results.
22
Evaluation - Gold Standard
• Two different sources:
• Already existing links extracted from the DataHub through
the property links:<target_dataset_id>.
• Some obvious links do not exist!!
• Survey different researchers on Semantic Web and Linked
Data.
• They evaluate pairs of dataset determining the possibility of
establishing links between them.
• Each pair has been evaluated by almost three researchers. 23
The survey arises a new problem…
24Too many combinations!!!
Evaluation - Gold Standard - Survey
• Reduction of the search space inspired by [2]:
• Reduction from 7,038 evaluations to 594.
• Agreement according to Fleiss’ Kappa: 41% → Moderate agreement
(according to Landis & Coch).
25
26
Survey
Evaluation - Baselines
Three baselines:
Baseline #1: common ontologies. As more common ontologies have two
datasets, more chances to establish links between them.
Baseline #2: ontology ranking. As more common ontologies with similar
usage have two datasets, more chances to establish links between them.
Baseline #3: common triples. As more common triples (predicate and
object) two datasets have, more chances to establish links between
them.
27
Evaluation - Results I
28
Evaluation - Results III
• Possible over-fitting of Baseline #1.
• As can be seen in
http://apps.morelab.deusto.es/iesd2015/datasets.csv,
many of used datasets are from RKB Explorer.
• Same ontologies, same model, etc.
• Many datasets from the LOD Cloud are offline:
• No RDF dump.
• No SPARQL endpoint.
• A cleaning is needed!!
29
Conclusions & Future Work
• Precise recommendations, but improvable
recall.
• Recommendations done are OK, but they are not
enough.
• Future work:
• String similarity techniques.
• Recall improved ≳ [0.10, 0.30]
• Include generated recommendations into the
workflow.
30
Detection of Related
Semantic Datasets Based
on Frequent Subgraph
Mining
Mikel Emaldi1, Oscar Corcho2, and Diego López-de-Ipiña1
1 Deusto Institute of Technology - DeustoTech, University of Deusto, Bilbao, Spain
{m.emaldi,dipina}@deusto.es
2 Ontology Engineering Group, Departamento de Inteligencia Artificial, Facultad de Informática,
Universidad Politécnica de Madrid, Spain
ocorcho@fi.upm.es
Credits
Images:
http://lod-cloud.net/
https://www.flickr.com/photos/ramv/10057168055
Bibliography:
[1] Jiang, C., Coenen, F., Zito, M.: A survey of frequent subgraph mining algorithms. The
Knowledge Engineering Review 28(1), 75–105 (2013).
[2] Fetahu, B., Dietze, S., Nunes, B.P., Casanova, M.A., Taibi, D., Nejdl, W.: A scalable
approach for efficiently generating structured dataset topic profiles. In: The Semantic
Web: Trends and Challenges, pp. 519–534. Springer (2014).
[3] LANDIS, J. R. y KOCH, G. G. The measurement of observer agreement for cate- gorical data.
biometrics, pa ́ginas 159–174, 1977. 91
32

Weitere ähnliche Inhalte

Was ist angesagt?

Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Lviv Data Science Summer School
 
The eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and queryThe eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and query
Nina Jeliazkova
 

Was ist angesagt? (20)

Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
Linked Open Graph: browsing multiple SPARQL entry points to build your own LO...
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
 
Kaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New OpportunitiesKaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New Opportunities
 
The eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and queryThe eNanoMapper database for nanomaterial safety information: storage and query
The eNanoMapper database for nanomaterial safety information: storage and query
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic Technologies
 
Link Discovery Tutorial Introduction
Link Discovery Tutorial IntroductionLink Discovery Tutorial Introduction
Link Discovery Tutorial Introduction
 
Let your data shine... with OpenRefine
Let your data shine... with OpenRefineLet your data shine... with OpenRefine
Let your data shine... with OpenRefine
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 

Andere mochten auch (8)

Graph Theory
Graph TheoryGraph Theory
Graph Theory
 
Morphy richard oven manual
Morphy richard oven manualMorphy richard oven manual
Morphy richard oven manual
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - Slides
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
 
Large Graph Mining
Large Graph MiningLarge Graph Mining
Large Graph Mining
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 

Ähnlich wie Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
Alexandru Iosup
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 

Ähnlich wie Detection of Related Semantic Datasets Based on Frequent Subgraph Mining (20)

Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Spark
SparkSpark
Spark
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge BasesLOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...Maintaining scholarly standards in the digital age: Publishing historical gaz...
Maintaining scholarly standards in the digital age: Publishing historical gaz...
 
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
Semi-Supervised Classification with Graph Convolutional Networks @ICLR2017読み会
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Lecture_2_Stats.pdf
Lecture_2_Stats.pdfLecture_2_Stats.pdf
Lecture_2_Stats.pdf
 

Kürzlich hochgeladen

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 

Kürzlich hochgeladen (20)

Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Cyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptxCyanide resistant respiration pathway.pptx
Cyanide resistant respiration pathway.pptx
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 

Detection of Related Semantic Datasets Based on Frequent Subgraph Mining

  • 1. Detection of Related Semantic Datasets Based on Frequent Subgraph Mining Mikel Emaldi1, Oscar Corcho2, and Diego López-de-Ipiña1 1 Deusto Institute of Technology - DeustoTech, University of Deusto, Bilbao, Spain {m.emaldi,dipina}@deusto.es 2 Ontology Engineering Group, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Spain ocorcho@fi.upm.es
  • 2. Table of Contents 1. Introduction 2. State of the Art 3. Proposed solution 4. Evaluation 5. Conclusions 6. Current & Future Work 2
  • 3. May, 2007 Evolution of the LOD Cloud 3
  • 4. August, 2014 Evolution of the LOD Cloud 4
  • 5. Introduction • The LOD Cloud has grown enormously. • Difficulties when… • … exploring data provided by different datasets. • … selecting from which dataset we want to browse data. • … trying to avoid overlapping existing data. • … finding other datasets for interlinking our dataset. 5
  • 6. My New Dataset Linking Framework ● Silk ● Limes ● ... Links The problem of selecting linking sources ? 6
  • 7. State of the Art I • Nikolov, A., d’Aquin, M.: Identifying relevant sources for data linking using a Semantic Web index. In: Proceedings of the Linked Data on the Web workshop in conjunction with the 20th international World Wide Web conference. vol. 813 (2011). • Extract literals from rdfs:label, foaf:name or dc:title properties and search them in Sig.ma, grouping the results by source dataset. • More results a source has, more chances to be linked with the original dataset. 7
  • 8. State of the Art II • Leme, L.A.P.P., Lopes, G.R., Nunes, B.P., Casanova, M.A., Dietze, S.: Identifying candidate datasets for data interlinking. In: Proceedings of the 13th International Conference on Web Engineering. pp. 354– 366. Springer (2013) • Naive Bayes classifier for establishing a ranking of related datasets based on correlations among them for reducing the search space. • These correlations are based on existing links. 8
  • 9. State of the Art III • Lopes, G.R., Leme, L.A.P.P., Nunes, B.P., Casanova, M.A., Dietze, S.: Recommending tripleset interlinking through a social network approach. In: Proceedings of the 14 th international conference on Web Information Systems Engineering. pp. 149–161. Springer (2013) • They use already existing links for establishing new links among datasets. 9
  • 10. Proposed Solution I • Our goals: • Do not use literals because they are a scarce resource. • Do not use already existing links • Trying to solve the cold start problem. 10
  • 11. Proposed Solution II • A solution based on the graph structure of the datasets. • Hypothesis: Given a set of datasets from the Semantic Web, candidate datasets to be interlinked can be found through the comparison of their structures. 11
  • 12. But... ...some datasets are incredibly large!!! 12
  • 13. Proposed Solution III http://datahub.io/dataset/nytimes-linked-open-data triples > 345,000 http://datahub.io/dataset/rkb-explorer-acm triples > 12,000,000 http://datahub.io/dataset/rkb-explorer-dblp triples > 24,000,000 13
  • 14. Proposed Solution IV • Extract the most frequent subgraph from each graph. • Summarizing the graph. • Easing the matching task among graphs. • Subgraphs are smaller. • The FSM task only has to be launched once per graph (until the graph is modified). 14
  • 15. Proposed Solution - Frequent Subgraph Mining I • Main objective: to extract the most frequent subgraphs from a single graph or a set of graphs. • According to the state of the art [1], four characteristics define a FSM tool: • Single graphs vs. transactions: ability for extracting the most frequent subgraphs from a single graph or from a set of graphs. • Directed graphs: ability for dealing with directed graphs. • Labeled vertices and/or edges: ability for dealing with graphs including labels in their vertices and/or edges. 15
  • 16. Proposed Solution - Frequent Subgraph Mining II Solution Single Graph / Transactions Directed Graphs Labeled Vertex Labeled Edges Subdue Single Graph ✔ ✔ ✔ AGM Transactions ✔ ✔ ✔ FSG Transactions ✖ ✔ ✔ DP Mine Transactions ✔ ✔ ✔ MoFA Transactions ✖ ✔ ✖ gSpan Transactions ✖ ✔ ✔ FFSM Transactions ✖ ✔ ✔ GREW Single Graph ✖ ✔ ✔ Gaston Single Graph ✖ ✔ ✔ gApprox Transactions ✖ ✖ ✖ (h/v)Sigram Single Graph ✖ ✔ ✔ 16
  • 17. Proposed Solution - SUBDUE • SUBDUE defines the most frequent subgraph as the subgraph that once replaced by a single node, most compresses the original graph. • G → original graph and S → candidate subgraph: • Where: 17
  • 18. Proposed Solution - From RDF to SUBDUE • Some transformations in the RDF graphs before applying FSM. • Transformation I: replacing URIs from subjects by the rdf:type of each resource. • Transformation II: removal of external links. 18
  • 19. Proposed Solution - From RDF to SUBDUE II 19
  • 20. Proposed Solution - From RDF to SUBDUE III 20
  • 21. Proposed Solution - Most frequent subgraph extraction & matching • Application of SUBDUE for extracting most frequent subgraphs. • Application of SUBDUE’s Graph Matcher (gm) tool for matching extracted subgraphs. • Computes the cost of transforming one subgraph on another subgraph. • Cost of all transformations = 1. 21
  • 22. Evaluation • Efficacy of the system. • Gold Standard. • Baselines. • Results. 22
  • 23. Evaluation - Gold Standard • Two different sources: • Already existing links extracted from the DataHub through the property links:<target_dataset_id>. • Some obvious links do not exist!! • Survey different researchers on Semantic Web and Linked Data. • They evaluate pairs of dataset determining the possibility of establishing links between them. • Each pair has been evaluated by almost three researchers. 23
  • 24. The survey arises a new problem… 24Too many combinations!!!
  • 25. Evaluation - Gold Standard - Survey • Reduction of the search space inspired by [2]: • Reduction from 7,038 evaluations to 594. • Agreement according to Fleiss’ Kappa: 41% → Moderate agreement (according to Landis & Coch). 25
  • 27. Evaluation - Baselines Three baselines: Baseline #1: common ontologies. As more common ontologies have two datasets, more chances to establish links between them. Baseline #2: ontology ranking. As more common ontologies with similar usage have two datasets, more chances to establish links between them. Baseline #3: common triples. As more common triples (predicate and object) two datasets have, more chances to establish links between them. 27
  • 29. Evaluation - Results III • Possible over-fitting of Baseline #1. • As can be seen in http://apps.morelab.deusto.es/iesd2015/datasets.csv, many of used datasets are from RKB Explorer. • Same ontologies, same model, etc. • Many datasets from the LOD Cloud are offline: • No RDF dump. • No SPARQL endpoint. • A cleaning is needed!! 29
  • 30. Conclusions & Future Work • Precise recommendations, but improvable recall. • Recommendations done are OK, but they are not enough. • Future work: • String similarity techniques. • Recall improved ≳ [0.10, 0.30] • Include generated recommendations into the workflow. 30
  • 31. Detection of Related Semantic Datasets Based on Frequent Subgraph Mining Mikel Emaldi1, Oscar Corcho2, and Diego López-de-Ipiña1 1 Deusto Institute of Technology - DeustoTech, University of Deusto, Bilbao, Spain {m.emaldi,dipina}@deusto.es 2 Ontology Engineering Group, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Spain ocorcho@fi.upm.es
  • 32. Credits Images: http://lod-cloud.net/ https://www.flickr.com/photos/ramv/10057168055 Bibliography: [1] Jiang, C., Coenen, F., Zito, M.: A survey of frequent subgraph mining algorithms. The Knowledge Engineering Review 28(1), 75–105 (2013). [2] Fetahu, B., Dietze, S., Nunes, B.P., Casanova, M.A., Taibi, D., Nejdl, W.: A scalable approach for efficiently generating structured dataset topic profiles. In: The Semantic Web: Trends and Challenges, pp. 519–534. Springer (2014). [3] LANDIS, J. R. y KOCH, G. G. The measurement of observer agreement for cate- gorical data. biometrics, pa ́ginas 159–174, 1977. 91 32

Hinweis der Redaktion

  1. Good afternoon. My name is Mikel Emaldi, from University of Deusto in Bilbao, and I’m going to present our work named “Detection of Related Semantic Dataset Based on Frequent Subgraph Mining”.
  2. This is the script of this presentation; as can you see is a very typical script.
  3. At first I’m going to introduce the problem that we have tried to solve through this work. As you know his is the first snapshot of the Linked Open Data Cloud at its early years. As can be seen there were few datasets, and it could be explored with the naked eye; you didn’t need much help to understand the datasets in this version of the LOD Cloud.
  4. But, since 2007 the LOD Cloud has grown, it has grown enormously, and we hope that it is going to grow more and more. Today, you need some help to explore the whole cloud; you can’t look for desired datasets with the naked eye, it’s harder to search for some data inside this cloud.
  5. As I have said, the actual size of the cloud is a problem when we want to explore data provided by different datasets; or when we have to select from which dataset we want to browse this data; or if we are trying to avoid an overlapping of our new data with already existing data; or, and this is the focus of our work, when you want to find some datasets related to yours for interlinking your datasets with these other datasets already existing in the LOD Cloud.
  6. This is a representation of the problem: you have created a new dataset, and you want to link your dataset with other, using a linking framework like Silk, Limes, or whatever system you want; and now, exploring the LOD Cloud is very hard. It’s going to take you a lot of time and efforts.
  7. There are few works, at our best knowledge, that deals with this problem. One of them is this work by Andriy Nikolov. In this work they take a graph and they extract literals from some properties of this graph, for example rdfs:label, foaf:name or dc:title and they make a search of these literals in Sig.ma. After that, they group the results given by Sig.ma by the source of each result. The source that has more results, has more chances to be linked with the original dataset.
  8. Another work is the one from Luiz André Paes Leme. I this work they use a Naive Bayes classifier for establishing a ranking of related datasets based on correlations among them for reducing the search space.
  9. The last one that we have found is the work by Giseli Rabello Lopez. In this work, basically and explaining it quickly, they use already existing links for establishing new links among datasets.
  10. After analyzing the state of the art, we wanted to propose an alternative solution; an alternative without using literals, and without using already existing links, as the previously shown works do. And, why? In the of literals, because they are a scarce resource; usually literals are not long enough to do a complex analysis; you can find a lot of n-grams (for example name “Mikel, surname “Emaldi”) but you can’t usually find long text in this literal properties of the RDF graphs. And why didn’t we want to avoid the already existing links? Because, maybe, not in the LOD Cloud, but in another scenario, we may have to deal with a set of datasets without any links. And if there are no links, we can’t apply a solution based on already existing links. We have try to solve the cold start problem.
  11. And finally, we have think about a solution based on the graph structure of the datasets. Our main hypotheses is that a set of datasets from the Semantic Web, candidate datasets to be interlinked can be found through the comparison of their structures. Our idea was that, analyzing the structure of the graphs we could find some similarities among them, and some clues indicating that these graphs could be interlinked in some way.
  12. But, this solution raises a big problem: some datasets are too large to make a direct comparison among their structure.
  13. For example…. A direct match is impossible because the graphs formed by these datasets are too large.
  14. For dealing with this problem, we have proposed to extract the most frequent subgraph from each graph. This way, we can summarize the graph, easing the matching task among these graphs, because the subgraphs are smaller and you can match them easily. Well but, you maybe are wondering about, “well the frequent subgraph matching task is computationally expensive, you have to process the same big graphs”. But, the frequent subgraph mining task only has to be launched once per graph, and the extracted subgraph is valid for matching with all the new graphs introduced in the LOD Cloud.
  15. Some basic hints, about the frequent subgraph mining. It’s a technique widely used in biology and chemistry; and it consists on extracting the most frequent subgraphs from a single graph or from a set of graphs. According to the state of the art, there are four characteristics which define different FSM tools: Some tools extract most frequent subgraphs from a single graph and other tools are developed for extracting most frequent subgraphs from a set of different and independent graphs. On the other hand, there are tools that can not deal with directed graphs; dealing with directed graphs is computationally more expensive. And at last, some tools can not deal with graphs which contains labels in theirs vertices or edges.
  16. According to the RDF model, the graphs are directed, and they have labels in their vertices and edges. For this reason, among all the tools that we have analysed the only candidates to be used in this work are SUBDUE and DPMine. But, because our proposed solution deals with single graphs, we have chosen SUBDUE for extracting the most frequent subgraphs.
  17. SUBDUE works this way: it defines the most frequent subgraph as the subgraph that once replaced by a single node, compresses most the original graph. This formula shows the total compression rate, where size(G) is the size of the whole graph; size(S) the size of the candidate subgraph that it is being evaluated, and size(G|S) is the size of the original graph if the candidate subgraph is replace by a single node.
  18. Returning to our solution, before applying SUBDUE to RDF graphs, some transformations have been done to these graphs. The first transformation consists on replacing the URIs from resource’s subject by their rdf:type properties. We have decided to do this because as each resource in a Linked Dataset has its own an unique URI, these nodes never could be form part of a most frequent subgraphs; because there are as many URIs as resources are in the graph. For example, given this example, and its graph representation [CHANGE SLIDE] The second transformation concerns to external links included in a graphs. Despite we know that these external links could provide useful information for fulfilling our task, we have discarded them for two reasons: the first is that in addition to the computational cost that would include these external links, the cost and the time expended by retrieving them through the net has to be added to the total cost of processing a graph. The second reason is that one of our goals was to trying to solve the cold start problem, and we have preferred to work on an sceneario without external links.
  19. … we can see how the original graph has been transformed into a new graph, but describing it in more generic way. In the original graph we can see that it is describing Tim Berners Lee his publication “The Semantic Web”, and in the transformed graph we can see that it is describing a person an one of its publications. [GO BACK].
  20. Here we can see the result of the exaple graph without the external links.
  21. The next step is to apply SUBDUE for extracting the most frequent subgraphs. Here you can see an example of a SUBDUE file: each vertex has a unique ID and immediately the edges have to be defined. These files, because the size of some graphs, are split in chunks, and analysed incrementally; this is a feature of SUBDUE. Once the most frequent subgraphs are extracted, these subgraphs are matched with SUBDUE’s Graph Matcher tool. This tools performs a very simple matching; it consist on computing the cost of transforming one subgraph in another subgraph. In this case, also, the transformation cost of all operations (creation, modification and deletion of a vertex or an edge) have the same cost. At last, the similarity ratio between datasets is calculated.
  22. Once our solution has been explained, it’s time to show the evaluation results. At this evaluation we have developed a gold standard and a set of baselines for evaluating our work.
  23. For developing the Gold Standard, two sources have been used. The first one is the profile of each dataset in the DataHub. One of the requirements to be in the LOD Cloud is the description of the links that a dataset has. Through its API we have retrieve these links to establish wich datasets are linked together. But, there was a problem: some obvious links do not exist; maybe because the publisher of the dataset did’t review his or hers dataset for looking for new related datasets, we don’t know. But this issue produced a lot of false positives in our system. For solving this issue, a survey have been done to sifferent researcher on Semantic Web and Linked Data. These researchers have been surveyed about the possibility of establishing links among different pair of datasets.
  24. But, again, the same problem. There were a lot of combinations to be evaluated by these researchers. The truth is that we have not used all datasets from the LOD Cloud to evaluate our system, but there were a lot of combinations at all.
  25. For solving this problem, based on another work, we have try to reduce the set of links to be evaluated. We have applied the evidence depicted in this figure: we asume that if two datasets share common links to another datasets, they could be linked. Ok, here we can see that datasets A, B, C and D share links with datasets I, F, E, H and G, so that, we can conclude that there are high chances to establish links among these datasets. Applying this method we have reduced the number of evaluations from more than 7,000 to around 600 evaluations. Once the survey has been done, an agreement of 41% have been achieved that it is a moderate agreement according to [3]. Maybe a more agreement could be achieved with a more exhaust definition of the question. Some surveyed researched declare that sometimes they have doubts about some pair of datasets because they thought that they were thinking about very complex relationships between these datasets.
  26. Here you can see the survey, its a very simple web page in which the title and a description of the dataset, along with an example resource an its DataHub URL (if the want more information) can be seen.
  27. .These baselines are simple solutions to the proposed problem which allow measuring how better is our solution. The first baseline is based on the assumption that if more shared ontologies two datasets have more similarities are between them and more chances to be interlinked they have. This concept is represented through THIS score, beign N1 and N2 the set of ontologies from Dataset1 and Dataset2, respectively. The second baseline computes the similarity between a pair of datasets based in ontologies used to describe them but it also takes into account the ranking of usage of each ontlogy. This usage is computed through the sum of the properties and classes of an ontology used in a dataset. The similarity between rankings is computed through the Kendall’s Tau, where Tau1 and Tau2 are the rankings and n is the size of the rankings. According to Kendalls Tau, if the ith element and jth element of the ranking is the same element, K returns 0; if not returns 1. At last, the third baseline assumes that as more common triples two datasets have, we have more chances to establish links between them. In this baseline the subjects of resource’s has been removed, because according to Linked Data paradigmn they are unique.
  28. Here the results of the evaluation are shown. As can be see, the precision of the proposed solution is high enough to consider this solution as a valid solution, but, on the other hand, the recall shows that we have to work harder for improving these results.
  29. Well, regarding to the good recall obtained by the first Baseline [SHOW RESULT], there is a possible over-fitting of the results. One reason of this over-fitting could be the high number of datasets belonging to RKB Explorer used in the evaluation. These datasets use the same ontologies, the same model, etc. When we have try to make a more heterogeneous, we have found many datasets from the LOD Cloud offline, without RDF dump and without SPARQL endpoint. For this reason, we think that it could be interesting to perform some dataset cleaning over the LOD Cloud. Well, but this is only a trouble that we had along this work and I think that it could be interesting to comment here.
  30. At last, and concluding this presentation, we can say that this work is able to get precise recommendations when recommending candidate datasets to be linked, but more work have to be done to improve the recall. In the future (or better, in the actual work), we are applying string similarity techniques for finding similarities among datasets described by different ontologies. As spoiler, we can say that the recall is slightly improved between a 10 and 30%, depending on the threshold. Another improvement that we want to do is about becoming the matching process iterative, including the recommendations done into the workflow.
  31. That’s all, thank you for your attention.