SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Profile-based Dataset Recommendation
for RDF Data Linking
PhD Thesis Defense of: Mohamed BEN ELLEFI
LIRMM – Montpellier, France
19/12/2016
Thesis Supervisors:
Zohra BELLAHSENE
KonstantinTODOROV
 7 industrial and academic partners join forces to discover, test, and
implement big open data.
http://www.datalyse.fr/
This work is financed by
2
o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
3
4
4
4
sameAs
Knows
sameAs
sameAs
worksOn
worksOn
5
Linked Data
570 datasets in 2014
12 datasets in 2007
Doc
Web of Hypertext
Hyperlinks
Hyperlinks
Hyperlinks
Doc
Doc
The Linking Open Data cloud diagram
http://lod-cloud.net/
Web Evloving
6
Linked Data Life-cycle
Modeling
Publishing
Conversion
Interlinking
Published Linked Data
Raw Data
Maintaining
Modeling
Publishing
Conversion
Interlinking
Raw Data
Maintaining
Vocabulary Search
Vocabulary terms
Selection
Vocabulary Editition
Published Linked Data
Linked Data Life-cycle (1/4)
6
Modeling
Publishing
Conversion
Interlinking
Raw Data
Published Linked Data
Transforming information
from raw data source to
RDF data using the
selected vocabulary…
Linked Data Life-cycle (2/4)
6
Maintaining
Modeling
Publishing
Conversion
Interlinking
Raw Data
Published Linked Data
Hosting the linked dataset and
its metadata publicly and make
it accessible…
Linked Data Life-cycle (3/4)
6
Maintaining
Modeling
Publishing
Conversion
Interlinking
Raw Data
Datasets Search
Candidate Selection
Data Linking
Published Linked Data
Linked Data Life-cycle (4/4)
6
Maintaining
Focus on:
Modeling
Publishing
Conversion
Interlinking
Raw Data
Published Linked Data
Recommending Vocabulary Terms
Recommending Candidate Datasets
6
Maintaining
2
1
o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
7
Focus on:
Modeling
Publishing
Conversion
Interlinking
Raw Data
Published Linked Data
Recommending Vocabulary Terms
8
Maintaining
2
1
“.. whatever the domain of your vocabulary, someone else has probably done it already.”
--- Cookbook for translating relational data models to RDF schemas
Motivation: Modeling Linked Data
4 http://lov.okfn.org/ ; 83 http://protege.stanford.edu
 Reusing existing vocabularies:
 Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …
 Ontology development tools: i.e, Protégé2 …
“.. whatever the domain of your vocabulary, someone else has probably done it already.”
--- Cookbook for translating relational data models to RDF schemas
Motivation: Modeling Linked Data
4 http://lov.okfn.org/ ;
• What keywords to use for the search
• How to select vocabularies
• Which metadata can help for modeling
83 http://protege.stanford.edu
 Reusing existing vocabularies:
 Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …
 Ontology development tools: i.e, Protégé2 …
Input dataset
texte
RD
LOV search Keywords
Sevices
Translator API
-Cleaning
-Translating
Datavore
LOV Sparql
Endpoint
-Metadata
Extraction
-Terms Search-Terms Extractions
• Ranked lists of vocabulary terms.
• The corresponding Metadatas.
• Triples suggestions.
1
2
3
4
5
-Triples Extractions
6
Linked Open Vocabularies
9
Datavore Ecosystem
Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin
Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. BDA 2015.
Datavore Tool
• A GUI desktop application
http://www.lirmm.fr/benellefi/Datavore_Exe
File
• A demonstration video
http://www.lirmm.fr/benellefi/Datavore_Vid
eoDemo
11
Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin
Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. (Posters & Demos)
ISWC 2015.
o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
12
Focus on:
Modeling
Publishing
Conversion
Interlinking
Raw Data
Published Linked Data
Recommending Candidate Datasets
6
Maintaining
2
1
13
Entity Linking Challenges
“A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider."
----Source: https://www.w3.org/TR/void/#dataset
13
 Data reuse and in-links focused on
trusted, reference graphs, i.e., Dbpedia,
Freebase, etc.
Few datasets actually used…
A long tail of potentially suitable yet
under-recognized datasets
“A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider."
----Source: https://www.w3.org/TR/void/#dataset
Entity Linking Challenges
14
Candidate Datasets Selection: Problem Statement (1/2)
How to find
candidates to link my
lovely dataset?
Source
15
Thank you for the
recommendations
Candidates
 Dataset recommendation for
data linking is the task of
computing a rank score for each
of a set of target datasets with
respect to a source dataset.
 The rank score indicates the
relatedness between the
source and the target dataset.
Source
Candidate Datasets Selection: Problem Statement (2/2)
(1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Related Work
(2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
16
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.
(1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Sig.ma is currently down!
Related Work
(2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
Costly input!
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
Cold Start Problem!
16
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.
To improve efficiency!
(2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
Costly input!
(1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Sig.ma is currently down!
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.
To improve efficiency!
Related Work
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
Cold Start Problem!
16
 To deal with real world LOD datasets.
 To provide a new recommender system with a greater efficiency.
o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
17
18
Profile-based Recommendation: Motivation
Similar taste
buy
buy
buy
buy
Homer Simpson VS. Peter Griffin:
Same profile (taste) behavior !
buy
Recommend
19
Linked Data: if two datasets are strongly similar (Profile-based similarity),
we can consider that they may have the same connectivity behaviour…
What is it going to be a dataset profile
Hypothesis:
Profile-based Recommendation: Motivation
(*) An RDF dataset profile can be seen as the
formal representation of a set of features that
describe a dataset and allow the comparison of
different datasets with regard to their
characteristics.The feature set is dependent on a
given application scenario and task.
20
Semantic Web Datasets Profiling
(*) Mohamed Ben Ellefi, Zohra Bellahsene, John Breslin, Elena Demidova, Stefan Dietze,
Julian Szymanski, Konstantin Todorov. Dataset Profiling - a Guide to Features, Methods,
Applications and Vocabularies. Major Revision statue In the Semantic Web Journal.
Dataset Profile
Features
Semantic
Qualitative
Domain/Topic
Context
Index elements
Schema/Instances
Trust
Accessibility
Representation
Context
Degree of connectivity
Statistical
Temporal
Schema Level
Instance Level
Global
Instance-specific
Semantics-specific
21
Semantic
Web
Data
management
SemanticWeb
Dog Food
WWW
Consortium
standards
Information
retrieval
l3s-dblp
Datasets
Topics
Topic Dataset Profile
B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi, and W. Nejdl. A scalable approach
for efficiently generating structured dataset topic profiles. In In Proceedings of the 11th ESWC 2014.
Topic/Domain
Semantic
Topic Profile based-Datasets Recommendation
22
o Step 4: Ranking system
o Steps1-3: Preprocessing/ Learning step
23
Learning/ Preprocessing
Topics:
Weigts:
Source:
23
Topics:
Weigts:
Source:
Connectivity (Di ,Dj)
Learning/ Preprocessing
23
Topics:
Weights:
Source:
Connectivity (Di ,Dj)
Learning/ Preprocessing
24
o A dataset is modeled as a set of topics-- a dataset's profile.
o Inversely, a topic is modeled as a set of datasets assigned to it-- a topic's signature.
 𝝈 is a connectivity
behaviour measure.
Topics Signatures
Learning/ Preprocessing
Target Datasets Ranking
25
Let D0 be a new dataset to be linked:
1- Extract Profile(D0).
2- Constitute a pool of target datasets
from the signatures (the result of the
learning step).
3- Ranking target datasets:
Training Set:
The topic profiles graph  from its available Sparql endpoint: http://data-observatory.org/lod-
profiles/profile-explorer.
 76 datasets and 185 392 topics.
 The evaluation data (ED)  the current topology of the LOD-Cloud using the datahub2void tool
(https://github.com/lod-cloud/datahub2void).
o We made the ED available on http://www.lirmm.fr/benellefi/void.ttl
Testing Set:
Source Datasets: All the 76 datasets indexed by the topics profiles graph.
Target Datasets: 258 datasets from the LOD cloud group (http://datahub.io/group/lodcloud).
26
Experimental Setup
27
Evaluation Framework
 Leave-one-out (5-fold cross-validation).
Selected
Dataset
(3)To recommend target
datasets using our system.
Selected
Dataset
owl:sameAs
…
Selected
Dataset
owl:sameAs
…
(1)To select a source dataset in
the evaluation data.
(2)To consider the dataset
as unlinked.
(4)To evaluate the recommendation
Target 1
Target 2
Target n
Target 1
Target 2
Target n
- Average recall: 81%.
- 59% of the DS have
a recall of 100%.
- Average precision:
19%.
Evaluation Results (1/3)
Recall/Precision/F1-Score over all DS ∈ ED
False Positive Rate?
28
29
FP overestimation:
a small average FP-
Rate of 13%
FP-Rate over all DS ∈ ED
Evaluation Results (2/3)
The original search
space size: 258
datasets.
30
Search Space Reduction over all DS ∈ ED
An average
space size
reduction is
up to 86%.
Evaluation Results (3/3)
31
Baselines & Comparison (1/3)
Baselines are available on http://www.lirmm.fr/benellefi/Baselines.rar.
1-
3-
2-
32
Recall values of our approach vs. baselines over all DS ∈ ED
 Baselines fail to provide any
results at all for some datasets.
 Our approach is more stable and
outperforms the baselines in the
majority of recommendations.
Baselines & Comparison (2/3)
33
 The baseline approaches have produced better results than our system in a limited number of cases.
 The shared tags baseline generated an F-Score of 100% on:
This is due to the fact that these two datasets are tagged by the same provenance
(data.oceandrilling.org)  share the same set of tags.
Note:
Baselines & Comparison (3/3)
33
AVG Precision, Recall and F1-score values over all recommendation lists for all source datasets.
 The baseline approaches have produced better results than our system in a limited number of cases.
 The shared tags baseline generated an F-Score of 100% on:
This is due to the fact that these two datasets are tagged by the same provenance
(data.oceandrilling.org)  share the same set of tags.
Note:
Our approach Shared Keywords Shared Linksets SharedTopics Profiles
AVG Precison 19% 9% 9% 3%
AVG Recall 81% 41% 11% 13%
AVG F1-Score 24% 10% 8% 4%
Baselines & Comparison (3/3)
34
Topic-profiles Dataset recommendation approach:
 Original search space reduction
 Average recall: 81% & Average precision: 19%.
 Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
Discussion
Advantages:
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge
Graphs Recommending Web Datasets for Data Linking. ICWE2016.
34
Topic-profiles Dataset recommendation approach:
 Original search space reduction
 Average recall: 81% & Average precision: 19%.
 Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
Discussion
Precision needs to be improved.
Learning Data is not complete.
Challenges:
Advantages:
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge
Graphs Recommending Web Datasets for Data Linking. ICWE2016.
Breaking up with the learning step.
oContext
oVocabulary Recommendation with Datavore
oDatasets Recommendation: Problem Statement
oDatasets Recommendation:Topic Profile-based Approach
oDatasets Recommendation: Intensional Profile-based Approach
oConclusion & Open Issues
Outline
35
Hypothesis
Motivation
36
Coreference resolution:
 Different datasets may contain different resources that refer to the same
real world entity.
 Following the LD best practices, these ressources are generally
represented by same/similar types.
Datasets that share at least one pair of similar concepts, are likely to
contain at least one potential pair of instances to be linked,
i.e.,“owl:sameAs” statement.
37
dbo:Place
dbo:Location
dbo:Settlement
schema:Place
umbek-rc:PopulatedPlace
yago:Commune108541609
…
lgeod:Place
lgeod:City
yago:District
yago:Municipaliy
yago:Region
yago:Town
yago:Commune
…
dbpedia.org
yago-knowledge.org
linkedgeodata.org
Hypothesis:An Example
Montpellier
37
dbo:Place
dbo:Location
dbo:Settlement
schema:Place
umbek-rc:PopulatedPlace
yago:Commune108541609
…
lgeod:Place
lgeod:City
yago:District
yago:Municipaliy
yago:Region
yago:Town
yago:Commune
…
dbpedia.org
yago-knowledge.org
linkedgeodata.org
City ≈ Town
Place = Place
Settlement ≈ Commune
Hypothesis:An Example
Montpellier
41
 Known from the ontology matching,WordNet-based similarity:
[1] L. Han,A. L. Kashyap,T. Finin, J. Mayeld, and J.Weese.Umbc ebiquity-core: Semantic textual similarity systems, in Proc. of the *SEM,
Association for Computational Linguistics, 2013.
 UMBC measure [1]: combines semantic distance in WordNet, with frequency of
occurrence and co-occurrence of terms in a large external corpus (the web).
o Wu Palmer
o Lin's
Similarity Measures to Use:
38
39
Preprocessing Target Datasets Filtering Datasets Ranking
Intensional Approach to Datasets Recommendation
*
* CCD = Cluster of Comparable Datasets
Cosine
1 2 3
Example: Montpellier ∈ DS;
<Montpellier, rdf:type, dbo:Town>
 PL(DS)= "town", …
 PD(DS)= “…Usually, a town is thought of as larger
than a village but smaller than a city, though there
are exceptions to this rule."
40
 Dataset Label Profile-- PL(DS): a set of n schema
concepts labels corresponding to DS.
Intensional Approach to Datasets Recommendation (1/3)
DS
PD(DT)
PL(DS)
Preprocessing
PD(DS)
PL(DT)
Profiling
Profiling
 Dataset Document Profile-- PD(DS): the
concatenation of PL(DS) and the textual
descriptions of the n schema concepts.
1
42
Two datasets DS and DT are comparable if there exist
at least one similarity between their labels profiles:
(PL(DS), PL(DT)).
 We identify CCD(DS) - a Cluster of Comparable Datasets
related to DS
 All the linking candidates for DS are found in its cluster
CCD(DS).
Target Datasets Filtering
 The next step consists of ranking DT‘s in CCD(DS)
sim(PL(DS), PL(DT))
Intensional Approach to Datasets Recommendation (2/3)
2
43
1) Forming a corpus by profiles documents: PD (DS) and all
PD(DT).
2) Building a vector space model by indexing the documents
in the corpus
3) Computing TF-IDF + cosine similarity between the
document vectors in the corpus.
4) Ranking each DT in the cluster CCD(𝐷 𝑆) with respect to 𝐷 𝑆.
 A mapping between datasets is returned based on their
labels profiles: PL(DS) , PL(DT).
Intensional Approach to Datasets Recommendation (3/3)
3
44
 LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets
 The profile PD  from the LOV (lov.okfn.org)
 Wu Palmer & Lin's  2013 WS4J java API (https://code.google.com/archive/p/ws4j/)
 The UMBC measure  web API (http://swoogle.umbc.edu/SimService)
Experimental Setup
 The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on
http://www.lirmm.fr/benellefi/void.ttl)
44
Experimental Setup
 Leave one out evaluation
 LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets.
 The profile PD  from the LOV (lov.okfn.org).
 Wu Palmer & Lin's  the 2013 WS4J java API (https://code.google.com/archive/p/ws4j/).
 The UMBC measure  its available web API (http://swoogle.umbc.edu/SimService).
 The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on
http://www.lirmm.fr/benellefi/void.ttl)
45
 For each DS ∈ ED, we evaluated the selection of target datasets in the cluster CCD(𝐷 𝑆) in
terms of recall.
o Wu Palmer: 𝜽 ∈ [𝟎 , 𝟎. 𝟗] o Lin: 𝜽 ∈ [𝟎 , 𝟎. 𝟖] o UMBC: 𝜽 ∈ [𝟎 , 𝟎. 𝟕]
Evaluation Results (1/3)
 In the following, the evaluation will be restricted only in these
intervals in order to guarantee a Recall up to 100% in the ED.
Result: the recall value remains 100% in the following threshold 𝜽 intervals:
46
o Wu Palmer: 𝜃 = 0.9
o Lin: 𝜃 = 0.8
o UMBC: 𝜃 = 0.7
MAP@R over all DS ∈ ED
Evaluation Results (2/3)
Parameter tuning:
47
UMBC @ 𝜃 = 0.7 is the best setting for our ranking approach
Mean Precision@K over all DS ∈ ED using the three different similarity
measures over their best setting
Evaluation Results (3/3)
P@5 P@10 P@15 P@20
Wu Palmer (𝜽= 0,9) 0,56 0,52 0,53 0,51
Lin (𝜽= 0,8) 0,57 0,54 0,55 0,51
UMBC (𝜽= 0,7) 0,58 0,54 0,53 0,53
48
 Baseline #2: All datasets are represented by their label profiles (PL).
1) CCD(DS) using UMBC @ 𝜃 = 0.7.
2) AvgUMBC be a ranking function that assigns scores to each DT ∈ CCD():
Baselines & Comparison (1/2)
 Baseline #1: All datasets are represented by their document profiles (PD).
1) A vector space model by indexing the PD  NO CCD clusters.
2) A TF-IDF + cosine similarity between the document vectors.
 NOTF-IDF + cosine similarity
49
Baseline#1 Proposed ApproachBaseline#2
 Baseline#2 produces better results
for a limited number of datasets:
o RKB explorer datasets are
sharing a high number of
identical labels in their PL.
 Efficiency with an AVG P@R up to
53%, compared to 49% and 39%
for the baselines.
 Precision up to 100% for DS from
geographic and governmental
domains
P@R over all DS ∈ ED
Baselines & Comparison (2/2)
50
 A high performance
o An average precision of 53% for a recall of 100%.
o Independence of the dataset size or the schema cardinality.
Result Analysis
False positives overestimation.
 A more fair evaluation can be given if better ED are used.
o The ED is far from being complete as a GroundTruth.
o We ran “SILK” on some FP recommendations. Example of
discovered linksets:
rkb-explorer-
unlocode
yovisto
datos-
bcn-cl
datos-
bcn-uk
owl:sameAs
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset
Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016.
51
Datasets Recommendation: Intensional Profile-based
Approach
 Semantic Profiles features: Intensional Profile
 No Learning Step.
 Average Recall: 100%
 Mean average precision: 53%.
 A mappings between the schema concepts
Ranking results available on
http://www.lirmm.fr/benellefi/CCD-CosineRank_Result.csv.
Topic Profiles-based Vs Intensional-based
Dataset recommendation:Topic profiles-based
approach
 Semantic Profiles features:Topic Profiles
 Learning data dependency.
 Average recall: 81%
 Mean average precision: 19%.
 A new topic profiles propagation approach
Ranking results available on
http://www.lirmm.fr/benellefi/results.csv.
o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
52
53
 The choice of the Profile features is dependent on a given application
scenario and task.
 Learning data dependency:
 Learning data is not complete  Effectiveness to be improved.
 The learning break-off  Highly greater effeciency.
 Better performance with datasets having high quality intensional
profiles:
 The awareness of the richness of datasets schema descriptions.
 The importance of reusing existing vocabulary terms:
o i.e, tools such as Datavore can ease the vocabulary
reusing task in linked data modeling.
Conclusion
54
Open Issues
Dataset recommendation:
o A reliable ground truth (i.e., crowdsourcing-based) and benchmark data.
o To improve the quality of the intensional profiles:
• the population of the schema elements,
• the dataset context,
• the multilinguisme, …
o Investigating the effectiveness of ML techniques
Vocabulary recommendation:
o A new evaluation framework for the linked data modeling process
i.e, in a user study manner and crowdsourcing-based.
o To examine the vocabulary terms ranking strategies based on the data structure factors
i.e, tabular sources modeling vs. web pages annotation.
Twitter: @benellefi
 Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation
for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016; Crete, Grèce.
 Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established
Knowledge Graphs Recommending Web Datasets for Data Linking. In Proceeding of the 16th
ICWE 2016; Lugano, Switzerland.
 Manel Achichi, Mohamed Ben Ellefi, Danai Symeonidou, Konstantin Todorov. Automatic Key Selection for
Data Linking. In Proceeding of the 20th EKAW 2016; Bologna, Italy.
 Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015; Bethlehem, PA, USA.
 This paper was also presented In BDA'2015, Île de Porquerolles, France.
 Mohamed Ben Ellefi, Zohra Bellahsene, François Scharffe, Konstantin Todorov. Towards Semantic Dataset
Profiling. In PROFILES@ESWC, Crete, Grèce, (2014).

Weitere ähnliche Inhalte

Was ist angesagt?

Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLPGVS Chaitanya
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Lihua Zhao
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Jean Brenda
 
Approach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsApproach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsIoannis Stavrakantonakis
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeTraian Rebedea
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesBesnik Fetahu
 
Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsBaden Hughes
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentMaribel Acosta Deibe
 
Searching Linked Data
Searching Linked DataSearching Linked Data
Searching Linked DataThanh Tran
 
Translating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsTranslating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsMauro Dragoni
 
Ethnograph 10 Jul07
Ethnograph 10 Jul07Ethnograph 10 Jul07
Ethnograph 10 Jul07Clara Kwan
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 
Ethnograph 11 Jul07
Ethnograph 11 Jul07Ethnograph 11 Jul07
Ethnograph 11 Jul07Clara Kwan
 

Was ist angesagt? (20)

Open domain Question Answering System - Research project in NLP
Open domain  Question Answering System - Research project in NLPOpen domain  Question Answering System - Research project in NLP
Open domain Question Answering System - Research project in NLP
 
Question answering
Question answeringQuestion answering
Question answering
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic Technologies
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
Approach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through SemanticsApproach to leverage Websites to APIs through Semantics
Approach to leverage Websites to APIs through Semantics
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic ProfilesA Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles
 
Closing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary LinguisticsClosing the Gap: Data Models for Documentary Linguistics
Closing the Gap: Data Models for Documentary Linguistics
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Searching Linked Data
Searching Linked DataSearching Linked Data
Searching Linked Data
 
Translating Ontologies in Real-World Settings
Translating Ontologies in Real-World SettingsTranslating Ontologies in Real-World Settings
Translating Ontologies in Real-World Settings
 
Ethnograph 10 Jul07
Ethnograph 10 Jul07Ethnograph 10 Jul07
Ethnograph 10 Jul07
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Ethnograph 11 Jul07
Ethnograph 11 Jul07Ethnograph 11 Jul07
Ethnograph 11 Jul07
 

Andere mochten auch

Increasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic webIncreasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic webJay Myers
 
Bm group1 sec_b_2016
Bm group1 sec_b_2016Bm group1 sec_b_2016
Bm group1 sec_b_2016Isha Khan
 
акцентуац хара презентация Microsoft office power point
акцентуац хара презентация Microsoft office power pointакцентуац хара презентация Microsoft office power point
акцентуац хара презентация Microsoft office power pointРуслан Музипов
 
Escuta crianca-adolescente
Escuta crianca-adolescenteEscuta crianca-adolescente
Escuta crianca-adolescenteACECTALCT
 
Orientações
OrientaçõesOrientações
OrientaçõesACECTALCT
 
EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia
EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia
EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia Abdul-Sattar Al-Mashhadani
 
FYP Finalised - 12142131
FYP Finalised - 12142131FYP Finalised - 12142131
FYP Finalised - 12142131Deirdre Gaire
 

Andere mochten auch (15)

Increasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic webIncreasing product and service visibility through front-end semantic web
Increasing product and service visibility through front-end semantic web
 
Bm group1 sec_b_2016
Bm group1 sec_b_2016Bm group1 sec_b_2016
Bm group1 sec_b_2016
 
акцентуац хара презентация Microsoft office power point
акцентуац хара презентация Microsoft office power pointакцентуац хара презентация Microsoft office power point
акцентуац хара презентация Microsoft office power point
 
Guia cepia
Guia cepiaGuia cepia
Guia cepia
 
Escuta crianca-adolescente
Escuta crianca-adolescenteEscuta crianca-adolescente
Escuta crianca-adolescente
 
PRF Flyer2
PRF Flyer2PRF Flyer2
PRF Flyer2
 
Orientações
OrientaçõesOrientações
Orientações
 
4.Estimación linea base
4.Estimación linea base4.Estimación linea base
4.Estimación linea base
 
hydrogen car
hydrogen carhydrogen car
hydrogen car
 
model
modelmodel
model
 
Final Golf Program 2016
Final Golf Program 2016Final Golf Program 2016
Final Golf Program 2016
 
John Stead CV
John Stead CVJohn Stead CV
John Stead CV
 
Daniel D McWain - 2017
Daniel D  McWain - 2017Daniel D  McWain - 2017
Daniel D McWain - 2017
 
EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia
EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia
EFFECT OF PLANTING METHOD AND LOCATIONS ON SEEDLING QUALITY OF Pinus brutia
 
FYP Finalised - 12142131
FYP Finalised - 12142131FYP Finalised - 12142131
FYP Finalised - 12142131
 

Ähnlich wie Profile-based Dataset Recommendation for RDF Data Linking

Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIAInsight_Altmetrics
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesFabio Benedetti
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsAndre Freitas
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemIJTET Journal
 
SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content RecommendationSE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content RecommendationIvana Bosnic
 
A Framework for Ontology Usage Analysis
A Framework for Ontology Usage AnalysisA Framework for Ontology Usage Analysis
A Framework for Ontology Usage AnalysisJamshaid Ashraf
 
Resource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and FederationResource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and FederationPistoia Alliance
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Anubhav Jain
 
Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Talis Consulting
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Lucy McKenna
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Make our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebMake our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebFranck Michel
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdiswebuploader
 
Proof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityProof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityOpen Cyber University of Korea
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebStefan Dietze
 

Ähnlich wie Profile-based Dataset Recommendation for RDF Data Linking (20)

Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data Sources
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
From Linked Data to Semantic Applications
From Linked Data to Semantic ApplicationsFrom Linked Data to Semantic Applications
From Linked Data to Semantic Applications
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Ontology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval SystemOntology Based Approach for Semantic Information Retrieval System
Ontology Based Approach for Semantic Information Retrieval System
 
SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content RecommendationSE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
SE@M 2010: Automatic Keywords Extraction - a Basis for Content Recommendation
 
A Framework for Ontology Usage Analysis
A Framework for Ontology Usage AnalysisA Framework for Ontology Usage Analysis
A Framework for Ontology Usage Analysis
 
Resource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and FederationResource Description Framework Approach to Data Publication and Federation
Resource Description Framework Approach to Data Publication and Federation
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
Linked Data Workshop Stanford University
Linked Data Workshop Stanford University Linked Data Workshop Stanford University
Linked Data Workshop Stanford University
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Make our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebMake our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the Web
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
 
Proof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityProof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics Interoperability
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 

Kürzlich hochgeladen

Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStrSaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStrsaastr
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...Sheetaleventcompany
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfakankshagupta7348026
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝soniya singh
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Salam Al-Karadaghi
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 

Kürzlich hochgeladen (20)

Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStrSaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
SaaStr Workshop Wednesday w: Jason Lemkin, SaaStr
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
No Advance 8868886958 Chandigarh Call Girls , Indian Call Girls For Full Nigh...
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Motivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdfMotivation and Theory Maslow and Murray pdf
Motivation and Theory Maslow and Murray pdf
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
Exploring protein-protein interactions by Weak Affinity Chromatography (WAC) ...
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 

Profile-based Dataset Recommendation for RDF Data Linking

  • 1. Profile-based Dataset Recommendation for RDF Data Linking PhD Thesis Defense of: Mohamed BEN ELLEFI LIRMM – Montpellier, France 19/12/2016 Thesis Supervisors: Zohra BELLAHSENE KonstantinTODOROV
  • 2.  7 industrial and academic partners join forces to discover, test, and implement big open data. http://www.datalyse.fr/ This work is financed by 2
  • 3. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 3
  • 4. 4
  • 5. 4
  • 7. 5 Linked Data 570 datasets in 2014 12 datasets in 2007 Doc Web of Hypertext Hyperlinks Hyperlinks Hyperlinks Doc Doc The Linking Open Data cloud diagram http://lod-cloud.net/ Web Evloving
  • 9. Modeling Publishing Conversion Interlinking Raw Data Maintaining Vocabulary Search Vocabulary terms Selection Vocabulary Editition Published Linked Data Linked Data Life-cycle (1/4) 6
  • 10. Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Transforming information from raw data source to RDF data using the selected vocabulary… Linked Data Life-cycle (2/4) 6 Maintaining
  • 11. Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Hosting the linked dataset and its metadata publicly and make it accessible… Linked Data Life-cycle (3/4) 6 Maintaining
  • 12. Modeling Publishing Conversion Interlinking Raw Data Datasets Search Candidate Selection Data Linking Published Linked Data Linked Data Life-cycle (4/4) 6 Maintaining
  • 13. Focus on: Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Recommending Vocabulary Terms Recommending Candidate Datasets 6 Maintaining 2 1
  • 14. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 7
  • 15. Focus on: Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Recommending Vocabulary Terms 8 Maintaining 2 1
  • 16. “.. whatever the domain of your vocabulary, someone else has probably done it already.” --- Cookbook for translating relational data models to RDF schemas Motivation: Modeling Linked Data 4 http://lov.okfn.org/ ; 83 http://protege.stanford.edu  Reusing existing vocabularies:  Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …  Ontology development tools: i.e, Protégé2 …
  • 17. “.. whatever the domain of your vocabulary, someone else has probably done it already.” --- Cookbook for translating relational data models to RDF schemas Motivation: Modeling Linked Data 4 http://lov.okfn.org/ ; • What keywords to use for the search • How to select vocabularies • Which metadata can help for modeling 83 http://protege.stanford.edu  Reusing existing vocabularies:  Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …  Ontology development tools: i.e, Protégé2 …
  • 18. Input dataset texte RD LOV search Keywords Sevices Translator API -Cleaning -Translating Datavore LOV Sparql Endpoint -Metadata Extraction -Terms Search-Terms Extractions • Ranked lists of vocabulary terms. • The corresponding Metadatas. • Triples suggestions. 1 2 3 4 5 -Triples Extractions 6 Linked Open Vocabularies 9 Datavore Ecosystem
  • 19. Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool Assisting Linked Data Modeling. BDA 2015. Datavore Tool • A GUI desktop application http://www.lirmm.fr/benellefi/Datavore_Exe File • A demonstration video http://www.lirmm.fr/benellefi/Datavore_Vid eoDemo 11 Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015.
  • 20. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 12
  • 21. Focus on: Modeling Publishing Conversion Interlinking Raw Data Published Linked Data Recommending Candidate Datasets 6 Maintaining 2 1
  • 22. 13 Entity Linking Challenges “A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider." ----Source: https://www.w3.org/TR/void/#dataset
  • 23. 13  Data reuse and in-links focused on trusted, reference graphs, i.e., Dbpedia, Freebase, etc. Few datasets actually used… A long tail of potentially suitable yet under-recognized datasets “A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider." ----Source: https://www.w3.org/TR/void/#dataset Entity Linking Challenges
  • 24. 14 Candidate Datasets Selection: Problem Statement (1/2) How to find candidates to link my lovely dataset? Source
  • 25. 15 Thank you for the recommendations Candidates  Dataset recommendation for data linking is the task of computing a rank score for each of a set of target datasets with respect to a source dataset.  The rank score indicates the relatedness between the source and the target dataset. Source Candidate Datasets Selection: Problem Statement (2/2)
  • 26. (1) Nikolov and d'Aquin, 2011 A keyword-based search approach: (i)Extracts literals from instances of source datasets and search the sig.ma for potentially relevant entities (ii)Filtering out irrelevant datasets by measuring semantic concept similarities (OM). Related Work (2) Mehdi et al. 2014 1) Input a set of domain-specific keywords provided manually by an expert. 2) For each Keywords, the system runs a comparison to a set of eight queries: {original-case, proper-case, lower-case, upper- case} * {no-lang-tag, @en-tag}. 3) The output consists of a list of target datasets. (3) Leme et al. 2013 The ranking is based on Bayesian criteria and on the popularity (existing links) of the datasets. 16 (4) Lopes et al. 2014 (3) + exploring the correlation between different sets of features- properties, classes and vocabularies and the links to compute new rank score functions. Recall of 100% | MAP of 60%.
  • 27. (1) Nikolov and d'Aquin, 2011 A keyword-based search approach: (i)Extracts literals from instances of source datasets and search the sig.ma for potentially relevant entities (ii)Filtering out irrelevant datasets by measuring semantic concept similarities (OM). Sig.ma is currently down! Related Work (2) Mehdi et al. 2014 1) Input a set of domain-specific keywords provided manually by an expert. 2) For each Keywords, the system runs a comparison to a set of eight queries: {original-case, proper-case, lower-case, upper- case} * {no-lang-tag, @en-tag}. 3) The output consists of a list of target datasets. Costly input! (3) Leme et al. 2013 The ranking is based on Bayesian criteria and on the popularity (existing links) of the datasets. Cold Start Problem! 16 (4) Lopes et al. 2014 (3) + exploring the correlation between different sets of features- properties, classes and vocabularies and the links to compute new rank score functions. Recall of 100% | MAP of 60%. To improve efficiency!
  • 28. (2) Mehdi et al. 2014 1) Input a set of domain-specific keywords provided manually by an expert. 2) For each Keywords, the system runs a comparison to a set of eight queries: {original-case, proper-case, lower-case, upper- case} * {no-lang-tag, @en-tag}. 3) The output consists of a list of target datasets. Costly input! (1) Nikolov and d'Aquin, 2011 A keyword-based search approach: (i)Extracts literals from instances of source datasets and search the sig.ma for potentially relevant entities (ii)Filtering out irrelevant datasets by measuring semantic concept similarities (OM). Sig.ma is currently down! (4) Lopes et al. 2014 (3) + exploring the correlation between different sets of features- properties, classes and vocabularies and the links to compute new rank score functions. Recall of 100% | MAP of 60%. To improve efficiency! Related Work (3) Leme et al. 2013 The ranking is based on Bayesian criteria and on the popularity (existing links) of the datasets. Cold Start Problem! 16  To deal with real world LOD datasets.  To provide a new recommender system with a greater efficiency.
  • 29. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 17
  • 30. 18 Profile-based Recommendation: Motivation Similar taste buy buy buy buy Homer Simpson VS. Peter Griffin: Same profile (taste) behavior ! buy Recommend
  • 31. 19 Linked Data: if two datasets are strongly similar (Profile-based similarity), we can consider that they may have the same connectivity behaviour… What is it going to be a dataset profile Hypothesis: Profile-based Recommendation: Motivation
  • 32. (*) An RDF dataset profile can be seen as the formal representation of a set of features that describe a dataset and allow the comparison of different datasets with regard to their characteristics.The feature set is dependent on a given application scenario and task. 20 Semantic Web Datasets Profiling (*) Mohamed Ben Ellefi, Zohra Bellahsene, John Breslin, Elena Demidova, Stefan Dietze, Julian Szymanski, Konstantin Todorov. Dataset Profiling - a Guide to Features, Methods, Applications and Vocabularies. Major Revision statue In the Semantic Web Journal. Dataset Profile Features Semantic Qualitative Domain/Topic Context Index elements Schema/Instances Trust Accessibility Representation Context Degree of connectivity Statistical Temporal Schema Level Instance Level Global Instance-specific Semantics-specific
  • 33. 21 Semantic Web Data management SemanticWeb Dog Food WWW Consortium standards Information retrieval l3s-dblp Datasets Topics Topic Dataset Profile B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi, and W. Nejdl. A scalable approach for efficiently generating structured dataset topic profiles. In In Proceedings of the 11th ESWC 2014. Topic/Domain Semantic
  • 34. Topic Profile based-Datasets Recommendation 22 o Step 4: Ranking system o Steps1-3: Preprocessing/ Learning step
  • 38. 24 o A dataset is modeled as a set of topics-- a dataset's profile. o Inversely, a topic is modeled as a set of datasets assigned to it-- a topic's signature.  𝝈 is a connectivity behaviour measure. Topics Signatures Learning/ Preprocessing
  • 39. Target Datasets Ranking 25 Let D0 be a new dataset to be linked: 1- Extract Profile(D0). 2- Constitute a pool of target datasets from the signatures (the result of the learning step). 3- Ranking target datasets:
  • 40. Training Set: The topic profiles graph  from its available Sparql endpoint: http://data-observatory.org/lod- profiles/profile-explorer.  76 datasets and 185 392 topics.  The evaluation data (ED)  the current topology of the LOD-Cloud using the datahub2void tool (https://github.com/lod-cloud/datahub2void). o We made the ED available on http://www.lirmm.fr/benellefi/void.ttl Testing Set: Source Datasets: All the 76 datasets indexed by the topics profiles graph. Target Datasets: 258 datasets from the LOD cloud group (http://datahub.io/group/lodcloud). 26 Experimental Setup
  • 41. 27 Evaluation Framework  Leave-one-out (5-fold cross-validation). Selected Dataset (3)To recommend target datasets using our system. Selected Dataset owl:sameAs … Selected Dataset owl:sameAs … (1)To select a source dataset in the evaluation data. (2)To consider the dataset as unlinked. (4)To evaluate the recommendation Target 1 Target 2 Target n Target 1 Target 2 Target n
  • 42. - Average recall: 81%. - 59% of the DS have a recall of 100%. - Average precision: 19%. Evaluation Results (1/3) Recall/Precision/F1-Score over all DS ∈ ED False Positive Rate? 28
  • 43. 29 FP overestimation: a small average FP- Rate of 13% FP-Rate over all DS ∈ ED Evaluation Results (2/3)
  • 44. The original search space size: 258 datasets. 30 Search Space Reduction over all DS ∈ ED An average space size reduction is up to 86%. Evaluation Results (3/3)
  • 45. 31 Baselines & Comparison (1/3) Baselines are available on http://www.lirmm.fr/benellefi/Baselines.rar. 1- 3- 2-
  • 46. 32 Recall values of our approach vs. baselines over all DS ∈ ED  Baselines fail to provide any results at all for some datasets.  Our approach is more stable and outperforms the baselines in the majority of recommendations. Baselines & Comparison (2/3)
  • 47. 33  The baseline approaches have produced better results than our system in a limited number of cases.  The shared tags baseline generated an F-Score of 100% on: This is due to the fact that these two datasets are tagged by the same provenance (data.oceandrilling.org)  share the same set of tags. Note: Baselines & Comparison (3/3)
  • 48. 33 AVG Precision, Recall and F1-score values over all recommendation lists for all source datasets.  The baseline approaches have produced better results than our system in a limited number of cases.  The shared tags baseline generated an F-Score of 100% on: This is due to the fact that these two datasets are tagged by the same provenance (data.oceandrilling.org)  share the same set of tags. Note: Our approach Shared Keywords Shared Linksets SharedTopics Profiles AVG Precison 19% 9% 9% 3% AVG Recall 81% 41% 11% 13% AVG F1-Score 24% 10% 8% 4% Baselines & Comparison (3/3)
  • 49. 34 Topic-profiles Dataset recommendation approach:  Original search space reduction  Average recall: 81% & Average precision: 19%.  Ranking results available on http://www.lirmm.fr/benellefi/results.csv. Discussion Advantages: Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge Graphs Recommending Web Datasets for Data Linking. ICWE2016.
  • 50. 34 Topic-profiles Dataset recommendation approach:  Original search space reduction  Average recall: 81% & Average precision: 19%.  Ranking results available on http://www.lirmm.fr/benellefi/results.csv. Discussion Precision needs to be improved. Learning Data is not complete. Challenges: Advantages: Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge Graphs Recommending Web Datasets for Data Linking. ICWE2016. Breaking up with the learning step.
  • 51. oContext oVocabulary Recommendation with Datavore oDatasets Recommendation: Problem Statement oDatasets Recommendation:Topic Profile-based Approach oDatasets Recommendation: Intensional Profile-based Approach oConclusion & Open Issues Outline 35
  • 52. Hypothesis Motivation 36 Coreference resolution:  Different datasets may contain different resources that refer to the same real world entity.  Following the LD best practices, these ressources are generally represented by same/similar types. Datasets that share at least one pair of similar concepts, are likely to contain at least one potential pair of instances to be linked, i.e.,“owl:sameAs” statement.
  • 55. 41  Known from the ontology matching,WordNet-based similarity: [1] L. Han,A. L. Kashyap,T. Finin, J. Mayeld, and J.Weese.Umbc ebiquity-core: Semantic textual similarity systems, in Proc. of the *SEM, Association for Computational Linguistics, 2013.  UMBC measure [1]: combines semantic distance in WordNet, with frequency of occurrence and co-occurrence of terms in a large external corpus (the web). o Wu Palmer o Lin's Similarity Measures to Use: 38
  • 56. 39 Preprocessing Target Datasets Filtering Datasets Ranking Intensional Approach to Datasets Recommendation * * CCD = Cluster of Comparable Datasets Cosine 1 2 3
  • 57. Example: Montpellier ∈ DS; <Montpellier, rdf:type, dbo:Town>  PL(DS)= "town", …  PD(DS)= “…Usually, a town is thought of as larger than a village but smaller than a city, though there are exceptions to this rule." 40  Dataset Label Profile-- PL(DS): a set of n schema concepts labels corresponding to DS. Intensional Approach to Datasets Recommendation (1/3) DS PD(DT) PL(DS) Preprocessing PD(DS) PL(DT) Profiling Profiling  Dataset Document Profile-- PD(DS): the concatenation of PL(DS) and the textual descriptions of the n schema concepts. 1
  • 58. 42 Two datasets DS and DT are comparable if there exist at least one similarity between their labels profiles: (PL(DS), PL(DT)).  We identify CCD(DS) - a Cluster of Comparable Datasets related to DS  All the linking candidates for DS are found in its cluster CCD(DS). Target Datasets Filtering  The next step consists of ranking DT‘s in CCD(DS) sim(PL(DS), PL(DT)) Intensional Approach to Datasets Recommendation (2/3) 2
  • 59. 43 1) Forming a corpus by profiles documents: PD (DS) and all PD(DT). 2) Building a vector space model by indexing the documents in the corpus 3) Computing TF-IDF + cosine similarity between the document vectors in the corpus. 4) Ranking each DT in the cluster CCD(𝐷 𝑆) with respect to 𝐷 𝑆.  A mapping between datasets is returned based on their labels profiles: PL(DS) , PL(DT). Intensional Approach to Datasets Recommendation (3/3) 3
  • 60. 44  LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets  The profile PD  from the LOV (lov.okfn.org)  Wu Palmer & Lin's  2013 WS4J java API (https://code.google.com/archive/p/ws4j/)  The UMBC measure  web API (http://swoogle.umbc.edu/SimService) Experimental Setup  The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on http://www.lirmm.fr/benellefi/void.ttl)
  • 61. 44 Experimental Setup  Leave one out evaluation  LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets.  The profile PD  from the LOV (lov.okfn.org).  Wu Palmer & Lin's  the 2013 WS4J java API (https://code.google.com/archive/p/ws4j/).  The UMBC measure  its available web API (http://swoogle.umbc.edu/SimService).  The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on http://www.lirmm.fr/benellefi/void.ttl)
  • 62. 45  For each DS ∈ ED, we evaluated the selection of target datasets in the cluster CCD(𝐷 𝑆) in terms of recall. o Wu Palmer: 𝜽 ∈ [𝟎 , 𝟎. 𝟗] o Lin: 𝜽 ∈ [𝟎 , 𝟎. 𝟖] o UMBC: 𝜽 ∈ [𝟎 , 𝟎. 𝟕] Evaluation Results (1/3)  In the following, the evaluation will be restricted only in these intervals in order to guarantee a Recall up to 100% in the ED. Result: the recall value remains 100% in the following threshold 𝜽 intervals:
  • 63. 46 o Wu Palmer: 𝜃 = 0.9 o Lin: 𝜃 = 0.8 o UMBC: 𝜃 = 0.7 MAP@R over all DS ∈ ED Evaluation Results (2/3) Parameter tuning:
  • 64. 47 UMBC @ 𝜃 = 0.7 is the best setting for our ranking approach Mean Precision@K over all DS ∈ ED using the three different similarity measures over their best setting Evaluation Results (3/3) P@5 P@10 P@15 P@20 Wu Palmer (𝜽= 0,9) 0,56 0,52 0,53 0,51 Lin (𝜽= 0,8) 0,57 0,54 0,55 0,51 UMBC (𝜽= 0,7) 0,58 0,54 0,53 0,53
  • 65. 48  Baseline #2: All datasets are represented by their label profiles (PL). 1) CCD(DS) using UMBC @ 𝜃 = 0.7. 2) AvgUMBC be a ranking function that assigns scores to each DT ∈ CCD(): Baselines & Comparison (1/2)  Baseline #1: All datasets are represented by their document profiles (PD). 1) A vector space model by indexing the PD  NO CCD clusters. 2) A TF-IDF + cosine similarity between the document vectors.  NOTF-IDF + cosine similarity
  • 66. 49 Baseline#1 Proposed ApproachBaseline#2  Baseline#2 produces better results for a limited number of datasets: o RKB explorer datasets are sharing a high number of identical labels in their PL.  Efficiency with an AVG P@R up to 53%, compared to 49% and 39% for the baselines.  Precision up to 100% for DS from geographic and governmental domains P@R over all DS ∈ ED Baselines & Comparison (2/2)
  • 67. 50  A high performance o An average precision of 53% for a recall of 100%. o Independence of the dataset size or the schema cardinality. Result Analysis False positives overestimation.  A more fair evaluation can be given if better ED are used. o The ED is far from being complete as a GroundTruth. o We ran “SILK” on some FP recommendations. Example of discovered linksets: rkb-explorer- unlocode yovisto datos- bcn-cl datos- bcn-uk owl:sameAs Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016.
  • 68. 51 Datasets Recommendation: Intensional Profile-based Approach  Semantic Profiles features: Intensional Profile  No Learning Step.  Average Recall: 100%  Mean average precision: 53%.  A mappings between the schema concepts Ranking results available on http://www.lirmm.fr/benellefi/CCD-CosineRank_Result.csv. Topic Profiles-based Vs Intensional-based Dataset recommendation:Topic profiles-based approach  Semantic Profiles features:Topic Profiles  Learning data dependency.  Average recall: 81%  Mean average precision: 19%.  A new topic profiles propagation approach Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
  • 69. o Context o Vocabulary Recommendation with Datavore o Datasets Recommendation: Problem Statement o Datasets Recommendation:Topic Profile-based Approach o Datasets Recommendation: Intensional Profile-based Approach o Conclusion & Open Issues Outline 52
  • 70. 53  The choice of the Profile features is dependent on a given application scenario and task.  Learning data dependency:  Learning data is not complete  Effectiveness to be improved.  The learning break-off  Highly greater effeciency.  Better performance with datasets having high quality intensional profiles:  The awareness of the richness of datasets schema descriptions.  The importance of reusing existing vocabulary terms: o i.e, tools such as Datavore can ease the vocabulary reusing task in linked data modeling. Conclusion
  • 71. 54 Open Issues Dataset recommendation: o A reliable ground truth (i.e., crowdsourcing-based) and benchmark data. o To improve the quality of the intensional profiles: • the population of the schema elements, • the dataset context, • the multilinguisme, … o Investigating the effectiveness of ML techniques Vocabulary recommendation: o A new evaluation framework for the linked data modeling process i.e, in a user study manner and crowdsourcing-based. o To examine the vocabulary terms ranking strategies based on the data structure factors i.e, tabular sources modeling vs. web pages annotation.
  • 73.  Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016; Crete, Grèce.  Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge Graphs Recommending Web Datasets for Data Linking. In Proceeding of the 16th ICWE 2016; Lugano, Switzerland.  Manel Achichi, Mohamed Ben Ellefi, Danai Symeonidou, Konstantin Todorov. Automatic Key Selection for Data Linking. In Proceeding of the 20th EKAW 2016; Bologna, Italy.  Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015; Bethlehem, PA, USA.  This paper was also presented In BDA'2015, Île de Porquerolles, France.  Mohamed Ben Ellefi, Zohra Bellahsene, François Scharffe, Konstantin Todorov. Towards Semantic Dataset Profiling. In PROFILES@ESWC, Crete, Grèce, (2014).

Hinweis der Redaktion

  1. Respected members of the examination committee and dear Soooooooooooooooooooooo, The title of my dissertation here is …………. Profile-based
  2. I start by presenting the project that financed our work… so …; Datalyse include 7 … to work on big open data For more information about the project you can check datalyse.fr
  3. This is the overview of what I’m present today I will talk quicly … So yeay I am gone talk of 3 years of my life in 40 minutes and it is very amtious
  4. Here we have a snapshots where Google literally processes 2.4 million searches every minute. YES That’s a lot of information in the web ….
  5. To be more concrete, lets take an example of some entities extracted from the web
  6. For better exploitation of this information Linked Data consist in linking this entities with a typed links Now technically, what changed in the web
  7. The main idea consist in Links are Relations ancres written in HTML … Linked Data consists in Knowledges Graphs where links between things are described by RDF.   Snapshots of the LOD that show how the graphes involved from 12 datasets How to publish our data as Linked Data.
  8. There are Different visions Data goes through a number of stages to be published as Linked Data.
  9. Search for potential vocabulary
  10. The conversion stage where data is transformed to RDF using the selected ontology
  11. Hosting LD and make it accessible which include controling the access of course Access Control is a mechanism through which an Agent (typically an HTTP server, in this case)
  12. Potential target datasets matching rules
  13. In this thesis, we focused on both the …
  14. Intro followed by Context et Motivation
  15. In this thesis, we focused on both the …
  16. Here we have the recommendation of building on, instead of replicating, existing RDF schemas and vocabularies
  17. There is a need for a tool that recommends vocabulary to assist the metadata designer to reuse existing vocabulary terms when devlopping the on
  18. Metadata = the set of object properties and datatype properties that can used with C Triples extraction= Relations between concepts from different lists.
  19. This is a snapshot of datavore graphical user interface for better user assistance in the vocabulary selection..;
  20. Intro followed by Context et Motivation
  21. Which is a pre-processing of the interlinking
  22. First we recall what is an RDF dataset…
  23. We highlight a challenging problem in the Field of entity linking where …
  24. preprocessing of the interlinking LOD as a Search Space However, we should note that we deal with the source dataset that are not linked yet or published yet, as well as an already linked and published datasets that we plan to enrich !
  25. Aim to reduce the effort of
  26. To the best of our knowledge, only a few works have looked at identifying candidate datasets for interlinking. Overcome the drawbacks of
  27. Overcome the drawbacks of
  28. We aim to provide No Precision/Recall evaluation except for (4). No list of datasets and their rank performance by datasets considered as source ! Direct comparison is not possible !
  29. Intro followed by Context et Motivation
  30. A user based recommendation Two keys that we have to keep in mind: Profile and Behavior Note: There are many recommendation systems, but, only few approaches are in the context of the linking process.
  31. Topic Profile is our grouping technique Note: There are many recommendation systems, but, only few approaches are in the context of the linking process.
  32. \what are the features that describe a dataset and allow to dene it?"
  33. Weighted bipartite graph Topics are Dbpedia categories
  34. We have a source dataset presented by its weigh
  35. C(., .) is a connectivity function based on the number of the links between two datasets in both directions.
  36. C(., .) is a connectivity function based on the number of the links between two datasets in both directions.
  37. A dataset DTK is significant with respect to a topic Tk if its weight in the LDPG is greater than a given value (0; 1). A topic Tk is modeled by the set of its signicant datasets together with their respective weights, Enlever
  38. Since this ED are the only available data that we have for both training we opted for a .. In 5-fold cross-validation, the ED was randomly split into two subsets: 1 - containing random 80% of the linked datasets in the ED, was used as training set 2 -, containing the remaining linked datasets (i.e., random 20% of the ED), was retained as the validation data for tests (i.e., the test set).
  39. LOD cloud as ED for each 𝐃𝐒∈ ED. We evaluate the recommendation based on the main links
  40. indicate that every time you call a positive, you have a probability of being right We confirm our hypothesis of
  41. It is aroud 40 recommendations from the total 258 Which reduce considerably the user effort
  42. To the best of our knowledge, there does not exist a common benchmark for dataset interlinking recommendation.
  43. For example, the shared tags baseline generated an F-Score of 100% on oceandrilling-janusamp, which is connected to * oceandrilling-codices, due to the fact that these two datasets are tagged by the same provenance (data.oceandrilling.org).
  44. J’aurais du faire la P_Value (significance test)
  45. crowdsourcing techniques ML techniques, such as classification or clustering, for the recommendation task.
  46. The question here: if we break up with
  47. Intro followed by Context et Motivation
  48. metadata designers reuse and build on, instead of replicating, existing RDF schema and vocabularies.
  49. Here we have an example of the city of Montpellier in France. This city is described by similar types in different datasets We observe that we have same/similar concepts between these different datasets
  50. We observe that we have same/similar concepts
  51. We need a similarity measure that can detect similarity between terms such “city” and “town”. Latent Semantic Analysis (LSA) word similarity, which relies on the words co-occurrence in the same contexts computed in a three billion words corpus of good quality English. D(x; y) is the minimal WordNet [16] path length between x and y. According to authors, using e􀀀D(x;y) to transform simple shortest path length has shown to be very efficient when the parameter is set to 0:25.
  52. Here we have a global view of our proposed approach pipeline which consists of three man steps … Ajouter une note de CCD …
  53. We extract datasets profiles vocabularies which are very generic and wide-spread have a negative impact, acting like hub nodes, which dilute the results. Therefore, we decided to remove VoID, FOAF and SKOS.
  54. We note that we use the term "cluster" in its general meaning, referring to a set of datasets grouped together by their similarity and not in a machine learning sense.
  55. As an additional contribution, our method returns the mappings between the schema concepts across datasets a particularly useful input for the data linking step. The standard Analyzer of Lucene: transform cases into lower case + tokenize + filter English stop words. Standard NLP techniques
  56. precision at rank k mean average precision at Recall = 1, MAP@R, where R(q) corresponds to the rank, at which recall reaches 1 for the q th dataset and TotalDS is the entire number of source datasets in the evaluation.
  57. The first step is to form a CCD(DS) for each DS. The CCD construction process depends on the similarity measure on dataset profiles. We observed that the recall value remains 100% in the following threshold intervals
  58. Here we have the parameter setting for each measure where we found that the best MAP is achived in
  59. Furthermore, we provide an evaluation in term of UMBC =stability in the performance
  60. The first baseline
  61. This is an evaluation of our system against … in term of Base2 which is based on ProfileLabel
  62. The matching rules as input of silk makes the process manually infeasible to perform over the entire LOD. Exosting task
  63. Intro followed by Context et Motivation
  64. In the Linked Data life cycle We addressed the stage For Modeling we provided … For interlinking we addressed the preprocessing step where we proposed two approach for dataset identifications … Also We show how
  65. ML techniques, such as classification or clustering, for the recommendation task.
  66. vocabularies which are very generic and wide-spread have a negative impact, acting like hub nodes, which dilute the results. Therefore, we decided to remove VoID, FOAF and SKOS.
  67. Intro followed by Context et Motivation
  68. precision at rank k mean average precision at Recall = 1, MAP@R, where R(q) corresponds to the rank, at which recall reaches 1 for the q th dataset and TotalDS is the entire number of source datasets in the evaluation.
  69. So the question now what is this semantic representation of the web ?