Profile-based Dataset Recommendation for RDF Data Linking

Profile-based Dataset Recommendation
for RDF Data Linking
PhD Thesis Defense of: Mohamed BEN ELLEFI
LIRMM – Montpellier, France
19/12/2016
Thesis Supervisors:
Zohra BELLAHSENE
KonstantinTODOROV

 7 industrial and academic partners join forces to discover, test, and
implement big open data.
http://www.datalyse.fr/
This work is financed by
2

o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
3

4
sameAs
Knows
sameAs
sameAs
worksOn
worksOn

5
Linked Data
570 datasets in 2014
12 datasets in 2007
Doc
Web of Hypertext
Hyperlinks
Hyperlinks
Hyperlinks
Doc
Doc
The Linking Open Data cloud diagram
http://lod-cloud.net/
Web Evloving

6
Linked Data Life-cycle
Modeling
Publishing
Conversion
Interlinking
Published Linked Data
Raw Data
Maintaining

Modeling
Publishing
Conversion
Interlinking
Raw Data
Maintaining
Vocabulary Search
Vocabulary terms
Selection
Vocabulary Editition
Linked Data Life-cycle (1/4)
6

Modeling
Publishing
Conversion
Interlinking
Raw Data
Transforming information
from raw data source to
RDF data using the
selected vocabulary…
6
Maintaining

Modeling
Publishing
Conversion
Interlinking
Raw Data
Hosting the linked dataset and
its metadata publicly and make
it accessible…
6
Maintaining

Modeling
Publishing
Conversion
Interlinking
Raw Data
Datasets Search
Candidate Selection
Data Linking
6
Maintaining

Focus on:
Modeling
Publishing
Conversion
Interlinking
Raw Data
Recommending Vocabulary Terms
Recommending Candidate Datasets
6
Maintaining
2
1

o Context
Outline
7

Focus on:
Modeling
Publishing
Conversion
Interlinking
Raw Data
Recommending Vocabulary Terms
8
Maintaining
2
1

“.. whatever the domain of your vocabulary, someone else has probably done it already.”
--- Cookbook for translating relational data models to RDF schemas
Motivation: Modeling Linked Data
4 http://lov.okfn.org/ ; 83 http://protege.stanford.edu
 Reusing existing vocabularies:
 Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …
 Ontology development tools: i.e, Protégé2 …

“.. whatever the domain of your vocabulary, someone else has probably done it already.”
--- Cookbook for translating relational data models to RDF schemas
Motivation: Modeling Linked Data
4 http://lov.okfn.org/ ;
• What keywords to use for the search
• How to select vocabularies
• Which metadata can help for modeling
83 http://protege.stanford.edu
 Reusing existing vocabularies:
 Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …
 Ontology development tools: i.e, Protégé2 …

Input dataset
texte
RD
LOV search Keywords
Sevices
Translator API
-Cleaning
-Translating
Datavore
LOV Sparql
Endpoint
-Metadata
Extraction
-Terms Search-Terms Extractions
• Ranked lists of vocabulary terms.
• The corresponding Metadatas.
• Triples suggestions.
1
2
3
4
5
-Triples Extractions
6
Linked Open Vocabularies
9
Datavore Ecosystem

Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin
Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. BDA 2015.
Datavore Tool
• A GUI desktop application
http://www.lirmm.fr/benellefi/Datavore_Exe
File
• A demonstration video
http://www.lirmm.fr/benellefi/Datavore_Vid
eoDemo
11
Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin
Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. (Posters & Demos)
ISWC 2015.

o Context
Outline
12

Focus on:
Modeling
Publishing
Conversion
Interlinking
Raw Data
Recommending Candidate Datasets
6
Maintaining
2
1

13
Entity Linking Challenges
“A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider."
----Source: https://www.w3.org/TR/void/#dataset

13
 Data reuse and in-links focused on
trusted, reference graphs, i.e., Dbpedia,
Freebase, etc.
Few datasets actually used…
A long tail of potentially suitable yet
under-recognized datasets
“A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider."
----Source: https://www.w3.org/TR/void/#dataset
Entity Linking Challenges

14
Candidate Datasets Selection: Problem Statement (1/2)
How to find
candidates to link my
lovely dataset?
Source

15
Thank you for the
recommendations
Candidates
 Dataset recommendation for
data linking is the task of
computing a rank score for each
of a set of target datasets with
respect to a source dataset.
 The rank score indicates the
relatedness between the
source and the target dataset.
Source
Candidate Datasets Selection: Problem Statement (2/2)

(1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Related Work
(2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
16
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.

Sig.ma is currently down!
Related Work
an expert.
Costly input!
Cold Start Problem!
16
To improve efficiency!

an expert.
Costly input!
Sig.ma is currently down!
To improve efficiency!
Related Work
Cold Start Problem!
16
 To deal with real world LOD datasets.
 To provide a new recommender system with a greater efficiency.

o Context
Outline
17

18
Profile-based Recommendation: Motivation
Similar taste
buy
buy
buy
buy
Homer Simpson VS. Peter Griffin:
Same profile (taste) behavior !
buy
Recommend

19
Linked Data: if two datasets are strongly similar (Profile-based similarity),
we can consider that they may have the same connectivity behaviour…
What is it going to be a dataset profile
Hypothesis:
Profile-based Recommendation: Motivation

(*) An RDF dataset profile can be seen as the
formal representation of a set of features that
describe a dataset and allow the comparison of
different datasets with regard to their
characteristics.The feature set is dependent on a
given application scenario and task.
20
Semantic Web Datasets Profiling
(*) Mohamed Ben Ellefi, Zohra Bellahsene, John Breslin, Elena Demidova, Stefan Dietze,
Julian Szymanski, Konstantin Todorov. Dataset Profiling - a Guide to Features, Methods,
Applications and Vocabularies. Major Revision statue In the Semantic Web Journal.
Dataset Profile
Features
Semantic
Qualitative
Domain/Topic
Context
Index elements
Schema/Instances
Trust
Accessibility
Representation
Context
Degree of connectivity
Statistical
Temporal
Schema Level
Instance Level
Global
Instance-specific
Semantics-specific

21
Semantic
Web
Data
management
SemanticWeb
Dog Food
WWW
Consortium
standards
Information
retrieval
l3s-dblp
Datasets
Topics
Topic Dataset Profile
B. Fetahu, S. Dietze, B. Pereira Nunes, M. Antonio Casanova, D. Taibi, and W. Nejdl. A scalable approach
for efficiently generating structured dataset topic profiles. In In Proceedings of the 11th ESWC 2014.
Topic/Domain
Semantic

Topic Profile based-Datasets Recommendation
22
o Step 4: Ranking system
o Steps1-3: Preprocessing/ Learning step

23
Learning/ Preprocessing
Topics:
Weigts:
Source:

23
Topics:
Weigts:
Source:
Connectivity (Di ,Dj)

23
Topics:
Weights:
Source:
Connectivity (Di ,Dj)

24
o A dataset is modeled as a set of topics-- a dataset's profile.
o Inversely, a topic is modeled as a set of datasets assigned to it-- a topic's signature.
 𝝈 is a connectivity
behaviour measure.
Topics Signatures

Target Datasets Ranking
25
Let D0 be a new dataset to be linked:
1- Extract Profile(D0).
2- Constitute a pool of target datasets
from the signatures (the result of the
learning step).
3- Ranking target datasets:

Training Set:
The topic profiles graph  from its available Sparql endpoint: http://data-observatory.org/lod-
profiles/profile-explorer.
 76 datasets and 185 392 topics.
 The evaluation data (ED)  the current topology of the LOD-Cloud using the datahub2void tool
(https://github.com/lod-cloud/datahub2void).
o We made the ED available on http://www.lirmm.fr/benellefi/void.ttl
Testing Set:
Source Datasets: All the 76 datasets indexed by the topics profiles graph.
Target Datasets: 258 datasets from the LOD cloud group (http://datahub.io/group/lodcloud).
26
Experimental Setup

27
Evaluation Framework
 Leave-one-out (5-fold cross-validation).
Selected
Dataset
(3)To recommend target
datasets using our system.
Selected
Dataset
owl:sameAs
…
Selected
Dataset
owl:sameAs
…
(1)To select a source dataset in
the evaluation data.
(2)To consider the dataset
as unlinked.
(4)To evaluate the recommendation
Target 1
Target 2
Target n
Target 1
Target 2
Target n

- Average recall: 81%.
- 59% of the DS have
a recall of 100%.
- Average precision:
19%.
Evaluation Results (1/3)
Recall/Precision/F1-Score over all DS ∈ ED
False Positive Rate?
28

29
FP overestimation:
a small average FP-
Rate of 13%
FP-Rate over all DS ∈ ED

The original search
space size: 258
datasets.
30
Search Space Reduction over all DS ∈ ED
An average
space size
reduction is
up to 86%.

31
Baselines & Comparison (1/3)
Baselines are available on http://www.lirmm.fr/benellefi/Baselines.rar.
1-
3-
2-

32
Recall values of our approach vs. baselines over all DS ∈ ED
 Baselines fail to provide any
results at all for some datasets.
 Our approach is more stable and
outperforms the baselines in the
majority of recommendations.

33
 The baseline approaches have produced better results than our system in a limited number of cases.
 The shared tags baseline generated an F-Score of 100% on:
This is due to the fact that these two datasets are tagged by the same provenance
(data.oceandrilling.org)  share the same set of tags.
Note:

33
AVG Precision, Recall and F1-score values over all recommendation lists for all source datasets.
 The baseline approaches have produced better results than our system in a limited number of cases.
 The shared tags baseline generated an F-Score of 100% on:
This is due to the fact that these two datasets are tagged by the same provenance
(data.oceandrilling.org)  share the same set of tags.
Note:
Our approach Shared Keywords Shared Linksets SharedTopics Profiles
AVG Precison 19% 9% 9% 3%
AVG Recall 81% 41% 11% 13%
AVG F1-Score 24% 10% 8% 4%

34
Topic-profiles Dataset recommendation approach:
 Original search space reduction
 Average recall: 81% & Average precision: 19%.
 Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
Discussion
Advantages:
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge
Graphs Recommending Web Datasets for Data Linking. ICWE2016.

34
Topic-profiles Dataset recommendation approach:
 Original search space reduction
 Average recall: 81% & Average precision: 19%.
 Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
Discussion
Precision needs to be improved.
Learning Data is not complete.
Challenges:
Advantages:
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge
Graphs Recommending Web Datasets for Data Linking. ICWE2016.
Breaking up with the learning step.

oContext
oVocabulary Recommendation with Datavore
oDatasets Recommendation: Problem Statement
oDatasets Recommendation:Topic Profile-based Approach
oDatasets Recommendation: Intensional Profile-based Approach
oConclusion & Open Issues
Outline
35

Hypothesis
Motivation
36
Coreference resolution:
 Different datasets may contain different resources that refer to the same
real world entity.
 Following the LD best practices, these ressources are generally
represented by same/similar types.
Datasets that share at least one pair of similar concepts, are likely to
contain at least one potential pair of instances to be linked,
i.e.,“owl:sameAs” statement.

37
dbo:Place
dbo:Location
dbo:Settlement
schema:Place
umbek-rc:PopulatedPlace
yago:Commune108541609
…
lgeod:Place
lgeod:City
yago:District
yago:Municipaliy
yago:Region
yago:Town
yago:Commune
…
dbpedia.org
yago-knowledge.org
linkedgeodata.org
Hypothesis:An Example
Montpellier

37
dbo:Place
dbo:Location
dbo:Settlement
schema:Place
umbek-rc:PopulatedPlace
yago:Commune108541609
…
lgeod:Place
lgeod:City
yago:District
yago:Municipaliy
yago:Region
yago:Town
yago:Commune
…
dbpedia.org
yago-knowledge.org
linkedgeodata.org
City ≈ Town
Place = Place
Settlement ≈ Commune
Hypothesis:An Example
Montpellier

41
 Known from the ontology matching,WordNet-based similarity:
[1] L. Han,A. L. Kashyap,T. Finin, J. Mayeld, and J.Weese.Umbc ebiquity-core: Semantic textual similarity systems, in Proc. of the *SEM,
Association for Computational Linguistics, 2013.
 UMBC measure [1]: combines semantic distance in WordNet, with frequency of
occurrence and co-occurrence of terms in a large external corpus (the web).
o Wu Palmer
o Lin's
Similarity Measures to Use:
38

39
Preprocessing Target Datasets Filtering Datasets Ranking
Intensional Approach to Datasets Recommendation
*
* CCD = Cluster of Comparable Datasets
Cosine
1 2 3

Example: Montpellier ∈ DS;
<Montpellier, rdf:type, dbo:Town>
 PL(DS)= "town", …
 PD(DS)= “…Usually, a town is thought of as larger
than a village but smaller than a city, though there
are exceptions to this rule."
40
 Dataset Label Profile-- PL(DS): a set of n schema
concepts labels corresponding to DS.
Intensional Approach to Datasets Recommendation (1/3)
DS
PD(DT)
PL(DS)
Preprocessing
PD(DS)
PL(DT)
Profiling
Profiling
 Dataset Document Profile-- PD(DS): the
concatenation of PL(DS) and the textual
descriptions of the n schema concepts.
1

42
Two datasets DS and DT are comparable if there exist
at least one similarity between their labels profiles:
(PL(DS), PL(DT)).
 We identify CCD(DS) - a Cluster of Comparable Datasets
related to DS
 All the linking candidates for DS are found in its cluster
CCD(DS).
Target Datasets Filtering
 The next step consists of ranking DT‘s in CCD(DS)
sim(PL(DS), PL(DT))
2

43
1) Forming a corpus by profiles documents: PD (DS) and all
PD(DT).
2) Building a vector space model by indexing the documents
in the corpus
3) Computing TF-IDF + cosine similarity between the
document vectors in the corpus.
4) Ranking each DT in the cluster CCD(𝐷 𝑆) with respect to 𝐷 𝑆.
 A mapping between datasets is returned based on their
labels profiles: PL(DS) , PL(DT).
3

44
 LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets
 The profile PD  from the LOV (lov.okfn.org)
 Wu Palmer & Lin's  2013 WS4J java API (https://code.google.com/archive/p/ws4j/)
 The UMBC measure  web API (http://swoogle.umbc.edu/SimService)
Experimental Setup
 The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on
http://www.lirmm.fr/benellefi/void.ttl)

44
Experimental Setup
 Leave one out evaluation
 LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets.
 The profile PD  from the LOV (lov.okfn.org).
 Wu Palmer & Lin's  the 2013 WS4J java API (https://code.google.com/archive/p/ws4j/).
 The UMBC measure  its available web API (http://swoogle.umbc.edu/SimService).
 The evaluation data (ED)  the current topology of the LOD-Cloud (the ED is available on
http://www.lirmm.fr/benellefi/void.ttl)

45
 For each DS ∈ ED, we evaluated the selection of target datasets in the cluster CCD(𝐷 𝑆) in
terms of recall.
o Wu Palmer: 𝜽 ∈ [𝟎 , 𝟎. 𝟗] o Lin: 𝜽 ∈ [𝟎 , 𝟎. 𝟖] o UMBC: 𝜽 ∈ [𝟎 , 𝟎. 𝟕]
 In the following, the evaluation will be restricted only in these
intervals in order to guarantee a Recall up to 100% in the ED.
Result: the recall value remains 100% in the following threshold 𝜽 intervals:

46
o Wu Palmer: 𝜃 = 0.9
o Lin: 𝜃 = 0.8
o UMBC: 𝜃 = 0.7
MAP@R over all DS ∈ ED
Parameter tuning:

47
UMBC @ 𝜃 = 0.7 is the best setting for our ranking approach
Mean Precision@K over all DS ∈ ED using the three different similarity
measures over their best setting
P@5 P@10 P@15 P@20
Wu Palmer (𝜽= 0,9) 0,56 0,52 0,53 0,51
Lin (𝜽= 0,8) 0,57 0,54 0,55 0,51
UMBC (𝜽= 0,7) 0,58 0,54 0,53 0,53

48
 Baseline #2: All datasets are represented by their label profiles (PL).
1) CCD(DS) using UMBC @ 𝜃 = 0.7.
2) AvgUMBC be a ranking function that assigns scores to each DT ∈ CCD():
 Baseline #1: All datasets are represented by their document profiles (PD).
1) A vector space model by indexing the PD  NO CCD clusters.
2) A TF-IDF + cosine similarity between the document vectors.
 NOTF-IDF + cosine similarity

49
Baseline#1 Proposed ApproachBaseline#2
 Baseline#2 produces better results
for a limited number of datasets:
o RKB explorer datasets are
sharing a high number of
identical labels in their PL.
 Efficiency with an AVG P@R up to
53%, compared to 49% and 39%
for the baselines.
 Precision up to 100% for DS from
geographic and governmental
domains
P@R over all DS ∈ ED

50
 A high performance
o An average precision of 53% for a recall of 100%.
o Independence of the dataset size or the schema cardinality.
Result Analysis
False positives overestimation.
 A more fair evaluation can be given if better ED are used.
o The ED is far from being complete as a GroundTruth.
o We ran “SILK” on some FP recommendations. Example of
discovered linksets:
rkb-explorer-
unlocode
yovisto
datos-
bcn-cl
datos-
bcn-uk
owl:sameAs
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset
Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016.

51
Datasets Recommendation: Intensional Profile-based
Approach
 Semantic Profiles features: Intensional Profile
 No Learning Step.
 Average Recall: 100%
 Mean average precision: 53%.
 A mappings between the schema concepts
Ranking results available on
http://www.lirmm.fr/benellefi/CCD-CosineRank_Result.csv.
Topic Profiles-based Vs Intensional-based
Dataset recommendation:Topic profiles-based
approach
 Semantic Profiles features:Topic Profiles
 Learning data dependency.
 Average recall: 81%
 Mean average precision: 19%.
 A new topic profiles propagation approach
Ranking results available on
http://www.lirmm.fr/benellefi/results.csv.

o Context
Outline
52

53
 The choice of the Profile features is dependent on a given application
scenario and task.
 Learning data dependency:
 Learning data is not complete  Effectiveness to be improved.
 The learning break-off  Highly greater effeciency.
 Better performance with datasets having high quality intensional
profiles:
 The awareness of the richness of datasets schema descriptions.
 The importance of reusing existing vocabulary terms:
o i.e, tools such as Datavore can ease the vocabulary
reusing task in linked data modeling.
Conclusion

54
Open Issues
Dataset recommendation:
o A reliable ground truth (i.e., crowdsourcing-based) and benchmark data.
o To improve the quality of the intensional profiles:
• the population of the schema elements,
• the dataset context,
• the multilinguisme, …
o Investigating the effectiveness of ML techniques
Vocabulary recommendation:
o A new evaluation framework for the linked data modeling process
i.e, in a user study manner and crowdsourcing-based.
o To examine the vocabulary terms ranking strategies based on the data structure factors
i.e, tabular sources modeling vs. web pages annotation.

 Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation
for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016; Crete, Grèce.
 Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established
Knowledge Graphs Recommending Web Datasets for Data Linking. In Proceeding of the 16th
ICWE 2016; Lugano, Switzerland.
 Manel Achichi, Mohamed Ben Ellefi, Danai Symeonidou, Konstantin Todorov. Automatic Key Selection for
Data Linking. In Proceeding of the 20th EKAW 2016; Bologna, Italy.
 Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015; Bethlehem, PA, USA.
 This paper was also presented In BDA'2015, Île de Porquerolles, France.
 Mohamed Ben Ellefi, Zohra Bellahsene, François Scharffe, Konstantin Todorov. Towards Semantic Dataset
Profiling. In PROFILES@ESWC, Crete, Grèce, (2014).

Profile-based Dataset Recommendation for RDF Data Linking

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Profile-based Dataset Recommendation for RDF Data Linking

Ähnlich wie Profile-based Dataset Recommendation for RDF Data Linking (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Profile-based Dataset Recommendation for RDF Data Linking

Hinweis der Redaktion