This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Profile-based Dataset Recommendation for RDF Data Linking
1. Profile-based Dataset Recommendation
for RDF Data Linking
PhD Thesis Defense of: Mohamed BEN ELLEFI
LIRMM – Montpellier, France
19/12/2016
Thesis Supervisors:
Zohra BELLAHSENE
KonstantinTODOROV
2. 7 industrial and academic partners join forces to discover, test, and
implement big open data.
http://www.datalyse.fr/
This work is financed by
2
3. o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
3
7. 5
Linked Data
570 datasets in 2014
12 datasets in 2007
Doc
Web of Hypertext
Hyperlinks
Hyperlinks
Hyperlinks
Doc
Doc
The Linking Open Data cloud diagram
http://lod-cloud.net/
Web Evloving
14. o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
7
16. “.. whatever the domain of your vocabulary, someone else has probably done it already.”
--- Cookbook for translating relational data models to RDF schemas
Motivation: Modeling Linked Data
4 http://lov.okfn.org/ ; 83 http://protege.stanford.edu
Reusing existing vocabularies:
Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …
Ontology development tools: i.e, Protégé2 …
17. “.. whatever the domain of your vocabulary, someone else has probably done it already.”
--- Cookbook for translating relational data models to RDF schemas
Motivation: Modeling Linked Data
4 http://lov.okfn.org/ ;
• What keywords to use for the search
• How to select vocabularies
• Which metadata can help for modeling
83 http://protege.stanford.edu
Reusing existing vocabularies:
Ontology search engine: i.e, trusted LOV1 (information of more than 500 vocabularies) …
Ontology development tools: i.e, Protégé2 …
19. Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin
Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. BDA 2015.
Datavore Tool
• A GUI desktop application
http://www.lirmm.fr/benellefi/Datavore_Exe
File
• A demonstration video
http://www.lirmm.fr/benellefi/Datavore_Vid
eoDemo
11
Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin
Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. (Posters & Demos)
ISWC 2015.
20. o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
12
22. 13
Entity Linking Challenges
“A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider."
----Source: https://www.w3.org/TR/void/#dataset
23. 13
Data reuse and in-links focused on
trusted, reference graphs, i.e., Dbpedia,
Freebase, etc.
Few datasets actually used…
A long tail of potentially suitable yet
under-recognized datasets
“A dataset is a set of RDF triples that are published, maintained or aggregated by a single provider."
----Source: https://www.w3.org/TR/void/#dataset
Entity Linking Challenges
25. 15
Thank you for the
recommendations
Candidates
Dataset recommendation for
data linking is the task of
computing a rank score for each
of a set of target datasets with
respect to a source dataset.
The rank score indicates the
relatedness between the
source and the target dataset.
Source
Candidate Datasets Selection: Problem Statement (2/2)
26. (1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Related Work
(2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
16
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.
27. (1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Sig.ma is currently down!
Related Work
(2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
Costly input!
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
Cold Start Problem!
16
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.
To improve efficiency!
28. (2) Mehdi et al. 2014
1) Input a set of domain-specific keywords provided manually by
an expert.
2) For each Keywords, the system runs a comparison to a set of
eight queries: {original-case, proper-case, lower-case, upper-
case} * {no-lang-tag, @en-tag}.
3) The output consists of a list of target datasets.
Costly input!
(1) Nikolov and d'Aquin, 2011
A keyword-based search approach:
(i)Extracts literals from instances of source datasets and
search the sig.ma for potentially relevant entities
(ii)Filtering out irrelevant datasets by measuring semantic
concept similarities (OM).
Sig.ma is currently down!
(4) Lopes et al. 2014
(3) + exploring the correlation between different sets of features-
properties, classes and vocabularies and the links to compute new
rank score functions. Recall of 100% | MAP of 60%.
To improve efficiency!
Related Work
(3) Leme et al. 2013
The ranking is based on Bayesian criteria and on
the popularity (existing links) of the datasets.
Cold Start Problem!
16
To deal with real world LOD datasets.
To provide a new recommender system with a greater efficiency.
29. o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
17
31. 19
Linked Data: if two datasets are strongly similar (Profile-based similarity),
we can consider that they may have the same connectivity behaviour…
What is it going to be a dataset profile
Hypothesis:
Profile-based Recommendation: Motivation
32. (*) An RDF dataset profile can be seen as the
formal representation of a set of features that
describe a dataset and allow the comparison of
different datasets with regard to their
characteristics.The feature set is dependent on a
given application scenario and task.
20
Semantic Web Datasets Profiling
(*) Mohamed Ben Ellefi, Zohra Bellahsene, John Breslin, Elena Demidova, Stefan Dietze,
Julian Szymanski, Konstantin Todorov. Dataset Profiling - a Guide to Features, Methods,
Applications and Vocabularies. Major Revision statue In the Semantic Web Journal.
Dataset Profile
Features
Semantic
Qualitative
Domain/Topic
Context
Index elements
Schema/Instances
Trust
Accessibility
Representation
Context
Degree of connectivity
Statistical
Temporal
Schema Level
Instance Level
Global
Instance-specific
Semantics-specific
38. 24
o A dataset is modeled as a set of topics-- a dataset's profile.
o Inversely, a topic is modeled as a set of datasets assigned to it-- a topic's signature.
𝝈 is a connectivity
behaviour measure.
Topics Signatures
Learning/ Preprocessing
39. Target Datasets Ranking
25
Let D0 be a new dataset to be linked:
1- Extract Profile(D0).
2- Constitute a pool of target datasets
from the signatures (the result of the
learning step).
3- Ranking target datasets:
40. Training Set:
The topic profiles graph from its available Sparql endpoint: http://data-observatory.org/lod-
profiles/profile-explorer.
76 datasets and 185 392 topics.
The evaluation data (ED) the current topology of the LOD-Cloud using the datahub2void tool
(https://github.com/lod-cloud/datahub2void).
o We made the ED available on http://www.lirmm.fr/benellefi/void.ttl
Testing Set:
Source Datasets: All the 76 datasets indexed by the topics profiles graph.
Target Datasets: 258 datasets from the LOD cloud group (http://datahub.io/group/lodcloud).
26
Experimental Setup
41. 27
Evaluation Framework
Leave-one-out (5-fold cross-validation).
Selected
Dataset
(3)To recommend target
datasets using our system.
Selected
Dataset
owl:sameAs
…
Selected
Dataset
owl:sameAs
…
(1)To select a source dataset in
the evaluation data.
(2)To consider the dataset
as unlinked.
(4)To evaluate the recommendation
Target 1
Target 2
Target n
Target 1
Target 2
Target n
42. - Average recall: 81%.
- 59% of the DS have
a recall of 100%.
- Average precision:
19%.
Evaluation Results (1/3)
Recall/Precision/F1-Score over all DS ∈ ED
False Positive Rate?
28
44. The original search
space size: 258
datasets.
30
Search Space Reduction over all DS ∈ ED
An average
space size
reduction is
up to 86%.
Evaluation Results (3/3)
45. 31
Baselines & Comparison (1/3)
Baselines are available on http://www.lirmm.fr/benellefi/Baselines.rar.
1-
3-
2-
46. 32
Recall values of our approach vs. baselines over all DS ∈ ED
Baselines fail to provide any
results at all for some datasets.
Our approach is more stable and
outperforms the baselines in the
majority of recommendations.
Baselines & Comparison (2/3)
47. 33
The baseline approaches have produced better results than our system in a limited number of cases.
The shared tags baseline generated an F-Score of 100% on:
This is due to the fact that these two datasets are tagged by the same provenance
(data.oceandrilling.org) share the same set of tags.
Note:
Baselines & Comparison (3/3)
48. 33
AVG Precision, Recall and F1-score values over all recommendation lists for all source datasets.
The baseline approaches have produced better results than our system in a limited number of cases.
The shared tags baseline generated an F-Score of 100% on:
This is due to the fact that these two datasets are tagged by the same provenance
(data.oceandrilling.org) share the same set of tags.
Note:
Our approach Shared Keywords Shared Linksets SharedTopics Profiles
AVG Precison 19% 9% 9% 3%
AVG Recall 81% 41% 11% 13%
AVG F1-Score 24% 10% 8% 4%
Baselines & Comparison (3/3)
49. 34
Topic-profiles Dataset recommendation approach:
Original search space reduction
Average recall: 81% & Average precision: 19%.
Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
Discussion
Advantages:
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge
Graphs Recommending Web Datasets for Data Linking. ICWE2016.
50. 34
Topic-profiles Dataset recommendation approach:
Original search space reduction
Average recall: 81% & Average precision: 19%.
Ranking results available on http://www.lirmm.fr/benellefi/results.csv.
Discussion
Precision needs to be improved.
Learning Data is not complete.
Challenges:
Advantages:
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established Knowledge
Graphs Recommending Web Datasets for Data Linking. ICWE2016.
Breaking up with the learning step.
51. oContext
oVocabulary Recommendation with Datavore
oDatasets Recommendation: Problem Statement
oDatasets Recommendation:Topic Profile-based Approach
oDatasets Recommendation: Intensional Profile-based Approach
oConclusion & Open Issues
Outline
35
52. Hypothesis
Motivation
36
Coreference resolution:
Different datasets may contain different resources that refer to the same
real world entity.
Following the LD best practices, these ressources are generally
represented by same/similar types.
Datasets that share at least one pair of similar concepts, are likely to
contain at least one potential pair of instances to be linked,
i.e.,“owl:sameAs” statement.
55. 41
Known from the ontology matching,WordNet-based similarity:
[1] L. Han,A. L. Kashyap,T. Finin, J. Mayeld, and J.Weese.Umbc ebiquity-core: Semantic textual similarity systems, in Proc. of the *SEM,
Association for Computational Linguistics, 2013.
UMBC measure [1]: combines semantic distance in WordNet, with frequency of
occurrence and co-occurrence of terms in a large external corpus (the web).
o Wu Palmer
o Lin's
Similarity Measures to Use:
38
57. Example: Montpellier ∈ DS;
<Montpellier, rdf:type, dbo:Town>
PL(DS)= "town", …
PD(DS)= “…Usually, a town is thought of as larger
than a village but smaller than a city, though there
are exceptions to this rule."
40
Dataset Label Profile-- PL(DS): a set of n schema
concepts labels corresponding to DS.
Intensional Approach to Datasets Recommendation (1/3)
DS
PD(DT)
PL(DS)
Preprocessing
PD(DS)
PL(DT)
Profiling
Profiling
Dataset Document Profile-- PD(DS): the
concatenation of PL(DS) and the textual
descriptions of the n schema concepts.
1
58. 42
Two datasets DS and DT are comparable if there exist
at least one similarity between their labels profiles:
(PL(DS), PL(DT)).
We identify CCD(DS) - a Cluster of Comparable Datasets
related to DS
All the linking candidates for DS are found in its cluster
CCD(DS).
Target Datasets Filtering
The next step consists of ranking DT‘s in CCD(DS)
sim(PL(DS), PL(DT))
Intensional Approach to Datasets Recommendation (2/3)
2
59. 43
1) Forming a corpus by profiles documents: PD (DS) and all
PD(DT).
2) Building a vector space model by indexing the documents
in the corpus
3) Computing TF-IDF + cosine similarity between the
document vectors in the corpus.
4) Ranking each DT in the cluster CCD(𝐷 𝑆) with respect to 𝐷 𝑆.
A mapping between datasets is returned based on their
labels profiles: PL(DS) , PL(DT).
Intensional Approach to Datasets Recommendation (3/3)
3
60. 44
LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets
The profile PD from the LOV (lov.okfn.org)
Wu Palmer & Lin's 2013 WS4J java API (https://code.google.com/archive/p/ws4j/)
The UMBC measure web API (http://swoogle.umbc.edu/SimService)
Experimental Setup
The evaluation data (ED) the current topology of the LOD-Cloud (the ED is available on
http://www.lirmm.fr/benellefi/void.ttl)
61. 44
Experimental Setup
Leave one out evaluation
LOD-Cloud (http://datahub.io/group/lodcloud) 90 responsive datasets.
The profile PD from the LOV (lov.okfn.org).
Wu Palmer & Lin's the 2013 WS4J java API (https://code.google.com/archive/p/ws4j/).
The UMBC measure its available web API (http://swoogle.umbc.edu/SimService).
The evaluation data (ED) the current topology of the LOD-Cloud (the ED is available on
http://www.lirmm.fr/benellefi/void.ttl)
62. 45
For each DS ∈ ED, we evaluated the selection of target datasets in the cluster CCD(𝐷 𝑆) in
terms of recall.
o Wu Palmer: 𝜽 ∈ [𝟎 , 𝟎. 𝟗] o Lin: 𝜽 ∈ [𝟎 , 𝟎. 𝟖] o UMBC: 𝜽 ∈ [𝟎 , 𝟎. 𝟕]
Evaluation Results (1/3)
In the following, the evaluation will be restricted only in these
intervals in order to guarantee a Recall up to 100% in the ED.
Result: the recall value remains 100% in the following threshold 𝜽 intervals:
63. 46
o Wu Palmer: 𝜃 = 0.9
o Lin: 𝜃 = 0.8
o UMBC: 𝜃 = 0.7
MAP@R over all DS ∈ ED
Evaluation Results (2/3)
Parameter tuning:
64. 47
UMBC @ 𝜃 = 0.7 is the best setting for our ranking approach
Mean Precision@K over all DS ∈ ED using the three different similarity
measures over their best setting
Evaluation Results (3/3)
P@5 P@10 P@15 P@20
Wu Palmer (𝜽= 0,9) 0,56 0,52 0,53 0,51
Lin (𝜽= 0,8) 0,57 0,54 0,55 0,51
UMBC (𝜽= 0,7) 0,58 0,54 0,53 0,53
65. 48
Baseline #2: All datasets are represented by their label profiles (PL).
1) CCD(DS) using UMBC @ 𝜃 = 0.7.
2) AvgUMBC be a ranking function that assigns scores to each DT ∈ CCD():
Baselines & Comparison (1/2)
Baseline #1: All datasets are represented by their document profiles (PD).
1) A vector space model by indexing the PD NO CCD clusters.
2) A TF-IDF + cosine similarity between the document vectors.
NOTF-IDF + cosine similarity
66. 49
Baseline#1 Proposed ApproachBaseline#2
Baseline#2 produces better results
for a limited number of datasets:
o RKB explorer datasets are
sharing a high number of
identical labels in their PL.
Efficiency with an AVG P@R up to
53%, compared to 49% and 39%
for the baselines.
Precision up to 100% for DS from
geographic and governmental
domains
P@R over all DS ∈ ED
Baselines & Comparison (2/2)
67. 50
A high performance
o An average precision of 53% for a recall of 100%.
o Independence of the dataset size or the schema cardinality.
Result Analysis
False positives overestimation.
A more fair evaluation can be given if better ED are used.
o The ED is far from being complete as a GroundTruth.
o We ran “SILK” on some FP recommendations. Example of
discovered linksets:
rkb-explorer-
unlocode
yovisto
datos-
bcn-cl
datos-
bcn-uk
owl:sameAs
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset
Recommendation for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016.
68. 51
Datasets Recommendation: Intensional Profile-based
Approach
Semantic Profiles features: Intensional Profile
No Learning Step.
Average Recall: 100%
Mean average precision: 53%.
A mappings between the schema concepts
Ranking results available on
http://www.lirmm.fr/benellefi/CCD-CosineRank_Result.csv.
Topic Profiles-based Vs Intensional-based
Dataset recommendation:Topic profiles-based
approach
Semantic Profiles features:Topic Profiles
Learning data dependency.
Average recall: 81%
Mean average precision: 19%.
A new topic profiles propagation approach
Ranking results available on
http://www.lirmm.fr/benellefi/results.csv.
69. o Context
o Vocabulary Recommendation with Datavore
o Datasets Recommendation: Problem Statement
o Datasets Recommendation:Topic Profile-based Approach
o Datasets Recommendation: Intensional Profile-based Approach
o Conclusion & Open Issues
Outline
52
70. 53
The choice of the Profile features is dependent on a given application
scenario and task.
Learning data dependency:
Learning data is not complete Effectiveness to be improved.
The learning break-off Highly greater effeciency.
Better performance with datasets having high quality intensional
profiles:
The awareness of the richness of datasets schema descriptions.
The importance of reusing existing vocabulary terms:
o i.e, tools such as Datavore can ease the vocabulary
reusing task in linked data modeling.
Conclusion
71. 54
Open Issues
Dataset recommendation:
o A reliable ground truth (i.e., crowdsourcing-based) and benchmark data.
o To improve the quality of the intensional profiles:
• the population of the schema elements,
• the dataset context,
• the multilinguisme, …
o Investigating the effectiveness of ML techniques
Vocabulary recommendation:
o A new evaluation framework for the linked data modeling process
i.e, in a user study manner and crowdsourcing-based.
o To examine the vocabulary terms ranking strategies based on the data structure factors
i.e, tabular sources modeling vs. web pages annotation.
73. Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Dataset Recommendation
for Data Linking: an Intensional Approach. In Proceeding of the 13th ESWC 2016; Crete, Grèce.
Mohamed Ben Ellefi, Zohra Bellahsene, Stefan Dietze and Konstantin Todorov. Beyond Established
Knowledge Graphs Recommending Web Datasets for Data Linking. In Proceeding of the 16th
ICWE 2016; Lugano, Switzerland.
Manel Achichi, Mohamed Ben Ellefi, Danai Symeonidou, Konstantin Todorov. Automatic Key Selection for
Data Linking. In Proceeding of the 20th EKAW 2016; Bologna, Italy.
Mohamed Ben Ellefi, Zohra Bellahsene, Konstantin Todorov. Datavore: A Vocabulary Recommender Tool
Assisting Linked Data Modeling. (Posters & Demos) ISWC 2015; Bethlehem, PA, USA.
This paper was also presented In BDA'2015, Île de Porquerolles, France.
Mohamed Ben Ellefi, Zohra Bellahsene, François Scharffe, Konstantin Todorov. Towards Semantic Dataset
Profiling. In PROFILES@ESWC, Crete, Grèce, (2014).
Hinweis der Redaktion
Respected members of the examination committee and dear
Soooooooooooooooooooooo,
The title of my dissertation here is …………. Profile-based
I start by presenting the project that financed our work… so …; Datalyse include 7 … to work on big open data
For more information about the project you can check datalyse.fr
This is the overview of what I’m present today
I will talk quicly …
So yeay I am gone talk of 3 years of my life in 40 minutes and it is very amtious
Here we have a snapshots where Google literally processes 2.4 million searches every minute.
YES That’s a lot of information in the web ….
To be more concrete, lets take an example of some entities extracted from the web
For better exploitation of this information Linked Data consist in linking this entities with a typed links
Now technically, what changed in the web
The main idea consist in
Links are Relations ancres written in HTML …
Linked Data consists in Knowledges Graphs where links between things are described by RDF.
Snapshots of the LOD that show how the graphes involved from 12 datasets
How to publish our data as Linked Data.
There are Different visions
Data goes through a number of stages to be published as Linked Data.
Search for potential vocabulary
The conversion stage where data is transformed to RDF using the selected ontology
Hosting LD and make it accessible which include controling the access of course
Access Control is a mechanism through which an Agent (typically an HTTP server, in this case)
Potential target datasets
matching rules
In this thesis, we focused on both the …
Intro followed by Context et Motivation
In this thesis, we focused on both the …
Here we have the recommendation of building on, instead of replicating, existing RDF schemas and vocabularies
There is a need for a tool that recommends vocabulary to assist the metadata designer to reuse existing vocabulary terms when devlopping the on
Metadata = the set of object properties and datatype properties that can used with C
Triples extraction= Relations between concepts from different lists.
This is a snapshot of datavore graphical user interface for better user assistance in the vocabulary selection..;
Intro followed by Context et Motivation
Which is a pre-processing of the interlinking
First we recall what is an RDF dataset…
We highlight a challenging problem in the Field of entity linking where …
preprocessing of the interlinking
LOD as a Search Space
However, we should note that we deal with the source dataset that are not linked yet or published yet, as well as an already linked and published datasets that we plan to enrich !
Aim to reduce the effort of
To the best of our knowledge, only a few works have looked at identifying candidate datasets for interlinking.
Overcome the drawbacks of
Overcome the drawbacks of
We aim to provide
No Precision/Recall evaluation except for (4).
No list of datasets and their rank performance by datasets considered as source !
Direct comparison is not possible !
Intro followed by Context et Motivation
A user based recommendation
Two keys that we have to keep in mind: Profile and Behavior
Note: There are many recommendation systems, but, only few approaches are in the context of the linking process.
Topic Profile is our grouping technique
Note: There are many recommendation systems, but, only few approaches are in the context of the linking process.
\what are the features that describe a dataset and allow to dene it?"
Weighted bipartite graph
Topics are Dbpedia categories
We have a source dataset presented by its weigh
C(., .) is a connectivity function based on the number of the links between two datasets in both directions.
C(., .) is a connectivity function based on the number of the links between two datasets in both directions.
A dataset DTK is significant with respect to a topic Tk if its weight in the LDPG is greater than a given value (0; 1).
A topic Tk is modeled by the set of its signicant datasets together with their respective weights,
Enlever
Since this ED are the only available data that we have for both training we opted for a ..
In 5-fold cross-validation, the ED was randomly split into two subsets: 1 - containing random 80% of the linked datasets in the ED, was used as training set
2 -, containing the remaining linked datasets (i.e., random 20% of the ED), was retained as the validation data for tests (i.e., the test set).
LOD cloud as ED
for each 𝐃𝐒∈ ED.
We evaluate the recommendation based on the main links
indicate that every time you call a positive, you have a probability of being right
We confirm our hypothesis of
It is aroud 40 recommendations from the total 258
Which reduce considerably the user effort
To the best of our knowledge, there does not exist a common benchmark for dataset interlinking recommendation.
For example, the shared tags baseline generated an F-Score of 100% on oceandrilling-janusamp, which is connected to * oceandrilling-codices, due to the fact that these two datasets are tagged by the same provenance (data.oceandrilling.org).
J’aurais du faire la P_Value (significance test)
crowdsourcing techniques
ML techniques, such as classification or clustering, for the recommendation task.
The question here: if we break up with
Intro followed by Context et Motivation
metadata designers reuse and build on, instead of replicating, existing RDF schema and vocabularies.
Here we have an example of the city of Montpellier in France.
This city is described by similar types in different datasets
We observe that we have same/similar concepts between these different datasets
We observe that we have same/similar concepts
We need a similarity measure that can detect similarity between terms such “city” and “town”.
Latent Semantic Analysis (LSA) word similarity, which relies on the words co-occurrence in the same contexts computed in a three
billion words corpus of good quality English. D(x; y) is the minimal WordNet [16] path length between x and y. According to authors, using eD(x;y) to transform
simple shortest path length has shown to be very efficient when the parameter is set to 0:25.
Here we have a global view of our proposed approach pipeline which consists of three man steps …
Ajouter une note de CCD …
We extract datasets profiles
vocabularies which are very generic and wide-spread have a negative impact, acting like hub nodes, which dilute the results. Therefore, we decided to remove VoID, FOAF and SKOS.
We note that we use the term "cluster" in its general meaning, referring to a set of datasets grouped together by their similarity and not in a machine learning sense.
As an additional contribution, our method returns the mappings between the schema concepts across datasets a particularly useful input for the data linking step.
The standard Analyzer of Lucene: transform cases into lower case + tokenize + filter English stop words.
Standard NLP techniques
precision at rank k
mean average precision at Recall = 1, MAP@R, where R(q) corresponds to the rank, at which recall reaches 1 for the q th dataset and TotalDS is the entire number of source datasets in the evaluation.
The first step is to form a CCD(DS) for each DS. The CCD construction process depends on the similarity measure on dataset profiles. We observed that the recall value remains 100% in the following threshold intervals
Here we have the parameter setting for each measure where we found that the best MAP is achived in
Furthermore, we provide an evaluation in term of
UMBC =stability in the performance
The first baseline
This is an evaluation of our system against … in term of
Base2 which is based on ProfileLabel
The matching rules as input of silk makes the process manually infeasible to perform over the entire LOD.
Exosting task
Intro followed by Context et Motivation
In the Linked Data life cycle We addressed the stage
For Modeling we provided …
For interlinking we addressed the preprocessing step where we proposed two approach for dataset identifications …
Also We show how
ML techniques, such as classification or clustering, for the recommendation task.
vocabularies which are very generic and wide-spread have a negative impact, acting like hub nodes, which dilute the results. Therefore, we decided to remove VoID, FOAF and SKOS.
Intro followed by Context et Motivation
precision at rank k
mean average precision at Recall = 1, MAP@R, where R(q) corresponds to the rank, at which recall reaches 1 for the q th dataset and TotalDS is the entire number of source datasets in the evaluation.
So the question now what is this semantic representation of the web ?