1. Motivation
Data on the Web
Some eyecatching opener illustrating growth and or diversity of web data
Curation and profiling of Linked Data
KnowEscape workshop, Open Knowledge Conference 2013 (OKCon2013)
Stefan Dietze1, Besnik Fetahu1, Mathieu d’Aquin2
1 L3S Research Center (Germany); 2 The Open University (UK)
http://linkedup-project.eu
http://purl.org/dietze
@stefandietze
19/09/2013 1Stefan Dietze
2. 17/09/2013 2Stefan Dietze
Success models:
data & applications
LinkedUp Challenge
to identify innovative
tools & applications
Evaluation methods
and approaches
http://www.linkedup-challenge.org/
“LinkedUp” – Linking Web Data for Education
L
Data curation
Technology transfer
& community-building
Collecting & exposing open
data of educational relevance
=> LinkedUp Data Catalog
Profiling and linking of Web
Data for education
=> educational data graph
Disseminating knowledge &
building communities
(educators, computer
scientists, data engineers)
Gathering stakeholder
feedback: use cases, and
requirements
http://linkedup-challenge.org/#usecases
http://data.linkededucation.org
http://linkedup-project.eu/events
European project aimed at
advancing take-up of open data
and related technologies
http://linkedup-project.eu
3. Problem: too many datasets, too few information
Stefan Dietze 19/09/13
http://datahub.io/dataset/bbc
60.000.000 triples
Using/exploiting Linked Data in Education ?
Lack of reliable dataset metadata about
Resource types
Topics & disciplines
Quality, currentness & availability
Provenance
Lack of links and cross-dataset references
Lack of scalable query methods
LOD: 300+ datasets, 32++ billion
distinct RDF statements
DataHub: 6000+ open datasets
4. Goal: dataset metadata & search for data consumers
“LinkedUp/Linked Education cloud” as “expanded” subset of LOD cloud at
The DataHub (http://datahub.io/groups/linked-education)
RDF (VoID) catalog of datasets = dataset of datasets (Linked Education
Catalog): classification of datasets according to, eg, represented types,
disciplines/topics, data quality, accessability
Links and coreferences => unified view on data => Linked Education Graph
Infrastructure, unified (SPARQL) endpoint & APIs for distributed/federated
querying
Data curation and dataset profiling
LinkedUp approach
Educational Datasets
LinkedUp
Catalog
LinkedUp
Links
Automated processing to generate:
Descriptive VoID/RDF Dataset Catalog
Data links
19/09/2013 4Stefan Dietze
5. Assessing the Educational Linked Data
Landscape, D’Aquin, M., Adamou, A.,
Dietze, S., ACM Web Science 2013
(WebSci2013), Paris, France, May 2013.
[WEBSCI‘13]
19/09/2013 5Stefan Dietze
Linked Data „Observatory“ for linking and profiling
Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Category Mapping,
Normalisation,
Filtering
Dataset
Catalog/Index
Links/
Cross-references
rdfs:label:„…ECB….“
?
Dataset metadata (RDF/VoID):
Schema mappings
(types, properties)
Entities & categories
Topic relevance scores
Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
Combining a co-occurrence-based and a
semantic measure for entity linking, B. P.
Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl. , ESWC
2013 - 10th Extended Semantic Web
Conference, (May 2013).
Generating structured Profiles of Linked
Data Graphs, Fetahu, B; Adamou, A.,
Dietze, S., d’Aquin, M., Nunes, B.P.,
ISWC2013 – 12th International Semantic
Web Conference; under review.
[ESCW‘13]
[ISWC‘13]
6. Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
?
http://datahub.io/group/linked-education
19/09/2013 6Stefan Dietze
7. Schema assessment and mapping
Co-occurence of
data types
(in 146 datasets:
144 Vocabularies,
588 highly
overlapping types,
719 Properties)
Co-occurence graph
after mapping
(201 frequent types
mapped into 79 classes)
Assessing the Educational Linked Data Landscape,
D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science
2013 (WebSci2013), Paris, France, May 2013.
bibo:Slideshow
bibo:Film
bibo:Document
19/09/2013 7Stefan Dietze
<po:Programme …>
<po:title>Secret Universe –
The Life of the Cell</po:title>
…
</po:Programme…>
BBC Programme
<sioc:Item …>
<label>Viral diseases &
bacteria</title>
…
</sioc:Item ….>
SlideShare Set
po:Programme
sioc:Item
8. LinkedUp Data Catalog
in a nutshell http://datahub.io/group/linked-education
http://data.linkededucation.org/linkedup/catalog/
VoID dataset catalog: browse, explore and query for
datasets/types
Federated queries using type mappings
19/09/2013 8Stefan Dietze
9. <yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
Topics/categories addressed?
Relatedness of resources/entities?
(types, semantics)
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Combining a co-occurrence-based and a semantic measure
for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R.
Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended
Semantic Web Conference, (May 2013).
Generating structured Profiles of Linked Data Graphs,
Fetahu, B; Adamou, A., Dietze, S., d’Aquin, M., Nunes, B.P.,
ISWC2013 – 12th International Semantic Web Conference; under
review.
Dataset topic profiling: data heterogeneity?
19/09/2013 9Stefan Dietze
10. <yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
Data disambiguation, linking & profiling
Brian Cox?
Sun?
Pluto?
19/09/2013 10Stefan Dietze
11. db:Pluto
(Dwarf Planet)
db:Astrono-
mical Objects
db:Sun
Data disambiguation, linking & profiling
db:Astronomy
19/09/2013 11Stefan Dietze
<yo:Video 8748720>
<dc:title>Pluto & the
Dwarf Planets</dc:title>
…
</yo:Video 8748720>
Video
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
12. db:Pluto
(Dwarf Planet)
db:Astrono-
mical Objects
<yov:Lecture8748720>
<title>Pluto & the Dwarf
Planets</title>
…
< yov:Lecture8748720>
Online Lecture
db:Astronomy
Computation of connectivity scores
between resources/entities
Method: combination of a
(i) semantic (graph-based) connectivity
score (SCS) with
(ii) a Web co-occurence-based measure
(CBM) (similar to NGD)
For (i): adaptation of Katz-Index from SNA
for (linked) data graphs (considering path
number and path lengths of transversal
properties)
Data linking
Dataset categorisation: computation of
normalised (DBpedia) category relevance
scores for datasets
db:Sun
SCS = 0.32
CBM = 0.24
http://purl.org/vol/doc/
http://purl.org/vol/ns/
19/09/2013 12Stefan Dietze
Combining a co-occurrence-based and a semantic
measure for entity linking, B. P. Nunes, S. Dietze, M.A.
Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013
- 10th Extended Semantic Web Conference, (May 2013).
Data disambiguation, linking & profiling
<sioc:Item 2139393292>
<title>Planetary motion
& gravity</title>
…
</sioc:Item 2139393292>
Slideset
<po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
13. <po:Programme519215>
<po:Series>Wonders of the Solar
System</po:Series>
<po:Episode>Emp. of the Sun</po:Episode>
<po:Actor>Brian Cox</po:Actor>
</po:Programme519215 >
Programme
db:Astrono-
mical Objects
db:Astronomy
db:Sun
Dataset profiling
Goal: extracting representative metadata („topic profile“) for each dataset
Approach: computation of normalised (DBpedia) category relevance scores
Using representative sample resource sets per reource type & dataset
Generating structured Profiles of Linked Data
Graphs, Fetahu, B; Adamou, A., Dietze, S., d’Aquin,
M., Nunes, B.P., ISWC2013 – 12th International
Semantic Web Conference; under review.
DBpedia category graph
14. Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset/type)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Dataset
Catalog/Index
Links/
Cross-references
rdfs:label:„…ECB….“
?
Dataset metadata (RDF/VoID):
Schema mappings
(types, properties)
Entities & categories
Topic relevance scores
Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
19/09/2013 14Stefan Dietze
Dataset profiling: topic extraction process (1/2)
Category Mapping,
Normalisation,
Filtering
Step 1 – NER:
Online NER & NED vs. incremental similarity-based „NER“:
Online NER: DBpedia Spotlight
Incremental & similarity-based NER: compare [via Jaccard
Index] textual desc of already extracted entities with
literal values of a resource instance
(assumption: recurring entities likely within datasets)
15. Endpoint Retrieval
& Graph
Extraction
Schema
Extraction and
Mapping
Sample Graph
Extraction
(per dataset/type)
NER & NED
(per resource)
Interlinking & Co-
Resolution
(cross-dataset)
Dataset
Catalog/Index
Links/
Cross-references
rdfs:label:„…ECB….“
?
Dataset metadata (RDF/VoID):
Schema mappings
(types, properties)
Entities & categories
Topic relevance scores
Availability, currentness
data (tbc)
dbpedia:Finance
dbpedia:Sports
dbpedia:England-Wales-Cricket-Board
dbpedia:European_Central_Bank
19/09/2013 15Stefan Dietze
Dataset profiling: topic extraction process (1/2)
Category Mapping,
Normalisation,
Filtering
Step 1 – NER:
Online NER & NED vs. incremental similarity-based „NER“:
Online NER: DBpedia Spotlight
Incremental & similarity-based NER: compare [via Jaccard
Index] textual desc of already extracted entities with
literal values of a resource instance
(assumption: recurring entities likely within datasets)
Step 2 – Computation of profile (ranked categories)
Entities => DBpedia categories = “Topics”: extraction of topics
from DBpedia entities via dcterms:subject
Expand the set of topics by leveraging hierarchical category
organization (skos:broader)
Normalised topic score:
topics datasets
# entities
for dataset D
# entities
for all datasets
# of entities for t
in dataset D
# of entities for t
for all datasets
17. LinkedUp Data Catalog – hands-on
in a nutshell
http://data.linkededucation.org
http://data.linkededucation.org/linkedup/catalog/sparql
http://data.linkededucation.org/request/pipeline/sparql
Querying FOR datasets
• Retrieving datasets for categories
SELECT ?datasetname ?link ?score WHERE
{ ?linkset a void:Linkset.
?linkset vol:hasLink ?link.
?link vol:linksResource <http://dbpedia.org/resource/Category:Technology>.
?link vol:hasScore ?score.
?dataset a void:Dataset.
?linkset void:target ?dataset.
?dataset dcterms:title ?datasetname.
FILTER (?score > 0.5) }
• Retrieve datasets describing schools:
select distinct ?endpoint ?cl where
{ ?ds void:sparqlEndpoint ?endpoint. {{?ds void:classPartition [ void:class ?cl]} UNION {?ds void:subset [ void:classPartition [
void:class ?cl] ]}}
{{?cl owl:equivalentClass aiiso:School} UNION {?cl rdfs:subClassOf aiiso:School} UNION {FILTER ( str(?cl) = str(aiiso:School) ) }} }
Querying THE datasets
• Federated queries using mappings beetwen aaiso:school and other „school“ types
prefix void: <http://rdfs.org/ns/void#> prefix aiiso: <http://purl.org/vocab/aiiso/schema#> prefix owl:
<http://www.w3.org/2002/07/owl#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?endpoint ?school ?cl where { … as above …. }
service silent ?endpoint { ?school a ?cl } }
19/09/2013 17Stefan Dietze
type mappings!
topic profiles/scores!
query federation!
18. Outlookin a nutshell
Merging the two VoID datasets
Datasets and type mappings (LinkedUp Catalog)
Category annotations (data.linkededucation.org)
Extracting statistical observations (RDF Data Cube)
Feeding data back into the DataHub
Application to entire LOD cloud group on DataHub
Consideration of additional profiling features
Quality aspects
Dataset and link dynamics
Temporal and spatial coverage (=> http://www.duraark.eu)
fake example
19/09/2013 18Stefan Dietze
19. LinkedUp Vidi Competition
19/09/13 19
Tools and demos that analyse or integrate open web data for educational purposes
• Wanted: applications tools that address real educational needs
• Anyone can participate - researchers, students, developers, industry
• Challenging focused tracks with clear goals
• More data, more challenging, more support, more prizes
More info: http://linkedup-challenge.org/
Launch at 4 November 2013
Submission deadline is 14 February 2014
20,000 Euro prize money
20. Thank you!
Contact
http://purl.org/dietze | @stefandietze
See also (data)
http://datahub.io/group/linked-education
http://data.linkededucation.org
http://data.linkededucation.org/linkedup/catalog/
http://lak.linkededucation.org
See also (general)
http://linkedup-project.eu
http://linkedup-challenge.org
http://linkededucation.org
http://linkeduniversities.org
19/09/2013 20Stefan Dietze