Turning Data into Knowledge – profiling and interlinking Web datasets 
Stefan Dietze 
L3S Research Center 
- KESW2014 - 
3...
KESW2014 
Recent work on Linked Data exploration/discovery/search 
 Entity interlinking & dataset interlinking recommenda...
KESW2014 
…why are there so few datasets actually used? 
Date reuse and in-links focused on trusted „reference graphs“ su...
KESW2014 
Linked data is more diverse (and messy) than we think 
SPARQL endpoint availability over time [Buil-Aranda et al...
KESW2014 
What about data consistency? 
Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case...
KESW2014 
Too many/diverse datasets, too little knowledge 
Stefan Dietze 
30/09/14 
? 
? 
? 
? 
? 
? 
Topics? Which datas...
KESW2014 
db:Astro. Objects 
Dataset Metadata 
Stefan Dietze 
30/09/14 
BIBO 
AAISO 
FOAF 
contains 
Entity & dataset disa...
KESW2014 
Schemas/vocabularies on the Web: XKCD 927 
Stefan Dietze 
30/09/14 
https://xkcd.com/927/ 
schemas & vocabularie...
KESW2014 
Schema assessment and mapping 
Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlap...
KESW2014 
typeX 
typeX 
Schema assessment and mapping 
Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 ...
KESW2014 
Application: LinkedUp Data Catalog 
in a nutshell 
 RDF (VoID) dataset catalog: browse & 
query distributed dat...
KESW2014 
Stefan Dietze 
30/09/14 
contains 
yov:Video 
po:Programme 
BBC Programme 
<po:Programme …> <po:Series>Wonders o...
KESW2014 
Stefan Dietze 
30/09/14 
contains 
yov:Video 
po:Programme 
BBC Programme 
<po:Programme …> 
<po:Series>Wonders ...
KESW2014 
Entity linking: evaluation 
30/09/14 
16 
Stefan Dietze 
 Evaluation based on USA Today News items (80.000 enti...
KESW2014 
„SCS Connector“ demo 
http://lod2.inf.puc-rio.br/scs/SemConnectivities 
SCS Connector – Quantifying and Visualis...
KESW2014 
Dataset Metadata 
db:Astronomy 
db:Astro. Objects 
Dataset Catalog/Registry 
yov:Video 
<yo:Video …> 
<dc:title>...
KESW2014 
Efficient dataset profiling: method 
1.Sampling of resource instances (random sampling, weighted sampling, resou...
KESW2014 
Dataset profiling: exploring LOD datasets/topics in a nutshell 
http://data-observatory.org/lod-profiles/ 
Auto...
KESW2014 
Dataset profiling: evaluation 
NDCG (averaged over all datasets) . 
Datasets & Ground Truth 
Yovisto, Oxpoints,...
KESW2014 
30/09/14 
What (dataset) have these categories in common? 
dbp:Category:1955_births 
dbp:Category:People_from_Lo...
KESW2014 
30/09/14 
Diversity of category profile for a single publication 
Berners-Lee, Tim; Hendler, James, Ora Lassila ...
KESW2014 
30/09/14 
http://data-observatory.org/led-explorer/ 
Type specific views on datasets/ categories 
“Document” (...
KESW2014 
data.l3s.de – the L3S DataHub
KESW2014 
KEYSTONE & PROFILES 2014 
30/09/14 
27 
Stefan Dietze 
http://www.keystone-cost.eu/ 
KEYSTONE: semantic keyword-...
KESW2014 
Summing up 
Summary 
Increasing amounts of data => require knowledge about nature and relationships of datasets...
KESW2014 
Спасибо! Thank You! 
WWW See also (general) 
 http://purl.org/dietze 
 http://linkedup-project.eu 
 http://du...
Nächste SlideShare
Wird geladen in …5
×

Turning Data into Knowledge (KESW2014 Keynote)

2.518 Aufrufe

Veröffentlicht am

Keynote at KESW2014 (http://2014.kesw.ru/) about Dataset Characterisation and Linking.

Veröffentlicht in: Technologie
0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
2.518
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
1.020
Aktionen
Geteilt
0
Downloads
9
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Turning Data into Knowledge (KESW2014 Keynote)

  1. 1. Turning Data into Knowledge – profiling and interlinking Web datasets Stefan Dietze L3S Research Center - KESW2014 - 30/09/14 1 Stefan Dietze KESW2014
  2. 2. KESW2014 Recent work on Linked Data exploration/discovery/search  Entity interlinking & dataset interlinking recommendation  Dataset profiling  Data consistency & conflicts Research areas  Web science, Information Retrieval, Semantic Web & Linked Data, data & knowledge integration (mapping, classification, interlinking)  Application domains: education/TEL, Web archiving, … Some projects Introduction http://www.l3s.de/ 30/09/14 2  See also: http://purl.org/dietze Stefan Dietze
  3. 3. KESW2014 …why are there so few datasets actually used? Date reuse and in-links focused on trusted „reference graphs“ such as DBpedia, Freebase etc Long tail of LD datasets which are neither reused nor linked to (LOD Cloud alone 300+ datasets, 50 bn triples) Explanations? Linked Data is awesome, but... 30/09/14 „HTTP-accessibility“ (SPARQL, URI-dereferencing) „Structure“ & „Semantics“ (=> shared/linked vocabularies) „Interlinked“ „Persistent“ Hm, really? Stefan Dietze 3
  4. 4. KESW2014 Linked data is more diverse (and messy) than we think SPARQL endpoint availability over time [Buil-Aranda et al 2013] Accessibility of datasets? Less than 50% of all SPARQL endpoints actually responsive at given point of time [Buil-Aranda2013] “THE” SPARQL protocol? No, but many variants & subsets “Semantics”, links, quality? …data accuracy (eg DBpedia)? [Paulheim2013] …vocabulary reuse? [D’AquinWebSci13] …schema compliance (RDFS, schemas) [HoganJWS2012] Stefan Dietze SPARQL Web-Querying Infrastructure: Ready for Action?, Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, International Semantic Web Conference 2013, (ISWC2013). Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Type Inference on Noisy RDF Data, Paulheim H., Bizer, C. Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525 An empirical survey of Linked Data conformance. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., Journal of Web Semantics 14, 2012 30/09/14 4
  5. 5. KESW2014 What about data consistency? Analyzing Relative Incompleteness of Movie Descriptions in the Web of Data: A Case Study, Yuan, W., Demidova, E., Dietze, S., Zhu, X., International Semantic Web Conference 2014 (ISWC2014) 30/09/14 Stefan Dietze 5
  6. 6. KESW2014 Too many/diverse datasets, too little knowledge Stefan Dietze 30/09/14 ? ? ? ? ? ? Topics? Which datasets are useful & trustworthy for case XY (eg „learning about the solar system“) ? Which topics are covered? Types? Which datasets describe statistics, videos, slides, publications etc? Quality? Currentness, dynamics, accessability/reliability, data quantity & quality? 6
  7. 7. KESW2014 db:Astro. Objects Dataset Metadata Stefan Dietze 30/09/14 BIBO AAISO FOAF contains Entity & dataset disambiguation & linking [ESWC13] Topic profile extraction [WWW13, ESCW14] db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video bibo:Fil bibo:Fi bibo:Film Schema mappings [WebSci13] Data mapping, linking and profiling 7
  8. 8. KESW2014 Schemas/vocabularies on the Web: XKCD 927 Stefan Dietze 30/09/14 https://xkcd.com/927/ schemas & vocabularies 8
  9. 9. KESW2014 Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. po:Programme sioc:Item 30/09/14 yov:Video ? Stefan Dietze 9
  10. 10. KESW2014 typeX typeX Schema assessment and mapping Co-occurence of data types (in 146 datasets: 144 Vocabularies, 588 highly overlapping types, 719 Properties) Co-occurence after mapping into most frequent schemas (201 frequent types mapped into 79 classes) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. bibo:Film bibo:Document po:Programme sioc:Item 30/09/14 foaf:Document yov:Video typeX 10
  11. 11. KESW2014 Application: LinkedUp Data Catalog in a nutshell  RDF (VoID) dataset catalog: browse & query distributed datasets  Federated queries using type mappings  Live information about endpoint accessibility Stefan Dietze 30/09/14 11 http://data.linkededucation.org/linkedup/catalog/ http://datahub.io/group/linked-education DBpedia categories
  12. 12. KESW2014 Stefan Dietze 30/09/14 contains yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video Towards profiling: dataset disambiguation/linking ? Relatedness of entities, meaningfulness of paths? [ESWC13] Extraction of “topics” & relatedness of datasets [ESWC14] ? ? ? 14 db:Astro. Objects db:CartoonCharacters ?
  13. 13. KESW2014 Stefan Dietze 30/09/14 contains yov:Video po:Programme BBC Programme <po:Programme …> <po:Series>Wonders of the Solar System</.> <po:Actor>Brian Cox</…> </po:Programme…> <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). db:Pluto (Dwarf Planet) db:Astrono- mical Objects db:Sun db:Astronomy Computation of connectivity scores between entities Combination of a (i) semantic (graph-based) connectivity score (SCS) with (ii) a Web co-occurence-based measure (CBM) (similar to NGD) For (i): adaptation of Katz-Index from SNA for (linked) data graphs (considering path number and path lengths of transversal properties) SCS = 0.32 CBM = 0.24 15 Dataset disambiguation/linking
  14. 14. KESW2014 Entity linking: evaluation 30/09/14 16 Stefan Dietze  Evaluation based on USA Today News items (80.000 entity pairs)  Manually created gold standard (1000 entity pairs)  Baseline: Explicit Semantic Analysis (ESA) => CBM/SCS: „relatedness“; ESA: „similarity“ Precision/Recall/F1 for SCS, CBM, ESA. Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013).
  15. 15. KESW2014 „SCS Connector“ demo http://lod2.inf.puc-rio.br/scs/SemConnectivities SCS Connector – Quantifying and Visualising Semantic Paths between Entity Pairs, Nunes, B. P., Herrera, J. E. T., Taibi, D., Lopes, G. R., Casanova, M. A., Dietze, S., Demo Paper at 11th Extended Semantic Web Conference (ESWC2014), Heraklion, Crete, Greece, (2014. – *BEST ESWC2014 DEMO AWARD* 17 Stefan Dietze 30/09/14
  16. 16. KESW2014 Dataset Metadata db:Astronomy db:Astro. Objects Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Pluto & the Dwarf Planets</dc:title> … </yo:Video…> Yovisto Video Extracting representative (DBpedia) categories („topic profile“) & entities for arbitrary datasets Sounds easy? But how to do that for 300+ datasets with < 50 bn triples? Scalability vs representativeness: sampling & ranking for good scalability/accuracy balance [ESWC2014] (applied to all responsive LOD datasets) A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). Dataset profiling: what‘s the data about? 18 Stefan Dietze 30/09/14 db:Pluto (Dwarf Planet)
  17. 17. KESW2014 Efficient dataset profiling: method 1.Sampling of resource instances (random sampling, weighted sampling, resource centrality sampling) 2.Entity and topic extraction (NER via DBpedia Spotlight, category mapping and expansion) 3.Normalisation and ranking (using graphical- models such as PageRank with Priors, HITS with Priors and K-Step Markov) Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). 19 Stefan Dietze 30/09/14
  18. 18. KESW2014 Dataset profiling: exploring LOD datasets/topics in a nutshell http://data-observatory.org/lod-profiles/ Automatic extraction of dataset “topics” [ESWC2014] => RDF/VoiD dataset profiles Visualisation & exploration of dataset-topic graph (datasets, topics, relationships) Includes all (responsive) datasets of LOD Cloud 20 Stefan Dietze 30/09/14
  19. 19. KESW2014 Dataset profiling: evaluation NDCG (averaged over all datasets) . Datasets & Ground Truth Yovisto, Oxpoints, LAK Dataset, Semantic Web Dogfood Crowd-sourced topic indicators from datasets (keywords, tags) Manual mapping to entities & category extraction (ranking according to frequency) Baselines 1) LDA, 2) tf/idf (applied to entire datasets) Topic extraction according to our approach, weighting/ranking based on term weight Measure NDCG @ rank l Performance (time/NDCG) for different sampling strategies/sizes etc 21 Stefan Dietze 30/09/14
  20. 20. KESW2014 30/09/14 What (dataset) have these categories in common? dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Category:Semantic_Web dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists dbp:Category:World_Wide_Web dbp:Category:Royal_Medal_winners Stefan Dietze 22 ? ?
  21. 21. KESW2014 30/09/14 Diversity of category profile for a single publication Berners-Lee, Tim; Hendler, James, Ora Lassila (2001). "The Semantic Web". Scientific American Magazine. foaf:Person foaf:Document dbp:Tim_Berners-Lee dbp:Category:1955_births dbp:Category:People_from_London dbp:Category:Buzzwords dbp:Semantic_Web dbp:Category:Semantic_Web dbp:Category:Web_Services dbp:Category:HTTP dbp:Category:Unitarian_Universalists first-level categories (dcterms:subject) dbp:Category:World_Wide_Web dbp:Category:Royal_Medal_winners Stefan Dietze DBLP 23
  22. 22. KESW2014 30/09/14 http://data-observatory.org/led-explorer/ Type specific views on datasets/ categories “Document” (foaf:document) “Person “ (foaf:person) “Course” (aaiso:course) Currently applied to datasets in LinkedUp Catalog only (as schema mappings already available here) Type-specific exploration of dataset categories Stefan Dietze Exploring type-specific topic profiles of datasets: a demo for educational linked data, Taibi, D., Dietze, S., Fetahu, B., Fulantelli, G., Demo at International Semantic Web Conference 2014 (ISWC2014) 24
  23. 23. KESW2014 data.l3s.de – the L3S DataHub
  24. 24. KESW2014 KEYSTONE & PROFILES 2014 30/09/14 27 Stefan Dietze http://www.keystone-cost.eu/ KEYSTONE: semantic keyword-based search on structured data sources (2013-2017) Research network focused on distributed search, dataset profiling, to Semantic Web, Databases, etc. Open to new members (beyond Europe) http://www.keystone-cost.eu/profiles http://www.ijswis.org/?q=node/51/ PROFILES2014 - Dataset PROFIling & fEderated Search for Linked Data Workshop collocated with ESWC2014 IJSWIS Special Issue on … LD search & profiling Deadline 8 December 2014
  25. 25. KESW2014 Summing up Summary Increasing amounts of data => require knowledge about nature and relationships of datasets Profiling: scalable methods for extracting dataset metadata Interlinking: connectivity of entities or datasets What about LD evolution? In RDF graphs (eg LOD Cloud), „all“ nodes are connected Impact of evolution on preservation, linking and enrichment? Which parts of datasets to preserve (entity „neighbourhood“)? => semantic relatedness /relevance/entity retrieval Link correctness in evolving LD? …. 30/09/14 29 Stefan Dietze
  26. 26. KESW2014 Спасибо! Thank You! WWW See also (general)  http://purl.org/dietze  http://linkedup-project.eu  http://duraark.eu  http://data.l3s.de See also (data)  http://data.l3s.de  http://data.linkededucation.org http://lak.linkededucation.org 30/09/14 30 Stefan Dietze Besnik Fetahu (L3S) Elena Demidova (L3S) Bernardo Pereira Nunes (PUC Rio) Marco Casanova (PUC Rio) Luiz Andre Paes Leme (PUC Rio) Giseli Lopes (PUC Rio) Davide Taibi (CNR, IT) Mathieu d’Aquin (Open University, UK) and many more… Acknowledgements

×