Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Semantic Linking & Retrieval for Digital Libraries

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 34 Anzeige

Semantic Linking & Retrieval for Digital Libraries

Herunterladen, um offline zu lesen

An overview of recent works on entitiy linking and retrieval in large corpora, specifically bibliographic data. The works address both traditional Linked Data and knowledge graphs as well as data extracted from Web markup, such as the Web Data Commons.

An overview of recent works on entitiy linking and retrieval in large corpora, specifically bibliographic data. The works address both traditional Linked Data and knowledge graphs as well as data extracted from Web markup, such as the Web Data Commons.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Semantic Linking & Retrieval for Digital Libraries (20)

Anzeige

Weitere von Stefan Dietze (19)

Anzeige

Semantic Linking & Retrieval for Digital Libraries

  1. 1. Backup Semantic Linking & Retrieval for Digital Libraries Dr. Stefan Dietze 11.02.2016 Institut für Informatik/Universität Bonn 29/03/16 1Stefan Dietze
  2. 2. Stefan Dietze Overview: research/application context Information (types)  Bibliographic (meta)data  Research information  Educational (meta)data  Web & social data Stakeholders  Archival organisations  Digital libraries  Publishers  Resource providers/ consumers Domains  Life Sciences  Computer Science  Learning Analytics  ... Data-centric tasks  Publishing, preservation, annotation, crawling, search, retrieval ... 29/03/16 2Stefan Dietze
  3. 3. Overview: contents Introduction & motivation Publishing, linking and profiling  Publishing & linking (bibliographic) data  Dataset profiling & linking Retrieval & search  Entity retrieval in large graphs  Embedded (bibliographic) Web data  Entity summarisation from Web markup Outlook and future directions Stefan Dietze Information (types)  Bibliographic (meta)data  Research information  Educational (meta)data  Web & social data Stakeholders  Archival organisations  Digital libraries  Publishers .... Domains  Life Sciences  Computer Science  Learning Analytics  ... Data-centric tasks  Publishing, preservation, annotation, crawling, search, retrieval ... 29/03/16 3Stefan Dietze
  4. 4. Introduction & motivation Publishing, linking and profiling  Publishing & linking (bibliographic) data  Dataset profiling & linking Retrieval & search  Entity retrieval in large graphs  Embedded (bibliographic) Web data  Entity summarisation from Web markup Outlook and future directions Overview: contents knowledge graphs and linked data beyond LD: embedded semantics [ESWC13, ESCW14] [ISWC15] [WebSci13, SWJ15] Stefan Dietze Information (types)  Bibliographic (meta)data  Research information  Educational (meta)data  Web & social data Stakeholders  Archival organisations  Digital libraries  Publishers .... Domains  Life Sciences  Computer Science  Learning Analytics  ... Data-centric tasks  Publishing, preservation, annotation, crawling, search, retrieval ... [ongoing] 29/03/16 4Stefan Dietze
  5. 5. Linked Data diversity: example library & scholarly data  Linked Data: W3C standards & de-facto standard for sharing data on the Web (roughly 1000 datasets, 100 bn triples), adopted specifically by library/GLAM sector & life sciences  Strong focus on established knowledge graphs, e.g. Yago, DBpedia, Freebase (still) Vocabularies/Schemas  BIBO, Bibliographic Ontology  BIRO, Bibliographic Reference Ontology  CITO, Citation Typing Ontology  SPAR vocabularies (incl. CITO, BIRO)  SWRC (Semantic Web Dogfood)  Functional Req. for Bibliographic Records (FRBR)  Nature Publishing Group Ontology  mEducator Educational Resources  .... Datasets  EUROPEANA  British Library  Deutsche-, Französische-, Spanische Nationalbibliotheken  Nature Publishing Group  Hochschulbibliothekszentrum NRW  Elsevier Scholarly Publications  TED Talks  mEducator Linked Educational Resources  Open Courseware Consortium  LAK Dataset  ... Initiatives  W3C Library Linked Data Incubator Group  Linked Library Data group on DataHub  LinkedUniversities.org  LinkedEducation.org  W3C Linked Open Education Community Group  ... 29/03/16 5Stefan Dietze
  6. 6. ? ? ? ?? ? Challenge: efficient search for suitable resources & datasets  „Quality“: currency, dynamics, accessibility [Buil-Aranda2013], correctness [Paulheim2013], schema compliance [Hogan2012]  Domains/topics: which datasets/resources address topic XY (e.g. „microbiology“) ?  Types: statistical data, bibliographic resources, AV resources, scholarly publications?  Links: related datasets? 29/03/16 6Stefan Dietze
  7. 7. Data publishing, linking and profiling: LinkedUp Dataset Catalog/Registry http://data.linkededucation.org/linkedup/catalog/  LinkedUp project (FP7 project: L3S, OU, OKFN, Elsevier, Exact Learning solutions)  LinkedUp Catalog: largest collection of LD/Open Data for educationally relevant resources (approx. 50 Datasets)  Original datasets published with key content providers, automatically extracted metadata 29/03/16 7Stefan Dietze
  8. 8. Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D., Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu, H. Q. (2013), Socio-semantic Integration of Educational Resources – the Case of the mEducator Project, in Journal of Universal Computer Science (J.UCS), Vol. 19, No. 11, pp. 1543-1569. Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked Dataset of Medical Educational Resources, British Journal of Educational Technology (BJET), Volume 46, Issue 5, pages 1123–1129, September 2015. mEducator: medical educational resources  EC-funded eContentPlus project (2009-2012)  Exploratory search through semantic and clustering techniques  Lifting/enriching/clustering medical metadata  Common vocabularies (MESH, SNOMED, Bioportal etc)  mEducator dataset: first Linked Data corpus of enriched OER metadata, used by number of applications 29/03/16 8Stefan Dietze
  9. 9. LAK Dataset: facilitating scientometrics Concept ofType # Reference npg:Citation 7885 Author foaf:Person 1214 Conference Paper swrc:InProceedings 652 Organization foaf:Organization 365 Journal Paper bibo:Article 45 Proceedings Volume swrc:Proceedings 15 Journal Volume bibo:Journal 9  Cooperation of  Linked Data corpus of „Learning Analytics“publications of last 5 years (~ 800 publications)  Metadata, full-text & automated linking (DBLP, SWDF, DBpedia)  Wide adoption (http://lak.linkededucation.org) 1. Data extraction & vocabulary definition 2.3. Applications & analysis Entity co-reference resolution & linking Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset, Dietze, S., Taibi, D., D’Aquin, M.,Semantic Web Journal, 2015. 29/03/16 9Stefan Dietze
  10. 10. 29/03/16 10Stefan Dietze LinkedUp Catalog: dataset index & registry, federated searchn a nutshell “Federated queries” through schema mappings  Dataset accessability  Linking & topic profiling Schema/Types
  11. 11. Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. po:Programme yov:Video ? bibo:Book Schema analysis & mapping 29/03/16 11Stefan Dietze
  12. 12. typeX typeX Co-occurence after mapping (201 frequently occuring types, mapped into 79 types) bibo:Film bibo:Document po:Programme bibo:Book foaf:Document yov:Video typeX Assessing the Educational Linked Data Landscape, D’Aquin, M., Adamou, A., Dietze, S., ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Schema analysis & mapping Co-occurence of types (in 146 datasets: 144 vocabularies, 588 overlapping types, 719 predicates) 29/03/16 12
  13. 13. 29/03/16 13Stefan Dietze http://data.linkededucation.org/linkedup/catalog/ LinkedUp Catalog: dataset index & registry, federated searchn a nutshell “Federated queries” through schema mappings  Dataset accessability  Linking & topic profiling Dataset topic profiles
  14. 14. contains yov:Video <yo:Video …> <dc:title> Lecture 29 – Stem Cells </dc:title> … </yo:Video…> Yovisto Video db:Medicine db:Rudolf Virchow db:Cell Biology  Linking entities/datasets through combination of (i) „semantic (graph-based) connectivity score (SCS)“ (based on Katz centrality) and „co-occurence-based measure (CBM)“ (similar to Normalised Google Distance)  Evaluation: outperforming Explicit Semantic Analysis (ESA) SCS = 0.32 CBM = 0.24 Data(set) interlinking bibo:Book British Library Book <bibo:Book …> <bibo:title>Über den Hungertyphus</.> <bibo:creator>Rudolf Virchov</…> </bibo:Book…> Combining a co-occurrence-based and a semantic measure for entity linking, B. P. Nunes, S. Dietze, M.A. Casanova, R. Kawase, B. Fetahu, and W. Nejdl., ESWC 2013 - 10th Extended Semantic Web Conference, (May 2013). ? 29/03/16 14 db:Cell (Biology) db:Cell(Micro- processor) Stefan Dietze
  15. 15. db:Biology db:Cell biology Dataset Catalog/Registry yov:Video <yo:Video …> <dc:title>Lecture 29 – Stem Cells</dc:title> … </yo:Video…> Yovisto Video  Extraction of representative (DBpedia) categories („topic profile“) for arbitrary datasets  Technically trivial, but scalability issues: LOD Cloud 1000+ datasets with <100 billion RDF statements  Efficient approach: sampling & ranking for balance between scalability and precision /recall Scalable profiling of datasets A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). db:Cell (Biology) 29/03/16 15 db:Cell (Biology) Stefan Dietze
  16. 16. Efficient dataset profiling 1. Sampling of resources (random sampling, weighted sampling, resource centrality sampling) 2. Entity- & topic-extraction (NER via DBpedia Spotlight, category mapping & -expansion) 3. Normalisation & ranking (graph-based models such as PageRank with Priors, HITS with Priors & K-Step Markov)  Result: weighted dataset-topic profile graph A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). 29/03/16 16Stefan Dietze
  17. 17. Search & exploration of datasets through topic profiles in a nutshell Applied to entire LOD cloud/graph  Visual exploration of extracted RDF dataset profiles (datasets, topics, relationships)  Evaluation results: K-Step Markov (10% sampling size) outperforms baselines (LDA, tf/idf on entire datasets) http://data-observatory.org/lod-profiles/ 29/03/16 17Stefan Dietze
  18. 18. Search: entity retrieval on large structured datasets? in a nutshell Challenges  How to efficiently retrieve related entities/resources for given query ?  Explicit entity links (owl:sameAs etc) are sparse yet important to facilitate state of the art methods (eg BM25F, Blanco et al, ISWC2011)  Query type affinity? 29/03/16 18Stefan Dietze ?? Large dataset/crawl e.g. LinkedUp dataset graph, LIVIVO dataset, BTC2014 entities related to <James D. Watson> ? BTC2014
  19. 19. Entity retrieval: approach in a nutshell (I) Offline processing (clustering to address link sparsity) 1. Feature vectors (lexical and structural features) 2. Bucketing: per type (LSH algorithm) 3. Clustering: X-means & Spectral clustering per bucket Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2014), Bethlehem, US, (2015). (II) Online processing (retrieval) 1. Retrieval & expansion: a) BM25F results b) expansion from clusters (related entities) 2. Re-Ranking (context terms & query type affinity) 29/03/16 19Stefan Dietze
  20. 20. Dataset  BTC2014 (1.4 billion triples)  92 SemSearch queries Methods  Our approaches: XM: Xmeans, SP: Spectral  Baselines B: BM25F, S1: Tonon et al [SIGIR12] Conclusions  XM & SP outperform baselines  Clustering to remedy link sparsity  Relevance to query crucial Improving Entity Retrieval on Structured Data, Fetahu, B., Gadiraju, U., Dietze, S., 14th International Semantic Web Conference (ISWC2014), Bethlehem, US, (2015). Entity retrieval: evaluation 29/03/16 20Stefan Dietze
  21. 21. Introduction & motivation Publishing, linking and profiling  Publishing & linking (bibliographic) data  Dataset profiling & linking Retrieval & search  Entity retrieval in large graphs  Embedded (bibliographic) Web data  Entity summarisation from Web markup Outlook and future directions Overview: contents so far 29/03/16 21Stefan Dietze [ESWC13, ESCW14] [ISWC15] [WebSci13, SWJ15] Outcomes & impact ?
  22. 22. Tangible outcomes / impact Open Datasets Applications Vocabularies & Schemas Initiatives & Working Groups VOL + vocabularies for educational resource & service modeling  W3C Community Group „Open Linked Education“  DCMI Task Force on LRMI  W3C Schema Bib Extend Group  Tutorial & workshop series on Linked Data & Learning  LinkedUniversities, LinkedEducation.org  KEYSTONE WG „Search and Profiling of LD“  …. http://linkeduniversties.org 29/03/16 22Stefan Dietze
  23. 23. Introduction & motivation Publishing, linking and profiling  Publishing & linking (bibliographic) data  Dataset profiling & linking Retrieval & search  Entity retrieval in large graphs  Embedded (bibliographic) Web data  Entity summarisation from Web markup Outlook and future directions Overview: contents beyond LD: embedded semantics Stefan Dietze Information (types)  Bibliographic (meta)data  Research information  Educational (meta)data  Web & social data Stakeholders  Archival organisations  Digital libraries  Publishers .... Domains  Life Sciences  Computer Science  Learning Analytics  ... Data-centric tasks  Publishing, preservation, annotation, crawling, search, retrieval ... 29/03/16 23Stefan Dietze
  24. 24.  The Web: approx. 46.000.000.000.000 (46 trillion) Web pages indexed by Google vs  Linked Data: approx. 1000 datasets & 100 billion statements - different order of magnitude wrt scale & dynamics  Other „semantics“ (structured facts) on the Web? The Web as a knowledge base: semantics on the Web? 29/03/16 24Stefan Dietze
  25. 25.  Embedded markup (RDFa, Microdata, Microformats) for interpretation of Web documents (search, retrieval)  Arbitrary vocabularies; schema.org used at scale: (700 classes, 1000 predicates)  Adoption on the Web: 26 % (2014 Google study of 12 bn Web pages)  “Web Data Commons” (Meusel & Paulheim [ISWC2014]) • Markup from Common Crawl (2.2 billion pages): 17 billion RDF quads • Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)  Same order of magnitude as “the Web” Embedded semantics: Web page markup & schema.org <div itemscope itemtype ="http://schema.org/Movie"> <h1 itemprop="name">Forrest Gump</h1> <span>Actor: <span itemprop=„actor">Tom Hanks</span> <span itemprop="genre">Drama</span> ... </div> 29/03/16 25 RDF statements node1 actor _node-x node1 actor Robin Wright node1 genre Comedy node2 actor T. Hanks node2 distributed by Paramount Pic. node3 actor Tom Cruise node3 distributed by Paramount Pic. Stefan Dietze
  26. 26. 29/03/16 26Stefan Dietze Characteristics Example Coreferences 18.000 results for <„Iphone 6“, type, s:Product> (8,6 quads on average) Redundancy <s, schema:name, „Iphone 6“> occuring 1000 times in WDC2013 Lack of links Largely unlinked entity descriptions / subgraphs Errors (typos & schema violations, see Meusel et al [ESWC2015]) Wrong namespaces, such as http://schma.org Undefined types & predicates: 9,7 % in WDC, less common than in LOD Confusion of datatype and object properties: <s1, s:publisher, „Springer“>, 24,35 % object property issues vs 8% in LOD Data property range violations: e.g. literals vs numbers (12,6% in WDC vs 4,6 in LOD) Using markup as global knowledge base - state of the art  Glimmer (http://glimmer.research.yahoo.com): entity retrieval (BM25F) on WDC dataset [Blanco, Mika & Vigna, ISWC2011]  Challenges: specific characteristics of markup data
  27. 27.  Goal: obtaining entity summary (or entity-centric knowledge graph) for given query ?  Tasks: document annotation, knowledge base augmentation, semantic enrichments Using markup as global knowledge base/graph? Web page markup 29/03/16 27Stefan Dietze Query Nucleic Acids, type:(Article) Entity Summary/Graph Name Molecular structure of nucleic acids author James D. Watson Francis Crick publisher Nature datePublished 1953 Web crawls, WDC or large (domain-specific) crawls: e.g. publishers, universities, libraries etc
  28. 28. Candidate Facts node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 node2 name Francis Crick node2 name Cricks  Extract (domain-specific) knowledge bases and knowledge graphs for digital libraries  Experiments on WDC data: 87,6 % MAP, coverage: on average 57% additional facts compared to DBpedia Ongoing work: entity summarisation from markup data Query Nucleic Acids, type:(Article) 1. Retrieval 2. Fact selection Entity Summary/Graph Name Molecular structure of nucleic acids author James D. Watson Francis Crick publisher Nature datePublished 1953 29/03/16 28 New Queries James D. Watson, type:(Person) Francis Crick, type:(Person) Nature, type:(Organization) Stefan Dietze Web crawls, WDC or large (domain-specific) crawls: e.g. publishers, universities, libraries etc Web page markup (clustering, heuristics, trained classifier)
  29. 29. 1 10 100 1000 10000 100000 1000000 10000000 1 51 101 151 201 count(log) PLD (ranked) # entities # statements Unprecedented source of bibliographic data  Metadata about scholarly articles (s:ScholarlyArticle): 6.793.764 quads, 1.184.623 entities, 429 distinct predicates (in WDC / 1 type alone)  Top 5 domains: Springer, MDPI, BMJ, diabetesjournals.org, mendeley.com, Biodiversitylibrary.org Domains, topics, disciplines?  Life Sciences and Computer Science predominant  Top-10 article titles  Most important publishers/journals, libraries represented => Domain-specific & targeted crawls = unprecedented source of data Embedded data for digital libraries / life sciences? 29/03/16 29Stefan Dietze
  30. 30. Knowledge graphs and LD (Yago, Freebase, Pubmed, DBLP etc) Entity node1 name Molecular structure of nucleic acids node1 author James D. Watson node1 publisher Nature node1 datePublished 1956 node1 datePublished 1953 Future work: improving entity-centric tasks for digital libraries 29/03/16 30 Entity node2 name Francis Crick node2 name Cricks node2 born 1916 Stefan Dietze • Web data as knowledge resource • Background knowledge/structured data • Training data & ground truths • .... Embedded data Unstructured (Web) documents Linked Data Improving data-centric tasks for large (bibliographic/life sciences) corpora, eg LIVIVO • KB construction & augmentation • Document annotation • Entity recognition, disambiguation, interlinking • Search & retrieval ...
  31. 31. Acknowledgements: team  Besnik Fetahu (L3S)  Ivana Marenzi (L3S)  Ujwal Gadiraju (L3S)  Eelco Herder (L3S)  Ran Yu (L3S)  Ricardo Kawase (L3S)  Pracheta Sahoo (L3S, IIT India)  Bernardo Pereira Nunes (L3S, PUC Rio) + external collaborators 29/03/16 31Stefan Dietze
  32. 32. References (presented work) Dietze, S., Taibi, D., D’Aquin, M., Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset, Semantic Web Journal, 2016. Dietze, S., Kaldoudi, E., Dovrolis, E., Giordano, D., Spampinato, C., Hendrix, M., Protopsaltis, A., Taibi, D., Yu, H. Q. (2013), Socio- semantic Integration of Educational Resources – the Case of the mEducator Project, in Journal of Universal Computer Science (J.UCS), Vol. 19, No. 11, pp. 1543-1569. Dietze, S., Taibi, D., Yu, H. Q., Dovrolis, N., A Linked Dataset of Medical Educational Resources, British Journal of Educational Technology (BJET), Volume 46, Issue 5, pages 1123–1129, September 2015. Gadiraju, U., Demartini, G., Kawase, R., Dietze, S. Human beyond the Machine: Challenges and Opportunities of Microtask Crowdsourcing. In: IEEE Intelligent Systems, Volume 30 Issue 4 – Jul/Aug 2015. Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI Conference on Human Factors in Computing Systems (CHI2015), April 18-23, Seoul, Korea. Fetahu, B., Gadiraju, U., Dietze, S., Improving Entity Retrieval on Structured Data, 14th International Semantic Web Conference (ISWC2014), Bethlehem, US, (2015). Fetahu, B., Dietze, S., Nunes, B. P., Casanova, M. A., Nejdl, W., A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles, 11th Extended Semantic Web Conference (ESWC2014), Crete, Greece, (2014). D’Aquin, M., Adamou, A., Dietze, S., Assessing the Educational Linked Data Landscape, ACM Web Science 2013 (WebSci2013), Paris, France, May 2013. Nunes, B. P., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W., Combining a co-occurrence-based and a semantic measure for entity linking, in: The Semantic Web: Semantics and Big Data, Proceedings of the 10th Extended Semantic Web Conference (ESWC2013), Lecture Notes in Computer Science Vol. 7882, Springer Berlin Heidelberg, 2013. http://www.stefandietze.net 29/03/16 32Stefan Dietze
  33. 33. Selected related work Entity retrieval  Alberto Tonon, Gianluca Demartini, and Philippe Cudré-Mauroux. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval. In: 35th Annual ACM SIGIR Conference (SIGIR 2012), Portland, Oregon, USA, August 2012.  Roi Blanco, Peter Mika, Sebastiano Vigna: Effective and Efficient Entity Search in RDF Data. International Semantic Web Conference (ISWC) 2011, pages 83-97. Embedded markups & Web Data Commons  Robert Meusel, Petar Petrovski, Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. Proceedings of the 13th International Semantic Web Conference (ISWC 2014), RBDS Track, Trentino, Italy, October 2014.  Robert Meusel and Heiko Paulheim: Heuristics for Fixing Common Errors in Deployed schema.org Microdata. Proceedings of the 12th Extended Semantic Web Conference (ESWC 2015), Portoroz, Slovenia, May 2015 Linked Data quality  Carlos Buil-Aranda, Aidan Hogan, Jürgen Umbrich Pierre-Yves Vandenbussch, SPARQL Web-Querying Infrastructure: Ready for Action?, International Semantic Web Conference 2013, (ISWC2013).  Paulheim H., Bizer, C., Type Inference on Noisy RDF Data, Semantic Web – ISWC 2013, Lecture Notes in Computer Science Volume 8218, 2013, pp 510-525  Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker., S., An empirical survey of Linked Data conformance. Journal of Web Semantics 14, 2012 29/03/16 33Stefan Dietze
  34. 34. Thank you 29/03/16 34Stefan Dietze • http://stefandietze.net • http://data.l3s.de • http://data.linkededucation.org/linkedup/catalog

×