LOHAI: Providing a baseline for KOS based automatic indexingKai Eckert
Automatic KOS based indexing – i.e. indexing based on a
restricted, controlled vocabulary, a thesaurus or a classification – can play
an important role to close the gap between the intellectually, high quality
indexed publications and the mass of unindexed publications. Especially
for unknown, heterogeneous publications, like web publications, simple
processes that do not rely on manually created training data are needed.
With this contribution, we propose a straight-forward linguistic indexer,
that can be used as a basis for own developments and for experiments
and analyses to explore own documents and KOSs; it uses state-of-the-
art information retrieval techniques and hence forms a suitable baseline
for evaluations. Finally, it is free and open source.
Guidance, Please! Towards a Framework for RDF-based Constraint Languages.Kai Eckert
Presentation held at the DCMI Conference 2015 in Sao Paulo.
http://dcevents.dublincore.org/IntConf/dc-2015/paper/view/386
In the context of the DCMI RDF Application Profile task group and the W3C Data Shapes Working Group solutions for the proper formulation of constraints and validation of RDF data on these constraints are being developed. Several approaches and constraint languages exist but there is no clear favorite and none of the languages is able to meet all requirements raised by data practitioners. To support the work, a comprehensive, community-driven database has been created where case studies, use cases, requirements and solutions are collected. Based on this database, we have hitherto published 81 types of constraints that are required by various stakeholders for data applications. We are using this collection of constraint types to gain a better understanding of the expressiveness of existing solutions and gaps that still need to be filled. Regarding the implementation of constraint languages, we have already proposed to use high-level languages to describe the constraints, but map them to SPARQL queries in order to execute the actual validation; we have demonstrated this approach for the Web Ontology Language in its current version 2 and Description Set Profiles. In this paper, we generalize from the experience of implementing OWL 2 and DSP by introducing an abstraction layer that is able to describe constraints of any constraint type in a way that mappings from high-level constraint languages to this intermediate representation can be created more or less straight-forwardly. We demonstrate that using another layer on top of SPARQL helps to implement validation consistently accross constraint languages, simplifies the actual implementation of new languages, and supports the transformation of semantically equivalent constraints across constraint languages.
Metadata Provenance Tutorial at SWIB 13, Part 1Kai Eckert
The slides of part one of the Metadata Provenance Tutorial (Linked Data Provenance). Part 2 is here: http://de.slideshare.net/MagnusPfeffer/metadata-provenance-tutorial-part-2-modelling-provenance-in-rdf
The Metadata Provenance Task Group aims to define a data model that allows for making
assertions about description sets. Creating a shared model of the data elements required to
describe an aggregation of metadata statements allows to collectively import, access, use and
publish facts about the quality, rights, timeliness, data source type, trust situation, etc. of the
described statements. In this paper we describe the preliminary model created by the task group,
together with first examples that demonstrate how the model is to be used.
Specialising the EDM for Digitised Manuscript (SWIB13)Kai Eckert
The DM2E project developed a data model to standardize metadata for digitized manuscripts. It specialized the Europeana Data Model (EDM) by adding over 50 new properties and 23 classes to better represent physical and conceptual aspects of manuscripts. The DM2E model was documented in PDF and OWL formats and made available online for humans and machines. Future work includes addressing uncertain statements about timespans and creators.
LOHAI: Providing a baseline for KOS based automatic indexingKai Eckert
Automatic KOS based indexing – i.e. indexing based on a
restricted, controlled vocabulary, a thesaurus or a classification – can play
an important role to close the gap between the intellectually, high quality
indexed publications and the mass of unindexed publications. Especially
for unknown, heterogeneous publications, like web publications, simple
processes that do not rely on manually created training data are needed.
With this contribution, we propose a straight-forward linguistic indexer,
that can be used as a basis for own developments and for experiments
and analyses to explore own documents and KOSs; it uses state-of-the-
art information retrieval techniques and hence forms a suitable baseline
for evaluations. Finally, it is free and open source.
Guidance, Please! Towards a Framework for RDF-based Constraint Languages.Kai Eckert
Presentation held at the DCMI Conference 2015 in Sao Paulo.
http://dcevents.dublincore.org/IntConf/dc-2015/paper/view/386
In the context of the DCMI RDF Application Profile task group and the W3C Data Shapes Working Group solutions for the proper formulation of constraints and validation of RDF data on these constraints are being developed. Several approaches and constraint languages exist but there is no clear favorite and none of the languages is able to meet all requirements raised by data practitioners. To support the work, a comprehensive, community-driven database has been created where case studies, use cases, requirements and solutions are collected. Based on this database, we have hitherto published 81 types of constraints that are required by various stakeholders for data applications. We are using this collection of constraint types to gain a better understanding of the expressiveness of existing solutions and gaps that still need to be filled. Regarding the implementation of constraint languages, we have already proposed to use high-level languages to describe the constraints, but map them to SPARQL queries in order to execute the actual validation; we have demonstrated this approach for the Web Ontology Language in its current version 2 and Description Set Profiles. In this paper, we generalize from the experience of implementing OWL 2 and DSP by introducing an abstraction layer that is able to describe constraints of any constraint type in a way that mappings from high-level constraint languages to this intermediate representation can be created more or less straight-forwardly. We demonstrate that using another layer on top of SPARQL helps to implement validation consistently accross constraint languages, simplifies the actual implementation of new languages, and supports the transformation of semantically equivalent constraints across constraint languages.
Metadata Provenance Tutorial at SWIB 13, Part 1Kai Eckert
The slides of part one of the Metadata Provenance Tutorial (Linked Data Provenance). Part 2 is here: http://de.slideshare.net/MagnusPfeffer/metadata-provenance-tutorial-part-2-modelling-provenance-in-rdf
The Metadata Provenance Task Group aims to define a data model that allows for making
assertions about description sets. Creating a shared model of the data elements required to
describe an aggregation of metadata statements allows to collectively import, access, use and
publish facts about the quality, rights, timeliness, data source type, trust situation, etc. of the
described statements. In this paper we describe the preliminary model created by the task group,
together with first examples that demonstrate how the model is to be used.
Specialising the EDM for Digitised Manuscript (SWIB13)Kai Eckert
The DM2E project developed a data model to standardize metadata for digitized manuscripts. It specialized the Europeana Data Model (EDM) by adding over 50 new properties and 23 classes to better represent physical and conceptual aspects of manuscripts. The DM2E model was documented in PDF and OWL formats and made available online for humans and machines. Future work includes addressing uncertain statements about timespans and creators.
The Linked Open Citation Database (LOC-DB) aims to create a fully linked and curated list of references as part of the cataloging process in libraries. It would include monographs, conference papers, and journal articles. The architecture involves OCR processing, linking to existing LOC-DB instances and linked open data. It reuses the Open Citations Data Model and collaborates to extend and maintain the model, with the goal of making existing citation databases obsolete by encouraging publishers to openly provide citation data.
JudaicaLink: Linked Data in the Jewish Studies FIDKai Eckert
This document discusses the JudaicaLink project, which aims to create a central portal for accessing digital Judaica collections and linking different data sources as linked open data. It describes ongoing work to contextualize collections, enrich metadata, re-transliterate text from Romanized to Hebrew script, and extract and link relevant information from sources like the YIVO Encyclopedia and Biographisches Handbuch der Rabbiner. The system uses a triple store, SPARQL endpoint, and static site generator to manage and deploy the linked data.
Towards Interoperable Metadata ProvenanceKai Eckert
The document discusses metadata provenance and proposes a model for tracking the provenance of metadata using named graphs and semantic web technologies. Key elements of the proposed model include using named graphs to represent different metadata sources, tracking the source and confidence value of metadata using semantic triples, and executing SPARQL queries to retrieve metadata based on provenance information like source or confidence value.
Linked Open Projects (DCMI Library Community)Kai Eckert
This document discusses making data from research projects more reusable by publishing it as linked open data. It describes several existing research projects that have generated datasets and outlines how publishing their data as linked open data using standards like RDF and SPARQL could make the data more accessible and reusable. This would allow the datasets to be more easily combined and integrated. The document then presents the Linked Data Service developed by Mannheim University Library as a way to publish this project data as linked open data and provides examples of how the data could be queried and reused through this service.
This document discusses the need for and use of metametadata, or metadata about metadata, in two scenarios: crosswalks between metadata schemas and integrating metadata from different sources. It proposes using metametadata to record additional provenance information like the rule or source that generated a metadata statement to help with tasks like debugging crosswalks, updating rules, and improving search by weighting statements. Examples are given of how this could be implemented using RDF reification or named graphs.
Slides (in German) for a talk of Magnus Pfeffer and Kai Eckert. We propose the linked data/semantic web technology as an infrastructure to publish the results of research projects for easy reuse.
Crowdsourcing the Assembly of Concept HierarchiesKai Eckert
How to create a taxonomy by a paid workforce provided by Amazon Mechanical Turk. Evaluative comparison to an existing community of motivated students and domain experts.
Presentation held at JCDL 2010, Brisbane, Australia (http://www.jcdl2010.org).
A Unified Approach for Representing MetametadataKai Eckert
This document proposes a unified approach for representing metametadata, or statements about statements, using RDF reification. It discusses two scenarios where metametadata is needed: crosswalks between metadata schemas, and integrating metadata from different sources. For each scenario, examples are given of how metametadata could provide additional context about the origin and generation of statements to help debug errors, update rules, and assess statement quality. The approach uses RDF and SPARQL to represent and query this metametadata.
The Linked Open Citation Database (LOC-DB) aims to create a fully linked and curated list of references as part of the cataloging process in libraries. It would include monographs, conference papers, and journal articles. The architecture involves OCR processing, linking to existing LOC-DB instances and linked open data. It reuses the Open Citations Data Model and collaborates to extend and maintain the model, with the goal of making existing citation databases obsolete by encouraging publishers to openly provide citation data.
JudaicaLink: Linked Data in the Jewish Studies FIDKai Eckert
This document discusses the JudaicaLink project, which aims to create a central portal for accessing digital Judaica collections and linking different data sources as linked open data. It describes ongoing work to contextualize collections, enrich metadata, re-transliterate text from Romanized to Hebrew script, and extract and link relevant information from sources like the YIVO Encyclopedia and Biographisches Handbuch der Rabbiner. The system uses a triple store, SPARQL endpoint, and static site generator to manage and deploy the linked data.
Towards Interoperable Metadata ProvenanceKai Eckert
The document discusses metadata provenance and proposes a model for tracking the provenance of metadata using named graphs and semantic web technologies. Key elements of the proposed model include using named graphs to represent different metadata sources, tracking the source and confidence value of metadata using semantic triples, and executing SPARQL queries to retrieve metadata based on provenance information like source or confidence value.
Linked Open Projects (DCMI Library Community)Kai Eckert
This document discusses making data from research projects more reusable by publishing it as linked open data. It describes several existing research projects that have generated datasets and outlines how publishing their data as linked open data using standards like RDF and SPARQL could make the data more accessible and reusable. This would allow the datasets to be more easily combined and integrated. The document then presents the Linked Data Service developed by Mannheim University Library as a way to publish this project data as linked open data and provides examples of how the data could be queried and reused through this service.
This document discusses the need for and use of metametadata, or metadata about metadata, in two scenarios: crosswalks between metadata schemas and integrating metadata from different sources. It proposes using metametadata to record additional provenance information like the rule or source that generated a metadata statement to help with tasks like debugging crosswalks, updating rules, and improving search by weighting statements. Examples are given of how this could be implemented using RDF reification or named graphs.
Slides (in German) for a talk of Magnus Pfeffer and Kai Eckert. We propose the linked data/semantic web technology as an infrastructure to publish the results of research projects for easy reuse.
Crowdsourcing the Assembly of Concept HierarchiesKai Eckert
How to create a taxonomy by a paid workforce provided by Amazon Mechanical Turk. Evaluative comparison to an existing community of motivated students and domain experts.
Presentation held at JCDL 2010, Brisbane, Australia (http://www.jcdl2010.org).
A Unified Approach for Representing MetametadataKai Eckert
This document proposes a unified approach for representing metametadata, or statements about statements, using RDF reification. It discusses two scenarios where metametadata is needed: crosswalks between metadata schemas, and integrating metadata from different sources. For each scenario, examples are given of how metametadata could provide additional context about the origin and generation of statements to help debug errors, update rules, and assess statement quality. The approach uses RDF and SPARQL to represent and query this metametadata.
1. Thesauruspflege mit ICE-Map und Semtinel
Thesaurusvisualisierung mit
ICE-Map und SEMTINEL
Kai Eckert
Universitätsbibliothek
Universität Mannheim
PETRUS-Workshop
Deutsche Nationalbibliothek
21. März 2011
Frankfurt
ZBW Workshop, Hamburg, 10. März 2011
2. Thesauruspflege mit ICE-Map und Semtinel
Forschungsschwerpunkte
● Effizienzsteigerung bei der Thesauruspflege in
Bibliotheken.
● Entwicklung von Werkzeugen und Prozessen, um
alternative Methoden der Verschlagwortung nutzbar zu
machen, ohne die Qualität zu gefährden.
● Durch bestmögliche Automatisierung den Menschen beim
Aufbau, der Pflege und der Nutzung eines Thesaurus zu
unterstützen.
● Dadurch den Einsatz thesaurusbasierter
Suchanwendungen auch in Bereichen ermöglichen, in
denen das bislang zu aufwändig ist.
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 2/27
3. Thesauruspflege mit ICE-Map und Semtinel
Visual Datamining
Cholera-Ausbruch 1854.
John Snow entdeckt die
Ursache durch Daten-
visualisierung.
Motivation für uns:
„Ich will das sehen!“
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 3/27
4. Thesauruspflege mit ICE-Map und Semtinel
ICE-Map Visualisierung
● Motivation: „Ich will das sehen!“
● Wie sieht denn der Thesaurus eigentlich aus?
● Welche Begriffe wurden denn zugewiesen?
● Gibt es Bereiche, die hauptsächlich verwendet wurden?
● Wie unterscheiden sich die Zuweisungen, wenn
verschiedene Verfahren zum Einsatz kommen
(Intellektuell, Automatisch, Tagging, ...)?
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 4/27
5. Thesauruspflege mit ICE-Map und Semtinel
Wo setzen wir an?
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 5/27
6. Thesauruspflege mit ICE-Map und Semtinel
Wie visualisiere ich einen Thesaurus?
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 6/27
7. Thesauruspflege mit ICE-Map und Semtinel
Slice and Dice Algorithmus
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 7/27
8. Thesauruspflege mit ICE-Map und Semtinel
Squarified Layout
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 8/27
9. Thesauruspflege mit ICE-Map und Semtinel
Intuitive Identifikation von problematischen
Konzepten
● Sehr hohe Anzahl Zuordnungen:
– Zu allgemein – sollte aufgeteilt werden
– Nicht signifikant
– Fehlerhafte Zuweisungen
● Sehr geringe Anzahl Zuordnungen:
– Zu spezialisiert – sollte mit anderen Begriffen
zusammengeführt werden
– Fehlende Synonyme
– Nicht signifikant
– Fehlende Zuweisungen
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 9/27
10. Thesauruspflege mit ICE-Map und Semtinel
Berücksichtigung der Thesaurus-Hierarchie
● Hohe Anzahl Höher in der Hierarchie
– Allgemeinere Begriffe
● Niedrige Anzahl Niedriger in der Hierarchie
– Speziellere Konzepte
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 10/27
11. Thesauruspflege mit ICE-Map und Semtinel
IC Differenz Analyse
Intrinsischer Informationsgehalt:
Informationsgehalt: ● Vorgestellt von Seco, Veale und Hayes
● Vorgestellt von Resnik
● Basiert auf der Anzahl der Unterbegriffe
● Basiert auf der Auftrittswahrscheinlichkeit Alternativ: Referenzset IC
in der Dokumentenbasis ● z.B. Manuell vergebene Schlagwörter
IC c=−log P c IIC c=−log
max
hypoc1
D IC c= IC c− IIC c
Intuitiv: Ein Wert zwischen -1 und 1, der angibt, ob
ein Begriff eine auffällige Häufigkeit hat bezüglich seiner
Position im Thesaurus oder im Vergleich zur Referenz.
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 11/27
12. Thesauruspflege mit ICE-Map und Semtinel
ICE-Map Visualisierung
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 12/27
13. Thesauruspflege mit ICE-Map und Semtinel
ICE-Map Visualisierung
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 13/27
14. Thesauruspflege mit ICE-Map und Semtinel
Anwendungen der ICE-Map Analyse
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 14/27
15. Thesauruspflege mit ICE-Map und Semtinel
Begrifflichkeiten
● IC Differenz Analyse: Das statistische Framework zur
Berechnung der IC Differenz eines Konzepts.
● ICE-Map Visualisierung: Die Visualierung der IC
Differenz Analyse mittels einer Treemap, plus
Navigationsunterstützung (Treeview, Rootline)
● SEMTINEL: Die Plattform zur Entwicklung und Nutzung
von Analysen und Visualisierungen, also der ganze Rest.
“Sorry für die Verwirrung ;-)”
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 15/27
16. Thesauruspflege mit ICE-Map und Semtinel
SEMTINEL ist...
– Eine Plattform zur Entwicklung eigener Anwendungen
● Integration in thesaurusbasierte Suchanwendungen
– Eine erweiterbare Anwendung, für die man eigene
Module entwickeln kann
● Analysen, Visualisierungen, Import-/Exportfilter, Editoren,
…
– Eine Anwendung zur Entwicklung von Analysen
● Entwicklung zur Laufzeit, noch nicht realisiert.
– Eine Anwendung zum Experimentieren
● Kombination von vorhandenen Analysen und
Visualisierungen
– Ein Werkzeug für Thesaurus-Ersteller und -Nutzer
● Einsatz der Werkzeuge, die von anderen entwickelt
wurden
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 16/27
17. Thesauruspflege mit ICE-Map und Semtinel
Netbeans Platform
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 17/27
18. Thesauruspflege mit ICE-Map und Semtinel
SEMTINEL Architektur
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 18/27
19. Thesauruspflege mit ICE-Map und Semtinel
SEMTINEL Datenmodell
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 19/27
20. Thesauruspflege mit ICE-Map und Semtinel
Experiment API
Configuration
Visualizations/
Datasets Output Analyses
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 20/27
21. Thesauruspflege mit ICE-Map und Semtinel
Konfiguration eines Experiments
Drag and Drop Support.
Erweiterbares Datenmodell.
Mehrfachauswahl möglich.
Register und Register Set.
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 21/27
22. Thesauruspflege mit ICE-Map und Semtinel
Hierarchische Analysen
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 22/27
23. Thesauruspflege mit ICE-Map und Semtinel
Explanation API
● Jede Analyse liefert Informationen:
– Was wird in der Analyse gemacht?
– Auf welchen Analysen baut sie auf?
– Was sind die Eingabewerte?
– Welche Zwischenergebnisse wurden berechnet?
– Welches Ergebnis wird zurückgegeben?
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 23/27
24.
25. Thesauruspflege mit ICE-Map und Semtinel
Gruppierung von Experimenten
Group Management
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 25/27
26. Thesauruspflege mit ICE-Map und Semtinel
Vielen Dank.
http://www.semtinel.org
Fragen und Anregungen:
eckert@bib.uni-mannheim.de
Kai Eckert ZBW Workshop, Hamburg, 10. März 2011 26/27