Mark Zöpfgen (German National Library) presented their library activities in content extraction and semantic web. They maintain the National Bibliography, which contains all national print and electronic publications since 1913. They produce an authority file (called GMD “Gemeinsame Normdatei”) with metadata. Activities in content extraction and semantic web comprise several projects. In these they build an ontology for generating the data and which enables a multilingual access to subjects in order to make the German National Library internationally available. Manual effort is also invested in providing high quality translations of the subject headings of the bibliographical records into English and French. So far, an Open Linked Data Service for spreading the data is available and downloadable in RDF format under creative commons zero license. The main goals of the German National Library comprise the following topics:
-constant improvement of the poor formal state of the bibliographical highly reliable data.
-building an integrated portal with search engine and linked data.
-integration of German bibliographical data into The European Library and finding standards for the provision in the linked data format.
-increase precision of multi-language term mappings under the assumption that there is rarely 1-1 matching.
-the motivation of external parties to work with RDF data and improve search possibilities.
1. 1
Deutsche Nationalbibliothek –
Software-supported Bibliographic Recording and Linked
Data
Lider Roadmapping Workshop
Mark Zöpfgen
Leipzig, 02.09.2014
2. Overview
- DNB – German National Library
- Activities in Content Extraction and Semantic Web
- MACS
- PETRUS
- Open Linked Data
- Motivation/Challenges
Lider-Roadmapping-Workshop | 2 Leipzig | 02.09.2014
3. DNB – the German National Library
- central archival library and national bibliographic center for the Federal Republic of
Germany
- collect, permanently archive, comprehensively document and record
bibliographically all German and German-language publications since 1913, foreign
publications about Germany, translations of German works (German National
Bibliography)
- produces (in collaboration with other institutions the Integrated Authority File
(GND, “Gemeinsame Normdatei”)
- makes them available to the public
- develops and maintains bibliographic rules and standards for Germany
- plays a significant role in the development of international library standards.
Inventory: ~ 27,8 bibliographical units; ~ 719000 online - publications (mainly pdf
and epub)
Lider-Roadmapping-Workshop | 3 Leipzig | 02.09.2014
5. Activities in the Context of Content extraction
and Semantic Web:
Contentus
CrissCross
Culturegraph
Linked Data Service
MACS
Unidissen
VIAF
PETRUS
…
For more information see:
http://www.dnb.de/DE/Wir/Projekte/projekte_node.html
Lider-Roadmapping-Workshop | 5 Leipzig | 02.09.2014
6. MACS – Multilingual Access to Subjects I/IV
- Creation of a multilingual retrieval-vocabulary for research in bibliograpic databases.
- Links between Subject Headings of LCSH (Library of Congress Subject Headings),
RAMEAU (Répertoire d'autorité-matière encyclopédique et alphabétique unifié)
and GND (Gemeinsame Normdatei)
- In cooperation with SNB (Swiss National Library)
- Currently ~ 63000 Links wich have been imported to the GND-records
Use Cases
Make data of DNB internationally available (search via LCSH/RAMEAU-subject headings)
Search in the Library of Congress /Bibliothèque de France with GND-subject headings
Possibility to overtake subject headings from bibliographical records (e.g. in case of
translations)
- Link: http://www.dnb.de/DE/Wir/Kooperation/MACS/macs.html
Lider-Roadmapping-Workshop | 6 Leipzig | 02.09.2014
7. MACS – Multilingual Access to Subjects II/IV
– Maintenance: The links are created/updated using the LMI (Link Management
Interface). The LMI provides a web-interface, data is stored in a central
database.
– Data Export / Import: The links are exported via OAI-Interface. The import to
the CBS (Central Bibliographic Database) is currently done by script (manually
initiated)
– Planned:
Integration in the search-portal of TEL (The European Library)
Provision via linked data service (actually not integrated)
Regular update between LMI and CBS
Lider-Roadmapping-Workshop | 7 Leipzig | 02.09.2014
10. Petrus – Software-supported Bibliographical
Recording I/V
Why software supported?
Growing number of online publications (see graphic below).
The German National Library is looking to reduce its traditional indexing operations
in areas which are no longer feasible due to the continually growing number of
publications, or are no longer necessary because of technological developments.
13525 17651
120000
100000
80000
60000
40000
20000
Lider-Roadmapping-Workshop | 10 Leipzig | 02.09.2014
29823
112766
0
2007 2008 2009 2010
11. Petrus – Software-supported Bibliographical
Recording II/V
Classification
Based upon the DNB-”Sachgruppen” ~ first two layers of the DDC
Statistical procedure, training corpus ~ 300.000 objects with known classes (full
text and tables of content). The objects are limited to 40.000 characters.
After stemming, the data model is generated. As classifier, SVM (scalable vector
machine) is used. After the creation of the model, a 3-fold validation is executed, in
order to verify the quality.
The model can be transferred to an “endpoint”, which is a stand-alone application.
The endpoint communicates via web service-interface.
In use since January 2012; currently ~ 400 objects/day
Lider-Roadmapping-Workshop | 11 Leipzig | 02.09.2014
12. Petrus – Software-supported Bibliographical
Recording III/V
Keywording
Linguistic text analysis: language recognition, identification of sentences, words,
phrases etc.
Term matching with a dictionary which is based on the integrated authority file
(72000 subject headings), Disambiguation
Term ranking (dependant on position and frequency)
The keywording process can eventually be transferred to an “endpoint” (according
to the classification modell)
~ 80 objects/day
Lider-Roadmapping-Workshop | 12 Leipzig | 02.09.2014
13. Petrus – Software-supported Bibliographical
Recording IV/V
Lider-Roadmapping-Workshop | 13 Leipzig | 02.09.2014
(1) List of publications to be
processed
(2) Metadata to be imported out of
the biblographic database
(3) (Full-text) objects to be imported
out of the repository
(4) Transfer via a webservice
interface
(5) Return of results
(6) Storage of the results in the
bibliographic data base
14. Petrus – Software-supported Bibliographical
Recording V/V
Lider-Roadmapping-Workshop | 14 Leipzig | 02.09.2014
Return of the classification software
Appearance in the biblio-graphic
record
15. Open Linked Data I/III
- DNB provides high quality, mainly intellectually created data.
- Authority file (GND) and National Bibliography are available in rdf-format
- Data is published under the Creative Commons Zero-License
- Currently, the data can be accessed via the Portal (for single records) or
downloaded
http://datendienst.dnb.de/cgi-bin/mabit.pl?userID=opendata&pass=opendata&cmd=login
- Target groups are research facilities and non-profit organisations as well as
commercial service suppliers (e.g. search engines, knowledge management
systems)
Lider-Roadmapping-Workshop | 15 Leipzig | 02.09.2014
16. Open Linked Data II/III
– Bibliographic data is highly reliable , but has a poor formal quality
(free-text fields) - High efforts for conversion
– The data was converted using Metafacture, which had been developed by
culturegraph.org. (www.culturegraph.org)
Lider-Roadmapping-Workshop | 16 Leipzig | 02.09.2014
17. Open Linked Data III/III
Bibliographic record of „Winnetou“ leads to Karl May - detail
… leads to place of birth
Lider-Roadmapping-Workshop | 17 Leipzig | 02.09.2014
Coordinates
18. Motivation of the DNB
- Motivate external parties to work with rdf-data, e.g. linking it with other
ontologies.
- Improve search: Access by themes, browsing, unveiling relations between cultural
entities.
Technical Challenges
- Improve the accessibilty (e.g. by services - MACS)
- Search: Integrate Portal (knowledge representation, user interaction) with search
engine and linked data
–
Lider-Roadmapping-Workshop | 18 Leipzig | 02.09.2014
19. Questions?
Mark Zöpfgen
Deutsche Nationalbibliothek
Informationstechnik
Adickesallee 1
D-60322 Frankfurt am Main
Telefon: +49-69-1525-1705
mailto: m.zoepfgen@d-nb.de
http://www.d-nb.de
Lider-Roadmapping-Workshop | 19 Leipzig | 02.09.2014