Slides of the paper Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project by Katrien Depuydt and Hennie Brugman at the 3rd Edition of the DATeCH2019 International Conference
7. Diachronic Research
What is available?
Different locations
Different ways of accessing the data
Different data formats
Different metadata schemes
Historical language
7
8. Nederlab Project (01-2013 – 06-2018)
o Research Dutch language, culture and society
o Literary, linguistic and historical research
o Diachronic corpus (ca. 600 – present)
o Research portal
o www.nederlab.nl
8
10. Diachronic text corpus
o 25 collections (so far)
o Text Formats: ALTO, TEI-XML, ABBYY
XML, proprietary XML, Word documents,
pdf, FoliA
o Metadata: DIDL (Digital Item Declaration
Language), TEI-header, proprietary XML,
CMDI
10
11. Metadata requirements
What information do we need?
Analysis of characteristics of a particular author
Research of a phenomenon through time
Language development
11
12. Metadata requirements
(1) Give accurate provenance information of each word in the
text
(2) Identify the authors of the texts by (by linking text authors
to a thesaurus with author information)
(3) Provide a genre classification
(4) Provide a possibility to keep collection specific metadata
(5) Retain the link to the source data used to build the corpus (so as to
be able to link from the text in the Nederlab corpus to the text in the
online source collection)
(6) Provide the necessary information for version control
(7) Provide information on text quality (OCR or ground truth quality)
(8) Provide information on IPR
(9) Linking versions of the same text
12
13. who is the author of these words, when did the
author write these words, what is the date of the
witness, the physical object carrying the text?
13
14. Why are the metadata accompanying digital
objects coming from digital libraries, archives or
other electronic text collections not sufficient?
14
15. Date Witness vs. Date Text
Jacob van Maerlant
Der Naturen Bloeme
written: ca. 1270
manuscript: ca. 1350-1365
LEIDEN, UB : BPL 14 A
15
16. Date Witness vs. Date Text
P.C. Hooft
Nederlandsche Histoorien
written: 1628-1647
printed: 1642 (20), 1654 (7),
1656 (27)
16
17. Date Witness – Date Text
Text edition
Eelco Verwijs
J. Van Maerlant
Der nature
Bloeme
17
Ca. 1270, Maerlant
Ca. 1350-1365, Maerlant
Before 1878, Verwijs
1878, Verwijs
Jacob van
Maerlant, Der naturen
bloeme (ed. Eelco
Verwijs). J.B. Wolters,
Groningen 1878
18. Metadata Book Sufficient
o Mien Visser-Düker, Baron van Hippelepip.
Nutsuitgeverij, Zaltbommel 1917
o Story told in eight chapters.
18
19. Metadata Book Insufficient
o Lucas Zasy, Borgerliicke huyshoudingh.
Cornelis van Damme, Rotterdam 1628
19
Cornelis van Damme
Poems by different authors
20. Metadata Book Insufficient
20
o A.J. Vervoorn, Antilliaans Nederlands.
Kabinet voor Nederlands-Antilliaanse Zaken,
Den Haag z.j. [1976]
o Elaborate
quotations
different authors
21. Metadata Book Insufficient
21
Pieter van Dam's Beschrijvinge van de Oostindische Compagnie 1639-1701.
Uitgegeven door F.W. Stapel en C.W.Th. baron van Boetzelaer van Asperen en Dubbeldam
(jaar van publicatie: 1927-1954). Rijksgeschiedkundige publicaties deel 1.1 Grote Serie 63, deel 1.2 Grote Serie 68,
deel 2.1 Grote Serie 74, deel 2.2 Grote Serie 76, deel 2.3 Grote Serie 83, deel 3 Grote Serie 87 en deel 4 Grote Serie 96.
23. Corpus Processing Strategy 1
Three large collections with good metadata of each
digital object in the collection
o The KB newspaper collection from 1618-1899,
consisting of 12.335.066 clipped articles of
OCR’ed newspapers scalability issues
o Early Dutch Books online post-correction
o DBNL collection GT quality
Metadata converted into the Nederlab CMDI
format
23
24. Beta version research portal 2015
Having only the publication date: severe issue
Searching
Linguistic annotation strategy
24
25. Corpus Processing Strategy 2
Smaller text collections
Collection format TEI FoLiA
Extract historical text from text editions
Careful determination as to what is a text?
Date witness vs. Date text
25
29. Metadata scheme CMDI
Five profiles
o NederlabTitle,
o NederlabDependentTitle
o NederlabSeriesTitle
o NederlabDocumentPart
o NederlabPerson
Building blocks of each profile:
o Information specific for the profile
o NLCore: administrative information
o NLCollectionSpecific
29
30. To conclude
Der Naturen Bloeme by Jacob van Maerlant :
Several manuscripts in several text editions, in
different digital collections and corpora
e.g. The edition of this text by Eelco Verwijs
in Delpher Google Books collection and the DBNL
o with different metadata,
o without the information about the date
of the witness of the edited text
<Bibliotheca Neerlandica Manuscripta>
30
31. o We should evolve towards metadata models
which take both library requirements and
diachronic research into account.
o Common infrastructure to share metadata
information (Researchers / Libraries)
31
32. Future of Nederlab
o Add new collections
o Reprocess stage 1 collections (DBNL) for the
metadata
o Improve linguistic annotation (Clariah +
project)
o Improve the portal
32