SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Nederlab:
metadata challenges
Turning Digitised
Material into a
Diachronic Corpus
Katrien Depuydt
Hennie Brugman
katrien.depuydt@ivdnt.org
Hennie.brugman@di.knaw.nl
2
3
4
5
6
Diachronic Research
What is available?
Different locations
Different ways of accessing the data
Different data formats
Different metadata schemes
Historical language
7
Nederlab Project (01-2013 – 06-2018)
o Research Dutch language, culture and society
o Literary, linguistic and historical research
o Diachronic corpus (ca. 600 – present)
o Research portal
o www.nederlab.nl
8
9
Diachronic text corpus
o 25 collections (so far)
o Text Formats: ALTO, TEI-XML, ABBYY
XML, proprietary XML, Word documents,
pdf, FoliA
o Metadata: DIDL (Digital Item Declaration
Language), TEI-header, proprietary XML,
CMDI
10
Metadata requirements
What information do we need?
Analysis of characteristics of a particular author
Research of a phenomenon through time
Language development
11
Metadata requirements
(1) Give accurate provenance information of each word in the
text
(2) Identify the authors of the texts by (by linking text authors
to a thesaurus with author information)
(3) Provide a genre classification
(4) Provide a possibility to keep collection specific metadata
(5) Retain the link to the source data used to build the corpus (so as to
be able to link from the text in the Nederlab corpus to the text in the
online source collection)
(6) Provide the necessary information for version control
(7) Provide information on text quality (OCR or ground truth quality)
(8) Provide information on IPR
(9) Linking versions of the same text
12
who is the author of these words, when did the
author write these words, what is the date of the
witness, the physical object carrying the text?
13
Why are the metadata accompanying digital
objects coming from digital libraries, archives or
other electronic text collections not sufficient?
14
Date Witness vs. Date Text
Jacob van Maerlant
Der Naturen Bloeme
written: ca. 1270
manuscript: ca. 1350-1365
LEIDEN, UB : BPL 14 A
15
Date Witness vs. Date Text
P.C. Hooft
Nederlandsche Histoorien
written: 1628-1647
printed: 1642 (20), 1654 (7),
1656 (27)
16
Date Witness – Date Text
Text edition
Eelco Verwijs
J. Van Maerlant
Der nature
Bloeme
17
Ca. 1270, Maerlant
Ca. 1350-1365, Maerlant
Before 1878, Verwijs
1878, Verwijs
Jacob van
Maerlant, Der naturen
bloeme (ed. Eelco
Verwijs). J.B. Wolters,
Groningen 1878
Metadata Book Sufficient
o Mien Visser-Düker, Baron van Hippelepip.
Nutsuitgeverij, Zaltbommel 1917
o Story told in eight chapters.
18
Metadata Book Insufficient
o Lucas Zasy, Borgerliicke huyshoudingh.
Cornelis van Damme, Rotterdam 1628
19
Cornelis van Damme
Poems by different authors
Metadata Book Insufficient
20
o A.J. Vervoorn, Antilliaans Nederlands.
Kabinet voor Nederlands-Antilliaanse Zaken,
Den Haag z.j. [1976]
o Elaborate
quotations
different authors
Metadata Book Insufficient
21
Pieter van Dam's Beschrijvinge van de Oostindische Compagnie 1639-1701.
Uitgegeven door F.W. Stapel en C.W.Th. baron van Boetzelaer van Asperen en Dubbeldam
(jaar van publicatie: 1927-1954). Rijksgeschiedkundige publicaties deel 1.1 Grote Serie 63, deel 1.2 Grote Serie 68,
deel 2.1 Grote Serie 74, deel 2.2 Grote Serie 76, deel 2.3 Grote Serie 83, deel 3 Grote Serie 87 en deel 4 Grote Serie 96.
Metadata Insufficient
o Resolutions of the town council of Alkmaar
22
Corpus Processing Strategy 1
Three large collections with good metadata of each
digital object in the collection
o The KB newspaper collection from 1618-1899,
consisting of 12.335.066 clipped articles of
OCR’ed newspapers  scalability issues
o Early Dutch Books online  post-correction
o DBNL collection  GT quality
Metadata  converted into the Nederlab CMDI
format
23
Beta version research portal 2015
Having only the publication date: severe issue
Searching
Linguistic annotation strategy
24
Corpus Processing Strategy 2
Smaller text collections
Collection format TEI  FoLiA
Extract historical text from text editions
Careful determination as to what is a text?
Date witness vs. Date text
25
26
27
Corpus Processing Strategy 3
o Existing corpora
• historical corpora
• corpora of present-day Dutch
28
Metadata scheme CMDI
Five profiles
o NederlabTitle,
o NederlabDependentTitle
o NederlabSeriesTitle
o NederlabDocumentPart
o NederlabPerson
Building blocks of each profile:
o Information specific for the profile
o NLCore: administrative information
o NLCollectionSpecific
29
To conclude
Der Naturen Bloeme by Jacob van Maerlant :
Several manuscripts in several text editions, in
different digital collections and corpora
e.g. The edition of this text by Eelco Verwijs
in Delpher Google Books collection and the DBNL
o with different metadata,
o without the information about the date
of the witness of the edited text
<Bibliotheca Neerlandica Manuscripta>
30
o We should evolve towards metadata models
which take both library requirements and
diachronic research into account.
o Common infrastructure to share metadata
information (Researchers / Libraries)
31
Future of Nederlab
o Add new collections
o Reprocess stage 1 collections (DBNL) for the
metadata
o Improve linguistic annotation (Clariah +
project)
o Improve the portal
32

Weitere ähnliche Inhalte

Was ist angesagt?

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
Towards a Linked Open Data Cloud of Language Resources in the Legal Domain
Towards a Linked Open Data Cloud of Language Resources in the Legal DomainTowards a Linked Open Data Cloud of Language Resources in the Legal Domain
Towards a Linked Open Data Cloud of Language Resources in the Legal DomainLynx Project
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Projectmbruemmer
 
Querying the Wikidata Knowledge Graph
Querying the Wikidata Knowledge GraphQuerying the Wikidata Knowledge Graph
Querying the Wikidata Knowledge GraphIoan Toma
 

Was ist angesagt? (6)

ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
Towards a Linked Open Data Cloud of Language Resources in the Legal Domain
Towards a Linked Open Data Cloud of Language Resources in the Legal DomainTowards a Linked Open Data Cloud of Language Resources in the Legal Domain
Towards a Linked Open Data Cloud of Language Resources in the Legal Domain
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Querying the Wikidata Knowledge Graph
Querying the Wikidata Knowledge GraphQuerying the Wikidata Knowledge Graph
Querying the Wikidata Knowledge Graph
 

Ähnlich wie Session7 03.katrien depuydt

20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissenDirk Roorda
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked DataLeon Wessels
 
17. kb.nederlab.20150324
17. kb.nederlab.2015032417. kb.nederlab.20150324
17. kb.nederlab.20150324ingeangevaare
 
REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...
REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...
REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...James Cummings
 
20140130 metadata vocabularies_and_cultural_heritage_final
20140130 metadata vocabularies_and_cultural_heritage_final20140130 metadata vocabularies_and_cultural_heritage_final
20140130 metadata vocabularies_and_cultural_heritage_finalGerard Kuys
 
2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of LettersDirk Roorda
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...cneudecker
 
Paper Dimito (microtoponym) 2005
Paper Dimito (microtoponym) 2005Paper Dimito (microtoponym) 2005
Paper Dimito (microtoponym) 2005douwez
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana
 
Discovering libraries's gold through collection-level descriptions
Discovering libraries's gold through collection-level descriptionsDiscovering libraries's gold through collection-level descriptions
Discovering libraries's gold through collection-level descriptionsValentine Charles
 
Europeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open DataEuropeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open DataValentine Charles
 
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...PretaLLOD
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Dirk Roorda
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experimentWARCnet
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples Victor de Boer
 
"some crauen scruple/ Of thinking too precisely": democratization, dialogue, ...
"some crauen scruple/ Of thinking too precisely": democratization, dialogue, ..."some crauen scruple/ Of thinking too precisely": democratization, dialogue, ...
"some crauen scruple/ Of thinking too precisely": democratization, dialogue, ...Pip Willcox
 

Ähnlich wie Session7 03.katrien depuydt (20)

20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Digital Humanities and Linked Data
Digital Humanities and Linked DataDigital Humanities and Linked Data
Digital Humanities and Linked Data
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
17. kb.nederlab.20150324
17. kb.nederlab.2015032417. kb.nederlab.20150324
17. kb.nederlab.20150324
 
REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...
REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...
REED:London and CWRC: the Digital Ecology of the Records of Early English Dra...
 
20140130 metadata vocabularies_and_cultural_heritage_final
20140130 metadata vocabularies_and_cultural_heritage_final20140130 metadata vocabularies_and_cultural_heritage_final
20140130 metadata vocabularies_and_cultural_heritage_final
 
2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters2010 Digital Humanities London - Dutch Republic of Letters
2010 Digital Humanities London - Dutch Republic of Letters
 
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
Climbing the Tower of Babel: Challenges and Opportunities in Multilingual Dat...
 
Paper Dimito (microtoponym) 2005
Paper Dimito (microtoponym) 2005Paper Dimito (microtoponym) 2005
Paper Dimito (microtoponym) 2005
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 1...
 
Discovering libraries's gold through collection-level descriptions
Discovering libraries's gold through collection-level descriptionsDiscovering libraries's gold through collection-level descriptions
Discovering libraries's gold through collection-level descriptions
 
Europeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open DataEuropeana 1914-1918, User-Generated Content and Linked Open Data
Europeana 1914-1918, User-Generated Content and Linked Open Data
 
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
Validating the OntoLex-lemon lexicography module with K Dictionaries’ multili...
 
Irish Digital Libraries Summit
Irish Digital Libraries SummitIrish Digital Libraries Summit
Irish Digital Libraries Summit
 
EOD at Lublin Conference, Poland, October 2012
EOD at Lublin Conference, Poland, October 2012EOD at Lublin Conference, Poland, October 2012
EOD at Lublin Conference, Poland, October 2012
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 
Museums and Europeana
Museums and EuropeanaMuseums and Europeana
Museums and Europeana
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experiment
 
Linked Data: principles and examples
Linked Data: principles and examples Linked Data: principles and examples
Linked Data: principles and examples
 
"some crauen scruple/ Of thinking too precisely": democratization, dialogue, ...
"some crauen scruple/ Of thinking too precisely": democratization, dialogue, ..."some crauen scruple/ Of thinking too precisely": democratization, dialogue, ...
"some crauen scruple/ Of thinking too precisely": democratization, dialogue, ...
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 

Kürzlich hochgeladen

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Session7 03.katrien depuydt

  • 1. Nederlab: metadata challenges Turning Digitised Material into a Diachronic Corpus Katrien Depuydt Hennie Brugman katrien.depuydt@ivdnt.org Hennie.brugman@di.knaw.nl
  • 2. 2
  • 3. 3
  • 4. 4
  • 5. 5
  • 6. 6
  • 7. Diachronic Research What is available? Different locations Different ways of accessing the data Different data formats Different metadata schemes Historical language 7
  • 8. Nederlab Project (01-2013 – 06-2018) o Research Dutch language, culture and society o Literary, linguistic and historical research o Diachronic corpus (ca. 600 – present) o Research portal o www.nederlab.nl 8
  • 9. 9
  • 10. Diachronic text corpus o 25 collections (so far) o Text Formats: ALTO, TEI-XML, ABBYY XML, proprietary XML, Word documents, pdf, FoliA o Metadata: DIDL (Digital Item Declaration Language), TEI-header, proprietary XML, CMDI 10
  • 11. Metadata requirements What information do we need? Analysis of characteristics of a particular author Research of a phenomenon through time Language development 11
  • 12. Metadata requirements (1) Give accurate provenance information of each word in the text (2) Identify the authors of the texts by (by linking text authors to a thesaurus with author information) (3) Provide a genre classification (4) Provide a possibility to keep collection specific metadata (5) Retain the link to the source data used to build the corpus (so as to be able to link from the text in the Nederlab corpus to the text in the online source collection) (6) Provide the necessary information for version control (7) Provide information on text quality (OCR or ground truth quality) (8) Provide information on IPR (9) Linking versions of the same text 12
  • 13. who is the author of these words, when did the author write these words, what is the date of the witness, the physical object carrying the text? 13
  • 14. Why are the metadata accompanying digital objects coming from digital libraries, archives or other electronic text collections not sufficient? 14
  • 15. Date Witness vs. Date Text Jacob van Maerlant Der Naturen Bloeme written: ca. 1270 manuscript: ca. 1350-1365 LEIDEN, UB : BPL 14 A 15
  • 16. Date Witness vs. Date Text P.C. Hooft Nederlandsche Histoorien written: 1628-1647 printed: 1642 (20), 1654 (7), 1656 (27) 16
  • 17. Date Witness – Date Text Text edition Eelco Verwijs J. Van Maerlant Der nature Bloeme 17 Ca. 1270, Maerlant Ca. 1350-1365, Maerlant Before 1878, Verwijs 1878, Verwijs Jacob van Maerlant, Der naturen bloeme (ed. Eelco Verwijs). J.B. Wolters, Groningen 1878
  • 18. Metadata Book Sufficient o Mien Visser-Düker, Baron van Hippelepip. Nutsuitgeverij, Zaltbommel 1917 o Story told in eight chapters. 18
  • 19. Metadata Book Insufficient o Lucas Zasy, Borgerliicke huyshoudingh. Cornelis van Damme, Rotterdam 1628 19 Cornelis van Damme Poems by different authors
  • 20. Metadata Book Insufficient 20 o A.J. Vervoorn, Antilliaans Nederlands. Kabinet voor Nederlands-Antilliaanse Zaken, Den Haag z.j. [1976] o Elaborate quotations different authors
  • 21. Metadata Book Insufficient 21 Pieter van Dam's Beschrijvinge van de Oostindische Compagnie 1639-1701. Uitgegeven door F.W. Stapel en C.W.Th. baron van Boetzelaer van Asperen en Dubbeldam (jaar van publicatie: 1927-1954). Rijksgeschiedkundige publicaties deel 1.1 Grote Serie 63, deel 1.2 Grote Serie 68, deel 2.1 Grote Serie 74, deel 2.2 Grote Serie 76, deel 2.3 Grote Serie 83, deel 3 Grote Serie 87 en deel 4 Grote Serie 96.
  • 22. Metadata Insufficient o Resolutions of the town council of Alkmaar 22
  • 23. Corpus Processing Strategy 1 Three large collections with good metadata of each digital object in the collection o The KB newspaper collection from 1618-1899, consisting of 12.335.066 clipped articles of OCR’ed newspapers  scalability issues o Early Dutch Books online  post-correction o DBNL collection  GT quality Metadata  converted into the Nederlab CMDI format 23
  • 24. Beta version research portal 2015 Having only the publication date: severe issue Searching Linguistic annotation strategy 24
  • 25. Corpus Processing Strategy 2 Smaller text collections Collection format TEI  FoLiA Extract historical text from text editions Careful determination as to what is a text? Date witness vs. Date text 25
  • 26. 26
  • 27. 27
  • 28. Corpus Processing Strategy 3 o Existing corpora • historical corpora • corpora of present-day Dutch 28
  • 29. Metadata scheme CMDI Five profiles o NederlabTitle, o NederlabDependentTitle o NederlabSeriesTitle o NederlabDocumentPart o NederlabPerson Building blocks of each profile: o Information specific for the profile o NLCore: administrative information o NLCollectionSpecific 29
  • 30. To conclude Der Naturen Bloeme by Jacob van Maerlant : Several manuscripts in several text editions, in different digital collections and corpora e.g. The edition of this text by Eelco Verwijs in Delpher Google Books collection and the DBNL o with different metadata, o without the information about the date of the witness of the edited text <Bibliotheca Neerlandica Manuscripta> 30
  • 31. o We should evolve towards metadata models which take both library requirements and diachronic research into account. o Common infrastructure to share metadata information (Researchers / Libraries) 31
  • 32. Future of Nederlab o Add new collections o Reprocess stage 1 collections (DBNL) for the metadata o Improve linguistic annotation (Clariah + project) o Improve the portal 32