General Missives

Generale Missieven
// clariah-wp6 use case 1
2020-11-17 DANS-
research meeting

Dirk Roorda

dirk.roorda@dans.knaw.nl

Generale Missieven
• yearly letters from governor
and board of the Dutch East
Indian Company to the Dutch
government (Heren XVII)

• 1610-1761

• 13 volumes

• 565 letters

• 10,000 pages
resources.huygens.knaw.nl/vocgeneralemissiven

Inside
page header
letter head
provenance
modern
editor
modern
editor
transcribed
original text
transcribed
original text
provenance
footnotes by
modern editor
resources.huygens.knaw.nl/retroboeken
1960 I Coolhaas
1985 VIII Coolhaas†
1988 IX Van Goor
2004 X Van Goor
1997 XI
Schooneveld-
Oosterling
2007 XII
Schooneveld-
Oosterling
2007 XIII s'Jacob

Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
It is a toolkit / model / framework /
ethos to

1. get corpus data into RAM

2. compute with it eﬃciently

3. harvest results

4. recycle results back to the corpus
and to do this in a way that

1. is reproducible

2. reduces friction

that's a long and winding road

Source: TEI
page number
it's ok for automatic
processing,

very discouraging for
manual checking and
double checking
very long lines
inhuman ﬁle names

Laundry - trim0
• some pages are hopeless

• we re-sourced data from the OCR strings of the
Huygens website

• cases:

• letters without original content not in TEI (but
there is editorial content and metadata)

• pages with big tables (landscape) resulted in
pathological TEI

Humane data!
file names
are page
numbers metadata is flattened
much of the XML overhead is gone
line breaks are
reflected in the
layout
All the inherent
problems in this
dataset are still there.

But now we have
hope to see them,

to tackle them.

Laundry - trim1
text separation:

• mark folio references

• correct the markup of page
headers

without this step:

• loss of original text

• contamination of original text
vol. 2 p 538
before
after

Laundry - trim2
• metadata

• re-distil from
letter headings

• check

• diagnostics
before
after

Laundry - trim3 - the mother of all laundries
• get the editorial remarks under tight control

even when they spread across pages

• detect all 12,000+ footnote bodies correctly (done)

• connect all footnote refs to their bodies (done)
None of this is feasible without successful completion of the previous steps.

745.3 92 9)
( ... retourschepen 745. 3. 929 )
or

running trim3
in progress
ﬁnally

End of laundry
github.com/Dans-labs/clariah-gm/xml/

Centrifuge
• Result:

• clean, dry stuﬀ: Text-Fabric
github.com/Dans-labs/clariah-gm/tf/
With clean XML in hand, We centrifuge
the XML out of the clean laundry:

• we squeeze out all tag material
(moisture)

• leaving only pure content (dry clothes)

• ready to process (ready to wear)

Local browse/
search interface

tutorialnotebooksonline
nbviewer

• start

• move around programmatically

• search
• get in focus

• compute
• reﬁne by computing

• exportExcel
• collect work sheets

• annotate
• insights are the new data

• share
• let others collect your data as easily as you
collected this corpus
annotation/tutorials/missieven

what does this road mean?
• for researchers?

• for CLARIAH?

• for DANS / eScience Center / Humanities Cluster / HuygensING

researchers
• short road to be completely "hands
on" with their own corpora

• compute in their ﬁrst programming
language: "XML"

• no technological overhead outside
their computing scope: XML, RDF, PID

• no metadata intricacy

• focus on data according to their own
mental concepts: the data features
TF corpora

CLARIAH
• a uniﬁed practice to compute with corpora:

• students of diﬀerent corpora can share practices

• they can build cookbooks that transcend their
particular corpus

• remember "peculiarity of missives"?

• nearly the same recipe exists for a dozen
corpora

• where is greater gain:

• sorting out metadata?

• support the processing of metadata ?
TF corpora

DANS / eScience / HuC / archives
Text-Fabric uses GitHub as data-backend!

• GitHub is unique in supporting versioned data check-in / check-out

• GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation

YET:

• GH is optimized for code, not (big) data

• although you can do private repos, there GH has little support for access roles

AND

• GH's diﬃng techniques maybe over the top for data

We need another data backend:

• based on the practices of a FAIR repository

• where researchers have the same kind of control as they have in GitHub

• that supports versioning

• where you can download specific versions of specific subfolders of
specific datasets under program control: API

• We need a TextHub, a Data Station for processable, annotated Text

• One corpus has many authors that deliver many parts of the data

• Authors control their own parts and share them from places they "own" on
the Hub

• Users grab those parts from the Hub under program control

• And deliver the new parts they create to the Hub

DANS: provide the Hub (Data Station in Dataverse)

eScience: support best computing practices around the Hub

HuC: consider the Hub as a hop-on to larger infrastructure

Archives: invest in resources on the shelf: make them Hub ready

Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
corpus data into memory

compute

harvest

share & recycle
be reproducible

go smoothly
dirk.roorda@dans.knaw.nl

General Missives

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (15)

Ähnlich wie General Missives

Ähnlich wie General Missives (20)

Mehr von Dirk Roorda

Mehr von Dirk Roorda (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

General Missives