2. Generale Missieven
• yearly letters from governor
and board of the Dutch East
Indian Company to the Dutch
government (Heren XVII)
• 1610-1761
• 13 volumes
• 565 letters
• 10,000 pages
resources.huygens.knaw.nl/vocgeneralemissiven
4. Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
It is a toolkit / model / framework /
ethos to
1. get corpus data into RAM
2. compute with it efficiently
3. harvest results
4. recycle results back to the corpus
and to do this in a way that
1. is reproducible
2. reduces friction
7. Source: TEI
page number
it's ok for automatic
processing,
very discouraging for
manual checking and
double checking
very long lines
inhuman file names
8. Laundry - trim0
• some pages are hopeless
• we re-sourced data from the OCR strings of the
Huygens website
• cases:
• letters without original content not in TEI (but
there is editorial content and metadata)
• pages with big tables (landscape) resulted in
pathological TEI
9. Humane data!
file names
are page
numbers metadata is flattened
much of the XML overhead is gone
line breaks are
reflected in the
layout
All the inherent
problems in this
dataset are still there.
But now we have
hope to see them,
to tackle them.
10. Laundry - trim1
text separation:
• mark folio references
• correct the markup of page
headers
without this step:
• loss of original text
• contamination of original text
vol. 2 p 538
before
after
11. Laundry - trim2
• metadata
• re-distil from
letter headings
• check
• diagnostics
before
after
12. Laundry - trim3 - the mother of all laundries
• get the editorial remarks under tight control
even when they spread across pages
• detect all 12,000+ footnote bodies correctly (done)
• connect all footnote refs to their bodies (done)
None of this is feasible without successful completion of the previous steps.
18. Centrifuge
• Result:
• clean, dry stuff: Text-Fabric
github.com/Dans-labs/clariah-gm/tf/
With clean XML in hand, We centrifuge
the XML out of the clean laundry:
• we squeeze out all tag material
(moisture)
• leaving only pure content (dry clothes)
• ready to process (ready to wear)
23. • start
• move around programmatically
• search
• get in focus
• compute
• refine by computing
• exportExcel
• collect work sheets
• annotate
• insights are the new data
• share
• let others collect your data as easily as you
collected this corpus
annotation/tutorials/missieven
30. what does this road mean?
• for researchers?
• for CLARIAH?
• for DANS / eScience Center / Humanities Cluster / HuygensING
31. researchers
• short road to be completely "hands
on" with their own corpora
• compute in their first programming
language: "XML"
• no technological overhead outside
their computing scope: XML, RDF, PID
• no metadata intricacy
• focus on data according to their own
mental concepts: the data features
TF corpora
32. CLARIAH
• a unified practice to compute with corpora:
• students of different corpora can share practices
• they can build cookbooks that transcend their
particular corpus
• remember "peculiarity of missives"?
• nearly the same recipe exists for a dozen
corpora
• where is greater gain:
• sorting out metadata?
• support the processing of metadata ?
TF corpora
33. DANS / eScience / HuC / archives
Text-Fabric uses GitHub as data-backend!
• GitHub is unique in supporting versioned data check-in / check-out
• GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation
YET:
• GH is optimized for code, not (big) data
• although you can do private repos, there GH has little support for access roles
AND
• GH's diffing techniques maybe over the top for data
34. DANS / eScience / HuC / archives
We need another data backend:
• based on the practices of a FAIR repository
• where researchers have the same kind of control as they have in GitHub
• that supports versioning
• where you can download specific versions of specific subfolders of
specific datasets under program control: API
35. DANS / eScience / HuC / archives
• We need a TextHub, a Data Station for processable, annotated Text
• One corpus has many authors that deliver many parts of the data
• Authors control their own parts and share them from places they "own" on
the Hub
• Users grab those parts from the Hub under program control
• And deliver the new parts they create to the Hub
36. DANS / eScience / HuC / archives
DANS: provide the Hub (Data Station in Dataverse)
eScience: support best computing practices around the Hub
HuC: consider the Hub as a hop-on to larger infrastructure
Archives: invest in resources on the shelf: make them Hub ready
37. Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
corpus data into memory
compute
harvest
share & recycle
be reproducible
go smoothly
dirk.roorda@dans.knaw.nl