SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Generale Missieven
// clariah-wp6 use case 1
2020-11-17 DANS-
research meeting

Dirk Roorda

dirk.roorda@dans.knaw.nl
Generale Missieven
• yearly letters from governor
and board of the Dutch East
Indian Company to the Dutch
government (Heren XVII)

• 1610-1761

• 13 volumes

• 565 letters

• 10,000 pages
resources.huygens.knaw.nl/vocgeneralemissiven
Inside
page header
letter head
provenance
modern
editor
modern
editor
transcribed
original text
transcribed
original text
provenance
footnotes by
modern editor
resources.huygens.knaw.nl/retroboeken
1960 I Coolhaas
1985 VIII Coolhaas†
1988 IX Van Goor
2004 X Van Goor
1997 XI
Schooneveld-
Oosterling
2007 XII
Schooneveld-
Oosterling
2007 XIII s'Jacob
Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
It is a toolkit / model / framework /
ethos to

1. get corpus data into RAM

2. compute with it efficiently

3. harvest results

4. recycle results back to the corpus
and to do this in a way that

1. is reproducible

2. reduces friction
that's a long and winding road
Source: TEI
page number
it's ok for automatic
processing,

very discouraging for
manual checking and
double checking
very long lines
inhuman file names
Laundry - trim0
• some pages are hopeless

• we re-sourced data from the OCR strings of the
Huygens website

• cases:

• letters without original content not in TEI (but
there is editorial content and metadata)

• pages with big tables (landscape) resulted in
pathological TEI
Humane data!
file names
are page
numbers metadata is flattened
much of the XML overhead is gone
line breaks are
reflected in the
layout
All the inherent
problems in this
dataset are still there.

But now we have
hope to see them,

to tackle them.
Laundry - trim1
text separation:

• mark folio references

• correct the markup of page
headers

without this step: 

• loss of original text

• contamination of original text
vol. 2 p 538
before
after
Laundry - trim2
• metadata

• re-distil from
letter headings

• check

• diagnostics
before
after
Laundry - trim3 - the mother of all laundries
• get the editorial remarks under tight control

even when they spread across pages

• detect all 12,000+ footnote bodies correctly (done)

• connect all footnote refs to their bodies (done)
None of this is feasible without successful completion of the previous steps.
745.3 92 9)
( ... retourschepen 745. 3. 929 )
or
running trim3
in progress
finally
corrections, corrections
End of laundry
github.com/Dans-labs/clariah-gm/xml/
Centrifuge
• Result:

• clean, dry stuff: Text-Fabric
github.com/Dans-labs/clariah-gm/tf/
With clean XML in hand, We centrifuge
the XML out of the clean laundry:

• we squeeze out all tag material
(moisture)

• leaving only pure content (dry clothes)

• ready to process (ready to wear)
The end of the road?
Local browse/
search interface
computing
interface
tutorialnotebooksonline
nbviewer
• start 

• move around programmatically

• search
• get in focus

• compute
• refine by computing

• exportExcel
• collect work sheets

• annotate
• insights are the new data

• share
• let others collect your data as easily as you
collected this corpus
annotation/tutorials/missieven
searchsearch
compute
compute
compute
compute
annotatebrat
annotate
shareshare
the road ends
what does this road mean?
• for researchers?

• for CLARIAH?

• for DANS / eScience Center / Humanities Cluster / HuygensING

researchers
• short road to be completely "hands
on" with their own corpora

• compute in their first programming
language: "XML"

• no technological overhead outside
their computing scope: XML, RDF, PID

• no metadata intricacy

• focus on data according to their own
mental concepts: the data features
TF corpora
CLARIAH
• a unified practice to compute with corpora:

• students of different corpora can share practices

• they can build cookbooks that transcend their
particular corpus

• remember "peculiarity of missives"?

• nearly the same recipe exists for a dozen
corpora

• where is greater gain:

• sorting out metadata?

• support the processing of metadata ?
TF corpora
DANS / eScience / HuC / archives
Text-Fabric uses GitHub as data-backend!

• GitHub is unique in supporting versioned data check-in / check-out

• GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation

YET: 

• GH is optimized for code, not (big) data

• although you can do private repos, there GH has little support for access roles

AND

• GH's diffing techniques maybe over the top for data
DANS / eScience / HuC / archives
We need another data backend:

• based on the practices of a FAIR repository

• where researchers have the same kind of control as they have in GitHub

• that supports versioning

• where you can download specific versions of specific subfolders of
specific datasets under program control: API
DANS / eScience / HuC / archives
• We need a TextHub, a Data Station for processable, annotated Text

• One corpus has many authors that deliver many parts of the data

• Authors control their own parts and share them from places they "own" on
the Hub

• Users grab those parts from the Hub under program control

• And deliver the new parts they create to the Hub
DANS / eScience / HuC / archives
DANS: provide the Hub (Data Station in Dataverse)

eScience: support best computing practices around the Hub

HuC: consider the Hub as a hop-on to larger infrastructure

Archives: invest in resources on the shelf: make them Hub ready
Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
corpus data into memory

compute

harvest

share & recycle
be reproducible

go smoothly
dirk.roorda@dans.knaw.nl

Weitere ähnliche Inhalte

Was ist angesagt?

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB InternalsInfluxData
 
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsRedis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsItamar Haber
 
Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 {code}
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabAlluxio, Inc.
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATAInfluxData
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 

Was ist angesagt? (15)

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsRedis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
 
Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016
 
RubiX
RubiXRubiX
RubiX
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Geek camp
Geek campGeek camp
Geek camp
 
Hive training
Hive trainingHive training
Hive training
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 

Ähnlich wie General Missives

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterEnrico Daga
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationBeGooden-IT Consulting
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanityCharlie Morris
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWSLindsay Millard
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingAnne Nicolas
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache HadoopC4Media
 
Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Globus
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 

Ähnlich wie General Missives (20)

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanity
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
 
Data Science
Data ScienceData Science
Data Science
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache Hadoop
 
Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

Mehr von Dirk Roorda

Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-FabricDirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysisDirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsDirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew BibleDirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissenDirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleDirk Roorda
 

Mehr von Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 

Kürzlich hochgeladen

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 

Kürzlich hochgeladen (20)

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 

General Missives

  • 1. Generale Missieven // clariah-wp6 use case 1 2020-11-17 DANS- research meeting Dirk Roorda dirk.roorda@dans.knaw.nl
  • 2. Generale Missieven • yearly letters from governor and board of the Dutch East Indian Company to the Dutch government (Heren XVII) • 1610-1761 • 13 volumes • 565 letters • 10,000 pages resources.huygens.knaw.nl/vocgeneralemissiven
  • 3. Inside page header letter head provenance modern editor modern editor transcribed original text transcribed original text provenance footnotes by modern editor resources.huygens.knaw.nl/retroboeken 1960 I Coolhaas 1985 VIII Coolhaas† 1988 IX Van Goor 2004 X Van Goor 1997 XI Schooneveld- Oosterling 2007 XII Schooneveld- Oosterling 2007 XIII s'Jacob
  • 4. Computing Companion helps a student of a corpus to approach it with her computer-aided intelligence It is a toolkit / model / framework / ethos to 1. get corpus data into RAM 2. compute with it efficiently 3. harvest results 4. recycle results back to the corpus and to do this in a way that 1. is reproducible 2. reduces friction
  • 5. that's a long and winding road
  • 6.
  • 7. Source: TEI page number it's ok for automatic processing, very discouraging for manual checking and double checking very long lines inhuman file names
  • 8. Laundry - trim0 • some pages are hopeless • we re-sourced data from the OCR strings of the Huygens website • cases: • letters without original content not in TEI (but there is editorial content and metadata) • pages with big tables (landscape) resulted in pathological TEI
  • 9. Humane data! file names are page numbers metadata is flattened much of the XML overhead is gone line breaks are reflected in the layout All the inherent problems in this dataset are still there. But now we have hope to see them, to tackle them.
  • 10. Laundry - trim1 text separation: • mark folio references • correct the markup of page headers without this step: • loss of original text • contamination of original text vol. 2 p 538 before after
  • 11. Laundry - trim2 • metadata • re-distil from letter headings • check • diagnostics before after
  • 12. Laundry - trim3 - the mother of all laundries • get the editorial remarks under tight control even when they spread across pages • detect all 12,000+ footnote bodies correctly (done) • connect all footnote refs to their bodies (done) None of this is feasible without successful completion of the previous steps.
  • 13. 745.3 92 9) ( ... retourschepen 745. 3. 929 ) or
  • 14.
  • 18. Centrifuge • Result: • clean, dry stuff: Text-Fabric github.com/Dans-labs/clariah-gm/tf/ With clean XML in hand, We centrifuge the XML out of the clean laundry: • we squeeze out all tag material (moisture) • leaving only pure content (dry clothes) • ready to process (ready to wear)
  • 19. The end of the road?
  • 23. • start • move around programmatically • search • get in focus • compute • refine by computing • exportExcel • collect work sheets • annotate • insights are the new data • share • let others collect your data as easily as you collected this corpus annotation/tutorials/missieven
  • 30. what does this road mean? • for researchers? • for CLARIAH? • for DANS / eScience Center / Humanities Cluster / HuygensING

  • 31. researchers • short road to be completely "hands on" with their own corpora • compute in their first programming language: "XML" • no technological overhead outside their computing scope: XML, RDF, PID • no metadata intricacy • focus on data according to their own mental concepts: the data features TF corpora
  • 32. CLARIAH • a unified practice to compute with corpora: • students of different corpora can share practices • they can build cookbooks that transcend their particular corpus • remember "peculiarity of missives"? • nearly the same recipe exists for a dozen corpora • where is greater gain: • sorting out metadata? • support the processing of metadata ? TF corpora
  • 33. DANS / eScience / HuC / archives Text-Fabric uses GitHub as data-backend! • GitHub is unique in supporting versioned data check-in / check-out • GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation YET: • GH is optimized for code, not (big) data • although you can do private repos, there GH has little support for access roles AND • GH's diffing techniques maybe over the top for data
  • 34. DANS / eScience / HuC / archives We need another data backend: • based on the practices of a FAIR repository • where researchers have the same kind of control as they have in GitHub • that supports versioning • where you can download specific versions of specific subfolders of specific datasets under program control: API
  • 35. DANS / eScience / HuC / archives • We need a TextHub, a Data Station for processable, annotated Text • One corpus has many authors that deliver many parts of the data • Authors control their own parts and share them from places they "own" on the Hub • Users grab those parts from the Hub under program control • And deliver the new parts they create to the Hub
  • 36. DANS / eScience / HuC / archives DANS: provide the Hub (Data Station in Dataverse) eScience: support best computing practices around the Hub HuC: consider the Hub as a hop-on to larger infrastructure Archives: invest in resources on the shelf: make them Hub ready
  • 37. Computing Companion helps a student of a corpus to approach it with her computer-aided intelligence corpus data into memory compute harvest share & recycle be reproducible go smoothly dirk.roorda@dans.knaw.nl