eMOP-PennSt-lunch

The Early Modern OCR Project
An open source workflow and tools for OCR
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
1

 emop.tamu.edu/
 Texas A&M Big Data Workshop
 emop.tamu.edu/TAMU-
BigData
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
2

Early Modern OCR Project
 The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
 eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.

The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
 Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
 Total: >300,000 documents
& 45 million page images.
Ground Truth
 Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed documents
 44,000 EEBO
 2,200 ECCO
4

6
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartnersemop.tamu.edu/participating-institutions

The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality
equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters &
ligatures
 Spelling variations
 Mixed typefaces and
languages
 over/under-inking
7
Document/Image Quality
 Torn and damaged
pages
 Noise introduced to
images of pages
 Skewed pages
 Warped pages
 Missing pages
 Inverted pages
 Incorrect metadata
 Extremely low quality
TIFFs (~50K)

Aletheia
10
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
 Created by PRImA Research
Labs, University of Salford, UK.
 Windows based tool.
 Developed as a groundtruth
creation tool
 Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
 Can identify glyphs on a page
image with page coordinates and
Unicode values.

Aletheia:Workflow11  Binarization and Denoise are native Aletheia functions
 A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
 Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.

Aletheia: Glyph Recognition 12
Uses Tesseract to find glyphs

Aletheia: I/O 13
We then convert PAGE XML
file to Tesseract Box file using
XSLT

Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
 Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
15

Franken+Workflow16
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.

Franken+ 18
 All exemplars of the
same glyph are
displayed together.
 Users can quickly
identify and
deselect:
 Incorrectly labeled
glyphs
 Incomplete glyphs
 Unrepresentative
exemplars
 Different sized glyphs

TrainingTesseract
20
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗﬅak'ﬅ thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ﬅ ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haﬅ loﬅ thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haﬅ forgot thy name thou hadﬅ; thou waﬅ
Nothing but ee, and her thou haﬅ o'rpaﬅ.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ﬅ laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ﬅ to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ﬅateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,

Franken+ Results 21
AFTER
BEFORE

23
Workflows-Controller
• Powered by the eMOP
DB
• Collection processing is
managed via the online
Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py

Brazos HPCC
• 128 processors as a stakeholder
• Access to background queues
• Estimated at over 2 months of
constant processing
• We will have to reprocess some
files that fail due to timeouts or
that require pre-processing
24

Post-Processing 27
Triage
Treatment Diagnosis

Triage:De-noising
28
 Uses hOCR results
1. Determine average
word bounding box
size
2. Weed out boxes are
that too big or too
small
3. But keep small boxes
that have neighbors
that are “words”

Triage: De-noising 29
Before: 35% After: 58%

Triage: De-noising 30
Before After

Triage: Estimated Correctability 31
Page Evaluation
 Determine how correctable a
page’s OCR results are by
examining the text.
 The score is based on the ratio
of words that fit the
correctable profile to the total
number of words
Correctable Profile
1. Clean tokens:
 remove leading and trailing
punctuation
 remaining token must have at
least 3 letters
2. Spell check tokens >1
character
3. Check token profile :
 contain at most 2 non-alpha
characters, and
 at least 1 alpha character,
 have a length of at least 3,
 and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page

32Triage: Estimated Correctability

Treatment: Page Correction 33
1. Preliminary cleanup
 remove punctuation from begin/end of
tokens
 remove empty lines and empty tokens
 combine hyphenated tokens that appear
at the end of a line
 retain cleaned & original tokens as
“suggestions”
2. Apply common transformations and
period specific dictionary lookups to
gather suggestions for words.
 transformation rules: rn->m; c->e; 1->l; e

34Treatment: Page Correction
3. Use context checking on a sliding window of 3 words,
and their suggested changes, to find the best context
matches in our(sanitized, period-specific) Google 3-
gram dataset
 if no context is found and only one additional suggestion
was made from transformation or dictionary, then
replace with this suggestion
 if no context and “clean” token from above is in the
dictionary, replace with this token

window: tbat l thoughc
Candidates used for context matching:
 tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial,
abat, tbat, teat)
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount: 1474)
window: l thoughc Ihe
 l -> Set(l)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
tbat I thoughc Ihe Was

window: thoughc Ihe Was
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
 Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount:
364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount:
325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was

Tags DB
 All page images that don’t produce OCR text that
meets our standards will be tagged in the eMOP DB
with tags that indicate what is wrong with the page
image.
 skew, warp, noise, wrong font, etc.
 Tags can then be used to later apply the
appropriate pre-processing techniques and then
re-OCR the page to produce better results.
 This will be the first time that any sort of
comprehensive analysis has been done on the
page images of these collections.
 We’ll end up (ideally) with a list of page images that
simply need to be re-scanned.
39

 www.18thconnect.org/typew
right/
 Available through
18thconnect.org
An aggregator of 18th
Century metadata & content
This will be the first time that
EEBO documents are available
as searchable text, and that
they can be corrected by
scholars for their own use.
TypeWright
 Crowd-sourced correction
tool for OCR.
 Currently available for ECCO
collection, based on original
ECCO OCR results.
 Will be ingested with eMOP
OCR for ECCO and EEBO by
year-end.
 Users can correct a whole
document and then receive
the text or a lightly encoded
TEI XML version for use in a
digital edition or whatever.
40

TypeWright
Edit button identifies
“TypeWright” enabled
Texts.
41

eMOP-PennSt-lunch

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to eMOP-PennSt-lunch

Similar to eMOP-PennSt-lunch (20)

Recently uploaded

Recently uploaded (20)

eMOP-PennSt-lunch

Editor's Notes