1. The Early Modern OCR Project
An open source workflow and tools for OCR
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
1
2. emop.tamu.edu/
Texas A&M Big Data Workshop
emop.tamu.edu/TAMU-
BigData
eMOP Workflows
emop.tamu.edu/workflows
Mellon Grant Proposal
idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
Facebook
The Early Modern OCR
Project
Twitter
#emop
@IDHMC_Nexus
@mandellc
@matt_christy
@EMGrumbach
2
3. Early Modern OCR Project
The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.
4. The Numbers
Page Images
Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
Total: >300,000 documents
& 45 million page images.
Ground Truth
Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed documents
44,000 EEBO
2,200 ECCO
4
6. 6
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartnersemop.tamu.edu/participating-institutions
7. The Problems
Early Modern Printing
Individual, hand-made
typefaces
Worn and broken type
Poor quality
equipment/paper
Inconsistent line bases
Unusual page layouts,
decorative page elements,
Special characters &
ligatures
Spelling variations
Mixed typefaces and
languages
over/under-inking
7
Document/Image Quality
Torn and damaged
pages
Noise introduced to
images of pages
Skewed pages
Warped pages
Missing pages
Inverted pages
Incorrect metadata
Extremely low quality
TIFFs (~50K)
10. Aletheia
10
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
Created by PRImA Research
Labs, University of Salford, UK.
Windows based tool.
Developed as a groundtruth
creation tool
Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
Can identify glyphs on a page
image with page coordinates and
Unicode values.
11. Aletheia:Workflow11 Binarization and Denoise are native Aletheia functions
A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.
15. Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
15
16. Franken+Workflow16
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
18. Franken+ 18
All exemplars of the
same glyph are
displayed together.
Users can quickly
identify and
deselect:
Incorrectly labeled
glyphs
Incomplete glyphs
Unrepresentative
exemplars
Different sized glyphs
20. TrainingTesseract
20
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗſtak'ſt thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ſt ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haſt loſt thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haſt forgot thy name thou hadſt; thou waſt
Nothing but ee, and her thou haſt o'rpaſt.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ſt laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ſt to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ſtateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
23. 23
Workflows-Controller
• Powered by the eMOP
DB
• Collection processing is
managed via the online
Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
24. Brazos HPCC
• 128 processors as a stakeholder
• Access to background queues
• Estimated at over 2 months of
constant processing
• We will have to reprocess some
files that fail due to timeouts or
that require pre-processing
24
28. Triage:De-noising
28
Uses hOCR results
1. Determine average
word bounding box
size
2. Weed out boxes are
that too big or too
small
3. But keep small boxes
that have neighbors
that are “words”
31. Triage: Estimated Correctability 31
Page Evaluation
Determine how correctable a
page’s OCR results are by
examining the text.
The score is based on the ratio
of words that fit the
correctable profile to the total
number of words
Correctable Profile
1. Clean tokens:
remove leading and trailing
punctuation
remaining token must have at
least 3 letters
2. Spell check tokens >1
character
3. Check token profile :
contain at most 2 non-alpha
characters, and
at least 1 alpha character,
have a length of at least 3,
and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page
33. Treatment: Page Correction 33
1. Preliminary cleanup
remove punctuation from begin/end of
tokens
remove empty lines and empty tokens
combine hyphenated tokens that appear
at the end of a line
retain cleaned & original tokens as
“suggestions”
2. Apply common transformations and
period specific dictionary lookups to
gather suggestions for words.
transformation rules: rn->m; c->e; 1->l; e
34. 34Treatment: Page Correction
3. Use context checking on a sliding window of 3 words,
and their suggested changes, to find the best context
matches in our(sanitized, period-specific) Google 3-
gram dataset
if no context is found and only one additional suggestion
was made from transformation or dictionary, then
replace with this suggestion
if no context and “clean” token from above is in the
dictionary, replace with this token
35. 35Treatment: Page Correction
window: tbat l thoughc
Candidates used for context matching:
tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial,
abat, tbat, teat)
l -> Set(l)
thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount: 1474)
window: l thoughc Ihe
Candidates used for context matching:
l -> Set(l)
thoughc -> Set(thoughc, thought, though)
Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
tbat I thoughc Ihe Was
36. 36Treatment: Page Correction
window: thoughc Ihe Was
Candidates used for context matching:
thoughc -> Set(thoughc, thought, though)
Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount:
364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount:
325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was
39. Tags DB
All page images that don’t produce OCR text that
meets our standards will be tagged in the eMOP DB
with tags that indicate what is wrong with the page
image.
skew, warp, noise, wrong font, etc.
Tags can then be used to later apply the
appropriate pre-processing techniques and then
re-OCR the page to produce better results.
This will be the first time that any sort of
comprehensive analysis has been done on the
page images of these collections.
We’ll end up (ideally) with a list of page images that
simply need to be re-scanned.
39
40. www.18thconnect.org/typew
right/
Available through
18thconnect.org
An aggregator of 18th
Century metadata & content
This will be the first time that
EEBO documents are available
as searchable text, and that
they can be corrected by
scholars for their own use.
TypeWright
Crowd-sourced correction
tool for OCR.
Currently available for ECCO
collection, based on original
ECCO OCR results.
Will be ingested with eMOP
OCR for ECCO and EEBO by
year-end.
Users can correct a whole
document and then receive
the text or a lightly encoded
TEI XML version for use in a
digital edition or whatever.
40
The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
These are not big numbers by STEM standards, but for the Humanities they are:
This is the largest dataset for any open-source, academic OCR project in North American, ever
For the Humanities research data is about quality of content rather than size
Our OCR must be as good as possible in order to generate reliable research
Some were great
most were not
Noisy
Skewed
Warped
Or they posed challenges for OCR engines
Multiple pages per image
Multiple columns
Images & decorative elements
Marginalia
Missing margins
many were terrible
Tesseract:
Google
Open source
Accurate
Fast
Tesseract must first be trained to recognize the typeface used in the document(s) it is OCRing.
Tesseract has a native mechanism to create that training, but it relies on having training pages with characteristics that can typically only be created with modern word processors, which don't have early modern fonts. To create early modern font training for Tesseract we had to adapt and create our own tools.
Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
This is cheating: the result of scanning the same page we used to create the training.
Franken+: Created by eMOP graduate student Bryan Tarpley. Used by the same team of undergraduates led by eMOP graduate student Kathy Torabi.
Takes Aletheia's output files as input.
Groups all glyphs with the same Unicode values into one window for comparison.
Mistakenly coded glyphs are easily identified and re-coded.
A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired.
Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base.
Outputs the same box files and TIFF images that Tesseract's first stage of native training.
Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files.
Outputs a .traineddata file used by Tesseract when OCRing page images.
This document had 3 distinct point sizes, but Franken+ revealed there were more like 5 or 6.
In this case it probably doesn’t matter, but it points out the possibilities
Dashboard provides access to the DB via an API
Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS
at any one time we currently have about 2000 pages being processed at the same time
that’s subject to processor availability on the background queues
about 8 pages per second.