SlideShare a Scribd company logo
1 of 41
The Early Modern OCR Project
An open source workflow and tools for OCR
Matthew Christy,
Laura Mandell,
Elizabeth Grumbach
1
 emop.tamu.edu/
 Texas A&M Big Data Workshop
 emop.tamu.edu/TAMU-
BigData
 eMOP Workflows
 emop.tamu.edu/workflows
 Mellon Grant Proposal
 idhmc.tamu.edu/projects/
Mellon/eMOPPublic.pdf
eMOP Info
eMOP Website More eMOP
 Facebook
 The Early Modern OCR
Project
 Twitter
 #emop
 @IDHMC_Nexus
 @mandellc
 @matt_christy
 @EMGrumbach
2
Early Modern OCR Project
 The Early Modern OCR Project (eMOP) is an Andrew W.
Mellon Foundation funded grant project running out of the
Initiative for Digital Humanities, Media, and Culture (IDHMC)
at Texas A&M University, to develop and test tools and
techniques to apply Optical Character Recognition (OCR)
to early modern English documents from the hand press
period, roughly 1475-1800.
 eMOP aims to improve the visibility of early modern texts by
making their contents fully searchable. The current
paradigm of searching special collections for early modern
materials by either metadata alone or “dirty” OCR is
insufficient for scholarly research.
3
Specifically, eMOP’s goal is to make
machine readable, or improve the
readability, for 305,000 document/45
million pages of text from two major
proprietary databases: Eighteenth
Century Collections Online (ECCO)
and Early English Books Online (EEBO).
Generally, our aim is to use typeface
and book history techniques to train
modern OCR engines specifically on
the typefaces in our collection of
documents, and thereby improve the
accuracy of the OCR results.
The Numbers
Page Images
 Early English Books online
(Proquest) EEBO: ~125,000
documents, ~13 million
pages images (1475-1700)
 Eighteenth Century
Collections Online (Gale
Cengage) ECCO: ~182,000
documents, ~32 million
page images (1700-1800)
 Total: >300,000 documents
& 45 million page images.
Ground Truth
 Text Creation Partnership TCP:
~46,000 double-keyed hand
transcribed documents
 44,000 EEBO
 2,200 ECCO
4
5
6
• PRImA (Pattern Recognition & Image Analysis Research) Lab at
the University of Salford, Manchester, UK
• SEASR (Software Environment for the Advancement of Scholarly
Research) at the University of Illinois, Urbana-Champaign
• PSI (Perception, Sensing, and Instrumentation) Lab at Texas
A&M University
• The Academy for Advanced Telecommunications and Learning
Technologies at Texas A&M University
• The Brazos High Performance Computing Cluster (HPCC)
OurPartnersemop.tamu.edu/participating-institutions
The Problems
Early Modern Printing
 Individual, hand-made
typefaces
 Worn and broken type
 Poor quality
equipment/paper
 Inconsistent line bases
 Unusual page layouts,
decorative page elements,
 Special characters &
ligatures
 Spelling variations
 Mixed typefaces and
languages
 over/under-inking
7
Document/Image Quality
 Torn and damaged
pages
 Noise introduced to
images of pages
 Skewed pages
 Warped pages
 Missing pages
 Inverted pages
 Incorrect metadata
 Extremely low quality
TIFFs (~50K)
Page Images 8
TrainingTesseract9
Aletheia
10
www.primaresearch.org/too
ls.php
Available for free but requires
registration.
 Created by PRImA Research
Labs, University of Salford, UK.
 Windows based tool.
 Developed as a groundtruth
creation tool
 Used by eMOP undergraduate
student workers to create training
of desired typeface for Tesseract.
 Can identify glyphs on a page
image with page coordinates and
Unicode values.
Aletheia:Workflow11  Binarization and Denoise are native Aletheia functions
 A team of Undergraduate student workers refines and
corrects glyph boxes and unicode values, where needed.
 Output: A set of PAGE XML files with page coordinates and
unicode values for every identified glyph on each processed
TIFF image.
Aletheia: Glyph Recognition 12
Uses Tesseract to find glyphs
Aletheia: I/O 13
We then convert PAGE XML
file to Tesseract Box file using
XSLT
Tesseract Training 14
Franken+
1. Windows based tool that uses a
MySQL DB.
2. Developed for eMOP by IDHMC
Graduate student worker Bryan
Tarpley.
3. Designed to be easily used by
eMOP Undergraduate student
workers
4. Takes Aletheia's output files as
input.
5. Outputs the same box files and TIFF
images that Tesseract's first stage
of native training.
 Available open-source at:
github.com/idhmc-
tamu/FrankenPlus
15
Franken+Workflow16
1. Groups all glyphs with
the same Unicode
values into one window
for comparison.
2. Uses all selected glyphs
to create a Franken-
page image (TIFF) using
a selected text as a
base.
3. Outputs the same box
files and TIFF images
that Tesseract's first
stage of native training.
Franken+ Ingestion 17
Franken+ 18
 All exemplars of the
same glyph are
displayed together.
 Users can quickly
identify and
deselect:
 Incorrectly labeled
glyphs
 Incomplete glyphs
 Unrepresentative
exemplars
 Different sized glyphs
19
Franken+
TrainingTesseract
20
Thiſ great conſumption to a fever turn'd,
And ſo the oꝗld had fitſ; it joy'd, it mourn'd;
And, aſ men thinke, that Agueſ phy ck are,
And th'Ague being ſpent, give over care.
Žo thou cke World, mꝗſtak'ſt thy ſelże to bee
Well, when ãlaſ, thou'rt in a Lethargie.
Her death did wound and tame thee than, and than
Thou might'ſt ha e better ſpar'd the Sunne, or man.
That wound waſ deep, but 'tiſ more miżery,
That thou haſt loſt thy ſenſe and memor .
'Twaſ heavy then to heare thy voyce of mone,
But thiſ iſ worſe, that thou art ſpeechle e growne.
Thou haſt forgot thy name thou hadſt; thou waſt
Nothing but ee, and her thou haſt o'rpaſt.
For aſ a child kept from the Fount, untill
Ä prince, expe ed long, come to fulfill
The ceremonieſ, thou unnam'd had'ſt laid,
Had not her comming, thee her palace made:
Her name defin'd thee, gave thee forme, and frame,
And thou forgett'ſt to celebrate th n me.
Some monethſ e hath beene dead (but beìng dead,
Meaſureſ of timeſ are all determined)
But long e'ath beene away, long, long, et none
Offerſ to tell uſ who it iſ that'ſ gone.
But aſ in ſtateſ doubtfull of future heireſ,
When ckne e without remedie empaireſ
The preſent Prince, they're loth it ould be ſaid,
The Prince doth langui , or the Prince iſ dead:
So mankinde feeling no a generall tha ,
Franken+ Results 21
AFTER
BEFORE
eMOP
TesseractTraining22
23
Workflows-Controller
• Powered by the eMOP
DB
• Collection processing is
managed via the online
Dashboard
emop-dashboard.tamu.edu
• Run by emop-controller.py
Brazos HPCC
• 128 processors as a stakeholder
• Access to background queues
• Estimated at over 2 months of
constant processing
• We will have to reprocess some
files that fail due to timeouts or
that require pre-processing
24
Post-ProcessingTriage
25
Post-Processing
Triage
26
Post-Processing 27
Triage
Treatment Diagnosis
Triage:De-noising
28
 Uses hOCR results
1. Determine average
word bounding box
size
2. Weed out boxes are
that too big or too
small
3. But keep small boxes
that have neighbors
that are “words”
Triage: De-noising 29
Before: 35% After: 58%
Triage: De-noising 30
Before After
Triage: Estimated Correctability 31
Page Evaluation
 Determine how correctable a
page’s OCR results are by
examining the text.
 The score is based on the ratio
of words that fit the
correctable profile to the total
number of words
Correctable Profile
1. Clean tokens:
 remove leading and trailing
punctuation
 remaining token must have at
least 3 letters
2. Spell check tokens >1
character
3. Check token profile :
 contain at most 2 non-alpha
characters, and
 at least 1 alpha character,
 have a length of at least 3,
 and do not contain 4 or more
repeated characters in a run
4. Also consider length of tokens
compared to average for the
page
32Triage: Estimated Correctability
Treatment: Page Correction 33
1. Preliminary cleanup
 remove punctuation from begin/end of
tokens
 remove empty lines and empty tokens
 combine hyphenated tokens that appear
at the end of a line
 retain cleaned & original tokens as
“suggestions”
2. Apply common transformations and
period specific dictionary lookups to
gather suggestions for words.
 transformation rules: rn->m; c->e; 1->l; e
34Treatment: Page Correction
3. Use context checking on a sliding window of 3 words,
and their suggested changes, to find the best context
matches in our(sanitized, period-specific) Google 3-
gram dataset
 if no context is found and only one additional suggestion
was made from transformation or dictionary, then
replace with this suggestion
 if no context and “clean” token from above is in the
dictionary, replace with this token
35Treatment: Page Correction
window: tbat l thoughc
Candidates used for context matching:
 tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial,
abat, tbat, teat)
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
ContextMatch: that l thought (matchCount: 1844 , volCount: 1474)
window: l thoughc Ihe
Candidates used for context matching:
 l -> Set(l)
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
ContextMatch: l though the (matchCount: 497 , volCount: 486)
ContextMatch: l thought she (matchCount: 1538 , volCount: 997)
ContextMatch: l thought the (matchCount: 2496 , volCount: 1905)
tbat I thoughc Ihe Was
36Treatment: Page Correction
window: thoughc Ihe Was
Candidates used for context matching:
 thoughc -> Set(thoughc, thought, though)
 Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e,
ene, ice, inc, tho, ime, ite, ive, the)
 Was -> Set(Was)
ContextMatch: though ice was (matchCount: 121 , volCount: 120)
ContextMatch: though ike was (matchCount: 65 , volCount: 59)
ContextMatch: though she was (matchCount: 556,763 , volCount:
364,965)
ContextMatch: though the was (matchCount: 197 , volCount: 196)
ContextMatch: thought ice was (matchCount: 45 , volCount: 45)
ContextMatch: thought ike was (matchCount: 112 , volCount: 108)
ContextMatch: thought she was (matchCount: 549,531 , volCount:
325,822)
ContextMatch: thought the was (matchCount: 91 , volCount: 91)
that I thought she was
Treatment: Results 37
Treatment: Results 38
Tags DB
 All page images that don’t produce OCR text that
meets our standards will be tagged in the eMOP DB
with tags that indicate what is wrong with the page
image.
 skew, warp, noise, wrong font, etc.
 Tags can then be used to later apply the
appropriate pre-processing techniques and then
re-OCR the page to produce better results.
 This will be the first time that any sort of
comprehensive analysis has been done on the
page images of these collections.
 We’ll end up (ideally) with a list of page images that
simply need to be re-scanned.
39
 www.18thconnect.org/typew
right/
 Available through
18thconnect.org
An aggregator of 18th
Century metadata & content
This will be the first time that
EEBO documents are available
as searchable text, and that
they can be corrected by
scholars for their own use.
TypeWright
 Crowd-sourced correction
tool for OCR.
 Currently available for ECCO
collection, based on original
ECCO OCR results.
 Will be ingested with eMOP
OCR for ECCO and EEBO by
year-end.
 Users can correct a whole
document and then receive
the text or a lightly encoded
TEI XML version for use in a
digital edition or whatever.
40
TypeWright
Edit button identifies
“TypeWright” enabled
Texts.
41

More Related Content

What's hot

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsLiz Grumbach
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-completeLaura Mandell
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingVrije Universiteit Amsterdam
 
How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...Holistic Benchmarking of Big Linked Data
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?jeansoulin
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handoutYi-Shin Chen
 
Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Regina Koury
 
The Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the UnknownThe Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the UnknownSteffen Staab
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 

What's hot (11)

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering StandardsNavigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards
 
Dh2014 e mopcobre-complete
Dh2014 e mopcobre-completeDh2014 e mopcobre-complete
Dh2014 e mopcobre-complete
 
Learning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic ProgrammingLearning to assess Linked Data relationships using Genetic Programming
Learning to assess Linked Data relationships using Genetic Programming
 
How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...How well does your Instance Matching system perform? Experimental evaluation ...
How well does your Instance Matching system perform? Experimental evaluation ...
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?Does Data Quality lays in facts, or in acts?
Does Data Quality lays in facts, or in acts?
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006Digitization Projects Tech Con 2006
Digitization Projects Tech Con 2006
 
The Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the UnknownThe Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the Unknown
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 

Similar to eMOP-PennSt-lunch

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentFaculty of Computer Science
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocolsc.titus.brown
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1Taymoor Nazmy
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...khalil IBRAHIM
 
Big data 4 webmonday
Big data 4 webmondayBig data 4 webmonday
Big data 4 webmondayDaniel Koller
 

Similar to eMOP-PennSt-lunch (20)

2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
A Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual EntailmentA Distributed Architecture System for Recognizing Textual Entailment
A Distributed Architecture System for Recognizing Textual Entailment
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
TRank ISWC2013
TRank ISWC2013TRank ISWC2013
TRank ISWC2013
 
Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...Develop and design hybrid genetic algorithms with multiple objectives in data...
Develop and design hybrid genetic algorithms with multiple objectives in data...
 
Big data 4 webmonday
Big data 4 webmondayBig data 4 webmonday
Big data 4 webmonday
 
1645 track 2 pafka
1645 track 2 pafka1645 track 2 pafka
1645 track 2 pafka
 
Dropbox CEC 2011 New Orleans
Dropbox CEC 2011 New OrleansDropbox CEC 2011 New Orleans
Dropbox CEC 2011 New Orleans
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 

Recently uploaded (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 

eMOP-PennSt-lunch

  • 1. The Early Modern OCR Project An open source workflow and tools for OCR Matthew Christy, Laura Mandell, Elizabeth Grumbach 1
  • 2.  emop.tamu.edu/  Texas A&M Big Data Workshop  emop.tamu.edu/TAMU- BigData  eMOP Workflows  emop.tamu.edu/workflows  Mellon Grant Proposal  idhmc.tamu.edu/projects/ Mellon/eMOPPublic.pdf eMOP Info eMOP Website More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2
  • 3. Early Modern OCR Project  The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 3 Specifically, eMOP’s goal is to make machine readable, or improve the readability, for 305,000 document/45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, our aim is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results.
  • 4. The Numbers Page Images  Early English Books online (Proquest) EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  Eighteenth Century Collections Online (Gale Cengage) ECCO: ~182,000 documents, ~32 million page images (1700-1800)  Total: >300,000 documents & 45 million page images. Ground Truth  Text Creation Partnership TCP: ~46,000 double-keyed hand transcribed documents  44,000 EEBO  2,200 ECCO 4
  • 5. 5
  • 6. 6 • PRImA (Pattern Recognition & Image Analysis Research) Lab at the University of Salford, Manchester, UK • SEASR (Software Environment for the Advancement of Scholarly Research) at the University of Illinois, Urbana-Champaign • PSI (Perception, Sensing, and Instrumentation) Lab at Texas A&M University • The Academy for Advanced Telecommunications and Learning Technologies at Texas A&M University • The Brazos High Performance Computing Cluster (HPCC) OurPartnersemop.tamu.edu/participating-institutions
  • 7. The Problems Early Modern Printing  Individual, hand-made typefaces  Worn and broken type  Poor quality equipment/paper  Inconsistent line bases  Unusual page layouts, decorative page elements,  Special characters & ligatures  Spelling variations  Mixed typefaces and languages  over/under-inking 7 Document/Image Quality  Torn and damaged pages  Noise introduced to images of pages  Skewed pages  Warped pages  Missing pages  Inverted pages  Incorrect metadata  Extremely low quality TIFFs (~50K)
  • 10. Aletheia 10 www.primaresearch.org/too ls.php Available for free but requires registration.  Created by PRImA Research Labs, University of Salford, UK.  Windows based tool.  Developed as a groundtruth creation tool  Used by eMOP undergraduate student workers to create training of desired typeface for Tesseract.  Can identify glyphs on a page image with page coordinates and Unicode values.
  • 11. Aletheia:Workflow11  Binarization and Denoise are native Aletheia functions  A team of Undergraduate student workers refines and corrects glyph boxes and unicode values, where needed.  Output: A set of PAGE XML files with page coordinates and unicode values for every identified glyph on each processed TIFF image.
  • 12. Aletheia: Glyph Recognition 12 Uses Tesseract to find glyphs
  • 13. Aletheia: I/O 13 We then convert PAGE XML file to Tesseract Box file using XSLT
  • 15. Franken+ 1. Windows based tool that uses a MySQL DB. 2. Developed for eMOP by IDHMC Graduate student worker Bryan Tarpley. 3. Designed to be easily used by eMOP Undergraduate student workers 4. Takes Aletheia's output files as input. 5. Outputs the same box files and TIFF images that Tesseract's first stage of native training.  Available open-source at: github.com/idhmc- tamu/FrankenPlus 15
  • 16. Franken+Workflow16 1. Groups all glyphs with the same Unicode values into one window for comparison. 2. Uses all selected glyphs to create a Franken- page image (TIFF) using a selected text as a base. 3. Outputs the same box files and TIFF images that Tesseract's first stage of native training.
  • 18. Franken+ 18  All exemplars of the same glyph are displayed together.  Users can quickly identify and deselect:  Incorrectly labeled glyphs  Incomplete glyphs  Unrepresentative exemplars  Different sized glyphs
  • 20. TrainingTesseract 20 Thiſ great conſumption to a fever turn'd, And ſo the oꝗld had fitſ; it joy'd, it mourn'd; And, aſ men thinke, that Agueſ phy ck are, And th'Ague being ſpent, give over care. Žo thou cke World, mꝗſtak'ſt thy ſelże to bee Well, when ãlaſ, thou'rt in a Lethargie. Her death did wound and tame thee than, and than Thou might'ſt ha e better ſpar'd the Sunne, or man. That wound waſ deep, but 'tiſ more miżery, That thou haſt loſt thy ſenſe and memor . 'Twaſ heavy then to heare thy voyce of mone, But thiſ iſ worſe, that thou art ſpeechle e growne. Thou haſt forgot thy name thou hadſt; thou waſt Nothing but ee, and her thou haſt o'rpaſt. For aſ a child kept from the Fount, untill Ä prince, expe ed long, come to fulfill The ceremonieſ, thou unnam'd had'ſt laid, Had not her comming, thee her palace made: Her name defin'd thee, gave thee forme, and frame, And thou forgett'ſt to celebrate th n me. Some monethſ e hath beene dead (but beìng dead, Meaſureſ of timeſ are all determined) But long e'ath beene away, long, long, et none Offerſ to tell uſ who it iſ that'ſ gone. But aſ in ſtateſ doubtfull of future heireſ, When ckne e without remedie empaireſ The preſent Prince, they're loth it ould be ſaid, The Prince doth langui , or the Prince iſ dead: So mankinde feeling no a generall tha ,
  • 23. 23 Workflows-Controller • Powered by the eMOP DB • Collection processing is managed via the online Dashboard emop-dashboard.tamu.edu • Run by emop-controller.py
  • 24. Brazos HPCC • 128 processors as a stakeholder • Access to background queues • Estimated at over 2 months of constant processing • We will have to reprocess some files that fail due to timeouts or that require pre-processing 24
  • 28. Triage:De-noising 28  Uses hOCR results 1. Determine average word bounding box size 2. Weed out boxes are that too big or too small 3. But keep small boxes that have neighbors that are “words”
  • 31. Triage: Estimated Correctability 31 Page Evaluation  Determine how correctable a page’s OCR results are by examining the text.  The score is based on the ratio of words that fit the correctable profile to the total number of words Correctable Profile 1. Clean tokens:  remove leading and trailing punctuation  remaining token must have at least 3 letters 2. Spell check tokens >1 character 3. Check token profile :  contain at most 2 non-alpha characters, and  at least 1 alpha character,  have a length of at least 3,  and do not contain 4 or more repeated characters in a run 4. Also consider length of tokens compared to average for the page
  • 33. Treatment: Page Correction 33 1. Preliminary cleanup  remove punctuation from begin/end of tokens  remove empty lines and empty tokens  combine hyphenated tokens that appear at the end of a line  retain cleaned & original tokens as “suggestions” 2. Apply common transformations and period specific dictionary lookups to gather suggestions for words.  transformation rules: rn->m; c->e; 1->l; e
  • 34. 34Treatment: Page Correction 3. Use context checking on a sliding window of 3 words, and their suggested changes, to find the best context matches in our(sanitized, period-specific) Google 3- gram dataset  if no context is found and only one additional suggestion was made from transformation or dictionary, then replace with this suggestion  if no context and “clean” token from above is in the dictionary, replace with this token
  • 35. 35Treatment: Page Correction window: tbat l thoughc Candidates used for context matching:  tbat -> Set(thai, thar, bat, twat, tibet, ébat, ibat, tobit, that, tat, tba, ilial, abat, tbat, teat)  l -> Set(l)  thoughc -> Set(thoughc, thought, though) ContextMatch: that l thought (matchCount: 1844 , volCount: 1474) window: l thoughc Ihe Candidates used for context matching:  l -> Set(l)  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the) ContextMatch: l though the (matchCount: 497 , volCount: 486) ContextMatch: l thought she (matchCount: 1538 , volCount: 997) ContextMatch: l thought the (matchCount: 2496 , volCount: 1905) tbat I thoughc Ihe Was
  • 36. 36Treatment: Page Correction window: thoughc Ihe Was Candidates used for context matching:  thoughc -> Set(thoughc, thought, though)  Ihe -> Set(che, sho, enc, ile, iee, plie, ihe, ire, ike, she, ife, ide, ibo, i.e, ene, ice, inc, tho, ime, ite, ive, the)  Was -> Set(Was) ContextMatch: though ice was (matchCount: 121 , volCount: 120) ContextMatch: though ike was (matchCount: 65 , volCount: 59) ContextMatch: though she was (matchCount: 556,763 , volCount: 364,965) ContextMatch: though the was (matchCount: 197 , volCount: 196) ContextMatch: thought ice was (matchCount: 45 , volCount: 45) ContextMatch: thought ike was (matchCount: 112 , volCount: 108) ContextMatch: thought she was (matchCount: 549,531 , volCount: 325,822) ContextMatch: thought the was (matchCount: 91 , volCount: 91) that I thought she was
  • 39. Tags DB  All page images that don’t produce OCR text that meets our standards will be tagged in the eMOP DB with tags that indicate what is wrong with the page image.  skew, warp, noise, wrong font, etc.  Tags can then be used to later apply the appropriate pre-processing techniques and then re-OCR the page to produce better results.  This will be the first time that any sort of comprehensive analysis has been done on the page images of these collections.  We’ll end up (ideally) with a list of page images that simply need to be re-scanned. 39
  • 40.  www.18thconnect.org/typew right/  Available through 18thconnect.org An aggregator of 18th Century metadata & content This will be the first time that EEBO documents are available as searchable text, and that they can be corrected by scholars for their own use. TypeWright  Crowd-sourced correction tool for OCR.  Currently available for ECCO collection, based on original ECCO OCR results.  Will be ingested with eMOP OCR for ECCO and EEBO by year-end.  Users can correct a whole document and then receive the text or a lightly encoded TEI XML version for use in a digital edition or whatever. 40

Editor's Notes

  1. The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to develop and test tools and techniques to improve Optical Character Recognition (OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  2. These are not big numbers by STEM standards, but for the Humanities they are: This is the largest dataset for any open-source, academic OCR project in North American, ever For the Humanities research data is about quality of content rather than size Our OCR must be as good as possible in order to generate reliable research
  3. Some were great most were not Noisy Skewed Warped Or they posed challenges for OCR engines Multiple pages per image Multiple columns Images & decorative elements Marginalia Missing margins many were terrible
  4. Tesseract: Google Open source Accurate Fast Tesseract must first be trained to recognize the typeface used in the document(s) it is OCRing. Tesseract has a native mechanism to create that training, but it relies on having training pages with characteristics that can typically only be created with modern word processors, which don't have early modern fonts. To create early modern font training for Tesseract we had to adapt and create our own tools.
  5. Aletheia: Created by PRImA Research Labs at the University of Salford, as a groundtruth creation tool. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
  6. This is cheating: the result of scanning the same page we used to create the training.
  7. Franken+: Created by eMOP graduate student Bryan Tarpley. Used by the same team of undergraduates led by eMOP graduate student Kathy Torabi. Takes Aletheia's output files as input. Groups all glyphs with the same Unicode values into one window for comparison. Mistakenly coded glyphs are easily identified and re-coded. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base. Outputs the same box files and TIFF images that Tesseract's first stage of native training. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files. Outputs a .traineddata file used by Tesseract when OCRing page images.
  8. This document had 3 distinct point sizes, but Franken+ revealed there were more like 5 or 6. In this case it probably doesn’t matter, but it points out the possibilities
  9. Dashboard provides access to the DB via an API Controller uses the Dashboard API to schedule jobs on Brazos queues and write back to the DB. It also writes output files to the NAS
  10. at any one time we currently have about 2000 pages being processed at the same time that’s subject to processor availability on the background queues about 8 pages per second.
  11. Before: 55% After: 73%