SlideShare a Scribd company logo
1 of 31
Download to read offline
Automatic OCR correction http://overproof.projectcomputing.com
Correcting noisy OCR
- Context beats Confusion
[ presentation viewableat http://goo.gl/n85gR6 ]
Automatic OCR correction http://overproof.projectcomputing.com
who are we?
● Australian software company
● developers John and Kent
● we put theory into practice
Automatic OCR correction http://overproof.projectcomputing.com
● the first draft of history
● popular if made available
● usually poorly digitized
● too extensive for full human
correction
main target - newspapers
Automatic OCR correction http://overproof.projectcomputing.com
goals
● run on commodity cloud server
● optimal for noisy text
● at least 1000 words/sec
● correct at least 50% of errors
Automatic OCR correction http://overproof.projectcomputing.com
division of labour
bad
good
models
models
MANAGER,
TRIAGE
CORE
Automatic OCR correction http://overproof.projectcomputing.com
snippets for the core
● prefer triaged good words at start/end
● column aware
● some easy corrections applied
● some suggestions supplied
● bag of topic words available
● surrounding noise level indicated
Automatic OCR correction http://overproof.projectcomputing.com
error contexts
● spell: vowals or consonnants
● type: you jit teh wrng key
● OCR: roprcroiitativcs cf thc Coveriuient
● random: anygh<eg 0at7happen
Automatic OCR correction http://overproof.projectcomputing.com
confusion cost matrix
93: w ← w
155: e ← e
3750: c ← e
4451: m ← rn
6652: rn ← m
11065: E ← m
Automatic OCR correction http://overproof.projectcomputing.com
word cost (eg rnorniny|morning)
language cost
● lexicon frequency
● entity list
● rare word list
● character 4-gram
error cost
● edit sum
● visual correlation
● generator hint
Automatic OCR correction http://overproof.projectcomputing.com
word character confusion
m o r n i n g
r n o r n i n y
Automatic OCR correction http://overproof.projectcomputing.com
visual correlation
Automatic OCR correction http://overproof.projectcomputing.com
suggestion methods
● gift
● common, cached
● language
● entities
● split/join
● generated (magic)
Automatic OCR correction http://overproof.projectcomputing.com
searching for gold (A*)
l
i
i
ne
r
h
hcii
h li b n ...
c e r o …
i i 1 l n u …
i i 1 l ...
purple nodes: working priority queue
red nodes: output priority queue
Automatic OCR correction http://overproof.projectcomputing.com
amazing generated suggestions
Parhumuitar} ← Parliamentary
I.iulwuvB ← Railways
Itegtniont ← Regiment
niltfltory ← adultery
uj.rccu.eut← agreement
couniutfc.o ← committee
cnuipuii ← company
dctoimiuatJOu ← determination
uiidcrtkikcr’a ← undertaker’s
Automatic OCR correction http://overproof.projectcomputing.com
selecting best combination
unsiejitlv
unsightly
unseemly
unsettle
unsteady
Unsightly
urgently
bohavlour
behaviour
behavour
behavior
Behaviour
behaviours
behaving
abonf
about
above
along
been
am
am
an
a
in
as
unsiejitlv
unsightly
unseemly
unsettle
unsteady
Unsightly
urgently
disgrie
disgrace
disagree
disguise
desire
degree
disease
[NOTE: word joins and splits are also supported]
Automatic OCR correction http://overproof.projectcomputing.com
training
● 5-grams - subset selection
● corpus 1,2,3-grams - statistical build
● extra word lists - easy
● error model - bootstrap or new pairs
Automatic OCR correction http://overproof.projectcomputing.com
testing
● 65000 words ground truth including
foreign (US) newspapers
● all measures exceeded goal:
○ search errors (article word types)
○ read errors (article word tokens)
○ entropy weighted term errors
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
Automatic OCR correction http://overproof.projectcomputing.com
¿preguntas?
Presentation viewable at http://goo.gl/n85gR6
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
National Library of Australia’s
TROVE
● 1.4m distinct visitors/month
● 16m pageviews/month
● 80% of usage is old newspapers
o 13m pages, over 600 titles
o 85k lines corrected/day
Automatic OCR correction http://overproof.projectcomputing.com
Even this massive volunteer effort
cannot keep up
● < 2% of errors have been corrected
● % corrected is declining
● Hence searching is unreliable, OCR’ed text
is hard to read and reuse
● Trove’s accuracy is “typical”
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
159 randomly selected news
articles from The Sydney
Morning Herald
47.4K words hand-corrected to ground truth
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 83.8% 94.1% recall misses reduced 63.3%
False positive recall 26.7% 9.1% false positives reduced 65.8%
Raw Error Rate 18.5% 5.5% errors reduced 70.1%
Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4%
SMH sample
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
Automatic OCR correction http://overproof.projectcomputing.com
49 randomly selected news
articles from LoC
Chronicling America
18.1K words hand-corrected to ground truth
Automatic OCR correction http://overproof.projectcomputing.com
Before After
Recall 84.0% 93.1% recall misses reduced 56.6%
False positive recall 23.6% 8.8% false positives reduced 62.8%
Raw Error Rate 19.1% 6.4% errors reduced 66.7%
Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8%
LOC sample

More Related Content

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Datech2014 - Session 3 - Correcting Noisy OCR: Context Beats Confsusion

  • 1. Automatic OCR correction http://overproof.projectcomputing.com Correcting noisy OCR - Context beats Confusion [ presentation viewableat http://goo.gl/n85gR6 ]
  • 2. Automatic OCR correction http://overproof.projectcomputing.com who are we? ● Australian software company ● developers John and Kent ● we put theory into practice
  • 3. Automatic OCR correction http://overproof.projectcomputing.com ● the first draft of history ● popular if made available ● usually poorly digitized ● too extensive for full human correction main target - newspapers
  • 4. Automatic OCR correction http://overproof.projectcomputing.com goals ● run on commodity cloud server ● optimal for noisy text ● at least 1000 words/sec ● correct at least 50% of errors
  • 5. Automatic OCR correction http://overproof.projectcomputing.com division of labour bad good models models MANAGER, TRIAGE CORE
  • 6. Automatic OCR correction http://overproof.projectcomputing.com snippets for the core ● prefer triaged good words at start/end ● column aware ● some easy corrections applied ● some suggestions supplied ● bag of topic words available ● surrounding noise level indicated
  • 7. Automatic OCR correction http://overproof.projectcomputing.com error contexts ● spell: vowals or consonnants ● type: you jit teh wrng key ● OCR: roprcroiitativcs cf thc Coveriuient ● random: anygh<eg 0at7happen
  • 8. Automatic OCR correction http://overproof.projectcomputing.com confusion cost matrix 93: w ← w 155: e ← e 3750: c ← e 4451: m ← rn 6652: rn ← m 11065: E ← m
  • 9. Automatic OCR correction http://overproof.projectcomputing.com word cost (eg rnorniny|morning) language cost ● lexicon frequency ● entity list ● rare word list ● character 4-gram error cost ● edit sum ● visual correlation ● generator hint
  • 10. Automatic OCR correction http://overproof.projectcomputing.com word character confusion m o r n i n g r n o r n i n y
  • 11. Automatic OCR correction http://overproof.projectcomputing.com visual correlation
  • 12. Automatic OCR correction http://overproof.projectcomputing.com suggestion methods ● gift ● common, cached ● language ● entities ● split/join ● generated (magic)
  • 13. Automatic OCR correction http://overproof.projectcomputing.com searching for gold (A*) l i i ne r h hcii h li b n ... c e r o … i i 1 l n u … i i 1 l ... purple nodes: working priority queue red nodes: output priority queue
  • 14. Automatic OCR correction http://overproof.projectcomputing.com amazing generated suggestions Parhumuitar} ← Parliamentary I.iulwuvB ← Railways Itegtniont ← Regiment niltfltory ← adultery uj.rccu.eut← agreement couniutfc.o ← committee cnuipuii ← company dctoimiuatJOu ← determination uiidcrtkikcr’a ← undertaker’s
  • 15. Automatic OCR correction http://overproof.projectcomputing.com selecting best combination unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently bohavlour behaviour behavour behavior Behaviour behaviours behaving abonf about above along been am am an a in as unsiejitlv unsightly unseemly unsettle unsteady Unsightly urgently disgrie disgrace disagree disguise desire degree disease [NOTE: word joins and splits are also supported]
  • 16. Automatic OCR correction http://overproof.projectcomputing.com training ● 5-grams - subset selection ● corpus 1,2,3-grams - statistical build ● extra word lists - easy ● error model - bootstrap or new pairs
  • 17. Automatic OCR correction http://overproof.projectcomputing.com testing ● 65000 words ground truth including foreign (US) newspapers ● all measures exceeded goal: ○ search errors (article word types) ○ read errors (article word tokens) ○ entropy weighted term errors
  • 18. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  • 19. Automatic OCR correction http://overproof.projectcomputing.com ¿preguntas? Presentation viewable at http://goo.gl/n85gR6
  • 20. Automatic OCR correction http://overproof.projectcomputing.com
  • 21. Automatic OCR correction http://overproof.projectcomputing.com National Library of Australia’s TROVE ● 1.4m distinct visitors/month ● 16m pageviews/month ● 80% of usage is old newspapers o 13m pages, over 600 titles o 85k lines corrected/day
  • 22. Automatic OCR correction http://overproof.projectcomputing.com Even this massive volunteer effort cannot keep up ● < 2% of errors have been corrected ● % corrected is declining ● Hence searching is unreliable, OCR’ed text is hard to read and reuse ● Trove’s accuracy is “typical”
  • 23. Automatic OCR correction http://overproof.projectcomputing.com
  • 24. Automatic OCR correction http://overproof.projectcomputing.com 159 randomly selected news articles from The Sydney Morning Herald 47.4K words hand-corrected to ground truth
  • 25. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 83.8% 94.1% recall misses reduced 63.3% False positive recall 26.7% 9.1% false positives reduced 65.8% Raw Error Rate 18.5% 5.5% errors reduced 70.1% Weighted Error Rate 16.2% 6.7% weighted errors reduced 59.4% SMH sample
  • 26. Automatic OCR correction http://overproof.projectcomputing.com
  • 27. Automatic OCR correction http://overproof.projectcomputing.com
  • 28. Automatic OCR correction http://overproof.projectcomputing.com
  • 29. Automatic OCR correction http://overproof.projectcomputing.com
  • 30. Automatic OCR correction http://overproof.projectcomputing.com 49 randomly selected news articles from LoC Chronicling America 18.1K words hand-corrected to ground truth
  • 31. Automatic OCR correction http://overproof.projectcomputing.com Before After Recall 84.0% 93.1% recall misses reduced 56.6% False positive recall 23.6% 8.8% false positives reduced 62.8% Raw Error Rate 19.1% 6.4% errors reduced 66.7% Weighted Error Rate 16.0% 7.7% weighted errors reduced 51.8% LOC sample