SlideShare a Scribd company logo
1 of 15
Simon Tanner
Blog: simon-tanner.blogspot.co.uk
Twitter: @SimonTanner


      www.slideshare.net/KDCS/
King’s Digital Consultancy Services




              www.digitalconsultancy.net
Deciding whether Optical Character Recognition is feasible
(PDF document) created for the Oxford University Digital
Library
www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf

Measuring Mass Text Digitization Quality and Usefulness:
Lessons Learned from Assessing the OCR Accuracy of the
British Library's 19th Century Online Newspaper Archive
www.dlib.org/dlib/july09/munoz/07munoz.html



www.impact-project.eu
Uniformity
Language
Text alignment
Complexity of alignment
Lines, graphics and pictures
Handwriting
Evaluating OCR accuracy is about more than just
character to character accuracy rates
    Character accuracy rates are misleading
    (more later…)
It is also about assessing the functionality enabled
through the OCR’s output
    Search accuracy
    Volume of hits returned
    Ability to structure searches and results
    Accuracy of result ranking
    Amount of correction required to
    achieve the required performance
Consider this scenario:
  1,000 words with 5,000 characters
  (an average of 5 per word) excluding spaces

  90% character accuracy means:
    4,500 characters correct
    Possibly a maximum 900 words correct (90%)
    Possibly a minimum 500 words correct (50%)
    Reality is somewhere in between
    Depending on the number of “significant
    words” the search results could still
    be almost 100% or near zero
100



90



80



70



60



50
      1801


             1810


                      1820



                             1830




                                                1840




                                                       1850



                                                              1860




                                                                     1870




                                                                                      1880




                                                                                                 1890




                                                                                                               1900
                    characters                                              words
                    words with capital letter start                         significant words
                    Poly. (characters)                                      Poly. (words)
                    Poly. (significant words)                               Poly. (words with capital letter start
OCR Results
                   % characters     % words correct     No. of corrections
 OCR Engine          correct
 FineReader            91.1              70.9                      110
 PrimeOCR              93.95             79.1                       79

Total number of characters = 2109
Total number of words = 379




                                                  I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe
                                                  Men's affections tp; and seal for all Party-making Notions
                                                  amdngft CfiriftiansybefGieirie will raife his,Church to that prof-
                                                  perous, flourilhing State prophefied of, and prOmifed in the
                                                  Scrip* tures. There mult be more Love, and Charity, and
                                                  Unanimity amongft Chriftians,.
OCR Results

                  % characters     % words correct   No. of corrections
OCR Engine          correct
FineReader            73.7              57.5                31
PrimeOCR              75.9             62.37                28


Total number of characters = 411
Total number of words = 73




                                                                          A THEATRE
                                                                           erein be reprc-fented as wel the miferies &
                                                                          calamities tijat foiioto tht too*
                                                                          e^jr alfo the greate toyts and
                                                                          plefures tobtcf) tbe fatrfc faltooenio^
                                                                          An Argument both profitable and
                                                                          dele&able, to all that finccrcly
                                                                          loue the word of Codt'.
                                                                          *Deuifedby S. hhnv&n~ derlS^oodt.
                                                                          s 3^ Scene and allowed according to the order
                                                                          appointed.
                                                                          , ^ Imprinted at London by Henry Bynncman*
                                                                          Anno Domini.
                                                                          CVM PHIT
Simon Tanner
Blog: simon-tanner.blogspot.co.uk
Twitter: @SimonTanner

More Related Content

Viewers also liked

optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
Museum of impact powerpoint pdf
Museum of impact powerpoint pdfMuseum of impact powerpoint pdf
Museum of impact powerpoint pdf
urbanmomentum
 

Viewers also liked (20)

optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
A Glance at the Future - the Image as Dr Who's TARDIS
A Glance at the Future - the Image as Dr Who's TARDISA Glance at the Future - the Image as Dr Who's TARDIS
A Glance at the Future - the Image as Dr Who's TARDIS
 
Value & Impact for Museums
Value & Impact for MuseumsValue & Impact for Museums
Value & Impact for Museums
 
Through a glass, darkly – reflections upon digitisation
Through a glass, darkly – reflections upon digitisationThrough a glass, darkly – reflections upon digitisation
Through a glass, darkly – reflections upon digitisation
 
Return on Investment for the Content Industries
Return on Investment for the Content IndustriesReturn on Investment for the Content Industries
Return on Investment for the Content Industries
 
Avoiding the Digital Death Spiral – how measuring value and impact can preser...
Avoiding the Digital Death Spiral – how measuring value and impact can preser...Avoiding the Digital Death Spiral – how measuring value and impact can preser...
Avoiding the Digital Death Spiral – how measuring value and impact can preser...
 
Planning for Success: Surviving and Thriving through understanding the Value ...
Planning for Success: Surviving and Thriving through understanding the Value ...Planning for Success: Surviving and Thriving through understanding the Value ...
Planning for Success: Surviving and Thriving through understanding the Value ...
 
Research support with optical character recognition apps
Research support with optical character recognition appsResearch support with optical character recognition apps
Research support with optical character recognition apps
 
OCR2
OCR2OCR2
OCR2
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
 
Optical Character Recognition: the What, Why, and How
Optical Character Recognition: the What, Why, and HowOptical Character Recognition: the What, Why, and How
Optical Character Recognition: the What, Why, and How
 
Avoiding the Digital Death Spiral: Surviving & Thriving through understanding...
Avoiding the Digital Death Spiral: Surviving & Thriving through understanding...Avoiding the Digital Death Spiral: Surviving & Thriving through understanding...
Avoiding the Digital Death Spiral: Surviving & Thriving through understanding...
 
Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...Design and implementation of optical character recognition using template mat...
Design and implementation of optical character recognition using template mat...
 
Region filling
Region fillingRegion filling
Region filling
 
Museum of impact powerpoint pdf
Museum of impact powerpoint pdfMuseum of impact powerpoint pdf
Museum of impact powerpoint pdf
 
Number plate recogition
Number plate recogitionNumber plate recogition
Number plate recogition
 
Thesis
ThesisThesis
Thesis
 
Optical character recognition of handwritten Arabic using hidden Markov models
Optical character recognition of handwritten Arabic using hidden Markov modelsOptical character recognition of handwritten Arabic using hidden Markov models
Optical character recognition of handwritten Arabic using hidden Markov models
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 

More from Simon Tanner

Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Simon Tanner
 
Proposing the modes of digital value for a memory institution
Proposing the modes of digital value for a memory institutionProposing the modes of digital value for a memory institution
Proposing the modes of digital value for a memory institution
Simon Tanner
 

More from Simon Tanner (20)

The Balanced Value Impact Model V2.0
The Balanced Value Impact Model V2.0The Balanced Value Impact Model V2.0
The Balanced Value Impact Model V2.0
 
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
Julie Fukuyama & Simon Tanner: Developing impact assessment indicators – maki...
 
Developing the Balanced Value Impact Model to assess the impact of digital re...
Developing the Balanced Value Impact Model to assess the impact of digital re...Developing the Balanced Value Impact Model to assess the impact of digital re...
Developing the Balanced Value Impact Model to assess the impact of digital re...
 
Life Writes Its Own Stories: The value and research benefits gained from digi...
Life Writes Its Own Stories: The value and research benefits gained from digi...Life Writes Its Own Stories: The value and research benefits gained from digi...
Life Writes Its Own Stories: The value and research benefits gained from digi...
 
Teaching Digital Preservation at scale on the MA Digital Asset & Media Manage...
Teaching Digital Preservation at scale on the MA Digital Asset & Media Manage...Teaching Digital Preservation at scale on the MA Digital Asset & Media Manage...
Teaching Digital Preservation at scale on the MA Digital Asset & Media Manage...
 
Focusing on European citizens and the impact of Open Access monographs for them
Focusing on European citizens and the impact of Open Access monographs for themFocusing on European citizens and the impact of Open Access monographs for them
Focusing on European citizens and the impact of Open Access monographs for them
 
Proposing the modes of digital value for a memory institution
Proposing the modes of digital value for a memory institutionProposing the modes of digital value for a memory institution
Proposing the modes of digital value for a memory institution
 
OpenGLAM – the Cultural, Social and Academic Importance of Sharing
OpenGLAM – the Cultural, Social and Academic Importance of SharingOpenGLAM – the Cultural, Social and Academic Importance of Sharing
OpenGLAM – the Cultural, Social and Academic Importance of Sharing
 
Walking the talk of open research and open innovation in practice
Walking the talk of open research and open innovation in practiceWalking the talk of open research and open innovation in practice
Walking the talk of open research and open innovation in practice
 
So, can I use that or not? Navigating rights, reproductions, and risk in an O...
So, can I use that or not? Navigating rights, reproductions, and risk in an O...So, can I use that or not? Navigating rights, reproductions, and risk in an O...
So, can I use that or not? Navigating rights, reproductions, and risk in an O...
 
Impact: A Europeana Case Study
Impact: A Europeana Case StudyImpact: A Europeana Case Study
Impact: A Europeana Case Study
 
Opening up Data - the benefits and value from a community and funding perspec...
Opening up Data - the benefits and value from a community and funding perspec...Opening up Data - the benefits and value from a community and funding perspec...
Opening up Data - the benefits and value from a community and funding perspec...
 
Mirror, Signal, Manoeuvre How  understanding context, indicators and strategi...
Mirror, Signal, Manoeuvre How  understanding context, indicators and strategi...Mirror, Signal, Manoeuvre How  understanding context, indicators and strategi...
Mirror, Signal, Manoeuvre How  understanding context, indicators and strategi...
 
The Academic Book of the Future - Progress & REF2014 data
The Academic Book of the Future - Progress & REF2014 dataThe Academic Book of the Future - Progress & REF2014 data
The Academic Book of the Future - Progress & REF2014 data
 
When Crowdsourcing was called Telecrofting - origin stories and challenges
When Crowdsourcing was called Telecrofting - origin stories and challengesWhen Crowdsourcing was called Telecrofting - origin stories and challenges
When Crowdsourcing was called Telecrofting - origin stories and challenges
 
Raising Funds for Digitisation
Raising Funds for DigitisationRaising Funds for Digitisation
Raising Funds for Digitisation
 
Raising Funds: some advice for our PhD students
Raising Funds: some advice for our PhD studentsRaising Funds: some advice for our PhD students
Raising Funds: some advice for our PhD students
 
Impact, the REF and Digital Humanities
Impact, the REF and Digital HumanitiesImpact, the REF and Digital Humanities
Impact, the REF and Digital Humanities
 
Democratisation of Collections through Digitisation.
Democratisation of Collections through Digitisation.Democratisation of Collections through Digitisation.
Democratisation of Collections through Digitisation.
 
The Impact of Digitisation on Photographic Heritage
The Impact of Digitisation on Photographic HeritageThe Impact of Digitisation on Photographic Heritage
The Impact of Digitisation on Photographic Heritage
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Digitisation Doctor Optical Character Recognition

  • 1. Simon Tanner Blog: simon-tanner.blogspot.co.uk Twitter: @SimonTanner www.slideshare.net/KDCS/
  • 2. King’s Digital Consultancy Services www.digitalconsultancy.net
  • 3. Deciding whether Optical Character Recognition is feasible (PDF document) created for the Oxford University Digital Library www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive www.dlib.org/dlib/july09/munoz/07munoz.html www.impact-project.eu
  • 4.
  • 5. Uniformity Language Text alignment Complexity of alignment Lines, graphics and pictures Handwriting
  • 6. Evaluating OCR accuracy is about more than just character to character accuracy rates Character accuracy rates are misleading (more later…) It is also about assessing the functionality enabled through the OCR’s output Search accuracy Volume of hits returned Ability to structure searches and results Accuracy of result ranking Amount of correction required to achieve the required performance
  • 7. Consider this scenario: 1,000 words with 5,000 characters (an average of 5 per word) excluding spaces 90% character accuracy means: 4,500 characters correct Possibly a maximum 900 words correct (90%) Possibly a minimum 500 words correct (50%) Reality is somewhere in between Depending on the number of “significant words” the search results could still be almost 100% or near zero
  • 8.
  • 9. 100 90 80 70 60 50 1801 1810 1820 1830 1840 1850 1860 1870 1880 1890 1900 characters words words with capital letter start significant words Poly. (characters) Poly. (words) Poly. (significant words) Poly. (words with capital letter start
  • 10.
  • 11.
  • 12. OCR Results % characters % words correct No. of corrections OCR Engine correct FineReader 91.1 70.9 110 PrimeOCR 93.95 79.1 79 Total number of characters = 2109 Total number of words = 379 I am petfood, God toil! uttedy-toverthroW, at feaft; $gy abafe Men's affections tp; and seal for all Party-making Notions amdngft CfiriftiansybefGieirie will raife his,Church to that prof- perous, flourilhing State prophefied of, and prOmifed in the Scrip* tures. There mult be more Love, and Charity, and Unanimity amongft Chriftians,.
  • 13.
  • 14. OCR Results % characters % words correct No. of corrections OCR Engine correct FineReader 73.7 57.5 31 PrimeOCR 75.9 62.37 28 Total number of characters = 411 Total number of words = 73 A THEATRE erein be reprc-fented as wel the miferies & calamities tijat foiioto tht too* e^jr alfo the greate toyts and plefures tobtcf) tbe fatrfc faltooenio^ An Argument both profitable and dele&able, to all that finccrcly loue the word of Codt'. *Deuifedby S. hhnv&n~ derlS^oodt. s 3^ Scene and allowed according to the order appointed. , ^ Imprinted at London by Henry Bynncman* Anno Domini. CVM PHIT