SlideShare a Scribd company logo
1 of 22
Tesseract OCR Engine
What it is, where it came from,
      where it is going.
  Ray Smith, Google Inc
         OSCON 2007
Contents
•   Introduction & history of OCR
•   Tesseract architecture & methods
•   Announcing Tesseract 2.00
•   Training Tesseract
•   Future enhancements
A Brief History of OCR
• What is Optical Character Recognition?




                         OCR


  My invention relates to statistical machines
 of the type in which successive comparisons
 are made between a character and a charac-
A Brief History of OCR
• OCR predates electronic computers!




                US Patent 1915993, Filed Apr 27, 1931
A Brief History of OCR
•   1929 – Digit recognition machine
•   1953 – Alphanumeric recognition machine
•   1965 – US Mail sorting
•   1965 – British banking system
•   1976 – Kurzweil reading machine
•   1985 – Hardware-assisted PC software
•   1988 – Software-only PC software
•   1994-2000 – Industry consolidation
Tesseract Background
• Developed on HP-UX at HP between 1985
  and 1994 to run in a desktop scanner.
• Came neck and neck with Caere and XIS
  in the 1995 UNLV test.
 (See http://www.isri.unlv.edu/downloads/AT-1995.pdf )
• Never used in an HP product.
• Open sourced in 2005. Now on:
  http://code.google.com/p/tesseract-ocr
• Highly portable.
Tesseract OCR Architecture
Input: Gray or Color Image
[+ Region Polygons]                        Adaptive
                                         Thresholding
                                                           Binary Image


                                             Character
                             Find Text                     Connected
                                             Outlines
                             Lines and                     Component
                               Words                        Analysis
         Character
         Outlines
         Organized
         Into Words
                                 Recognize               Recognize
                                   Word                    Word
                                  Pass 1                  Pass 2
Adaptive Thresholding is Essential
        Some examples of how difficult it can be to make a binary image
                   Taken from the UNLV Magazine set.
                  (http://www.isri.unlv.edu/ISRI/OCRtk )
Baselines are rarely perfectly straight

• Text Line Finding – skew independent –
  published at ICDAR’95 Montreal.
  (http://scholar.google.com/scholar?q=skew+detection+smith)
• Baselines are approximated by quadratic splines
  to account for skew and curl.
• Meanline, ascender and descender lines are a
  constant displacement from baseline.
• Critical value is the x-height.
Spaces between words are tricky too

 • Italics, digits, punctuation all create
   special-case font-dependent spacing.
 • Fully justified text in narrow columns can
   have vastly varying spacing on different
   lines.
Tesseract: Recognize Word


                                             No
  Character      Character
                                     Done?
  Chopper        Associator

                                   Yes

                                              Adapt to
                                               Word




  Static                      Adaptive
                                              Number
              Dictionary
 Character                    Character
                                              Parser
 Classifier                   Classifier
Outline Approximation



 Original Image      Outlines of components     Polygonal Approximation



Polygonal approximation is a double-edged sword.
Noise and some pertinent information are both lost.
Tesseract: Features and Matching




Prototype   Character     Extracted   Match of      Match of
            to classify   Features    Prototype     Features To
                                      To Features   Prototype

• Static classifier uses outline fragments as
  features. Broken characters are easily
  recognizable by a small->large matching
  process in classifier. (This is slow.)
• Adaptive classifier uses the same technique!
  (Apart from normalization method.)
Announcing tesseract-2.00
• Fully Unicode (UTF-8) capable
• Already trained for 6 Latin-based
  languages (Eng, Fra, Ita, Deu, Spa, Nld)
• Code and documented process to train at
  http://code.google.com/p/tesseract-ocr
• UNLV regression test framework
• Other minor fixes
Training Tesseract
                                                                       Tesseract Data Files


                                                                           User-words
                                   Wordlist2dawg
      Word List
                                                                           Word-dawg,
                                                                           Freq-dawg
      Training
    page images                                   mfTraining                inttemp,
                                  Character                                 pffmtable
              Tesseract
                                  Features
Tesseract
                                  (*.tr files)    cnTraining
 +manual
                                                                            normproto
correction


                  Unicharset_extractor                   Addition of
      Box files                          unicharset                         unicharset
                                                         character
                                                         properties

                                                       Manual
                                                                          DangAmbigs
                                                      Data Entry
Tesseract Dictionaries
                                         Tesseract Data Files
                Usually Empty
                                             User-words
            Infrequent
Word List
            Word List                        Word-dawg,
                         Wordlist2dawg
                                             Freq-dawg
            Frequent
            Word List
Tesseract Shape Data
                                                         Tesseract Data Files



                                Prototype Shape Features

                           Expected Feature Counts

      Training
    page images                             mfTraining        inttemp,
                             Character                        pffmtable
              Tesseract
                             Features
Tesseract
                             (*.tr files)   cnTraining
 +manual
                                                              normproto
correction


                      Character Normalization Features
      Box files
Tesseract Character Data
                                                                       Tesseract Data Files




      Training
    page images

                          List of Characters + ctype information
Tesseract
 +manual
correction


                  Unicharset_extractor                   Addition of
      Box files                          unicharset                         unicharset
                                                         character
                                                         properties

                                                       Manual
                                                                          DangAmbigs
                                                      Data Entry
     Typical OCR errors eg e<->c, rn<->m etc
Accuracy Results
          Comparison of current results against 1995 UNLV results

Testid   Testset    Character                        Non-stopword

                    Errors      Accuracy   Change    Errors     Accuracy   Change

1995     Bus.3B     5959        98.14%               1293       95.73%

1995     Doe3.3B    36349       97.52%               7042       94.87%

1995     Mag.3B     15043       97.74%               3379       94.99%

1995     News.3B    6432        98.69%               1502       96.94%

Gcc4.1   Bus.3B     6258        98.04%     5.02%     1312       95.67%     1.47%

Gcc4.1   Doe3.3B    28589       98.05%     -21.35%   6692       95.12%     -4.97%

Gcc4.1   Mag.3B     14800       97.78%     -1.62%    3123       95.37%     -7.58%

Gcc4.1   News.3B    7524        98.47%     16.98%    1220       97.51%     -18.77%

Gcc4.1   Total      57171                  -10.37%   12347                 -6.58%
Commercial OCR v Tesseract
• 100+ languages.     • 6 languages + growing.
• Accuracy is good    • Accuracy was good in
  now.                  1995.
• Sophisticated app   • No UI yet.
  with complex UI.
• Works on complex    • Page layout analysis
  magazine pages.       coming soon.
• Windows Mostly.     • Runs on Linux, Mac,
                        Windows, more...
• Costs $130-$500     • Open source – Free!
Tesseract Future

•   Page layout analysis.
•   More languages.
•   Improve accuracy.
•   Add a UI.
The End
• For more information see:
  http://code.google.com/p/tesseract-ocr

More Related Content

What's hot

A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
ijcsitcejournal
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
Vijay Apurva
 
Biometric Signature Recognization
 Biometric Signature Recognization Biometric Signature Recognization
Biometric Signature Recognization
Faimin Khan
 

What's hot (20)

OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)OCR Presentation (Optical Character Recognition)
OCR Presentation (Optical Character Recognition)
 
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
ocr
ocrocr
ocr
 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from images
 
Image authentication techniques based on Image watermarking
Image authentication techniques based on Image watermarkingImage authentication techniques based on Image watermarking
Image authentication techniques based on Image watermarking
 
Lung Cancer Detection using Image Processing Techniques
Lung Cancer Detection using Image Processing TechniquesLung Cancer Detection using Image Processing Techniques
Lung Cancer Detection using Image Processing Techniques
 
Signature verification in biometrics
Signature verification in biometricsSignature verification in biometrics
Signature verification in biometrics
 
Ocr abstract
Ocr abstractOcr abstract
Ocr abstract
 
Digital Image Processing: Digital Image Fundamentals
Digital Image Processing: Digital Image FundamentalsDigital Image Processing: Digital Image Fundamentals
Digital Image Processing: Digital Image Fundamentals
 
Image segmentation
Image segmentation Image segmentation
Image segmentation
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantization
 
Biometric Signature Recognization
 Biometric Signature Recognization Biometric Signature Recognization
Biometric Signature Recognization
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
 
3D Password
3D Password3D Password
3D Password
 
Pattern recognition voice biometrics
Pattern recognition voice biometricsPattern recognition voice biometrics
Pattern recognition voice biometrics
 
Text extraction From Digital image
Text extraction From Digital imageText extraction From Digital image
Text extraction From Digital image
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
OCR (Optical Character Recognition)
OCR (Optical Character Recognition) OCR (Optical Character Recognition)
OCR (Optical Character Recognition)
 

Viewers also liked

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
Solin TEM
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
oscon2007
 
Os Keyshacks
Os KeyshacksOs Keyshacks
Os Keyshacks
oscon2007
 
J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Tour
oscon2007
 
Os Ellistutorial
Os EllistutorialOs Ellistutorial
Os Ellistutorial
oscon2007
 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
oscon2007
 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
Anil Madhavapeddy
 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
oscon2007
 

Viewers also liked (20)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Os Keysholistic
Os KeysholisticOs Keysholistic
Os Keysholistic
 
Os Napier
Os NapierOs Napier
Os Napier
 
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
FregeDay: Roadmap for resolving differences between Haskell and Frege (Ingo W...
 
Os Keyshacks
Os KeyshacksOs Keyshacks
Os Keyshacks
 
J Ruby Whirlwind Tour
J Ruby Whirlwind TourJ Ruby Whirlwind Tour
J Ruby Whirlwind Tour
 
Os Ellistutorial
Os EllistutorialOs Ellistutorial
Os Ellistutorial
 
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss) FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
FregeDay: Parallelism in Frege compared to GHC Haskell (Volker Steiss)
 
Solr Presentation5
Solr Presentation5Solr Presentation5
Solr Presentation5
 
Os Harkins
Os HarkinsOs Harkins
Os Harkins
 
FregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVMFregeFX - JavaFX with Frege, a Haskell for the JVM
FregeFX - JavaFX with Frege, a Haskell for the JVM
 
Frege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMFrege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVM
 
FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)FregeDay: Design and Implementation of the language (Ingo Wechsung)
FregeDay: Design and Implementation of the language (Ingo Wechsung)
 
Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege Software Transactional Memory (STM) in Frege
Software Transactional Memory (STM) in Frege
 
Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015Frege Tutorial at JavaOne 2015
Frege Tutorial at JavaOne 2015
 
2 architecture anddatastructures
2 architecture anddatastructures2 architecture anddatastructures
2 architecture anddatastructures
 
OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012OCaml Labs introduction at OCaml Consortium 2012
OCaml Labs introduction at OCaml Consortium 2012
 
Os Peytonjones
Os PeytonjonesOs Peytonjones
Os Peytonjones
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 

Similar to Os Raysmith

Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)
PostgreSQL Experts, Inc.
 
[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal
Drupal Taiwan
 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
kingargyle
 

Similar to Os Raysmith (20)

Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)Simplifying Database Development (OSCON 2009)
Simplifying Database Development (OSCON 2009)
 
GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012GTC 2012: NVIDIA OpenGL in 2012
GTC 2012: NVIDIA OpenGL in 2012
 
Model-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax DefinitionModel-Driven Software Development - Language Workbenches & Syntax Definition
Model-Driven Software Development - Language Workbenches & Syntax Definition
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal[DCTPE2010] Biodiversity & Drupal
[DCTPE2010] Biodiversity & Drupal
 
stream processing engine
stream processing enginestream processing engine
stream processing engine
 
Using S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and TrainingUsing S1000D and SCORM to Integrate Documentation and Training
Using S1000D and SCORM to Integrate Documentation and Training
 
Maximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FMEMaximize the Value of Raster Data Using FME
Maximize the Value of Raster Data Using FME
 
Building a SQL Database that Works
Building a SQL Database that WorksBuilding a SQL Database that Works
Building a SQL Database that Works
 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebase
 
Evolution: It's a process
Evolution: It's a processEvolution: It's a process
Evolution: It's a process
 
Erlang/OTP for Rubyists
Erlang/OTP for RubyistsErlang/OTP for Rubyists
Erlang/OTP for Rubyists
 
XML Schema Patterns for Databinding
XML Schema Patterns for DatabindingXML Schema Patterns for Databinding
XML Schema Patterns for Databinding
 
Corel draw vba object model
Corel draw vba object modelCorel draw vba object model
Corel draw vba object model
 
Cascon2011_5_rules+owl
Cascon2011_5_rules+owlCascon2011_5_rules+owl
Cascon2011_5_rules+owl
 
Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...Extraction of topic evolutions from references in scientific articles and its...
Extraction of topic evolutions from references in scientific articles and its...
 
3 training
3 training3 training
3 training
 
Turmeric SOA Cloud Mashups
Turmeric SOA Cloud MashupsTurmeric SOA Cloud Mashups
Turmeric SOA Cloud Mashups
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
Multi-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency StructureMulti-layer Annotation in Dependency Structure
Multi-layer Annotation in Dependency Structure
 

More from oscon2007

Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
oscon2007
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
oscon2007
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
oscon2007
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
oscon2007
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
oscon2007
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
oscon2007
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
oscon2007
 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reform
oscon2007
 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007
oscon2007
 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbaker
oscon2007
 

More from oscon2007 (20)

Os Borger
Os BorgerOs Borger
Os Borger
 
Os Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman WiifmOs Fitzpatrick Sussman Wiifm
Os Fitzpatrick Sussman Wiifm
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Yuicss R7
Yuicss R7Yuicss R7
Yuicss R7
 
Performance Whack A Mole
Performance Whack A MolePerformance Whack A Mole
Performance Whack A Mole
 
Os Fogel
Os FogelOs Fogel
Os Fogel
 
Os Lanphier Brashears
Os Lanphier BrashearsOs Lanphier Brashears
Os Lanphier Brashears
 
Os Tucker
Os TuckerOs Tucker
Os Tucker
 
Os Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman SwpOs Fitzpatrick Sussman Swp
Os Fitzpatrick Sussman Swp
 
Os Furlong
Os FurlongOs Furlong
Os Furlong
 
Os Berlin Dispelling Myths
Os Berlin Dispelling MythsOs Berlin Dispelling Myths
Os Berlin Dispelling Myths
 
Os Kimsal
Os KimsalOs Kimsal
Os Kimsal
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
Os Alrubaie
Os AlrubaieOs Alrubaie
Os Alrubaie
 
Os Jonphillips
Os JonphillipsOs Jonphillips
Os Jonphillips
 
Os Urnerupdated
Os UrnerupdatedOs Urnerupdated
Os Urnerupdated
 
Adventures In Copyright Reform
Adventures In Copyright ReformAdventures In Copyright Reform
Adventures In Copyright Reform
 
Railsconf2007
Railsconf2007Railsconf2007
Railsconf2007
 
Oscon Mitchellbaker
Oscon MitchellbakerOscon Mitchellbaker
Oscon Mitchellbaker
 
Os Sharp
Os SharpOs Sharp
Os Sharp
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Os Raysmith

  • 1. Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007
  • 2. Contents • Introduction & history of OCR • Tesseract architecture & methods • Announcing Tesseract 2.00 • Training Tesseract • Future enhancements
  • 3. A Brief History of OCR • What is Optical Character Recognition? OCR My invention relates to statistical machines of the type in which successive comparisons are made between a character and a charac-
  • 4. A Brief History of OCR • OCR predates electronic computers! US Patent 1915993, Filed Apr 27, 1931
  • 5. A Brief History of OCR • 1929 – Digit recognition machine • 1953 – Alphanumeric recognition machine • 1965 – US Mail sorting • 1965 – British banking system • 1976 – Kurzweil reading machine • 1985 – Hardware-assisted PC software • 1988 – Software-only PC software • 1994-2000 – Industry consolidation
  • 6. Tesseract Background • Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. • Came neck and neck with Caere and XIS in the 1995 UNLV test. (See http://www.isri.unlv.edu/downloads/AT-1995.pdf ) • Never used in an HP product. • Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr • Highly portable.
  • 7. Tesseract OCR Architecture Input: Gray or Color Image [+ Region Polygons] Adaptive Thresholding Binary Image Character Find Text Connected Outlines Lines and Component Words Analysis Character Outlines Organized Into Words Recognize Recognize Word Word Pass 1 Pass 2
  • 8. Adaptive Thresholding is Essential Some examples of how difficult it can be to make a binary image Taken from the UNLV Magazine set. (http://www.isri.unlv.edu/ISRI/OCRtk )
  • 9. Baselines are rarely perfectly straight • Text Line Finding – skew independent – published at ICDAR’95 Montreal. (http://scholar.google.com/scholar?q=skew+detection+smith) • Baselines are approximated by quadratic splines to account for skew and curl. • Meanline, ascender and descender lines are a constant displacement from baseline. • Critical value is the x-height.
  • 10. Spaces between words are tricky too • Italics, digits, punctuation all create special-case font-dependent spacing. • Fully justified text in narrow columns can have vastly varying spacing on different lines.
  • 11. Tesseract: Recognize Word No Character Character Done? Chopper Associator Yes Adapt to Word Static Adaptive Number Dictionary Character Character Parser Classifier Classifier
  • 12. Outline Approximation Original Image Outlines of components Polygonal Approximation Polygonal approximation is a double-edged sword. Noise and some pertinent information are both lost.
  • 13. Tesseract: Features and Matching Prototype Character Extracted Match of Match of to classify Features Prototype Features To To Features Prototype • Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) • Adaptive classifier uses the same technique! (Apart from normalization method.)
  • 14. Announcing tesseract-2.00 • Fully Unicode (UTF-8) capable • Already trained for 6 Latin-based languages (Eng, Fra, Ita, Deu, Spa, Nld) • Code and documented process to train at http://code.google.com/p/tesseract-ocr • UNLV regression test framework • Other minor fixes
  • 15. Training Tesseract Tesseract Data Files User-words Wordlist2dawg Word List Word-dawg, Freq-dawg Training page images mfTraining inttemp, Character pffmtable Tesseract Features Tesseract (*.tr files) cnTraining +manual normproto correction Unicharset_extractor Addition of Box files unicharset unicharset character properties Manual DangAmbigs Data Entry
  • 16. Tesseract Dictionaries Tesseract Data Files Usually Empty User-words Infrequent Word List Word List Word-dawg, Wordlist2dawg Freq-dawg Frequent Word List
  • 17. Tesseract Shape Data Tesseract Data Files Prototype Shape Features Expected Feature Counts Training page images mfTraining inttemp, Character pffmtable Tesseract Features Tesseract (*.tr files) cnTraining +manual normproto correction Character Normalization Features Box files
  • 18. Tesseract Character Data Tesseract Data Files Training page images List of Characters + ctype information Tesseract +manual correction Unicharset_extractor Addition of Box files unicharset unicharset character properties Manual DangAmbigs Data Entry Typical OCR errors eg e<->c, rn<->m etc
  • 19. Accuracy Results Comparison of current results against 1995 UNLV results Testid Testset Character Non-stopword Errors Accuracy Change Errors Accuracy Change 1995 Bus.3B 5959 98.14% 1293 95.73% 1995 Doe3.3B 36349 97.52% 7042 94.87% 1995 Mag.3B 15043 97.74% 3379 94.99% 1995 News.3B 6432 98.69% 1502 96.94% Gcc4.1 Bus.3B 6258 98.04% 5.02% 1312 95.67% 1.47% Gcc4.1 Doe3.3B 28589 98.05% -21.35% 6692 95.12% -4.97% Gcc4.1 Mag.3B 14800 97.78% -1.62% 3123 95.37% -7.58% Gcc4.1 News.3B 7524 98.47% 16.98% 1220 97.51% -18.77% Gcc4.1 Total 57171 -10.37% 12347 -6.58%
  • 20. Commercial OCR v Tesseract • 100+ languages. • 6 languages + growing. • Accuracy is good • Accuracy was good in now. 1995. • Sophisticated app • No UI yet. with complex UI. • Works on complex • Page layout analysis magazine pages. coming soon. • Windows Mostly. • Runs on Linux, Mac, Windows, more... • Costs $130-$500 • Open source – Free!
  • 21. Tesseract Future • Page layout analysis. • More languages. • Improve accuracy. • Add a UI.
  • 22. The End • For more information see: http://code.google.com/p/tesseract-ocr