SlideShare a Scribd company logo
1 of 17
National Conference on Free Software Nishad T R NIT, Calicut http://www.himili.com/ocr/ Lessons from Indic OCR Development
2 Overview History and Evolution of OCR When, Where, Why and How of OCR Selection of an OCR Engine and other gears Putting it all together, and why Tesseractarchitectural style Challenges in Indic OCR Lessons learned and applied Where is it NOW?
OCR in General Engine Training Data Input Tools Output formatting tools 15-Nov-08 3
15-Nov-08 4 Three competents Ocrad  Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL. GOCR GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL. Tesseract Under the sponsorship of Google, Tesseract was made open source in 2006.
And how they performed
Again how they performed
And the winner is …. Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal. Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy. GOCR gives poor output at a slow speed.
15-Nov-08 8 Development Process Evolution Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support Recognizing Ego trust and good intentions beware of maniacal focus Limits of volunteerism eight knives and an apple (dining developer problem) eight knives and a pumpkin eight pumpkins and no knives
How Debayan tamed Matra http://debayanin.googlepages.com/hackingtesseract
And how they performed To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs
The BOX File concept Command tesseract fontfile.tif fontfilebatch.nochopmakebox Sample Box അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705 15-Nov-08 11
In Kindergarten  15-Nov-08 12
His Teacher JTesseract is the Tesseract GUI responsible for easing the training process. JTesseract is released under Apache 2.0 license.  JTesseractcurrently works only on Windows platform.  Developed by RuwanJanapriyaEgodaGamagehttp://www.janapriya.net Features Visual box file editing  Project based training process  13
His Classmates nopapaper 15-Nov-08 14
LibTIFF This software provides support for the Tag Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's).  15-Nov-08 15
Windows GUI 15-Nov-08 16
15-Nov-08 17 Questions? Places to see: Front Door	http://code.google.com/p/tesseract-ocr jtesseracthttp://code.google.com/p/jtesseract/ FreeOCRhttp://www.freeocr.net http://www.himili.com/ocr

More Related Content

Viewers also liked (9)

Signal &telicommunication doc/sanjeet-1308143
Signal &telicommunication doc/sanjeet-1308143Signal &telicommunication doc/sanjeet-1308143
Signal &telicommunication doc/sanjeet-1308143
 
OCR
OCROCR
OCR
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
 
Basics of-optical-character-recognition
Basics of-optical-character-recognitionBasics of-optical-character-recognition
Basics of-optical-character-recognition
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
Text Detection and Recognition
Text Detection and RecognitionText Detection and Recognition
Text Detection and Recognition
 
Revision Guide A2 Media OCR
Revision Guide A2 Media OCRRevision Guide A2 Media OCR
Revision Guide A2 Media OCR
 

Similar to Lessons from Indic OCR Development

Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
Ravi Okade
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
Dr. Haxel Consult
 

Similar to Lessons from Indic OCR Development (20)

Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
The trials and tribulations of providing engineering infrastructure
 The trials and tribulations of providing engineering infrastructure  The trials and tribulations of providing engineering infrastructure
The trials and tribulations of providing engineering infrastructure
 
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
 
The computer science behind a modern disributed data store
The computer science behind a modern disributed data storeThe computer science behind a modern disributed data store
The computer science behind a modern disributed data store
 
Rocking the microservice world with Helidon-LAOUCTour2023.pdf
Rocking the microservice world with Helidon-LAOUCTour2023.pdfRocking the microservice world with Helidon-LAOUCTour2023.pdf
Rocking the microservice world with Helidon-LAOUCTour2023.pdf
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
Cyber Security Workshop Presentation.pptx
Cyber Security Workshop Presentation.pptxCyber Security Workshop Presentation.pptx
Cyber Security Workshop Presentation.pptx
 
Intro elasticsearch taswarbhatti
Intro elasticsearch taswarbhattiIntro elasticsearch taswarbhatti
Intro elasticsearch taswarbhatti
 
2. introduction
2. introduction2. introduction
2. introduction
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)Version Control in Machine Learning + AI (Stanford)
Version Control in Machine Learning + AI (Stanford)
 
Machine learning in cybersecutiry
Machine learning in cybersecutiryMachine learning in cybersecutiry
Machine learning in cybersecutiry
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
Deploying New DNSSEC Algorithms (IEPG@IETF93 - July 2015)
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Lessons from Indic OCR Development

  • 1. National Conference on Free Software Nishad T R NIT, Calicut http://www.himili.com/ocr/ Lessons from Indic OCR Development
  • 2. 2 Overview History and Evolution of OCR When, Where, Why and How of OCR Selection of an OCR Engine and other gears Putting it all together, and why Tesseractarchitectural style Challenges in Indic OCR Lessons learned and applied Where is it NOW?
  • 3. OCR in General Engine Training Data Input Tools Output formatting tools 15-Nov-08 3
  • 4. 15-Nov-08 4 Three competents Ocrad  Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL. GOCR GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL. Tesseract Under the sponsorship of Google, Tesseract was made open source in 2006.
  • 5. And how they performed
  • 6. Again how they performed
  • 7. And the winner is …. Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal. Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy. GOCR gives poor output at a slow speed.
  • 8. 15-Nov-08 8 Development Process Evolution Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support Recognizing Ego trust and good intentions beware of maniacal focus Limits of volunteerism eight knives and an apple (dining developer problem) eight knives and a pumpkin eight pumpkins and no knives
  • 9. How Debayan tamed Matra http://debayanin.googlepages.com/hackingtesseract
  • 10. And how they performed To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs
  • 11. The BOX File concept Command tesseract fontfile.tif fontfilebatch.nochopmakebox Sample Box അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705 15-Nov-08 11
  • 12. In Kindergarten 15-Nov-08 12
  • 13. His Teacher JTesseract is the Tesseract GUI responsible for easing the training process. JTesseract is released under Apache 2.0 license. JTesseractcurrently works only on Windows platform. Developed by RuwanJanapriyaEgodaGamagehttp://www.janapriya.net Features Visual box file editing Project based training process 13
  • 14. His Classmates nopapaper 15-Nov-08 14
  • 15. LibTIFF This software provides support for the Tag Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's). 15-Nov-08 15
  • 17. 15-Nov-08 17 Questions? Places to see: Front Door http://code.google.com/p/tesseract-ocr jtesseracthttp://code.google.com/p/jtesseract/ FreeOCRhttp://www.freeocr.net http://www.himili.com/ocr