"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Lessons from Indic OCR Development
1. National Conference on Free Software Nishad T R NIT, Calicut http://www.himili.com/ocr/ Lessons from Indic OCR Development
2. 2 Overview History and Evolution of OCR When, Where, Why and How of OCR Selection of an OCR Engine and other gears Putting it all together, and why Tesseractarchitectural style Challenges in Indic OCR Lessons learned and applied Where is it NOW?
3. OCR in General Engine Training Data Input Tools Output formatting tools 15-Nov-08 3
4. 15-Nov-08 4 Three competents Ocrad Ocrad is the GNU OCR program. It was written by Antonio Diaz Diaz and is licensed under GPL. GOCR GOCR is an OCR program written by Joerg Schulenburg and others. It is licensed under GPL. Tesseract Under the sponsorship of Google, Tesseract was made open source in 2006.
7. And the winner is …. Tesseract gives extremely good output at a reasonable speed. It is the clear overall winner of the test. The only caveat is that one absolutely must convert the input to bitonal. Ocrad gives reasonable output at extremely high speed. It can be useful in applications where speed is more important than accuracy. GOCR gives poor output at a slow speed.
8. 15-Nov-08 8 Development Process Evolution Fostering Contributions developer focus and avoiding starvation code, code review, documentation, support Recognizing Ego trust and good intentions beware of maniacal focus Limits of volunteerism eight knives and an apple (dining developer problem) eight knives and a pumpkin eight pumpkins and no knives
9. How Debayan tamed Matra http://debayanin.googlepages.com/hackingtesseract
10. And how they performed To train for another language, you have to create 8 data files in the tessdata subdirectory. Language codes follow the ISO 639-3 standard tessdata/xxx.freq-dawg tessdata/xxx.word-dawg tessdata/xxx.user-words tessdata/xxx.inttemp tessdata/xxx.normproto tessdata/xxx.pffmtable tessdata/xxx.unicharset tessdata/xxx.DangAmbigs
11. The BOX File concept Command tesseract fontfile.tif fontfilebatch.nochopmakebox Sample Box അ 8 682 53 703 ആ 62 676 112 703 ഇ 121 676 155 705 ഈ 165 677 220 705 ഉ 232 677 256 704 ഊ 265 677 313 705 15-Nov-08 11
13. His Teacher JTesseract is the Tesseract GUI responsible for easing the training process. JTesseract is released under Apache 2.0 license. JTesseractcurrently works only on Windows platform. Developed by RuwanJanapriyaEgodaGamagehttp://www.janapriya.net Features Visual box file editing Project based training process 13
15. LibTIFF This software provides support for the Tag Image File Format (TIFF), a widely used format for storing image data. The latest version of the TIFF specification is available on-line in several different formats, as are a number of Technical Notes (TTN's). 15-Nov-08 15
17. 15-Nov-08 17 Questions? Places to see: Front Door http://code.google.com/p/tesseract-ocr jtesseracthttp://code.google.com/p/jtesseract/ FreeOCRhttp://www.freeocr.net http://www.himili.com/ocr