SlideShare a Scribd company logo
1 of 15
Download to read offline
2. Architecture and Data
Structures
A quick tour of the Tesseract Code
Ray Smith, Google Inc.
Tesseract Tutorial: DAS 2014 Tours France
A Note about the Coordinate System
● The pixel edges are aligned with integer coordinates.
● (0, 0) is at bottom-left.
● Width = right - left => no silly +1/-1.
Note: The API exposes a more common top-down system.
0 21
0
2
1
Tesseract Tutorial: DAS 2014 Tours France
Nominally a pipeline, but not really, as there is a lot of re-visiting of
old decisions.
Tesseract System Architecture
Tesseract Tutorial: DAS 2014 Tours France
Tesseract Word Recognizer
Tesseract Tutorial: DAS 2014 Tours France
The ‘C’ Legacy
● Large chunks of the code written originally in C.
● Major rewrite in ~1991 with new C++ code.
● C->C++ migration gradual over time since.
● Majority of global functions now live in a convenience directory
structure class. (For thread compatibility purposes.)
Tesseract Tutorial: DAS 2014 Tours France
Directory Structure ~ Functional Architecture
API
ccutil
cutil
ccstruct
dict
classify
wordrec
textord
ccmain
CCUtil
CUtil
CCStruct
Classify
TessBaseAPI
Tesseract
Wordrec
Textord
Dict
cube
Tesseract Tutorial: DAS 2014 Tours France
Key Data Structures = Page Hierarchy
BLOCK
ROW
WERD
PAGE_RES
BLOB_CHOICE
C_OUTLINE
C_BLOB
WERD_CHOICEWERD_RES
ROW_RES
BLOCK_RES
BLOBNBOX
TO_ROW
TO_BLOCKWorkingPartSet
ColPartition
TPOINT
EDGEPT
TWERD
TESSLINE
TBLOB
Layout (old) Layout Normalized
outlines
Results
Core page
outlines
Tesseract Tutorial: DAS 2014 Tours France
Software Engineering - Building Blocks
UNICHARSET
GenericVector ELIST CLIST
STRING
TBOX
FCOORDICOORD
ContainersCoordinates
Text
Tesseract Tutorial: DAS 2014 Tours France
Key Parts of the Call Hierarchy
TessBaseAPI::Recognize
Tesseract::SegmentPage
Tesseract::
classify_word_and_language
Tesseract::recog_all_words
Textord::TextordPageTesseract::AutoPageSeg
Classify::AdaptiveClassifier LanguageModel::UpdateState
Tesseract::chop_word_main
Wordrec::SegSearch
Tesseract Tutorial: DAS 2014 Tours France
Tesseract’s List Implementation
● Predates STL
● Allows control over ownership of list elements
● Uses nasty macros instead of templates
Tesseract Tutorial: DAS 2014 Tours France
List Example
tordmain.cpp:
float Textord::filter_noise_blobs(
BLOBNBOX_LIST *src_list, // original list
BLOBNBOX_LIST *noise_list, // noise list
BLOBNBOX_LIST *small_list) { // small blobs
BLOBNBOX_IT src_it(src_list); // iterators
BLOBNBOX_IT noise_it(noise_list);
BLOBNBOX_IT small_it(small_list);
for (src_it.mark_cycle_pt(); !src_it.cycled_list();
src_it.forward()) {
blob = src_it.data();
if (blob->bounding_box().height() < textord_max_noise_size)
noise_it.add_after_then_move(src_it.extract());
else if (blob->enclosed_area() >=
blob->bounding_box().area() * textord_noise_area_ratio)
small_it.add_after_then_move(src_it.extract());
}
blobbox.h:
class BLOBNBOX : public ELIST_LINK {
…
};
// Defines classes:
// BLOBNBOX_LIST: a list of BLOBNBOX
// BLOBNBOX_IT: list iterator
ELISTIZEH(BLOBNBOX)
blobbox.cpp:
// Implementation of some of the
// list functions.
ELISTIZE(BLOBNBOX)
Tesseract Tutorial: DAS 2014 Tours France
TessBaseAPI : Simple example
Main API class provides initialization, image input, text/hOCR/PDF
output:
TessBaseAPI api;
api.Init(NULL, “eng”);
Pix* pix = pixRead(“phototest.tif”);
api.SetImage(pix);
char* text = api.GetUTF8Text();
printf(“%sn”, text);
delete [] text;
pixDestroy(&pix);
Tesseract Tutorial: DAS 2014 Tours France
TessBaseAPI : Multipage example
TessBaseAPI api;
api.Init(NULL, “eng”);
tesseract::TessResultRenderer* renderer =
new tesseract::TessPDFRenderer(api.GetDatapath());
api.ProcessPages(filename, NULL, 0, renderer);
const char* data;
inT32 data_len;
if (renderer->GetOutput(&data, &data_len)) {
fwrite(data, 1, data_len, fout);
fclose(fout);
}
Tesseract Tutorial: DAS 2014 Tours France
ResultIterator for getting the real details
ResultIterator* it = api.GetIterator();
do {
int left, top, right, bottom;
if (it->BoundingBox(RIL_WORD, &left, &top, &right, &bottom)) {
char* text = it->GetUTF8Text(RIL_WORD);
printf("%s %d %d %d %dn", text, left, top, right, bottom);
delete [] text;
}
} while (it->Next(RIL_WORD));
delete it;
Tesseract Tutorial: DAS 2014 Tours France
Thanks for Listening!
Questions?

More Related Content

What's hot

What's hot (9)

TensorFlow Object Detection | Realtime Object Detection with TensorFlow | Ten...
TensorFlow Object Detection | Realtime Object Detection with TensorFlow | Ten...TensorFlow Object Detection | Realtime Object Detection with TensorFlow | Ten...
TensorFlow Object Detection | Realtime Object Detection with TensorFlow | Ten...
 
Deep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractionsDeep-learning based Language Understanding and Emotion extractions
Deep-learning based Language Understanding and Emotion extractions
 
C++ concept of Polymorphism
C++ concept of  PolymorphismC++ concept of  Polymorphism
C++ concept of Polymorphism
 
Go Programming Language (Golang)
Go Programming Language (Golang)Go Programming Language (Golang)
Go Programming Language (Golang)
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
 
Polymorphism and its types
Polymorphism and its typesPolymorphism and its types
Polymorphism and its types
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Introduction to OOP in Python
Introduction to OOP in PythonIntroduction to OOP in Python
Introduction to OOP in Python
 

Viewers also liked

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
Solin TEM
 

Viewers also liked (14)

CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1CS_rapport_final_fr_v3_1
CS_rapport_final_fr_v3_1
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
6 char segmentation
6 char segmentation6 char segmentation
6 char segmentation
 
7 layout analysis
7 layout analysis7 layout analysis
7 layout analysis
 
BrailleOCR: An Open Source Document to Braille Converter Application
BrailleOCR: An Open Source Document to Braille Converter ApplicationBrailleOCR: An Open Source Document to Braille Converter Application
BrailleOCR: An Open Source Document to Braille Converter Application
 
Tiny Google Projects
Tiny Google ProjectsTiny Google Projects
Tiny Google Projects
 
OCR using Tesseract
OCR using TesseractOCR using Tesseract
OCR using Tesseract
 
Os Raysmith
Os RaysmithOs Raysmith
Os Raysmith
 
reelyActive Brick & Mortar Retail Solution
reelyActive Brick & Mortar Retail SolutionreelyActive Brick & Mortar Retail Solution
reelyActive Brick & Mortar Retail Solution
 
TCDL15 Beyond eMOP
TCDL15 Beyond eMOPTCDL15 Beyond eMOP
TCDL15 Beyond eMOP
 
Node way
Node wayNode way
Node way
 
Scalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and TesseractScalable OCR with NiFi and Tesseract
Scalable OCR with NiFi and Tesseract
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
手機自動化測試和持續整合
手機自動化測試和持續整合手機自動化測試和持續整合
手機自動化測試和持續整合
 

Similar to 2 architecture anddatastructures

PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12
Andrew Dunstan
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
oscon2007
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
oscon2007
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
oscon2007
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
Max Kleiner
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
RomanKhavronenko
 

Similar to 2 architecture anddatastructures (20)

Summary of C++17 features
Summary of C++17 featuresSummary of C++17 features
Summary of C++17 features
 
EBtree - Design for a Scheduler and Use (Almost) Everywhere
EBtree - Design for a Scheduler and Use (Almost) EverywhereEBtree - Design for a Scheduler and Use (Almost) Everywhere
EBtree - Design for a Scheduler and Use (Almost) Everywhere
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12PostgreSQL 8.4 TriLUG 2009-11-12
PostgreSQL 8.4 TriLUG 2009-11-12
 
Experience with C++11 in ArangoDB
Experience with C++11 in ArangoDBExperience with C++11 in ArangoDB
Experience with C++11 in ArangoDB
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
Os Reindersfinal
Os ReindersfinalOs Reindersfinal
Os Reindersfinal
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
 
Katello on TorqueBox
Katello on TorqueBoxKatello on TorqueBox
Katello on TorqueBox
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers(Slightly) Smarter Smart Pointers
(Slightly) Smarter Smart Pointers
 
Introduction to Git for developers
Introduction to Git for developersIntroduction to Git for developers
Introduction to Git for developers
 
Chenli linux-kerne-community
Chenli linux-kerne-communityChenli linux-kerne-community
Chenli linux-kerne-community
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Ekon bestof rtl_delphi
Ekon bestof rtl_delphiEkon bestof rtl_delphi
Ekon bestof rtl_delphi
 
Containerd Project Update: FOSDEM 2018
Containerd Project Update: FOSDEM 2018Containerd Project Update: FOSDEM 2018
Containerd Project Update: FOSDEM 2018
 
Looking Ahead to Tcl 8.6
Looking Ahead to Tcl 8.6Looking Ahead to Tcl 8.6
Looking Ahead to Tcl 8.6
 
Python 3000
Python 3000Python 3000
Python 3000
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 

More from Solin TEM (8)

cs15 slide_v2
cs15 slide_v2cs15 slide_v2
cs15 slide_v2
 
4 downloading
4 downloading4 downloading
4 downloading
 
3 training
3 training3 training
3 training
 
0 tutorial contents
0 tutorial contents0 tutorial contents
0 tutorial contents
 
Khmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_dayKhmer ocr using gfd_seminar_day
Khmer ocr using gfd_seminar_day
 
Khmer ocr using gfd
Khmer ocr using gfdKhmer ocr using gfd
Khmer ocr using gfd
 
Khmer ocr scientificday_itc
Khmer ocr scientificday_itcKhmer ocr scientificday_itc
Khmer ocr scientificday_itc
 
Khmer ocr itc
Khmer ocr itcKhmer ocr itc
Khmer ocr itc
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

2 architecture anddatastructures

  • 1. 2. Architecture and Data Structures A quick tour of the Tesseract Code Ray Smith, Google Inc.
  • 2. Tesseract Tutorial: DAS 2014 Tours France A Note about the Coordinate System ● The pixel edges are aligned with integer coordinates. ● (0, 0) is at bottom-left. ● Width = right - left => no silly +1/-1. Note: The API exposes a more common top-down system. 0 21 0 2 1
  • 3. Tesseract Tutorial: DAS 2014 Tours France Nominally a pipeline, but not really, as there is a lot of re-visiting of old decisions. Tesseract System Architecture
  • 4. Tesseract Tutorial: DAS 2014 Tours France Tesseract Word Recognizer
  • 5. Tesseract Tutorial: DAS 2014 Tours France The ‘C’ Legacy ● Large chunks of the code written originally in C. ● Major rewrite in ~1991 with new C++ code. ● C->C++ migration gradual over time since. ● Majority of global functions now live in a convenience directory structure class. (For thread compatibility purposes.)
  • 6. Tesseract Tutorial: DAS 2014 Tours France Directory Structure ~ Functional Architecture API ccutil cutil ccstruct dict classify wordrec textord ccmain CCUtil CUtil CCStruct Classify TessBaseAPI Tesseract Wordrec Textord Dict cube
  • 7. Tesseract Tutorial: DAS 2014 Tours France Key Data Structures = Page Hierarchy BLOCK ROW WERD PAGE_RES BLOB_CHOICE C_OUTLINE C_BLOB WERD_CHOICEWERD_RES ROW_RES BLOCK_RES BLOBNBOX TO_ROW TO_BLOCKWorkingPartSet ColPartition TPOINT EDGEPT TWERD TESSLINE TBLOB Layout (old) Layout Normalized outlines Results Core page outlines
  • 8. Tesseract Tutorial: DAS 2014 Tours France Software Engineering - Building Blocks UNICHARSET GenericVector ELIST CLIST STRING TBOX FCOORDICOORD ContainersCoordinates Text
  • 9. Tesseract Tutorial: DAS 2014 Tours France Key Parts of the Call Hierarchy TessBaseAPI::Recognize Tesseract::SegmentPage Tesseract:: classify_word_and_language Tesseract::recog_all_words Textord::TextordPageTesseract::AutoPageSeg Classify::AdaptiveClassifier LanguageModel::UpdateState Tesseract::chop_word_main Wordrec::SegSearch
  • 10. Tesseract Tutorial: DAS 2014 Tours France Tesseract’s List Implementation ● Predates STL ● Allows control over ownership of list elements ● Uses nasty macros instead of templates
  • 11. Tesseract Tutorial: DAS 2014 Tours France List Example tordmain.cpp: float Textord::filter_noise_blobs( BLOBNBOX_LIST *src_list, // original list BLOBNBOX_LIST *noise_list, // noise list BLOBNBOX_LIST *small_list) { // small blobs BLOBNBOX_IT src_it(src_list); // iterators BLOBNBOX_IT noise_it(noise_list); BLOBNBOX_IT small_it(small_list); for (src_it.mark_cycle_pt(); !src_it.cycled_list(); src_it.forward()) { blob = src_it.data(); if (blob->bounding_box().height() < textord_max_noise_size) noise_it.add_after_then_move(src_it.extract()); else if (blob->enclosed_area() >= blob->bounding_box().area() * textord_noise_area_ratio) small_it.add_after_then_move(src_it.extract()); } blobbox.h: class BLOBNBOX : public ELIST_LINK { … }; // Defines classes: // BLOBNBOX_LIST: a list of BLOBNBOX // BLOBNBOX_IT: list iterator ELISTIZEH(BLOBNBOX) blobbox.cpp: // Implementation of some of the // list functions. ELISTIZE(BLOBNBOX)
  • 12. Tesseract Tutorial: DAS 2014 Tours France TessBaseAPI : Simple example Main API class provides initialization, image input, text/hOCR/PDF output: TessBaseAPI api; api.Init(NULL, “eng”); Pix* pix = pixRead(“phototest.tif”); api.SetImage(pix); char* text = api.GetUTF8Text(); printf(“%sn”, text); delete [] text; pixDestroy(&pix);
  • 13. Tesseract Tutorial: DAS 2014 Tours France TessBaseAPI : Multipage example TessBaseAPI api; api.Init(NULL, “eng”); tesseract::TessResultRenderer* renderer = new tesseract::TessPDFRenderer(api.GetDatapath()); api.ProcessPages(filename, NULL, 0, renderer); const char* data; inT32 data_len; if (renderer->GetOutput(&data, &data_len)) { fwrite(data, 1, data_len, fout); fclose(fout); }
  • 14. Tesseract Tutorial: DAS 2014 Tours France ResultIterator for getting the real details ResultIterator* it = api.GetIterator(); do { int left, top, right, bottom; if (it->BoundingBox(RIL_WORD, &left, &top, &right, &bottom)) { char* text = it->GetUTF8Text(RIL_WORD); printf("%s %d %d %d %dn", text, left, top, right, bottom); delete [] text; } } while (it->Next(RIL_WORD)); delete it;
  • 15. Tesseract Tutorial: DAS 2014 Tours France Thanks for Listening! Questions?