SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
Summary
I. General presentation...............................................................................2
II. Binarisation .............................................................................................2
III. Segmentation ...........................................................................................3
IV. OCR Recognition ....................................................................................4
V. Sequencer.................................................................................................5
VI. Post-OCR correction with Spellchecking.............................................6
VII. Pictures Treatment/Export....................................................................7
VIII. Export of content:...................................................................................7
IX. Contact.....................................................................................................8
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
I. General presentation
B.I.T. has developed an adaptive OCR solution called BIT-Alpha.
This semiautomatic adaptive OCR is able to adapt itself to all types of text,
independently of their language, typeface or age.
Specifically developed for the treatment of historical and heritage documents,
BIT-Alpha allows scientific research and access to content.
BIT-Alpha is a tool containing the whole workflow:
 Binarisation
 Segmentation
 OCR recognition
 Post OCR correction with spellchecking
 Picture processing/Export
 Export of content
II. Binarisation
3 Binarisation modes in BIT-Alpha:
 A Binarisation through Threshold ideal for Newspapers
BIT-Alpha analyses the document by domains/fields so the Binarisation will not
be the same at the bottom, top or left right corner… Through this domains/fields
analysis instead of a global analysis of the whole document, the binarisation will
adapt to the different contrasts of the document.
 A Binarisation through the “Niblack” algorithm
BIT-Alpha is analyzing the contrast variance around each letter. In this respect
BIT-Alpha is able to make the difference between a letter and a color spot close
to a letter and therefore is able to eliminate the background noise without
eliminating parts of a letter.
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
BIT-Alpha does the variance analysis over neighborhoods and so determines if a
pixel is part of a text area, non-text area, interline or a picture.
 A Binarisation based on an algorithm develop by B.I.T.
Thanks to this very advanced spectral-decomposition algorithm, BIT-Alpha is
able to redraw/reconstruct damaged letters, as if BIT-Alpha were choosing an
optimal paint brush (fine or large). It also allows to maintain very fine traits of
characters which may be deleted by other algorithms.
Those binarisation allows to prepare the document as best as possible in order to
get the best OCR results that are possible for these historic/ heritage documents.
III. Segmentation
BIT-Alpha is segmenting titles, sub-titles, pictures, picture comments, chapters
and articles, for example in Newspapers:
Fraktur dated 1805 at 1944: segmentation of title, sub-titles and chapters
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
During segmentation BIT-Alpha is detecting each line, for each line each word
and for each word each character individually.
Note that Bit-Alpha can output the position of each character (for example into
an alto file).
IV. OCR Recognition
Developed for the processing of historical/ heritage documents, BIT-Alpha is an
adaptive OCR able of adapting itself to all types of text, independently of their
language, typeface or age.
Character learning can be done manually and automatically:
 Manually
Training with human action:
Memory storage of characters’ digital signatures.
As the “image” of a character is much heavier than its digital signature, BIT-Alpha
has the ability to create bigger data bases than tools saving “images” of characters.
 Automatic
Training without human action:
 BIT-Alpha can learn the characters automatically from the text to be
processed. During a Batch process, BIT-Alpha is reading and recognizing
characters already known those characters which are recognized with high
reliability are then used to train the OCR engine. Thereby, BIT-Alpha’s
reliability rates will be increase with each processed page.
 A spellchecking database which is adapted to the type of documents that
are to be treated (for example Latin database) can be loaded into BIT-
Alpha. If BIT-Alpha recognizes a word from the database, BIT-Alpha
learns all the character constituting this word automatically. BIT-Alpha can
handle any databases consisting of more than 500 000 words.
 BIT-Alpha is able to identify the nature of fonts constituting a text even
when the fonts are mixed-up: Gothic (before 1845), Fracture (after 1845),
Antiqua, Cursive, Greece, Hebrew...
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
 BIT-Alpha is able to recognize and read embellished letters, miniatures,
abbreviations and can deal with unusual characters.
V. Sequencer
The Sequencer permits to:
 Reconstruct fragmented characters: Sometimes a letter can be fragmented
into two or more parts. BIT-Alpha recognises the fragments of a letter and
reconstitutes it.
Recognition of the right hand side of a lower-case “n” (RKN)
Recognition of the left hand side of a lower-case “n” (LKN)
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
Assembling of the two fragments by the sequencer and reconstruction of
the “n”
 Extend abbreviations
In Roman writing a “q” followed by ”;” means “que”.
 Correct wrong sequences of letters
When other OCR reads “nnn”, the sequencer corrects that to « mm ». BIT-
Alpha considers the typical sequences of the language of the document
processed and is therefore able to correct incorrect sequences of letters.
For example in Latin the wrong sequence “dcn” is changed into the typical
one: “den”. Another example would be the incorrect sequence “qn” which is
changed changed into the typical one used in Latin: “qu”.
The Sequencer is composed of more than 900 sequences preprogramed in BIT-
Alphas’s data base. By each use, the Sequencer’s data base can be enhanced
and conversely the sequences preprogramed disturbing can be removed.
VI. Post-OCR correction with Spellchecking
BIT-Alpha’s post-OCR correction is based on the “Levenshtein” distance
algorithm. Alpha analyses the edit-distance (different editing operations
correspond to different OCR-mistakes and may have different weights) between
two words, the words in the text and the reference from the database. Thanks to
this technology BIT-Alpha is able to reconstitute words or to separate them with
blanks if needed. For example, in German composed words (very common in
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
German) may be checked by checking the components individually against
known words from the database. Whereas for Latin texts (where composed words
rarely occur) BIT-Alpha separates the words that are sticking together with
blanks.
BIT-Alpha permit to switch off the post-OCR correction and also to adapt how
aggressively it corrects pure OCR results.
VII. Pictures Treatment/Export
BIT-Alpha has very advanced technology for the processing of pictures (for
example in newspapers).
BIT-Alpha is able to detect pictures, to delete interpolate dithered images and to
deliver a high-quality true-color digital image.
Dithered image (binary): Interpolated image without dithering
(greyscale):
VIII. Export of content:
The results can be rendered in different formats, for example:
 Txt
 Pdf with Highlighting (text as transparent overlay over the original image,
allowing to search, select, copy)
 BIT-Alpha creates a lightweight pdf by reducing the resolution (dpi) of the
document in order facilitate exchange of the document or online
publication.
 Alto (pixel or 10 de mm)
 Tei
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive
OCR
Solution
______________________________________________________________________________________
B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail :
bit.entreprise.at@gmail.com
N° SIRET 503 902 983 00017 R. C. TOULOUSE
 Html
The Html export from BIT-Alpha keeps mathematical formula, pictures,
etc. and positions them at the same place where they were in the original
document.
IX. Contact
Head of sales department
Anne Tomasi,
+33 786 844 845
bit.entreprise.at@gmail.com

Weitere ähnliche Inhalte

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Kürzlich hochgeladen

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Kürzlich hochgeladen (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

BIT Alpha - ICoC

  • 1. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE Summary I. General presentation...............................................................................2 II. Binarisation .............................................................................................2 III. Segmentation ...........................................................................................3 IV. OCR Recognition ....................................................................................4 V. Sequencer.................................................................................................5 VI. Post-OCR correction with Spellchecking.............................................6 VII. Pictures Treatment/Export....................................................................7 VIII. Export of content:...................................................................................7 IX. Contact.....................................................................................................8
  • 2. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE I. General presentation B.I.T. has developed an adaptive OCR solution called BIT-Alpha. This semiautomatic adaptive OCR is able to adapt itself to all types of text, independently of their language, typeface or age. Specifically developed for the treatment of historical and heritage documents, BIT-Alpha allows scientific research and access to content. BIT-Alpha is a tool containing the whole workflow:  Binarisation  Segmentation  OCR recognition  Post OCR correction with spellchecking  Picture processing/Export  Export of content II. Binarisation 3 Binarisation modes in BIT-Alpha:  A Binarisation through Threshold ideal for Newspapers BIT-Alpha analyses the document by domains/fields so the Binarisation will not be the same at the bottom, top or left right corner… Through this domains/fields analysis instead of a global analysis of the whole document, the binarisation will adapt to the different contrasts of the document.  A Binarisation through the “Niblack” algorithm BIT-Alpha is analyzing the contrast variance around each letter. In this respect BIT-Alpha is able to make the difference between a letter and a color spot close to a letter and therefore is able to eliminate the background noise without eliminating parts of a letter.
  • 3. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE BIT-Alpha does the variance analysis over neighborhoods and so determines if a pixel is part of a text area, non-text area, interline or a picture.  A Binarisation based on an algorithm develop by B.I.T. Thanks to this very advanced spectral-decomposition algorithm, BIT-Alpha is able to redraw/reconstruct damaged letters, as if BIT-Alpha were choosing an optimal paint brush (fine or large). It also allows to maintain very fine traits of characters which may be deleted by other algorithms. Those binarisation allows to prepare the document as best as possible in order to get the best OCR results that are possible for these historic/ heritage documents. III. Segmentation BIT-Alpha is segmenting titles, sub-titles, pictures, picture comments, chapters and articles, for example in Newspapers: Fraktur dated 1805 at 1944: segmentation of title, sub-titles and chapters
  • 4. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE During segmentation BIT-Alpha is detecting each line, for each line each word and for each word each character individually. Note that Bit-Alpha can output the position of each character (for example into an alto file). IV. OCR Recognition Developed for the processing of historical/ heritage documents, BIT-Alpha is an adaptive OCR able of adapting itself to all types of text, independently of their language, typeface or age. Character learning can be done manually and automatically:  Manually Training with human action: Memory storage of characters’ digital signatures. As the “image” of a character is much heavier than its digital signature, BIT-Alpha has the ability to create bigger data bases than tools saving “images” of characters.  Automatic Training without human action:  BIT-Alpha can learn the characters automatically from the text to be processed. During a Batch process, BIT-Alpha is reading and recognizing characters already known those characters which are recognized with high reliability are then used to train the OCR engine. Thereby, BIT-Alpha’s reliability rates will be increase with each processed page.  A spellchecking database which is adapted to the type of documents that are to be treated (for example Latin database) can be loaded into BIT- Alpha. If BIT-Alpha recognizes a word from the database, BIT-Alpha learns all the character constituting this word automatically. BIT-Alpha can handle any databases consisting of more than 500 000 words.  BIT-Alpha is able to identify the nature of fonts constituting a text even when the fonts are mixed-up: Gothic (before 1845), Fracture (after 1845), Antiqua, Cursive, Greece, Hebrew...
  • 5. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE  BIT-Alpha is able to recognize and read embellished letters, miniatures, abbreviations and can deal with unusual characters. V. Sequencer The Sequencer permits to:  Reconstruct fragmented characters: Sometimes a letter can be fragmented into two or more parts. BIT-Alpha recognises the fragments of a letter and reconstitutes it. Recognition of the right hand side of a lower-case “n” (RKN) Recognition of the left hand side of a lower-case “n” (LKN)
  • 6. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE Assembling of the two fragments by the sequencer and reconstruction of the “n”  Extend abbreviations In Roman writing a “q” followed by ”;” means “que”.  Correct wrong sequences of letters When other OCR reads “nnn”, the sequencer corrects that to « mm ». BIT- Alpha considers the typical sequences of the language of the document processed and is therefore able to correct incorrect sequences of letters. For example in Latin the wrong sequence “dcn” is changed into the typical one: “den”. Another example would be the incorrect sequence “qn” which is changed changed into the typical one used in Latin: “qu”. The Sequencer is composed of more than 900 sequences preprogramed in BIT- Alphas’s data base. By each use, the Sequencer’s data base can be enhanced and conversely the sequences preprogramed disturbing can be removed. VI. Post-OCR correction with Spellchecking BIT-Alpha’s post-OCR correction is based on the “Levenshtein” distance algorithm. Alpha analyses the edit-distance (different editing operations correspond to different OCR-mistakes and may have different weights) between two words, the words in the text and the reference from the database. Thanks to this technology BIT-Alpha is able to reconstitute words or to separate them with blanks if needed. For example, in German composed words (very common in
  • 7. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE German) may be checked by checking the components individually against known words from the database. Whereas for Latin texts (where composed words rarely occur) BIT-Alpha separates the words that are sticking together with blanks. BIT-Alpha permit to switch off the post-OCR correction and also to adapt how aggressively it corrects pure OCR results. VII. Pictures Treatment/Export BIT-Alpha has very advanced technology for the processing of pictures (for example in newspapers). BIT-Alpha is able to detect pictures, to delete interpolate dithered images and to deliver a high-quality true-color digital image. Dithered image (binary): Interpolated image without dithering (greyscale): VIII. Export of content: The results can be rendered in different formats, for example:  Txt  Pdf with Highlighting (text as transparent overlay over the original image, allowing to search, select, copy)  BIT-Alpha creates a lightweight pdf by reducing the resolution (dpi) of the document in order facilitate exchange of the document or online publication.  Alto (pixel or 10 de mm)  Tei
  • 8. Bureau Ingénieur Tomasi S.A.R.L. Adaptive OCR Solution ______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : bit.entreprise.at@gmail.com N° SIRET 503 902 983 00017 R. C. TOULOUSE  Html The Html export from BIT-Alpha keeps mathematical formula, pictures, etc. and positions them at the same place where they were in the original document. IX. Contact Head of sales department Anne Tomasi, +33 786 844 845 bit.entreprise.at@gmail.com