SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Imago OCR
             Open-source toolkit for chemical
               structure image recognition




             http://ggasoftware.com/opensource/imago/
14/08/2012                 GGA Software Services LLC    1
Project goals
      • Perform the optical chemical structure
        recognition applicable for a wide range of
        raster images:
             – different image formats
             – various scanning quality (or even photo)
             – complex structures and uncommon features

      • Provide complete toolset for embedding
        recognition engine in any other application

14/08/2012                    GGA Software Services LLC                   2
Applications
      • Automated articles and patents processing
             – similarity analysis
      • Chemical database search (PubChem, etc.)
      • “The Deep Web indexing”
             – development of a universal chemical search
               engine;
             – conversion of a human-readable data to machine-
               readable formats

14/08/2012                           GGA Software Services LLC                  3
Use case
               Source image                                                 MOL format




                                                  imago




     • BMP, DIB, JPG, JPE, PNG, PBM, P                               • MDL Molfile;
       GM, PPM, SR, RAS, TIFF;                                       • SMILES (requires Indigo);
     • Images from scanner/camera;                                   • Rendered image (requires
     • PDF document                                                    Indigo)



14/08/2012                               GGA Software Services LLC                                 4
Supported features

      • Multiple bonds

      • Single-up & single-down bonds

      • Bridged bonds

      • Aromatic rings


14/08/2012               GGA Software Services LLC        5
Supported features
      • Superatom labels,
        charges, isotopes

      • Abbreviations expansion

      • R-groups handling

      • Query features

14/08/2012               GGA Software Services LLC        6
Engine structure


     Raster level        Prefilter & Binarization           Image loader



     Primitives level   Vectorization & Separation


                                                             Molecule
     Structural level     Logical layout analyzer
                                                              export




14/08/2012                     GGA Software Services LLC                   7
Preliminary filters
      • Pass-through filter
             – For rendered images (only binarization)
      • Cross-correlation based filter
             – For scanned images (quite fast)
      • Logical analysis based filter
             – For low-quality photos
             – Takes some time for processing
      • Imago allows auto-detection of suitable filter

14/08/2012                      GGA Software Services LLC              8
Cross-correlation based filter
             Source image         Strong threshold               Weak threshold




                                 ← Filter result:
                                 image combined of weak threshold image
                                 segments that passes the restrictions of the CC
                                 value between corresponding strong threshold
                                 image segments



14/08/2012                       GGA Software Services LLC                         9
Logical analysis based filter
      •      Removes noise (spots, light glares)
      •      Suitable for out-of-focus images
      •      Can process low-contrast images
      •      Removes unusual artifacts
      •      Deals with multicolor photos

      • Keywords: wiener filtering, wave
        algorithm, weak segmentation

14/08/2012                     GGA Software Services LLC   10
Preliminary separation
      • Separate labels and graphics:




      •      Hu moments classifier (d1)
      •      Contours analysis (d2)
      •      Approximation criteria (d3)
      •      Object is symbol if f(d1, d2, d3) > c0

14/08/2012                       GGA Software Services LLC   11
Vectorization
      • Convert pixels to a matching polyline:




      • Minimization of mean distance between
        original and vectorized structure
             – Penalty for extra segments

14/08/2012                      GGA Software Services LLC                   12
Logical layout analysis
      • Mapping labels to bonds
             – Group labels into superatoms
      • Finding multiple bonds
             – Dissolving of short edges
             – Connection of bridged bonds
      • Removal of surely unrelated captions
      • Detection of aromatic rings
             – Figuring out stereo bonds orientation and
               aromatizing molecule if circles were presented

14/08/2012                      GGA Software Services LLC       13
Adaptive methods or particular cases?
    • Adaptive methods                             • Particular-case
             – Based on
                                                     methods
               optimization of                            – Based on some
               some function                                criteria
             – Wider input class                          – Stability
               range                                      – Good performance
             – Probably better                            – Easier
               results in hard cases                        implementation


14/08/2012                        GGA Software Services LLC                    14
Particular case methods

      • What is it?

      • Line? Tested line criteria: no.
      • Character? Tested against ‘A’: no.
                 … Tested against ‘Z’: no.
      • Ring? no.
      • Unrecognizable object – ignore.
14/08/2012                GGA Software Services LLC   15
Adaptive methods

      • What is it?

      • Line: approximation: d=1.6
      • Character? Compared with ‘C’: d=6.1
                 … Compared with ‘L’: d=3.2
      • Ring? approximation: d=653.3
      • Final decision depends on neighbors
14/08/2012              GGA Software Services LLC         16
Decision tree

                                                                                 Bond with d=0.0



                         Label with
                         d=0.1 (almost                   “C” with d=0.1
                         surely
                         recognized)



       Then object is a bond and                         Then object is a letter ‘l’ and segments
       segments group recognized as                      group recognized as bond + label of
       bond + label with                                 two chars with d=0.0+0.1+3.2=3.3
       d=0.1+1.6=1.7


14/08/2012                               GGA Software Services LLC                                  17
Metrics
      • For symbols
             – Distance between Fourier descriptors set
      • For graphics
             – Distance between approximated and source image
      • For single-up bonds
             – f(average fill, relative size, etc.)
      • For single-down bonds
             – f(distance between segments, line thickness, etc.)
      • … (every recognition method has a metric
        function)

14/08/2012                           GGA Software Services LLC             18
Labels correction
      • Any recognized symbol can have alternatives:
           : A(metric value of 3.2), R(4.9), P(5.0)
      • Imago keeps probable captions information
        (periodic table, abbreviations)
      • Labels correction: select such combination of
        symbols alternatives that is probably and the
        sum of metric values is minimal
      • Allows to recognize partially broken labels

14/08/2012               GGA Software Services LLC              19
Recognition
      • Image recognition is a search of vectorized
        result gives minimal distance value between
        vectorized form and original image
      • Can be formalized depending on metrics
      • Search is exhaustive
             – Needs some restrictions to achieve good speed




14/08/2012                     GGA Software Services LLC                 20
Trade-off: restricted adaptive methods
      • Limit metric values: d < 0.5 – surely; d > 10.0 –
        impossibly
      • Limit Euclidian distances for neighbors search (up to
        100 pixels)
      • Limit alternatives count (not more than 10)
      • Assume image filling rate is less than 10%
      • Assume the distances for single-down bonds segments
        is in range 5..10 pixels
      • Assume the symbol aspect ratio is in range 0.5..2.0
      • Some more assumptions with the “magic” constants
      • Gains the speed and stability


14/08/2012                  GGA Software Services LLC           21
Configuration clusters
      • For scanned images
             – Strict adaptive methods limits (fast, <300ms per image)
      • For photos and low quality images
             – Flexible limits (less than a second per image in average)
      • For high-resolution images
             – up to 5 seconds
      • For handwritten structures
             – up to 10 seconds in complex cases
      • Imago supports auto-detection of suitable
        configuration cluster


14/08/2012                          GGA Software Services LLC              22
Configuration cluster creation
      • Allows to gain better recognition success rate
        for specified images type:
             – different render type
             – images captured differently (scanner type, lighting
               conditions, etc.)
      • Process is automated
             – test set of target images type is required
             – takes some time
             – machine learning application

14/08/2012                       GGA Software Services LLC           23
Machine learning
      • Test set: amount of pairs (image; related MDL
        molfile)
      • Imago will tune the method parameters to
        gain the best score on the test collection
             – Metrics included
             – No information directly related to test set (such a
               characters table) is stored
      • Criteria of the complete set will be formed by
        small subset of the same type

14/08/2012                       GGA Software Services LLC            24
Learning effectiveness
      • Used Img2Structure test set with different
        renderer:


      • Initial results (before training): 202/944
        correct, similarity value: 74.54%
      • Trained on set of 50 images with new render
      • Trained results: 831/944 correct, similarity
        value: 98.33% on the whole set

14/08/2012               GGA Software Services LLC     25
Comparison: overall scores 1
      • Image2Structure set from TREC 2011 Chemical IR Track
        (removed ambiguous & partial structures): original files
                                OSRA 1.4.0               Imago 1.0   Imago 2.0 beta 1
         Absolutely correct     769 / 944                540 / 944   861 / 944
         Almost correct1        +31                      +49         +43
         Average time           2.54s                    0.20s       0.31s
         Average similarity2    94.57%                   89.59%      98.26%

             1 similarity value is greater than 95%;
             2 correct elements (atoms and bonds) ratio; extra and

             missing elements are counted too.


14/08/2012                               GGA Software Services LLC                      26
Comparison: overall scores 2
      • Image2Structure re-rendered using appropriate molfiles

                                OSRA 1.4.0               Imago 1.0   Imago 2.0 beta 1
         Absolutely correct     796 / 944                604 / 944   831 / 944
         Almost correct1        +20                      +58         +29
         Average time           4.57s                    0.47s       1.24s
         Average similarity2    93.45%                   95.38%      98.33%

             1 similarity value is greater than 95%;
             2 correct elements (atoms and bonds) ratio; extra and

             missing elements are counted too.


14/08/2012                               GGA Software Services LLC                      27
Common issues resolved
             Source                        OSRA                   Imago




                         Large gap




       Lines too close




  No more symbols

14/08/2012                            GGA Software Services LLC           28
Imago Library
      • API: Methods set for
             –   Image loading
             –   Configuration clusters setup
             –   Retrieving molfile results
             –   Partial processing (filtering, approximation, validation)
      • Bindings for C/C++, Java
      • Cross-platform implementation (Windows, Linux, Mac)
      • Dependencies:
             – Boost library (LGPL license)
             – OpenCV library (BSD license)
             – Indigo (optional)


14/08/2012                            GGA Software Services LLC                   29
Thank you for the attention!

      • Imago OCR:
        http://ggasoftware.com/opensource/imago/

      • Try imago recognition engine online:
        http://ggasoftware.com/opensource/imago/online/




14/08/2012                GGA Software Services LLC       30
Appendix A
             Imago: technical details




14/08/2012         GGA Software Services LLC   31
Pass-trough prefilter
      • Calculate black, white and others pixels
      • If (black + white) > t0 ∙ others,
             – recolor others to black → image is binarized
             – else schedule another prefilter call
      • Perform accurate image downscale when
        image is too large (>5Mpix)




14/08/2012                      GGA Software Services LLC       32
Cross-correlation prefilter
      • Smooth source image → smoothed
             – Pyramidal reduce 2x, then pyramidal upsample 2x
      • Process adaptive threshold binarization filter of smoothed image:
             – With threshold t0 → strong
             – With threshold t1 → weak
      • Segmentate (strong, weak) images using wavemap algorithm
      • For each weak segment find appropriate strong segment and
        calculate intersection:
             – If intersection area to original segment area ratio is less than c0 then
               remove this segment (bad segment)
      • If reassembled image contains the rectangular structure R – crop
        image to R inner dimensions (locate molecules)
      • Calculate average pixels intensity for good segments and try to add
        other pixels with intensity passing this boundary (if they’re not
        affecting segments connectivity)

14/08/2012                               GGA Software Services LLC                        33
Separator details
      • Given a binarized set of segments classify
        them into two main groups: letters and
        chemical bond representation
      • Classification result is based on the value of
        C = k0 ∙ r0 + k1 ∙ r1 + k2 ∙ r2
             – Where (r0, r1, r2) are submethods results
             – And (k0, k1, k2) – weight constants (configurable)



14/08/2012                       GGA Software Services LLC              34
Separator: Hu moments
      • Hu moments usually differs for characters and
        bonds, so the classification tree can be
        computed
      • Note: some objects
        can not be classified
        that way

                                           symbols   bonds
                                            r0 = 0   r0 = 1



14/08/2012               GGA Software Services LLC            35
Separator: contours analysis
      • Extract the outer contour of the binarized segment S;
             – approximate the chain contour using Teh-Chin chain
               approximation algorithm;
             – taking line thickness as a approximation parameter the polygon
               is approximated once again;
             – calculate the offsets of the contour points by a clockwise step;
             – the output is a chain of sequential vectors normalized by their
               perimeters;
      • Compare the chain result to the set of patterns describing
        valid structures
             – The set contains of 8x8 matrices where the cell (j, k) denotes
               the probability of changing the jth direction to the kth.
      • Result of this method is r1 – probability of {S is a bond}


14/08/2012                            GGA Software Services LLC                   36
Separator: approximation criteria
      • For a given segment S we calculate its best
        approximation with n line segments (d0) and
        the closest distance to the most probable
        character (d1)
             – If d1 < d0 and n > n0 then probably segment
               represents character
                • Check its width/height ratio, height/average_height
                  ratio: penalty p0 if this criteria is not matched
                • Result is r2 = 1 - (d1 [+ p0]) – probability of {S is a bond}
             – Result is r2 = d0 – probability of {S is a bond}

14/08/2012                           GGA Software Services LLC                    37
Bonds skeleton analysis
      •      Dissolve short edges
      •      Join closest vertices
      •      Dissolve intermediate vertices
      •      Find multiple edges
      •      Connect bridged bonds
      •      Shrink short bonds
      •      Detect and mark suspicious edges


14/08/2012                   GGA Software Services LLC   38
Basic labels analysis
      • Location analysis: check against baseline
             – The subscripts are underline:
             – Capitals mostly above line:
      • Calculate distances to all possible characters:


      • Alternate distances using topological features
      • Select the best result candidate and calculate
        recognition quality:

14/08/2012                      GGA Software Services LLC         39
Superatoms analysis
      • Concatenate recognized characters into labels
      • Check chemical validity
      • If validity check is failed – try to find the most
        probable alternative using other distance map
        elements
      • If such alternative is not found – try to
        recognize the less probable characters as
        bonds
      • Handle R-semantic, special characters: X, Q, A

14/08/2012                 GGA Software Services LLC         40
Appendix B
             Imago: workflow features




14/08/2012         GGA Software Services LLC   41
Related continuous integration system
             Versions list




  Test
  sets
                                                    …




             Results estimation
14/08/2012                        GGA Software Services LLC   42
Explanation: continuous integration
      • Some logically grounded changes may
        decrease the recognition rate → convenient
        tracking tool is required
      • Good way to improve overall stability
      • Useful visual representation of the machine-
        learning progress




14/08/2012               GGA Software Services LLC     43
Embedded HTML-based logging system

                                                   Embedded images



                     Variables and parameters dump



                                                        Call hierarchy

                                                             Performance counters




14/08/2012                  GGA Software Services LLC                               44
Explanation: logging system
      • Structured logs (reports) are offering
             – Convenient way of bugs detection;
             – Exact visual representation of the internal
               processes;
      • Several improvements may be evident just by
        looking through logs
      • Performance decrease is comparable to the
        (usual) plaintext logs
      • Stability is not affected

14/08/2012                       GGA Software Services LLC   45
Authors
      •      Rostislav Chutkov
      •      Michael Rybalkin
      •      Kliton Andrea
      •      Victor Smolov

      • GGA Software Services LLC



14/08/2012                       GGA Software Services LLC             46

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

How To Create An AppD Centre of Excellence at AppD Global Tour London
How To Create An AppD Centre of Excellence at AppD Global Tour LondonHow To Create An AppD Centre of Excellence at AppD Global Tour London
How To Create An AppD Centre of Excellence at AppD Global Tour London
 
Crime
CrimeCrime
Crime
 
DevOps Evolution - The Next Generation ?
DevOps Evolution - The Next Generation ?DevOps Evolution - The Next Generation ?
DevOps Evolution - The Next Generation ?
 
5 questions about the IoT (Internet of Things)
5 questions about the IoT (Internet of Things) 5 questions about the IoT (Internet of Things)
5 questions about the IoT (Internet of Things)
 
Webinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringWebinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform Engineering
 
A DevOps Playbook at DraftKings Built with New Relic and AWS
 A DevOps Playbook at DraftKings Built with New Relic and AWS A DevOps Playbook at DraftKings Built with New Relic and AWS
A DevOps Playbook at DraftKings Built with New Relic and AWS
 
Jira Training
Jira TrainingJira Training
Jira Training
 
Apigee Demo: API Platform Overview
Apigee Demo: API Platform OverviewApigee Demo: API Platform Overview
Apigee Demo: API Platform Overview
 
Continuous Integration and Continuous Delivery on Azure
Continuous Integration and Continuous Delivery on AzureContinuous Integration and Continuous Delivery on Azure
Continuous Integration and Continuous Delivery on Azure
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
Monetization: Unlock More Value from Your APIs
Monetization: Unlock More Value from Your APIs Monetization: Unlock More Value from Your APIs
Monetization: Unlock More Value from Your APIs
 
Be ready for hyperautomation with the UiPath RPA Platform
Be ready for hyperautomation with the UiPath RPA PlatformBe ready for hyperautomation with the UiPath RPA Platform
Be ready for hyperautomation with the UiPath RPA Platform
 
BPMN -The Very First Step in Business Continuity
BPMN -The Very First Step in Business ContinuityBPMN -The Very First Step in Business Continuity
BPMN -The Very First Step in Business Continuity
 
UiPath 23.4 Product Release Updates
UiPath 23.4 Product Release UpdatesUiPath 23.4 Product Release Updates
UiPath 23.4 Product Release Updates
 
Struggle to success: How generative ai can transform your university experience?
Struggle to success: How generative ai can transform your university experience?Struggle to success: How generative ai can transform your university experience?
Struggle to success: How generative ai can transform your university experience?
 
Hospital management System (asp.net with c#)Project report
Hospital management System (asp.net with c#)Project reportHospital management System (asp.net with c#)Project report
Hospital management System (asp.net with c#)Project report
 

Andere mochten auch

Image Recognition
Image RecognitionImage Recognition
Image Recognition
guestbe3cbf
 
빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410
빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410
빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410
Peter Woo
 
Malimu data collection methods
Malimu data collection methodsMalimu data collection methods
Malimu data collection methods
Miharbi Ignasm
 

Andere mochten auch (20)

Integrating Text and Image
Integrating Text and ImageIntegrating Text and Image
Integrating Text and Image
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
Nuance-ACEDS May 21 OCR Webcast
Nuance-ACEDS May 21 OCR Webcast Nuance-ACEDS May 21 OCR Webcast
Nuance-ACEDS May 21 OCR Webcast
 
Image Recognition
Image RecognitionImage Recognition
Image Recognition
 
Image recognition technology (Medical Presentation)
Image recognition technology (Medical Presentation)Image recognition technology (Medical Presentation)
Image recognition technology (Medical Presentation)
 
Golang 으로 vision api 적용하기
Golang 으로 vision api 적용하기Golang 으로 vision api 적용하기
Golang 으로 vision api 적용하기
 
Process for Big Data Analysis
Process for Big Data AnalysisProcess for Big Data Analysis
Process for Big Data Analysis
 
[코세나, kosena] 빅데이터 기반의 End-to-End APM과 비정형 데이터 분석 자료입니다.
[코세나, kosena] 빅데이터 기반의 End-to-End APM과 비정형 데이터 분석 자료입니다.[코세나, kosena] 빅데이터 기반의 End-to-End APM과 비정형 데이터 분석 자료입니다.
[코세나, kosena] 빅데이터 기반의 End-to-End APM과 비정형 데이터 분석 자료입니다.
 
Ocr abstract
Ocr abstractOcr abstract
Ocr abstract
 
Building an Image Recognition Service - How to leverage IBM Watson for visual...
Building an Image Recognition Service - How to leverage IBM Watson for visual...Building an Image Recognition Service - How to leverage IBM Watson for visual...
Building an Image Recognition Service - How to leverage IBM Watson for visual...
 
Image to text Converter
Image to text ConverterImage to text Converter
Image to text Converter
 
Image Recognition. Technology, Guidelines and Trends
Image Recognition. Technology, Guidelines and TrendsImage Recognition. Technology, Guidelines and Trends
Image Recognition. Technology, Guidelines and Trends
 
METHOD OF DATA COLLECTION
METHOD OF DATA COLLECTIONMETHOD OF DATA COLLECTION
METHOD OF DATA COLLECTION
 
Optical character recognition (ocr) ppt
Optical character recognition (ocr) pptOptical character recognition (ocr) ppt
Optical character recognition (ocr) ppt
 
빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410
빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410
빅데이터미래전략세미나발표자료 빅데이터기술현황및전망-황승구-20120410
 
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
임태현, Text-CNN을 이용한 Sentiment 분설모델 구현
 
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
 
Malimu data collection methods
Malimu data collection methodsMalimu data collection methods
Malimu data collection methods
 
Image recognition
Image recognitionImage recognition
Image recognition
 
Tamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR EngineTamil OCR using Tesseract OCR Engine
Tamil OCR using Tesseract OCR Engine
 

Ähnlich wie Imago OCR: Open-source toolkit for chemical structure image recognition

Compact Descriptors for Visual Search
Compact Descriptors for Visual SearchCompact Descriptors for Visual Search
Compact Descriptors for Visual Search
Antonio Capone
 
Standardizing the Data Distribution Service (DDS) API for Modern C++
Standardizing the Data Distribution Service (DDS) API for Modern C++Standardizing the Data Distribution Service (DDS) API for Modern C++
Standardizing the Data Distribution Service (DDS) API for Modern C++
Sumant Tambe
 
Architecture of a Modern Web App - SpringOne India
Architecture of a Modern Web App - SpringOne IndiaArchitecture of a Modern Web App - SpringOne India
Architecture of a Modern Web App - SpringOne India
Jeremy Grelle
 
Feature Prioritization Rmpdma Public
Feature Prioritization Rmpdma PublicFeature Prioritization Rmpdma Public
Feature Prioritization Rmpdma Public
Signarama Dtc
 
Initial Results Building a Normalized Software Database Using SRDRs
Initial Results Building a Normalized Software Database Using SRDRsInitial Results Building a Normalized Software Database Using SRDRs
Initial Results Building a Normalized Software Database Using SRDRs
gallomike
 
X.commerce Open Commerce Language (XOCL)
X.commerce Open Commerce Language (XOCL)X.commerce Open Commerce Language (XOCL)
X.commerce Open Commerce Language (XOCL)
X.commerce
 
Logic networks protones Functions & Features
Logic networks protones  Functions & Features Logic networks protones  Functions & Features
Logic networks protones Functions & Features
rajlogicnet
 

Ähnlich wie Imago OCR: Open-source toolkit for chemical structure image recognition (20)

Compact Descriptors for Visual Search
Compact Descriptors for Visual SearchCompact Descriptors for Visual Search
Compact Descriptors for Visual Search
 
Camera RAW Workflow
Camera RAW WorkflowCamera RAW Workflow
Camera RAW Workflow
 
Standardizing the Data Distribution Service (DDS) API for Modern C++
Standardizing the Data Distribution Service (DDS) API for Modern C++Standardizing the Data Distribution Service (DDS) API for Modern C++
Standardizing the Data Distribution Service (DDS) API for Modern C++
 
STPCon fall 2012: The Testing Renaissance Has Arrived
STPCon fall 2012: The Testing Renaissance Has ArrivedSTPCon fall 2012: The Testing Renaissance Has Arrived
STPCon fall 2012: The Testing Renaissance Has Arrived
 
Automated BI Modernizations
Automated BI ModernizationsAutomated BI Modernizations
Automated BI Modernizations
 
Architecture of a Modern Web App - SpringOne India
Architecture of a Modern Web App - SpringOne IndiaArchitecture of a Modern Web App - SpringOne India
Architecture of a Modern Web App - SpringOne India
 
Workshop APM in a Cloud & Virtualized environment
Workshop APM in a Cloud & Virtualized environmentWorkshop APM in a Cloud & Virtualized environment
Workshop APM in a Cloud & Virtualized environment
 
Orbit GT Mobile Mapping Solutions
Orbit GT Mobile Mapping SolutionsOrbit GT Mobile Mapping Solutions
Orbit GT Mobile Mapping Solutions
 
MBE Summit 2012
MBE Summit 2012MBE Summit 2012
MBE Summit 2012
 
Dancing about architecture
Dancing about architectureDancing about architecture
Dancing about architecture
 
Feature Prioritization Rmpdma Public
Feature Prioritization Rmpdma PublicFeature Prioritization Rmpdma Public
Feature Prioritization Rmpdma Public
 
Safe and Reliable Embedded Linux Programming: How to Get There
Safe and Reliable Embedded Linux Programming: How to Get ThereSafe and Reliable Embedded Linux Programming: How to Get There
Safe and Reliable Embedded Linux Programming: How to Get There
 
20160317 lagom sf scala
20160317 lagom sf scala20160317 lagom sf scala
20160317 lagom sf scala
 
Initial Results Building a Normalized Software Database Using SRDRs
Initial Results Building a Normalized Software Database Using SRDRsInitial Results Building a Normalized Software Database Using SRDRs
Initial Results Building a Normalized Software Database Using SRDRs
 
X.commerce Open Commerce Language (XOCL)
X.commerce Open Commerce Language (XOCL)X.commerce Open Commerce Language (XOCL)
X.commerce Open Commerce Language (XOCL)
 
FASE08.ppt
FASE08.pptFASE08.ppt
FASE08.ppt
 
Infopulse presentation
Infopulse presentation Infopulse presentation
Infopulse presentation
 
TorontoRb Intro to BDD
TorontoRb   Intro to BDDTorontoRb   Intro to BDD
TorontoRb Intro to BDD
 
Logic networks protones Functions & Features
Logic networks protones  Functions & Features Logic networks protones  Functions & Features
Logic networks protones Functions & Features
 
Pose extraction for real time workout assistant - milestone 1
Pose extraction for real time workout assistant - milestone 1Pose extraction for real time workout assistant - milestone 1
Pose extraction for real time workout assistant - milestone 1
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Imago OCR: Open-source toolkit for chemical structure image recognition

  • 1. Imago OCR Open-source toolkit for chemical structure image recognition http://ggasoftware.com/opensource/imago/ 14/08/2012 GGA Software Services LLC 1
  • 2. Project goals • Perform the optical chemical structure recognition applicable for a wide range of raster images: – different image formats – various scanning quality (or even photo) – complex structures and uncommon features • Provide complete toolset for embedding recognition engine in any other application 14/08/2012 GGA Software Services LLC 2
  • 3. Applications • Automated articles and patents processing – similarity analysis • Chemical database search (PubChem, etc.) • “The Deep Web indexing” – development of a universal chemical search engine; – conversion of a human-readable data to machine- readable formats 14/08/2012 GGA Software Services LLC 3
  • 4. Use case Source image MOL format imago • BMP, DIB, JPG, JPE, PNG, PBM, P • MDL Molfile; GM, PPM, SR, RAS, TIFF; • SMILES (requires Indigo); • Images from scanner/camera; • Rendered image (requires • PDF document Indigo) 14/08/2012 GGA Software Services LLC 4
  • 5. Supported features • Multiple bonds • Single-up & single-down bonds • Bridged bonds • Aromatic rings 14/08/2012 GGA Software Services LLC 5
  • 6. Supported features • Superatom labels, charges, isotopes • Abbreviations expansion • R-groups handling • Query features 14/08/2012 GGA Software Services LLC 6
  • 7. Engine structure Raster level Prefilter & Binarization Image loader Primitives level Vectorization & Separation Molecule Structural level Logical layout analyzer export 14/08/2012 GGA Software Services LLC 7
  • 8. Preliminary filters • Pass-through filter – For rendered images (only binarization) • Cross-correlation based filter – For scanned images (quite fast) • Logical analysis based filter – For low-quality photos – Takes some time for processing • Imago allows auto-detection of suitable filter 14/08/2012 GGA Software Services LLC 8
  • 9. Cross-correlation based filter Source image Strong threshold Weak threshold ← Filter result: image combined of weak threshold image segments that passes the restrictions of the CC value between corresponding strong threshold image segments 14/08/2012 GGA Software Services LLC 9
  • 10. Logical analysis based filter • Removes noise (spots, light glares) • Suitable for out-of-focus images • Can process low-contrast images • Removes unusual artifacts • Deals with multicolor photos • Keywords: wiener filtering, wave algorithm, weak segmentation 14/08/2012 GGA Software Services LLC 10
  • 11. Preliminary separation • Separate labels and graphics: • Hu moments classifier (d1) • Contours analysis (d2) • Approximation criteria (d3) • Object is symbol if f(d1, d2, d3) > c0 14/08/2012 GGA Software Services LLC 11
  • 12. Vectorization • Convert pixels to a matching polyline: • Minimization of mean distance between original and vectorized structure – Penalty for extra segments 14/08/2012 GGA Software Services LLC 12
  • 13. Logical layout analysis • Mapping labels to bonds – Group labels into superatoms • Finding multiple bonds – Dissolving of short edges – Connection of bridged bonds • Removal of surely unrelated captions • Detection of aromatic rings – Figuring out stereo bonds orientation and aromatizing molecule if circles were presented 14/08/2012 GGA Software Services LLC 13
  • 14. Adaptive methods or particular cases? • Adaptive methods • Particular-case – Based on methods optimization of – Based on some some function criteria – Wider input class – Stability range – Good performance – Probably better – Easier results in hard cases implementation 14/08/2012 GGA Software Services LLC 14
  • 15. Particular case methods • What is it? • Line? Tested line criteria: no. • Character? Tested against ‘A’: no. … Tested against ‘Z’: no. • Ring? no. • Unrecognizable object – ignore. 14/08/2012 GGA Software Services LLC 15
  • 16. Adaptive methods • What is it? • Line: approximation: d=1.6 • Character? Compared with ‘C’: d=6.1 … Compared with ‘L’: d=3.2 • Ring? approximation: d=653.3 • Final decision depends on neighbors 14/08/2012 GGA Software Services LLC 16
  • 17. Decision tree Bond with d=0.0 Label with d=0.1 (almost “C” with d=0.1 surely recognized) Then object is a bond and Then object is a letter ‘l’ and segments segments group recognized as group recognized as bond + label of bond + label with two chars with d=0.0+0.1+3.2=3.3 d=0.1+1.6=1.7 14/08/2012 GGA Software Services LLC 17
  • 18. Metrics • For symbols – Distance between Fourier descriptors set • For graphics – Distance between approximated and source image • For single-up bonds – f(average fill, relative size, etc.) • For single-down bonds – f(distance between segments, line thickness, etc.) • … (every recognition method has a metric function) 14/08/2012 GGA Software Services LLC 18
  • 19. Labels correction • Any recognized symbol can have alternatives: : A(metric value of 3.2), R(4.9), P(5.0) • Imago keeps probable captions information (periodic table, abbreviations) • Labels correction: select such combination of symbols alternatives that is probably and the sum of metric values is minimal • Allows to recognize partially broken labels 14/08/2012 GGA Software Services LLC 19
  • 20. Recognition • Image recognition is a search of vectorized result gives minimal distance value between vectorized form and original image • Can be formalized depending on metrics • Search is exhaustive – Needs some restrictions to achieve good speed 14/08/2012 GGA Software Services LLC 20
  • 21. Trade-off: restricted adaptive methods • Limit metric values: d < 0.5 – surely; d > 10.0 – impossibly • Limit Euclidian distances for neighbors search (up to 100 pixels) • Limit alternatives count (not more than 10) • Assume image filling rate is less than 10% • Assume the distances for single-down bonds segments is in range 5..10 pixels • Assume the symbol aspect ratio is in range 0.5..2.0 • Some more assumptions with the “magic” constants • Gains the speed and stability 14/08/2012 GGA Software Services LLC 21
  • 22. Configuration clusters • For scanned images – Strict adaptive methods limits (fast, <300ms per image) • For photos and low quality images – Flexible limits (less than a second per image in average) • For high-resolution images – up to 5 seconds • For handwritten structures – up to 10 seconds in complex cases • Imago supports auto-detection of suitable configuration cluster 14/08/2012 GGA Software Services LLC 22
  • 23. Configuration cluster creation • Allows to gain better recognition success rate for specified images type: – different render type – images captured differently (scanner type, lighting conditions, etc.) • Process is automated – test set of target images type is required – takes some time – machine learning application 14/08/2012 GGA Software Services LLC 23
  • 24. Machine learning • Test set: amount of pairs (image; related MDL molfile) • Imago will tune the method parameters to gain the best score on the test collection – Metrics included – No information directly related to test set (such a characters table) is stored • Criteria of the complete set will be formed by small subset of the same type 14/08/2012 GGA Software Services LLC 24
  • 25. Learning effectiveness • Used Img2Structure test set with different renderer: • Initial results (before training): 202/944 correct, similarity value: 74.54% • Trained on set of 50 images with new render • Trained results: 831/944 correct, similarity value: 98.33% on the whole set 14/08/2012 GGA Software Services LLC 25
  • 26. Comparison: overall scores 1 • Image2Structure set from TREC 2011 Chemical IR Track (removed ambiguous & partial structures): original files OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1 Absolutely correct 769 / 944 540 / 944 861 / 944 Almost correct1 +31 +49 +43 Average time 2.54s 0.20s 0.31s Average similarity2 94.57% 89.59% 98.26% 1 similarity value is greater than 95%; 2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too. 14/08/2012 GGA Software Services LLC 26
  • 27. Comparison: overall scores 2 • Image2Structure re-rendered using appropriate molfiles OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1 Absolutely correct 796 / 944 604 / 944 831 / 944 Almost correct1 +20 +58 +29 Average time 4.57s 0.47s 1.24s Average similarity2 93.45% 95.38% 98.33% 1 similarity value is greater than 95%; 2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too. 14/08/2012 GGA Software Services LLC 27
  • 28. Common issues resolved Source OSRA Imago Large gap Lines too close No more symbols 14/08/2012 GGA Software Services LLC 28
  • 29. Imago Library • API: Methods set for – Image loading – Configuration clusters setup – Retrieving molfile results – Partial processing (filtering, approximation, validation) • Bindings for C/C++, Java • Cross-platform implementation (Windows, Linux, Mac) • Dependencies: – Boost library (LGPL license) – OpenCV library (BSD license) – Indigo (optional) 14/08/2012 GGA Software Services LLC 29
  • 30. Thank you for the attention! • Imago OCR: http://ggasoftware.com/opensource/imago/ • Try imago recognition engine online: http://ggasoftware.com/opensource/imago/online/ 14/08/2012 GGA Software Services LLC 30
  • 31. Appendix A Imago: technical details 14/08/2012 GGA Software Services LLC 31
  • 32. Pass-trough prefilter • Calculate black, white and others pixels • If (black + white) > t0 ∙ others, – recolor others to black → image is binarized – else schedule another prefilter call • Perform accurate image downscale when image is too large (>5Mpix) 14/08/2012 GGA Software Services LLC 32
  • 33. Cross-correlation prefilter • Smooth source image → smoothed – Pyramidal reduce 2x, then pyramidal upsample 2x • Process adaptive threshold binarization filter of smoothed image: – With threshold t0 → strong – With threshold t1 → weak • Segmentate (strong, weak) images using wavemap algorithm • For each weak segment find appropriate strong segment and calculate intersection: – If intersection area to original segment area ratio is less than c0 then remove this segment (bad segment) • If reassembled image contains the rectangular structure R – crop image to R inner dimensions (locate molecules) • Calculate average pixels intensity for good segments and try to add other pixels with intensity passing this boundary (if they’re not affecting segments connectivity) 14/08/2012 GGA Software Services LLC 33
  • 34. Separator details • Given a binarized set of segments classify them into two main groups: letters and chemical bond representation • Classification result is based on the value of C = k0 ∙ r0 + k1 ∙ r1 + k2 ∙ r2 – Where (r0, r1, r2) are submethods results – And (k0, k1, k2) – weight constants (configurable) 14/08/2012 GGA Software Services LLC 34
  • 35. Separator: Hu moments • Hu moments usually differs for characters and bonds, so the classification tree can be computed • Note: some objects can not be classified that way symbols bonds r0 = 0 r0 = 1 14/08/2012 GGA Software Services LLC 35
  • 36. Separator: contours analysis • Extract the outer contour of the binarized segment S; – approximate the chain contour using Teh-Chin chain approximation algorithm; – taking line thickness as a approximation parameter the polygon is approximated once again; – calculate the offsets of the contour points by a clockwise step; – the output is a chain of sequential vectors normalized by their perimeters; • Compare the chain result to the set of patterns describing valid structures – The set contains of 8x8 matrices where the cell (j, k) denotes the probability of changing the jth direction to the kth. • Result of this method is r1 – probability of {S is a bond} 14/08/2012 GGA Software Services LLC 36
  • 37. Separator: approximation criteria • For a given segment S we calculate its best approximation with n line segments (d0) and the closest distance to the most probable character (d1) – If d1 < d0 and n > n0 then probably segment represents character • Check its width/height ratio, height/average_height ratio: penalty p0 if this criteria is not matched • Result is r2 = 1 - (d1 [+ p0]) – probability of {S is a bond} – Result is r2 = d0 – probability of {S is a bond} 14/08/2012 GGA Software Services LLC 37
  • 38. Bonds skeleton analysis • Dissolve short edges • Join closest vertices • Dissolve intermediate vertices • Find multiple edges • Connect bridged bonds • Shrink short bonds • Detect and mark suspicious edges 14/08/2012 GGA Software Services LLC 38
  • 39. Basic labels analysis • Location analysis: check against baseline – The subscripts are underline: – Capitals mostly above line: • Calculate distances to all possible characters: • Alternate distances using topological features • Select the best result candidate and calculate recognition quality: 14/08/2012 GGA Software Services LLC 39
  • 40. Superatoms analysis • Concatenate recognized characters into labels • Check chemical validity • If validity check is failed – try to find the most probable alternative using other distance map elements • If such alternative is not found – try to recognize the less probable characters as bonds • Handle R-semantic, special characters: X, Q, A 14/08/2012 GGA Software Services LLC 40
  • 41. Appendix B Imago: workflow features 14/08/2012 GGA Software Services LLC 41
  • 42. Related continuous integration system Versions list Test sets … Results estimation 14/08/2012 GGA Software Services LLC 42
  • 43. Explanation: continuous integration • Some logically grounded changes may decrease the recognition rate → convenient tracking tool is required • Good way to improve overall stability • Useful visual representation of the machine- learning progress 14/08/2012 GGA Software Services LLC 43
  • 44. Embedded HTML-based logging system Embedded images Variables and parameters dump Call hierarchy Performance counters 14/08/2012 GGA Software Services LLC 44
  • 45. Explanation: logging system • Structured logs (reports) are offering – Convenient way of bugs detection; – Exact visual representation of the internal processes; • Several improvements may be evident just by looking through logs • Performance decrease is comparable to the (usual) plaintext logs • Stability is not affected 14/08/2012 GGA Software Services LLC 45
  • 46. Authors • Rostislav Chutkov • Michael Rybalkin • Kliton Andrea • Victor Smolov • GGA Software Services LLC 14/08/2012 GGA Software Services LLC 46