SlideShare ist ein Scribd-Unternehmen logo
1 von 21
using Document Structure Features
and Support Vector Machines
Konstantinos Zagoris, Nikos Papamarkos
Image Processing and Multimedia Laboratory
Department of Electrical & Computer Engineering
Democritus University of Thrace
67100 Xanthi, Greece
Email: papamark@ee.duth.gr
http://ipml.ee.duth.gr/~papamark/
 Nowadays, there is abundance of document
images such as technical articles, business letters,
faxes and newspapers without any indexing
information
 In order to successfully exploit them from systems
such as OCR a text localization technique must be
employed
 Bottom Up Techniques
 Top Down Techniques
The proposed technique is a continuation
of the work “"PLA using RLSA and a neural
network” of C. Strouthopoulos, N.
Papamarkos and C. Chamzas
In proposed work the feature set is
adaptive
The feature reduction technique is simpler
 The classifier is the SVMs
Apply
Preprocessing
Techniques
(binarization,
noire reduction)
Locate, Merge and
Extract Blocks
Extract the
Features from the
Blocks
Find the Blocks
Which Contain
Text using
Support Vector
Machines
Locate or Extract
the Text Blocks
and Present them
to User
 The Original Document  After the Pre-Processing Step
 The Connected Components  The Expanded Connected Components
 The Final Blocks
The Features are a set of suitable
Document Structure Elements (DSEs)
which the blocks contain
DSE is any 3x3 binary block
There are total 29 = 512 DSEs
b0
b8 b7 b6
b5 b4 b3
b2 b1
The Pixel Order of the DSEs
8
0
2i
j ji
i
L b

 
The DSE of L142
 The initial descriptor of the block is the histogram
of the DSEs that the block contains
 The length of the initial descriptor is 510
 The L0 and L511 DSEs are removed because they
correspond to pure background and pure
document objects, respectively
 A feature reduction algorithm is applied which
reduces the number of features.
 The selected features are the DSEs which they
most reliable separate the text blocks from the
others.
 We call this feature reduction algorithm Feature
Standard Deviation Analysis of Structure Elements
(FSDASE)
Find the Standard Deviation for the Text
Blocks SDXT(Ln) for each Ln DSE
Find the Standard Deviation for the non
Text Blocks SDXP(Ln) for each Ln DSE
Normalize them
Then define the O(Ln) vector as
O(Ln)=|SDXT´ (Ln) – SDXP´(Ln)|
Finally, take those 32 DSEs that
correspond to the first 32 maximum values
of O(Ln).
 The goal of the FSDASE is to find those
DSEs that have maximum SD at the text
blocks and minimum SD at the non text
blocks and the opposite
 A training dataset is required
 Does not cause a problem because such
dataset already is required for the training of
the SVMs
 Therefore the final block descriptor is a vector
with 32 elements and it corresponds to the
frequency of the 32 DSEs that the block
contains
The descriptor has the ability to adapt to
the demands of each set of documents
images
A noisy document has different set of
DSEs than a clear document
If there is available more computational
power, the descriptor can increase its size
easily above 32
This descriptor is used to train the Support
Vector Machines
 Based on statistical learning theory
 They need training data
 They separate the space that the training
data is reside to two classes.
 The training data must be linear separable.
 If the training data are not linear separable (as in our case)
then they mapped from the input space to a feature space
using the kernel method
 Our experiments showed the Radial Basis Function
(exp{-γ|x-x`|) as the most robust kernel
 The parameters of SVMs are detected by a cross-
validation procedure using a grid search
 The output of SVM classifies each block as text or not
 The Document Image Database from the University of
Oulu is employed
 In our experiments we used the set of the 48 article
documents
 Those image documents contained a mixture of text
and pictures
 From this database five images are selected and the
extracted blocks used to determine the proper DSEs
and to be employ as training samples for the SVMs
 The overall results are:
Document Images Blocks Success Rate
48 25958 98.453%
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
 A bottom-up text localization technique is proposed
that detects and extracts homogeneous text from
document images
 A Connected Component analysis technique is applied
which detects the objects of the document
 A flexible descriptor is extracted based on structural
elements
 The descriptor has the ability to adapt to the demands
of each set of documents images
 For example a noisy document has different set of
DSEs than a clear document
 If there is available more computational power, the
descriptor can increase its size easily above 32
 A trained SVM classify the objects as text and non-text
 The experimental results are much promised
ΕΥΧΑΡΙΣΤΩ!
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

Kernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of movingKernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of moving
IAEME Publication
 

Was ist angesagt? (19)

Self-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingSelf-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with Smoothing
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff Distance
 
Sub1586
Sub1586Sub1586
Sub1586
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from images
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 
Ijetcas14 527
Ijetcas14 527Ijetcas14 527
Ijetcas14 527
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Enhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildEnhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wild
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Currency recognition on mobile phones
Currency recognition on mobile phonesCurrency recognition on mobile phones
Currency recognition on mobile phones
 
50120140501016
5012014050101650120140501016
50120140501016
 
AN IMPROVED MULTI-SOM ALGORITHM
AN IMPROVED MULTI-SOM ALGORITHMAN IMPROVED MULTI-SOM ALGORITHM
AN IMPROVED MULTI-SOM ALGORITHM
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
 
An effective and robust technique for the binarization of degraded document i...
An effective and robust technique for the binarization of degraded document i...An effective and robust technique for the binarization of degraded document i...
An effective and robust technique for the binarization of degraded document i...
 
Kernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of movingKernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of moving
 
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence MatrixSteganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 

Andere mochten auch

Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
 
Alternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationAlternating direction-method-for-image-restoration
Alternating direction-method-for-image-restoration
Prashant Pal
 

Andere mochten auch (20)

Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
 
Svm based cbir of breast masses on mammograms
Svm based cbir of breast masses on mammogramsSvm based cbir of breast masses on mammograms
Svm based cbir of breast masses on mammograms
 
Content and Metadata Based Image Document Retrieval (in Greek)
Content and Metadata Based Image Document Retrieval (in Greek)Content and Metadata Based Image Document Retrieval (in Greek)
Content and Metadata Based Image Document Retrieval (in Greek)
 
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
 
Query expansion based on visual content new
Query expansion based on visual content newQuery expansion based on visual content new
Query expansion based on visual content new
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
Alternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationAlternating direction-method-for-image-restoration
Alternating direction-method-for-image-restoration
 
An Overview of Identity Based Encryption
An Overview of Identity Based EncryptionAn Overview of Identity Based Encryption
An Overview of Identity Based Encryption
 
Identity based encryption with outsourced revocation in cloud computing
Identity based encryption with outsourced revocation in cloud computingIdentity based encryption with outsourced revocation in cloud computing
Identity based encryption with outsourced revocation in cloud computing
 
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
 IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
 
Voting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationVoting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label Classification
 
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM) Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
 
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Decision trees
Decision treesDecision trees
Decision trees
 
Decision Tree Analysis
Decision Tree AnalysisDecision Tree Analysis
Decision Tree Analysis
 
Decision tree example problem
Decision tree example problemDecision tree example problem
Decision tree example problem
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Ähnlich wie Text extraction using document structure features and support vector machines

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
butest
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
butest
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Sachin414679
 
DSA 1- Introduction.pdf
DSA 1- Introduction.pdfDSA 1- Introduction.pdf
DSA 1- Introduction.pdf
AliyanAbbas1
 

Ähnlich wie Text extraction using document structure features and support vector machines (20)

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
DSA 1- Introduction.pdf
DSA 1- Introduction.pdfDSA 1- Introduction.pdf
DSA 1- Introduction.pdf
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Text extraction using document structure features and support vector machines

  • 1. using Document Structure Features and Support Vector Machines Konstantinos Zagoris, Nikos Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace 67100 Xanthi, Greece Email: papamark@ee.duth.gr http://ipml.ee.duth.gr/~papamark/
  • 2.  Nowadays, there is abundance of document images such as technical articles, business letters, faxes and newspapers without any indexing information  In order to successfully exploit them from systems such as OCR a text localization technique must be employed
  • 3.  Bottom Up Techniques  Top Down Techniques
  • 4. The proposed technique is a continuation of the work “"PLA using RLSA and a neural network” of C. Strouthopoulos, N. Papamarkos and C. Chamzas In proposed work the feature set is adaptive The feature reduction technique is simpler  The classifier is the SVMs
  • 5. Apply Preprocessing Techniques (binarization, noire reduction) Locate, Merge and Extract Blocks Extract the Features from the Blocks Find the Blocks Which Contain Text using Support Vector Machines Locate or Extract the Text Blocks and Present them to User
  • 6.  The Original Document  After the Pre-Processing Step  The Connected Components  The Expanded Connected Components  The Final Blocks
  • 7. The Features are a set of suitable Document Structure Elements (DSEs) which the blocks contain DSE is any 3x3 binary block There are total 29 = 512 DSEs b0 b8 b7 b6 b5 b4 b3 b2 b1 The Pixel Order of the DSEs 8 0 2i j ji i L b    The DSE of L142
  • 8.  The initial descriptor of the block is the histogram of the DSEs that the block contains  The length of the initial descriptor is 510  The L0 and L511 DSEs are removed because they correspond to pure background and pure document objects, respectively  A feature reduction algorithm is applied which reduces the number of features.  The selected features are the DSEs which they most reliable separate the text blocks from the others.  We call this feature reduction algorithm Feature Standard Deviation Analysis of Structure Elements (FSDASE)
  • 9. Find the Standard Deviation for the Text Blocks SDXT(Ln) for each Ln DSE Find the Standard Deviation for the non Text Blocks SDXP(Ln) for each Ln DSE Normalize them Then define the O(Ln) vector as O(Ln)=|SDXT´ (Ln) – SDXP´(Ln)| Finally, take those 32 DSEs that correspond to the first 32 maximum values of O(Ln).
  • 10.  The goal of the FSDASE is to find those DSEs that have maximum SD at the text blocks and minimum SD at the non text blocks and the opposite  A training dataset is required  Does not cause a problem because such dataset already is required for the training of the SVMs  Therefore the final block descriptor is a vector with 32 elements and it corresponds to the frequency of the 32 DSEs that the block contains
  • 11. The descriptor has the ability to adapt to the demands of each set of documents images A noisy document has different set of DSEs than a clear document If there is available more computational power, the descriptor can increase its size easily above 32 This descriptor is used to train the Support Vector Machines
  • 12.  Based on statistical learning theory  They need training data  They separate the space that the training data is reside to two classes.  The training data must be linear separable.
  • 13.  If the training data are not linear separable (as in our case) then they mapped from the input space to a feature space using the kernel method  Our experiments showed the Radial Basis Function (exp{-γ|x-x`|) as the most robust kernel  The parameters of SVMs are detected by a cross- validation procedure using a grid search  The output of SVM classifies each block as text or not
  • 14.  The Document Image Database from the University of Oulu is employed  In our experiments we used the set of the 48 article documents  Those image documents contained a mixture of text and pictures  From this database five images are selected and the extracted blocks used to determine the proper DSEs and to be employ as training samples for the SVMs  The overall results are: Document Images Blocks Success Rate 48 25958 98.453%
  • 15. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 16. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 17. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 18. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 19. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 20.  A bottom-up text localization technique is proposed that detects and extracts homogeneous text from document images  A Connected Component analysis technique is applied which detects the objects of the document  A flexible descriptor is extracted based on structural elements  The descriptor has the ability to adapt to the demands of each set of documents images  For example a noisy document has different set of DSEs than a clear document  If there is available more computational power, the descriptor can increase its size easily above 32  A trained SVM classify the objects as text and non-text  The experimental results are much promised