SlideShare ist ein Scribd-Unternehmen logo
1 von 21
using Document Structure Features
and Support Vector Machines
Konstantinos Zagoris, Nikos Papamarkos
Image Processing and Multimedia Laboratory
Department of Electrical & Computer Engineering
Democritus University of Thrace
67100 Xanthi, Greece
Email: papamark@ee.duth.gr
http://ipml.ee.duth.gr/~papamark/
 Nowadays, there is abundance of document
images such as technical articles, business letters,
faxes and newspapers without any indexing
information
 In order to successfully exploit them from systems
such as OCR a text localization technique must be
employed
 Bottom Up Techniques
 Top Down Techniques
The proposed technique is a continuation
of the work “"PLA using RLSA and a neural
network” of C. Strouthopoulos, N.
Papamarkos and C. Chamzas
In proposed work the feature set is
adaptive
The feature reduction technique is simpler
 The classifier is the SVMs
Apply
Preprocessing
Techniques
(binarization,
noire reduction)
Locate, Merge and
Extract Blocks
Extract the
Features from the
Blocks
Find the Blocks
Which Contain
Text using
Support Vector
Machines
Locate or Extract
the Text Blocks
and Present them
to User
 The Original Document  After the Pre-Processing Step
 The Connected Components  The Expanded Connected Components
 The Final Blocks
The Features are a set of suitable
Document Structure Elements (DSEs)
which the blocks contain
DSE is any 3x3 binary block
There are total 29 = 512 DSEs
b0
b8 b7 b6
b5 b4 b3
b2 b1
The Pixel Order of the DSEs
8
0
2i
j ji
i
L b

 
The DSE of L142
 The initial descriptor of the block is the histogram
of the DSEs that the block contains
 The length of the initial descriptor is 510
 The L0 and L511 DSEs are removed because they
correspond to pure background and pure
document objects, respectively
 A feature reduction algorithm is applied which
reduces the number of features.
 The selected features are the DSEs which they
most reliable separate the text blocks from the
others.
 We call this feature reduction algorithm Feature
Standard Deviation Analysis of Structure Elements
(FSDASE)
Find the Standard Deviation for the Text
Blocks SDXT(Ln) for each Ln DSE
Find the Standard Deviation for the non
Text Blocks SDXP(Ln) for each Ln DSE
Normalize them
Then define the O(Ln) vector as
O(Ln)=|SDXT´ (Ln) – SDXP´(Ln)|
Finally, take those 32 DSEs that
correspond to the first 32 maximum values
of O(Ln).
 The goal of the FSDASE is to find those
DSEs that have maximum SD at the text
blocks and minimum SD at the non text
blocks and the opposite
 A training dataset is required
 Does not cause a problem because such
dataset already is required for the training of
the SVMs
 Therefore the final block descriptor is a vector
with 32 elements and it corresponds to the
frequency of the 32 DSEs that the block
contains
The descriptor has the ability to adapt to
the demands of each set of documents
images
A noisy document has different set of
DSEs than a clear document
If there is available more computational
power, the descriptor can increase its size
easily above 32
This descriptor is used to train the Support
Vector Machines
 Based on statistical learning theory
 They need training data
 They separate the space that the training
data is reside to two classes.
 The training data must be linear separable.
 If the training data are not linear separable (as in our case)
then they mapped from the input space to a feature space
using the kernel method
 Our experiments showed the Radial Basis Function
(exp{-γ|x-x`|) as the most robust kernel
 The parameters of SVMs are detected by a cross-
validation procedure using a grid search
 The output of SVM classifies each block as text or not
 The Document Image Database from the University of
Oulu is employed
 In our experiments we used the set of the 48 article
documents
 Those image documents contained a mixture of text
and pictures
 From this database five images are selected and the
extracted blocks used to determine the proper DSEs
and to be employ as training samples for the SVMs
 The overall results are:
Document Images Blocks Success Rate
48 25958 98.453%
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD
 A bottom-up text localization technique is proposed
that detects and extracts homogeneous text from
document images
 A Connected Component analysis technique is applied
which detects the objects of the document
 A flexible descriptor is extracted based on structural
elements
 The descriptor has the ability to adapt to the demands
of each set of documents images
 For example a noisy document has different set of
DSEs than a clear document
 If there is available more computational power, the
descriptor can increase its size easily above 32
 A trained SVM classify the objects as text and non-text
 The experimental results are much promised
ΕΥΧΑΡΙΣΤΩ!
THANK YOU!

Weitere ähnliche Inhalte

Was ist angesagt?

Self-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingSelf-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingPriyanka Wagh
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff DistanceIRJET Journal
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from imagesGarby Baby
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural NetworksYogendra Tamang
 
Enhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildEnhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildPrerana Mukherjee
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksJeremy Nixon
 
Currency recognition on mobile phones
Currency recognition on mobile phonesCurrency recognition on mobile phones
Currency recognition on mobile phoneshabeebsab
 
AN IMPROVED MULTI-SOM ALGORITHM
AN IMPROVED MULTI-SOM ALGORITHMAN IMPROVED MULTI-SOM ALGORITHM
AN IMPROVED MULTI-SOM ALGORITHMIJNSA Journal
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET Journal
 
An effective and robust technique for the binarization of degraded document i...
An effective and robust technique for the binarization of degraded document i...An effective and robust technique for the binarization of degraded document i...
An effective and robust technique for the binarization of degraded document i...eSAT Publishing House
 
Kernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of movingKernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of movingIAEME Publication
 
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence MatrixSteganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence MatrixCSCJournals
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networksananth
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classificationijtsrd
 

Was ist angesagt? (19)

Self-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with SmoothingSelf-Directing Text Detection and Removal from Images with Smoothing
Self-Directing Text Detection and Removal from Images with Smoothing
 
IRJET- Object Detection using Hausdorff Distance
IRJET-  	  Object Detection using Hausdorff DistanceIRJET-  	  Object Detection using Hausdorff Distance
IRJET- Object Detection using Hausdorff Distance
 
Sub1586
Sub1586Sub1586
Sub1586
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
Text extraction from images
Text extraction from imagesText extraction from images
Text extraction from images
 
Self-organizing map
Self-organizing mapSelf-organizing map
Self-organizing map
 
Ijetcas14 527
Ijetcas14 527Ijetcas14 527
Ijetcas14 527
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Enhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wildEnhanced characterness for text detection in the wild
Enhanced characterness for text detection in the wild
 
Understanding Convolutional Neural Networks
Understanding Convolutional Neural NetworksUnderstanding Convolutional Neural Networks
Understanding Convolutional Neural Networks
 
Currency recognition on mobile phones
Currency recognition on mobile phonesCurrency recognition on mobile phones
Currency recognition on mobile phones
 
50120140501016
5012014050101650120140501016
50120140501016
 
AN IMPROVED MULTI-SOM ALGORITHM
AN IMPROVED MULTI-SOM ALGORITHMAN IMPROVED MULTI-SOM ALGORITHM
AN IMPROVED MULTI-SOM ALGORITHM
 
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...IRJET-  	  Finding Dominant Color in the Artistic Painting using Data Mining ...
IRJET- Finding Dominant Color in the Artistic Painting using Data Mining ...
 
An effective and robust technique for the binarization of degraded document i...
An effective and robust technique for the binarization of degraded document i...An effective and robust technique for the binarization of degraded document i...
An effective and robust technique for the binarization of degraded document i...
 
Kernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of movingKernel based similarity estimation and real time tracking of moving
Kernel based similarity estimation and real time tracking of moving
 
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence MatrixSteganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
Steganalysis of LSB Embedded Images Using Gray Level Co-Occurrence Matrix
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Hand Written Digit Classification
Hand Written Digit ClassificationHand Written Digit Classification
Hand Written Digit Classification
 

Andere mochten auch

Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...Konstantinos Zagoris
 
Svm based cbir of breast masses on mammograms
Svm based cbir of breast masses on mammogramsSvm based cbir of breast masses on mammograms
Svm based cbir of breast masses on mammogramsKonstantinos Zagoris
 
Content and Metadata Based Image Document Retrieval (in Greek)
Content and Metadata Based Image Document Retrieval (in Greek)Content and Metadata Based Image Document Retrieval (in Greek)
Content and Metadata Based Image Document Retrieval (in Greek)Konstantinos Zagoris
 
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...Xi Wang
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesKonstantinos Zagoris
 
Query expansion based on visual content new
Query expansion based on visual content newQuery expansion based on visual content new
Query expansion based on visual content newLazaros Tsochatzidis
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
Alternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationAlternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationPrashant Pal
 
An Overview of Identity Based Encryption
An Overview of Identity Based EncryptionAn Overview of Identity Based Encryption
An Overview of Identity Based EncryptionVertoda System
 
Identity based encryption with outsourced revocation in cloud computing
Identity based encryption with outsourced revocation in cloud computingIdentity based encryption with outsourced revocation in cloud computing
Identity based encryption with outsourced revocation in cloud computingPvrtechnologies Nellore
 
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
 IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTINGNexgen Technology
 
Voting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationVoting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationDaniele Loiacono
 
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM) Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM) tuxette
 
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...tuxette
 
Decision Tree Analysis
Decision Tree AnalysisDecision Tree Analysis
Decision Tree AnalysisAnand Arora
 

Andere mochten auch (20)

Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
Comparative Performance Evaluation of Image Descriptors Over IEEE 802.11b Noi...
 
Svm based cbir of breast masses on mammograms
Svm based cbir of breast masses on mammogramsSvm based cbir of breast masses on mammograms
Svm based cbir of breast masses on mammograms
 
Content and Metadata Based Image Document Retrieval (in Greek)
Content and Metadata Based Image Document Retrieval (in Greek)Content and Metadata Based Image Document Retrieval (in Greek)
Content and Metadata Based Image Document Retrieval (in Greek)
 
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
2013 KDD conference presentation--"Multi-Label Relational Neighbor Classifica...
 
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal DatabasesDynamic Two-Stage Image Retrieval from Large Multimodal Databases
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
 
Query expansion based on visual content new
Query expansion based on visual content newQuery expansion based on visual content new
Query expansion based on visual content new
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
Alternating direction-method-for-image-restoration
Alternating direction-method-for-image-restorationAlternating direction-method-for-image-restoration
Alternating direction-method-for-image-restoration
 
An Overview of Identity Based Encryption
An Overview of Identity Based EncryptionAn Overview of Identity Based Encryption
An Overview of Identity Based Encryption
 
Identity based encryption with outsourced revocation in cloud computing
Identity based encryption with outsourced revocation in cloud computingIdentity based encryption with outsourced revocation in cloud computing
Identity based encryption with outsourced revocation in cloud computing
 
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
 IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
IDENTITY-BASED ENCRYPTION WITH OUTSOURCED REVOCATION IN CLOUD COMPUTING
 
Voting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationVoting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label Classification
 
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM) Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
Analyse de données fonctionnelles par Machines à Vecteurs de Support (SVM)
 
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
Théorie de l’apprentissage et SVM : présentation rapide et premières idées da...
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Decision trees
Decision treesDecision trees
Decision trees
 
Decision Tree Analysis
Decision Tree AnalysisDecision Tree Analysis
Decision Tree Analysis
 
Decision tree example problem
Decision tree example problemDecision tree example problem
Decision tree example problem
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Ähnlich wie Document Structure Features and SVMs Locate Text

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR OptimizationniveditJain
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detectionieeepondy
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...Alexander Decker
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...Madan Golla
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfSachin414679
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)DheerajPachauri
 
DSA 1- Introduction.pdf
DSA 1- Introduction.pdfDSA 1- Introduction.pdf
DSA 1- Introduction.pdfAliyanAbbas1
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson StudioSasha Lazarevic
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning ModelsEng Teong Cheah
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDatamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 

Ähnlich wie Document Structure Features and SVMs Locate Text (20)

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
Super Resolution with OCR Optimization
Super Resolution with OCR OptimizationSuper Resolution with OCR Optimization
Super Resolution with OCR Optimization
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdfHandwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
Handwriting_Recognition_using_KNN_classificatiob_algorithm_ijariie6729 (1).pdf
 
Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)Clustering for Stream and Parallelism (DATA ANALYTICS)
Clustering for Stream and Parallelism (DATA ANALYTICS)
 
PPT.pptx
PPT.pptxPPT.pptx
PPT.pptx
 
DSA 1- Introduction.pdf
DSA 1- Introduction.pdfDSA 1- Introduction.pdf
DSA 1- Introduction.pdf
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
Building Azure Machine Learning Models
Building Azure Machine Learning ModelsBuilding Azure Machine Learning Models
Building Azure Machine Learning Models
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 

Kürzlich hochgeladen

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Kürzlich hochgeladen (20)

The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Document Structure Features and SVMs Locate Text

  • 1. using Document Structure Features and Support Vector Machines Konstantinos Zagoris, Nikos Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace 67100 Xanthi, Greece Email: papamark@ee.duth.gr http://ipml.ee.duth.gr/~papamark/
  • 2.  Nowadays, there is abundance of document images such as technical articles, business letters, faxes and newspapers without any indexing information  In order to successfully exploit them from systems such as OCR a text localization technique must be employed
  • 3.  Bottom Up Techniques  Top Down Techniques
  • 4. The proposed technique is a continuation of the work “"PLA using RLSA and a neural network” of C. Strouthopoulos, N. Papamarkos and C. Chamzas In proposed work the feature set is adaptive The feature reduction technique is simpler  The classifier is the SVMs
  • 5. Apply Preprocessing Techniques (binarization, noire reduction) Locate, Merge and Extract Blocks Extract the Features from the Blocks Find the Blocks Which Contain Text using Support Vector Machines Locate or Extract the Text Blocks and Present them to User
  • 6.  The Original Document  After the Pre-Processing Step  The Connected Components  The Expanded Connected Components  The Final Blocks
  • 7. The Features are a set of suitable Document Structure Elements (DSEs) which the blocks contain DSE is any 3x3 binary block There are total 29 = 512 DSEs b0 b8 b7 b6 b5 b4 b3 b2 b1 The Pixel Order of the DSEs 8 0 2i j ji i L b    The DSE of L142
  • 8.  The initial descriptor of the block is the histogram of the DSEs that the block contains  The length of the initial descriptor is 510  The L0 and L511 DSEs are removed because they correspond to pure background and pure document objects, respectively  A feature reduction algorithm is applied which reduces the number of features.  The selected features are the DSEs which they most reliable separate the text blocks from the others.  We call this feature reduction algorithm Feature Standard Deviation Analysis of Structure Elements (FSDASE)
  • 9. Find the Standard Deviation for the Text Blocks SDXT(Ln) for each Ln DSE Find the Standard Deviation for the non Text Blocks SDXP(Ln) for each Ln DSE Normalize them Then define the O(Ln) vector as O(Ln)=|SDXT´ (Ln) – SDXP´(Ln)| Finally, take those 32 DSEs that correspond to the first 32 maximum values of O(Ln).
  • 10.  The goal of the FSDASE is to find those DSEs that have maximum SD at the text blocks and minimum SD at the non text blocks and the opposite  A training dataset is required  Does not cause a problem because such dataset already is required for the training of the SVMs  Therefore the final block descriptor is a vector with 32 elements and it corresponds to the frequency of the 32 DSEs that the block contains
  • 11. The descriptor has the ability to adapt to the demands of each set of documents images A noisy document has different set of DSEs than a clear document If there is available more computational power, the descriptor can increase its size easily above 32 This descriptor is used to train the Support Vector Machines
  • 12.  Based on statistical learning theory  They need training data  They separate the space that the training data is reside to two classes.  The training data must be linear separable.
  • 13.  If the training data are not linear separable (as in our case) then they mapped from the input space to a feature space using the kernel method  Our experiments showed the Radial Basis Function (exp{-γ|x-x`|) as the most robust kernel  The parameters of SVMs are detected by a cross- validation procedure using a grid search  The output of SVM classifies each block as text or not
  • 14.  The Document Image Database from the University of Oulu is employed  In our experiments we used the set of the 48 article documents  Those image documents contained a mixture of text and pictures  From this database five images are selected and the extracted blocks used to determine the proper DSEs and to be employ as training samples for the SVMs  The overall results are: Document Images Blocks Success Rate 48 25958 98.453%
  • 15. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 16. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 17. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 18. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 19. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD
  • 20.  A bottom-up text localization technique is proposed that detects and extracts homogeneous text from document images  A Connected Component analysis technique is applied which detects the objects of the document  A flexible descriptor is extracted based on structural elements  The descriptor has the ability to adapt to the demands of each set of documents images  For example a noisy document has different set of DSEs than a clear document  If there is available more computational power, the descriptor can increase its size easily above 32  A trained SVM classify the objects as text and non-text  The experimental results are much promised