This document presents a bottom-up text localization technique that uses document structure features and support vector machines. The technique detects and extracts text from document images through several steps: preprocessing, locating and merging blocks, extracting features from blocks, and using support vector machines trained on the features to classify blocks as containing text or not. The technique uses a flexible feature descriptor based on structural elements that can adapt to different document image types. Experimental results on document images show a 98.5% success rate in classifying blocks.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Text extraction using document structure features and support vector machines
1. using Document Structure Features
and Support Vector Machines
Konstantinos Zagoris, Nikos Papamarkos
Image Processing and Multimedia Laboratory
Department of Electrical & Computer Engineering
Democritus University of Thrace
67100 Xanthi, Greece
Email: papamark@ee.duth.gr
http://ipml.ee.duth.gr/~papamark/
2. Nowadays, there is abundance of document
images such as technical articles, business letters,
faxes and newspapers without any indexing
information
In order to successfully exploit them from systems
such as OCR a text localization technique must be
employed
4. The proposed technique is a continuation
of the work “"PLA using RLSA and a neural
network” of C. Strouthopoulos, N.
Papamarkos and C. Chamzas
In proposed work the feature set is
adaptive
The feature reduction technique is simpler
The classifier is the SVMs
6. The Original Document After the Pre-Processing Step
The Connected Components The Expanded Connected Components
The Final Blocks
7. The Features are a set of suitable
Document Structure Elements (DSEs)
which the blocks contain
DSE is any 3x3 binary block
There are total 29 = 512 DSEs
b0
b8 b7 b6
b5 b4 b3
b2 b1
The Pixel Order of the DSEs
8
0
2i
j ji
i
L b
The DSE of L142
8. The initial descriptor of the block is the histogram
of the DSEs that the block contains
The length of the initial descriptor is 510
The L0 and L511 DSEs are removed because they
correspond to pure background and pure
document objects, respectively
A feature reduction algorithm is applied which
reduces the number of features.
The selected features are the DSEs which they
most reliable separate the text blocks from the
others.
We call this feature reduction algorithm Feature
Standard Deviation Analysis of Structure Elements
(FSDASE)
9. Find the Standard Deviation for the Text
Blocks SDXT(Ln) for each Ln DSE
Find the Standard Deviation for the non
Text Blocks SDXP(Ln) for each Ln DSE
Normalize them
Then define the O(Ln) vector as
O(Ln)=|SDXT´ (Ln) – SDXP´(Ln)|
Finally, take those 32 DSEs that
correspond to the first 32 maximum values
of O(Ln).
10. The goal of the FSDASE is to find those
DSEs that have maximum SD at the text
blocks and minimum SD at the non text
blocks and the opposite
A training dataset is required
Does not cause a problem because such
dataset already is required for the training of
the SVMs
Therefore the final block descriptor is a vector
with 32 elements and it corresponds to the
frequency of the 32 DSEs that the block
contains
11. The descriptor has the ability to adapt to
the demands of each set of documents
images
A noisy document has different set of
DSEs than a clear document
If there is available more computational
power, the descriptor can increase its size
easily above 32
This descriptor is used to train the Support
Vector Machines
12. Based on statistical learning theory
They need training data
They separate the space that the training
data is reside to two classes.
The training data must be linear separable.
13. If the training data are not linear separable (as in our case)
then they mapped from the input space to a feature space
using the kernel method
Our experiments showed the Radial Basis Function
(exp{-γ|x-x`|) as the most robust kernel
The parameters of SVMs are detected by a cross-
validation procedure using a grid search
The output of SVM classifies each block as text or not
14. The Document Image Database from the University of
Oulu is employed
In our experiments we used the set of the 48 article
documents
Those image documents contained a mixture of text
and pictures
From this database five images are selected and the
extracted blocks used to determine the proper DSEs
and to be employ as training samples for the SVMs
The overall results are:
Document Images Blocks Success Rate
48 25958 98.453%
20. A bottom-up text localization technique is proposed
that detects and extracts homogeneous text from
document images
A Connected Component analysis technique is applied
which detects the objects of the document
A flexible descriptor is extracted based on structural
elements
The descriptor has the ability to adapt to the demands
of each set of documents images
For example a noisy document has different set of
DSEs than a clear document
If there is available more computational power, the
descriptor can increase its size easily above 32
A trained SVM classify the objects as text and non-text
The experimental results are much promised