co:op-READ-Convention Marburg - Basilis Gatos

Hard Tasks in the Background - Layout Analysis
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival
Collections. With a special focus on biographical data in archives.
Hessian State Archives in Marburg, 19-21 January 2016
Computational Intelligence Laboratory
Institute of Informatics and Telecommunications
National Center for Scientific Research "Demokritos“
Agia Paraskevi, Athens, Greece
Basilis Gatos

2 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Outline
► The Computational Intelligence Laboratory of NCSR Demokritos
► Introduction – Problem Definition
► Preprocessing Tasks before Layout Analysis
► Page Segmentation & Document Understanding
► Form & Table Analysis
► Text Line Detection
► Word Detection
► Layout Analysis tasks in READ Project

The Computational Intelligence Laboratory of NCSR Demokritos
National Centre of Scientific Research "DEMOKRITOS“:
The largest self-governing research organisation, under the supervision of the
Greek Government.
It is composed of the following Institutes:
 Biosciences & Applications
 Nuclear & Particle Physics
 Informatics & Telecommunications
 Nuclear & Radiological Sciences & Technology, Energy & Safety
 Advanced Materials, Physicochemical Processes, Nanotechnology & Microsystems

CIL Activities Chart
Neural Networks
Computational Intelligence-
Pattern recognition background
Biolocically inspired
modelling
Bayesian
networks Machine
learning
Multimedia Optical Information
Processing, Semantic
analysis & Retrieval
Image Video 3D Graphics
Document image processing and understanding
Medical signal and image analysis,
Environmental applications,
Information retrieval from the Web
…

• A strong involvement in the research field of Document Image Analysis and Recognition in
the last 25 years
• Our specific research interests lie on document image preprocessing (binarization,
deskew, dewarping, image enhancement), segmentation, recognition, word spotting,
writer identification and performance evaluation mainly for historical and handwritten
documents. Also, we work on VOCR and Logo detection
• More than 150 journal and conference publications
• Our group consists of 11 people working in the field of Document Image Analysis and
Recognition (Researchers, Research Associates and PhD students)
– Researchers: S. Perantonis, B.Gatos, I.Pratikakis (Ass. Professor, DUTH)
– Research Associates: G. Louloudis, N. Stamatopoulos, G. Sfikas, K. Zagoris
– PhD student: A. Papandreou, K. Alexopoulos, G. Retsinas, G. Barlas
• We are involved in a series of national and EU projects (READ, tranScriptorium,
OldDocPro, IMPACT, CASAM, BOEMIE, POLYTIMO, D-SCRIBE etc.). Contracts to
support several companies for processing handwritten documents (Greek Army Archives),
analyzing handwritten forms, business documents (receipts and invoices) as well as to
detect logos in videos (tennis games).
• Program committee of several international Conferences and Workshops (e.g. ICDAR
20011, ICFHR 2012, ICDAR 2013, CBDAR 2013, International Workshop on Historical
Document Imaging and Processing 2013) as well as on the Editorial Board of the
International Journal on Document Analysis and Recognition (IJDA). We are also the co-
organizers of the International Conference on Frontiers in Handwriting Recognition
(ICFHR) that was held in Greece in 2014 and DAS 2016 (11 April 2016, Santorini).

Introduction – Problem Definition
Historical handwritten documents often suffer from several degradations,
have low quality, exhibit dense layout, may have adjacent text line touching
and arbitrary text line skew.

Page segmentation is the task of extracting homogeneous components
from page images (detect both text and non-text areas, discriminate
handwritten from possible machine printed text, classify non-text areas as
decorations, ruled lines, noise etc.)

Document understanding or logical layout analysis refers to the logical
and semantic analysis of document parts in order to extract human
understandable information and codify it into machine-readable form
(detect reading order, page numbers, headers, marginal elements or other
use-case oriented information).

Text line and word detection are used after layout analysis in order to
provide the proper input to a recognition or a word spotting engine.

Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.

Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.

segmentation.
Border removal in order to detect and remove black borders as well as
noise regions from the scanned document image.

segmentation.
Border removal in order to detect and remove black borders as well as
noise regions from the scanned document image.
Skew/orientation correction, a document image normalization step is
useful in order to restore text areas horizontally aligned in 00 angle.
Page curle – warping correction to correct image distortions.

Page Segmentation & Document Understanding
Historical handwritten documents do not have strict layout rules and thus,
page segmentation and layout analysis methods needs to be invariant to
layout inconsistencies, irregularities in script and writing style, skew,
fluctuating text lines, variable shapes of decorative entities etc.

Existing techniques:
Most of the state-of-the-art methodologies focus on machine-printed or
modern handwritten documents and only a few deal with historical
handwritten documents.
► For machine-printed documents:
■ XY-Cuts: The pixels of the image of the document are projected
horizontally and vertically. Then we look for the largest possible
white gap in the projection and split the image into two sub-
images at this gap. We repeat this procedure recursively changing
direction until a stopping criterion is fulfilled.
■ Run Length Smearing Algorithm: Smearing of black pixels.
■ Docstrum: For each connected component we compute the k-
nearest-neighbours
■ Voronoi, Whitespace Algorithms, …

Related work:
V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crousle, P. Regnier, “Text Lines and
Snippets Extraction for 19th Century Handwriting Documents Layout
Analysis”, ICDAR 2009, pp. 1001-1005. (Universite de Lyon, CNRS)

M. Bulacu, R. van Koert, L.Schomaker, T. van der Zant, "Layout analysis of
handwritten historical documents for searching the archive of the Cabinet of the Dutch
Queen", ICDAR 2007, pp. 357-361. (University of Groningen, The Netherlands)
Related work:
Detection of the rule lines of the tables
and the page margins.
Two methods were tested: the first uses
color information (and then horizontal and
vertical projections), while the second
takes as input gray-scale images
(binarization, detection of long vert. black
runs to find vertical rule lines, process
columns and extract long hor. black runs to
detect horizontal rule lines).

S. S. Bukhari, T. M. Breuel, A Asi, J. El-Sana, “Layout Analysis for Arabic Historical
Document Images Using Machine Learning”, ICFHR 2012, pp. 635-640. (Technical
University of Kaiserslautern, Germany - Ben-Gurion University of the Negev, Israel)
Related work:
Features are extracted in a connected-component level, multi-
layer perception classifier is exploited to classify connected
components to main-body or side-notes text.
Component Shape:
1. Normalized height: the height of a component divided by the height of
an input document image.
2. Foreground area: number of foreground pixels in the rescaled area of a
component divided by the total number of pixels in the rescaled area.
3. Relative distance: the relative distance of a connected component
from the center of the document.
4. Orientation: the orientation of a connected component is estimated
with respect to its neighborhood.
Component Context:

S. Nicolas, T. Paquet and L. Heutte, "Complex Handwritten Page Segmentation Using
Contextual Models“, DIAL 2006, pp. 46-59. (Laboratoire PSI – Université de Rouen)
Related work:
Task1: Label the main regions of the
manuscripts such as text body, margins, header,
footer, page number and marginal annotations
red = page number
green = header
blue =text body
pink = footer
cyan = text block
yellow = margin
Task2: Detect pseudowords, deletions,
diacritics and background
white = background
green = norm. text
blue = erasure
pink = diacritic
Markov Random Field models using
multiresolution pixel density feature
extraction are used (results for task1:
~90%, num. of images: 69).

 vertical lines are detected based on a fuzzy smoothing method.
 we also process the vertical white runs of the image
 treat cases of text overlapping with rule lines
 Success rate ~90% using a set of 500 representative images

N. Stamatopoulos, G. Louloudis and B. Gatos, “Goal-
Oriented Performance Evaluation Methodology for
Page Segmentation Techniques”, ICDAR 2015.
 It is a pixel-based approach which avoids the
dependence on a strictly defined ground-truth.
 The proposed evaluation measure deals is correlated
with the percentage of the text information in which
the subsequent processing (e.g. text line segmentation
and recognition) can be applied successfully.

Form & Table Analysis
 Forms/Tables contain structured
information
 Allows extraction of semantic
information due to syntactical
knowledge
 Form documents:
 Get information of the content
of a record
 E.g. Index or Table of Contents
 Concrete search for form
documents

Challenges:
 Bad condition of historical
documents (faded out ink,
stains, mold)
 Small variations of the form
layout of consecutive
versions
 Geometrical similar layouts
 Handwritten filled in data can
affect (global) form features

Challenges:
 Small inter-class variance for
certain form types (e.g. table
of content)
 Different form types can have
the same logical structure
(based on the description)
 Restoration of handwritten text
needed after form dropout
 “Hand drawn” forms in
historical documents (small
variations of spacing, … )

Form Processing:

State-of-the-art:
 Global Image Based Features
 Methods based on Hierarchical
Descriptions
 Local and Structural Features
 Subgraphs as combination of 2
or more primitives (E. Saund)
 Line information + preprinted
labels

B. Gatos, D. Danatsas, I. Pratikakis and S. J. Perantonis, "Automatic Table Detection
in Document Images", 3rd International Conference on Advances in Pattern
Recognition (ICAPR'05), pp. 612-621, Path, UK, August 2005.

Text Line Detection
► Process for defining the region of every text line on a document
image
► Crucial part of the workflow since its performance seriously affects
word segmentation and HTR
► How is a text line region defined?
► Using a baseline
► Or using a polygon area

Text Line Detection
Challenges:
► Difference in the skew angle between lines on the page or even
along the same text line

Text Line Detection
Challenges:
► Overlapping text lines
► Touching text lines

Text Line Detection
Challenges:
► Additions above the text line
► Deleted text

Text Line Detection
State-of-the-art:
► Projection Profile methods
► Find local minima/maxima on the projection profile
► Work on entire image/vertical strips
► Can handle some degree of skew
► Hough based methods
► Calculate the Hough transform of a set of interest points
► Points include gravity centers of CCs, minima points of CCs, all black pixels
► Skewed text lines can be detected
► Errors occur when text lines present different skew along their width
► Smearing methods
► Consecutive black pixels along the horizontal direction are smeared
► The white space between them is filled with black pixels if its is length is within
a predefined threshold
► Seam carving methods - find the paths with the minimum cost from the left part
to the right part of the document image using dynamic programming

Text Line Detection

Text Line Detection
Success rate: ~85%
using a set of more
than 400 images
(>10.000 text lines)

Text Line Detection
Binarization Remove
Underlines
Baselines
Detection
Skew
Correction
Slant
Correction
Size
Normalization
normalized
text line
text line
Cost Function: ℱ(𝑢, 𝑙) = (𝑃𝑤(𝑙)−𝑃𝑤(𝑢)) ∗ 𝑒
−( 𝑙−𝑢 −ℎ)2
𝑠
𝑃𝑤 𝑦 : difference of horizontal projections
upper & lower of y
ℎ: dominant height
𝑈𝑝𝑝𝑒𝑟 𝑍𝑜𝑛𝑒: 30
𝑀𝑎𝑖𝑛 𝑍𝑜𝑛𝑒: 60
𝐿𝑜𝑤𝑒𝑟 𝑍𝑜𝑛𝑒: 30

Text Line Detection

Word Detection
► Process of defining the regions of words of a text line
► Necessary for keyword spotting
► Two step procedure:
► Distance computation
► Gap classification

Word Detection
Challenges:
► Skew along the text line
► Existence of slant
► Punctuation marks

Word Detection
Challenges:
► Non-uniform spacing of words

Word Detection
► Distance computation stage
► Several distance metrics used in the literature
► Euclidean
► Bounding box
► Minimum Run-length
► Convex Hull
State-of-the-art:

Word Detection
 Success rate: ~80%
using a set of more
than 400 images
(>100.000 words)

Layout Analysis tasks in READ Project
Task 6.1. Image pre-processing and enhancement (M1-M36) Lead: NCSR
Task 6.2. Basic layout analysis (M1-M36) Lead: CVL
Task 6.3. Table and forms analysis (M1-M36) Lead: CVL
Task 6.4. Segmentation of text regions (M1-M36) Lead: NCSR
Task 6.5. Document understanding (M1-M36) Lead: XRCE

co:op-READ-Convention Marburg - Basilis Gatos

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Ähnlich wie co:op-READ-Convention Marburg - Basilis Gatos

Ähnlich wie co:op-READ-Convention Marburg - Basilis Gatos (20)

Mehr von ICARUS - International Centre for Archival Research

Mehr von ICARUS - International Centre for Archival Research (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

co:op-READ-Convention Marburg - Basilis Gatos