Basilis Gatos (Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos”, GR): Hard Tasks in the Background. Layout analysis
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
co:op-READ-Convention Marburg - Basilis Gatos
1. Hard Tasks in the Background - Layout Analysis
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival
Collections. With a special focus on biographical data in archives.
Hessian State Archives in Marburg, 19-21 January 2016
Computational Intelligence Laboratory
Institute of Informatics and Telecommunications
National Center for Scientific Research "Demokritos“
Agia Paraskevi, Athens, Greece
Basilis Gatos
2. 2 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Outline
► The Computational Intelligence Laboratory of NCSR Demokritos
► Introduction – Problem Definition
► Preprocessing Tasks before Layout Analysis
► Page Segmentation & Document Understanding
► Form & Table Analysis
► Text Line Detection
► Word Detection
► Layout Analysis tasks in READ Project
3. 3 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
The Computational Intelligence Laboratory of NCSR Demokritos
National Centre of Scientific Research "DEMOKRITOS“:
The largest self-governing research organisation, under the supervision of the
Greek Government.
It is composed of the following Institutes:
Biosciences & Applications
Nuclear & Particle Physics
Informatics & Telecommunications
Nuclear & Radiological Sciences & Technology, Energy & Safety
Advanced Materials, Physicochemical Processes, Nanotechnology & Microsystems
4. 4 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
The Computational Intelligence Laboratory of NCSR Demokritos
CIL Activities Chart
Neural Networks
Computational Intelligence-
Pattern recognition background
Biolocically inspired
modelling
Bayesian
networks Machine
learning
Multimedia Optical Information
Processing, Semantic
analysis & Retrieval
Image Video 3D Graphics
Document image processing and understanding
Medical signal and image analysis,
Environmental applications,
Information retrieval from the Web
…
5. 5 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
The Computational Intelligence Laboratory of NCSR Demokritos
• A strong involvement in the research field of Document Image Analysis and Recognition in
the last 25 years
• Our specific research interests lie on document image preprocessing (binarization,
deskew, dewarping, image enhancement), segmentation, recognition, word spotting,
writer identification and performance evaluation mainly for historical and handwritten
documents. Also, we work on VOCR and Logo detection
• More than 150 journal and conference publications
• Our group consists of 11 people working in the field of Document Image Analysis and
Recognition (Researchers, Research Associates and PhD students)
– Researchers: S. Perantonis, B.Gatos, I.Pratikakis (Ass. Professor, DUTH)
– Research Associates: G. Louloudis, N. Stamatopoulos, G. Sfikas, K. Zagoris
– PhD student: A. Papandreou, K. Alexopoulos, G. Retsinas, G. Barlas
• We are involved in a series of national and EU projects (READ, tranScriptorium,
OldDocPro, IMPACT, CASAM, BOEMIE, POLYTIMO, D-SCRIBE etc.). Contracts to
support several companies for processing handwritten documents (Greek Army Archives),
analyzing handwritten forms, business documents (receipts and invoices) as well as to
detect logos in videos (tennis games).
• Program committee of several international Conferences and Workshops (e.g. ICDAR
20011, ICFHR 2012, ICDAR 2013, CBDAR 2013, International Workshop on Historical
Document Imaging and Processing 2013) as well as on the Editorial Board of the
International Journal on Document Analysis and Recognition (IJDA). We are also the co-
organizers of the International Conference on Frontiers in Handwriting Recognition
(ICFHR) that was held in Greece in 2014 and DAS 2016 (11 April 2016, Santorini).
6. 6 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Historical handwritten documents often suffer from several degradations,
have low quality, exhibit dense layout, may have adjacent text line touching
and arbitrary text line skew.
7. 7 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Page segmentation is the task of extracting homogeneous components
from page images (detect both text and non-text areas, discriminate
handwritten from possible machine printed text, classify non-text areas as
decorations, ruled lines, noise etc.)
8. 8 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Document understanding or logical layout analysis refers to the logical
and semantic analysis of document parts in order to extract human
understandable information and codify it into machine-readable form
(detect reading order, page numbers, headers, marginal elements or other
use-case oriented information).
9. 9 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Document understanding or logical layout analysis refers to the logical
and semantic analysis of document parts in order to extract human
understandable information and codify it into machine-readable form
(detect reading order, page numbers, headers, marginal elements or other
use-case oriented information).
10. 10 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Text line and word detection are used after layout analysis in order to
provide the proper input to a recognition or a word spotting engine.
11. 11 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
12. 12 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.
13. 13 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.
Border removal in order to detect and remove black borders as well as
noise regions from the scanned document image.
14. 14 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.
Border removal in order to detect and remove black borders as well as
noise regions from the scanned document image.
Skew/orientation correction, a document image normalization step is
useful in order to restore text areas horizontally aligned in 00 angle.
Page curle – warping correction to correct image distortions.
15. 15 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
16. 16 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
17. 17 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
Historical handwritten documents do not have strict layout rules and thus,
page segmentation and layout analysis methods needs to be invariant to
layout inconsistencies, irregularities in script and writing style, skew,
fluctuating text lines, variable shapes of decorative entities etc.
18. 18 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
Existing techniques:
Most of the state-of-the-art methodologies focus on machine-printed or
modern handwritten documents and only a few deal with historical
handwritten documents.
► For machine-printed documents:
■ XY-Cuts: The pixels of the image of the document are projected
horizontally and vertically. Then we look for the largest possible
white gap in the projection and split the image into two sub-
images at this gap. We repeat this procedure recursively changing
direction until a stopping criterion is fulfilled.
■ Run Length Smearing Algorithm: Smearing of black pixels.
■ Docstrum: For each connected component we compute the k-
nearest-neighbours
■ Voronoi, Whitespace Algorithms, …
19. 19 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
Related work:
V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crousle, P. Regnier, “Text Lines and
Snippets Extraction for 19th Century Handwriting Documents Layout
Analysis”, ICDAR 2009, pp. 1001-1005. (Universite de Lyon, CNRS)
20. 20 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
M. Bulacu, R. van Koert, L.Schomaker, T. van der Zant, "Layout analysis of
handwritten historical documents for searching the archive of the Cabinet of the Dutch
Queen", ICDAR 2007, pp. 357-361. (University of Groningen, The Netherlands)
Related work:
Detection of the rule lines of the tables
and the page margins.
Two methods were tested: the first uses
color information (and then horizontal and
vertical projections), while the second
takes as input gray-scale images
(binarization, detection of long vert. black
runs to find vertical rule lines, process
columns and extract long hor. black runs to
detect horizontal rule lines).
21. 21 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
S. S. Bukhari, T. M. Breuel, A Asi, J. El-Sana, “Layout Analysis for Arabic Historical
Document Images Using Machine Learning”, ICFHR 2012, pp. 635-640. (Technical
University of Kaiserslautern, Germany - Ben-Gurion University of the Negev, Israel)
Related work:
Features are extracted in a connected-component level, multi-
layer perception classifier is exploited to classify connected
components to main-body or side-notes text.
Component Shape:
1. Normalized height: the height of a component divided by the height of
an input document image.
2. Foreground area: number of foreground pixels in the rescaled area of a
component divided by the total number of pixels in the rescaled area.
3. Relative distance: the relative distance of a connected component
from the center of the document.
4. Orientation: the orientation of a connected component is estimated
with respect to its neighborhood.
Component Context:
22. 22 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
S. Nicolas, T. Paquet and L. Heutte, "Complex Handwritten Page Segmentation Using
Contextual Models“, DIAL 2006, pp. 46-59. (Laboratoire PSI – Université de Rouen)
Related work:
Task1: Label the main regions of the
manuscripts such as text body, margins, header,
footer, page number and marginal annotations
red = page number
green = header
blue =text body
pink = footer
cyan = text block
yellow = margin
Task2: Detect pseudowords, deletions,
diacritics and background
white = background
green = norm. text
blue = erasure
pink = diacritic
Markov Random Field models using
multiresolution pixel density feature
extraction are used (results for task1:
~90%, num. of images: 69).
23. 23 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
vertical lines are detected based on a fuzzy smoothing method.
we also process the vertical white runs of the image
treat cases of text overlapping with rule lines
Success rate ~90% using a set of 500 representative images
24. 24 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
N. Stamatopoulos, G. Louloudis and B. Gatos, “Goal-
Oriented Performance Evaluation Methodology for
Page Segmentation Techniques”, ICDAR 2015.
It is a pixel-based approach which avoids the
dependence on a strictly defined ground-truth.
The proposed evaluation measure deals is correlated
with the percentage of the text information in which
the subsequent processing (e.g. text line segmentation
and recognition) can be applied successfully.
25. 25 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Forms/Tables contain structured
information
Allows extraction of semantic
information due to syntactical
knowledge
Form documents:
Get information of the content
of a record
E.g. Index or Table of Contents
Concrete search for form
documents
26. 26 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Challenges:
Bad condition of historical
documents (faded out ink,
stains, mold)
Small variations of the form
layout of consecutive
versions
Geometrical similar layouts
Handwritten filled in data can
affect (global) form features
27. 27 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Challenges:
Small inter-class variance for
certain form types (e.g. table
of content)
Different form types can have
the same logical structure
(based on the description)
Restoration of handwritten text
needed after form dropout
“Hand drawn” forms in
historical documents (small
variations of spacing, … )
28. 28 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Form Processing:
29. 29 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
State-of-the-art:
Global Image Based Features
Methods based on Hierarchical
Descriptions
Local and Structural Features
Subgraphs as combination of 2
or more primitives (E. Saund)
Line information + preprinted
labels
30. 30 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
B. Gatos, D. Danatsas, I. Pratikakis and S. J. Perantonis, "Automatic Table Detection
in Document Images", 3rd International Conference on Advances in Pattern
Recognition (ICAPR'05), pp. 612-621, Path, UK, August 2005.
31. 31 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
► Process for defining the region of every text line on a document
image
► Crucial part of the workflow since its performance seriously affects
word segmentation and HTR
► How is a text line region defined?
► Using a baseline
► Or using a polygon area
32. 32 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Challenges:
► Difference in the skew angle between lines on the page or even
along the same text line
33. 33 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Challenges:
► Overlapping text lines
► Touching text lines
34. 34 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Challenges:
► Additions above the text line
► Deleted text
35. 35 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
State-of-the-art:
► Projection Profile methods
► Find local minima/maxima on the projection profile
► Work on entire image/vertical strips
► Can handle some degree of skew
► Hough based methods
► Calculate the Hough transform of a set of interest points
► Points include gravity centers of CCs, minima points of CCs, all black pixels
► Skewed text lines can be detected
► Errors occur when text lines present different skew along their width
► Smearing methods
► Consecutive black pixels along the horizontal direction are smeared
► The white space between them is filled with black pixels if its is length is within
a predefined threshold
► Seam carving methods - find the paths with the minimum cost from the left part
to the right part of the document image using dynamic programming
36. 36 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
37. 37 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Success rate: ~85%
using a set of more
than 400 images
(>10.000 text lines)
38. 38 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Binarization Remove
Underlines
Baselines
Detection
Skew
Correction
Slant
Correction
Size
Normalization
normalized
text line
text line
Cost Function: ℱ(𝑢, 𝑙) = (𝑃𝑤(𝑙)−𝑃𝑤(𝑢)) ∗ 𝑒
−( 𝑙−𝑢 −ℎ)2
𝑠
𝑃𝑤 𝑦 : difference of horizontal projections
upper & lower of y
ℎ: dominant height
𝑈𝑝𝑝𝑒𝑟 𝑍𝑜𝑛𝑒: 30
𝑀𝑎𝑖𝑛 𝑍𝑜𝑛𝑒: 60
𝐿𝑜𝑤𝑒𝑟 𝑍𝑜𝑛𝑒: 30
39. 39 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
40. 40 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
► Process of defining the regions of words of a text line
► Necessary for keyword spotting
► Two step procedure:
► Distance computation
► Gap classification
41. 41 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
Challenges:
► Skew along the text line
► Existence of slant
► Punctuation marks
42. 42 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
Challenges:
► Non-uniform spacing of words
43. 43 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
► Distance computation stage
► Several distance metrics used in the literature
► Euclidean
► Bounding box
► Minimum Run-length
► Convex Hull
State-of-the-art:
44. 44 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
Success rate: ~80%
using a set of more
than 400 images
(>100.000 words)
45. 45 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Layout Analysis tasks in READ Project
Task 6.1. Image pre-processing and enhancement (M1-M36) Lead: NCSR
Task 6.2. Basic layout analysis (M1-M36) Lead: CVL
Task 6.3. Table and forms analysis (M1-M36) Lead: CVL
Task 6.4. Segmentation of text regions (M1-M36) Lead: NCSR
Task 6.5. Document understanding (M1-M36) Lead: XRCE