SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Hard Tasks in the Background - Layout Analysis
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival
Collections. With a special focus on biographical data in archives.
Hessian State Archives in Marburg, 19-21 January 2016
Computational Intelligence Laboratory
Institute of Informatics and Telecommunications
National Center for Scientific Research "Demokritos“
Agia Paraskevi, Athens, Greece
Basilis Gatos
2 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Outline
► The Computational Intelligence Laboratory of NCSR Demokritos
► Introduction – Problem Definition
► Preprocessing Tasks before Layout Analysis
► Page Segmentation & Document Understanding
► Form & Table Analysis
► Text Line Detection
► Word Detection
► Layout Analysis tasks in READ Project
3 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
The Computational Intelligence Laboratory of NCSR Demokritos
National Centre of Scientific Research "DEMOKRITOS“:
The largest self-governing research organisation, under the supervision of the
Greek Government.
It is composed of the following Institutes:
 Biosciences & Applications
 Nuclear & Particle Physics
 Informatics & Telecommunications
 Nuclear & Radiological Sciences & Technology, Energy & Safety
 Advanced Materials, Physicochemical Processes, Nanotechnology & Microsystems
4 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
The Computational Intelligence Laboratory of NCSR Demokritos
CIL Activities Chart
Neural Networks
Computational Intelligence-
Pattern recognition background
Biolocically inspired
modelling
Bayesian
networks Machine
learning
Multimedia Optical Information
Processing, Semantic
analysis & Retrieval
Image Video 3D Graphics
Document image processing and understanding
Medical signal and image analysis,
Environmental applications,
Information retrieval from the Web
…
5 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
The Computational Intelligence Laboratory of NCSR Demokritos
• A strong involvement in the research field of Document Image Analysis and Recognition in
the last 25 years
• Our specific research interests lie on document image preprocessing (binarization,
deskew, dewarping, image enhancement), segmentation, recognition, word spotting,
writer identification and performance evaluation mainly for historical and handwritten
documents. Also, we work on VOCR and Logo detection
• More than 150 journal and conference publications
• Our group consists of 11 people working in the field of Document Image Analysis and
Recognition (Researchers, Research Associates and PhD students)
– Researchers: S. Perantonis, B.Gatos, I.Pratikakis (Ass. Professor, DUTH)
– Research Associates: G. Louloudis, N. Stamatopoulos, G. Sfikas, K. Zagoris
– PhD student: A. Papandreou, K. Alexopoulos, G. Retsinas, G. Barlas
• We are involved in a series of national and EU projects (READ, tranScriptorium,
OldDocPro, IMPACT, CASAM, BOEMIE, POLYTIMO, D-SCRIBE etc.). Contracts to
support several companies for processing handwritten documents (Greek Army Archives),
analyzing handwritten forms, business documents (receipts and invoices) as well as to
detect logos in videos (tennis games).
• Program committee of several international Conferences and Workshops (e.g. ICDAR
20011, ICFHR 2012, ICDAR 2013, CBDAR 2013, International Workshop on Historical
Document Imaging and Processing 2013) as well as on the Editorial Board of the
International Journal on Document Analysis and Recognition (IJDA). We are also the co-
organizers of the International Conference on Frontiers in Handwriting Recognition
(ICFHR) that was held in Greece in 2014 and DAS 2016 (11 April 2016, Santorini).
6 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Historical handwritten documents often suffer from several degradations,
have low quality, exhibit dense layout, may have adjacent text line touching
and arbitrary text line skew.
7 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Page segmentation is the task of extracting homogeneous components
from page images (detect both text and non-text areas, discriminate
handwritten from possible machine printed text, classify non-text areas as
decorations, ruled lines, noise etc.)
8 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Document understanding or logical layout analysis refers to the logical
and semantic analysis of document parts in order to extract human
understandable information and codify it into machine-readable form
(detect reading order, page numbers, headers, marginal elements or other
use-case oriented information).
9 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Document understanding or logical layout analysis refers to the logical
and semantic analysis of document parts in order to extract human
understandable information and codify it into machine-readable form
(detect reading order, page numbers, headers, marginal elements or other
use-case oriented information).
10 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Introduction – Problem Definition
Text line and word detection are used after layout analysis in order to
provide the proper input to a recognition or a word spotting engine.
11 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
12 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.
13 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.
Border removal in order to detect and remove black borders as well as
noise regions from the scanned document image.
14 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
Image Enhancement in order to improve the quality of the original image.
Binarization (conversion of a grayscale or colour image into a binary
image) helps to separate the text from the background, permits less image
storage space, allows efficient and quick further processing for page
segmentation.
Border removal in order to detect and remove black borders as well as
noise regions from the scanned document image.
Skew/orientation correction, a document image normalization step is
useful in order to restore text areas horizontally aligned in 00 angle.
Page curle – warping correction to correct image distortions.
15 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
16 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Preprocessing Tasks before Layout Analysis
17 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
Historical handwritten documents do not have strict layout rules and thus,
page segmentation and layout analysis methods needs to be invariant to
layout inconsistencies, irregularities in script and writing style, skew,
fluctuating text lines, variable shapes of decorative entities etc.
18 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
Existing techniques:
Most of the state-of-the-art methodologies focus on machine-printed or
modern handwritten documents and only a few deal with historical
handwritten documents.
► For machine-printed documents:
■ XY-Cuts: The pixels of the image of the document are projected
horizontally and vertically. Then we look for the largest possible
white gap in the projection and split the image into two sub-
images at this gap. We repeat this procedure recursively changing
direction until a stopping criterion is fulfilled.
■ Run Length Smearing Algorithm: Smearing of black pixels.
■ Docstrum: For each connected component we compute the k-
nearest-neighbours
■ Voronoi, Whitespace Algorithms, …
19 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
Related work:
V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crousle, P. Regnier, “Text Lines and
Snippets Extraction for 19th Century Handwriting Documents Layout
Analysis”, ICDAR 2009, pp. 1001-1005. (Universite de Lyon, CNRS)
20 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
M. Bulacu, R. van Koert, L.Schomaker, T. van der Zant, "Layout analysis of
handwritten historical documents for searching the archive of the Cabinet of the Dutch
Queen", ICDAR 2007, pp. 357-361. (University of Groningen, The Netherlands)
Related work:
Detection of the rule lines of the tables
and the page margins.
Two methods were tested: the first uses
color information (and then horizontal and
vertical projections), while the second
takes as input gray-scale images
(binarization, detection of long vert. black
runs to find vertical rule lines, process
columns and extract long hor. black runs to
detect horizontal rule lines).
21 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
S. S. Bukhari, T. M. Breuel, A Asi, J. El-Sana, “Layout Analysis for Arabic Historical
Document Images Using Machine Learning”, ICFHR 2012, pp. 635-640. (Technical
University of Kaiserslautern, Germany - Ben-Gurion University of the Negev, Israel)
Related work:
Features are extracted in a connected-component level, multi-
layer perception classifier is exploited to classify connected
components to main-body or side-notes text.
Component Shape:
1. Normalized height: the height of a component divided by the height of
an input document image.
2. Foreground area: number of foreground pixels in the rescaled area of a
component divided by the total number of pixels in the rescaled area.
3. Relative distance: the relative distance of a connected component
from the center of the document.
4. Orientation: the orientation of a connected component is estimated
with respect to its neighborhood.
Component Context:
22 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
S. Nicolas, T. Paquet and L. Heutte, "Complex Handwritten Page Segmentation Using
Contextual Models“, DIAL 2006, pp. 46-59. (Laboratoire PSI – Université de Rouen)
Related work:
Task1: Label the main regions of the
manuscripts such as text body, margins, header,
footer, page number and marginal annotations
red = page number
green = header
blue =text body
pink = footer
cyan = text block
yellow = margin
Task2: Detect pseudowords, deletions,
diacritics and background
white = background
green = norm. text
blue = erasure
pink = diacritic
Markov Random Field models using
multiresolution pixel density feature
extraction are used (results for task1:
~90%, num. of images: 69).
23 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
 vertical lines are detected based on a fuzzy smoothing method.
 we also process the vertical white runs of the image
 treat cases of text overlapping with rule lines
 Success rate ~90% using a set of 500 representative images
24 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Page Segmentation & Document Understanding
N. Stamatopoulos, G. Louloudis and B. Gatos, “Goal-
Oriented Performance Evaluation Methodology for
Page Segmentation Techniques”, ICDAR 2015.
 It is a pixel-based approach which avoids the
dependence on a strictly defined ground-truth.
 The proposed evaluation measure deals is correlated
with the percentage of the text information in which
the subsequent processing (e.g. text line segmentation
and recognition) can be applied successfully.
25 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
 Forms/Tables contain structured
information
 Allows extraction of semantic
information due to syntactical
knowledge
 Form documents:
 Get information of the content
of a record
 E.g. Index or Table of Contents
 Concrete search for form
documents
26 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Challenges:
 Bad condition of historical
documents (faded out ink,
stains, mold)
 Small variations of the form
layout of consecutive
versions
 Geometrical similar layouts
 Handwritten filled in data can
affect (global) form features
27 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Challenges:
 Small inter-class variance for
certain form types (e.g. table
of content)
 Different form types can have
the same logical structure
(based on the description)
 Restoration of handwritten text
needed after form dropout
 “Hand drawn” forms in
historical documents (small
variations of spacing, … )
28 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
Form Processing:
29 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
State-of-the-art:
 Global Image Based Features
 Methods based on Hierarchical
Descriptions
 Local and Structural Features
 Subgraphs as combination of 2
or more primitives (E. Saund)
 Line information + preprinted
labels
30 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Form & Table Analysis
B. Gatos, D. Danatsas, I. Pratikakis and S. J. Perantonis, "Automatic Table Detection
in Document Images", 3rd International Conference on Advances in Pattern
Recognition (ICAPR'05), pp. 612-621, Path, UK, August 2005.
31 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
► Process for defining the region of every text line on a document
image
► Crucial part of the workflow since its performance seriously affects
word segmentation and HTR
► How is a text line region defined?
► Using a baseline
► Or using a polygon area
32 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Challenges:
► Difference in the skew angle between lines on the page or even
along the same text line
33 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Challenges:
► Overlapping text lines
► Touching text lines
34 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Challenges:
► Additions above the text line
► Deleted text
35 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
State-of-the-art:
► Projection Profile methods
► Find local minima/maxima on the projection profile
► Work on entire image/vertical strips
► Can handle some degree of skew
► Hough based methods
► Calculate the Hough transform of a set of interest points
► Points include gravity centers of CCs, minima points of CCs, all black pixels
► Skewed text lines can be detected
► Errors occur when text lines present different skew along their width
► Smearing methods
► Consecutive black pixels along the horizontal direction are smeared
► The white space between them is filled with black pixels if its is length is within
a predefined threshold
► Seam carving methods - find the paths with the minimum cost from the left part
to the right part of the document image using dynamic programming
36 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
37 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Success rate: ~85%
using a set of more
than 400 images
(>10.000 text lines)
38 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
Binarization Remove
Underlines
Baselines
Detection
Skew
Correction
Slant
Correction
Size
Normalization
normalized
text line
text line
Cost Function: ℱ(𝑢, 𝑙) = (𝑃𝑤(𝑙)−𝑃𝑤(𝑢)) ∗ 𝑒
−( 𝑙−𝑢 −ℎ)2
𝑠
𝑃𝑤 𝑦 : difference of horizontal projections
upper & lower of y
ℎ: dominant height
𝑈𝑝𝑝𝑒𝑟 𝑍𝑜𝑛𝑒: 30
𝑀𝑎𝑖𝑛 𝑍𝑜𝑛𝑒: 60
𝐿𝑜𝑤𝑒𝑟 𝑍𝑜𝑛𝑒: 30
39 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Text Line Detection
40 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
► Process of defining the regions of words of a text line
► Necessary for keyword spotting
► Two step procedure:
► Distance computation
► Gap classification
41 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
Challenges:
► Skew along the text line
► Existence of slant
► Punctuation marks
42 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
Challenges:
► Non-uniform spacing of words
43 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
► Distance computation stage
► Several distance metrics used in the literature
► Euclidean
► Bounding box
► Minimum Run-length
► Convex Hull
State-of-the-art:
44 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Word Detection
 Success rate: ~80%
using a set of more
than 400 images
(>100.000 words)
45 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016
Layout Analysis tasks in READ Project
Task 6.1. Image pre-processing and enhancement (M1-M36) Lead: NCSR
Task 6.2. Basic layout analysis (M1-M36) Lead: CVL
Task 6.3. Table and forms analysis (M1-M36) Lead: CVL
Task 6.4. Segmentation of text regions (M1-M36) Lead: NCSR
Task 6.5. Document understanding (M1-M36) Lead: XRCE

Weitere ähnliche Inhalte

Was ist angesagt?

Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
Tariq Hassan
 
Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...
The European Library
 
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked KnowledgeFrom Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
Sören Auer
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
University of Bologna
 

Was ist angesagt? (17)

Techniques of information retrieval
Techniques of information retrieval Techniques of information retrieval
Techniques of information retrieval
 
DARIAH Athens May 2009
DARIAH  Athens  May 2009DARIAH  Athens  May 2009
DARIAH Athens May 2009
 
HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013HathiTrust Reserach Center Nov2013
HathiTrust Reserach Center Nov2013
 
Connecting Heterogeneous Collections using Linked Data
Connecting Heterogeneous Collections using Linked DataConnecting Heterogeneous Collections using Linked Data
Connecting Heterogeneous Collections using Linked Data
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
Introduction to persistency and Berkeley DB
Introduction to persistency and Berkeley DBIntroduction to persistency and Berkeley DB
Introduction to persistency and Berkeley DB
 
Linking library data
Linking library dataLinking library data
Linking library data
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...Linked Data and cultural heritage data: an overview of the approaches from Eu...
Linked Data and cultural heritage data: an overview of the approaches from Eu...
 
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.orgEC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
EC-WEB: Validator and Preview for the JobPosting Data Model of Schema.org
 
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked KnowledgeFrom Open Linked Data towards an Ecosystem of Interlinked Knowledge
From Open Linked Data towards an Ecosystem of Interlinked Knowledge
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 

Ähnlich wie co:op-READ-Convention Marburg - Basilis Gatos

Semi-automatic Text MiningNK
Semi-automatic Text MiningNKSemi-automatic Text MiningNK
Semi-automatic Text MiningNK
butest
 

Ähnlich wie co:op-READ-Convention Marburg - Basilis Gatos (20)

Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
Semi-automatic Text MiningNK
Semi-automatic Text MiningNKSemi-automatic Text MiningNK
Semi-automatic Text MiningNK
 
Layout Based Information Retrieval from Document Images
Layout Based Information Retrieval from Document ImagesLayout Based Information Retrieval from Document Images
Layout Based Information Retrieval from Document Images
 
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
Tuning Personalized PageRank for Semantics-aware Recommendations based on Lin...
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Humanistic and Computational Thinking Through Practice
Humanistic and Computational Thinking Through PracticeHumanistic and Computational Thinking Through Practice
Humanistic and Computational Thinking Through Practice
 
bonino
boninobonino
bonino
 
2016 iccgis module4_thinking_aloud
2016 iccgis module4_thinking_aloud2016 iccgis module4_thinking_aloud
2016 iccgis module4_thinking_aloud
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
Machine Support for Interacting with Scientific Publications Improving Inform...
Machine Support for Interacting with Scientific Publications Improving Inform...Machine Support for Interacting with Scientific Publications Improving Inform...
Machine Support for Interacting with Scientific Publications Improving Inform...
 
Information Architectures - Lecture 04 - Next Generation User Interfaces (401...
Information Architectures - Lecture 04 - Next Generation User Interfaces (401...Information Architectures - Lecture 04 - Next Generation User Interfaces (401...
Information Architectures - Lecture 04 - Next Generation User Interfaces (401...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Scientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an OverviewScientific Knowledge Graphs: an Overview
Scientific Knowledge Graphs: an Overview
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
R in the Humanities: Text Analysis (v2)
R in the Humanities: Text Analysis (v2)R in the Humanities: Text Analysis (v2)
R in the Humanities: Text Analysis (v2)
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Applied spatial data introducing
Applied spatial data introducingApplied spatial data introducing
Applied spatial data introducing
 

Mehr von ICARUS - International Centre for Archival Research

Mehr von ICARUS - International Centre for Archival Research (20)

ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
ICARUS-Meeting #20 | The Age of Digital Technology: Documents, Archives and S...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
ICARUS-Meeting #17 | Transparency - Accessibility – Dialogue. How a creative ...
 

Kürzlich hochgeladen

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Kürzlich hochgeladen (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 

co:op-READ-Convention Marburg - Basilis Gatos

  • 1. Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections. With a special focus on biographical data in archives. Hessian State Archives in Marburg, 19-21 January 2016 Computational Intelligence Laboratory Institute of Informatics and Telecommunications National Center for Scientific Research "Demokritos“ Agia Paraskevi, Athens, Greece Basilis Gatos
  • 2. 2 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Outline ► The Computational Intelligence Laboratory of NCSR Demokritos ► Introduction – Problem Definition ► Preprocessing Tasks before Layout Analysis ► Page Segmentation & Document Understanding ► Form & Table Analysis ► Text Line Detection ► Word Detection ► Layout Analysis tasks in READ Project
  • 3. 3 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 The Computational Intelligence Laboratory of NCSR Demokritos National Centre of Scientific Research "DEMOKRITOS“: The largest self-governing research organisation, under the supervision of the Greek Government. It is composed of the following Institutes:  Biosciences & Applications  Nuclear & Particle Physics  Informatics & Telecommunications  Nuclear & Radiological Sciences & Technology, Energy & Safety  Advanced Materials, Physicochemical Processes, Nanotechnology & Microsystems
  • 4. 4 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 The Computational Intelligence Laboratory of NCSR Demokritos CIL Activities Chart Neural Networks Computational Intelligence- Pattern recognition background Biolocically inspired modelling Bayesian networks Machine learning Multimedia Optical Information Processing, Semantic analysis & Retrieval Image Video 3D Graphics Document image processing and understanding Medical signal and image analysis, Environmental applications, Information retrieval from the Web …
  • 5. 5 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 The Computational Intelligence Laboratory of NCSR Demokritos • A strong involvement in the research field of Document Image Analysis and Recognition in the last 25 years • Our specific research interests lie on document image preprocessing (binarization, deskew, dewarping, image enhancement), segmentation, recognition, word spotting, writer identification and performance evaluation mainly for historical and handwritten documents. Also, we work on VOCR and Logo detection • More than 150 journal and conference publications • Our group consists of 11 people working in the field of Document Image Analysis and Recognition (Researchers, Research Associates and PhD students) – Researchers: S. Perantonis, B.Gatos, I.Pratikakis (Ass. Professor, DUTH) – Research Associates: G. Louloudis, N. Stamatopoulos, G. Sfikas, K. Zagoris – PhD student: A. Papandreou, K. Alexopoulos, G. Retsinas, G. Barlas • We are involved in a series of national and EU projects (READ, tranScriptorium, OldDocPro, IMPACT, CASAM, BOEMIE, POLYTIMO, D-SCRIBE etc.). Contracts to support several companies for processing handwritten documents (Greek Army Archives), analyzing handwritten forms, business documents (receipts and invoices) as well as to detect logos in videos (tennis games). • Program committee of several international Conferences and Workshops (e.g. ICDAR 20011, ICFHR 2012, ICDAR 2013, CBDAR 2013, International Workshop on Historical Document Imaging and Processing 2013) as well as on the Editorial Board of the International Journal on Document Analysis and Recognition (IJDA). We are also the co- organizers of the International Conference on Frontiers in Handwriting Recognition (ICFHR) that was held in Greece in 2014 and DAS 2016 (11 April 2016, Santorini).
  • 6. 6 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Introduction – Problem Definition Historical handwritten documents often suffer from several degradations, have low quality, exhibit dense layout, may have adjacent text line touching and arbitrary text line skew.
  • 7. 7 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Introduction – Problem Definition Page segmentation is the task of extracting homogeneous components from page images (detect both text and non-text areas, discriminate handwritten from possible machine printed text, classify non-text areas as decorations, ruled lines, noise etc.)
  • 8. 8 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Introduction – Problem Definition Document understanding or logical layout analysis refers to the logical and semantic analysis of document parts in order to extract human understandable information and codify it into machine-readable form (detect reading order, page numbers, headers, marginal elements or other use-case oriented information).
  • 9. 9 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Introduction – Problem Definition Document understanding or logical layout analysis refers to the logical and semantic analysis of document parts in order to extract human understandable information and codify it into machine-readable form (detect reading order, page numbers, headers, marginal elements or other use-case oriented information).
  • 10. 10 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Introduction – Problem Definition Text line and word detection are used after layout analysis in order to provide the proper input to a recognition or a word spotting engine.
  • 11. 11 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Preprocessing Tasks before Layout Analysis Image Enhancement in order to improve the quality of the original image.
  • 12. 12 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Preprocessing Tasks before Layout Analysis Image Enhancement in order to improve the quality of the original image. Binarization (conversion of a grayscale or colour image into a binary image) helps to separate the text from the background, permits less image storage space, allows efficient and quick further processing for page segmentation.
  • 13. 13 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Preprocessing Tasks before Layout Analysis Image Enhancement in order to improve the quality of the original image. Binarization (conversion of a grayscale or colour image into a binary image) helps to separate the text from the background, permits less image storage space, allows efficient and quick further processing for page segmentation. Border removal in order to detect and remove black borders as well as noise regions from the scanned document image.
  • 14. 14 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Preprocessing Tasks before Layout Analysis Image Enhancement in order to improve the quality of the original image. Binarization (conversion of a grayscale or colour image into a binary image) helps to separate the text from the background, permits less image storage space, allows efficient and quick further processing for page segmentation. Border removal in order to detect and remove black borders as well as noise regions from the scanned document image. Skew/orientation correction, a document image normalization step is useful in order to restore text areas horizontally aligned in 00 angle. Page curle – warping correction to correct image distortions.
  • 15. 15 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Preprocessing Tasks before Layout Analysis
  • 16. 16 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Preprocessing Tasks before Layout Analysis
  • 17. 17 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding Historical handwritten documents do not have strict layout rules and thus, page segmentation and layout analysis methods needs to be invariant to layout inconsistencies, irregularities in script and writing style, skew, fluctuating text lines, variable shapes of decorative entities etc.
  • 18. 18 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding Existing techniques: Most of the state-of-the-art methodologies focus on machine-printed or modern handwritten documents and only a few deal with historical handwritten documents. ► For machine-printed documents: ■ XY-Cuts: The pixels of the image of the document are projected horizontally and vertically. Then we look for the largest possible white gap in the projection and split the image into two sub- images at this gap. We repeat this procedure recursively changing direction until a stopping criterion is fulfilled. ■ Run Length Smearing Algorithm: Smearing of black pixels. ■ Docstrum: For each connected component we compute the k- nearest-neighbours ■ Voronoi, Whitespace Algorithms, …
  • 19. 19 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding Related work: V. Malleron, V. Eglin, H. Emptoz, S. Dord-Crousle, P. Regnier, “Text Lines and Snippets Extraction for 19th Century Handwriting Documents Layout Analysis”, ICDAR 2009, pp. 1001-1005. (Universite de Lyon, CNRS)
  • 20. 20 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding M. Bulacu, R. van Koert, L.Schomaker, T. van der Zant, "Layout analysis of handwritten historical documents for searching the archive of the Cabinet of the Dutch Queen", ICDAR 2007, pp. 357-361. (University of Groningen, The Netherlands) Related work: Detection of the rule lines of the tables and the page margins. Two methods were tested: the first uses color information (and then horizontal and vertical projections), while the second takes as input gray-scale images (binarization, detection of long vert. black runs to find vertical rule lines, process columns and extract long hor. black runs to detect horizontal rule lines).
  • 21. 21 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding S. S. Bukhari, T. M. Breuel, A Asi, J. El-Sana, “Layout Analysis for Arabic Historical Document Images Using Machine Learning”, ICFHR 2012, pp. 635-640. (Technical University of Kaiserslautern, Germany - Ben-Gurion University of the Negev, Israel) Related work: Features are extracted in a connected-component level, multi- layer perception classifier is exploited to classify connected components to main-body or side-notes text. Component Shape: 1. Normalized height: the height of a component divided by the height of an input document image. 2. Foreground area: number of foreground pixels in the rescaled area of a component divided by the total number of pixels in the rescaled area. 3. Relative distance: the relative distance of a connected component from the center of the document. 4. Orientation: the orientation of a connected component is estimated with respect to its neighborhood. Component Context:
  • 22. 22 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding S. Nicolas, T. Paquet and L. Heutte, "Complex Handwritten Page Segmentation Using Contextual Models“, DIAL 2006, pp. 46-59. (Laboratoire PSI – Université de Rouen) Related work: Task1: Label the main regions of the manuscripts such as text body, margins, header, footer, page number and marginal annotations red = page number green = header blue =text body pink = footer cyan = text block yellow = margin Task2: Detect pseudowords, deletions, diacritics and background white = background green = norm. text blue = erasure pink = diacritic Markov Random Field models using multiresolution pixel density feature extraction are used (results for task1: ~90%, num. of images: 69).
  • 23. 23 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding  vertical lines are detected based on a fuzzy smoothing method.  we also process the vertical white runs of the image  treat cases of text overlapping with rule lines  Success rate ~90% using a set of 500 representative images
  • 24. 24 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Page Segmentation & Document Understanding N. Stamatopoulos, G. Louloudis and B. Gatos, “Goal- Oriented Performance Evaluation Methodology for Page Segmentation Techniques”, ICDAR 2015.  It is a pixel-based approach which avoids the dependence on a strictly defined ground-truth.  The proposed evaluation measure deals is correlated with the percentage of the text information in which the subsequent processing (e.g. text line segmentation and recognition) can be applied successfully.
  • 25. 25 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Form & Table Analysis  Forms/Tables contain structured information  Allows extraction of semantic information due to syntactical knowledge  Form documents:  Get information of the content of a record  E.g. Index or Table of Contents  Concrete search for form documents
  • 26. 26 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Form & Table Analysis Challenges:  Bad condition of historical documents (faded out ink, stains, mold)  Small variations of the form layout of consecutive versions  Geometrical similar layouts  Handwritten filled in data can affect (global) form features
  • 27. 27 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Form & Table Analysis Challenges:  Small inter-class variance for certain form types (e.g. table of content)  Different form types can have the same logical structure (based on the description)  Restoration of handwritten text needed after form dropout  “Hand drawn” forms in historical documents (small variations of spacing, … )
  • 28. 28 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Form & Table Analysis Form Processing:
  • 29. 29 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Form & Table Analysis State-of-the-art:  Global Image Based Features  Methods based on Hierarchical Descriptions  Local and Structural Features  Subgraphs as combination of 2 or more primitives (E. Saund)  Line information + preprinted labels
  • 30. 30 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Form & Table Analysis B. Gatos, D. Danatsas, I. Pratikakis and S. J. Perantonis, "Automatic Table Detection in Document Images", 3rd International Conference on Advances in Pattern Recognition (ICAPR'05), pp. 612-621, Path, UK, August 2005.
  • 31. 31 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection ► Process for defining the region of every text line on a document image ► Crucial part of the workflow since its performance seriously affects word segmentation and HTR ► How is a text line region defined? ► Using a baseline ► Or using a polygon area
  • 32. 32 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection Challenges: ► Difference in the skew angle between lines on the page or even along the same text line
  • 33. 33 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection Challenges: ► Overlapping text lines ► Touching text lines
  • 34. 34 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection Challenges: ► Additions above the text line ► Deleted text
  • 35. 35 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection State-of-the-art: ► Projection Profile methods ► Find local minima/maxima on the projection profile ► Work on entire image/vertical strips ► Can handle some degree of skew ► Hough based methods ► Calculate the Hough transform of a set of interest points ► Points include gravity centers of CCs, minima points of CCs, all black pixels ► Skewed text lines can be detected ► Errors occur when text lines present different skew along their width ► Smearing methods ► Consecutive black pixels along the horizontal direction are smeared ► The white space between them is filled with black pixels if its is length is within a predefined threshold ► Seam carving methods - find the paths with the minimum cost from the left part to the right part of the document image using dynamic programming
  • 36. 36 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection
  • 37. 37 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection Success rate: ~85% using a set of more than 400 images (>10.000 text lines)
  • 38. 38 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection Binarization Remove Underlines Baselines Detection Skew Correction Slant Correction Size Normalization normalized text line text line Cost Function: ℱ(𝑢, 𝑙) = (𝑃𝑤(𝑙)−𝑃𝑤(𝑢)) ∗ 𝑒 −( 𝑙−𝑢 −ℎ)2 𝑠 𝑃𝑤 𝑦 : difference of horizontal projections upper & lower of y ℎ: dominant height 𝑈𝑝𝑝𝑒𝑟 𝑍𝑜𝑛𝑒: 30 𝑀𝑎𝑖𝑛 𝑍𝑜𝑛𝑒: 60 𝐿𝑜𝑤𝑒𝑟 𝑍𝑜𝑛𝑒: 30
  • 39. 39 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Text Line Detection
  • 40. 40 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Word Detection ► Process of defining the regions of words of a text line ► Necessary for keyword spotting ► Two step procedure: ► Distance computation ► Gap classification
  • 41. 41 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Word Detection Challenges: ► Skew along the text line ► Existence of slant ► Punctuation marks
  • 42. 42 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Word Detection Challenges: ► Non-uniform spacing of words
  • 43. 43 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Word Detection ► Distance computation stage ► Several distance metrics used in the literature ► Euclidean ► Bounding box ► Minimum Run-length ► Convex Hull State-of-the-art:
  • 44. 44 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Word Detection  Success rate: ~80% using a set of more than 400 images (>100.000 words)
  • 45. 45 Hard Tasks in the Background - Layout Analysis Technology meets Scholarship, Marburg, 19-21 January 2016 Layout Analysis tasks in READ Project Task 6.1. Image pre-processing and enhancement (M1-M36) Lead: NCSR Task 6.2. Basic layout analysis (M1-M36) Lead: CVL Task 6.3. Table and forms analysis (M1-M36) Lead: CVL Task 6.4. Segmentation of text regions (M1-M36) Lead: NCSR Task 6.5. Document understanding (M1-M36) Lead: XRCE