Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections

Automatic Article Extraction in Old Newspapers
Digitized Collections
David Hébert
May 19th 2014
David Hébert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas, Thierry Paquet

Document digitization
David Hébert - Datech - May 19th 2014 2
Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps
pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du
Planier. Tout autour, la ville de béton et de tuiles à perte de vue.
Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le
Corbusier offre une vue panoramique unique à Marseille. Sur ce
promontoire, il faut ajouter les cris des enfants de l'école maternelle
dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une
incroyable cour de récréation.
Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps
pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du
Planier. Tout autour, la ville de béton et de tuiles à perte de vue.
Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le
Corbusier offre une vue panoramique unique à Marseille. Sur ce
promontoire, il faut ajouter les cris des enfants de l'école maternelle
dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une
incroyable cour de récréation.

180 years of diversity
PlaIR : Regional Indexation Platform
Enrichment of the « Journal de Rouen »
• 1762 – 1947
• Approximately 300 000 images
• Various layouts

Plan
1. Proposed Approach
2. Logical labeling at pixel level
3. Logical structureextraction
4. Results
5. Conclusion and future work

Overview of our method
Physico-logical
entities extraction
Physico-logical
entities extraction
Article
reconstruction
Article
reconstruction
• Labelling at the pixel level
• Contextualisation
• Graphical model
• Discriminative model
The CRF
• Higher level of analysis
• Blocs identification
• Taking advantage of
hierarchical organisation of
information
• Finding a reading order
Logical labeling at
pixel level
Logical structure
extraction

Plan
4. Results

Conditional Random Fields
Proposed by Lafferty, McCallum and Peirera in 2001 for Part Of Speech tagging
Having a sequence of observations X, find the best label sequence Y
Having a sequence of words, find the role of the words in the sentence
=> observations are words (discrete observations)
=> labels are the description of the role in the sentence
[Lafferty 01] John Lafferty,Andrew McCallum & Fernando Pereira.Conditional Random Fields :Probabilistic Models for Segmenting and Labeling
Sequence Data.In Proc. 18th International Conf.on Machine Learning,pages 282-289,2001.
xt-1
yt-1yt-1
xt
ytyt
xt+1
yt+1yt+1
Local combination of
potentials
Global combination over the sequence

Feature functions
: generical notation of a feature function that include 2 kind of functions
- Observation functions, denoted by
- Transition functions, denoted by
- Each feature function is linked to a parameter λk
x1 x2 xT
ytytYt-1Yt-1
Parameter estimation = conditional log-likelihood on N
labelled examples
Inference: Having X, find Y* as

Which physico-logical entities?
Pixel description with numerical values
Require some data adaptation to
feed the CRF:
Multi-scale quantization
x1 x2 xT
y1y1 y2y2 yTyT
Numerical descriptors
D. Hébert, T. Paquet, S. Nicolas, Continuous CRF with Multi-scale Quantization Feature Functions Application to Structure Extraction in Old Newspaper,ICDAR 2011

Experimentations
Identification of:
- Text lines
- Titles
- Horizontal separators
- Vertical separators
- Noisy areas
- Characters
- Inter-character white spaces
- Inter-words white spaces
• Observations are horizontal runs length.
• An observation is described by :
- its length
- The median length of the vertical runs

A generical model of data
• Not a complete document model
• A model of columns of information
• A model of entities sequences
=> Generical enought model for
various layouts

Approach recall
Physico-logical
entities extraction
Physico-logical
entities extraction
Article
reconstruction
Article
reconstruction
Pixel level analysis : DONE
Higher level of analysis to identify articles

Plan
4. Results

Article reconstruction

D
O
R
B
F
S
Z
A
P
W
O
O
P
P
R
R
A
A
Z
Z
S
S
B
B
F
F
W
W

D
Reading order
O
R
B
F
S
Z
A
P
W
O
O
P
P
R
R
A
A
Z
Z
S
S
B
F
F
B W
W

Plan
4. Results

Results
Quantitative evaluation :
42 images evaluated manually
226 true articles
245 articles detected
194 correct detection (85,84%)
Over-segmentation rate of 8.41%
• 21550 documents made of 4 pages on average
(101978 images) on the platform :
http://plair.univ-rouen.fr
• 550 000 articles
• Approximately 20 days of computation (8 cores)

Results on other layouts

Conclusion and future work
Presentation of a logical segmentation method in two steps :
- Physico-logical entities segmentation with CRF
- Article identification with a generic layout model
Suitable for complex Manhattan layouts with little set of rules
Average article detection rate of 85%
Future work :
- Improve the CRF model (descriptors and/or the labels description)
- Add variability in the description of an entity (typicaly the definition of a
separator)

22
The end…
Thanks for your attention
Questions?
David Hébert - Datech - May 19th 2014

Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections

Ähnlich wie Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections (20)

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections