SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Automatic Article Extraction in Old Newspapers
Digitized Collections
David Hébert
May 19th 2014
David Hébert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas, Thierry Paquet
Document digitization
David Hébert - Datech - May 19th 2014 2
Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps
pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du
Planier. Tout autour, la ville de béton et de tuiles à perte de vue.
Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le
Corbusier offre une vue panoramique unique à Marseille. Sur ce
promontoire, il faut ajouter les cris des enfants de l'école maternelle
dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une
incroyable cour de récréation.
Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps
pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du
Planier. Tout autour, la ville de béton et de tuiles à perte de vue.
Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le
Corbusier offre une vue panoramique unique à Marseille. Sur ce
promontoire, il faut ajouter les cris des enfants de l'école maternelle
dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une
incroyable cour de récréation.
180 years of diversity
PlaIR : Regional Indexation Platform
Enrichment of the « Journal de Rouen »
• 1762 – 1947
• Approximately 300 000 images
• Various layouts
David Hébert - Datech - May 19th 2014 3
Plan
1. Proposed Approach
2. Logical labeling at pixel level
3. Logical structureextraction
4. Results
5. Conclusion and future work
David Hébert - Datech - May 19th 2014 4
Overview of our method
David Hébert - Datech - May 19th 2014 5
Physico-logical
entities extraction
Physico-logical
entities extraction
Article
reconstruction
Article
reconstruction
• Labelling at the pixel level
• Contextualisation
• Graphical model
• Discriminative model
The CRF
• Higher level of analysis
• Blocs identification
• Taking advantage of
hierarchical organisation of
information
• Finding a reading order
Logical labeling at
pixel level
Logical structure
extraction
Plan
1. Proposed Approach
2. Logical labeling at pixel level
3. Logical structureextraction
4. Results
5. Conclusion and future work
David Hébert - Datech - May 19th 2014 6
Conditional Random Fields
Proposed by Lafferty, McCallum and Peirera in 2001 for Part Of Speech tagging
Having a sequence of observations X, find the best label sequence Y
Having a sequence of words, find the role of the words in the sentence
=> observations are words (discrete observations)
=> labels are the description of the role in the sentence
David Hébert - Datech - May 19th 2014 7
[Lafferty 01] John Lafferty,Andrew McCallum & Fernando Pereira.Conditional Random Fields :Probabilistic Models for Segmenting and Labeling
Sequence Data.In Proc. 18th International Conf.on Machine Learning,pages 282-289,2001.
xt-1
yt-1yt-1
xt
ytyt
xt+1
yt+1yt+1
Local combination of
potentials
Global combination over the sequence
Feature functions
David Hébert - Datech - May 19th 2014 8
: generical notation of a feature function that include 2 kind of functions
- Observation functions, denoted by
- Transition functions, denoted by
- Each feature function is linked to a parameter λk
x1 x2 xT
ytytYt-1Yt-1
Parameter estimation = conditional log-likelihood on N
labelled examples
Inference: Having X, find Y* as
Which physico-logical entities?
David Hébert - Datech - May 19th 2014 9
Pixel description with numerical values
Require some data adaptation to
feed the CRF:
Multi-scale quantization
x1 x2 xT
y1y1 y2y2 yTyT
Numerical descriptors
D. Hébert, T. Paquet, S. Nicolas, Continuous CRF with Multi-scale Quantization Feature Functions Application to Structure Extraction in Old Newspaper,ICDAR 2011
Experimentations
David Hébert - Datech - May 19th 2014 10
Identification of:
- Text lines
- Titles
- Horizontal separators
- Vertical separators
- Noisy areas
- Characters
- Inter-character white spaces
- Inter-words white spaces
• Observations are horizontal runs length.
• An observation is described by :
- its length
- The median length of the vertical runs
A generical model of data
David Hébert - Datech - May 19th 2014 11
• Not a complete document model
• A model of columns of information
• A model of entities sequences
=> Generical enought model for
various layouts
Approach recall
David Hébert - Datech - May 19th 2014 12
Physico-logical
entities extraction
Physico-logical
entities extraction
Article
reconstruction
Article
reconstruction
Pixel level analysis : DONE
Higher level of analysis to identify articles
Plan
1. Proposed Approach
2. Logical labeling at pixel level
3. Logical structureextraction
4. Results
5. Conclusion and future work
David Hébert - Datech - May 19th 2014 13
Article reconstruction
David Hébert - Datech - May 19th 2014 14
Article reconstruction
David Hébert - Datech - May 19th 2014 15
David Hébert - Datech - May 19th 2014 16
D
O
R
B
F
S
Z
A
P
W
O
O
P
P
R
R
A
A
Z
Z
S
S
B
B
F
F
W
W
Article reconstruction
David Hébert - Datech - May 19th 2014 17
D
Reading order
O
R
B
F
S
Z
A
P
W
O
O
P
P
R
R
A
A
Z
Z
S
S
B
F
F
B W
W
Article reconstruction
Plan
1. Proposed Approach
2. Logical labeling at pixel level
3. Logical structureextraction
4. Results
5. Conclusion and future work
David Hébert - Datech - May 19th 2014 18
Results
David Hébert - Datech - May 19th 2014 19
Quantitative evaluation :
42 images evaluated manually
226 true articles
245 articles detected
194 correct detection (85,84%)
Over-segmentation rate of 8.41%
• 21550 documents made of 4 pages on average
(101978 images) on the platform :
http://plair.univ-rouen.fr
• 550 000 articles
• Approximately 20 days of computation (8 cores)
Results on other layouts
David Hébert - Datech - May 19th 2014 20
Conclusion and future work
David Hébert - Datech - May 19th 2014 21
Presentation of a logical segmentation method in two steps :
- Physico-logical entities segmentation with CRF
- Article identification with a generic layout model
Suitable for complex Manhattan layouts with little set of rules
Average article detection rate of 85%
Future work :
- Improve the CRF model (descriptors and/or the labels description)
- Add variability in the description of an entity (typicaly the definition of a
separator)
22
The end…
Thanks for your attention
Questions?
David Hébert - Datech - May 19th 2014

Weitere ähnliche Inhalte

Ähnlich wie Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections

ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...Franck Michel
 
Jewei Hans & Kamber Capter 7
Jewei Hans & Kamber Capter 7Jewei Hans & Kamber Capter 7
Jewei Hans & Kamber Capter 7Houw Liong The
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataTravis Oliphant
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
 
NJ Wildlife Habitat Finder
NJ Wildlife Habitat FinderNJ Wildlife Habitat Finder
NJ Wildlife Habitat FinderDan Ford
 
Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템
Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템
Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템Wookjin Choi
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Julien PLU
 
Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...
Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...
Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...GeoVIS'15 Workshop
 
Revealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid ApproachRevealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid ApproachJulien PLU
 
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigData_Europe
 
Make our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebMake our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebFranck Michel
 
Final_Talk_Tool_Team
Final_Talk_Tool_TeamFinal_Talk_Tool_Team
Final_Talk_Tool_TeamMehdi Lamee
 
RIPE NCC Tools
RIPE NCC ToolsRIPE NCC Tools
RIPE NCC ToolsRIPE NCC
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
 
OK festival Lightning Talk - Collaborative Open Geospatial Data
OK festival Lightning Talk - Collaborative Open Geospatial DataOK festival Lightning Talk - Collaborative Open Geospatial Data
OK festival Lightning Talk - Collaborative Open Geospatial DataAndrew Turner
 
AAPG GTW 2017: Deep Water and Shelf Reservoirs
AAPG GTW 2017: Deep Water and Shelf ReservoirsAAPG GTW 2017: Deep Water and Shelf Reservoirs
AAPG GTW 2017: Deep Water and Shelf ReservoirsDustin Dewett
 
wimmics and DBpedia FR
wimmics and DBpedia FRwimmics and DBpedia FR
wimmics and DBpedia FRJulienCojan
 

Ähnlich wie Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections (20)

ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
ISSA: Generic Pipeline, Knowledge Model and Visualization tools to Help Scien...
 
Jewei Hans & Kamber Capter 7
Jewei Hans & Kamber Capter 7Jewei Hans & Kamber Capter 7
Jewei Hans & Kamber Capter 7
 
Array computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
 
A Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing CostsA Recommender Story: Improving Backend Data Quality While Reducing Costs
A Recommender Story: Improving Backend Data Quality While Reducing Costs
 
Using Open Research Data for Public Policy Making: Opportunities of Virtual R...
Using Open Research Data for Public Policy Making: Opportunities of Virtual R...Using Open Research Data for Public Policy Making: Opportunities of Virtual R...
Using Open Research Data for Public Policy Making: Opportunities of Virtual R...
 
NJ Wildlife Habitat Finder
NJ Wildlife Habitat FinderNJ Wildlife Habitat Finder
NJ Wildlife Habitat Finder
 
Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템
Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템
Insight toolkit을 이용한 삼차원 흉부 CT 영상분석 및 폐결절 검출 시스템
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
Network analysis
Network analysisNetwork analysis
Network analysis
 
Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...
Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...
Visualization of Marine Sand Dune Displacements utilizing Modern GPU Techniqu...
 
Revealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid ApproachRevealing Entities From Texts With a Hybrid Approach
Revealing Entities From Texts With a Hybrid Approach
 
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
BigDataEurope 1st SC5 Workshop, Project Teleios & LEO, by M. Koubarakis, Univ...
 
Make our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the WebMake our Scientific Datasets Accessible and Interoperable on the Web
Make our Scientific Datasets Accessible and Interoperable on the Web
 
Final_Talk_Tool_Team
Final_Talk_Tool_TeamFinal_Talk_Tool_Team
Final_Talk_Tool_Team
 
RIPE NCC Tools
RIPE NCC ToolsRIPE NCC Tools
RIPE NCC Tools
 
Extreme earth overview
Extreme earth overviewExtreme earth overview
Extreme earth overview
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
OK festival Lightning Talk - Collaborative Open Geospatial Data
OK festival Lightning Talk - Collaborative Open Geospatial DataOK festival Lightning Talk - Collaborative Open Geospatial Data
OK festival Lightning Talk - Collaborative Open Geospatial Data
 
AAPG GTW 2017: Deep Water and Shelf Reservoirs
AAPG GTW 2017: Deep Water and Shelf ReservoirsAAPG GTW 2017: Deep Water and Shelf Reservoirs
AAPG GTW 2017: Deep Water and Shelf Reservoirs
 
wimmics and DBpedia FR
wimmics and DBpedia FRwimmics and DBpedia FR
wimmics and DBpedia FR
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Kürzlich hochgeladen

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Kürzlich hochgeladen (20)

Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Datech2014 - Automatic Article Extraction in Old Newspapers Digitized Collections

  • 1. Automatic Article Extraction in Old Newspapers Digitized Collections David Hébert May 19th 2014 David Hébert, Thomas Palfray, Pierrick Tranouez, Stéphane Nicolas, Thierry Paquet
  • 2. Document digitization David Hébert - Datech - May 19th 2014 2 Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du Planier. Tout autour, la ville de béton et de tuiles à perte de vue. Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le Corbusier offre une vue panoramique unique à Marseille. Sur ce promontoire, il faut ajouter les cris des enfants de l'école maternelle dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une incroyable cour de récréation. Le blanc du calcaire des calanques tacheté du vert jailli d'un printemps pluvieux. Le bleu de la rade duquel émerge, au loin, le phare du Planier. Tout autour, la ville de béton et de tuiles à perte de vue. Jusqu'à d'autres collines... Le toit-terrasse de l'immeuble de Le Corbusier offre une vue panoramique unique à Marseille. Sur ce promontoire, il faut ajouter les cris des enfants de l'école maternelle dont la terrasse de la Cité radieuse, à 56 mètres du sol, compose une incroyable cour de récréation.
  • 3. 180 years of diversity PlaIR : Regional Indexation Platform Enrichment of the « Journal de Rouen » • 1762 – 1947 • Approximately 300 000 images • Various layouts David Hébert - Datech - May 19th 2014 3
  • 4. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 4
  • 5. Overview of our method David Hébert - Datech - May 19th 2014 5 Physico-logical entities extraction Physico-logical entities extraction Article reconstruction Article reconstruction • Labelling at the pixel level • Contextualisation • Graphical model • Discriminative model The CRF • Higher level of analysis • Blocs identification • Taking advantage of hierarchical organisation of information • Finding a reading order Logical labeling at pixel level Logical structure extraction
  • 6. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 6
  • 7. Conditional Random Fields Proposed by Lafferty, McCallum and Peirera in 2001 for Part Of Speech tagging Having a sequence of observations X, find the best label sequence Y Having a sequence of words, find the role of the words in the sentence => observations are words (discrete observations) => labels are the description of the role in the sentence David Hébert - Datech - May 19th 2014 7 [Lafferty 01] John Lafferty,Andrew McCallum & Fernando Pereira.Conditional Random Fields :Probabilistic Models for Segmenting and Labeling Sequence Data.In Proc. 18th International Conf.on Machine Learning,pages 282-289,2001. xt-1 yt-1yt-1 xt ytyt xt+1 yt+1yt+1 Local combination of potentials Global combination over the sequence
  • 8. Feature functions David Hébert - Datech - May 19th 2014 8 : generical notation of a feature function that include 2 kind of functions - Observation functions, denoted by - Transition functions, denoted by - Each feature function is linked to a parameter λk x1 x2 xT ytytYt-1Yt-1 Parameter estimation = conditional log-likelihood on N labelled examples Inference: Having X, find Y* as
  • 9. Which physico-logical entities? David Hébert - Datech - May 19th 2014 9 Pixel description with numerical values Require some data adaptation to feed the CRF: Multi-scale quantization x1 x2 xT y1y1 y2y2 yTyT Numerical descriptors D. Hébert, T. Paquet, S. Nicolas, Continuous CRF with Multi-scale Quantization Feature Functions Application to Structure Extraction in Old Newspaper,ICDAR 2011
  • 10. Experimentations David Hébert - Datech - May 19th 2014 10 Identification of: - Text lines - Titles - Horizontal separators - Vertical separators - Noisy areas - Characters - Inter-character white spaces - Inter-words white spaces • Observations are horizontal runs length. • An observation is described by : - its length - The median length of the vertical runs
  • 11. A generical model of data David Hébert - Datech - May 19th 2014 11 • Not a complete document model • A model of columns of information • A model of entities sequences => Generical enought model for various layouts
  • 12. Approach recall David Hébert - Datech - May 19th 2014 12 Physico-logical entities extraction Physico-logical entities extraction Article reconstruction Article reconstruction Pixel level analysis : DONE Higher level of analysis to identify articles
  • 13. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 13
  • 14. Article reconstruction David Hébert - Datech - May 19th 2014 14
  • 15. Article reconstruction David Hébert - Datech - May 19th 2014 15
  • 16. David Hébert - Datech - May 19th 2014 16 D O R B F S Z A P W O O P P R R A A Z Z S S B B F F W W Article reconstruction
  • 17. David Hébert - Datech - May 19th 2014 17 D Reading order O R B F S Z A P W O O P P R R A A Z Z S S B F F B W W Article reconstruction
  • 18. Plan 1. Proposed Approach 2. Logical labeling at pixel level 3. Logical structureextraction 4. Results 5. Conclusion and future work David Hébert - Datech - May 19th 2014 18
  • 19. Results David Hébert - Datech - May 19th 2014 19 Quantitative evaluation : 42 images evaluated manually 226 true articles 245 articles detected 194 correct detection (85,84%) Over-segmentation rate of 8.41% • 21550 documents made of 4 pages on average (101978 images) on the platform : http://plair.univ-rouen.fr • 550 000 articles • Approximately 20 days of computation (8 cores)
  • 20. Results on other layouts David Hébert - Datech - May 19th 2014 20
  • 21. Conclusion and future work David Hébert - Datech - May 19th 2014 21 Presentation of a logical segmentation method in two steps : - Physico-logical entities segmentation with CRF - Article identification with a generic layout model Suitable for complex Manhattan layouts with little set of rules Average article detection rate of 85% Future work : - Improve the CRF model (descriptors and/or the labels description) - Add variability in the description of an entity (typicaly the definition of a separator)
  • 22. 22 The end… Thanks for your attention Questions? David Hébert - Datech - May 19th 2014