SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
A Bimodal Crowdsourcing Platform for
Demographic Historical Manuscripts
Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré
Computer Vision Center - Centre for Demographic Studies
Universitat Autònoma de Barcelona
2
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
3
5CofM: Barcelona Marriage Licenses
5CofM project: Five Centuries of Marriages
• Advanced Grant – European Research Council.
• 2011 – 2016.
• Partners:
• Universitat Autònoma de Barcelona (UAB)
• Centre for Demographic Studies (CED).
• Computer Vision Center (CVC).
• Aim:
This project is based on the data-mining of the Llibres d'Esposalles conserved at the
Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books
of marriage licenses records, with information of approximately 610.000 unions
celebrated in over 250 parishes of the Diocese between 1451 and 1905.
4
The Barcelona Marriage Licenses
The Marriage Licenses contain information about:
– The couple (groom/bride)
– Their parents
– Their occupation (job)
– The place of origin
– The parish (church) where they married
– The fee that was paid (depending on their social class)
NAME
DATE
JOB
PLACE
FEE
NAME
NAME
5
The Barcelona Marriage Licenses
Index Marriage Licenses
6
The Barcelona Marriage Licenses
“Llibres d’esposalles” from the Archives of the Barcelona Cathedral
• 244 books
• From 1451 to 1905
• Approximately 550.000 marriages licenses
Ground truth
• From the volume 69
• 50 documents
• 20 classes
Index License marriage
Husband’s
surname
License marriage Fee
6
7
The Barcelona Marriage Licenses: Continuity
1481: volume 3 1601: volume 61
Marriage license
Husband’s surname
1729: volume 127 1860: volume 200
Fee
Marriage license
Fee
Husband’s surname
Marriage license
Fee
Husband’s surname
Marriage license
Fee
8
The Barcelona Marriage Licenses: Fees
Marriage licenses fees for the two year period that starts on
the first of May, 1627 and ends on the last day of April, 1629
Dukes, Marquises, Counts and
Viscounts.
Noble knights and Lords of vassals.
Knights, Honored Citizens and
Bourgeoisies.
Merchants, Notaries of Barcelona,
Shopkeepers of distinguish materials,
Chemists and Druggists.
Shopkeepers of materials, Royal
Notaries, Surgeons, Traders, Solicitors,
Middlemen and Artists.
The rest.
The poor ones for the love of God.
12 ll
2ll 6s
1ll 4s
12s
6s
4s
-
9
CED objectives (scholars)
– Genealogic tree
• Ancestors / descendants
– Immigration / Emigration
• Family names appear / disappear
• French surnames (descendants)
– Population (by num. of marriages)
• Plagues, epidemics, baby boom
– Parish churches
• Neighborhood is/becomes rich/poor
– Evolution of a family name
• Jobs, fees (higher or lower)
– Relationships between families
• Strategic, commercial reasons
CVC objectives
(computer scientists)
– Layout analysis
• Text-line segmentation
– Word Spotting
• Query by example
• Query by string
– Handwriting Recognition
– Syntactic analysis
The Barcelona Marriage Licenses
10
Document Image Analysis: Tasks
• Layout analysis: to detect (crop) records, lines, words for subsequent recognition.
• Full transcription: to convert images to editable text.
• Word spotting: given a query word to search,
to locate at image level visually similar word snippets.
dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon=
BLOCKS
WORDS
LINES
11
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
12
Technical architecture
Image Space
Transcription
Space
Contextual
knowledge
Space
HW recognition
Crowdsourcing
Data mining
• Harmonization
• Record linkage
Scanning
exploitation
13
Crowdsourcing platform
• Manual transcription  tedious and time consuming task
• Crowdsourcing Platform (Divide & Conquer)
• Split and distribute a big amount of small and simple tasks
• Crowdsourcing architecture:
• Image space (digitized documents)
• Transcription space (extraction of information)
• Contextual space (semantic meaning)
14
Crowdsourcing platform
• Web-based application: Integration of two points of view
• Contents view: Semantic information  demographic research
• Labeling view: Ground-truthing  document analysis research
http://www.cvc.uab.es/5cofm/
15
Crowdsourcing platform: Administration
Administration: Managing documents and Users
16
Crowdsourcing platform: User login
17
Contents view (semantics): Form filling
18
Contents view (semantics): Form filling (Indices)
19
Contents view (semantics): Checking correction
Check for posible spelling errors (words that appear only once?)
20
Contents view (semantics): Record Linkage
• Record Linkage  Genealogical tree
• Batch process searches links between individuals:
• Parent’s marriage, Brothers/Sisters marriages
• The search allows spelling variations
• String Edit distance (Levenshtein), with different costs for substitutions
• Useful for harmonization of names, surnames…
• The expert decides the correct linkage from the candidates
Year Bride Father Mother Year Groom Bride Similarity
1638 Jeronima Lluis
Teixidor
Paula 1606 Lluis
Teixidor
Paula 1
1638 Joana Nicolau
Ferrer
Antiga 1613 Nicolau
Ferrera
Antiga 0.95
21
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
22
Labeling view (annotation): Transcription (lines)
Literal transcription  Ground-truth for handwriting recognition methods
23
Labeling view (annotation): Word Labeling
Word meta-data:
• Bounding-box (coordinates)
• Cathegory
(e.g. groom’s name,
occupation…)
• The system does the
automatic correspondence
 The user validates!
Integrated platform: put into correspondence contents view  labeling view
24
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
25
Running Experience
ADVANTAGES
• Digital source
• Not necessary to go to the Archive
• No timetable limitations
• Parallelization
• Many users work simultaneously
• Centralization
• Easier management of images, users, database...
• Easy to see “who works on what”
• Automatic control
• System forces to fill some fields, raises warnings
• Useful for detection of spelling errors (auto-correction)
26
Running Experience
ADVANTAGES
• Security
• Frequent back-up
• Users can visualize the documents assigned to them, but not
download them
• Monitoring
• Administrator can monitor the user’s work and provide feedback
• Visualization and confort
• Drag (move), zoom in/out
DISADVANTAGES
• Internet connection is always needed
• If system is down (e.g. maintenance)  no one can work
27
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
Generalization to other demographic manuscripts
• The platform has been adapted for census documents
29
Index
 Introduction
 5CofM project: The Barcelona Marriage Licenses
 Bi-modal Crowdsourcing Platform
 Contents view
 Labeling view
 Running experience
 Generalization to other kind of documents
 Conclusions
Conclusions
• Web-based crowdsourcing platform for demographic manuscripts
• Integrates the needs of demographers and computer scientists
Future directions
• Improve validation
• Combine the output of several users
• Compare with the output of document analysis techniques
• Mobile-based applications
• For crowdsourcing  Faster ground-truth generation
• For browsing and searching  User friendly interfaces
Crowdsourcing on mobile devices
Task 1
Page layout
R · 30 s/T · 1 T/P · 29 P
Initial
(29 pages)
Redundancy: each task solved by different people
Task 2
Bounding Box
R · 30 s/T · 18 T/P · 29 P
s/T = seconds per task
T/P = task per page
R = 5, Redundancy
Task 3
Word
Segmentation
R · 10 s/T · 360 T/P · 29 P
32
Browsing the marriage licenses on a mobile device
33
33
Thank you!!

Weitere ähnliche Inhalte

Ähnlich wie Crowdsourcing Platform for Historical Manuscripts

Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTAMartin Wynne
 
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...CUBCCE Conference
 
Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsBob Coret
 
ESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a countryESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a countryRick Mourits
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...BOBCATSSS 2017
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...Digital Classicist Seminar Berlin
 
Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museumsdejp3
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Marton Nemeth
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...Hazel Hall
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesSilje Ljosland Bakke
 
Semantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationSemantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationValentina Carriero
 
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Mike Mertens
 
The National Bibliographic Knowledgebase
The National Bibliographic KnowledgebaseThe National Bibliographic Knowledgebase
The National Bibliographic KnowledgebaseJisc
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101Oscar Corcho
 
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...CIGScotland
 
OpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositoriesOpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositoriesOpenAIRE
 

Ähnlich wie Crowdsourcing Platform for Historical Manuscripts (20)

Forty Years of the OTA
Forty Years of the OTAForty Years of the OTA
Forty Years of the OTA
 
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
Neven Vrček - Role of Governments, Academy & Science Parks - University of Za...
 
Gaenovium - Open data in the Netherlands
Gaenovium - Open data in the NetherlandsGaenovium - Open data in the Netherlands
Gaenovium - Open data in the Netherlands
 
ESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a countryESDG seminar 2019: reconstructing a country
ESDG seminar 2019: reconstructing a country
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
 
Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)
 
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
[DCSB] Dr Gabriel Bodard (KCL) “A View on Digital Classics Collaboration: fro...
 
Digital archaeology and museums
Digital archaeology and museumsDigital archaeology and museums
Digital archaeology and museums
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...
 
Research into Practice case study 2: Library linked data implementations an...
	Research into Practice case study 2:  Library linked data implementations an...	Research into Practice case study 2:  Library linked data implementations an...
Research into Practice case study 2: Library linked data implementations an...
 
Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)Open access in Latin America and the Caribbean (LAC)
Open access in Latin America and the Caribbean (LAC)
 
Standards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologiesStandards in health informatics - Problem, clinical models and terminologies
Standards in health informatics - Problem, clinical models and terminologies
 
Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...
Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...
Winter, Chandler, Biedenbach, Pearson, and Stanton, "It’s Only as Good as the...
 
Semantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisationSemantic Web for Cultural Heritage valorisation
Semantic Web for Cultural Heritage valorisation
 
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
Liber 2014 - Chain Reactions: TEL & RLUK on their Linked Open data.
 
The National Bibliographic Knowledgebase
The National Bibliographic KnowledgebaseThe National Bibliographic Knowledgebase
The National Bibliographic Knowledgebase
 
Linked Statistical Data 101
Linked Statistical Data 101Linked Statistical Data 101
Linked Statistical Data 101
 
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
SENESCHAL: Semantic ENrichment Enabling Sustainability of arCHAeological Link...
 
OpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositoriesOpenAIRE workshop @ OR2016 - From Repositories, for repositories
OpenAIRE workshop @ OR2016 - From Repositories, for repositories
 
Ee bdm ws-v1
Ee bdm ws-v1Ee bdm ws-v1
Ee bdm ws-v1
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Crowdsourcing Platform for Historical Manuscripts

  • 1. A Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré Computer Vision Center - Centre for Demographic Studies Universitat Autònoma de Barcelona
  • 2. 2 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 3. 3 5CofM: Barcelona Marriage Licenses 5CofM project: Five Centuries of Marriages • Advanced Grant – European Research Council. • 2011 – 2016. • Partners: • Universitat Autònoma de Barcelona (UAB) • Centre for Demographic Studies (CED). • Computer Vision Center (CVC). • Aim: This project is based on the data-mining of the Llibres d'Esposalles conserved at the Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books of marriage licenses records, with information of approximately 610.000 unions celebrated in over 250 parishes of the Diocese between 1451 and 1905.
  • 4. 4 The Barcelona Marriage Licenses The Marriage Licenses contain information about: – The couple (groom/bride) – Their parents – Their occupation (job) – The place of origin – The parish (church) where they married – The fee that was paid (depending on their social class) NAME DATE JOB PLACE FEE NAME NAME
  • 5. 5 The Barcelona Marriage Licenses Index Marriage Licenses
  • 6. 6 The Barcelona Marriage Licenses “Llibres d’esposalles” from the Archives of the Barcelona Cathedral • 244 books • From 1451 to 1905 • Approximately 550.000 marriages licenses Ground truth • From the volume 69 • 50 documents • 20 classes Index License marriage Husband’s surname License marriage Fee 6
  • 7. 7 The Barcelona Marriage Licenses: Continuity 1481: volume 3 1601: volume 61 Marriage license Husband’s surname 1729: volume 127 1860: volume 200 Fee Marriage license Fee Husband’s surname Marriage license Fee Husband’s surname Marriage license Fee
  • 8. 8 The Barcelona Marriage Licenses: Fees Marriage licenses fees for the two year period that starts on the first of May, 1627 and ends on the last day of April, 1629 Dukes, Marquises, Counts and Viscounts. Noble knights and Lords of vassals. Knights, Honored Citizens and Bourgeoisies. Merchants, Notaries of Barcelona, Shopkeepers of distinguish materials, Chemists and Druggists. Shopkeepers of materials, Royal Notaries, Surgeons, Traders, Solicitors, Middlemen and Artists. The rest. The poor ones for the love of God. 12 ll 2ll 6s 1ll 4s 12s 6s 4s -
  • 9. 9 CED objectives (scholars) – Genealogic tree • Ancestors / descendants – Immigration / Emigration • Family names appear / disappear • French surnames (descendants) – Population (by num. of marriages) • Plagues, epidemics, baby boom – Parish churches • Neighborhood is/becomes rich/poor – Evolution of a family name • Jobs, fees (higher or lower) – Relationships between families • Strategic, commercial reasons CVC objectives (computer scientists) – Layout analysis • Text-line segmentation – Word Spotting • Query by example • Query by string – Handwriting Recognition – Syntactic analysis The Barcelona Marriage Licenses
  • 10. 10 Document Image Analysis: Tasks • Layout analysis: to detect (crop) records, lines, words for subsequent recognition. • Full transcription: to convert images to editable text. • Word spotting: given a query word to search, to locate at image level visually similar word snippets. dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon= BLOCKS WORDS LINES
  • 11. 11 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 12. 12 Technical architecture Image Space Transcription Space Contextual knowledge Space HW recognition Crowdsourcing Data mining • Harmonization • Record linkage Scanning exploitation
  • 13. 13 Crowdsourcing platform • Manual transcription  tedious and time consuming task • Crowdsourcing Platform (Divide & Conquer) • Split and distribute a big amount of small and simple tasks • Crowdsourcing architecture: • Image space (digitized documents) • Transcription space (extraction of information) • Contextual space (semantic meaning)
  • 14. 14 Crowdsourcing platform • Web-based application: Integration of two points of view • Contents view: Semantic information  demographic research • Labeling view: Ground-truthing  document analysis research http://www.cvc.uab.es/5cofm/
  • 18. 18 Contents view (semantics): Form filling (Indices)
  • 19. 19 Contents view (semantics): Checking correction Check for posible spelling errors (words that appear only once?)
  • 20. 20 Contents view (semantics): Record Linkage • Record Linkage  Genealogical tree • Batch process searches links between individuals: • Parent’s marriage, Brothers/Sisters marriages • The search allows spelling variations • String Edit distance (Levenshtein), with different costs for substitutions • Useful for harmonization of names, surnames… • The expert decides the correct linkage from the candidates Year Bride Father Mother Year Groom Bride Similarity 1638 Jeronima Lluis Teixidor Paula 1606 Lluis Teixidor Paula 1 1638 Joana Nicolau Ferrer Antiga 1613 Nicolau Ferrera Antiga 0.95
  • 21. 21 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 22. 22 Labeling view (annotation): Transcription (lines) Literal transcription  Ground-truth for handwriting recognition methods
  • 23. 23 Labeling view (annotation): Word Labeling Word meta-data: • Bounding-box (coordinates) • Cathegory (e.g. groom’s name, occupation…) • The system does the automatic correspondence  The user validates! Integrated platform: put into correspondence contents view  labeling view
  • 24. 24 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 25. 25 Running Experience ADVANTAGES • Digital source • Not necessary to go to the Archive • No timetable limitations • Parallelization • Many users work simultaneously • Centralization • Easier management of images, users, database... • Easy to see “who works on what” • Automatic control • System forces to fill some fields, raises warnings • Useful for detection of spelling errors (auto-correction)
  • 26. 26 Running Experience ADVANTAGES • Security • Frequent back-up • Users can visualize the documents assigned to them, but not download them • Monitoring • Administrator can monitor the user’s work and provide feedback • Visualization and confort • Drag (move), zoom in/out DISADVANTAGES • Internet connection is always needed • If system is down (e.g. maintenance)  no one can work
  • 27. 27 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 28. Generalization to other demographic manuscripts • The platform has been adapted for census documents
  • 29. 29 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 30. Conclusions • Web-based crowdsourcing platform for demographic manuscripts • Integrates the needs of demographers and computer scientists Future directions • Improve validation • Combine the output of several users • Compare with the output of document analysis techniques • Mobile-based applications • For crowdsourcing  Faster ground-truth generation • For browsing and searching  User friendly interfaces
  • 31. Crowdsourcing on mobile devices Task 1 Page layout R · 30 s/T · 1 T/P · 29 P Initial (29 pages) Redundancy: each task solved by different people Task 2 Bounding Box R · 30 s/T · 18 T/P · 29 P s/T = seconds per task T/P = task per page R = 5, Redundancy Task 3 Word Segmentation R · 10 s/T · 360 T/P · 29 P
  • 32. 32 Browsing the marriage licenses on a mobile device