SlideShare a Scribd company logo
1 of 11
Download to read offline
University Library of KU Leuven 
Sam Alloing and Demmy Verbeke
University Library of KU Leuven 
Divisions involved: 
Arts Faculty Library 
•Collections and services focused on ongoing research and teaching in the Faculty of Arts 
•Some special collections (e.g. Gulden Librije) 
LIBIS 
•Provides services for libraries, museums and archives (inside and outside the university) 
Digitisation Unit 
•A.o. Digital Lab: High-tech digital photography centre
Why did we get involved? 
Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research 
http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie 
http://www.illuminare.be/rich_project 
http://www.europeana-photography.eu
Corpus 
13 books from the pretiosa collection of the Gulden Librije: 
-translations from Latin 
-books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
Assumptions 
•As automated as possible 
•Try as soon as possible, to fail early 
•Use ALTO format throughout the workflow
Workflow OCR 
Attestation 
Improving 
•User pattern training 
•Use dictionary 
•Improve images 
Executing OCR 
Digitisation 
Evaluation set 
ocrevalUAtion 
Lesson learnt: 
high error rate is not necessarily bad 
Aletheia 
•Create ground truth 
•User friendly 
Lessons learnt: 
•B&W images 
•Remove border 
•Biggest problem: letters from other pages coming through 
ABBYY FineReader engine 
•Useful sample applications 
•Windows
Workflow NER 
Attestation 
Training set 
Test set 
Execute NER 
Model 
Input 
Europeana Newspaper NER 
•ALTO input from OCR 
•Lesson learnt: lot of resources (RAM) needed 
INL Attestation tool 
Lesson learnt: 
lot more ground truth needed than OCR 
NERT of INL 
80/20 split training/test 
NERT of INL 
•Different split training and test set 
•Create variants from old spelling 
Improving
Results NER 
Precision 
Recall 
F1 
Overall 
0.6257 
0.5130 
0.5638 
Location 
0.675 
0.2903 
0.40601 
Organization 
1.0 
0.1666 
0.2857 
Person 
0.6207 
0.5571 
0.5871 
Segmentation 
0.6634 
0.5438 
0.5977 
Classification accuracy 
0.9433 
> 60% recognised correctly 
≈ 50% of the entities found
Results NER, an experiment 
Input 
Corrected file 
Training file 
Test file 
Split 
Combine 
Precision 
Recall 
F1 
Overall 
0.8398 
0.7954 
0.8170 
Location 
0.8741 
0.6720 
0.7599 
Organization 
1.0 
0.5 
0.6666 
Person 
0.8320 
0.8320 
0.8320 
Segmentation 
0.8920 
0.8448 
0.8677 
Classification accuracy 
0.9415 
80% recognised correctly 
≈ 80% entities found
Next steps 
•Create a OCR and NER platform for the university and as part of the LIBIS services 
•New project about OCR and (early modern) Latin texts 
•Looking into other tools : 
•Lexicon building 
•Border detection 
•Automatically remove ‘noise’ from a page 
•NER: 
•Learning to use Latin (and Greek)
Thanks! 
Questions? 
•Sam Alloing (Sam.Alloing@libis.kuleuven.be) 
•Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) 
•http://bib.kuleuven.be/english/ub

More Related Content

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnetJo Rademakers
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlinelab_SNG
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABRonald Snijder
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine HarvesterTry PurpleSearch
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder Ulab_SNG
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital CollectionsErin Tripp
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceElena Yaroshenko
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanitieslabsbl
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional RepertoireBohyun Kim
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...Jason Casden
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcherLIBER Europe
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghRepository Fringe
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...MCN (Museum Computer Network)
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyJane Alexander
 

Similar to University library of KU Leuven - Sam Alloing et Demmy Verbecke (20)

150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet150310 Implementing Alma for LIBISnet
150310 Implementing Alma for LIBISnet
 
Introducing SUL
Introducing SULIntroducing SUL
Introducing SUL
 
Dag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections onlineDag Hensten - Nasjonalmuseet collections online
Dag Hensten - Nasjonalmuseet collections online
 
OA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOABOA academic book publishing – OAPEN Library and DOAB
OA academic book publishing – OAPEN Library and DOAB
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Keep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder UKeep Things Simple @ Dortmunder U
Keep Things Simple @ Dortmunder U
 
KU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoCKU Leuven - Words and numbers - ICoC
KU Leuven - Words and numbers - ICoC
 
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
Islandora Webinar:  Highlighting CUHK Chinese Digital CollectionsIslandora Webinar:  Highlighting CUHK Chinese Digital Collections
Islandora Webinar: Highlighting CUHK Chinese Digital Collections
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Technion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpaceTechnion IR: Institutional Repository with DSpace
Technion IR: Institutional Repository with DSpace
 
BL Labs and Digital Humanities
BL Labs and Digital HumanitiesBL Labs and Digital Humanities
BL Labs and Digital Humanities
 
Geek out : Adding Coding Skills to Your Professional Repertoire
Geek out: Adding Coding Skills to Your Professional RepertoireGeek out: Adding Coding Skills to Your Professional Repertoire
Geek out : Adding Coding Skills to Your Professional Repertoire
 
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...In Context: Case Studies in Integrated Physical and Virtual Library Service D...
In Context: Case Studies in Integrated Physical and Virtual Library Service D...
 
Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014Europeana Cloud Aggregator Forum 2014
Europeana Cloud Aggregator Forum 2014
 
Reaching the researcher
Reaching the researcherReaching the researcher
Reaching the researcher
 
Sistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLCSistema Compartit a l'ICOLC
Sistema Compartit a l'ICOLC
 
ArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of EdinburghArchivesSpace - Scott Renton, University of Edinburgh
ArchivesSpace - Scott Renton, University of Edinburgh
 
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
Gallery One, One Year Later - Jane Alexander, Chief Information Officer and S...
 
Emea, March 2011
Emea, March 2011 Emea, March 2011
Emea, March 2011
 
Panel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: TechnologyPanel Discussion, The Future of the Museum: Technology
Panel Discussion, The Future of the Museum: Technology
 

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

University library of KU Leuven - Sam Alloing et Demmy Verbecke

  • 1. University Library of KU Leuven Sam Alloing and Demmy Verbeke
  • 2. University Library of KU Leuven Divisions involved: Arts Faculty Library •Collections and services focused on ongoing research and teaching in the Faculty of Arts •Some special collections (e.g. Gulden Librije) LIBIS •Provides services for libraries, museums and archives (inside and outside the university) Digitisation Unit •A.o. Digital Lab: High-tech digital photography centre
  • 3. Why did we get involved? Already digitization infrastructure/experience, but focused on visualization => now: digitization of textual material with a view to creating digital text corpora for research http://www.arts.kuleuven.be/ono/meso/projects/digitalisatie http://www.illuminare.be/rich_project http://www.europeana-photography.eu
  • 4. Corpus 13 books from the pretiosa collection of the Gulden Librije: -translations from Latin -books that had not been digitized yet Augustinus, Stad Gods (1876-8); Augustinus, Belydenis (1741); Boëthius, Vertroostinge der wysgeerte (1703); Horatius, Over de dichtkunst (1866); Horatius, Hekeldichten en brieven (1728); Nepos, Leevens van doorlugtige mannen (1796); Nepos, Leeven der doorluchtige veld-ooversten (1726); Ovidius, Treur-digten (1814-5); Ovidius, Treur-gesangen (1692); Seneca, Christelycke Seneca (1705); Tacitus, Vande ghedenkwaerdige geschiedenissen der Romeinen (1645); Vergilius, Wercken (1737); Vergilius, Aeneis (1662)
  • 5. Assumptions •As automated as possible •Try as soon as possible, to fail early •Use ALTO format throughout the workflow
  • 6. Workflow OCR Attestation Improving •User pattern training •Use dictionary •Improve images Executing OCR Digitisation Evaluation set ocrevalUAtion Lesson learnt: high error rate is not necessarily bad Aletheia •Create ground truth •User friendly Lessons learnt: •B&W images •Remove border •Biggest problem: letters from other pages coming through ABBYY FineReader engine •Useful sample applications •Windows
  • 7. Workflow NER Attestation Training set Test set Execute NER Model Input Europeana Newspaper NER •ALTO input from OCR •Lesson learnt: lot of resources (RAM) needed INL Attestation tool Lesson learnt: lot more ground truth needed than OCR NERT of INL 80/20 split training/test NERT of INL •Different split training and test set •Create variants from old spelling Improving
  • 8. Results NER Precision Recall F1 Overall 0.6257 0.5130 0.5638 Location 0.675 0.2903 0.40601 Organization 1.0 0.1666 0.2857 Person 0.6207 0.5571 0.5871 Segmentation 0.6634 0.5438 0.5977 Classification accuracy 0.9433 > 60% recognised correctly ≈ 50% of the entities found
  • 9. Results NER, an experiment Input Corrected file Training file Test file Split Combine Precision Recall F1 Overall 0.8398 0.7954 0.8170 Location 0.8741 0.6720 0.7599 Organization 1.0 0.5 0.6666 Person 0.8320 0.8320 0.8320 Segmentation 0.8920 0.8448 0.8677 Classification accuracy 0.9415 80% recognised correctly ≈ 80% entities found
  • 10. Next steps •Create a OCR and NER platform for the university and as part of the LIBIS services •New project about OCR and (early modern) Latin texts •Looking into other tools : •Lexicon building •Border detection •Automatically remove ‘noise’ from a page •NER: •Learning to use Latin (and Greek)
  • 11. Thanks! Questions? •Sam Alloing (Sam.Alloing@libis.kuleuven.be) •Demmy Verbeke (Demmy.Verbeke@arts.kuleuven.be; @viroviacum) •http://bib.kuleuven.be/english/ub