SlideShare ist ein Scribd-Unternehmen logo
1 von 13
AI for digitized cultural heritage
Qurator.ai @ Berlin State Library
Clemens Neudecker (@cneudecker)
EuropeanaTech x AI webinar
21 May 2021
Berlin State Library (SBB)
● Established 1661 in Berlin (Kingdom of Prussia)
● Largest research library in Germany
(25M media objects, 2.5 PetaBytes digital data storage)
● Forms part of the larger LAM legal entity
Prussian Cultural Heritage Foundation (SPK)
● https://staatsbibliothek-berlin.de/
● In-house Digitization Center since 2007
○ ~80 concurrent digitization projects
○ ~2M scanned images annual production
● Digital collections give access to ~185k digitized documents
(mostly Public Domain)
● https://digital.staatsbibliothek-berlin.de/
Qurator.ai @ SBB
● SBB responsible for sub-project 10: “AI for digitized cultural heritage”
● Main goal: improve the quality and efficiency of (document) digitization
● Full recognition and enrichment
pipeline for digitized documents
● Development of open source tools
https://github.com/qurator-spk
● Publication of open datasets
https://zenodo.org/communities/stabi
● Releases of trained models
https://qurator-data.de/
● Showcases (only available in German)
https://qurator.ai/innovationlab/staatsbibliothek-zu-berlin/
Image Preprocessing: Binarization
● Binarization (i.e. the conversion of colour/greyscale images to black or white pixels) can be used to
increase the contrast between background (paper) and foreground (ink) and to remove defects, noise
etc. which improves subsequent processes
● OCR engines require binarized images for recognition
● Training of autoencoder model for document image binarization
https://github.com/qurator-spk/sbb_binarization
Document Image Analysis
● High-quality analysis of document layout is key for all subsequent tasks
● Training of multiple ResNet50-U-Net models for pixelwise segmentation
● 1st iteration (“pure” ML)
○ some problems with headings,
drop capitals, reading order
● 2nd iteration (“hybrid”)
○ additional heuristics deliver
improvements for textlines
and reading order detection
https://github.com/qurator-spk/eynollah
Text regions
Text lines
Image (Similarity) Search
● Document layout analysis provides (pixel coordinate) information about image content contained in
the digitized documents
● Extraction (and release) of ~600k graphical elements from document images
● Training an image classification
model on the basis of ImageNet
● ROI within image using YOLO v3
● Approximate nearest neighbour
search for similar images
● Alternative search and browse
entry to digitised collections
https://github.com/qurator-spk/sbb_images
OCR / Text Recognition
● Traditionally, OCR for historical documents is hard
(Fraktur fonts, complex layouts, defects and
damages, historical spelling)
● Thanks to deep learning for OCR (Calamari) and
public GT datasets (GT4HistOCR), nearly error-
free OCR is now possible!
● A single (language independent) OCR model can be
applied for both Fraktur + Antigua (also mixed)
● Initial evaluations show reductions of
Character-Error-Rate from ~20% to ~2%
https://github.com/qurator-spk/ocrd_calamari
OCR Postcorrection
● Even with highly accurate OCR, there remain a few recognition errors
● Idea: train a machine translation model to “translate” OCR errors to correct words
● Challenges:
○ retain historical spelling variants
○ avoid introducing new errors
● Two-step model (seq2seq LSTM):
○ First, detect the parts of text with errors
(this helps artificially increase the error
density in the input for step two)
○ Translate (i.e. correct) errors in the OCR text
● Relative OCR accuracy improvement: 18%
https://github.com/qurator-spk/sbb_ocr_postcorrection
Named Entity Recognition
● Named Entity Recognition (NER) is used to identify proper names of persons, locations,
organizations in unstructured text (here: OCR results)
● Unsupervised Pre-Training of BERT model on the digitized historical documents
● Supervised Training of BERT model for NER with labeled data for German NER
● Results are state of the art with f1 score of 85.6%
https://github.com/qurator-spk/sbb_ner
Named Entity Disambiguation and Linking
● Entities recognized by NER can be ambiguous
● Example: “Paris is in France”
- Paris the city or Paris (Hilton) the person?
● Necessary to determine the correct entity by context
● Establishing a knowledge base for comparison based on Wikidata/Wikipedia
(harvesting of all articles for the corresponding categories)
● Training of a “context-comparison” BERT embeddings model that decides for a given entity
in the OCR text whether it is similar to a Wikipedia lemma
● Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms
https://github.com/qurator-spk/sbb_ned
Data Annotation
● neat (named entity annotation tool) for data annotation (and OCR correction)
● Simple, browser based Javascript tool
(no installation or rights required)
● TSV (tab-separated-values)
as internal working format
● Embeds image snippets
via IIIF Image API to aid with annotation
● Due to (popular demand - i.e. Covid-19),
neat can now also be used for OCR correction
or transcription (e.g. to create GT)
https://github.com/qurator-spk/neat
Future Work
● Processing all the digitized documents in SBB with the Qurator pipeline would give us some greatly
improved data to extend this work, and for training better models
● But AI/ML is quite demanding on computation - with our current server (36 CPU cores, 2x V100,
192 GiB RAM) this would take years...what can we do to increase throughput without sacrificing
performance?
● Methods that combine computer vision
(document image analysis) and natural
language processing (OCR text content)
features promise further improvements
● Extending current developments to other
languages and scripts (esp. Asian) and layouts (e.g. right-to-left, vertical)
● Provision of interactive demos in our SBB LAB https://lab.sbb.berlin/
Thank you for your attention!
Questions?

Weitere ähnliche Inhalte

Ähnlich wie EuropeanaTech x AI: Qurator.ai @ Berlin State Library

Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Edinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopEdinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopPetr Pridal
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMVladimir Alexiev, PhD, PMP
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache sparkInfoFarm
 
Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaAccessITplus
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans cneudecker
 
This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?Dobrica Pavlinušić
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014panitzm
 
Python and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowPython and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowJohn Reiser
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?Matteo Romanello
 
H2O at Berlin R Meetup
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R MeetupJo-fai Chow
 

Ähnlich wie EuropeanaTech x AI: Qurator.ai @ Berlin State Library (20)

Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Edinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline WorkshopEdinburgh OldMapsOnline Workshop
Edinburgh OldMapsOnline Workshop
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
 
Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: Albania
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?This is an interesting metadata source. Can I import it into Koha?
This is an interesting metadata source. Can I import it into Koha?
 
DURAARK at IGeLU 2014
DURAARK at IGeLU 2014DURAARK at IGeLU 2014
DURAARK at IGeLU 2014
 
Python and GIS: Improving Your Workflow
Python and GIS: Improving Your WorkflowPython and GIS: Improving Your Workflow
Python and GIS: Improving Your Workflow
 
Greedy Enough for the Grid?
Greedy Enough for the Grid?Greedy Enough for the Grid?
Greedy Enough for the Grid?
 
H2O at Berlin R Meetup
H2O at Berlin R MeetupH2O at Berlin R Meetup
H2O at Berlin R Meetup
 

Mehr von cneudecker

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?cneudecker
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBBcneudecker
 
Europeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 BerlinEuropeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 Berlincneudecker
 
Coding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana NewspapersCoding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana Newspaperscneudecker
 
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918cneudecker
 

Mehr von cneudecker (20)

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 
Europeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 BerlinEuropeana Newspapers Aggregator Forum 2018 Berlin
Europeana Newspapers Aggregator Forum 2018 Berlin
 
Coding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana NewspapersCoding da Vinci Berlin 2017 - Europeana Newspapers
Coding da Vinci Berlin 2017 - Europeana Newspapers
 
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
Coding da Vinci Berlin 2017 - Europeana Collections 1914-1918
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

  • 1. AI for digitized cultural heritage Qurator.ai @ Berlin State Library Clemens Neudecker (@cneudecker) EuropeanaTech x AI webinar 21 May 2021
  • 2. Berlin State Library (SBB) ● Established 1661 in Berlin (Kingdom of Prussia) ● Largest research library in Germany (25M media objects, 2.5 PetaBytes digital data storage) ● Forms part of the larger LAM legal entity Prussian Cultural Heritage Foundation (SPK) ● https://staatsbibliothek-berlin.de/ ● In-house Digitization Center since 2007 ○ ~80 concurrent digitization projects ○ ~2M scanned images annual production ● Digital collections give access to ~185k digitized documents (mostly Public Domain) ● https://digital.staatsbibliothek-berlin.de/
  • 3. Qurator.ai @ SBB ● SBB responsible for sub-project 10: “AI for digitized cultural heritage” ● Main goal: improve the quality and efficiency of (document) digitization ● Full recognition and enrichment pipeline for digitized documents ● Development of open source tools https://github.com/qurator-spk ● Publication of open datasets https://zenodo.org/communities/stabi ● Releases of trained models https://qurator-data.de/ ● Showcases (only available in German) https://qurator.ai/innovationlab/staatsbibliothek-zu-berlin/
  • 4. Image Preprocessing: Binarization ● Binarization (i.e. the conversion of colour/greyscale images to black or white pixels) can be used to increase the contrast between background (paper) and foreground (ink) and to remove defects, noise etc. which improves subsequent processes ● OCR engines require binarized images for recognition ● Training of autoencoder model for document image binarization https://github.com/qurator-spk/sbb_binarization
  • 5. Document Image Analysis ● High-quality analysis of document layout is key for all subsequent tasks ● Training of multiple ResNet50-U-Net models for pixelwise segmentation ● 1st iteration (“pure” ML) ○ some problems with headings, drop capitals, reading order ● 2nd iteration (“hybrid”) ○ additional heuristics deliver improvements for textlines and reading order detection https://github.com/qurator-spk/eynollah Text regions Text lines
  • 6. Image (Similarity) Search ● Document layout analysis provides (pixel coordinate) information about image content contained in the digitized documents ● Extraction (and release) of ~600k graphical elements from document images ● Training an image classification model on the basis of ImageNet ● ROI within image using YOLO v3 ● Approximate nearest neighbour search for similar images ● Alternative search and browse entry to digitised collections https://github.com/qurator-spk/sbb_images
  • 7. OCR / Text Recognition ● Traditionally, OCR for historical documents is hard (Fraktur fonts, complex layouts, defects and damages, historical spelling) ● Thanks to deep learning for OCR (Calamari) and public GT datasets (GT4HistOCR), nearly error- free OCR is now possible! ● A single (language independent) OCR model can be applied for both Fraktur + Antigua (also mixed) ● Initial evaluations show reductions of Character-Error-Rate from ~20% to ~2% https://github.com/qurator-spk/ocrd_calamari
  • 8. OCR Postcorrection ● Even with highly accurate OCR, there remain a few recognition errors ● Idea: train a machine translation model to “translate” OCR errors to correct words ● Challenges: ○ retain historical spelling variants ○ avoid introducing new errors ● Two-step model (seq2seq LSTM): ○ First, detect the parts of text with errors (this helps artificially increase the error density in the input for step two) ○ Translate (i.e. correct) errors in the OCR text ● Relative OCR accuracy improvement: 18% https://github.com/qurator-spk/sbb_ocr_postcorrection
  • 9. Named Entity Recognition ● Named Entity Recognition (NER) is used to identify proper names of persons, locations, organizations in unstructured text (here: OCR results) ● Unsupervised Pre-Training of BERT model on the digitized historical documents ● Supervised Training of BERT model for NER with labeled data for German NER ● Results are state of the art with f1 score of 85.6% https://github.com/qurator-spk/sbb_ner
  • 10. Named Entity Disambiguation and Linking ● Entities recognized by NER can be ambiguous ● Example: “Paris is in France” - Paris the city or Paris (Hilton) the person? ● Necessary to determine the correct entity by context ● Establishing a knowledge base for comparison based on Wikidata/Wikipedia (harvesting of all articles for the corresponding categories) ● Training of a “context-comparison” BERT embeddings model that decides for a given entity in the OCR text whether it is similar to a Wikipedia lemma ● Enrichment of the OCR text with links to Wikidata IDs and geo-coordinates for toponyms https://github.com/qurator-spk/sbb_ned
  • 11. Data Annotation ● neat (named entity annotation tool) for data annotation (and OCR correction) ● Simple, browser based Javascript tool (no installation or rights required) ● TSV (tab-separated-values) as internal working format ● Embeds image snippets via IIIF Image API to aid with annotation ● Due to (popular demand - i.e. Covid-19), neat can now also be used for OCR correction or transcription (e.g. to create GT) https://github.com/qurator-spk/neat
  • 12. Future Work ● Processing all the digitized documents in SBB with the Qurator pipeline would give us some greatly improved data to extend this work, and for training better models ● But AI/ML is quite demanding on computation - with our current server (36 CPU cores, 2x V100, 192 GiB RAM) this would take years...what can we do to increase throughput without sacrificing performance? ● Methods that combine computer vision (document image analysis) and natural language processing (OCR text content) features promise further improvements ● Extending current developments to other languages and scripts (esp. Asian) and layouts (e.g. right-to-left, vertical) ● Provision of interactive demos in our SBB LAB https://lab.sbb.berlin/
  • 13. Thank you for your attention! Questions?