Named Entity Recognition for Europeana Newspapers

•Als PPTX, PDF herunterladen•

2 gefällt mir•645 views

Overview of Europeana Newspapers Named Entity Recognition for Oceanic Exchanges Workshop, Stuttgart, Germany, 8-9 May 2018

Technologie

NER for Europeana Newspapers
Clemens Neudecker (@cneudecker)
Staatsbibliothek zu Berlin –
Preußischer Kulturbesitz

Why Named Entity Recognition?
• Analysis* of query log files from the National Library of Wales
newspaper website: a vast majority of searches queries contain
either person or place names
* Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis:
A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne)
• Improving Information
Retrieval
• Linking to authority files
(Linked Data)
• Historical Social Network
Analysis (HNA/SNA)

Languages
• Dutch (1614 – 1900)
• French (1814 – 1944)
• German (1721 – 1949)
• Together approx. 50% of the total collection

Many challenges
• Historical data (language)
• Noisy data (OCR)
• Multilingual data
• Lack of extensive metadata
• Lack of open resources
(tagged corpora, gazetteers)
• Lack of common annotation guidelines
• Limitations of annotation tools

Reuse of existing NER tools
• Simple evaluation of
– Apache OpenNLP
– Stanford CoreNLP
– GATE
• Choice of using Stanford CoreNLP since
– Java-based (thread safe, scalable)
– Good performance (f-measure)
– Strong and active community
– Rather robust against noisy input (CRF)

Approach
• Export option ALTO v3 with tags added
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>

Annotation
• Quick evaluation of annotation tools:
– BRAT
– WebANNO
– INL Attestation Tool
• Choice of INL Attestation Tool since:
– Optimized for tagging speed
– Supported by consortium partner (INL/IVDNT)

Corpus creation
• Selection of 100 pages each per language
• Processing of the OCRed texts with
StanfordNER to get initial tagging results
• Manual verification and annotation

Corpus statistics
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%

ner-app
https://github.com/EuropeanaNewspapers/ner-app

ner-corpora
https://github.com/EuropeanaNewspapers/ner-corpora

Evaluation DE
• A Named Entity Recognition Shootout for
German
M. Riedl and S. Padó. Proceedings of ACL,
Melbourne, Australia, (2018).To appear.

NER vs OCR success rate
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
NER
OCR

Improving performance
• Possible additional features
– Distributional similarity (Clark 2003)
– Semantic generalization (Faruqui & Padò 2010)
– Word embeddings (Braune 2017)
• Gazetteers
– Person names, historical place names
• Data cleanup and improvement
– https://github.com/EuropeanaNewspapers/
ner-corpora/wiki

Trias NER
• Combination and voting of different NER
classifiers, e.g.
– Stanford CoreNLP
– Spacy
– NLTK
• Inspiration:
https://github.com/KBNLresearch/Trias_NER

Disambiguation
• Disambiguation of person and place names
• Inspiration:
https://github.com/KBNLresearch/europeana
np-dbpedia-disambiguation

Linking
• Linking of recognised and disambiguated NE‘s
to authority files (e.g. Wikidata, GND)
• Inspiration:
https://github.com/KBNLresearch/dac

Weitere ähnliche Inhalte

Ähnlich wie Named Entity Recognition for Europeana Newspapers

Data integration in ENFIN using standards. The EnCore DAS service.

Rafael C. Jimenez

Curation Technologies for Multilingual Europe

Georg Rehm

DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...

Felipe Albrecht

Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)

Péter Király

Data management for researchers

Dirk Roorda

Celtic language technologies in the digital age

techiaith

Audiovisual collections, the spoken word and user needs of scholars in the Hu...

roelandordelman.nl

20151111 utrecht ver theolbibliothecarissen

Dirk Roorda

2010 Digital Humanities London - Dutch Republic of Letters

Dirk Roorda

Pyathon Program.pdf

tomlee12821

Iasa Presentatie

Mies Langelaar

Correlating languages and sentiment analysis on the basis of text-based reviews

International Federation for Information Technologies in Travel and Tourism (IFITT)

Reproducible research - to infinity

PeterMorrell4

Integration of an Automatic Indexing System within the Document Flow of a Gre...

Jindřich Mynarz

Smart Content - FREME Project - Presentation Frankfurt Book Fair

Kevin Koidl

These slides were used in a presentation at the "Our Digital Future - Multidisciplinary Perspectives on Long Term Data Preservation and Access" conference in Cambridge/UK in March 2016 in the session "Current and Future perspectives on technology for data preservation and sharing". They describe work in progress in the E-ARK project, which is co-funded by the European Commission and has as its main objective the creation of a scalable open source, digital archiving system offering efficent search and access content of very large digital object collections. The focus of this presentation lies on describing the core big data technologies (Apache Hadoop, Apache Hbase, and the document repository Lily developed by NGData), the architecture of the E-ARK integrated prototype implementation, and data mining use cases related to geographical data, named entitity extraction, and OLAP data analysis.

The Use of Big Data Techniques for Digital Archiving

Sven Schlarb

Europeana Newspapers - Data, Tools & Future Plans

cneudecker

An HLT profile of the official South African languages

Guy De Pauw

Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)

IMPACT Centre of Competence

1 CELI – Language and Information Gennaio 2014 2 We develop software solutions based on (NLP) Natural Language Processing 3 CELI’s offices, Countries in which we operate, Years of experience, People, Active customers, Business lines 4 Partners in Academia, Research projects, Published scientific papers Close relationship with scientific community 5 From 1999 to 2013 6 Clients: semantic solutions, Speech Technology, Blogmeter 7 NLP solutions 8 NLP technology: Comprehensive suite of multilingual components and resource 9 Linguistic processing and annotation 10 From text to Knowledge 11 Meaningful intelligence from unstructured information 12 Speech technology: Comprehensive suite of multilingual components and resources for text processing in Voice application (Text To Speech) 13 Contribution to TTS development:Consulting and technologies 14 Semantic solutions 15 Semantic Search: Enterprise Semantic Search solution for document system and knowledge management systems 16 Linked Data for Semantic Search: Creation-ReUse of multilingual ontologies,Linking to LOD resources,Deploying LOD 17 Linked (Open) Data for Enterprise Search 18 Semantic Search Platform 19 Customer Voice Analytics: Automatic classification of customer surveys (answers to open questions) and verbatim (customer cases or call transcriptios) 20-21 Multilingual management of verbatim coding 22 Product lines (Blogmeter, Crosslibrary) 23 Social Media Monitoring, Analytics & Management Tools per Aziende & Agenzie. 24 Blogmeter: Leader in Italia nella social media intelligence,Tecnologie d’avanguardia per la social intelligence 25 Digital Humanities e Scuola Digitale 26 Leggere i classici usando il digitale 27 I Promessi sposi e Pinocchio 28 Grazie per l’attenzione! 29 Vittorio Di Tomaso ditomaso@celi.it

Forum Tal 2014: Celi company presentation

CELI

Ähnlich wie Named Entity Recognition for Europeana Newspapers (20)

Data integration in ENFIN using standards. The EnCore DAS service.

Curation Technologies for Multilingual Europe

DeepBlue epigenomic data server: programmatic data retrieval and analysis of ...

Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)

Data management for researchers

Celtic language technologies in the digital age

Audiovisual collections, the spoken word and user needs of scholars in the Hu...

20151111 utrecht ver theolbibliothecarissen

2010 Digital Humanities London - Dutch Republic of Letters

Pyathon Program.pdf

Iasa Presentatie

Correlating languages and sentiment analysis on the basis of text-based reviews

Reproducible research - to infinity

Integration of an Automatic Indexing System within the Document Flow of a Gre...

Smart Content - FREME Project - Presentation Frankfurt Book Fair

The Use of Big Data Techniques for Digital Archiving

Europeana Newspapers - Data, Tools & Future Plans

An HLT profile of the official South African languages

Impact Centre of Competence presentation at CERL 2014 by Tomasz Parkola (PSNC)

Forum Tal 2014: Celi company presentation

Mehr von cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

cneudecker

ALTO, PAGE & Co. Formate für Volltexte

cneudecker

OCR und Strukturerkennung für Zeitungen

cneudecker

Digitisation and Digital Humanities - what is the role of Libraries?

cneudecker

Multimodal Perspectives for Digitised Historical Newspapers

cneudecker

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...

cneudecker

AI for digitized cultural heritage

cneudecker

Kuratieren mit künstlicher Intelligenz

cneudecker

Überblick zum DFG-Projekt OCR-D

cneudecker

The many uses of digitized newspapers

cneudecker

Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...

cneudecker

Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...

cneudecker

OCR-D: An end-to-end open source OCR framework for historical printed documents

cneudecker

Text and Data Mining

cneudecker

Formate für Volltexte

cneudecker

Extrablatt: The Latest News on Newspaper Digitisation in Europe

cneudecker

Reise durch Europeana Collections in 11 Minuten

cneudecker

Europeana Newspapers in a Nutshell

cneudecker

lab.sbb.berlin

cneudecker

What's up, Europeana Newspapers?

cneudecker

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library

ALTO, PAGE & Co. Formate für Volltexte

OCR und Strukturerkennung für Zeitungen

Digitisation and Digital Humanities - what is the role of Libraries?

Multimodal Perspectives for Digitised Historical Newspapers

OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...

AI for digitized cultural heritage

Kuratieren mit künstlicher Intelligenz

Überblick zum DFG-Projekt OCR-D

The many uses of digitized newspapers

Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...

Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...

OCR-D: An end-to-end open source OCR framework for historical printed documents

Text and Data Mining

Formate für Volltexte

Extrablatt: The Latest News on Newspaper Digitisation in Europe

Reise durch Europeana Collections in 11 Minuten

Europeana Newspapers in a Nutshell

lab.sbb.berlin

What's up, Europeana Newspapers?

Kürzlich hochgeladen

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

ICT role in 21st century education and its challenges

rafiqahmad00786416

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Scalable LLM APIs for AI and Generative AI Application Development Ettikan Karuppiah, Director/Technologist - NVIDIA Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

apidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Zilliz

A Beginners Guide to Building a RAG App Using Open Source Milvus

Zilliz

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf

Artificial Intelligence Chap.5 : Uncertainty

ICT role in 21st century education and its challenges

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Powerful Google developer tools for immediate impact! (2023-24 C)

Data Cloud, More than a CDP by Matt Robison

AXA XL - Insurer Innovation Award Americas 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Axa Assurance Maroc - Insurer Innovation Award 2024

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

MS Copilot expands with MS Graph connectors

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

A Beginners Guide to Building a RAG App Using Open Source Milvus

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

presentation ICT roal in 21st century education

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Named Entity Recognition for Europeana Newspapers

1. NER for Europeana Newspapers Clemens Neudecker (@cneudecker) Staatsbibliothek zu Berlin – Preußischer Kulturbesitz

2. Background

3. Why Named Entity Recognition? • Analysis* of query log files from the National Library of Wales newspaper website: a vast majority of searches queries contain either person or place names * Paul Gooding, Exploring Usage of Digital Newspaper Archives through Web Log Analysis: A Case Study of Welsh Newspapers Online, presented at DH2014, Lausanne) • Improving Information Retrieval • Linking to authority files (Linked Data) • Historical Social Network Analysis (HNA/SNA)

4. Languages • Dutch (1614 – 1900) • French (1814 – 1944) • German (1721 – 1949) • Together approx. 50% of the total collection

5. Many challenges • Historical data (language) • Noisy data (OCR) • Multilingual data • Lack of extensive metadata • Lack of open resources (tagged corpora, gazetteers) • Lack of common annotation guidelines • Limitations of annotation tools

6. Technology

7. Reuse of existing NER tools • Simple evaluation of – Apache OpenNLP – Stanford CoreNLP – GATE • Choice of using Stanford CoreNLP since – Java-based (thread safe, scalable) – Good performance (f-measure) – Strong and active community – Rather robust against noisy input (CRF)

8. Approach • Adaptation of Stanford CoreNLP by the KB National Library of the Netherlands to directly consume ENMAP (= Europeana Newspapers METS/ALTO profile) objects

9. Approach • Export option ALTO v3 with tags added <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>

10. Annotation • Quick evaluation of annotation tools: – BRAT – WebANNO – INL Attestation Tool • Choice of INL Attestation Tool since: – Optimized for tagging speed – Supported by consortium partner (INL/IVDNT)

11. Corpus creation • Selection of 100 pages each per language • Processing of the OCRed texts with StanfordNER to get initial tagging results • Manual verification and annotation

12. Corpus statistics Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%

13. ner-app https://github.com/EuropeanaNewspapers/ner-app

14. ner-corpora https://github.com/EuropeanaNewspapers/ner-corpora

15. Evaluation: NL

16. Evaluation FR

17. Evaluation DE • A Named Entity Recognition Shootout for German M. Riedl and S. Padó. Proceedings of ACL, Melbourne, Australia, (2018).To appear.

18. NER vs OCR success rate 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 NER OCR

19. Future Plans

20. Improving performance • Possible additional features – Distributional similarity (Clark 2003) – Semantic generalization (Faruqui & Padò 2010) – Word embeddings (Braune 2017) • Gazetteers – Person names, historical place names • Data cleanup and improvement – https://github.com/EuropeanaNewspapers/ ner-corpora/wiki

21. Trias NER • Combination and voting of different NER classifiers, e.g. – Stanford CoreNLP – Spacy – NLTK • Inspiration: https://github.com/KBNLresearch/Trias_NER

22. Disambiguation • Disambiguation of person and place names • Inspiration: https://github.com/KBNLresearch/europeana np-dbpedia-disambiguation

23. Linking • Linking of recognised and disambiguated NE‘s to authority files (e.g. Wikidata, GND) • Inspiration: https://github.com/KBNLresearch/dac

Named Entity Recognition for Europeana Newspapers

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Named Entity Recognition for Europeana Newspapers

Ähnlich wie Named Entity Recognition for Europeana Newspapers (20)

Mehr von cneudecker

Mehr von cneudecker (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Named Entity Recognition for Europeana Newspapers