SlideShare ist ein Scribd-Unternehmen logo
1 von 18
An Open Corpus for Named Entity
Recognition in Historic Newspapers
Clemens Neudecker
Berlin State Library
@cneudecker
LREC2016, 23-28 May 2016, Portorož, Slovenia
Background
• Europeana Newspapers EU-project:
www.europeana-newspapers.eu
• OCRed 12m pages of historic newspapers
from Europe (an estimated 25 billion words!)
• Newspaper content from 23 libraries, in 40
languages, covering 4 centuries (1618-1990)
• Public domain full-text available for download
per language/content provider
Formats & Standards
• Full-text produced in ALTO
• Metadata (structural) in METS
• Metadata (bibliographic) in EDM
• Not a fan of XML?
Good ol‘ plain text (UTF-8) is also available…
research.europeana.eu/itemtype/newspapers
• Currently working on:
– API for text/search
– API for images (IIIF)
Approach
• 3 languages selected for NER:
Dutch, German, French – in collab. with
• Content in these languages constitutes about
50% of the overall full-text in the collection
Methodology
• Select 100 representative pages per language
– If a classifier already exists for given language –
run it on the selected 100 pages
– Ingest tagged/untagged pages to annotation tool
– Manually add/correct annotations
(>=2 librarians per language)
– Export and convert tagged data to BIO format
– Train classifier from BIO & gazetteers (if available)
– Evaluate derived classifier using 4-fold cross-eval
– Repeat until classification performance converges
NER software
• Tested Stanford NER, OpenNLP, NLTK, Gate
• Adaptation of Stanford NER package (CRF)
– Mature, well-documented, widely used
– Open source (GPL)
– Thread-safe & platform-independent (JVM)
– Machine learning scales out more easily
to multiple languages
– Prior experience working with CRF
NER encoding in ALTO
• In ALTO versions >2.1, this is possible:
<String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0"
VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5">
</String>
<String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0"
VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10">
</String>
…
<Tags>
<NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/>
<NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/>
</Tags>
Annotation
• Evaluated BRAT, WebAnno, INL Attestation
• Reasons for selection of INL Attestation:
– Speed
– Support
of ALTO
format
– Support
from INL
available
Annotation stats
Language # tokens # PER # LOC # ORG
French 207,000 5,672 5,614 2,574
Dutch 182,483 4,492 4,448 1,160
German 96,735 7,914 6,143 2,784
Language # tokens # PER # LOC # ORG
French 100% 2,75% 2,71% 1,24%
Dutch 100% 2,46% 2,44% 0,64%
German 100% 8,18% 6,35% 2,88%
Language Word-Error-Rate (Bag of Words) Reading Order Success Rate
French 16,6% 19,9%
Dutch 17,6% 23,2%
German 15,9% / 21,9% 13,6%
Challenges
• Clear, comprehensive & common guidelines
for manual annotation
• OCR quality – on average 80% word accuracy
• Wide variation in historical spelling
• Mix of languages on a single page
• Lack/loss of metadata on page/word level
• Some data corruption occured when ingesting
pre-tagged data into the annotation tool
Attempted workarounds
• Introduce OCR error patterns into training
data
 actually yields less precision/recall
• Introduce a spelling variation module in the
NER classifier
 rewrite rules (e.g. „frorn“  „from“)
 high integration effort
 requires reasonable amount of rules
 abandoned due to high complexity
Evaluation NL
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Evaluation FR
Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
Use cases
• Improving search, information retrieval
– Within digital newspapers, a vast majority of
user queries are person and place names
• Linking of named entities to authority files
to create linked data
– The classification and disambiguation of named
entities allows the assignment of unique
identifiers from authorative sources – thus
enabling cross-language/cross-collection linking
Next steps
• Volunteers wanted!
Help correct corpus and collaboratively create a
free dataset – instructions on GitHub wiki:
– github.com/EuropeanaNewspapers/
ner-corpora/wiki/Corpus-cleanup
• Plans to improve performance:
– Add distributional similarity as feature (Clark 2003)
– Semantic generalisation (Faruqui & Padò 2010)
– Specialised gazetteers (e.g. list of historic place names)
– Data, data, data
Open resources
• European Newspapers NER dataset (CC0):
– github.com/EuropeanaNewspapers/ner-corpora
• Europeana Newspapers NER software (EUPL):
– github.com/EuropeanaNewspapers/europeananp-
ner
– github.com/EuropeanaNewspapers/europeananp-
dbpedia-disambiguation
• Annotated ALTO files:
– lab.kbresearch.nl/static/html/eunews.html
References
• C. Neudecker, W.J. Faber, L. Wilms, T. van Veen:
Large scale refinement of digital historical
newspapers with named entity recognition
Proceedings of the IFLA Newspaper Section
Satellite Meeting, 2014, Geneva, Switzerland.
• Y. Mossalam, A. Abi-Haidar, J.G. Ganascia:
Unsupervised named entity recognition and
disambiguation: An application to old French
journals
Advances in Data Mining. Applications and
Theoretical Aspects, Springer LNCS, 2014.
Thank you for your attention!
Questions?
Clemens Neudecker
Berlin State Library
@cneudecker

Weitere ähnliche Inhalte

Andere mochten auch

презентация скоцкой т.н.
презентация скоцкой т.н.презентация скоцкой т.н.
презентация скоцкой т.н.skotckaiatn
 
Esquema o processo de reconhecimento de competências
Esquema   o processo de reconhecimento de competênciasEsquema   o processo de reconhecimento de competências
Esquema o processo de reconhecimento de competênciasJ P
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...cneudecker
 
TIC[B1]
TIC[B1]TIC[B1]
TIC[B1]J P
 
MV[B1]
MV[B1]MV[B1]
MV[B1]J P
 
CP_2
CP_2CP_2
CP_2J P
 
Ficha de avaliação nº 21 importancia do operador de caixa
Ficha de avaliação nº 21 importancia do operador de caixaFicha de avaliação nº 21 importancia do operador de caixa
Ficha de avaliação nº 21 importancia do operador de caixaLeonor Alves
 
Ficha de trabalho nº 3 spv pos venda e fidelização
Ficha de trabalho nº 3 spv   pos venda e fidelizaçãoFicha de trabalho nº 3 spv   pos venda e fidelização
Ficha de trabalho nº 3 spv pos venda e fidelizaçãoLeonor Alves
 
Ficha de trabalho nº18 spv- o livro de reclamações
Ficha de trabalho nº18  spv- o livro de reclamaçõesFicha de trabalho nº18  spv- o livro de reclamações
Ficha de trabalho nº18 spv- o livro de reclamaçõesLeonor Alves
 
Ficha de trabalho nº14 spv-como reagem os clientes ás falhas de serviços
Ficha de trabalho nº14  spv-como reagem os clientes ás falhas de serviçosFicha de trabalho nº14  spv-como reagem os clientes ás falhas de serviços
Ficha de trabalho nº14 spv-como reagem os clientes ás falhas de serviçosLeonor Alves
 

Andere mochten auch (12)

презентация скоцкой т.н.
презентация скоцкой т.н.презентация скоцкой т.н.
презентация скоцкой т.н.
 
Amigos reales o virtuales
Amigos reales o virtualesAmigos reales o virtuales
Amigos reales o virtuales
 
Esquema o processo de reconhecimento de competências
Esquema   o processo de reconhecimento de competênciasEsquema   o processo de reconhecimento de competências
Esquema o processo de reconhecimento de competências
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
TIC[B1]
TIC[B1]TIC[B1]
TIC[B1]
 
MV[B1]
MV[B1]MV[B1]
MV[B1]
 
CP_2
CP_2CP_2
CP_2
 
Ficha de avaliação nº 21 importancia do operador de caixa
Ficha de avaliação nº 21 importancia do operador de caixaFicha de avaliação nº 21 importancia do operador de caixa
Ficha de avaliação nº 21 importancia do operador de caixa
 
Ficha de trabalho nº 3 spv pos venda e fidelização
Ficha de trabalho nº 3 spv   pos venda e fidelizaçãoFicha de trabalho nº 3 spv   pos venda e fidelização
Ficha de trabalho nº 3 spv pos venda e fidelização
 
Para mim és um pai...
Para mim és um pai...Para mim és um pai...
Para mim és um pai...
 
Ficha de trabalho nº18 spv- o livro de reclamações
Ficha de trabalho nº18  spv- o livro de reclamaçõesFicha de trabalho nº18  spv- o livro de reclamações
Ficha de trabalho nº18 spv- o livro de reclamações
 
Ficha de trabalho nº14 spv-como reagem os clientes ás falhas de serviços
Ficha de trabalho nº14  spv-como reagem os clientes ás falhas de serviçosFicha de trabalho nº14  spv-como reagem os clientes ás falhas de serviços
Ficha de trabalho nº14 spv-como reagem os clientes ás falhas de serviços
 

Mehr von cneudecker

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentscneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 

Mehr von cneudecker (20)

EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
OCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documentsOCR-D: An end-to-end open source OCR framework for historical printed documents
OCR-D: An end-to-end open source OCR framework for historical printed documents
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 

Kürzlich hochgeladen

Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdf
Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdfMonastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdf
Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdfCharlynTorres1
 
Panet vs.Plastics - Earth Day 2024 - 22 APRIL
Panet vs.Plastics - Earth Day 2024 - 22 APRILPanet vs.Plastics - Earth Day 2024 - 22 APRIL
Panet vs.Plastics - Earth Day 2024 - 22 APRILChristina Parmionova
 
productionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptxproductionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptxHenryBriggs2
 
Club of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationClub of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationEnergy for One World
 
High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...
High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...
High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...Christina Parmionova
 
Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.Christina Parmionova
 
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...narwatsonia7
 
2024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 262024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 26JSchaus & Associates
 
办理约克大学毕业证成绩单|购买加拿大文凭证书
办理约克大学毕业证成绩单|购买加拿大文凭证书办理约克大学毕业证成绩单|购买加拿大文凭证书
办理约克大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.Christina Parmionova
 
NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...
NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...
NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...Amil baba
 
call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
YHR Fall 2023 Issue (Joseph Manning Interview) (2).pdf
YHR Fall 2023 Issue (Joseph Manning Interview) (2).pdfYHR Fall 2023 Issue (Joseph Manning Interview) (2).pdf
YHR Fall 2023 Issue (Joseph Manning Interview) (2).pdfyalehistoricalreview
 
call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
call girls in Mehrauli DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mehrauli  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Mehrauli  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mehrauli DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...saminamagar
 
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...narwatsonia7
 
call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Disciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdf
Disciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdfDisciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdf
Disciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdfDeLeon9
 

Kürzlich hochgeladen (20)

Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdf
Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdfMonastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdf
Monastic-Supremacy-in-the-Philippines-_20240328_092725_0000.pdf
 
Panet vs.Plastics - Earth Day 2024 - 22 APRIL
Panet vs.Plastics - Earth Day 2024 - 22 APRILPanet vs.Plastics - Earth Day 2024 - 22 APRIL
Panet vs.Plastics - Earth Day 2024 - 22 APRIL
 
productionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptxproductionpost-productiondiary-240320114322-5004daf6.pptx
productionpost-productiondiary-240320114322-5004daf6.pptx
 
Club of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological CivilizationClub of Rome: Eco-nomics for an Ecological Civilization
Club of Rome: Eco-nomics for an Ecological Civilization
 
High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...
High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...
High-Level Thematic Event on Tourism - SUSTAINABILITY WEEK 2024- United Natio...
 
Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.Action Toolkit - Earth Day 2024 - April 22nd.
Action Toolkit - Earth Day 2024 - April 22nd.
 
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
High Class Call Girls Bangalore Komal 7001305949 Independent Escort Service B...
 
2024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 262024: The FAR, Federal Acquisition Regulations - Part 26
2024: The FAR, Federal Acquisition Regulations - Part 26
 
办理约克大学毕业证成绩单|购买加拿大文凭证书
办理约克大学毕业证成绩单|购买加拿大文凭证书办理约克大学毕业证成绩单|购买加拿大文凭证书
办理约克大学毕业证成绩单|购买加拿大文凭证书
 
WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.WORLD CREATIVITY AND INNOVATION DAY 2024.
WORLD CREATIVITY AND INNOVATION DAY 2024.
 
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
9953330565 Low Rate Call Girls In Adarsh Nagar Delhi NCR
 
NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...
NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...
NO1 Certified kala jadu Love Marriage Black Magic Punjab Powerful Black Magic...
 
call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Punjabi Bagh DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
YHR Fall 2023 Issue (Joseph Manning Interview) (2).pdf
YHR Fall 2023 Issue (Joseph Manning Interview) (2).pdfYHR Fall 2023 Issue (Joseph Manning Interview) (2).pdf
YHR Fall 2023 Issue (Joseph Manning Interview) (2).pdf
 
call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mayapuri DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
call girls in Mehrauli DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mehrauli  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Mehrauli  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Mehrauli DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in West Patel Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
 
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
Premium Call Girls Btm Layout - 7001305949 Escorts Service with Real Photos a...
 
call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in Tilak Nagar DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Disciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdf
Disciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdfDisciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdf
Disciplines-and-Ideas-in-the-Applied-Social-Sciences-DLP-.pdf
 

An Open Corpus for Named Entity Recognition in Historic Newspapers

  • 1. An Open Corpus for Named Entity Recognition in Historic Newspapers Clemens Neudecker Berlin State Library @cneudecker LREC2016, 23-28 May 2016, Portorož, Slovenia
  • 2. Background • Europeana Newspapers EU-project: www.europeana-newspapers.eu • OCRed 12m pages of historic newspapers from Europe (an estimated 25 billion words!) • Newspaper content from 23 libraries, in 40 languages, covering 4 centuries (1618-1990) • Public domain full-text available for download per language/content provider
  • 3. Formats & Standards • Full-text produced in ALTO • Metadata (structural) in METS • Metadata (bibliographic) in EDM • Not a fan of XML? Good ol‘ plain text (UTF-8) is also available… research.europeana.eu/itemtype/newspapers • Currently working on: – API for text/search – API for images (IIIF)
  • 4. Approach • 3 languages selected for NER: Dutch, German, French – in collab. with • Content in these languages constitutes about 50% of the overall full-text in the collection
  • 5. Methodology • Select 100 representative pages per language – If a classifier already exists for given language – run it on the selected 100 pages – Ingest tagged/untagged pages to annotation tool – Manually add/correct annotations (>=2 librarians per language) – Export and convert tagged data to BIO format – Train classifier from BIO & gazetteers (if available) – Evaluate derived classifier using 4-fold cross-eval – Repeat until classification performance converges
  • 6. NER software • Tested Stanford NER, OpenNLP, NLTK, Gate • Adaptation of Stanford NER package (CRF) – Mature, well-documented, widely used – Open source (GPL) – Thread-safe & platform-independent (JVM) – Machine learning scales out more easily to multiple languages – Prior experience working with CRF
  • 7. NER encoding in ALTO • In ALTO versions >2.1, this is possible: <String STYLEREFS="ID7" HEIGHT="132.0" WIDTH="570.0" HPOS="5937.0" VPOS="3279.0" CONTENT="Reynolds" WC="0.95238096" TAGREFS="Tag5"> </String> <String STYLEREFS="ID7" HEIGHT="102.0" WIDTH="540.0" HPOS="18438.0" VPOS="22008.0" CONTENT="Baltimore" WC="0.82539684" TAGREFS="Tag10"> </String> … <Tags> <NamedEntityTag ID="Tag5" TYPE="Person" LABEL="Reynolds"/> <NamedEntityTag ID="Tag10" TYPE=”Location" LABEL=”Baltimore"/> </Tags>
  • 8. Annotation • Evaluated BRAT, WebAnno, INL Attestation • Reasons for selection of INL Attestation: – Speed – Support of ALTO format – Support from INL available
  • 9. Annotation stats Language # tokens # PER # LOC # ORG French 207,000 5,672 5,614 2,574 Dutch 182,483 4,492 4,448 1,160 German 96,735 7,914 6,143 2,784 Language # tokens # PER # LOC # ORG French 100% 2,75% 2,71% 1,24% Dutch 100% 2,46% 2,44% 0,64% German 100% 8,18% 6,35% 2,88% Language Word-Error-Rate (Bag of Words) Reading Order Success Rate French 16,6% 19,9% Dutch 17,6% 23,2% German 15,9% / 21,9% 13,6%
  • 10. Challenges • Clear, comprehensive & common guidelines for manual annotation • OCR quality – on average 80% word accuracy • Wide variation in historical spelling • Mix of languages on a single page • Lack/loss of metadata on page/word level • Some data corruption occured when ingesting pre-tagged data into the annotation tool
  • 11. Attempted workarounds • Introduce OCR error patterns into training data  actually yields less precision/recall • Introduce a spelling variation module in the NER classifier  rewrite rules (e.g. „frorn“  „from“)  high integration effort  requires reasonable amount of rules  abandoned due to high complexity
  • 12. Evaluation NL Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
  • 13. Evaluation FR Derived via 4-fold cross-evaluation (25 out of 100 annotated pages)
  • 14. Use cases • Improving search, information retrieval – Within digital newspapers, a vast majority of user queries are person and place names • Linking of named entities to authority files to create linked data – The classification and disambiguation of named entities allows the assignment of unique identifiers from authorative sources – thus enabling cross-language/cross-collection linking
  • 15. Next steps • Volunteers wanted! Help correct corpus and collaboratively create a free dataset – instructions on GitHub wiki: – github.com/EuropeanaNewspapers/ ner-corpora/wiki/Corpus-cleanup • Plans to improve performance: – Add distributional similarity as feature (Clark 2003) – Semantic generalisation (Faruqui & Padò 2010) – Specialised gazetteers (e.g. list of historic place names) – Data, data, data
  • 16. Open resources • European Newspapers NER dataset (CC0): – github.com/EuropeanaNewspapers/ner-corpora • Europeana Newspapers NER software (EUPL): – github.com/EuropeanaNewspapers/europeananp- ner – github.com/EuropeanaNewspapers/europeananp- dbpedia-disambiguation • Annotated ALTO files: – lab.kbresearch.nl/static/html/eunews.html
  • 17. References • C. Neudecker, W.J. Faber, L. Wilms, T. van Veen: Large scale refinement of digital historical newspapers with named entity recognition Proceedings of the IFLA Newspaper Section Satellite Meeting, 2014, Geneva, Switzerland. • Y. Mossalam, A. Abi-Haidar, J.G. Ganascia: Unsupervised named entity recognition and disambiguation: An application to old French journals Advances in Data Mining. Applications and Theoretical Aspects, Springer LNCS, 2014.
  • 18. Thank you for your attention! Questions? Clemens Neudecker Berlin State Library @cneudecker