SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Slicing and Dicing a Newspaper Corpus
for Historical Ecology Research
Marieke van Erp

Jesse de Does

Katrien Depuydt

Rob Lenders

Thomas van Goethem
Image source: https://kidsbiology.com/wp-content/uploads/2016/10/Martes-americana1141684271.jpg
SERPENS in a Nutshell
• Historical ecologists are starting to use
newspaper corpora for their research

• The abundance of data is both a blessing and a
curse 

• SERPENS aims to make the computer do the
‘boring’ work of filtering relevant articles from
irrelevant ones 

• Historical ecology researchers can then spend
more time on the ‘hard’ analyses

• Partners: 

• Funded by:
Why pest and nuisance species?
• Ambivalent relationship;

• Food, fur, totem

• Diseases, agricultural damages

• Relationships change over time 

• Exotic species, reintroductions, plagues

• Understanding the past helps us to
understand current ecological conditions

• Useful to policy makers, conservationist
biologists etc.
Muskrat Image source: http://www.virtualmuseum.ca/sgc-cms/expositions-exhibitions/faune_urbaine-urban_wildlife/medias/sheets/47.jpg
Why newspapers?
• Which species were considered “pest and
nuisance species”?

• Why were they considered as such?

• How did humans respond? 

Also more tangible information:

• Extermination methods, number of
incidents/sightings, statistics, fur prices
First hurdle: OCR
• The older the source, the harder it is to read 

• OCR errors may result in relevant
documents being missed and irrelevant
documents being retrieved

• We don’t try to ‘fix’ bad OCR but rank
documents by OCR quality through lexicon
overlap
Ambiguity
• Wolf: animal 

• Wolf: last name

• Wolf in sheep’s clothes 

• …

• Context of the document needed to find the
right meaning
Experimental Setup
SERPENS Categories
• Natural history

• Nuisance, material damage

• Nuisance, immaterial damage

• Pest control

• Hunt for economic reasons

• Prevention 

• Accidents

• Figurative

• Other beast

• No beast

• Bad OCR
Training a new topic classifier
• Manually classified 9,940 documents

• Replace occurrences of animal names from
queries with “—ANIMAL—“

• 10-fold cross-validation

• various experiments to measure impact
settings and dataset size 

• Code available at: https://github.com/
CLARIAH/serpens/
Results different algorithms
Zooming in (snippets)
Results per class linear SVM (snippets)
Learning curves
• Total dataset consists of nearly 10,000
annotated examples 

• Learning curves are a measure of
performance vs training set size 

• Results converge rapidly, for two-class
problem, ~1000 examples already achieve
90% accuracy
Preliminary analysis
• Public perception of Mustelidae
(European polecat)

• Combination of distant and close
reading approaches

• Newspaper archives not
equally well digitised over time

• Trends in news may affect
reporting on animals
Lessons Learnt & Future Work
• Domain use cases often need specific
solutions 

• Document classification already very useful
to historical ecologists (probably also to
other domain experts)

• 1,000 annotated examples sufficient for
two-class classification 

• Extend to more species 

• Improve classification sub-categories 

• Add sentiment/opinions
Image source: https://upload.wikimedia.org/wikipedia/commons/1/1c/American_mink.jpg Mink
Shameless plug: 3rd Workshop on Humanities in the Semantic Web
image source: https://www.thesun.co.uk/wp-content/uploads/2017/07/nintchdbpict0001286085811.jpg
Questions?

Weitere ähnliche Inhalte

Ähnlich wie Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
William Ulate
 
Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?
Joseph Kraus
 
Biodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History MuseumBiodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History Museum
Edward Baker
 

Ähnlich wie Slicing and Dicing a Newspaper Corpus for Historical Ecology Research (20)

Finding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital libraryFinding the annotation needs of the botanical community in a digital library
Finding the annotation needs of the botanical community in a digital library
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case Towards Semantic Enrichment of Newspapers: a historical ecology use case
Towards Semantic Enrichment of Newspapers: a historical ecology use case
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Jim Woolley - Name Registration: One Less Impediment to Taxonomy
Jim Woolley - Name Registration: One Less Impediment to TaxonomyJim Woolley - Name Registration: One Less Impediment to Taxonomy
Jim Woolley - Name Registration: One Less Impediment to Taxonomy
 
Shorthouse
ShorthouseShorthouse
Shorthouse
 
Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?Why would someone with a background in science go into an LIS program?
Why would someone with a background in science go into an LIS program?
 
The Companion Avian Manifesto: The Value of Birds are Our Patients and Pets
The Companion Avian Manifesto: The Value of Birds are Our Patients and PetsThe Companion Avian Manifesto: The Value of Birds are Our Patients and Pets
The Companion Avian Manifesto: The Value of Birds are Our Patients and Pets
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014
 
Biodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History MuseumBiodiversity Informatics at the Natural History Museum
Biodiversity Informatics at the Natural History Museum
 
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
 
Linking biodiversity data for ecology
Linking biodiversity data for ecologyLinking biodiversity data for ecology
Linking biodiversity data for ecology
 
An Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data ResourceAn Oz Mammals Bioinformatics and Data Resource
An Oz Mammals Bioinformatics and Data Resource
 

Mehr von Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
Marieke van Erp
 

Mehr von Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH SymposiumTowards Culturally Aware AI Systems - TSDH Symposium
Towards Culturally Aware AI Systems - TSDH Symposium
 
A Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic WebA Polyvocal and Contextualised Semantic Web
A Polyvocal and Contextualised Semantic Web
 
AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit AI x Digital Humanities = > Inclusiviteit
AI x Digital Humanities = > Inclusiviteit
 
Computationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and SpaceComputationally Tracing Concepts Through Time and Space
Computationally Tracing Concepts Through Time and Space
 
The Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital HumanitiesThe Hitchhiker's Guide to the Future of Digital Humanities
The Hitchhiker's Guide to the Future of Digital Humanities
 
Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)Why language technology can’t handle Game of Thrones (yet)
Why language technology can’t handle Game of Thrones (yet)
 
(Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research (Beyond) Combining Text and Tables for qualitative and quantitative research
(Beyond) Combining Text and Tables for qualitative and quantitative research
 
Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...Finding common ground between text, maps, and tables for quantitative and qua...
Finding common ground between text, maps, and tables for quantitative and qua...
 
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...
 
Good Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologistsGood Lynx, bad Lynx: Document enrichment for historical ecologists
Good Lynx, bad Lynx: Document enrichment for historical ecologists
 
Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition Natural Language Processing en Named Entity Recognition
Natural Language Processing en Named Entity Recognition
 
HuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the ConversationHuC lecture - Digital and Humanities: Continuing the Conversation
HuC lecture - Digital and Humanities: Continuing the Conversation
 
Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing Multilingual Fine-grained Entity Typing
Multilingual Fine-grained Entity Typing
 
Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia Entity Typing Using Distributional Semantics and DBpedia
Entity Typing Using Distributional Semantics and DBpedia
 
Entity Typing and Event Extraction
Entity Typing and Event Extraction Entity Typing and Event Extraction
Entity Typing and Event Extraction
 
The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...The domain as unifier, how focusing on social history can bring technical fie...
The domain as unifier, how focusing on social history can bring technical fie...
 
Evaluating entity linking an analysis of current benchmark datasets and a ro...
Evaluating entity linking  an analysis of current benchmark datasets and a ro...Evaluating entity linking  an analysis of current benchmark datasets and a ro...
Evaluating entity linking an analysis of current benchmark datasets and a ro...
 
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...Finding Stories in 1,784,532 Events:  Scaling up computational models of narr...
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...
 
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and TweetsEvaluating Named Entity Recognition and Disambiguation in News and Tweets
Evaluating Named Entity Recognition and Disambiguation in News and Tweets
 
Orientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural HistoryOrientation EBC 2013: Digitising Natural History
Orientation EBC 2013: Digitising Natural History
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

  • 1. Slicing and Dicing a Newspaper Corpus for Historical Ecology Research Marieke van Erp Jesse de Does Katrien Depuydt Rob Lenders Thomas van Goethem Image source: https://kidsbiology.com/wp-content/uploads/2016/10/Martes-americana1141684271.jpg
  • 2. SERPENS in a Nutshell • Historical ecologists are starting to use newspaper corpora for their research • The abundance of data is both a blessing and a curse • SERPENS aims to make the computer do the ‘boring’ work of filtering relevant articles from irrelevant ones • Historical ecology researchers can then spend more time on the ‘hard’ analyses • Partners: • Funded by:
  • 3. Why pest and nuisance species? • Ambivalent relationship; • Food, fur, totem • Diseases, agricultural damages • Relationships change over time • Exotic species, reintroductions, plagues • Understanding the past helps us to understand current ecological conditions • Useful to policy makers, conservationist biologists etc. Muskrat Image source: http://www.virtualmuseum.ca/sgc-cms/expositions-exhibitions/faune_urbaine-urban_wildlife/medias/sheets/47.jpg
  • 4. Why newspapers? • Which species were considered “pest and nuisance species”? • Why were they considered as such? • How did humans respond? Also more tangible information: • Extermination methods, number of incidents/sightings, statistics, fur prices
  • 5. First hurdle: OCR • The older the source, the harder it is to read • OCR errors may result in relevant documents being missed and irrelevant documents being retrieved • We don’t try to ‘fix’ bad OCR but rank documents by OCR quality through lexicon overlap
  • 6. Ambiguity • Wolf: animal • Wolf: last name • Wolf in sheep’s clothes • … • Context of the document needed to find the right meaning
  • 8. SERPENS Categories • Natural history • Nuisance, material damage • Nuisance, immaterial damage • Pest control • Hunt for economic reasons • Prevention • Accidents • Figurative • Other beast • No beast • Bad OCR
  • 9. Training a new topic classifier • Manually classified 9,940 documents • Replace occurrences of animal names from queries with “—ANIMAL—“ • 10-fold cross-validation • various experiments to measure impact settings and dataset size • Code available at: https://github.com/ CLARIAH/serpens/
  • 12. Results per class linear SVM (snippets)
  • 13. Learning curves • Total dataset consists of nearly 10,000 annotated examples • Learning curves are a measure of performance vs training set size • Results converge rapidly, for two-class problem, ~1000 examples already achieve 90% accuracy
  • 14. Preliminary analysis • Public perception of Mustelidae (European polecat) • Combination of distant and close reading approaches • Newspaper archives not equally well digitised over time • Trends in news may affect reporting on animals
  • 15. Lessons Learnt & Future Work • Domain use cases often need specific solutions • Document classification already very useful to historical ecologists (probably also to other domain experts) • 1,000 annotated examples sufficient for two-class classification • Extend to more species • Improve classification sub-categories • Add sentiment/opinions Image source: https://upload.wikimedia.org/wikipedia/commons/1/1c/American_mink.jpg Mink
  • 16. Shameless plug: 3rd Workshop on Humanities in the Semantic Web