Text Mining Biodiversity 20160127

Text Mining Biodiversity
S. Ananiadou
E. Milios
W. Ulate

Partners
24/15/2016 Mining Biodiversity

Outline
1. Introduction
2. Creating a Term Inventory of Biodiversity
3. Interactive Visualization of Inventory
4. Creating a Text Mining Infrastructure for Biodiversity
5. Interactive Clustering of Search Engine results
6. OCR Error correction
7. Social media platform
8. Impact

Social Media
Visualisation
Semantic
Metadata
What do we want to do?
54/15/2016 Mining Biodiversity
http://miningbiodiversity.org
Help transform BHL into a next-generation social digital
library through a multi-disciplinary approach that includes:
• Text Mining
• Machine learning
• History of Science
• Environmental History & Studies
• Library and Information Science
• Social Media

Creating the Term Inventory: why we need it
• A species name may usually be expressed in multiple ways, e.g., using
scientific names or vernacular names
– Balaena mysticetus Bowhead whale, bowhead
– Spizella passerina Chipping sparrows
• Identify synonymous terms in biodiversity text
• Why? To go beyond keyword-based search!
6

Search Results Using Vernacular Names
Vernacular name of “Balaena
mysticetus”
Different results!!
7

Keyword-based Search: Ambiguity
Boxwood
historic place in
Alabama?
North American term for plants in the
Buxaceae family?
Box
container?
Boxwood for other English-speaking
countries?
8

Methods: Distributional Semantics
• Determine the meaning of terms and phrases by looking at the context
and the meaning of individual words
bowhead whale
43.99 39.99 25.06 23.92 20.84 19.86 19.52 17.91 … 5.62
balaena mysticetus alaska seals distribu
tion
ringed catch quota … murray
9
mysticetus seals distribut
ion
ringed … murray
43.99 25.06 19.52 17.91 …
balaena alaska catch quota …
bowhead
whale
39.99 23.92 20.84 19.52 … 5.62

Distributional semantics methods
balaena mysticetus
balaena glacialis 0.7896
bowhead whale 0.7392
bowhead 0.7074
bowhead whales 0.6999
eubalaena glacialis 0.6905
minke whale 0.6864
humpback whale 0.6490
sperm whale 0.6440
finback whale 0.6322
sei whale 0.6287
eubalaena japonica 0.6065
brydes whale 0.6052
humpback whales 0.6000
finback whales 0.5998 10

Experiments
• Training data: all English texts from the BHL
• about 26 million pages with a size of 49GB
• Evaluation data: synonymous terms from the Catalogue of Life
• Select 500 scientific names and their synonyms from the CoL
• Results at top-20
Category Class #terms in
CoL
#terms in
BHL
#average synonyms
in CoL
Birds Aves 1140 818 2.28
Mammals Mammalia 1131 726 2.26
Plants Plantae 1141 826 2.28
Category Pre@20 Re@20
Birds 69.41% 63%
Mammals 62.12% 53.84%
Plants 56.17% 21.43% 11

3. Interactive visualization of term inventory
12

TermInventoryVisualization
Video

4. Creating a text mining infrastructure for
biodiversity
14
• Web-based, graphical TM workbench
• Straightforward integration of tools into modular, extensible,
reconfigurable and reusable workflows
http://argo.nactem.ac.uk
Source: LEGO DUPLO

Annotation Workflow for Biodiversity
Pre-processing
Dictionary lookup
Machine learning-
based recognition
Relation extraction
Saving
15

5. Interactive clustering of search engine
results
• Goal: to cluster BHL search engine results
• Input dataset: output of an “Or” query based on the following terms:
1. Kangaroo
2. Lion
3. Rabbit
4. Shark
• Only titles of books or articles are considered in clustering
• Interactive clustering based on the keyterms of the titles

6. OCR error correction
• Correct errors in natural language texts
• Spelling errors (e.g. the => teh)
• Grammar errors (e.g. this is => this are)
• Outline

OCR error correction
• Input
• Document
• Component selection (select components to use for processing)
• Correction candidates
• A list of candidates with confidence for each error
• Component structure

Making Biodiversity
Digital Objects More
Social and Shareable
Follow us on Twitter: @SMLabTO

“My Tweeps” app
mytweeps.com
Helping BHL (and other organizations)
to get daily insights about their Twitter
followers (or Tweeps) and what they
are interested in.
We call it a "reverse" Twitter because
instead of seeing tweets from people
whom you follow, the app shows you
tweets from people who follow you.

We also partnered with Altmetric to better understand who and why people
share BHL content across various social media platforms

Enhanced Searching of BHL Content
Faceted search
Automatically
generated
questions
Time-sensitive
search
28

Enhanced Document Viewing
Page in
PDF/image
format
OCR-corrected text with
colour-coded annotations
29

The Team
• NaCTeM • Ryerson
• Dalhousie
• Missouri Botanical Garden
• Smithsonian Libraries (contract)

Text Mining Biodiversity 20160127

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Text Mining Biodiversity 20160127

Ähnlich wie Text Mining Biodiversity 20160127 (20)

Mehr von William Ulate

Mehr von William Ulate (19)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Mining Biodiversity 20160127

Hinweis der Redaktion