The document discusses text mining techniques to analyze biodiversity documents from the Biodiversity Heritage Library (BHL). It outlines 8 areas: 1) creating a term inventory, 2) visualizing the inventory, 3) developing a text mining infrastructure, 4) interactive clustering of search results, 5) OCR error correction, 6) a social media platform, 7) the impact of these techniques, and 8) videos demonstrating some of the techniques. The overall goal is to transform BHL into a next-generation social digital library through multi-disciplinary approaches including text mining, machine learning, and social media.
3. Outline
1. Introduction
2. Creating a Term Inventory of Biodiversity
3. Interactive Visualization of Inventory
4. Creating a Text Mining Infrastructure for Biodiversity
5. Interactive Clustering of Search Engine results
6. OCR Error correction
7. Social media platform
8. Impact
4.
5. Social Media
Visualisation
Semantic
Metadata
What do we want to do?
54/15/2016 Mining Biodiversity
http://miningbiodiversity.org
Help transform BHL into a next-generation social digital
library through a multi-disciplinary approach that includes:
• Text Mining
• Machine learning
• History of Science
• Environmental History & Studies
• Library and Information Science
• Social Media
6. Creating the Term Inventory: why we need it
• A species name may usually be expressed in multiple ways, e.g., using
scientific names or vernacular names
– Balaena mysticetus Bowhead whale, bowhead
– Spizella passerina Chipping sparrows
• Identify synonymous terms in biodiversity text
• Why? To go beyond keyword-based search!
6
7. Search Results Using Vernacular Names
Vernacular name of “Balaena
mysticetus”
Different results!!
7
11. Experiments
• Training data: all English texts from the BHL
• about 26 million pages with a size of 49GB
• Evaluation data: synonymous terms from the Catalogue of Life
• Select 500 scientific names and their synonyms from the CoL
• Results at top-20
Category Class #terms in
CoL
#terms in
BHL
#average synonyms
in CoL
Birds Aves 1140 818 2.28
Mammals Mammalia 1131 726 2.26
Plants Plantae 1141 826 2.28
Category Pre@20 Re@20
Birds 69.41% 63%
Mammals 62.12% 53.84%
Plants 56.17% 21.43% 11
17. 5. Interactive clustering of search engine
results
• Goal: to cluster BHL search engine results
• Input dataset: output of an “Or” query based on the following terms:
1. Kangaroo
2. Lion
3. Rabbit
4. Shark
• Only titles of books or articles are considered in clustering
• Interactive clustering based on the keyterms of the titles
19. 6. OCR error correction
• Correct errors in natural language texts
• Spelling errors (e.g. the => teh)
• Grammar errors (e.g. this is => this are)
• Outline
20. OCR error correction
• Input
• Document
• Component selection (select components to use for processing)
• Correction candidates
• A list of candidates with confidence for each error
• Component structure
24. “My Tweeps” app
mytweeps.com
Helping BHL (and other organizations)
to get daily insights about their Twitter
followers (or Tweeps) and what they
are interested in.
We call it a "reverse" Twitter because
instead of seeing tweets from people
whom you follow, the app shows you
tweets from people who follow you.
Follow us on Twitter: @SMLabTO
25. We also partnered with Altmetric to better understand who and why people
share BHL content across various social media platforms
Follow us on Twitter: @SMLabTO
Shortcuts for fast forward of VLC videos: http://www.shortcutworld.com/en/win/VLC-Media-Player.html
Before starting, go to display settings and make the projector screen the main screen, so that videos pop up
There and not on the laptop screen.
BHL is the data source
IMLS is the Funding Agency
Missouri Botanical Garden is the partner for the US
Smithsonian Libraries is a contractor (not sure if we should include it)
Sophia
Sophia
Sophia
Sophia
Evangelos
Evangelos
William (Anatoliy’s video has voice, so it is self-explanatory)
William
The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts, the BHL has digitized more than 48 million pages of taxonomic literature, representing over 100,000 titles and over 170,000 volumes.
MiBIO will integrate TM tools within an interoperable platform to provide a semantic search system for the BHL, enhanced through clustering and visualisation capabilities.
MiBIO will also provide a social media environment, which will enable BHL users to discuss, link and share digital artifacts posted to social media sites linked to the BHL search portal. The outcome will be the transformation of the BHL from a Digital Library (DL) into a Social Digital Library (SDL). This will be achieved through the
enrichment of its historical digital archives with semantic metadata generated by TM.
Furthermore, by leveraging existing social media sites and providing facilities for their integration with the BHL, we will engage a community of users to exploit the BHL as a forum
for the exchange of ideas.
In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata.
Such variants may cause low performance to a keyword-based search engine and moreover it causes difficulties for non-expert users (users that are not familiar with scientific names).
To alleviate the issue of variants searching in the search engine, we have compiled a terminological inventory containing semantic variants of biodiversity terms, e.g., mammals, birds, plants, by using distributional semantic methods.
Learn the representation vector of each term
Calculate the cosine similarity between two terms
Extract top-20 candidates of synonyms.
And here is the search result when we use a common name of the previous term, which consists only one document related to “bowhead whale”.
Apparently, the search engine returns a different result with the previous one …
Another problem with keyword-based search, as mentioned above, is ambiguity.
If one searches for “Boxwood”, a keyword-based system wouldn’t know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both.
Nor will it know if a query “Box” pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.
We then implemented two distributional semantic models. The first one is a count-based model that determines the …
For example, within a 7-word window, this is the context vector of “bowhead whale” -- SA rubbish frequency
In this manner, for each name, we generate a list of names ranked by similarity.
For “balaena mysticetus”, for example, we obtained the following list.
Determine the meaning of a term by considering all lexical units occurring within a N-word window.
We have conducted our experiments on the Biodiversity Heritage Library (BHL) corpus. The corpus size is about 49 GB.
We have created a golden data of synonymous terms based on the Catalogue of Life. For each scientific name, we extract the corresponding common names and synonyms.
We then picked randomly 500 species whose class is Aves. As a result, we got about 11 hundred terms of bird names (both vernacular and scientific names), of which about 8 hundreds existing in the BHL corpus.
According to CoL, the average number of synonyms for each scientific names is about 2.
We did the same process with mammal and plant names.
Follows are the precision and recall scores at top-20.
Among the three categories, the performance of bird names is the best.
With plant names, its lower performance can be explained by the fact that unlike mammals and birds, most of synonyms of plant names are also scientific names, which is more difficult to detect than the other.
Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backward
Alt+Arrow Right/Arrow LeftJump 10 seconds forward/ backward
Ctrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward
-Frequency of species names can be visually explored, or queried by a search interface-Clicking on a species name acts as a query to retrieve its top-20 semantically related species.--Their semantically related score can be inspected--A blue color denotes that the species names appear as synonym in the CoL-Interactive visualizations were constructed for mammals, plants and birds[and in case somebody asks:]-Images, which were crawled from external open sources, may help assess visually species' relatedness based on their visible features.
Shift+Arrow Right/Arrow LeftJump 3 seconds forward/ backward
Alt+Arrow Right/Arrow LeftJump 10 seconds forward/ backward
Ctrl+Arrow Right/Arrow LeftJump 1 minute forward/ backward
Species names are shown in bubbles
Larger bubbles denote species more frequently mentioned in the biodiversity literature
Upon interaction (semantically) related species can be inspected
Color opacity indicates degree of relatedness
Blue color indicates that species also appear as synonyms in CoL
Images are retrieved from open data collections (e.g. Wikipedia)
Web-based application: No installation; Access with a web browser
Multi-user system: Remote collaborative annotation
Supports Unstructured Information Management Architecture UIMA, Cloud and high-performance computing
This is the workflow that we put together using Argo.
Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.