A talk given at the Geological Society of London, UK on 2016/03/09 as part of the Lyell meeting on Palaeoinformatics. http://www.geolsoc.org.uk/lyell16 #lyell16
2. About Me
Currently a Postdoc at
a Fellow of the
('Class of 2016')
a researcher with
plantsci.cam.ac.uk
software.ac.uk/fellows
contentmine.org
3. About This Talk (A little warning!)
● Don't expect to see much biology in this talk
● I'm going to talk about informatics
● I will focus more on context, background and methods,
more than 'results' per se
● There will be more questions than answers :)
5. New
Open Data
Easy-to-use
Quick
Images
Audio
Interactive Maps
Citable
API access
Open Source
Infrastructure
It’s not KE Emu :)
6. What I want to do:
link specimen records to their mentions in the literature
“Micro-computed tomography scan slice through four bat skulls, displaying the relative position of
the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus
rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
7. 114,000,000
scholarly papers available online
36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
8. Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no
institution in the world has access to everything. Not even close to everything!
9. In 2016, libraries pay subscriptions, or individuals per article fees
to access even out of copyright works
??
http://outofcopyright.eu/rights-after-digitisation/
11. This is what a PDF looks like
PDF is NOT a
good method
of exchanging
information
12. HTML is better, but lacks
standardisation
+ italics & bold preserved, semantic links to figures & tables - lacks standardisation
13. The industry standard format for
scholarly articles is JATS XML
● Journal Article Tags Archiving Suite
is an application of NISO Z39.96-2015, which defines a set of XML elements and
attributes for tagging journal articles
● Standardising the format of digital scholarly publications is HIGHLY desirable
e.g. for this project, knowing if the string 'NHM' occurrs in the Materials section, rather
than the Acknowledgements section is hugely helpful.
Much harder to do with PDF/HTML.
Section-based search already implemented in EuropePMC!
→ Section level search functionality in Europe PMC. Kafkas et al (2015) J Biomed Semantics
14. A plea for full text XML
A minority of journals do not provide full text XML
✓PLOS, eLife, PeerJ, Pensoft, Wiley, Elsevier, Springer,
NPG, Ubiquity Press, Copernicus, Hindawi, MPDI
✘ Geological Society of London Publications,
Magnolia Press, a long tail of smaller publishers
16. Image credit: Ubiquity Press
http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
UK Copyright Law has
changed recently,
giving a specific
copyright exemption
for non-commercial
text and data mining
work
17. A complicated, fragmented landscape of relevant journals
Nature + Science + PNAS + Phytotaxa + Zootaxa
BioOne Journals (131)
Springer Journals (32)
Wiley Journals (22)
Taylor & Francis Journals (14)
Elsevier Journals (12)
Oxford University Press Journals (8)
SciELO Journals (7) [Open Access but not in PMC]
Ecological Society of America Journals (6)
Geological Society Journals (4)
CSIRO Journals (4)
Cambridge University Press Journals (3)
Royal Society Journals (2)
Journal-omics!
18. I discover 'new' journals every week
e.g. last week I 'found' Oryctos (published between 1998-2010), still behind a
paywall. Does anyone have access to this journal? Please let me know
http://www.dinosauria.org/oryctos.php
How are we meant to achieve a comprehensive
aggregation of research literature (to do rigorous science,
inclusive of all the evidence) when it is so unhelpfully
scattered and we don't even know where it all is?
20. I don’t just find in-text mentions.
I’m trying to match them up to our
NHM Data Portal records too!
Specimens in RED do not appear
to be on the Data Portal ...yet
Blue globe represents a PLOS ONE paper
21. Searching ALL full texts is
not enough!!!
A significant number of specimens are
probably ‘hiding-out’ in supplementary
data files of all sorts of formats.
Google Scholar does not index SI
Web of Science doesn’t either
Nor does Scopus
At scale, journal-held supplementary
data files are the ‘darkest corners’ of
science
“Specimens were deposited in the collections of the California Academy of Sciences' Department of
Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)”
10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
22. Why write such descriptive papers in natural
language? Keep data as data!
The above was published in 2013(!)
23. Almost nothing in Nature & Science ‘full (short) text’
Context: 15 years worth of full text research in Nature & Science examined
Science: only 11 NHM specimens found in 39,600 full texts.
Nature: similar story. <30 specimens in 14,132 full texts.
Clearly there are more,
but it’s all buried in supplementary materials :(
24. Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?
Hint: it’s a very famous specimen! Grey represents an NHMUK specimen
25. Huge variation in how specimens are cited (not helpful!)
PI AZ 8459 TEXSpruce6067
BM000922891 NYRaz054
BMNH(E)609062 MSB00509
Belize_CW_All_1071 F1629082
BM-BRIT-EURO 3948 OR.5379
“BMNH” is not necessarily British Museum of Natural History (UK).
Can also be Beijing Museum of Natural History (CN) or Bell Museum of Natural History (US)
26. Where possible use standard/permanent identifiers
Want to discuss a particular collection? Use the official GrSciColl identifier
The Global Registry of Scientific Collections (GRSciColl)
http://grscicoll.org/
Which for the Natural History Museum, London (UK) is: NHMUK
http://biocol.org/urn:lsid:biocol.org:col:34665
Want to cite the BM Archaeopteryx specimen?
NHMUK PV OR 37001
http://data.nhm.ac.uk/object/57ee3bf1-0a74-4ae4-a588-ba9ea8dc5265
27. Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The
Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection
Pressures for Flight and Echolocation. PLoS ONE 8(4): e61998.
doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere, could
be re-incorporated back into the online museum catalogue. A
one-stop shop for information.
Beyond-linking:
repatriation of knowledge
This is a CT-scan of “BMNH 76.3.15.14”.
Without mining, I wouldn’t know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-0275-
4a82-9fa6-cf1c3949f408
28. Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horiin. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
29. Can we create a (better) digital NHM metadata catalogue
entirely from the literature, hundreds of years before the NHM
themselves complete their own digitisation programme?
Given funding and time, perhaps…
30. Acknowledgements
Sincere thanks to:
Aime Rankin for help with the project
The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining
Nancy Chillingsworth (IPR, NHM London)
Mark Wilkinson (Life Sciences, NHM London)
Peter Murray-Rust & the ContentMine team
Vince Smith (Life Sciences, NHM London)
Ben Scott (NHM Data Portal Lead Architect)
Rod Page (University of Glasgow)
All of the Biodiversity Informatics team
http://contentmine.org/
For a more detailed version of this talk on
YouTube see: bit.ly/nhmlink