SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Museum Impact
Linking-up our specimens with
research published on them
Dr Ross Mounce
@rmounce
Talk Structure
● Background: the collections, the research literature
● Interesting things you should know about access to research
○ The costs of knowledge $$$
● Examples of content mining
○ Including a video demo!
● My work (in progress) on finding NHM specimens in recent literature
Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London
● New
● Open Data
● Easy-to-use
● Quick
● Images
● Audio
● Interactive
Maps
● Citable
● API access
● Open Source
Infrastructure
It’s not KE Emu :)
What I want to do:
link specimen records to their mentions in the literature
“Micro-computed tomography scan slice through four bat skulls, displaying the relative
position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
114,000,000
scholarly papers available online
36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and
no institution in the world has access to everything. Not even close to everything!
Cheryl Hall (2014) FOI request https://www.whatdotheyknow.com/request/academic_journal_subscription_co
We rent access to knowledge. Companies profiteer from it
2004/05 £357,197.79
2005/06 £383,214.29
2006/07 £340,690.33
2007/08 £381,526.57
2008/09 £441,706.36
2009/10 £437,539.71
2010/11 £430,105.08
2011/12 £449,515.12
2012/13 £469,007.50
2013/14 £494,913.01
10-year-total: £4,185,415.76
Tax Year Revenue Profit Profit Margin
2004 £1363m £460m 33.75%
2005 £1436m £449m 31.25%
2006 £1521m £465m 30.57%
2007 £1507m £477m 31.65%
2008 £1700m £568m 33.41%
2009 £1985m £693m 34.91%
2010 £2026m £724m 35.74%
2011 £2058m £768m 37.30%
2012 £2063m £780m 37.81%
2013 £2126m £826m 38.85%
Source: RELX Group (Parent company of Elsevier) Company Reports
Actually, the NHM’s annual bill isn’t bad compared to others
Source: Lawson S and Meghreblian B. (2015) Journal subscription
expenditure of UK higher education institutions. F1000Research
http://shiny.retr0.me/journal_costs/
Content Mining provides more bang for your buck
Making fuller use of our expensively provisioned access
● If the NHM is going to pay £500,000 per year to rent journals, why not use the
access to this resource to its fullest?
● I can’t read everything with my human eyes but…
computers can!
● If you can process one document with a computer,
you can process a million: content mining
Recent examples of Content Mining
Fig. 6 from the paper
Brachiopod body-size
estimates
Red-line humans
Grey bars machines
(PaleoDeepDive)
Better than PaleoDB ?
I think so. PDD more clearly-linked to evidence than PDB
Provenance matters.
Recent examples of Content Mining (Images)
3-second
image analysis
source: 10.1099/ijs.0.65212-0
(Zymobacter_palmae:261,((((Chromohalobacter_canadensis:42,
(Chromohalobacter_sarecensis:96,Chromohalobacter_nigrandesensls:154):41):80,
(Chromohalobacter_marismortui:125,Chromohalobacter_beijerinckii:103):164):61,
(Chromohalobacter_israelensis:11,Chromohalobacter_salexigens:11):92):293,
((Halomonas_halodurans:328,(Halomonas_ventosae:100,(Halomonas_pacifica:116,
(Halomonas_halophila:223,(Halomonas_eurihalina:27,Halomonas_elongate:58):236):
79):41):46):72,(Halomonas_desiderata:187,(Halomonas_pantelleriensis:173,
Halomonas_muralis:190):70):30):110):187);
outputs re-usable Newick
& NeXML
no manual input required
Can replot data,
re-analyse,
combine many to make a supertree!
PLUTo Project
Mounce, Murray-Rust, Wills (in prep.)
How to get a sufficient volume of journal articles?
● The ContentMine (CM) team are actively developing new tools
& training workshops to help researchers get into content
mining: be it text, data, or image mining
● CM are a not-for-profit Shuttleworth-funded project led by
Peter Murray-Rust
● All the software tools are open source and available on github:
https://github.com/ContentMine/
● I’m a Scientific Advisor with the ContentMine
● Try getpapers OR quickscrape to get journal content en masse
https://github.com/ContentMine/getpapers
https://github.com/ContentMine/quickscrape
http://contentmine.org/
● No problem. PMC to the rescue!
● PMC has a full text
Open Access-only subset which
you can download easily for free
● >1,100,000 full texts in XML
(compressed) is just 16.6GB
Want to download more than a million (OA) papers?
Source: Neil Saunders (2014) https://rpubs.com/neilfws/45828http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
Are there NHM specimens in the PMC OA subset?
PMC is medically-focused, so one wouldn’t expect it to be
rich in organismal biology, however …some relevant content
ALL of PLOS ONE is in the PMC OA subset.
Over 100,000 articles in that journal alone!
https://github.com/rossmounce/NHM-specimens
Version-controlled data on github
open for scrutiny & collaboration
Searching ALL full texts
is not enough!!!
A significant number of specimens
are probably ‘hiding-out’ in
supplementary data files of all sorts
of formats.
Google Scholar does not index SI
Web of Science doesn’t either
Nor does Scopus
At scale, journal-held
supplementary data files are the
‘darkest corners’ of science“Specimens were deposited in the collections of the California Academy of Sciences'
Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of
author GJM (Table S1)” 10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
I don’t just find in-text mentions.
I’m trying to match them up to our
NHM Data Portal records too!
Specimens in RED do not
appear to be on the Data Portal
...yet
Blue globe represents a PLOS ONE paper
Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?
Hint: it’s famous! Grey represents an NHMUK specimen
Mining over 200 subscription access / non-PMC journals
from 2000 <-> 2015 inclusive
Nature + Science + PNAS + Phytotaxa + Zootaxa
BioOne Journals (131)
Springer Journals (32)
Wiley Journals (22)
Taylor & Francis Journals (14)
Elsevier Journals (12)
Oxford University Press Journals (8)
SciELO Journals (7) [Open Access but not in PMC]
Ecological Society of America Journals (6)
Geological Society Journals (4)
CSIRO Journals (4)
Cambridge University Press Journals (3)
Royal Society Journals (2)
Journal-omics!
Thanks to a recent change in
UK copyright law:
text and data mining for non-
commercial research purposes
is legal (in the UK),
(provided that you have
legitimate access to the
resource you want to mine e.g.
a paid-for institutional
subscription)
http://blogs.lse.ac.uk/impactofsocialsciences/2014/06/04/the-right-to-read-is-the-right-to-mine-tdm/
Image credit: Ubiquity Press
http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
So far… (very much still in progress)
Almost nothing in Nature & Science ‘full (short) text’
Context: 15 years worth of full text research in Nature & Science examined.
Science: only 11 NHM specimens found in ~39,600 texts.
Nature: similar story. <30 specimens in 14,132 ‘full’ texts.
Clearly there are more, but it’s all buried in supplementary materials :(
Shoving all the research details into non-searchable
supplementary materials is bad for science
● For the avoidance of doubt, this is not a criticism of authors. This is squarely
aimed at journals that artificially restrict the ‘length’ of research articles online.
e.g. Prufer, K. et al. 2014. The complete genome sequence of a Neanderthal from
the Altai Mountains. Nature 2014, 505, 43-49.
7-pages (in paper), 12-pages (in PDF, with extra data tables & figures)
The supplementary data file?
249 pages!
Someone needs to build a searchable index of
supplementary data. ASAP
Interactive plot: https://plot.ly/~rossmounce/22.embed
“Micro-computed tomography scan slice through four bat skulls, displaying the relative
position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
Huge potential to go beyond mere linking-up of identifiers.
This specimen & others have been CT scanned in the PLOS ONE paper.
We could do data, media and knowledge ‘repatriation’ back to the museum/portal.
Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013)
The Evolution of Bat Vestibular Systems in the Face of Potential
Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8
(4): e61998. doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere,
could be re-incorporated back into the online museum
catalogue. A one-stop shop for information.
Beyond-linking:
repatriation of knowledge
This is a CT-scan of “BMNH 76.3.15.14”.
Without mining, I wouldn’t know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-
0275-4a82-9fa6-cf1c3949f408
Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horii n. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
Acknowledgements
● Sincere thanks to:
○ The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining
○ Nancy Chillingsworth (IPR, NHM London)
○ Mark Wilkinson (Life Sciences, NHM London)
○ Peter Murray-Rust & the ContentMine team
○ Vince Smith (Life Sciences, NHM London)
○ Ben Scott (NHM Data Portal Lead Architect)
○ Rod Page (University of Glasgow)
○ All of the Biodiversity Informatics team
http://contentmine.org/
Please ask me questions!
Feedback appreciated :)
@rmounce
ross.mounce@nhm.ac.uk

Weitere ähnliche Inhalte

Was ist angesagt?

Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDMpetermurrayrust
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetRoss Mounce
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literaturepetermurrayrust
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literaturepetermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trustpetermurrayrust
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature TheContentMine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!petermurrayrust
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSSpetermurrayrust
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifestpetermurrayrust
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career ResearchersRoss Mounce
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS TheContentMine
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature TheContentMine
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literaturepetermurrayrust
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europepetermurrayrust
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataHerbert Van de Sompel
 

Was ist angesagt? (20)

Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yet
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifest
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career Researchers
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Content Mining of Science in Europe
Content Mining of Science in EuropeContent Mining of Science in Europe
Content Mining of Science in Europe
 
MESUR: Making sense and use of usage data
MESUR: Making sense and use of usage dataMESUR: Making sense and use of usage data
MESUR: Making sense and use of usage data
 

Andere mochten auch

The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014Ross Mounce
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? Nancy Pontika
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Kaitlin Thaney
 
Subscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesSubscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesAlex Holcombe
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingKent Anderson
 
Research publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeResearch publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeRon Martinez
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You OnJill Cirasella
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationhierohiero
 

Andere mochten auch (10)

The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014The PLUTo project @iEvoBio 2014
The PLUTo project @iEvoBio 2014
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why?
 
Open Access Publishing, Threat or Opportunity?
Open Access Publishing, Threat or Opportunity?Open Access Publishing, Threat or Opportunity?
Open Access Publishing, Threat or Opportunity?
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
 
Subscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesSubscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundaries
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meeting
 
Research publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeResearch publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challenge
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You On
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly information
 

Ähnlich wie Museum impact: linking-up specimens with research published on them

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neurosciencepetermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in NeuroscienceTheContentMine
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome TrustTheContentMine
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biologypetermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic BiologyTheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Jisc
 
THOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOSTHOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOSMaaike Duine
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literaturepetermurrayrust
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulTheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is usefulpetermurrayrust
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesespetermurrayrust
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Librariespetermurrayrust
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesTheContentMine
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machinespetermurrayrust
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHpetermurrayrust
 

Ähnlich wie Museum impact: linking-up specimens with research published on them (20)

ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
BHL Tech Report
BHL Tech ReportBHL Tech Report
BHL Tech Report
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
THOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOSTHOR Workshop - Data Publishing PLOS
THOR Workshop - Data Publishing PLOS
 
ContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific LiteratureContentMine: Mining the Scientific Literature
ContentMine: Mining the Scientific Literature
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
Big Data and ContentMining for Libraries
Big Data and ContentMining for LibrariesBig Data and ContentMining for Libraries
Big Data and ContentMining for Libraries
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIH
 

Mehr von Ross Mounce

Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Ross Mounce
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For ResearchersRoss Mounce
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for ScienceRoss Mounce
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Ross Mounce
 

Mehr von Ross Mounce (7)

Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For Researchers
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for Science
 
Herding Cats
Herding CatsHerding Cats
Herding Cats
 
Content Mining
Content MiningContent Mining
Content Mining
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
 
ProgPal2011
ProgPal2011ProgPal2011
ProgPal2011
 

Kürzlich hochgeladen

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Kürzlich hochgeladen (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Museum impact: linking-up specimens with research published on them

  • 1. Museum Impact Linking-up our specimens with research published on them Dr Ross Mounce @rmounce
  • 2. Talk Structure ● Background: the collections, the research literature ● Interesting things you should know about access to research ○ The costs of knowledge $$$ ● Examples of content mining ○ Including a video demo! ● My work (in progress) on finding NHM specimens in recent literature
  • 3. Source: http://www.nhm.ac.uk/our-science/collections.html © The Trustees of the Natural History Museum, London
  • 4. ● New ● Open Data ● Easy-to-use ● Quick ● Images ● Audio ● Interactive Maps ● Citable ● API access ● Open Source Infrastructure It’s not KE Emu :)
  • 5. What I want to do: link specimen records to their mentions in the literature “Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …” NHM Data Portal Link (Stable, Unique Identifier) http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408 Article DOI (Stable, Unique Identifier) http://dx.doi.org/10.1371/journal.pone.0061998
  • 6. 114,000,000 scholarly papers available online 36,000,000 of which are ‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’ Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
  • 7. Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and no institution in the world has access to everything. Not even close to everything!
  • 8. Cheryl Hall (2014) FOI request https://www.whatdotheyknow.com/request/academic_journal_subscription_co We rent access to knowledge. Companies profiteer from it 2004/05 £357,197.79 2005/06 £383,214.29 2006/07 £340,690.33 2007/08 £381,526.57 2008/09 £441,706.36 2009/10 £437,539.71 2010/11 £430,105.08 2011/12 £449,515.12 2012/13 £469,007.50 2013/14 £494,913.01 10-year-total: £4,185,415.76 Tax Year Revenue Profit Profit Margin 2004 £1363m £460m 33.75% 2005 £1436m £449m 31.25% 2006 £1521m £465m 30.57% 2007 £1507m £477m 31.65% 2008 £1700m £568m 33.41% 2009 £1985m £693m 34.91% 2010 £2026m £724m 35.74% 2011 £2058m £768m 37.30% 2012 £2063m £780m 37.81% 2013 £2126m £826m 38.85% Source: RELX Group (Parent company of Elsevier) Company Reports
  • 9. Actually, the NHM’s annual bill isn’t bad compared to others Source: Lawson S and Meghreblian B. (2015) Journal subscription expenditure of UK higher education institutions. F1000Research http://shiny.retr0.me/journal_costs/
  • 10. Content Mining provides more bang for your buck Making fuller use of our expensively provisioned access ● If the NHM is going to pay £500,000 per year to rent journals, why not use the access to this resource to its fullest? ● I can’t read everything with my human eyes but… computers can! ● If you can process one document with a computer, you can process a million: content mining
  • 11. Recent examples of Content Mining Fig. 6 from the paper Brachiopod body-size estimates Red-line humans Grey bars machines (PaleoDeepDive) Better than PaleoDB ? I think so. PDD more clearly-linked to evidence than PDB Provenance matters.
  • 12. Recent examples of Content Mining (Images) 3-second image analysis source: 10.1099/ijs.0.65212-0 (Zymobacter_palmae:261,((((Chromohalobacter_canadensis:42, (Chromohalobacter_sarecensis:96,Chromohalobacter_nigrandesensls:154):41):80, (Chromohalobacter_marismortui:125,Chromohalobacter_beijerinckii:103):164):61, (Chromohalobacter_israelensis:11,Chromohalobacter_salexigens:11):92):293, ((Halomonas_halodurans:328,(Halomonas_ventosae:100,(Halomonas_pacifica:116, (Halomonas_halophila:223,(Halomonas_eurihalina:27,Halomonas_elongate:58):236): 79):41):46):72,(Halomonas_desiderata:187,(Halomonas_pantelleriensis:173, Halomonas_muralis:190):70):30):110):187); outputs re-usable Newick & NeXML no manual input required Can replot data, re-analyse, combine many to make a supertree! PLUTo Project Mounce, Murray-Rust, Wills (in prep.)
  • 13. How to get a sufficient volume of journal articles? ● The ContentMine (CM) team are actively developing new tools & training workshops to help researchers get into content mining: be it text, data, or image mining ● CM are a not-for-profit Shuttleworth-funded project led by Peter Murray-Rust ● All the software tools are open source and available on github: https://github.com/ContentMine/ ● I’m a Scientific Advisor with the ContentMine ● Try getpapers OR quickscrape to get journal content en masse https://github.com/ContentMine/getpapers https://github.com/ContentMine/quickscrape http://contentmine.org/
  • 14.
  • 15. ● No problem. PMC to the rescue! ● PMC has a full text Open Access-only subset which you can download easily for free ● >1,100,000 full texts in XML (compressed) is just 16.6GB Want to download more than a million (OA) papers? Source: Neil Saunders (2014) https://rpubs.com/neilfws/45828http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
  • 16. Are there NHM specimens in the PMC OA subset? PMC is medically-focused, so one wouldn’t expect it to be rich in organismal biology, however …some relevant content ALL of PLOS ONE is in the PMC OA subset. Over 100,000 articles in that journal alone!
  • 18. Searching ALL full texts is not enough!!! A significant number of specimens are probably ‘hiding-out’ in supplementary data files of all sorts of formats. Google Scholar does not index SI Web of Science doesn’t either Nor does Scopus At scale, journal-held supplementary data files are the ‘darkest corners’ of science“Specimens were deposited in the collections of the California Academy of Sciences' Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of author GJM (Table S1)” 10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
  • 19. I don’t just find in-text mentions. I’m trying to match them up to our NHM Data Portal records too! Specimens in RED do not appear to be on the Data Portal ...yet Blue globe represents a PLOS ONE paper
  • 20. Blue globe represents a PLOS ONE paper Very few specimens occur in more than one paper Can you guess what BMNH 37001 is? Hint: it’s famous! Grey represents an NHMUK specimen
  • 21. Mining over 200 subscription access / non-PMC journals from 2000 <-> 2015 inclusive Nature + Science + PNAS + Phytotaxa + Zootaxa BioOne Journals (131) Springer Journals (32) Wiley Journals (22) Taylor & Francis Journals (14) Elsevier Journals (12) Oxford University Press Journals (8) SciELO Journals (7) [Open Access but not in PMC] Ecological Society of America Journals (6) Geological Society Journals (4) CSIRO Journals (4) Cambridge University Press Journals (3) Royal Society Journals (2) Journal-omics!
  • 22. Thanks to a recent change in UK copyright law: text and data mining for non- commercial research purposes is legal (in the UK), (provided that you have legitimate access to the resource you want to mine e.g. a paid-for institutional subscription) http://blogs.lse.ac.uk/impactofsocialsciences/2014/06/04/the-right-to-read-is-the-right-to-mine-tdm/
  • 23. Image credit: Ubiquity Press http://ubiquitypress.tumblr.com/post/96012592921/the-right-to-read-is-the-right-to-mine
  • 24. So far… (very much still in progress)
  • 25. Almost nothing in Nature & Science ‘full (short) text’ Context: 15 years worth of full text research in Nature & Science examined. Science: only 11 NHM specimens found in ~39,600 texts. Nature: similar story. <30 specimens in 14,132 ‘full’ texts. Clearly there are more, but it’s all buried in supplementary materials :(
  • 26. Shoving all the research details into non-searchable supplementary materials is bad for science ● For the avoidance of doubt, this is not a criticism of authors. This is squarely aimed at journals that artificially restrict the ‘length’ of research articles online. e.g. Prufer, K. et al. 2014. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 2014, 505, 43-49. 7-pages (in paper), 12-pages (in PDF, with extra data tables & figures) The supplementary data file? 249 pages!
  • 27. Someone needs to build a searchable index of supplementary data. ASAP
  • 29. “Micro-computed tomography scan slice through four bat skulls, displaying the relative position of the three semicircular canals within the skull. Scans are from the following species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …” NHM Data Portal Link (Stable, Unique Identifier) http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408 Article DOI (Stable, Unique Identifier) http://dx.doi.org/10.1371/journal.pone.0061998 Huge potential to go beyond mere linking-up of identifiers. This specimen & others have been CT scanned in the PLOS ONE paper. We could do data, media and knowledge ‘repatriation’ back to the museum/portal.
  • 30. Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013) The Evolution of Bat Vestibular Systems in the Face of Potential Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8 (4): e61998. doi:10.1371/journal.pone.0061998 Openly-licensed data on specimens, published elsewhere, could be re-incorporated back into the online museum catalogue. A one-stop shop for information. Beyond-linking: repatriation of knowledge This is a CT-scan of “BMNH 76.3.15.14”. Without mining, I wouldn’t know this data exists. Perhaps it could also be made available on the portal? http://data.nhm.ac.uk/specimen/69e97f52- 0275-4a82-9fa6-cf1c3949f408
  • 31. Does published info make it back ‘home’ to the collections? BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi” I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9 It’s now called: Petrochromis horii n. sp. , according to the paper. What mechanisms are there to update newer information back into the collection? Content mining could definitely help keep collections data up-to-date!
  • 32. Acknowledgements ● Sincere thanks to: ○ The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining ○ Nancy Chillingsworth (IPR, NHM London) ○ Mark Wilkinson (Life Sciences, NHM London) ○ Peter Murray-Rust & the ContentMine team ○ Vince Smith (Life Sciences, NHM London) ○ Ben Scott (NHM Data Portal Lead Architect) ○ Rod Page (University of Glasgow) ○ All of the Biodiversity Informatics team http://contentmine.org/
  • 33. Please ask me questions! Feedback appreciated :) @rmounce ross.mounce@nhm.ac.uk