The document discusses plans to create a new data portal at data.nhm.ac.uk to address issues with finding, accessing, citing, and integrating research data and collection data from the Natural History Museum. It will provide a central access point, allow for integrated search and browse of datasets, and enable users to download, export, and analyze data. The portal will follow an open by default approach and be populated by museum staff. Development will occur over three years with initial focus on discovery of research datasets and collections data, followed by improved visualization and citation of data.
2. The problem – research data
Hard to find, access, cite and integrate
• 45 available online
(4 print only or behind pay walls)
• 9 had supplementary data files
• 39 papers with tables, charts & other data
o>1000 sequences
o826 figures
o76 tables
o1 genome
• No collective view of these data (37 journals)
• No consistent way of citing NHM data
• No mechanism to integrate or version
• No way to repurpose data (retyping?)
49 NHM science group
papers in last 4 weeks
Data via Carolyn Lowry e-mail, 13th Feb. 2013
3. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
4. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32
Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40
Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36
Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55
Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34
Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38
5. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
6. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
•23 interfaces & datasets of varying importance
7. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
119 Specimens Up to
28,000,000
Specimens
8. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different data collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
9. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
•Library doesn’t have any online collections!
10. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
•Library doesn’t have any online collections!
Bigger issues
•Idiosyncratic browse or search
11. The problem – collections data
Hard to find, access, cite and integrate
Initial problems
•Don’t know / can’t find the website
•6 different collections
•23 interfaces & datasets of varying importance
•No priority to collection datasets
•Entomology collections don’t exist (404)
•Library doesn’t have any online collections!
Bigger issues
•Idiosyncratic browse or search
•No maps, few images & very slow
•No summary or statistics
•No download, export or custom views
•No integration with other data
•No author info or update info
•No means of specimen citation The data portal must
•No exports to GBIF or associated projects correct these issues
12. The solution – data.nhm.ac.uk portal
High level issues
Functional requirements
•A central access point for NHM research & collections data
•The capacity store/link and describe datasets
•Integrated search & browse of datasets
•The ability to cite datasets and specimen records in data sets
•The ability to integrate collections data
•Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium)
•The capacity to download, export & analyse data
Principles
•Open-by-default: Capacity for embargoed and private data
•Sustainable: Self-populated by NHM staff (except collections data)
Exclusions
•Not a replacement for DAMS or KeEMu (a Web interface for these systems)
•Publications out of scope (focused on data sets)
•All annotations on data link back to the source (e.g. KeEMu)
13. The solution – data.nhm.ac.uk portal
System Overview
Scope File types Registry Subportals
(Source Data) (formats) (Discovery & download) (Branded slices of data)
KeEMu (NHM) Subportal 1
Other
e.g. Disease
initiative
HerbCat (Kew) NHM specimens
DwC-A
PhyloXML
neXML Subportal 2
Nexus e.g. Kew / NHM
Excel, CSV
Other datasets etc… Kew specimens Virtual Herbarium
Species dictionary,
initiatives, Scratchpads etc
Private
User contributed Explorer
datasets Map view Table view Statistics view Analytic view
R
14. Portal overview – adding data sets
Quick & easy, semi-automated workflow
1. Name the
dataset 2. Upload / link
the data file
3. Describe the
data file
4. Theme &
tag
5. Add additional
resources
6. Temporal
coverage
7. Geographic
coverage
8. Save & finish
16. Portal overview – data set display
Exploring research data sets
License
Name Authors
Tags
Download
Metadata
about the
dataset
Technical
Info.
(extracted
from data
file)
Geographic Developer
“Social”
scope tools
17. Portal overview – collections data
Main interface
Toggle map, table Search, download
No. records
& stats views & display options
No.
Georef.
records
Zoomable Applied
map filters
18. Portal overview – collections data
Additional interfaces
Collections views Specimen record views
Tables
Statistical
summary Full
record
Summary Data field
Download preview mappings
19. Portal overview
Some example data portals & software
Data.gov & CKAN
•UK government data portal
•Uses CKAN, open-source data portal platform
•Used by national & regional governments
•Links into Drupal, DataCite & NHM systems
•http://data.gov.uk & http://ckan.org/
Canadensys & CartoDB
•Canadian network of biodiversity collections
•Almost 1 million specimens, 18 datasets
•Uses CartoDB mapping solution
•Create dynamic maps, analyze and build location
aware and geospatial applications
•Widely used, cloud data storage, PostGIS
•http://data.canadensys.net & http://cartodb.com/
20. Portal development
Timeline & resources
Year 1 – Dataset discovery
•Technical & functional specification (Vizz. subcontract)
•Data workflows (KeEMu & research datasets)
•Functional alpha prototype (CKAN)
Year 2 – Visualisation
•Mapping & statistical functionality (CartoDB)
•Social and annotation functions
•Stable beta release at http://data.nhm.ac.uk
Year 3 – Citation & analysis
•DataCite DOIs on datasets & specimens
•Initial Web analytical functions (R)
•Initiative sub-portals including Virt. Herbarium
Resources
•1x Developer (Ben Scott) for 3 years
•Vizzuality subcontract (circa £xxk - TBC)
•ICT capital, travel & software (circa £25k)
21. Portal consultation
Feedback & next steps
Documentation
•Overview specification - http://goo.gl/qjioh
•Project Initiation Document - http://goo.gl/oRr2j
Initial stakeholder meetings (Feb. – May)
•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)
•Darrell Siebert and the KE EMu user group
•NHM Collections Committee & Initiative leaders
•Kew Gardens & Virtual Herbarium Reps.
•GBIF, NBN, UK DataCite team at BL, NERC
•Digital Facility Team
•Vizzuality
FEEDBACK & LINKS
Wider consultation Slides:
•Example data types / sets Feedback: vince+portal@vsmith.info
•Specialist search options & vocabularies Specification: http://goo.gl/qjioh
•Specialist Earth Science needs PID: http://goo.gl/oRr2j
22. Two more things
Wikipedian in Residence
•Four month post with Science Museum
•Starting March / April
•Work with NHM staff to improve Wikipedia
•Run events with NHM staff & volunteers
•Work with the GLAM group at Imperial College
•Focus on NHM science themes & specimens
•Not about promotion of “The NHM”
Biodiversity Informatics Workshop – May 2013
•One full day - date TBC
•Outputs from ViBRANT & e-Monocot
•Includes Scratchpads & the Biodiversity Data Journal
•What we do, how its used and where are we going
•Includes links to NHM informatics & digitisation initiatives
23. Portal overview – data citation
Unique identifiers for datasets & specimen records
Why cite data
•URLs are not persistent
•e.g. Wren JD: URL decay in MEDLINE- a 4-year
follow-up study. Bioinformatics. 2008, Jun
1;24(11):1381-5) – circa 40% decay
•Measure our digital footprint
•Puts research data on par with articles
•Facilitates data mining
What gets an identifier
•NHM specimen records (suffix of NHM ID’s) http://dx.doi.org/BMNH_
•NHM research datasets (files) PBI_00388325
•Insert into publications
How to cite data
•Digital Object Identifiers (DOIs)
•Widely used & understood on articles
•Operates in collaboration with DataCite
•Part of an International consortium
•Mixes NHM data with other domains