SlideShare ist ein Scribd-Unternehmen logo
1 von 23
data.nhm.ac.uk
NHM data portal update

Part of the informatics
initiative (2013-15)




Vince Smith & Ben Scott
The problem – research data
    Hard to find, access, cite and integrate
                                                  •   45 available online
                                                      (4 print only or behind pay walls)
                                                  •   9 had supplementary data files
                                                  •   39 papers with tables, charts & other data
                                                      o>1000 sequences
                                                      o826 figures
                                                      o76 tables
                                                      o1 genome

                                                  •   No collective view of these data (37 journals)
                                                  •   No consistent way of citing NHM data
                                                  •   No mechanism to integrate or version
                                                  •   No way to repurpose data (retyping?)


  49 NHM science group
  papers in last 4 weeks
  Data via Carolyn Lowry e-mail, 13th Feb. 2013
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
     Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32
     Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40
     Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36
     Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55
     Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34
     Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
   •6 different data collections
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
   •6 different data collections
   •23 interfaces & datasets of varying importance
The problem – collections data
    Hard to find, access, cite and integrate
     Initial problems
     •Don’t know / can’t find the website
     •6 different data collections
     •23 interfaces & datasets of varying importance
     •No priority to collection datasets




119 Specimens                                             Up to
                                                       28,000,000
                                                       Specimens
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
   •6 different data collections
   •23 interfaces & datasets of varying importance
   •No priority to collection datasets
   •Entomology collections don’t exist (404)
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
   •6 different collections
   •23 interfaces & datasets of varying importance
   •No priority to collection datasets
   •Entomology collections don’t exist (404)
   •Library doesn’t have any online collections!
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
   •6 different collections
   •23 interfaces & datasets of varying importance
   •No priority to collection datasets
   •Entomology collections don’t exist (404)
   •Library doesn’t have any online collections!

   Bigger issues
   •Idiosyncratic browse or search
The problem – collections data
   Hard to find, access, cite and integrate
   Initial problems
   •Don’t know / can’t find the website
   •6 different collections
   •23 interfaces & datasets of varying importance
   •No priority to collection datasets
   •Entomology collections don’t exist (404)
   •Library doesn’t have any online collections!

   Bigger issues
   •Idiosyncratic browse or search
   •No maps, few images & very slow
   •No summary or statistics
   •No download, export or custom views
   •No integration with other data
   •No author info or update info
   •No means of specimen citation                    The data portal must
   •No exports to GBIF or associated projects        correct these issues
The solution – data.nhm.ac.uk portal
   High level issues
 Functional requirements
 •A central access point for NHM research & collections data
 •The capacity store/link and describe datasets
 •Integrated search & browse of datasets
 •The ability to cite datasets and specimen records in data sets
 •The ability to integrate collections data
 •Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium)
 •The capacity to download, export & analyse data

 Principles
 •Open-by-default: Capacity for embargoed and private data
 •Sustainable: Self-populated by NHM staff (except collections data)

 Exclusions
 •Not a replacement for DAMS or KeEMu (a Web interface for these systems)
 •Publications out of scope (focused on data sets)
 •All annotations on data link back to the source (e.g. KeEMu)
The solution – data.nhm.ac.uk portal
  System Overview
           Scope                  File types                 Registry                        Subportals
        (Source Data)              (formats)           (Discovery & download)         (Branded slices of data)




    KeEMu (NHM)                                                                                Subportal 1
                                                               Other
                                                                                               e.g. Disease
                                                                                                 initiative


    HerbCat (Kew)                                          NHM specimens
                                     DwC-A
                                   PhyloXML
                                     neXML                                                      Subportal 2
                                     Nexus                                                    e.g. Kew / NHM
                                   Excel, CSV
    Other datasets                    etc…                 Kew specimens                     Virtual Herbarium
        Species dictionary,
   initiatives, Scratchpads etc
                                                              Private



   User contributed                                          Explorer
      datasets                                  Map view      Table view   Statistics view     Analytic view

                                                                                                   R
Portal overview – adding data sets
        Quick & easy, semi-automated workflow

  1. Name the
     dataset                                    2. Upload / link
                                                  the data file


 3. Describe the
     data file

                                                  4. Theme &
                                                      tag
5. Add additional
    resources

                                                 6. Temporal
                                                  coverage
 7. Geographic
    coverage
                                                8. Save & finish
Portal overview – search interface
   Discovering research data sets



 Results                                              Search




Browse &
                                                  Datasets
 search
                                                  matching
 criteria
                                                   criteria



            Individual                 Advanced
             dataset                display options
Portal overview – data set display
    Exploring research data sets

                                           License
 Name                                      Authors

                                             Tags
                                           Download

Metadata
about the
 dataset
                                               Technical
                                                 Info.
                                                (extracted
                                                from data
                                                   file)
       Geographic              Developer
                    “Social”
         scope                   tools
Portal overview – collections data
   Main interface

                         Toggle map, table   Search, download
           No. records
                           & stats views     & display options
  No.
Georef.
records



Zoomable                                                         Applied
  map                                                             filters
Portal overview – collections data
       Additional interfaces
              Collections views        Specimen record views




 Tables



Statistical
summary                            Full
                                  record



                                  Summary                 Data field
                  Download         preview                mappings
Portal overview
     Some example data portals & software

Data.gov & CKAN
•UK government data portal
•Uses CKAN, open-source data portal platform
•Used by national & regional governments
•Links into Drupal, DataCite & NHM systems
•http://data.gov.uk & http://ckan.org/



Canadensys & CartoDB
•Canadian network of biodiversity collections
•Almost 1 million specimens, 18 datasets
•Uses CartoDB mapping solution
•Create dynamic maps, analyze and build location
aware and geospatial applications
•Widely used, cloud data storage, PostGIS
•http://data.canadensys.net & http://cartodb.com/
Portal development
     Timeline & resources
Year 1 – Dataset discovery
•Technical & functional specification (Vizz. subcontract)
•Data workflows (KeEMu & research datasets)
•Functional alpha prototype (CKAN)

Year 2 – Visualisation
•Mapping & statistical functionality (CartoDB)
•Social and annotation functions
•Stable beta release at http://data.nhm.ac.uk

Year 3 – Citation & analysis
•DataCite DOIs on datasets & specimens
•Initial Web analytical functions (R)
•Initiative sub-portals including Virt. Herbarium

Resources
•1x Developer (Ben Scott) for 3 years
•Vizzuality subcontract (circa £xxk - TBC)
•ICT capital, travel & software (circa £25k)
Portal consultation
     Feedback & next steps
Documentation
•Overview specification - http://goo.gl/qjioh
•Project Initiation Document - http://goo.gl/oRr2j

Initial stakeholder meetings (Feb. – May)
•ICT Group (David Thomas, Chris Sleep & Gavin Malarky)
•Darrell Siebert and the KE EMu user group
•NHM Collections Committee & Initiative leaders
•Kew Gardens & Virtual Herbarium Reps.
•GBIF, NBN, UK DataCite team at BL, NERC
•Digital Facility Team
•Vizzuality
                                                            FEEDBACK & LINKS
Wider consultation                                   Slides:
•Example data types / sets                           Feedback: vince+portal@vsmith.info
•Specialist search options & vocabularies            Specification: http://goo.gl/qjioh
•Specialist Earth Science needs                      PID: http://goo.gl/oRr2j
Two more things
Wikipedian in Residence
•Four month post with Science Museum
•Starting March / April
•Work with NHM staff to improve Wikipedia
•Run events with NHM staff & volunteers
•Work with the GLAM group at Imperial College
•Focus on NHM science themes & specimens
•Not about promotion of “The NHM”



Biodiversity Informatics Workshop – May 2013
•One full day - date TBC
•Outputs from ViBRANT & e-Monocot
•Includes Scratchpads & the Biodiversity Data Journal
•What we do, how its used and where are we going
•Includes links to NHM informatics & digitisation initiatives
Portal overview – data citation
      Unique identifiers for datasets & specimen records
Why cite data
•URLs are not persistent
•e.g. Wren JD: URL decay in MEDLINE- a 4-year
follow-up study. Bioinformatics. 2008, Jun
1;24(11):1381-5) – circa 40% decay
•Measure our digital footprint
•Puts research data on par with articles
•Facilitates data mining
What gets an identifier
•NHM specimen records (suffix of NHM ID’s)          http://dx.doi.org/BMNH_
•NHM research datasets (files)                           PBI_00388325

•Insert into publications

How to cite data
•Digital Object Identifiers (DOIs)
•Widely used & understood on articles
•Operates in collaboration with DataCite
•Part of an International consortium
•Mixes NHM data with other domains

Weitere ähnliche Inhalte

Andere mochten auch

Bibliography of Life: Comprehensive services for biodiversity bibliographic r...
Bibliography of Life: Comprehensive services for biodiversity bibliographic r...Bibliography of Life: Comprehensive services for biodiversity bibliographic r...
Bibliography of Life: Comprehensive services for biodiversity bibliographic r...Vince Smith
 
2006.Smith Electronic Resources
2006.Smith Electronic Resources2006.Smith Electronic Resources
2006.Smith Electronic ResourcesVince Smith
 
Assisted restructure of web content for paper-based presentation: a look at w...
Assisted restructure of web content for paper-based presentation: a look at w...Assisted restructure of web content for paper-based presentation: a look at w...
Assisted restructure of web content for paper-based presentation: a look at w...Vince Smith
 
Online taxonomy: Why do people engage?
Online taxonomy: Why do people engage?Online taxonomy: Why do people engage?
Online taxonomy: Why do people engage?Vince Smith
 
FP7 Funded RI Project experiences: some overly honest tips from a project coo...
FP7 Funded RI Project experiences: some overly honest tips from a project coo...FP7 Funded RI Project experiences: some overly honest tips from a project coo...
FP7 Funded RI Project experiences: some overly honest tips from a project coo...Vince Smith
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behindVince Smith
 
Scratchpads: A standard implementation using Drupal
Scratchpads: A standard implementation using DrupalScratchpads: A standard implementation using Drupal
Scratchpads: A standard implementation using DrupalVince Smith
 

Andere mochten auch (8)

Bibliography of Life: Comprehensive services for biodiversity bibliographic r...
Bibliography of Life: Comprehensive services for biodiversity bibliographic r...Bibliography of Life: Comprehensive services for biodiversity bibliographic r...
Bibliography of Life: Comprehensive services for biodiversity bibliographic r...
 
2006.Smith Electronic Resources
2006.Smith Electronic Resources2006.Smith Electronic Resources
2006.Smith Electronic Resources
 
Assisted restructure of web content for paper-based presentation: a look at w...
Assisted restructure of web content for paper-based presentation: a look at w...Assisted restructure of web content for paper-based presentation: a look at w...
Assisted restructure of web content for paper-based presentation: a look at w...
 
Roberts leiden110213
Roberts leiden110213Roberts leiden110213
Roberts leiden110213
 
Online taxonomy: Why do people engage?
Online taxonomy: Why do people engage?Online taxonomy: Why do people engage?
Online taxonomy: Why do people engage?
 
FP7 Funded RI Project experiences: some overly honest tips from a project coo...
FP7 Funded RI Project experiences: some overly honest tips from a project coo...FP7 Funded RI Project experiences: some overly honest tips from a project coo...
FP7 Funded RI Project experiences: some overly honest tips from a project coo...
 
No specimen (software) left behind
No specimen (software) left behindNo specimen (software) left behind
No specimen (software) left behind
 
Scratchpads: A standard implementation using Drupal
Scratchpads: A standard implementation using DrupalScratchpads: A standard implementation using Drupal
Scratchpads: A standard implementation using Drupal
 

Ähnlich wie 2013 02 data portal science group update -v smith

Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Ken Karapetyan
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...ICZN
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset cataloguese-ROSA
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Publishing hkh biodiversity data globally technical session ii
Publishing hkh biodiversity data globally   technical session ii Publishing hkh biodiversity data globally   technical session ii
Publishing hkh biodiversity data globally technical session ii ICIMOD
 
Cytoscape Network Visualization and Analysis
Cytoscape Network Visualization and AnalysisCytoscape Network Visualization and Analysis
Cytoscape Network Visualization and Analysisbdemchak
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciencesChris Dwan
 
Documentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampDocumentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampSherry Lake
 
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index DesignsAnalyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index DesignsAleatha Parker-Wood
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...GarethKnight
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
Data discovery through federated dataset catalogs
Data discovery through federated dataset catalogsData discovery through federated dataset catalogs
Data discovery through federated dataset catalogsValeria Pesce
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanPhilippe Rocca-Serra
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 Scott Edmunds
 

Ähnlich wie 2013 02 data portal science group update -v smith (20)

Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Publishing hkh biodiversity data globally technical session ii
Publishing hkh biodiversity data globally   technical session ii Publishing hkh biodiversity data globally   technical session ii
Publishing hkh biodiversity data globally technical session ii
 
Cytoscape Network Visualization and Analysis
Cytoscape Network Visualization and AnalysisCytoscape Network Visualization and Analysis
Cytoscape Network Visualization and Analysis
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Documentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampDocumentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM Bootcamp
 
Analyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index DesignsAnalyzing Extended and Scientific Metadata for Scalable Index Designs
Analyzing Extended and Scientific Metadata for Scalable Index Designs
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Data discovery through federated dataset catalogs
Data discovery through federated dataset catalogsData discovery through federated dataset catalogs
Data discovery through federated dataset catalogs
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, JapanISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
ISA-Tab Standards at Metabolomics Society Meeting, Tsuruoka 2014, Japan
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 

Mehr von Vince Smith

DiSSCo institutional benefits
DiSSCo institutional benefitsDiSSCo institutional benefits
DiSSCo institutional benefitsVince Smith
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeVince Smith
 
Moving beyond the box: automating the digitisation of insect collections
Moving beyond the box: automating the digitisation of insect collectionsMoving beyond the box: automating the digitisation of insect collections
Moving beyond the box: automating the digitisation of insect collectionsVince Smith
 
Use it or lose it: a hybrid model for sustaining e-infrastructures
Use it or lose it: a hybrid model for sustaining e-infrastructuresUse it or lose it: a hybrid model for sustaining e-infrastructures
Use it or lose it: a hybrid model for sustaining e-infrastructuresVince Smith
 
No specimen left behind: Collections digitisation at the NHM, London*
No specimen left behind:  Collections digitisation at the NHM, London*No specimen left behind:  Collections digitisation at the NHM, London*
No specimen left behind: Collections digitisation at the NHM, London*Vince Smith
 
SYNTHESYS 3 Overview
SYNTHESYS 3 OverviewSYNTHESYS 3 Overview
SYNTHESYS 3 OverviewVince Smith
 
Scratchpad 2014-introduction
Scratchpad 2014-introductionScratchpad 2014-introduction
Scratchpad 2014-introductionVince Smith
 
Consolidated ViBRANT Project Final Review Presentations
Consolidated ViBRANT Project Final Review PresentationsConsolidated ViBRANT Project Final Review Presentations
Consolidated ViBRANT Project Final Review PresentationsVince Smith
 
Scratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataScratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataVince Smith
 
Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Vince Smith
 
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...Vince Smith
 
Vince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notextVince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notextVince Smith
 
The biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveThe biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveVince Smith
 
Building data infrastructures for science
Building data infrastructures for scienceBuilding data infrastructures for science
Building data infrastructures for scienceVince Smith
 
Don't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easyDon't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easyVince Smith
 
Delivering biodiversity knowledge in the information age
Delivering biodiversity knowledge in the information ageDelivering biodiversity knowledge in the information age
Delivering biodiversity knowledge in the information ageVince Smith
 
The Biodiversity Informatics Landscape
The Biodiversity Informatics LandscapeThe Biodiversity Informatics Landscape
The Biodiversity Informatics LandscapeVince Smith
 
Don’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easyDon’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easyVince Smith
 
Digitised collections: Toward a digital strategy for for the NHM, London
Digitised collections: Toward a digital strategy for for the NHM, LondonDigitised collections: Toward a digital strategy for for the NHM, London
Digitised collections: Toward a digital strategy for for the NHM, LondonVince Smith
 
Virtual Research Environments supporting biodiversity research: Needs & prior...
Virtual Research Environments supporting biodiversity research: Needs & prior...Virtual Research Environments supporting biodiversity research: Needs & prior...
Virtual Research Environments supporting biodiversity research: Needs & prior...Vince Smith
 

Mehr von Vince Smith (20)

DiSSCo institutional benefits
DiSSCo institutional benefitsDiSSCo institutional benefits
DiSSCo institutional benefits
 
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-LifeNHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
 
Moving beyond the box: automating the digitisation of insect collections
Moving beyond the box: automating the digitisation of insect collectionsMoving beyond the box: automating the digitisation of insect collections
Moving beyond the box: automating the digitisation of insect collections
 
Use it or lose it: a hybrid model for sustaining e-infrastructures
Use it or lose it: a hybrid model for sustaining e-infrastructuresUse it or lose it: a hybrid model for sustaining e-infrastructures
Use it or lose it: a hybrid model for sustaining e-infrastructures
 
No specimen left behind: Collections digitisation at the NHM, London*
No specimen left behind:  Collections digitisation at the NHM, London*No specimen left behind:  Collections digitisation at the NHM, London*
No specimen left behind: Collections digitisation at the NHM, London*
 
SYNTHESYS 3 Overview
SYNTHESYS 3 OverviewSYNTHESYS 3 Overview
SYNTHESYS 3 Overview
 
Scratchpad 2014-introduction
Scratchpad 2014-introductionScratchpad 2014-introduction
Scratchpad 2014-introduction
 
Consolidated ViBRANT Project Final Review Presentations
Consolidated ViBRANT Project Final Review PresentationsConsolidated ViBRANT Project Final Review Presentations
Consolidated ViBRANT Project Final Review Presentations
 
Scratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity dataScratchpads: the Virtual Research Environment for biodiversity data
Scratchpads: the Virtual Research Environment for biodiversity data
 
Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...
 
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...
Use it or lose it: crowdsourcing support and outreach activities in a hybrid ...
 
Vince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notextVince smith-delivering biodiversity knowledge in the information age-notext
Vince smith-delivering biodiversity knowledge in the information age-notext
 
The biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveThe biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspective
 
Building data infrastructures for science
Building data infrastructures for scienceBuilding data infrastructures for science
Building data infrastructures for science
 
Don't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easyDon't make me think: biodiversity data publishing made easy
Don't make me think: biodiversity data publishing made easy
 
Delivering biodiversity knowledge in the information age
Delivering biodiversity knowledge in the information ageDelivering biodiversity knowledge in the information age
Delivering biodiversity knowledge in the information age
 
The Biodiversity Informatics Landscape
The Biodiversity Informatics LandscapeThe Biodiversity Informatics Landscape
The Biodiversity Informatics Landscape
 
Don’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easyDon’t make me think: biodiversity data publishing made easy
Don’t make me think: biodiversity data publishing made easy
 
Digitised collections: Toward a digital strategy for for the NHM, London
Digitised collections: Toward a digital strategy for for the NHM, LondonDigitised collections: Toward a digital strategy for for the NHM, London
Digitised collections: Toward a digital strategy for for the NHM, London
 
Virtual Research Environments supporting biodiversity research: Needs & prior...
Virtual Research Environments supporting biodiversity research: Needs & prior...Virtual Research Environments supporting biodiversity research: Needs & prior...
Virtual Research Environments supporting biodiversity research: Needs & prior...
 

2013 02 data portal science group update -v smith

  • 1. data.nhm.ac.uk NHM data portal update Part of the informatics initiative (2013-15) Vince Smith & Ben Scott
  • 2. The problem – research data Hard to find, access, cite and integrate • 45 available online (4 print only or behind pay walls) • 9 had supplementary data files • 39 papers with tables, charts & other data o>1000 sequences o826 figures o76 tables o1 genome • No collective view of these data (37 journals) • No consistent way of citing NHM data • No mechanism to integrate or version • No way to repurpose data (retyping?) 49 NHM science group papers in last 4 weeks Data via Carolyn Lowry e-mail, 13th Feb. 2013
  • 3. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website
  • 4. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website Botany http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=32 Entomology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=40 Library http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=36 Mineralogy http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=55 Palaeontology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=34 Zoology http://www.nhm.ac.uk/research-curation/collections/search/results.jsp?mode=collections&department=38
  • 5. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections
  • 6. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections •23 interfaces & datasets of varying importance
  • 7. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections •23 interfaces & datasets of varying importance •No priority to collection datasets 119 Specimens Up to 28,000,000 Specimens
  • 8. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different data collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404)
  • 9. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404) •Library doesn’t have any online collections!
  • 10. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404) •Library doesn’t have any online collections! Bigger issues •Idiosyncratic browse or search
  • 11. The problem – collections data Hard to find, access, cite and integrate Initial problems •Don’t know / can’t find the website •6 different collections •23 interfaces & datasets of varying importance •No priority to collection datasets •Entomology collections don’t exist (404) •Library doesn’t have any online collections! Bigger issues •Idiosyncratic browse or search •No maps, few images & very slow •No summary or statistics •No download, export or custom views •No integration with other data •No author info or update info •No means of specimen citation The data portal must •No exports to GBIF or associated projects correct these issues
  • 12. The solution – data.nhm.ac.uk portal High level issues Functional requirements •A central access point for NHM research & collections data •The capacity store/link and describe datasets •Integrated search & browse of datasets •The ability to cite datasets and specimen records in data sets •The ability to integrate collections data •Custom functions for sub-sections of data (e.g. initiatives, Virtual Herbarium) •The capacity to download, export & analyse data Principles •Open-by-default: Capacity for embargoed and private data •Sustainable: Self-populated by NHM staff (except collections data) Exclusions •Not a replacement for DAMS or KeEMu (a Web interface for these systems) •Publications out of scope (focused on data sets) •All annotations on data link back to the source (e.g. KeEMu)
  • 13. The solution – data.nhm.ac.uk portal System Overview Scope File types Registry Subportals (Source Data) (formats) (Discovery & download) (Branded slices of data) KeEMu (NHM) Subportal 1 Other e.g. Disease initiative HerbCat (Kew) NHM specimens DwC-A PhyloXML neXML Subportal 2 Nexus e.g. Kew / NHM Excel, CSV Other datasets etc… Kew specimens Virtual Herbarium Species dictionary, initiatives, Scratchpads etc Private User contributed Explorer datasets Map view Table view Statistics view Analytic view R
  • 14. Portal overview – adding data sets Quick & easy, semi-automated workflow 1. Name the dataset 2. Upload / link the data file 3. Describe the data file 4. Theme & tag 5. Add additional resources 6. Temporal coverage 7. Geographic coverage 8. Save & finish
  • 15. Portal overview – search interface Discovering research data sets Results Search Browse & Datasets search matching criteria criteria Individual Advanced dataset display options
  • 16. Portal overview – data set display Exploring research data sets License Name Authors Tags Download Metadata about the dataset Technical Info. (extracted from data file) Geographic Developer “Social” scope tools
  • 17. Portal overview – collections data Main interface Toggle map, table Search, download No. records & stats views & display options No. Georef. records Zoomable Applied map filters
  • 18. Portal overview – collections data Additional interfaces Collections views Specimen record views Tables Statistical summary Full record Summary Data field Download preview mappings
  • 19. Portal overview Some example data portals & software Data.gov & CKAN •UK government data portal •Uses CKAN, open-source data portal platform •Used by national & regional governments •Links into Drupal, DataCite & NHM systems •http://data.gov.uk & http://ckan.org/ Canadensys & CartoDB •Canadian network of biodiversity collections •Almost 1 million specimens, 18 datasets •Uses CartoDB mapping solution •Create dynamic maps, analyze and build location aware and geospatial applications •Widely used, cloud data storage, PostGIS •http://data.canadensys.net & http://cartodb.com/
  • 20. Portal development Timeline & resources Year 1 – Dataset discovery •Technical & functional specification (Vizz. subcontract) •Data workflows (KeEMu & research datasets) •Functional alpha prototype (CKAN) Year 2 – Visualisation •Mapping & statistical functionality (CartoDB) •Social and annotation functions •Stable beta release at http://data.nhm.ac.uk Year 3 – Citation & analysis •DataCite DOIs on datasets & specimens •Initial Web analytical functions (R) •Initiative sub-portals including Virt. Herbarium Resources •1x Developer (Ben Scott) for 3 years •Vizzuality subcontract (circa £xxk - TBC) •ICT capital, travel & software (circa £25k)
  • 21. Portal consultation Feedback & next steps Documentation •Overview specification - http://goo.gl/qjioh •Project Initiation Document - http://goo.gl/oRr2j Initial stakeholder meetings (Feb. – May) •ICT Group (David Thomas, Chris Sleep & Gavin Malarky) •Darrell Siebert and the KE EMu user group •NHM Collections Committee & Initiative leaders •Kew Gardens & Virtual Herbarium Reps. •GBIF, NBN, UK DataCite team at BL, NERC •Digital Facility Team •Vizzuality FEEDBACK & LINKS Wider consultation Slides: •Example data types / sets Feedback: vince+portal@vsmith.info •Specialist search options & vocabularies Specification: http://goo.gl/qjioh •Specialist Earth Science needs PID: http://goo.gl/oRr2j
  • 22. Two more things Wikipedian in Residence •Four month post with Science Museum •Starting March / April •Work with NHM staff to improve Wikipedia •Run events with NHM staff & volunteers •Work with the GLAM group at Imperial College •Focus on NHM science themes & specimens •Not about promotion of “The NHM” Biodiversity Informatics Workshop – May 2013 •One full day - date TBC •Outputs from ViBRANT & e-Monocot •Includes Scratchpads & the Biodiversity Data Journal •What we do, how its used and where are we going •Includes links to NHM informatics & digitisation initiatives
  • 23. Portal overview – data citation Unique identifiers for datasets & specimen records Why cite data •URLs are not persistent •e.g. Wren JD: URL decay in MEDLINE- a 4-year follow-up study. Bioinformatics. 2008, Jun 1;24(11):1381-5) – circa 40% decay •Measure our digital footprint •Puts research data on par with articles •Facilitates data mining What gets an identifier •NHM specimen records (suffix of NHM ID’s) http://dx.doi.org/BMNH_ •NHM research datasets (files) PBI_00388325 •Insert into publications How to cite data •Digital Object Identifiers (DOIs) •Widely used & understood on articles •Operates in collaboration with DataCite •Part of an International consortium •Mixes NHM data with other domains