SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Whither Small Data?
Some Thoughts on Managing
      Research Data
           February 26, 2013
               Anita de Waard
 VP Research Data Collaborations, Elsevier RDS
          a.dewaard@elsevier.com
Why should data be saved?
A. Hold scientists accountable:       Data Preservation
  – Preserve record of scientific process, provenance
  – Enable reproducible research
B. Do better science:                 Data Use
  – Use results obtained by others!
  – Improve interdisciplinary work
C. Enable long-term access:           Sustainable Models
  – Use for technology transfer; societal/industrial
    development
  – Reward scientists for data creation (credit/attribution)
  – Allow public/others insight/use of results
Where The Data Goes Now:
                                                                       PDB:
                         A small portion of data                      88,3 k
                         (1-2%?) stored in small,      PetDB:
 > 50 My Papers                                         1,5 k                    SedDB:
                              topic-focused
  2 M scientists            data repositories                                     0.6 k
                                                              MiRB:
2 My papers/year                                               25k
                                                                               TAIR:
                                                                               72,1 k
                                                 Some data
                                            (8%?) stored in large,
                                                generic data
            Majority of data                    repositories
            (90%?) is stored
           on local hard drives
                                                      Dryad:              Dataverse:
                                                    7,631 files            0.6 My



                                                                     Datacite:
                                                                      1.5 My
Key Needs:                      DEVELOP SUSTAINABLE MODELS
                                                                       PDB:
                         A small portion of data                      88,3 k
                         (1-2%?) stored in small,      PetDB:
 > 50 My Papers                                         1,5 k                    SedDB:
                              topic-focused
  2 M scientists            data repositories                                     0.6 k
                                                              MiRB:
2 My papers/year                                               25k
                                                                               TAIR:
                                                                               72,1 k
                                                 Some data
                                            (8%?) stored in large,
                                                generic data
            Majority of data                    repositories
            (90%?) is stored
           on local hard drives
                                                      Dryad:              Dataverse:
                                                    7,631 files            0.6 My

                    INCREASE DATA
                    PRESERVATION                                     Datacite:
                                                                      1.5 My
A. Data Preservation:
• Issues:
  – Currently data is often used by single researchers or
    small groups: many different, idiosyncratic formats
  – Often not in electronic form (maps, images)
  – No metadata: when, where, by whom, WHY was this
    data collected?
• Needs:
  – Tools to make data export/storage simple and
    unavoidable
  – Policies that make data sharing mandatory and simple
  – Systems that reward data sharing/digitisation
B. Data Use:
• Issues:
  – In generic data repositories, data cannot be used
    because of inadequate metadata, lack of quality
    review, lack of provenance
  – It’s expensive to make data useable!
  – Domain-specific data stores are not cross-
    searchable across discipline/national borders
• Needs:
  – Standardised metadata systems across
    systems/repositories and tools to apply them
    easily
  – Integration layers to enable cross-repository
    queries
  – A funding model to enable long-term preservation
C. Sustainable Models:
• Issues:
  – Many successful domain-specific data repositories
    are running out of funding
  – Is adding metadata something you want to keep
    paying PhD+ scientists to do?
  – Unclear who foots the bill: the researcher? The
    institute? The grant agency? For how long?
• Needs:
  – Attribution models for rewarding scientists
  – Policies to improve cross-domain and cross-national
    collaborations
  – Funding models to sustain databases long-term
Linking papers to research data:
Database           Object Linked          Displayed
Pangaea            Google Maps Location   Map with location
Protein Databank   PDB Protein            3d Protein Visualisation
Genbank            Gene Name              NCBI Gene Viewer
Exoplanets +       Exoplanet name         Rich Information on extrasolar Planets
Species +          Species name           Rich information on species




 9
Towards ‘wrapping papers around data’
                                      metadata                 1. Store metadata on all materials
                                                 metadata


             metadata
                                                               2. Track the methods while doing them
                                                               3. Write papers that ‘wrap around’ this
                     metadata
                                                               4. Don’t ‘send’ your papers – just
                                                    metadata   expose them to the outside world
                                                               5. Invite reviews; open data to
                                                               trusted parties, at trusted time
      Rats were subjected to two                               6. Allow apps/tools to integrate
      grueling tests
      (click on fig 2 to see underlying
      data). These results suggest
      that the neurological pain pro-
                                                                          Calculate, coordinate…
    Review
                                 Revise                                  Compile, comment,
                   Edit
                                                                         compare…
Research Data Services:
A. Increase Data Preservation:
   Help increase the amount and quality of data
   preserved and shared
B. Improve Data Use:
   Help increase the value and usability of the data shared
   by increasing annotation, normalization, provenance
   enabling enhanced interoperability
C. Develop Sustainable Models:
   Help measure and deliver credit for shared data, the
   researchers, the institute, and the funding body,
   enabling more sustainable platforms.
Guiding Principles of RDS:
• In principle, all open data stays open and URLs,
  front end etc. stay where they are (i.e. with
  repository)
• Collaboration is tailored to data repositories’
  unique needs/interests- ‘service-model’ type:
  – Aspects where collaboration is needed are discussed
  – A collaboration plan is drawn up using a Service-Level
    Agreement: agree on time, conditions, etc.
• Transparent business model
• Very small (2/3 people) department; immediate
  communication; instant deployment of ideas
Three pilots:
1. Carnegie Mellon Electrophysiology Lab:
  A. Data Input: Develop a suite of tools to enable simple
     data capturing on a handheld device, add metadata
     during experiment, store with raw traces and create
     dashboard for viewing
  B. Data Use: Integrate with NIF and eagle-I ontologies,
     enable access through NIF; combine with other sources
2. ImageVault, with Duke CIVM:
  A. Data Input: Get 3D image data into common format,
     resolution, annotated to allow comparison
  B. Data Use: View other image data sets & do image
     analytics
  C. Sustainable Models: Create funding for 3D image sets:
     free layer for raw data/subscription analytics.
3. IEDA Data Rescue Process Study
Data Rescue:
  – Identify 3 -5 data sets that need to be ‘rescued’
  – Work with investigators to identify data sources,
    formats
  – Work with IEDA to define metadata standards,
    quality checks etc.
Data Rescue Process:
  – A group of data wranglers perform ‘electrification’
    and annotation
  – (Open source) software is developed where needed,
    to help this process
  – We help develop common standards, if needed
3. IEDA Data Rescue Process Study
Data Rescue Process Study:
Jointly publish a report on a ‘gap analysis’ comparing
where are we now vs. and where we need to be, including:
   – What we did (data imported, processes/standards
     created/described; software built; user tests,
     outcomes)
   – Effort involved (time, software, equipment, skills, etc)
   – How easy it would be to scale up; what part of data
     out there could be done this way.
   – Recommendations for tools and skills that are
     needed, if we want to scale up this process
Summary:
• Three key issues:
  A. Data Preservation
  B. Data Use
  C. Sustainable Models
• Elsevier’s approach:
  – Linking data to papers
  – Wrap papers around data
  – Explore role in the research data space
• Elsevier RDS:
  – Three pilots (CMU, Duke, IEDA) to investigate issues
  – We’ll report back in about a year!
Questions?

            Anita de Waard
VP Research Data Collaborations, Elsevier
       a.dewaard@elsevier.com

Weitere ähnliche Inhalte

Was ist angesagt?

Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Natsuko Nicholls
 

Was ist angesagt? (18)

Data Management Planning - 02/21/13
Data Management Planning - 02/21/13Data Management Planning - 02/21/13
Data Management Planning - 02/21/13
 
University of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchersUniversity of Bath Research Data Management training for researchers
University of Bath Research Data Management training for researchers
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
 
A basic course on Research data management: part 1 - part 4
A basic course on Research data management: part 1 - part 4A basic course on Research data management: part 1 - part 4
A basic course on Research data management: part 1 - part 4
 
The Brain Imaging Data Structure and its use for fNIRS
The Brain Imaging Data Structure and its use for fNIRSThe Brain Imaging Data Structure and its use for fNIRS
The Brain Imaging Data Structure and its use for fNIRS
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
Enriching Scholarship 2014 Beyond the Journal Article: Publishing and Citing ...
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?
 
Knowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, BonnKnowledge Exchange, Nov 2011, Bonn
Knowledge Exchange, Nov 2011, Bonn
 
The Donders Repository
The Donders RepositoryThe Donders Repository
The Donders Repository
 
METRO RDM Webinar
METRO RDM WebinarMETRO RDM Webinar
METRO RDM Webinar
 
Planning for Research Data Management
Planning for Research Data ManagementPlanning for Research Data Management
Planning for Research Data Management
 
Research Data Management and the Research Data Lifecycle: a Gentle Introduction
Research Data Management and the Research Data Lifecycle: a Gentle IntroductionResearch Data Management and the Research Data Lifecycle: a Gentle Introduction
Research Data Management and the Research Data Lifecycle: a Gentle Introduction
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 

Andere mochten auch

Creating organisation of the future
Creating organisation of the futureCreating organisation of the future
Creating organisation of the future
OPUS Management
 
Knowledge Media Panel U Toronto, Sept 30 2010
Knowledge Media Panel U Toronto, Sept 30 2010Knowledge Media Panel U Toronto, Sept 30 2010
Knowledge Media Panel U Toronto, Sept 30 2010
Anita de Waard
 

Andere mochten auch (15)

Ten Habits of Highly Effective Data
Ten Habits of Highly Effective DataTen Habits of Highly Effective Data
Ten Habits of Highly Effective Data
 
Assessment
AssessmentAssessment
Assessment
 
Towards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data ServicesTowards Incidental Collaboratories; Research Data Services
Towards Incidental Collaboratories; Research Data Services
 
Epistemics
EpistemicsEpistemics
Epistemics
 
Creating organisation of the future
Creating organisation of the futureCreating organisation of the future
Creating organisation of the future
 
De Waard Carusi
De Waard CarusiDe Waard Carusi
De Waard Carusi
 
Knowledge Media Panel U Toronto, Sept 30 2010
Knowledge Media Panel U Toronto, Sept 30 2010Knowledge Media Panel U Toronto, Sept 30 2010
Knowledge Media Panel U Toronto, Sept 30 2010
 
Overview of scientific discourse annotatoin
Overview of scientific discourse annotatoinOverview of scientific discourse annotatoin
Overview of scientific discourse annotatoin
 
Is Assessment Really So Horrible?
Is Assessment Really So Horrible?Is Assessment Really So Horrible?
Is Assessment Really So Horrible?
 
Designing Sideways : integrating emergence with authorship
Designing Sideways : integrating emergence with authorshipDesigning Sideways : integrating emergence with authorship
Designing Sideways : integrating emergence with authorship
 
How to Execute A Research Paper
How to Execute A Research PaperHow to Execute A Research Paper
How to Execute A Research Paper
 
Argumentation in biology papers
Argumentation in biology papersArgumentation in biology papers
Argumentation in biology papers
 
Enabling your Human Resource Information System to support HR Strategic Roles
Enabling your Human Resource Information System to support HR Strategic RolesEnabling your Human Resource Information System to support HR Strategic Roles
Enabling your Human Resource Information System to support HR Strategic Roles
 
Keep the fires burning
Keep the fires burningKeep the fires burning
Keep the fires burning
 
Vu210610futurejournal
Vu210610futurejournalVu210610futurejournal
Vu210610futurejournal
 

Ähnlich wie Whither Small Data?

Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
Sherry Lake
 

Ähnlich wie Whither Small Data? (20)

SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...Research Data Management: What is it and why is the Library & Archives Servic...
Research Data Management: What is it and why is the Library & Archives Servic...
 
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"Scott Edmunds: Data Dissemination in the era of "Big-Data"
Scott Edmunds: Data Dissemination in the era of "Big-Data"
 
Libby Bishop, Ethics Of Data Sharing Ncess Jun 09 Final
Libby Bishop, Ethics Of Data Sharing Ncess Jun 09 FinalLibby Bishop, Ethics Of Data Sharing Ncess Jun 09 Final
Libby Bishop, Ethics Of Data Sharing Ncess Jun 09 Final
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Big Data
Big Data Big Data
Big Data
 
Data discovery through federated dataset catalogs
Data discovery through federated dataset catalogsData discovery through federated dataset catalogs
Data discovery through federated dataset catalogs
 
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogueseROSA Stakeholder WS1: Data discovery through federated dataset catalogues
eROSA Stakeholder WS1: Data discovery through federated dataset catalogues
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarship
 
STI Summit 2011 - LS4 LS Khaos
STI Summit 2011 - LS4 LS KhaosSTI Summit 2011 - LS4 LS Khaos
STI Summit 2011 - LS4 LS Khaos
 
google Bigtable
google Bigtablegoogle Bigtable
google Bigtable
 
SEAD: Anatomy of a multi-repository member node
SEAD: Anatomy of a multi-repository member nodeSEAD: Anatomy of a multi-repository member node
SEAD: Anatomy of a multi-repository member node
 
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & ReuseLaurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
Laurie Goodman at NDIC: Big Data Publishing, Handling & Reuse
 
Make your data great now
Make your data great nowMake your data great now
Make your data great now
 

Mehr von Anita de Waard

Mehr von Anita de Waard (20)

Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
Why would a publisher care about open data?
Why would a publisher care about open data?Why would a publisher care about open data?
Why would a publisher care about open data?
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
NFAIS Talk on Enabling FAIR Data
NFAIS Talk on Enabling FAIR DataNFAIS Talk on Enabling FAIR Data
NFAIS Talk on Enabling FAIR Data
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Enabling FAIR Data: TAG B Authoring Guidelines
Enabling FAIR Data: TAG B Authoring GuidelinesEnabling FAIR Data: TAG B Authoring Guidelines
Enabling FAIR Data: TAG B Authoring Guidelines
 
Scientific facts are myths, told through fairytales and spread by gossip.
Scientific facts are myths, told through fairytales and spread by gossip.Scientific facts are myths, told through fairytales and spread by gossip.
Scientific facts are myths, told through fairytales and spread by gossip.
 
Data, Data Everywhere: What's A Publisher to Do?
Data, Data Everywhere: What's  A Publisher to Do?Data, Data Everywhere: What's  A Publisher to Do?
Data, Data Everywhere: What's A Publisher to Do?
 
Talk on Research Data Management
Talk on Research Data ManagementTalk on Research Data Management
Talk on Research Data Management
 
History of the future
History of the futureHistory of the future
History of the future
 
Networked Science, And Integrating with Dataverse
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with Dataverse
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost Recovery
 
The Economics of Data Sharing
The Economics of Data SharingThe Economics of Data Sharing
The Economics of Data Sharing
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly Publishing
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
Elsevier‘s RDM Program: Ten Habits of Highly Effective Data
Elsevier‘s RDM Program: Ten Habits of Highly Effective DataElsevier‘s RDM Program: Ten Habits of Highly Effective Data
Elsevier‘s RDM Program: Ten Habits of Highly Effective Data
 
Charleston Conference 2016
Charleston Conference 2016Charleston Conference 2016
Charleston Conference 2016
 
The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
 

Whither Small Data?

  • 1. Whither Small Data? Some Thoughts on Managing Research Data February 26, 2013 Anita de Waard VP Research Data Collaborations, Elsevier RDS a.dewaard@elsevier.com
  • 2. Why should data be saved? A. Hold scientists accountable: Data Preservation – Preserve record of scientific process, provenance – Enable reproducible research B. Do better science: Data Use – Use results obtained by others! – Improve interdisciplinary work C. Enable long-term access: Sustainable Models – Use for technology transfer; societal/industrial development – Reward scientists for data creation (credit/attribution) – Allow public/others insight/use of results
  • 3.
  • 4. Where The Data Goes Now: PDB: A small portion of data 88,3 k (1-2%?) stored in small, PetDB: > 50 My Papers 1,5 k SedDB: topic-focused 2 M scientists data repositories 0.6 k MiRB: 2 My papers/year 25k TAIR: 72,1 k Some data (8%?) stored in large, generic data Majority of data repositories (90%?) is stored on local hard drives Dryad: Dataverse: 7,631 files 0.6 My Datacite: 1.5 My
  • 5. Key Needs: DEVELOP SUSTAINABLE MODELS PDB: A small portion of data 88,3 k (1-2%?) stored in small, PetDB: > 50 My Papers 1,5 k SedDB: topic-focused 2 M scientists data repositories 0.6 k MiRB: 2 My papers/year 25k TAIR: 72,1 k Some data (8%?) stored in large, generic data Majority of data repositories (90%?) is stored on local hard drives Dryad: Dataverse: 7,631 files 0.6 My INCREASE DATA PRESERVATION Datacite: 1.5 My
  • 6. A. Data Preservation: • Issues: – Currently data is often used by single researchers or small groups: many different, idiosyncratic formats – Often not in electronic form (maps, images) – No metadata: when, where, by whom, WHY was this data collected? • Needs: – Tools to make data export/storage simple and unavoidable – Policies that make data sharing mandatory and simple – Systems that reward data sharing/digitisation
  • 7. B. Data Use: • Issues: – In generic data repositories, data cannot be used because of inadequate metadata, lack of quality review, lack of provenance – It’s expensive to make data useable! – Domain-specific data stores are not cross- searchable across discipline/national borders • Needs: – Standardised metadata systems across systems/repositories and tools to apply them easily – Integration layers to enable cross-repository queries – A funding model to enable long-term preservation
  • 8. C. Sustainable Models: • Issues: – Many successful domain-specific data repositories are running out of funding – Is adding metadata something you want to keep paying PhD+ scientists to do? – Unclear who foots the bill: the researcher? The institute? The grant agency? For how long? • Needs: – Attribution models for rewarding scientists – Policies to improve cross-domain and cross-national collaborations – Funding models to sustain databases long-term
  • 9. Linking papers to research data: Database Object Linked Displayed Pangaea Google Maps Location Map with location Protein Databank PDB Protein 3d Protein Visualisation Genbank Gene Name NCBI Gene Viewer Exoplanets + Exoplanet name Rich Information on extrasolar Planets Species + Species name Rich information on species 9
  • 10. Towards ‘wrapping papers around data’ metadata 1. Store metadata on all materials metadata metadata 2. Track the methods while doing them 3. Write papers that ‘wrap around’ this metadata 4. Don’t ‘send’ your papers – just metadata expose them to the outside world 5. Invite reviews; open data to trusted parties, at trusted time Rats were subjected to two 6. Allow apps/tools to integrate grueling tests (click on fig 2 to see underlying data). These results suggest that the neurological pain pro- Calculate, coordinate… Review Revise Compile, comment, Edit compare…
  • 11. Research Data Services: A. Increase Data Preservation: Help increase the amount and quality of data preserved and shared B. Improve Data Use: Help increase the value and usability of the data shared by increasing annotation, normalization, provenance enabling enhanced interoperability C. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the institute, and the funding body, enabling more sustainable platforms.
  • 12. Guiding Principles of RDS: • In principle, all open data stays open and URLs, front end etc. stay where they are (i.e. with repository) • Collaboration is tailored to data repositories’ unique needs/interests- ‘service-model’ type: – Aspects where collaboration is needed are discussed – A collaboration plan is drawn up using a Service-Level Agreement: agree on time, conditions, etc. • Transparent business model • Very small (2/3 people) department; immediate communication; instant deployment of ideas
  • 13. Three pilots: 1. Carnegie Mellon Electrophysiology Lab: A. Data Input: Develop a suite of tools to enable simple data capturing on a handheld device, add metadata during experiment, store with raw traces and create dashboard for viewing B. Data Use: Integrate with NIF and eagle-I ontologies, enable access through NIF; combine with other sources 2. ImageVault, with Duke CIVM: A. Data Input: Get 3D image data into common format, resolution, annotated to allow comparison B. Data Use: View other image data sets & do image analytics C. Sustainable Models: Create funding for 3D image sets: free layer for raw data/subscription analytics.
  • 14. 3. IEDA Data Rescue Process Study Data Rescue: – Identify 3 -5 data sets that need to be ‘rescued’ – Work with investigators to identify data sources, formats – Work with IEDA to define metadata standards, quality checks etc. Data Rescue Process: – A group of data wranglers perform ‘electrification’ and annotation – (Open source) software is developed where needed, to help this process – We help develop common standards, if needed
  • 15. 3. IEDA Data Rescue Process Study Data Rescue Process Study: Jointly publish a report on a ‘gap analysis’ comparing where are we now vs. and where we need to be, including: – What we did (data imported, processes/standards created/described; software built; user tests, outcomes) – Effort involved (time, software, equipment, skills, etc) – How easy it would be to scale up; what part of data out there could be done this way. – Recommendations for tools and skills that are needed, if we want to scale up this process
  • 16. Summary: • Three key issues: A. Data Preservation B. Data Use C. Sustainable Models • Elsevier’s approach: – Linking data to papers – Wrap papers around data – Explore role in the research data space • Elsevier RDS: – Three pilots (CMU, Duke, IEDA) to investigate issues – We’ll report back in about a year!
  • 17. Questions? Anita de Waard VP Research Data Collaborations, Elsevier a.dewaard@elsevier.com

Hinweis der Redaktion

  1. Are current modes of publication and excessive reliance on essentially only one medium (articles and books) serving scholarship or limiting it?