This document discusses open access to research data and peer review of data publications. It notes that as a first step, data underpinning journal articles should be made concurrently available in accessible databases. The Royal Society report in 2012 advocated for all science literature and data to be online and interoperable. Key issues in linking data to the scientific record are data persistence, quality, attribution, and credit. The document provides examples from astronomy of data reuse leading to new publications and cites a study finding poor reproducibility of ecological data sets over time as data availability declines. It outlines different levels of research data from raw to processed to published and discusses initiatives for open data publication and peer review.
Ähnlich wie Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data"
Ähnlich wie Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data" (20)
Jonathan Tedds Distinguished Lecture at DLab, UC Berkeley, 12 Sep 2013: "The Open Research Challenge: Peer Review and Publication of Research Data"
1. THE OPEN RESEARCH CHALLENGE: PEER
REVIEW AND PUBLICATION OF RESEARCH DATA
Dr Jonathan Tedds jat26@le.ac.uk @jtedds
Senior Research Fellow,
D2K Data to Knowledge Research Group
(University of Leicester)
PI #PREPARDE http://www.le.ac.uk/projects/preparde
4. Science as an Open Enterprise Report
Why open?
• As a first step towards this intelligent
openness, data that underpin a journal article
should be made concurrently available in an
accessible database
• We are now on the brink of an achievable aim:
for all science literature to be online, for all of
the data to be online and for the two to be
interoperable. [p.7]
• Royal Society June 2012, Science as an Open
Enterprise,
http://royalsociety.org/policy/projects/science
-public-enterprise/report/
• Issues linking data to the scientific record:
– Data persistence
– Data and metadata quality
– Attribution and credit for data producers
• Geoffrey Boulton (Edinburgh), Lead author:
– “Science has been sleepwalking into crisis of
replicability...and of the credibility of science”
– “Publishing articles without making the data
available is scientific malpractice”
5. Data Reuse: asking new questions
Hubble Space Telescope
• Papers based upon reuse of archived observations now exceed those based on the
use described in the original proposal.
– http://archive.stsci.edu/hst/bibliography/pubstat.html
• See also work by Piwowar & Vision re life sciences: “Data reuse and the open data
citation advantage”
– http://peerj.com/preprints/1/
6. Oh, and…. says so :P
We are committed to openness in scientific research data to speed up the progress of
scientific discovery, create innovation, ensure that the results of scientific research are
as widely available as practical, enable transparency in science and engage the public
in the scientific process.
• To the greatest extent and with the fewest constraints possible publicly funded
scientific research data should be open, while at the same time respecting
concerns in relation to privacy, safety, security and commercial interests, whilst
acknowledging the legitimate concerns of private partners.
• Open scientific research data should be easily discoverable, accessible, assessable,
intelligible, useable, and wherever possible interoperable to specific quality
standards.
• To ensure successful adoption by scientific communities, open scientific research
data principles will need to be underpinned by an appropriate policy environment,
including recognition of researchers fulfilling these principles, and appropriate
digital infrastructure.
7. Scale of the problem:
who, what, when
where….?
http://blogs.scientificamerican.com/absolutely-maybe/2013/09/10/opening-a-can-of-data-
sharing-worms/
• Timothy Vines and colleagues studied reproducibility of data sets in zoology and changes
through time
– gathered 516 papers published between 1991 and 2011
– then they tried to track the data down…
• Even tracking down the authors was a challenge
– Over time a dwindling minority of papers were accompanied by author email addresses that still
functioned
• only 37% of the data - even from papers in 2011 - were still findable and retrievable
– proportion dropped each earlier year
• For papers published in 1991
– only 7% of the data could be determined to truly still be in existence and retrievable
– few authors could be found, and most of them were reporting that their data were lost or
inaccessible
8. This isn’t new…
Henry Oldenburg
– inveterate correspondent
– now think of as scientist
• Had idea to publish Philosophical
Transactions (1665):
o Should be written in vernacular not Latin
o Underlying evidence must be concurrently
published
o Helped propel Europe at the time
o Concept of scientific self correction
• able to write it's errors
• Wrote: “thought fit to employ the [printing]
press……..Universal Good of Mankind“
o How do we achieve these ends in the post-
Gutenburg era?
9. Science ecosystems (Peter Fox, Rensselaer)
• These elements are what enable scientists to
explore/ confirm/ deny their research ideas
and collaborate!
• Abduction as well as induction and deduction
Accountability
ProofExplanation Justification Verifiability
‘Transparency’ -> Translucency
Trust
IdentityCitability
Integrateability
10. Data as a “public good” (2011)
• Public good
• Preservation
• Discovery
• Confidentiality
• First use
• Recognition
• Public funding
12. So what do we mean by publishing data?
• The familiar:
– Supplementary tables via journal or
– Archived raw or calibrated facility data
– Discipline specific and institutional / national archives
• Data under the graph?
– In order to reproduce and adapt article analysis
• “Research ready” open data
– In order to reuse and repurpose
– for interdisciplinary researchers, community, business
– Ideally peer reviewed?
13. Research data example - level 1:
• A typical example from physical
sciences (astronomy) distinguishes
between broad categories within
the research data spectrum:
• raw/initially auto-processed data
produced at a research facility such
as an observatory
• typically made publically
available in this format after an
embargo period of e.g. 1 year
• in some cases available
immediately - e.g. Swift
Gamma Ray Burst satellite
14. "research ready" processed data which has been fully calibrated,
combined and cleaned/annotated
• often produced by individuals or collaborations
• rarely available to anyone outside the collaboration except upon
request/collaboration
• needed for re-analysis or reuse for science unless you have detailed
sub domain specific knowledge and detailed contextual information to
reproduce from raw
• considered to enable a competitive advantage for producers
• may be produced by dedicated data scientists on behalf of the
community for major survey/missions e.g. ESA XMM-Survey Science
Centre (Leicester), NASA, NCAR…
Research data example – level 2
15. output dataset – following
detailed analysis of research
ready datasets
• forms the data under the
graph in a journal publication
following analysis of research
ready datasets
• Might be available as a Table
via journal, CDS etc
• May not be available outside
the collaboration except upon
request/collaboration
• may well generate future
additional samples and papers
for the owning collaboration
on top of the original
• other researchers may request
the data for their own research
but may not get it!
Research data example – level 3
16. ….and STOP!
• Next project
– Proposal long since written
– Probably already underway…
• Feel free to email ME if you would like to work
on an idea using this dataset or code
– As long as I’m a co-author on the paper!
– You have to go through me to find out what you
really need to know to reuse the data/code
17. • published catalogue type representation of published output dataset
• NOT a “data paper”….but could be
• optional in many cases, mandatory for most major surveys
• usually made available via project specific online resource
• may be provided as table of parameters based on research
ready dataset, usually linked from and associated with a journal
• specifically produced in order for the wider community to reuse
(cite!) and repurpose if wanted
• The well-known Sloan Digital Sky Survey is a classic example or
more recently the 2XMMi X-ray catalogue I have a close
involvement with (largest X-ray survey of the sky).
Research data example – level 4a
19. data paper describing and linking to output dataset(s)
Research data example – level 4b
Live Data
paper!
Dataset citation is
first thing in the
paper and is also
included in reference
list (to take
advantage of citation
count systems)
DOI: 10.1002/gdj3.2
20. Data Publications
• Initiatives for open publication of data,
datasets, data papers and open peer
review
• RDMF8 ‘Engaging with the publishers’:
http://www.dcc.ac.uk/events/research-
data-management-forum-rdmf/rdmf8-
engaging-publishers
• Particularly Rebecca Lawrence on peer
review policies
• Earth System Science Data:
http://www.earth-system-science-
data.net/
• Pensoft Data Publishing Policies and
Guidelines for Biodiversity Data
http://www.pensoft.net/news.php?n=59
Slide: Simon Hodson (Jisc / CODATA)
21. 21
ODE Data Publication Pyramid:
Pubs
Supps
Data Archives
Data on Disks
and in Drawers
(1) Top of the
pyramid is stable
but small
(2) Risk that
supplements to
articles turn into
Data Dumping
places
(3) Too many
disciplines lack a
community
endorsed data
archive
(4) Estimates are
that at least 75 %
of research data
is never made
openly avaiable
21
22. From Mayernik et al (in prep) – PREPARDE project - Most cited
Bulletin of the American Meteorological Society (BAMS) articles.
Data from Web of Science, gathered on June 11, 2013
Article Data paper? Citations Article details
1 Yes 10,113 Kalnay, E; et al. The NCEP/NCAR 40-year reanalysis project,
1996.
2 No 3,201 Torrence, C; Compo, GP. A practical guide to wavelet
analysis, 1998.
3 No 2,367 Mantua, NJ; et al. A Pacific interdecadal climate oscillation
with impacts on salmon production, 1997.
4 Yes 1,987 Kistler, R; et al. The NCEP-NCAR 50-year reanalysis:
Monthly means CD-ROM and documentation, 2001.
5 Yes 1,791 Xie, PP; Arkin, PA. Global precipitation: A 17-year monthly
analysis based on gauge observations, satellite estimates, and
numerical model outputs, 1997.
6 Yes 1,448 Kanamitsu, M; et al. NCEP-DOE AMIP-II reanalysis (R-2),
2002.
7 No 1,014 Baldocchi, D; et al. FLUXNET: A new tool to study the
temporal and spatial variability of ecosystem-scale carbon
dioxide, water vapor, and energy flux densities, 2001.
8 Yes 902 Rossow, WB; Schiffer, RA. Advances in understanding clouds
from ISCCP, 1999.
9 Yes 900 Rossow, WB; Schiffer, RA. ISCCP cloud data products, 1991.
10 No 877 Hess, M; Koepke, P; Schult, I. Optical properties of aerosols
and clouds: The software package OPAC, 1998.
11 No 815 Willmott, CJ. Some comments on the evaluation of model
performance, 1982.
12 No 815 Trenberth, KE. The definition of El Nino, 1997.
13 Yes 785 Woodruff, SD; Slutz, RJ; et al. A comprehensive ocean-
atmosphere data set, 1987.
14 Yes 776 Meehl, G.A.; et al. The WCRP CMIP3 multimodel dataset - A
23. It’s a long road….
What do researchers need to make this all possible?
– Incentives - citations, promotion, support
• long way to go
– Institutional and funder policy framework
• mostly there now?
– Appropriate discipline specific community centres of expertise
• rare, mostly limited to big science niches or very broad but may be
poorly sustained
– Institutional support services for the basics
• pilots to date
– Software tools that are open and can be adapted
• on the way
24. PREPARDE: Peer REview for
Publication & Accreditation of Research
Data in the Earth sciences
Jonathan Tedds (Leicester), Sarah Callaghan (BADC), Fiona Murphy
(Wiley), Rebecca Lawrence (F1000R), Geraldine Stoneham (MRC),
Elizabeth Newbold (BL), Rachel Kotarski (BL), Matthew Mayernik
(NCAR), John Kunze, Carly Strasser (CDL), Angus Whyte (DCC), Becca
Wilson (Leicester), Simon Hodson (Jisc) and #PREPARDE project team
+ Geraldine Clement Stoneham (MRC), Elizabeth Newbold, Rachel
Kotarski (BL) on data peer review
http://www.le.ac.uk/projects/preparde
25. • Partnership formed between Royal
Meteorological Society and academic
publishers Wiley Blackwell to develop a
mechanism for the formal publication of data in
the Open Access Geoscience Data Journal
• GDJ publishes short data articles cross-linked
to, and citing, datasets that have been
deposited in approved data centres and
awarded DOIs (or other permanent identifier).
• A data article describes a dataset, giving details
of its collection, processing, software, file
formats, etc., without the requirement of novel
analyses or ground breaking conclusions.
• the when, how and why data was collected
and what the data-product is.
http://www.geosciencedata.com/
PREPARDE key use case: Geoscience Data Journal,
Wiley-Blackwell and the Royal Meteorological Society
26. • capture the processes and procedures required to publish a scientific dataset
– ingestion into a data repository
– formal publication in a data journal
• address key issues in data publication
– how to peer-review a dataset?
– what criteria are needed for a repository to be considered objectively
trustworthy?
– how can datasets and journal publications be effectively cross-linked for the
benefit of the wider research community?
• PREPARDE team includes key expertise in
– Research
– academic publishing
– data management
• Earth Sciences focus but produce general guidelines applicable to a wide range of
scientific disciplines and data publication types incl life sciences (F1000R)
PREPARDE: Peer REview for Publication & Accreditation of Research
Data in the Earth sciences http://www.le.ac.uk/projects/preparde
27. BADC
Data Data
BODC
DataData
A Journal
(Any online
journal system)
PDF PDF PDF PDF PDF
Word processing software
with journal template
Data Journal
(Geoscience Data Journal)
html html html html
1) Author prepares the
paper using word
processing software.
3) Reviewer reviews the
PDF file against the
journal’s acceptance
criteria.
2) Author submits
the paper as a
PDF/Word file.
Word processing software
with journal template
1) Author prepares the
data paper using word
processing software and
the dataset using
appropriate tools.
2a) Author submits
the data paper to
the journal.
3) Reviewer reviews
the data paper and
the dataset it points
to against the
journals acceptance
criteria.
The traditional online journal model
Overlay journal model for publishing data
2b) Author submits
the dataset to a
repository.
Data
How: to publish data in GDJ
28. Live Data paper!
Dataset citation is first thing in
the paper and is also included
in reference list (to take
advantage of citation count
systems)
DOI: 10.1002/gdj3.2
29. Data
Centre
Trust: Repository accreditation
• Link between data paper and dataset is crucial!
• How do data journal editors know a repository
is trustworthy?
• How can repositories prove they’re
trustworthy?
• What makes a repository trustworthy?
• Many things: mission, processes, expertise,
workflows, history, systems, documentation, …
• Assessing trustworthiness requires assessing
the entire repository workflow
• PREPARDE / IDCC13 Workshop – report out soon!
• Peer review of data is implicitly peer review of
repository
And what does
“trustworthy” mean,
when you get right
down to it?
30. DataCite Repository List
• working document
• initated via a collaboration
between the British Library,
BioMed Central and the Digital
Curation Centre
• aims to capture growing number
of repositories for research data
• provided for information purposes
only:
• DataCite provides no
endorsements of quality or
suitability of the repositories
• encourage community
participation in developing
this resource
http://www.datacite.org/repolist/
31. Dryad Data Repository
JDAP: Joint Data Archiving Policy
Joint Data Archiving Policy: http://datadryad.org/jdap
Joint declarations, Feb 2010, in American Naturalist, Evolution, the Journal of Evolutionary
Biology, Molecular Ecology, Heredity, and other key journals in evolution and ecology:
http://www.journals.uchicago.edu/doi/full/10.1086/650340
This journal requires, as a condition for publication, that data supporting the results
in the paper should be archived in an appropriate public archive, such as GenBank,
TreeBASE, Dryad, or the Knowledge Network for Biocomplexity.
Allows embargos of up to one year;
allows exceptions for, e.g., sensitive
information such as human subject data
or the location of endangered species.
‘Data that have an established standard
repository, such as DNA sequences,
should continue to be archived in the
appropriate repository, such as GenBank.
For more idiosyncratic data, the data
can be placed in a more flexible digital
data library such as the National Science
Foundation-sponsored Dryad archive at
http://datadryad.org.'
Slide: Simon Hodson (Dryad / Jisc / CODATA)
32. PREPARDE and Bi-Directional Data Linking
• already have a link from the GDJ data
article to the data repository via DOI
• GDJ can also pull the standard DOI
metadata attached to that DOI from the
DataCite metadata store
• need to figure out a way so GDJ can
inform the repository that their dataset
has been cited/published!
• At this time, we have a manual work-
around (i.e. email)
• Workshop on cross-linking between
data centres and publishers 30th April
2013 at RAL, UK
• Report out soon!
BADC NCAR
GDJ
Standardised
metadata
DataCite Metadata Store
Standardised
metadata
33. Peer review of data: the Perfect Disaster?
• Support for the peer review process
– scholars contributing peer reviews with little formal
reward
– opportunity to polish and refine understanding of the
cutting edge of research
• But peer review system under stress
– exploding number of journals, conferences, and grant
applications
– self-publication tools - blogs and wikis - allow scholars to
disseminate their research results and products
• Faster and more directly
• Now adding research data into the publication and peer
review queues …
34. Peer-review of data
• Technical
– author guidelines for GDJ
– Funder Data Value Checklist
– implicit peer review of repository?
• Scientific
– pre-publication?
– post-publication? E.g. F1000R
– guidelines on uncertainty e.g. IPCC
– discipline specific?
– EU Inspire spatial formatting
• Societal
– contribution to human knowledge
– reliability http://libguides.luc.edu/content.php?pid=5464&sid=164619
35. Open Peer Review of Data
ESSD peer review ensures that the datasets are:
Plausible with no immediately detectable problems;
Sufficient high quality and their limitations clearly
stated;
Well annotated by standard metadata and available
from a certified data center/repository;
Customary with regard to their format(s) and/or
access protocol, and expected to be useable for the
foreseeable future;
Openly accessible (toll free)
Earth System Science Data journal:
http://www.earth-system-science-data.net/
Rebecca Lawrence, Data Publishing: peer review,
shared standards and collaboration,
http://www.dcc.ac.uk/events/research-data-
management-forum-rdmf/rdmf8-engaging-
publishers
Faculty 1000 Open Peer Review
Sanity check:
Format and suitable basic structure adherence
A standard basic protocol structure is adhered to
Data stored in the most appropriate and stable
location
Open Peer Review:
Is the method used appropriate for the scientific
question being asked?
Has enough information been provided to be able to
replicate the experiment?
Have appropriate controls been conducted, and the
data presented?
Is the data in a useable format/structure?
Are stated data limitations and possible sources of
error appropriately described
Does the data ‘look’ ok (optional; e.g. Microarray
data)
36. Draft Recommendations on Peer-
review of data
• Summary Recommendations from
Workshop at British Library, 11 March
2013
• Workshop attendees included funders,
publishers, repository managers,
researchers ….
• Draft recommendations put up for
discussion and feedback captured
• Feedback from the community still
welcome
• 2nd workshop 24 June: put
recommendations to peer reviewers!
http://libguides.luc.edu/content.php?pid=5464&sid=164619
Document at: http://bit.ly/DataPRforComment
Feedback to: https://www.jiscmail.ac.uk/DATA-
PUBLICATION
37. Draft Recommendations on data peer review
Summary Recommendations from Workshop at the British Library, 11 March 2013
• Connecting data review with data management planning
• Connecting scientific, technical review and curation
• Connecting data review with article review
• 4-5 draft recommendations in each of above
• Assist Researchers, Publishers, Journal Editors, Reviewers,
Data Centres, Institutional Repositories to map requirements
for data peer review
• Matrix of stakeholders vs processes
– Assist in assigning responsibilities for given context
– New for most disciplines
– Learn from disciplines where this already happens
38. Connecting data review with data management planning
1. All research funders should at least require a “data sharing plan” as part of
all funding proposals, and if a submitted data sharing plan is inadequate,
appropriate amendments should be proposed.
2. Research organisations should manage research data according to
recognised standards, providing relevant assurance to funders so that
additional technical requirements do not need to be assessed as part of the
funding application peer review. (Additional note: Research organisations
need to provide adequate technical capacity to support the management of
the data that the researchers generate.)
3. Research organisations and funders should ensure that adequate funding is
available within an award to encourage good data management practice.
4. Data sharing plans should indicate how the data can and will be shared and
publishers should refuse to publish papers which do not clearly indicate how
underlying data can be accessed, where appropriate.
39. Connecting scientific, technical review and curation
1. Articles and their underlying data or metadata (by the same or other
authors) should be multi-directionally linked, with appropriate
management for data versioning.
2. Journal editors should check data repository ingest policies to avoid
duplication of effort , but provide further technical review of important
aspects of the data where needed. (Additional note: A map of
ingest/curation policies of the different repositories should be generated.)
3. If there is a practical/technical issue with data access (e.g. files don’t open
or exist), then the journal should inform the repository of the issue. If
there is a scientific issue with the data, then the journal should inform the
author in the first instance; if the author does not respond adequately to
serious issues, then the journal should inform the institution who should
take the appropriate action. Repositories should have a clear policy in
place to deal with any feedback.
40. Connecting data review with article review
1. For all articles where the underlying data is being submitted, authors need to
provide adequate methods and software/infrastructure information as part of
their article. Publishers of these articles should have a clear data peer review
process for authors and referees.
2. Publishers should provide simple and, where appropriate, discipline-specific data
review (technical and scientific) checklists as basic guidance for reviewers.
3. Authors should clearly state the location of the underlying data. Publishers should
provide a list of known trusted repositories or, if necessary, provide advice to
authors and reviewers of alternative suitable repositories for the storage of their
data.
4. For data peer review, the authors (and journal) should ensure that the data
underpinning the publication, and any tools required to view it, should be fully
accessible to the referee. The referees and the journal need to then ensure
appropriate access is in place following publication.
5. Repositories need to provide clear terms and conditions for access, and ensure
that datasets have permanent and unique identifiers.
41. Publishing research data
• Research is heavily context specific
=> So keep it context specific?
• Publishers and Professional & Learned Societies can help
galvanise agreement of researchers
• To define how they want their data represented and preserved for reuse &
citation => your researchers need you
• Along with funder appointed peer review committees often stronger
connection than to institution
• Institutional managers/services cannot cover wide range of discipline specific
expertise
• Like publishers don’t cover all fields
• Relatively constant as career progresses
• Change host institution
• Change field(s)
• Change technique
• In UK RCUK funding for APCs strictly for articles only….
– How to fund APCs for depositing data in repositories?
– Will publishers charge APCs to publish data papers?
42. 2012-02-07
DCC roadshow East Midlands - CC-BY-SA
42
PDB
GenBank
UniProt
Pfam
Spreadsheets, Notebooks
Local, Lost
High throughput experimental methods
Industrial scale
Commons based production
Public data sets
Cherry picked results
Preserved
CATH, SCOP
(Protein Structure
Classification)
ChemSpider
Research and the long tail
Slide: Carole Goble
43. Enabling Open Data Publishing
• Active Data Management Planning
– built in at proposal stage
• Local institutional tweaks of funder and local templates
– Implemented and evolved in project
• Data Management Plan as a live, evolving object
• Annotate data on the fly – lab notebook approach
– Curated & preserved using permanent identifiers
• Appropriate repository and data collection descriptors
44. Active Data Storage: Identifying the Holy Grail ?
• “what is needed is a tool to transparently sync local and
network storage” (Marieke Guy, JISCMRD2 Nottingham)
• CKAN (Orbital, Lincoln)?
• Research Data Toolkit (Herts) – hybrid solutions
• UC3 suite at CDL…
• DropBox-like functionality a must
• Usability
• Technical interoperability
• aim to help create databases for research data so
• facilitates collaboration and data sharing
• enables the subsequent publication of datasets
• challenge is to ensure data are
• documented
• Preserved
• service is sustainable
46. Halogen as template for research data management #jiscmrd
Requirements Analysis – must be iterative!
Data Management Plan – use DMPonline (UK Digital Curation Centre)
Scalable research data management infrastructure
pilot phase to nationally available resource
LAMP stack IT infrastructure: host research database – work with JISC/DCC
A model for the long term delivery of a data management service within the
institution including
support, maintenance, governance & charging policies
Include researchers, IT services, research support office, library services etc.
47.
48. BENEFITS
New research opportunities
Cross database work – seed new research samples
Scholarly communication/access to national resources
Key to English Place Names (Nottingham)
Portable Antiquities Scheme (British Museum)
Verification, re-purposing, re-use of data
Cleaning & enhancing private research datasets for reuse & correlation
No re-creation of data
Increased transparency
excellent training for best practice in research data management
Increasing research productivity
Build in cleaning, annotation, enhancement into normal research workflows
research datasets may immediately be reusable and interoperable
Impact & Knowledge Transfer
Reuse IT infrastructure
Increasing skills base of researchers/students/staff
50. CHALLENGES
interdisciplinary research database
ingest each input dataset in form such that sufficient information is carried forward
to enable interoperation
Cultural differences
versioning & provenance for input datasets
which software tools, infrastructure , Query interface?
suitable for multi disciplinary researchers
Requirements upon the institution for sustaining the research assets & skills
Requirements upon the researchers
Annotating
Refreshing
Maintainence of datasets
55. Suggested timeline for implementing
institutional research data management
From Whyte & Tedds (2011), DCC Briefing
http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm
56. Challenge for institutions
– Rise to scientific and research challenge
• Not just a management challenge
• Responsibility for the knowledge they create
– Library
• “Doing the wrong things through the wrong people”?
• Challenge for library to enable:
• curation of data and publications
• active support from data scientists
• from centralised to dispersed support
• Expert centres such as D-Lab essential!
– IT Service
• Provide research data platforms for researchers:
– Active storage
– Enable collaboration
– Connect to preservation services through Library
57. But that’s not all…
What about the software underpinning data driven
research?
If we’re going to publish as open data:
How do we help researchers to store, annotate and
discover the datasets they create?
How do you sustain and reuse that?
58. Biomedical Research Infrastructure Software Service Kit
A vision for cloud-based open source research applications
#BRISSKit
http://www.brisskit.le.ac.uk
59. BRISSKit context:
The I4Health goal of applying knowledge engineering to close the
‘ICT gap’ between research and healthcare (Beck, T. et al 2012)
62. The semantic bridge
?
OBiBa Onyx
Records participant
consent, questionnaire
data and primary
specimen IDs
i2b2
Cohort selection and
data querying
Bio-ontology!
63. Research Software Sustainability
• OS community engagement
• standards compliance
• consortium approach
• work with grain of researchers
• discipline specific forks?
• Github versioning an example for research data?
• OS Community Engagement Charter
• defining engagement with existing & new OS
communities
• including adoption & code commitments
See Rob Baxter blog: “The research software engineer”
• http://dirkgorissen.com/2012/09/13/the-research-
software-engineer/
64. Lessons for institutions?
Can’t do it all in house!
But many disciplines don’t have data centres
Build coalition of institutional actors
Essential to have high level support
Take and shape
Identify what you do have in-house
Access external tools, standards where possible
Active storage, collaboration, eprints…
Propose best of breed for (inter)national reuse
Share benefits (and costs) over acacdemic networks
Sustainability the key challenge
As much cultural as technical – needs networks…
66. Accepted Research Data Alliance Interest Group
Publishing Data
• http://rd-alliance.org/
• Close coordination with ICSU-WDS working group, CODATA and other
ongoing initiatives in data publication
– WDS under International Council of Science, RDA wider
– Avoid duplication within related RDA and WDS WGs – join up
– For WDS partnerships between publishers and data centres key
• scope the territory – gap analysis
• Use RDA Forum and new http://jiscmail.ac.uk/data-publication 350+ list
• Take findings from RDA / WDS group(s) and trial in other communities /
disciplines / institutional repositories
68. “Keep reaching for the stars”
• increase the trustworthiness
and value of individual data
sets
• strengthen the findings based
on cited data sets
• increase the transparency and
traceability of data and
publications
• enable reuse and repurposing
i.e. Problems but extraordinary
opportunities – all hands on deck!
69. Thank you for listening and thanks to CDL, D-
Lab and the project partners
Dr Jonathan Tedds jat26@le.ac.uk @jtedds
Senior Research Fellow,
D2K Data to Knowledge Research Group
(University of Leicester)
#PREPARDE http://www.le.ac.uk/projects/preparde
Mailing list: http://jiscmail.ac.uk/DATA-PUBLICATION