Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Managing and Sharing Research Data: Good practices for an ideal world...in the real world.
1. Managing and Sharing
Research Data:
Good Practices for an Ideal World…
in the Real World
Martin Donnelly
Digital Curation Centre
University of Edinburgh
University of Sheffield
19 January 2012
2. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
3. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
4. Digital Curation Centre
- Founded in 2004 to support research in UK higher and further
education in the preservation, curation and management of
digital resources
- Major funder is JISC
- Original focus on publications / biblio; now more emphasis on
research data management
- Support to JISC projects, especially the two Managing Research
Data programmes...
http://www.jisc.ac.uk/whatwedo/programmes/di_researchman
agement/managingresearchdata.aspx
- Tools, training, guidance, consultancy, other resources/studies…
- Three partner sites: Edinburgh (lead), Bath and Glasgow
5. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
6. What is meant by managing
research data?
Lots of strands…
- Ensuring physical integrity of files and helping to preserve them
- Ensuring safety of content (data protection, ethics, etc)
- Describing the data (via metadata) and recording its history
- Providing or enabling appropriate access at the right time, or
restricting access, as appropriate
- Transferring custody at some point, and possibly destroying
In short, RDM means meeting funder, institutional,
disciplinary and other requirements/norms across various
areas and at different times, in sympathy with the nature
of the data itself, for the benefit of yourself, your
institution, and the wider community, as appropriate.
7. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
8. RDM and research ethics/integrity
- RDM is increasingly seen as a core research competency, along with things
like writing and referencing (see RCUK Common Principles >>)
9. Policy Streamlining
RCUK Common Principles on Data Policy
Key messages:
1. Data are a public good
2. Adherence to community standards and best practice
3. Metadata for discoverability and access
4. Recognise constraints on what data to release
5. Permit embargo periods delaying data release
6. Acknowledgement of / compliance with T&Cs
7. Data management and sharing activities should be explicitly funded
http://www.rcuk.ac.uk/research/Pages/DataPolicy.aspx
10. RDM and research ethics/integrity
- RDM is increasingly seen as a core research competency, along with things
like writing and referencing (see RCUK principles >>)
- Research outputs (which constitute the scientific record) are often based on
the collection, analysis and processing of data / sources / information
- Reproducibility and verifiability are fundamental principles in many
disciplines. In other disciplines, including those where research cannot be
replicated such as social and environmental sciences, the longevity of the
data from which the findings are derived is equally crucial
- Some data is unique and cannot be replaced if destroyed or lost, yet only by
referring to trustworthy data can research be judged as sound
- Therefore data must be accessible and comprehensible in order to back up
claims, and enable third parties to reproduce (or validate) results
- Additionally, there is increasing demand for public (or Open) access to
publicly-funded research outputs, including data, but more on that later…
11. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
12. Institutional and funder perspectives
- Research today is technology enabled and data intensive
- Data as long-term asset; identify and preserve
- The fragility and cost of digital data; curate to reuse and
preserve
- Data sharing: research pooling, cross-disciplinary and global
partnering, new research from old, the wealth of knowledge
- The cost of technology and human infrastructures
- Pressure to show return on public investment of £3.5bn
- Compliance with legislation and funder policies
- The data deluge: volume and complexity, not just in HEIs
- Financial and human consequences from lost data
- The cost of administering unmanaged datasets
13. Context
“For science to effectively function, and for society to
reap the full benefits from scientific endeavours, it is
crucial that science data be made open”
Surfing the Tsunami
Science, 11 February 2011
15. Policy
RCUK Policy and Code of Conduct on the Governance
ofEPSRCResearchall those institutions it October 2011)
Good expects Conduct, 2008 (updated funds
UNACCEPTABLEroadmap that aligns theirmismanagement or
to develop a RESEARCH CONDUCT includes policies and
inadequate preservation of data and/or primary materials,st May 2012;
processes with EPSRC’s expectations by 1 including failure
to:
to be fully compliantrecords these expectations by 1st May
keep clear and accurate
with of the research procedures followed and the
2015. obtained, including interim results;
results
Compliance securely inmonitored andform;
hold records will be paper or electronic non-compliance
investigated. primary data and research evidence accessible to others for
make relevant
Failure to share research data could result datathe normally
reasonable periods after the completion of the research:
in should
be preserved and accessible for 10 yrs (in some cases 20 yrs or longer);
imposition of sanctions. research funder‟s data policy and all relevant
manage data according to the
legislation;
wherever possible, deposit data permanently within a national collection.
Responsibility for proper management and preservation of data and primary
materials is shared between the researcher and the research organisation.
16. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
17. The Why (pt. 1)
It’s A Good Thing
– Data as a public good (see RCUK Shared Principles)
– Others can build upon your work (the Shoulders of
Giants, Newton) and it may be useful in ways you did
not foresee, beyond your discipline (‘fresh eyes and
new techniques or approaches’)
– Passing custody enables you to leave the preservation
legwork to the specialists
– You won’t be around forever, but your work might be
18. The Why (pt. 2)
Incentives, or “Why Should I Spend Time On This
When I Have Other Things To Worry About?”
- Impact. Linking papers to data increases citation rates,
see for example Henneken & Accomazzi, Smithsonian
Astrophysical Observatory:
http://arxiv.org/PS_cache/arxiv/pdf/1111/1111.3618v
1.pdf (pre-print)
- Warning! Some numbers follow…
19. Institutional cost saving
Researcher career benefits
Growing popularity of re-use
Sharing as a catalyst
for discovery
http://www.dcc.ac.uk/resources/briefing-papers
21. Impact
- Making data accessible increases citation rates
- Better for authors; better for publishers
- Piwowar, Day & Fridsma (2007):
- 45% of studies make data accessible
- They receive 85% of citations
- N.B correlation is not causation…
doi:10.1371/journal.pone.0000308
4th DCC Roadshow - Oxford. Kevin Ashley,
2011-09-14 21
DCC, CC-BY-SA
22. Key findings
- 2.98 more publications per
dataset if archived
3
- 2.77 more if „informally
shared‟ 2.5
“TheOr correct forof social science research: The use and reuse of primary
- enduring value some 2
research data” Archived
confounding factors…
Amy M. Pienta, George Alter, Jared Lyle 1.5
Shared
http://hdl.handle.net/2027.42/78307
- 2.42 more if archived 1
Not shared1
Presented in Torino, April 2010: “Organisation, Economics and Policy of Scientific
Research”more if informally
- 2.31 0.5
shared 0
Raw Corrected
2011-09-14 4th DCC Roadshow - Oxford. Kevin Ashley, DCC, CC-BY-SA 22
23. The Why (pt. 2)
More incentives…
- Increased citations help with the
Research Excellence Framework
- Research councils are increasingly
rejecting submissions on the basis of
poor data management plans
- So you get more funding if you do
this right…
24. The Why (pt. 3)
Sticks…
- Some funders require you to make your data available for many
years after project funding has ceased. So laying adequate data
preservation foundations should be near the top of your list
when planning any new research project.
- Funder rejections on basis of poor data management.
- EPSRC roadmap requirement (N.B. It is likely that DMPs will form
part of many institutional infrastructures) - the institution has
overall responsibility for this, but everyone will need to play a
part, and EPSRC is an important funder at Sheffield. Others may
follow suit…
25. The Why (pt. 3)
Government pressure on RCs…
6.9 The Research Councils expect the researchers they fund to deposit published
articles or conference proceedings in an open access repository at or around the
time of publication. But this practice is unevenly enforced. Therefore, as an
immediate step, we have asked the Research Councils to ensure the researchers
they fund fulfil the current requirements. Additionally, the Research Councils
have now agreed to invest £2 million in the development, by 2013, of a UK
‘Gateway to Research’. In the first instance this will allow ready access to
Research Council funded research information and related data but it will be
designed so that it can also include research funded by others in due course. The
Research Councils will work with their partners and users to ensure information is
presented in a readily reusable form, using common formats and open standards.
http://www.bis.gov.uk/assets/biscore/innovation/docs/i/11-1387-innovation-
and-research-strategy-for-growth.pdf
26. The Why (pt. 3)
- In addition to funders and institutions, prestige journals like Science and Nature already
have data policies in place, and the tendency is towards increasing requirements and
scrutiny here as well as with the funders…
Nature and Science data policies
Nature
Such material must be hosted on an accredited independent site (URL and accession numbers to be provided by the author), or sent to the Nature journal
at submission, either uploaded via the journal's online submission service, or if the files are too large or in an unsuitable format for this purpose, on
CD/DVD (five copies). Such material cannot solely be hosted on an author's personal or institutional web site.[4]
Nature requires the reviewer to determine if all of the supplementary data and methods have been archived. The policy advises reviewers to consider
several questions, including: "Should the authors be asked to provide supplementary methods or data to accompany the paper online? (Such data might
include source code for modelling studies, detailed experimental protocols or mathematical derivations.)"[5]
Science
‘’’Database deposition policy’’’ – Science supports the efforts of databases that aggregate published data for the use of the scientific community.
Therefore, before publication, large data sets (including microarray data, protein or DNA sequences, and atomic coordinates or electron microscopy
maps for macromolecular structures) must be deposited in an approved database and an accession number provided for inclusion in the published
paper.[6]
‘’’Materials and methods’’’ – Science now requests that, in general, authors place the bulk of their description of materials and methods online as
supporting material, providing only as much methods description in the print manuscript as is necessary to follow the logic of the text. (Obviously, this
restriction will not apply if the paper is fundamentally a study of a new method or technique.)[7]
REFERENCES
^"Availability of Data and Materials: The Policy of Nature Magazine[4]
^ "Guide to Publication Policies of the Nature Journals," published March 14, 2007.[5]
^ "General Policies of Science Magazine" [6]
^ ”Preparing Your Supporting Online Material” [7]
- Finally, a data management plan requirement is very likely to feature in EC FP8 (“Horizon
27. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
28. Practicalities
…or, Areas Where The DCC Can Help
- Assessing Need
- Delivering Support
- Developing Strategic Institutional
Research Data Management Support
- Policy
- Advocacy
- Planning
- Tools
- Training
www.dcc.ac.uk
29. Three areas for thought
1. Documentation and metadata
2. Backup
3. Depositing data for the long term
30. Documentation and Metadata
- Could you, or someone else, make sense of
your data five years from now? What about
five minutes from now?
- Metadata is ‘data about data’
- Simple documentation (study level)
– Use consistent file names and informative labels
– Version control
– E.g. ABC_Study4_output_2012-01-19_v1.xls
31. Documentation and Metadata
- You may wish to maintain a separate log of high
level metadata about each dataset (text file,
spreadsheet or database)
- Research context (when, where, who)
- Data history (preparation, processing)
- Where and how to access the data
- Access rights and permissions
- Link to supplementary materials, related data,
documents, publications
- Wherever possible, use standardised
vocabularies and metadata formats
32. Backup
- What would happen to your data if there was a
fire in your office tonight?
- Automatic backup
- Find out if this is available in your Department or
School
- Best practice is at least one automatic off-site
backup
- Manual backup
- Set repeat reminders, e.g. via online calendar
- N.B. Backup and archiving are not same thing!
33. Depositing Data for the Long Term
- Check copyright, consent and Data Protection
status
- Identify the appropriate archive / data centre
- Submit form/sample data/supporting
documentation for review
- If accepted, sign Licence Agreement
- Deposit data
- Dissemination?
34. That’s a lot to remember…
It is, but the DCC’s Checklist
for a Data Management Plan
provides a comprehensive list
of issues you might need to
consider…
Not all of it will be relevant to
your work. Start with the
section headings, and use
DMP Online to make your life
easier…
37. Moving Forward
There are lots of guidance resources
available already, e.g.
www.lib.cam.ac.uk/preservation/incremental/
and www.glasgow.ac.uk/datamanagement and
Research Data MANTRA
http://datalib.edina.ac.uk/mantra/
… and Sheffield-focused resources are on the
way.
38. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
43. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
44. Last Words
- You may be in a small group with not much capacity for
huge changes, but no one expects miracles
- Starting with incremental changes now is better than
burying your head in the sand and hitting a brick wall
later
- You’re not alone! There are lots of resources available,
both institutionally and at a national level
45. Running Order
1. Introduction
2. What is meant by managing research data?
3. Research data management and research ethics/integrity
4. Context and policy
5. The Why
Pt. 1 – It’s A Good Thing
Pt. 2 – Carrots
Pt. 3 – Sticks
6. Practicalities and Moving Forward
7. Sheffield Stories
8. Last Words
9. Q+A
46. Q+A
FAQ’s pt. 1
Q. I don’t have time for all of this.
A. You should have: the RCUK councils explicitly state that data management
activities should be included as part of funding applications, and institutions
are bound to meet their obligations. It’s not necessary for every researcher to
become an expert in all aspects of RDM, just to know what their role is in the
bigger picture.
Q. How are data management plans actually assessed?
A. It varies from funder to funder. The AHRC has a technical review college,
and ADS has internal guidance on what to look for when marking. All funders
provide markers' guidelines which probably say something about DMPs, but
these tend not to be public documents. A notable exception is ESRC, where
markers’ guidance is produced by the UK Data Archive. We’re hearing more
and more stories of bids rejected on the basis of poor DMPs, so the review
processes may soon become more transparent. Interestingly, the AHRC crops
up in this context more often than the others.
47. Q+A
FAQ’s pt. 2
Q. Won’t sharing my data mean people can steal my work?
A. No. Others might find things you didn’t (or weren’t looking for), but you
should receive proper attribution. Additionally, most funders permit
embargo periods to enable the original data collectors/creators to benefit
from their work. The risk of plagiarism is the same as publishing a paper.
Q. How could I possibly share confidential data?
A. If it’s confidential, you probably shouldn’t! Techniques such as
anonymisation and aggregation can be applied in order to safeguard
personal information, and data with commercial significance may also be
protected. It depends on policies and consortium agreements etc, which
should be clearly communicated. ESRC/UKDA, for example, provide advice
on ‘What to tell participants’ re. confidentiality /
anonymisationhttp://www.data-archive.ac.uk/create-manage/consent-
ethics/consent?index=7
48. Thank you
Martin Donnelly
Digital Curation Centre
University of Edinburgh
www.dcc.ac.uk/dmponline
martin.donnelly@ed.ac.uk
Twitter: @mkdDCC
This work is licensed under the Creative
Commons Attribution-NonCommercial-ShareAlike
2.5 UK: Scotland License. Image credits:
To view a copy of this license, (a) visit slide 12 -http://www.psdgraphics.com/3d/gold-pound-symbol/
http://creativecommons.org/licenses/by-nc-
sa/2.5/scotland/; or (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Slide credits:
Francisco, California, 94105, USA. Kevin Ashley and Graham Pryor, DCC Edinburgh; Andrew McHugh, DCC Glasgow