1. Introduction to
Digital Curation
DTC Archive
Knowledge Elicitation Workshop
Windermere, August 9th, 2011
Gareth Knight, Digital Curation Specialist
Centre for e-Research
2. Session Overview
1. What is digital curation and
preservation?
2. Why should I curate?
• Reasons to curate your data
• Threats to protect against
1. How do you perform digital curation?
• Organisational requirements
• Practical steps
3. Curation and the digital lifecycle
Digital Curation
"The activity of managing and promoting the
use of data from its point of creation, to
ensure it is fit for contemporary purpose,
and available for discovery and reuse“
Philip Lord, et al. 2004
Digital Preservation
”…refers to the series of managed
activities necessary to ensure
continued access to digital materials
for as long as necessary.”
Beagrie & Jones, Preservation Management of Digital
Materials: A Handbook2001, p10
4. Why Curate content?
Researcher perspective
1. Protect value of content
2. Maximise visibility and impact of researcher
3. Enable continue development and use
Institutional Perspective
4. Protect financial investment
5. Evidence of operation and impact
6. Compliance with appropriate regulations
5. Digital lifecycle threats (1)
(1) Inability to locate:
Data files have been lost/corrupted and an
alternative copy cannot be found.
• “What data was created by Project/Person X? in 2005”
• “I wrote a contract & stored it on a shared drive in
2008. Where is it located now?”
• “Person X wrote a useful software application before
they left the institution. What happened to it? Has it
been deleted or archived?
(2) Inability to access:
Data files cannot be accessed due to media
and/or software issues
•Floppy/tape reader not available to read
media
•Format obsolescence - access software
obsolete through gradual replacement (e.g.
MS Word 5, 95, XP), developer closes, OS
incompatibility
http://failblog.org/2008/02/08/floppy-fail/
6. Digital lifecycle threats (2)
(3) Uncertainty over content
Many data files exist, but it is unclear what
constitutes the final product of research process
and what is a by-product of investigation.
• Lots of different formats – which is original and which is
derivative
• What is the organisational structure?
(4) Inability to understand:
Data files can be accessed using
appropriate software, but context of research
content cannot be established
•When/How/By whom was the data created /
recorded? Does this effect the interpretation?
•What is the intended meaning of a db/spreadsheet
column or data value?
Will we need a Rosetta Stone to understand content?
http://failblog.org/2008/02/08/floppy-fail/
7. Digital lifecycle threats (3)
(5) Questionable authenticity
Content interpretation changes between
software. How do you know that content
represents authors intent?
•Decoding differences between product versions, e.g.
MS Word 97 & 2010
•Alternative products, e.g. OpenOffice vs. MS Word
•Change may be introduced by format conversion tools
(6) Uncertainty regarding usage rights
Rights issues associated with publication
and use of content is unclear – should the
institution err on the side of caution?
•Do have right to store data?
•Do have right to preserve it?
•Do have right to publish it over time? Does
this expire?
http://www.flickr.com/photos/kris3198/2409340274/
MS Word 2 MS Word 2010
8. Institutional requirements to curate data
• Organisational infrastructure:
Must maintain an organisational structure capable of
accepting, maintaining and making available information in a
trustworthy manner.
• Technical infrastructure:
Must possess appropriate strategy and procedures to
maintain digital objects at a technical level
• Legal:
Must possess IPR or appropriate permission to curate and
preserve information
• Money:
Require £££ to pay for staff, electricity, space, etc. to pay for
long-term curation – difficult as real costs of long-term digital
preservation are not clear.
9. Data Storage
Reality: ALL digital storage media is unreliable:
• Gradual degradation over time
• Unexpected failure through power surge, unexpected motion,
and theft
• 3rd
party storage providers can close
their service & delete your content
Practical approaches to take:
• Appraise - do you need to keep everything?
• Store content on at least 2 forms of storage in different
locations, e.g. one on office shared drive and a remote copy
(offsite backup)
• Perform regular fixity checks to test your backups
• Copy data files to new media every 2-5 years after first
creation
http://www.flickr.com/photos/timypenburg/5442288539/
10. Data Organisation
•If someone examined your data for the first time, what
would they wish to know?
• What research collection is contained within the directory?
• What type of information does it contain?
• Where can I find specific content, e.g. final report, analysis
data?
•Practical approaches to take:
• Establish directory structure that clearly distinguishes between
groups of files (e.g. reports, photographs, etc.) Use sub-
directories for sub-categories (e.g. topics, date, version)
• Adopt a consistent approach to organising directories (across
your department, if possible)
• Label files in manner that allows purpose, version and other
relevant information to be quickly identified (e.g. using
filename, cover page)
http://www.flickr.com/photos/amcclen/253640379/
11. Choosing appropriate formats
How do you choose correct file format to store your
content in short & long-term?
• Each format has diff. capabilities & are not suitable for every task,
e.g. MSWord not suitable for web access, etc.
• Some formats remove content or functionality to reduce file size &
limit use, e.g. JPEGs lack detail, difficult to edit PDFs
Practical approaches to take:
Select diff. formats based upon needs, rather than single format:
• Digital master: Preservation copy intended for long-term storage
• Dissemination: Access formats for use by specific users, e.g. PDF
•Format of the digital master:
• Try to use common, widely used formats supported by a range of
software tools.
• Store content in formats that support required attributes (e.g. 16
million colours) and will not degrade when resaved – ensure that you
re-examine your file after you’ve saved it
• Retain all data associated with original creation/capture process –
may contain information properties that is useful at a later date
12. Documentation
Description
Information necessary to interpret,
understand and use a given dataset
or set of documents
What would someone wish to
know about your content?
• Who created it?
• When was it created?
• Why was it created?
Who funded it?
• What is the source of the
material used?
• What is the motivation for the
approach you took?
• What content can be
published?
• How can it be used?
Practical approach to take:
• Attach a cover page to your
document with relevant creator &
rights information
• Create a catalogue record for
your digital repository
• Create an administrative file for
internal use that can help
colleagues and repository staff
and assign it an appropriate
filename
http://www.flickr.com/photos/playingwithpsp/3031647963/
13. Conclusion
•Digital curation is an institutional commitment to
maintaining data
• Data has value – should not leave it to researcher to protect value
• Infrastructure required to curate data in a trustworthy manner
•Many strategies available to curate & preserve
• Short & long-term choices can help or hinder content access over
time.
• Challenge is to select those that enable continued access, while
minimising the risk of data loss or corruption
•Easy steps to protect the value of your data:
• Store your data in 2 or more locations
• Organise it using an easy to understand structure
• Adopt a digital master format that is fit for purpose
• Document information that cannot be obtained elsewhere
14. Thank You for your attention
QUESTIONS?
Gareth Knight
gareth.knight@kcl.ac.uk