"Tomorrow, and tomorrow, and tomorrow": the players on the curation stage
1. a centre of expertise in data curation and preservation
“Tomorrow, and tomorrow, and
tomorrow”:
the players on the curation stage
Chris Rusbridge
Presentation at OCLC
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5
UK: Scotland License, excluding content property of others. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative
Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
2. a centre of expertise in data curation and preservation
•"To-morrow, and to-morrow, and to-morrow,
•Creeps in this petty pace from day to day,
•To the last syllable of recorded time;
•And all our yesterdays have lighted fools
•The way to dusty death.
•Out, out, brief candle!
•Life's but a walking shadow; a poor player,
•That struts and frets his hour upon the stage,
•And then is heard no more: it is a tale
•Told by an idiot, full of sound and fury,
•Signifying nothing."
•Shakespeare: Macbeth
OCLC October 2006
3. a centre of expertise in data curation and preservation
•Dunsinane Hill
OCLC October 2006 •Photo by Fabrice
4. a centre of expertise in data curation and preservation
OCLC October 2006
5. a centre of expertise in data curation and preservation
OCLC October 2006
6. a centre of expertise in data curation and preservation
Contents
• Curation and the Digital Curation Centre
• Science and Data Citations
• The “poor players” of data curation
• Sustainability of curated data
• Macbeth again…
OCLC October 2006
7. a centre of expertise in data curation and preservation
Curation
• Data increasingly important as evidence
• Experimental verifiability (the basis of science)
• Unrepeatable observations & experiments
(particularly environmental in broadest sense)
• Legal, compliance & transactions
• Cultural resources
• “Preservation” view vs “Publishing” view
OCLC October 2006
8. a centre of expertise in data curation and preservation
Lynch remarks
• Closing the Curation Conference
• 3 views of digital curation
• Finite process, handover to preservation
• Whole life process, evolving object(s)
• Collection as a living thing
OCLC October 2006
9. a centre of expertise in data curation and preservation
Digital curation?
For later use
Static
Digital preservation
OCLC October 2006
10. a centre of expertise in data curation and preservation
Digital curation?
In use now (and the future) For later use
Dynamic Static
Long-term
Digital curation Digital preservation
OCLC October 2006
11. a centre of expertise in data curation and preservation
Digital curation
In use now (and the future) For later use
Dynamic Static
Long-term
Digital curation & preservation
“maintaining and adding value to a trusted body
of digital information for current and future use”
OCLC October 2006
12. a centre of expertise in data curation and preservation
Mission
“The over-riding purpose of the DCC is to
support and promote continuing improvement
in the quality of data curation, and of
associated digital preservation”
OCLC October 2006
13. a centre of expertise in data curation and preservation
Aims
• Establish vibrant research
• Build strong community relations
• Development activity leading to service
• Achieve the “virtuous circle”
• NOT a repository (funder mandate)!
OCLC October 2006
14. a centre of expertise in data curation and preservation
Organisation to Engage & Collaborate
communities of curation
practice: users organisations
eg DPC
community
support &
outreach
service management
Associates research
definition & admin research
Network collaborators
& delivery support
development
co-ordination
testbeds
& tools
Industry standards bodies
OCLC October 2006
15. a centre of expertise in data curation and preservation
Organisation to Engage & Collaborate: Leads
communities of curation
practice: users organisations
eg DPC
Bath
Associates Glasgow Edinburgh Edinburgh research
Network collaborators
CCLRC
testbeds
& tools
Industry standards bodies
OCLC October 2006
16. a centre of expertise in data curation and preservation
Achievements: Research
• Edinburgh Database Group
• Annotation, archiving, citation, lineage,
provenance, “publishing”
• CCLRC: metadata curation
• Glasgow: genre extraction
• Bath: repository interactions
OCLC October 2006
17. a centre of expertise in data curation and preservation
Achievements: Development
• RLG/NARA Certification checklist
• Representation Information Registry/
Repository
• Concept from Open Archival Information System
standard (OAIS) on preserving information
OCLC October 2006
18. a centre of expertise in data curation and preservation
Achievements: Services
• Help Desk
• Workshops and events
• Curation manual
• Audit and certification
• From checklists to service?
• Standards and tools
• Representation information
• From tool to service?
OCLC October 2006
19. a centre of expertise in data curation and preservation
Achievements: Outreach
• Developing web site content
• Conferences
• Associates Network and Forum
OCLC October 2006
20. a centre of expertise in data curation and preservation
Achievements: management
• Developing international impact
• Developing policy impact
OCLC October 2006
21. a centre of expertise in data curation and preservation
Associated work
• DCC LOCKSS Technical Support Service
(Lots of Copies Keep Stuff Safe)
• DCC SCARP Project
• Disciplinary approaches to sharing, curation, re-
use and preservation
• EU projects associated
• CASPAR
• Digital Preservation Europe
• PLANETS
OCLC October 2006
22. a centre of expertise in data curation and preservation
Phase 2
• Externally-moderated, reflective self-
evaluation completed
• Phase 2 proposal (2007/10) to JISC
• Accepted: focus on science data, reduced scale
• EPSRC-funded Research continues until
2007/8
OCLC October 2006
23. a centre of expertise in data curation and preservation
2nd International Digital Curation
Conference
• Research & invited presentations
• Glasgow, 21/22 November, 2006
• Please register at:
http://www.dcc.ac.uk/events/dcc-2006/
OCLC October 2006
24. a centre of expertise in data curation and preservation
OCLC October 2006
25. a centre of expertise in data curation and preservation
Data resource stages
• Curated data is created…
• Observations? Fixed!
• Or Acquired…
• Data brought/bought from outside
• Ingest
• Development
• Derived, refined, combined, processed data
• Potentially many stages
OCLC October 2006
26. a centre of expertise in data curation and preservation
TWOMASS (Infrared)
SDSS (Visual)
OCLC October 2006 Slide from Rajendra Bose
27. a centre of expertise in data curation and preservation
OCLC October 2006 Slide from Rajendra Bose
28. a centre of expertise in data curation and preservation
New discovery…
• National Virtual Observatory
• Johns Hopkins press release: “Scientists working to create the
NVO, an online portal for astronomical research unifying dozens of
large astronomical databases, confirmed discovery of [a] new
brown dwarf recently. The star emerged from a computerized
search of information on millions of astronomical objects in two
separate astronomical databases. Thanks to an NVO prototype,
that search, formerly an endeavor requiring weeks or months of
human attention, took approximately two minutes.”
OCLC October 2006
29. a centre of expertise in data curation and preservation
Context
• Data meaningless without context
• Linkage
• Metadata of many kinds
• Workflow!
• Provenance
• Computational lineage
• Authenticity
OCLC October 2006
30. a centre of expertise in data curation and preservation
NASA
Csat8-day composite and subsceneCsat 8-day composite subscene PAR subscene RPT
E0SST and Pbopt calc H
Ctot calc Zeu calc PPeu calc
University research
University group3 local
research
research decision-
group1
group2 making body
OCLC October 2006 Slide from Rajendra Bose
31. a centre of expertise in data curation and preservation
Access and re-use
• Ethics and rights control access
• Weak in expressing this long-term
• Collaboration tools
• Annotation, discussion, review
• Re-use leading to change and development
• “Publication”
• Not just in “print”
• Underlying data should be “published”, too
• Citation…
OCLC October 2006
32. a centre of expertise in data curation and preservation
CLADDIER citation investigation
“My last example was an MST data set held at the BADC, and I was
suggesting something like this (for a citation):
<Citation><Author> Natural Environment Research Council </Author>
<Title> Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </Title>
<Medium> Internet </Medium>
<Publisher> British Atmospheric Data Centre (BADC) </Publisher>
<PublicationDate status="ongoing"> 1990</PublicationDate>
<Identifier> badc.nerc.ac.uk/data/mst/v3/upd15032006</Identifier>
<Feature><FeatureType>http://featuretype.registry/verticalProfile</FeatureType><
LocalID>200409031205</LocalID></Feature>
<AccessDate> Sep 21 2006 </AccessDate>
<AvailableAt><url>http://badc.nerc.ac.uk/data/mst/v3/</url></AvailableAt>
</Citation>
(Made up tags!)”
OCLC October 2006 •Bryan Lawrence Weblog
33. a centre of expertise in data curation and preservation
CLADDIER 2: “Version of record”
• Role of Publisher: add value
• provision of catalogue metadata
• some commitment to maintenance of the resource
at the AvailableAt url
• some commitment to the resource being
conformant to the description of the Feature
• some commitment to the maintenance of the
mapping between the identifier [LocalID] and the
resource.
OCLC October 2006 •Bryan Lawrence Weblog
34. a centre of expertise in data curation and preservation
CLADDIER 3: persistence
• Wayback Machine
• Only snapshots (eg only 2004 version of Bryan’s home
page!)
• WebCite
• allows the creater of content to submit URLs for [archiving],
thus ensuring when one writes an academic document, the
material will be archived, and the citation will be persistent
• But no real help for data…
• “… only allow [data citation] when we believe in the
persistence of the organisation making the data
available…”
OCLC October 2006 •Bryan Lawrence Weblog
35. a centre of expertise in data curation and preservation
OCLC October 2006
36. a centre of expertise in data curation and preservation
Citation
• Needs a stable resource to cite…
OWL Web Ontology Language
Reference
W3C Proposed Recommendation 15 December 2003
This version:
http://www.w3.org/TR/2003/PR-owl-ref-20031215/
Latest version:
http://www.w3.org/TR/owl-ref/
Previous version:
http://www.w3.org/TR/2003/CR-owl-ref-2003081
• (FRBR works & expressions?)
OCLC October 2006
37. a centre of expertise in data curation and preservation
Citation…
• The date alone (as in common web citation
approaches) is not enough!
•[6] The CIA World Factbook.
•www.cia.gov/cia/publications/factbook/.
•Retrieved on 8 Jan 2006.
• Cited object likely to have changed…
• Citation should link to the cited object as it was!
OCLC October 2006
38. a centre of expertise in data curation and preservation
Citation needs…
• An efficient way to reference and access “archived” past states
of a changing dataset (work in progress, Buneman et al)
• Not important for original observations
• Don’t mess with those data
• Less important for incremental datasets
• Later stuff should not invalidate earlier
• Very important for revisable datasets
• Eg Genomics… datasets that result from the combined work of
curators, or contain opinions or facts likely to change
• Eg Mapping… OS maps represent a huge database that changes
on a daily basis
OCLC October 2006
39. a centre of expertise in data curation and preservation
XML Archive at time t - 1
XMLArch: System Architecture
time t
Relational
XML Archiver
XML Snapshot at
Database
Pre-processor
Version
Merger
Data Extractor
XML Archive at time t
OCLC October 2006 •Carwyn Edwards
40. a centre of expertise in data curation and preservation
Consider the Record
• Business records
• Evidence of business decisions
• Formal: minutes, letters etc
• Informal: emails, notes
• Databases
• Records of Science
• Lab notebooks
• Proposals, reports, papers
• Data
• Informal parts generally poorly managed
OCLC October 2006
41. a centre of expertise in data curation and preservation
Preservation & curation
• Use preserves
• Money preserves
• Redundancy good, monoculture bad?
• LOCKSS-type & other approaches…
• Bits are fragile and robust
• Don’t rely on portable media
• Look after them well
• Technology changes…
• How fast? What impact?
• Metadata matters! (Know what you’ve got)
OCLC October 2006
42. a centre of expertise in data curation and preservation
Who are the curation players?
OCLC October 2006
43. a centre of expertise in data curation and preservation
Curation: Individual
• “Small science” 2-3 times more data than “Big
science”, but much more at risk
• PhD student? RA? PI? Administrator? IT support?
• Data potentially on local hard drives, or at best
shared network drives
• May be inadequately protected
• Liable for policy-led deletion on resignation
• Individual “knows” too much
• Documentation/metadata unlikely to be adequate
• Tomorrow: gone!
OCLC October 2006
44. a centre of expertise in data curation and preservation
Department: eCrystals
• Specialist department
archive (& national service)
• Workflow recording of lab
parameters (R4L)
• Public & private elements
• Trying to build eCrystals
federation (eBank 3)
• But… ReciprocalNet?
French COD efforts?
Fragmented discipline!
• Tomorrow: likely to continue
OCLC October 2006
45. a centre of expertise in data curation and preservation
Institution: Cambridge Chemistry
• 175,000 small molecule
structures in CML
• Alongside Archaeology,
Manuscripts, Learning
Materials, etc
• No library curation skills;
dependent on research
group enthusiast
• Collection isolated from
other Chemistry
• Tomorrow: assured…
OCLC October 2006
46. a centre of expertise in data curation and preservation
Community: CDL
• Shared effort from
group of institutions
• Comparison OhioLink?
• Document tradition, not
data
• Passive role re
collections
• Rely on departmental &
domain expertise
• Tomorrow: assured…
OCLC October 2006
47. a centre of expertise in data curation and preservation
Community: SDSC?
• Data specialists
• Multiple disciplines
• Distinct from domains;
curation dependent on
external expertise
• Research ethos
• Tomorrow: dependent
on grant/contract
income & research
priorities
OCLC October 2006
48. a centre of expertise in data curation and preservation
Community: LOCKSS?
• Self-selected group of
collectors: closest to genuine
open activity (despite
Alliance)?
• Traditionally libraries
collecting eJournals
• Model respects IPR
• No domain expertise; rely on
origins
• Data limitations…
• Tomorrow: potentially very
persistent (low cost, high
reliability, attack resistance,
distributed)
OCLC October 2006
49. a centre of expertise in data curation and preservation
Discipline: Archaeology
• Staffed by archaeologist
curators
• Understand special
legal issues
• Strong relationship with
community & peers
• Internationally still
fragmented?
• Tomorrow: dependent
on research council
grants + deposit funding
OCLC October 2006
50. a centre of expertise in data curation and preservation
Discipline: Astronomy
• Part of major
international effort
• Expensive shared
facilities, global reach
• Well integrated into
community
• Enable new science
• Tomorrow: assured by
community (another
large facility)
OCLC October 2006
51. a centre of expertise in data curation and preservation
Discipline: Atmosphere
• Strong believer in need
for domain scientists as
curators
• Significant participant in
“community proxy”
agenda-setting activities
• Internationally
fragmented resources
• Tomorrow: mostly
dependent on grant
funding (but strong
commitment)
OCLC October 2006
52. a centre of expertise in data curation and preservation
Discipline: Pharmacology
• International Scientific
Union
• Attempting to build
credit for data
contributions
• DB ownership rotates
• Tomorrow: extremely
limited funding
OCLC October 2006
53. a centre of expertise in data curation and preservation
Discipline: Pharmacology
OCLC October 2006
54. a centre of expertise in data curation and preservation
Discipline: Social Sciences
• Mature!
• Staffed by Social
Science curators
• Alert to opportunities
• Able to appraise
material offered
• Strong relationship to
discipline
• Tomorrow: assured
through broad mix of
funding streams
OCLC October 2006
55. a centre of expertise in data curation and preservation
Publisher: Crystallography
• Publisher and Scientific
Union
• Created key domain
crystallographic standard
(CIF)
• Strong motivator for deposit
of structure data
• Consistent quality checks
• DOIs used for structure data
• Tomorrow: publishing
business model
OCLC October 2006 •Slide from IUCr
56. a centre of expertise in data curation and preservation
National bodies: British Library
• Serious and robust
approach
• Legal deposit powers &
responsibilities as driver
• Oriented primarily
towards “cultural
heritage” (broadly
interpreted)
• Little data, no science
domain experience
• Tomorrow: strong future
commitment
OCLC October 2006
57. a centre of expertise in data curation and preservation
National bodies: TNA/NDAD
• Specialist archive for
government datasets
• Understand government
regulations, dynamics &
requirements
• Subject generalists;
disconnected from
associated science
• Technology specialists
(understand databases)
• Tomorrow: likely to pass
eventually to The National
Archives
OCLC October 2006
58. a centre of expertise in data curation and preservation
National bodies: NOAA (etc)
• Government body
making serious data
available
• Domain scientists
curate data
• Operates in current
political context (!)
• Tomorrow: reasonably
assured but some un-
funded mandates?
OCLC October 2006
59. a centre of expertise in data curation and preservation
3rd parties: OCLC?
• Should this be
community?
• Demand driven
• No domain science
expertise: rely on
origins
• Tomorrow: business
case
OCLC October 2006
60. a centre of expertise in data curation and preservation
3rd parties: Portico
• Specific area: eJournals
• Depends on publisher
agreements
• No data or domain
science expertise
• Tomorrow: commitment
from Mellon +
publishers +
subscriptions, good
funding mix
OCLC October 2006
61. a centre of expertise in data curation and preservation
3rd Parties: Iron Mountain
• Records management
IS a curation problem
• Organisations like this
very likely to branch out
• No domain science
expertise
• Tomorrow: business
case, viability, stock
market…
OCLC October 2006
62. a centre of expertise in data curation and preservation
Institutions & the network
• Institutions have some fundamental
sustainability
• Disciplines live in the network; sustainability is
an issue
• Can we get the best of both?
OCLC October 2006
63. a centre of expertise in data curation and preservation
Intersections…
Institution Institution Institution etc
1 2 3
Discipline X X
1
Discipline X X
2
Discipline X X
3
etc
OCLC October 2006
64. a centre of expertise in data curation and preservation
Who are the curation players
again?
OCLC October 2006
65. a centre of expertise in data curation and preservation
Project StORe findings
• Discipline commonality from survey (Miller, UKDA, 2006):
• 2-way links between data & publication useful
• Barriers to actual deposit of data/outputs
• Sharing data important, likely between colleagues
• Perceived inconsistency across repositories
• Most common searching: Google type
• Researchers favour self-reliance rather than library support
• Recognise need for common minimum metadata
• Aim for pilot linking middleware demonstrator
• “Creating small scale ‘silos’ of information with institutional
repositories is not … a compelling information
management strategy in the ‘Google age’” (Heery &
Anderson for JISC, 2005)
OCLC October 2006
66. a centre of expertise in data curation and preservation
Sustainability: tomorrow is the
emerging worry
• Sustainability work package in DCC (new
grant!)
• JISC/NDIIPP meeting addressed it
• AHRC report draft soon
• Research Information Network report draft
• JISC study on sustainable IT systems for HE
• Recent ARL/NSF workshop, NSF strategy
OCLC October 2006
67. a centre of expertise in data curation and preservation
Sustainability of what?
• Repository as an organisation
• Repository as a service
• Repository as a system
• Repositories as a network (federation?)
• Collections and objects supported by
repositories
• Commit to collection: contract the manager!
OCLC October 2006
68. a centre of expertise in data curation and preservation
Sustainability of what?
• Culture of deposit & re-use!
• One of the most important social dimensions, but out of scope
here…
• Curation service
• Separate service from collection
• Funding always finite: 5 + 5 then re-compete?
• Relay approach: hand on in good order
• Succession! Start with the plan for your own end…
• Data
• Digital object access when required (for long future time)
OCLC October 2006
69. a centre of expertise in data curation and preservation
Sustainability for what?
• Variety of curation approaches
• Developing resource
• Preserving resource
• Significant properties have a big impact
• Produce bit stream as ingested?
• All the work for the consumer
• Produce full look and feel as ingested? Expensive!
• May also be unfamiliar for future consumer
• Somewhere between?
• Depends on goals…
OCLC October 2006
70. a centre of expertise in data curation and preservation
Social factors
• Commitment essential… much more than anything else
(cf persistent identifiers)
• Funder requirements express social determination
• Policy & grant application forms, selection criteria
• Monitoring essential
• Legal, ethical, IPR impacts all significant
• Public good questions
• Academic credit (citations?)
• Free-loaders (embargos?)
• Disciplines are different!
• Workforce skills: researcher, data librarian/scientist
OCLC October 2006
71. a centre of expertise in data curation and preservation
Sustainability a function of...
• Commitment
• Goals
• Value and cost
• Business model
• Time
• Environment
• Domain knowledge and information
• Dimensions (how much stuff)
• Technical approaches
• Usage
OCLC October 2006
72. a centre of expertise in data curation and preservation
Financial sustainability 2: projects
• Traditional research project approach:
• Produces unsustainable resources
• PIs focus on next project proposal
• RAs focus on next job application
• Result: no metadata, orphan data
OCLC October 2006
73. a centre of expertise in data curation and preservation
Financial sustainability 3: investment
• How you justify a long-term spend: persuasive? No!
Return on investment = value - cost
• Intangible asset: hard to value; situated, multi-scaled
• Aggregate rather than individual
• Academic value is key
• Citations support this: needs work
• Reputation is the target currency
• But dollars pay the bills!
OCLC October 2006
74. a centre of expertise in data curation and preservation
Value
• “… the by-products of our research may be more significant than our
soon dated theoretical insights." (Seeger 2004, quoted by Kevin
Bradley, APSR)
• “I think I would be safe in saying that worldwide hundreds of millions of
dollars’ worth of crystallographic data is lost each year. For spectra and
synthetic chemistry it will be at least 10 times greater. Many synthetic
chemists say they are interested in failed reactions - and these are
almost never published!” (Murray-Rust blog)
• Value of curation service can grow from re-use promotion & community
proxy activities (eg BADC & CF conventions, ICPSR & DDI)
• (But the value of data is easily negated, at creation and after)
OCLC October 2006
75. a centre of expertise in data curation and preservation
Financial sustainability: the 8 pillars
of wisdom?
• Someone has to pay…
• Consumer pays: subscription or usage?
• Depositor pays (ie grant or institution)?
• Institution pays (IR, cf library/archive/museum)
• Community (discipline repository?) pays
• Government, or science funder
• Learned society?
• Volunteers (cf open source, social computing, LOCKSS)?
• Side effect (advertiser) pays (unlikely for much data?)
• Endowment or donor pays…
• Diversity?
OCLC October 2006
76. a centre of expertise in data curation and preservation
Role of libraries
• 2-4% of university budgets (“There’s plenty of
money… there’s just not plenty of money for
everything!” Courant)?
• Traditional role in sustaining the raw material
of scholarship
• Looking for new roles in the digital world?
• Many unsaid assumptions from publishing
paradigm?
• Domain knowledge: wide but not deep
• Involvement in data creation low
OCLC October 2006
77. a centre of expertise in data curation and preservation
So, tomorrow…
• Digital data repositories already sustained > 30 years
• How?
• Vision, leadership, commitment
• Libraries, archives, museums sustained 100s of
years
• How?
• Aggregate value proposition
• Perception now under threat!
• Collectively we need to identify the next steps toward
digital data sustainability, for tomorrow, and
tomorrow, and tomorrow!
OCLC October 2006
78. a centre of expertise in data curation and preservation
Macbeth again…
•"To-morrow, and to-morrow, and to-morrow,
•Creeps in this petty pace from day to day,
•To the last syllable of recorded time;
•…it is a tale
•Told by an idiot, full of sound and fury,
•Signifying nothing."
OCLC October 2006
79. a centre of expertise in data curation and preservation
Mission (impossible?)
• To that last syllable of recorded time
• Keep our tales forever full of significance!
Thank you
OCLC October 2006
Hinweis der Redaktion
Although in the play a speech of despair, we can read into this metaphors for long term digital curation and preservation!
Another key point in the play, remember the prophecy?
So how would Macbeth have got on with modern tools? BTW, this might be an argument for migration rather than emulation; a much better Interface than a map scratched on calf skin…
Woops! Perhaps emulation (calf-skin) was better after all; at least it would not be troubled by spelling mistakes. Anyway, Google Never claimed to be a battlefield management system! However, we move on…
Initially we have concentrated on data extracted from relational databases, mainly because this is where the IUPHAR data is. 1) Extract to XML (friendly hierarchical format). 2) Next we want to merge with the archive containing the previous versions. 3) Process and Merge 4) New archive with latest version added. Demo ....