Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Curating data for integrated science
1. a centre of expertise in data curation and preservation
Curating data for integrated
science
Chris Rusbridge
NERC Data Management Workshop
February 2009
Funded by:
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5
UK: Scotland License. To view a copy of this license, visit http://creativecommons
.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard
Street, 5th Floor, San Francisco, California, 94105, USA.
2. a centre of expertise in data curation and preservation
Contents
• Curation
• Integrated science
• Poetry & Philosophy of D H Rumsfeld
• Designated Community & Knowledge Base
• Curation and integration
• Data and Texts
NERC Data Management Workshop
3. a centre of expertise in data curation and preservation
Curation
• Wikipedia
• Curator: a content specialist responsible for an institution's
collections and, together with a publications specialist, their
associated collections catalogs.
• Digital Curation: the curation, preservation, maintenance,
collection and archiving of digital assets
• Sheer curation: an approach to digital curation where
curation activities are quietly integrated into the normal work
flow of those creating and managing data and other digital
assets.
• DCC: Digital curation is maintaining and adding value
to a trusted body of digital information for current and
future use.
NERC Data Management Workshop
4. a centre of expertise in data curation and preservation
Integrated Science?
• Mostly educational: easy-to-swallow science
• Some strange things
• One nice essay
• Lots of environmental science
NERC Data Management Workshop
5. a centre of expertise in data curation and preservation
NERC Data Management Workshop
6. a centre of expertise in data curation and preservation
University of Integrated Science,
California
• Degree Programs:
• Vertical reality
• Tachyon Holistic Wellness
• Tantra (including Sexual Alchemy for Singles 101)
• Vegan and Live Food Nutrition Masters Program
• …and that’s it!
NERC Data Management Workshop
7. a centre of expertise in data curation and preservation
Edward O Wilson (1998)
• “Science: organized systematic enterprise that gathers
knowledge about the world and condenses the knowledge
into testable laws and principles. Defining traits are
• 1st, confirmation of discoveries & support of hypotheses through repetition by
independent investigators, preferably with different tests & analyses;
• 2nd, mensuration, the quantitative description of the phenomena on
universally accepted scales;
• 3rd, economy, by which the largest amount of information is abstracted into a
simple and precise form, which can be unpacked to re-create detail;
• 4th, heuristics, the opening of avenues to new discovery and interpretation.
• And 5th, and finally, is consilience, the interlocking of causal explanations
across disciplines.”
• Consilience: “the concurrence of multiple inductions drawn from different
data sets”
•Wilson, E. O. (1998, 27 March 1998). Integrated Science and The
•Coming Century of The Environment. Science Magazine, 279, 2048-2049.
NERC Data Management Workshop
8. a centre of expertise in data curation and preservation
Wilson concluding
• “Arguably the foremost of global problems
grounded in the idiosyncrasies of human
nature is overpopulation and the destruction
of the environment. The crisis is not long-term
but here and now; it is upon us. Like it or not,
we are entering the century of the
environment, when science and polities will
give the highest priority to settling humanity
down before we wreck the planet.”
NERC Data Management Workshop
9. a centre of expertise in data curation and preservation
NCAR: January 2009
• The Integrated Science Program will promote scientific
frontiers that are dependent on an integrated approach,
across NCAR laboratories and across disciplines. ISP will
focus on thematic areas where the mission and
expertise at NCAR, and in the university atmospheric
and related sciences community, can be advanced by
contributions from the social and environmental
sciences beyond those that typically occur within single
programs or departments. These areas include, but are
not limited to, Earth system-society interactions,
building societal resilience to weather and climate
hazards, hydrologic sciences, and biogeochemistry.
NERC Data Management Workshop
10. a centre of expertise in data curation and preservation
Fisheries & Oceans Canada
• Integrated Science Data Management
(ISDM) Providing Access to Ocean Data
• “ISDM's mandate is to manage and archive
ocean data collected by DFO, or acquired
through national and international
programmes conducted in ocean areas
adjacent to Canada, and to disseminate
data, data products, and services to the
marine community in accordance with the
policies of the Department.”
NERC Data Management Workshop
11. a centre of expertise in data curation and preservation
Integrated Science
• We need a definition that works better;
something like:
“The application of multiple scientific disciplines
to one or more core scientific challenges”
• Examples of integrated sciences?
• Archaeology
• Environmental sciences
NERC Data Management Workshop
12. a centre of expertise in data curation and preservation
Integrated Science implications
• Scientists will be using unfamiliar data,
therefore
• Data curators and managers must make their
data available for unfamiliar users!
• And now for something unfamiliar?
NERC Data Management Workshop
13. a centre of expertise in data curation and preservation
Poetry & Philosophy of D H
Rumsfeld
Hart Seely, April 2, 2003,
SLATE http://www.slate.com/id/2081042/
NERC Data Management Workshop
14. a centre of expertise in data curation and preservation
A Confession
‘Once in a while,
I'm standing here, doing something.
And I think,
"What in the world am I doing here?"
It's a big surprise.’
—May 16, 2001, interview with the New York Times
NERC Data Management Workshop
15. a centre of expertise in data curation and preservation
Clarity
‘I think what you'll find,
I think what you'll find is,
Whatever it is we do substantively,
There will be near-perfect clarity
As to what it is.
‘And it will be known,
And it will be known to the Congress,
And it will be known to you,
Probably before we decide it,
But it will be known.’
—Feb. 28, 2003, Department of Defense briefing
NERC Data Management Workshop
16. a centre of expertise in data curation and preservation
The Unknown
‘As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.’
—Feb. 12, 2002, Department of Defense news briefing
NERC Data Management Workshop
17. a centre of expertise in data curation and preservation
The 4th Rumsfeld?
• 3 epistemological classes (???)
• Known knowns
• Known unknowns
• Unknown unknowns
• 4th class?
• Uknown knowns?
• Critical issue for integrated sciences
NERC Data Management Workshop
18. a centre of expertise in data curation and preservation
Some OAIS Concepts?
• Knowledge Base: allows a consumer to understand
something
• Designated Community: the set of consumers for
whom the archive curates something
• Representation Information: helps you interpret a
data object yielding an information object
• The amount and nature of RepInfo required is dependent on
the Knowledge Base of the Designated Community
• If you curate for project colleagues in the short term, little if
any RepInfo required
• If you curate for those unfamiliar with the data, more RepInfo
is needed
• (All broadly interpreted!)
•CCSDS (2002). Reference Model for an Open Archival Information System (OAIS).
•Retrieved. from http://public.ccsds.org/publications/archive/650x0b1.pdf.
NERC Data Management Workshop
19. a centre of expertise in data curation and preservation
Time
• KB is f1(DC, t)
• DC is f2(t)
• RepInfo needed is f3(f1(DC, t), f2(t))
• (but none of these concepts can be precisely defined!)
• If DC is small and t is short (months to year or so),
then both may be ignored, and RepInfo be assumed
part of the KB
• If DC is extensive (eg cross-discipline) and t is long (5
years to 25 plus), then RepInfo must be articulated
• If t is very long, most bets are off (post-hoc
reconstruction likely to be needed)
NERC Data Management Workshop
20. a centre of expertise in data curation and preservation
What might RepInfo include
• Structure information: file format definitions, etc
• Semantic information: data dictionaries, code books etc
• Robust methods (working code?)
• Not to mention many kinds of metadata, provenance,
documentation of hidden assumptions, etc
• Cross-domain schemas one approach to articulating
RepInfo?
• (Never perfect, of course)
NERC Data Management Workshop
21. a centre of expertise in data curation and preservation
What about Rumsfeld 4?
• Biggest concern with unfamiliar user is
clashing concepts, eg different baselines,
units, geographies, granularity
• Especially where terms are ambiguous or
differently interpreted
• The KBs of two DCs conflict, potentially silently
• Happens all the time, of course
• The unspoken: tacit knowledge, unknown
knowns!
NERC Data Management Workshop
22. a centre of expertise in data curation and preservation
Timing
• Curation starts before creation
• Before project proposal!
• Data acquisition should not happen at the end
• Continuous acquisition much better?
• Enforcement… or credit for data?
NERC Data Management Workshop
23. a centre of expertise in data curation and preservation
Other curation issues of concern
• Sustainability (work on your survival)
• Succession (what happens to your data if you don’t)
• Data audit (know what you’ve got)
• Data risk assessment (assess your chances of loss)
• Repository external audit???
• Provenance & computational lineage
• Archiving database changes
• Community proxy roles: help your communities
develop data standards & data practices
• DCC has tools & support for some of these…
NERC Data Management Workshop
24. a centre of expertise in data curation and preservation
… and what is the role of
RDF?
NERC Data Management Workshop
25. a centre of expertise in data curation and preservation
RDF
• Anchors data to (well?) defined ontology or
schema
• Reduces 4th Rumsfeld risk?
• Allows processing by increasing class of tools
• More suited to comparatively isolated “facts”
or claims than substantial data arrays?
NERC Data Management Workshop
26. a centre of expertise in data curation and preservation
… and Research Outputs?
• Need more semantically aware texts to
support cross-community understanding
• Coded up (cf microformats, RDFa)
• People
• Citations & references
• Science features (eg chemicals, reactions)
• Graphs, spectra, tables linking to
• Supplementary data
• PDF is pretty bad at this
NERC Data Management Workshop
27. a centre of expertise in data curation and preservation
Thanks… and now for the experts!
NERC Data Management Workshop