Latest version of an oft-given talk/discussion about data citation and related issues, presented to data scientists at HQ of the National Ecological Observatory Network
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Identity, Location, and Citation at NEON
1. Identity, Location, and Citation
Mark A. Parsons with help from Ruth Duerr and Peter Fox
!
!
!
!
NEON
7 February 2014
Unless otherwise noted, the slides in this presentation are licensed by Mark A. Parsons under a Creative Commons Attribution-Share Alike 3.0 License
3. Metaphor is for most people a
device of the poetic imagination and
the rhetorical flourish— a matter of
extraordinary rather than ordinary
language. Moreover, metaphor is
typically viewed as characteristic of
language alone, a matter of words
rather than thought or action. For
this reason, most people think they
can get along perfectly well without
metaphor.
We have found, on the contrary,
that metaphor is pervasive in
everyday life, not just in language
but in thought and action. Our
ordinary conceptual system, in
terms of which we both think and
act, is fundamentally metaphorical
in nature.
4. Is data publication the right metaphor?
M. A. Parsons & P. A. Fox
Data Science Journal, 2013
http://dx.doi.org/10.2481/dsj.WDS-042
5. Purpose of Data Citation
• Aid scientific reproducibility through direct, unambiguous connection to the precise data
used
• Credit for data authors and stewards
• Accountability for creators and stewards
• Track impact of data set
• Help identify data use (e.g., trackbacks)
• Data authors can verify how their data are being used.
• Users can better understand the application of the data.
!
• A locator/reference mechanism not a discovery mechanism per se
6. Identifier vs. Locator
• Human ID: Mark Alan Parsons (son of Robert A. and Ann M., etc.)
• every term defined independently (only unique in context/provenance)
• Alternative like a social security number (or ORCID) requires a well controlled
central authority and unchanging objects.
• Human Locator: Amos Eaton Hall, Room 209, 110 8th St., Troy, NY 12180.
• every term has a naming authority, i.e. a type of registry
!
• Data Set IDs: data set title, filename, database key, object id code (e.g. UUID),
etc.
• Data set Locators: URL, directory structure, catalog number, registered locator
(e.g. DOI), etc.
7. One of the main purposes of assigning DOI names (or any
persistent identifier) is to separate the location information
from any other metadata about a resource. Changeable
location information is not considered part of the resource
description. Once a resource has been registered with a
persistent identifier, the only location information relevant for
this resource from now on is that identifier, e.g., http://
dx.doi.org/10.xx.
!
— DataCite Metadata Scheme for the Publication and Citation of Research
Data, Version 2.2, July 2011 (my emphasis).
8. How data citation is currently done
• Citation of traditional publication that actually contains the data, e.g. a
parameterization value.
• Not mentioned, just used, e.g., in tables or figures
• Reference to name or source of data in text
• URL in text (with variable degrees of specificity)
• Citation of related paper (e.g. CRU Temp. records recommend citing two old
journal articles which do not contain the actual data or full description of
methods.)
• Citation of actual data set typically using recommended citation given by data
center
• Citation of data set including a persistent identifier/locator, typically a DOI
10. Data Citation Guidelines
• Federation of Earth Science Information Partners. 2012. http://commons.esipfed.org/node/308 and related
guidelines for the Group on Earth Observations (GEO)
• Best available for Earth system science. Not yet widely adopted.
• Digital Curation Center. 2011. http://www.dcc.ac.uk/resources/how-guides/cite-datasets
• Best overall guide. Not yet widely adopted.
• Digital Mapping Techniques '00 -- Workshop Proceedings. USGS Open-File Report 00-325 “Proposal for
Authorship and Citation Guidelines for Geologic Data Sets and Map Images in the Era of Digital Publication.” By
Stephen M. Richard http://pubs.usgs.gov/of/2000/of00-325/richard.html
• Detailed treatment of map-based data, but seemingly not well recognized. Does not address location.
• DataCite—a well-recognized consortium of libraries and related organizations working to define a citation
approach and assign DOIs. Also working to get data citations included in citation indices.
• CODATA/ICSTI and NAS Task Group conducted an excellent workshop that summarizes approaches and
issues: http://www.nap.edu/catalog.php?record_id=13564. A final report summarizing the state of the art will
be out soon.
• DataVerse Network Project—a standard from the social science community using a Handle locator and
“Universal Numerical Fingerprint” as a unique identifier.
• NASA DAACs, some NOAA and NSF centers adopting ESIP-based approaches.
• Various life and social science centers have standardized approaches with increasing adoption, e.g. Dryad.
11. The Evolution of Data Citation—Then
• Data was part of the literature—tables, maps, monographs, etc.—and we cited
accordingly. (Some data were still hoarded).
• Digital data becomes the norm. It’s messier and we forget how to do cite it
routinely.
• Initial efforts to define digital data citation in the late 90s - early 00s
• Right idea, little traction
• Partially conflated with the citing URLs issue
• A blossoming in the mid-late 00s.
• Multiple disciplines start developing approaches and guidelines
• DOI a big driver, especially for DataCite, but other identifiers used too
(Handles, LSIDs, UNFs, ARKs and good ol’ URI/Ls)
• A slightly competitive atmosphere
12. The Evolution of Data Citation—Now
• Now a consensus phase
• Out of Cite, Out of Mind: The Current State of Practice, Policy, and
Technology for the Citation of Data. 2013.
http://dx.doi.org/10.2481/dsj.OSOM13-043
• Draft Global Joint Declaration of Data Citation Principles. 2013.
http://www.force11.org/datacitation
13. The Evolution of Data Citation—Next
• Implementation phase just begun
• ESIP Guidelines adopted by a variety of NASA and NOAA data centers and
internationally by GEOSS.
• AGU Publishing Committee is developing author guidelines based on ESIP.
• ESA Data Policy requires data deposit bit not citation
• Other disciplines, notably social science, has relationships with publishers.
• Several data centers partnering with publishers, e.g. Elsevier’s “article of the
future”.
• It happens locally and requires culture change so debates will continue
14. What needs an identifier/locator?
What needs to be cited?
• Everything needs an identifier. Most things need locators. Intellectual content
needs citation.
• Different versions of things may need different identifiers/locators
• Subsets may need identifiers or clear reference to sub-setting process (e.g.
space and time).
• Different representations (conceptual models) may need different identifiers/
locators. E.g Maps.
15. News Flash! September 2011
Greenland ice shrinks 15% since 1999 according to
new edition of The Times Atlas
16. Headlines a couple days later
• UPDATED: Atlas Shrugged? 'Outraged' Glaciologists Say
Mappers Misrepresented Greenland Ice Melt
• Mapmakers' claim on shape of Greenland suddenly melts
away
• A greener Greenland? Times Atlas 'error' overstates global
warming
• Row over how much Greenland has shrunk
• Times Atlas is 'wrong on Greenland climate change'
• Times Atlas accused of 'absurd' climate change ice error
18. • Maurer, J. 2007. Atlas of the Cryosphere. Boulder, Colorado USA: National
Snow and Ice Data Center. Digital media.
• Bamber, J.L., R.L. Layberry, S.P. Gogenini. 2001. A new ice thickness and bed
data set for the Greenland ice sheet 1: Measurement, data reduction, and
errors. Journal of Geophysical Research. 106(D24): 33773-33780. Data
provided by the National Snow and Ice Data Center DAAC, University of
Colorado, Boulder, Colorado USA. Available at http://nsidc.org/data/
nsidc-0092.html. 25 October 2006.
• Bamber, J.L., R.L. Layberry, S.P. Gogenini. 2001. A new ice thickness and bed
data set for the Greenland ice sheet 2: Relationship between dynamics and
basal topography. Journal of Geophysical Research. 106(D24): 33781-33788.
Data provided by the National Snow and Ice Data Center DAAC, University of
Colorado, Boulder, Colorado USA. Available at http://nsidc.org/data/
nsidc-0092.html. 25 October 2006.
19. Basic data citation form and content
Per DataCite:
Creator. PublicationYear. Title. [Version]. Publisher. [ResourceType].
Identifier.
!
Per ESIP:
Author(s). ReleaseDate. Title, [version]. [editor(s)]. Archive and/or
Distributor. Locator. [date/time accessed]. [subset used].
!
20. An Example Citation
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
21. Author
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
22. Release Date
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
23. Title
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
24. Version
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
25. Editor
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
26. Archive and/or Distributor
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
27. Locator
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://nsidc.org/data/nsidc-0175.html.
28. Locator
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston.
2002, Updated 2003. CLPX-Ground: ISA snow depth
transects and related measurements, ver. 2.0. Edited by M.
Parsons and M. J. Brodzik. Boulder, CO: National Snow
and Ice Data Center. Data set accessed 2008-05-14 at
http://dx.doi.org/10.5060/D4H41PBP.
29. An assessment of identification schemes
for digital Earth science data
Unique
Identifier
Lo
ca
to
rs
ID Scheme
Data Set
Item
Unique
Locator
Data Set
Item
Citable
Locator
Data Set
Item
Scientifically
Unique ID
Data Set
Item
URL/N/I
PURL
XRI
Handle
DOI
Ide
ntifi
ers
ARK
LSID
OID
UUID
Good
Fair
Poor
Adapted from Duerr, R. E., et al.. 2011. On the utility of identification schemes for digital Earth science data: An
assessment and recommendations. Earth Science Informatics. 4:139-160.
http://dx.doi.org/10.1007/s12145-011-0083-6
30. Suggested identifier practice for now
• DOI’s (or ARKs, maybe Handles) for data sets
• UUID’s (perhaps ARKs) for data items
• Systems need to be prepared to support multiple identifiers and locators over
time
• ESIP data citation guidelines (http://commons.esipfed.org/node/308 )
• Assign identifiers/locators to associated provenance and contextual materials, as
much as you can really.
• Continue to explore adding identifiers to people, organizations, instruments, etc.
• Participate in RDA and ESIP communities addressing these issues
31. Other topics
• Identifying/locating versions
• Identifying/locating subsets—“Micro-citation”
• Landing pages and content negotiation
• Identifiers/locators for more than data
• Choosing between EZID and CrossRef
32. Why the DOI?
• Not perfect but well understood by publishers
• Thomson Reuters collaborating with DataCite to get data citations in
their index.
!
But...
• What is the citable unit?
• How do we handle different versions?
• What about “retired” data?
• When is a DOI assigned?
33. Issues largely resolved by...
• A defined versioning scheme
• Good tracking and documentation of the versions
• Due diligence in archive and release practices
34. When to assign a DOI?
• First principle: Data should be citable as soon as they are available for use by
anyone other than the original authors.
• But...
• Most people (falsely) believe that a DOI implies permanence so how do we
cite transient data?
• Some believe that a DOI should not be assigned until the data has
undergone some level of review (e.g. Lawrence et al. 2010). So how do we
cite data used before the review?
• Data are often used by friends and collaborators in a raw, “unpublished”
state. Should this use be cited with a DOI?
• Near real time or preliminary data may only be available for a short
uncurated, period, and there may not be a good match between the
submission package and the distribution package. What gets the DOI?
When?
35. Versioning approach recommended by DCC
• “As DOIs are used to cite data as evidence, the dataset to which a DOI
points should also remain unchanged, with any new version receiving a
new DOI.”
• “There are two possible approaches the data repository can take: time
slices and snapshots.”
36. Versioning and locators:
some suggestions from NSIDC
• major version.minor version.[archive version]
• Individual stewards need to determine which are major vs. minor versions and describe the
nature and file/record range of every version.
• Assign DOIs to major versions.
• Old DOIs should be maintained and point to some appropriate page that explains what
happened to the old data if they were not archived.
• A new major version leads to the creation of a new collection-level metadata record that is
distributed to appropriate registries. The older metadata record should remain with a
pointer to the new version and with explanation of the status of the older version data.
• Major and minor version (after the first version) should be exposed with the data set title
and recommended citation.
• Minor versions should be explained in documentation, ideally in file-level metadata.
• Applying UUIDs to individual files upon ingest aids in tracking minor versions and historical
citations.
37. Basic data citation form and content
Author(s). ReleaseDate. Title, Version. [editor(s)]. Archive and/or
Distributor. Locator. [date/time accessed]. [subset used].
!
!
!
The best solution is to have unique identifiers or query IDs for
subsets, but that won’t be available for most data sets for a long
time, so we need alternative solutions...
38. !
February 8, 2011, 4:45 PM
Page Numbers for Kindle Books an Imperfect Solution
Neither solution is perfect—‘locations’ or
page numbers—because the problem is
unsolvable. The best we can hope for is a
choice...
Amazon’s Kindle will have
page numbers that
correspond to real books
and locations by passage.
http://pogue.blogs.nytimes.com/2011/02/08/page-numbers-for-kindle-books-an-imperfect-solution/
39. Chapter and Verse
• Bible
• Koran
• Bhagavad-Gita and
Ramayana
• other sacred texts
!
• A “structural index”
40. The “Archive Information Unit”
“An Archival Information Package whose Content Information is not
further broken down into other Content Information components, each of
which has its own complete Preservation Description Information. It can
be viewed as an ‘atomic’ AIP”
“From an Access viewpoint, new subsetting and manipulation capabilities
are beginning to blur the distinction between AICs and AIUs. Content
objects which used to be viewed as atomic can now be viewed as
containing a large variation of contents based on the subsetting
parameters chosen. In a more extreme example, the Content Information
of an AIU may not exist as a physical entity. The Content Information
could consist of several input files (or pointers to the AIPs containing
these data files) and an algorithm which uses these files to create the
data object of interest.”
• CCSDS. 2002. Reference Model for An Open Archival Information System (OAIS) CCSDS
41. Citation scenarios and production patterns
• What kind of “atomic” item is being cited—the “Archive Information
Unit (AIU)” (e.g., a data file, a data element within a file, a relational (or
other) database, a job “residue”)?
• How many AIUs items are in a typical citation for the scenario being
considered?
• What other digital or physical objects need to be available to make the
unit usable—the “Preservation Description Information (PDI)”?
!
Key Question:
• What structure or structures can we use to organize data collections
that might be common across Earth sciences?
54. A production pattern for Cline et al., 2003
field notebook
HTML
Doc.
+
analog to
digital w/ QC
Excel v1
distributed data set
printout
Couldn't find post, Excel v2
used GPS 5940 1197;
FAA4.4 and FAA4.5
unsafe, avalanche
area!
TRANSECT,IOP ,DATE
,TIME,UTME ,UTMN
,DEPTH
,SWET,SRUF,CNPY,
TEMP,SURVEYOR
,QC
,COMMENTS
,
,
,
,
,
,cm
,
,
,
,
deg-F,
,
,
FAA01.1 ,iop4,2003-03-25,1017,425941,4410860,
104,
d,
y,
n,
-999,"Fitzgerald, Matous, Dundas
","QC(000)
","
"!
FAA01.2 ,iop4,2003-03-25,1017,425956,4410860,
13,
d,
n,
n,
-999,"Fitzgerald, Matous, Dundas
","QC(000)
","
"!
...!
FAA04.1 ,iop4,2003-03-25,1221,425938,4411193,
325,
d,
y,
n,
-999,"Fitzgerald, Matous, Dundas
","QC(000)
","Couldn't
find post, used GPS 5940 1197; FAA4.4 and FAA4.5 unsafe, avalanche area!
100s
100s
ascii files
tarball
shapefiles
born
digital
1000s
camera
interim jpgs
collated and named jpgs
55. Crude, inaccurate production pattern for MODIS/Aqua Snow Cover
Daily L3 Global 500m Grid V005 (Hall et al., 2007)
Archives
GSFC
EDC
NSIDC
1,000,000s
MODAPs Processing
1 file/day/tile (grid cell)
Each file contains metadata
describing previous inputs
and detailed versioning
56. Doing it as best we can...?
• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated
daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007Sep. 2008, 84°N, 75°W; 44°N, 10°W. Boulder, Colorado USA: National Snow
and Ice Data Center. Data set accessed 2008-11-01 at http://dx.doi.org/
10.1234/xxx.
• Hall, Dorothy K., George A. Riggs, and Vincent V. Salomonson. 2007, updated
daily. MODIS/Aqua Snow Cover Daily L3 Global 500m Grid V005.3, Oct. 2007Sep. 2008, Tiles (15,2;16,0;16,1;16,2;17,0;17,1). Boulder, Colorado USA:
National Snow and Ice Data Center. Data set accessed 2008-11-01 at http://
dx.doi.org/10.1234/xxx.
• Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated 2003.
CLPX-Ground: ISA snow depth transects and related measurements, Version
2.0, shapefiles. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National
Snow and Ice Data Center. Data set accessed 2008-05-14 at http://dx.doi.org/
10.5060/D4H41PBP.
59. Conneg
• Many
examples,
but
what
follows
is
~
from:
http://www.crosscite.org/cn/
• What
is
it?
!
– Es
ce
que
vous
parlez
Français?
– Do
you
speak
html
or
JSON
or
RDF?
64. DataCite/EZID vs CrossRef/PILA
CrossRef/PILA
DataCite/EZID
• Primarily meant for data
• Numerical data
• Other research data outputs
• Support for common metadata needed
for finding or understanding data
• Supports point, bounding box, and place
names for geoLocations
• Can link metadata record to DataCite record
• Built in support for many types of
relationships to other resources including
physical objects
• Built in support for versioning
• Working to be included in citation
indexes, etc.
• Schema and services actively evolving
•Primarily meant for publications
•
•
•
•
Only registers metadata for works not
individual manifestations of a work
Schema allows multiple resolution
Can register data associated with a work
Must also provide linking information about
references in the work
•Support for common metadata needed
for finding publications or parts thereof
•
•
•
Books
Journals
Conference proceedings
•Well integrated into existing citation
metrics, indexing, etc. providers using a
pay for query model
Presented to the USGS Digital Object Identifiers Focus Group
Preparing Data for Ingest, presented 10/27/09 by R. Duerr
By Ruth Duerr, June 12, 2013
LID590DCL Foundations of Data Curation
65. EZID vs CrossRef/PILA
CrossRef/PILA
EZID
• EZID supports both DataCite’s DOI’s
and ARK’s (lower cost)
• EZID suggests using ARK’s prior to
decision to support an object in
perpetuity
• ARK’s can be deleted
• An ARK can be the suffix of a DOI
• ARK’s can be used at the granule level
using a single registration and a “pass
through” suffix
•DOI’s are the only locator supported
•Annual fee based on publishing
revenue
•Additional fee for each DOI assigned
•Additional fee if linkage information is
not provided for most content within 18
months
• Annual fee for up to 1 million IDs/yr
based on
• Profit/non-profit status
• Size/status of organization
• Considering development of single
DOI purchase capability
Presented to the USGS Digital Object Identifiers Focus Group
Preparing Data for Ingest, presented 10/27/09 by R. Duerr
By Ruth Duerr, June 12, 2013
LID590DCL Foundations of Data Curation
66. EZID vs CrossRef considerations
• Do you have mostly publications or mostly data or both?
• Do you want/need locators prior to making a decision
about long-term availability?
• Are citation indexes, citation metrics, or the ability to
support full-text access currently important to you?
• Which is more important to you - library or data
concepts?
• What kind of metadata do you have about the things you
need identifiers for?
Presented to the USGS Digital Object Identifiers Focus Group
Preparing Data for Ingest, presented 10/27/09 by R. Duerr
By Ruth Duerr, June 12, 2013
LID590DCL Foundations of Data Curation
67. Get involved!
• RDA proposed Working Group on citing dynamic data.
• http://rd-alliance.org/working-groups/data-citation-wg.html
• RDA WG on identifier “types”
• https://rd-alliance.org/working-groups/pid-information-types-wg.html
• ESIP Preservation and Stewardship Committee
• http://wiki.esipfed.org/index.php/Preservation_and_Stewardship