Presentation on how to chat with PDF using ChatGPT code interpreter
Data Exchange, Data Citation: An overview of some community work
1. Data Exchange, Data Citation:
An overview of some community
work
Todd Carpenter
Executive Director, NISO
May 20, 2012
Council of Science Editors
2. National Information Standards Organization
Non-profit industry trade association accredited by
ANSI
110+ Organizational Members
Divided roughly in thirds among publishers, libraries
and support service providers
Mission of developing and maintaining technical
standards related to
information, documentation, discovery and
distribution of published materials and media
Volunteer driven organization: 400+ contributors who
May 20, 2012 CSE: Data Standards and Data Citation 1
Issues
3. Standards are familiar, even if you don‟t notice
Image: DanTaylor Image: Joel Washing
May 20, 2012 CSE: Data Standards and Data Citation 2
Issues
5. Large Scale Data-Driven
Science
Increasingly, scientific
breakthroughs will be
powered by advanced
computing capabilities
that help researchers
manipulate and explore
massive datasets.
May 20, 2012 CSE: Data Standards and Data Citation 4
Issues
6. Science data isn‟t what it used to be
Image: Walters Art Museum Image:
Domenico, Caron, Davis, et al.
May 20, 2012 CSE: Data Standards and Data Citation 5
Issues
7. Challenges of Data Publishing/Sharing
Data
– is massive in scale.
– can be constantly changing.
– is difficult to describe/catalog.
– is often difficult to integrate with other data.
– can be complex to analyze (tools required).
– is hard to peer review.
– is challenging to curate.
– is challenging to preserve.
May 20, 2012 CSE: Data Standards and Data Citation 6
Issues
8. Data – Complexities
Storage
Provenance
Legal Issues
Technical
Interoperability
Metadata
Privacy
Identification
May Citation
20, 2012 CSE: Data Standards and Data Citation 7
Issues
9. Problem with data, e.g.:
mutability
The fact data sets are constantly changing poses
the following problems (not exhaustive, BTW)
– How does someone rerun your experiment, when the
data set isn‟t now what it was when you ran your
experiment
– For citation – You have to cite the subset of the data
that existed on XXX date when you ran your analysis
– How can you preserve a snapshot of a dataset?
– How do you describe the dataset at that point in time?
– Problems of managing the metadata of each data
deposit
May 20, 2012 CSE: Data Standards and Data Citation 8
– Problems of synchronization across repositories
Issues
10. Data Complexity – Provenance
• How can you be certain
that the data set you are
looking at hasn‟t been
altered or changed?
• Can you trust the data
source?
• Has the data been
updated/added to/or
appended since the
analysis was initially run
• Dealing with issues of
abstraction
May 20, 2012 CSE: Data Standards and Data Citation Issues 9
12. We all know what a citation looks like?
Right?
May 20, 2012 CSE: Data Standards and Data Citation 11
Issues
13. But what about citing a data
set?
Source: Citations for SEER Databases
Source: International Polar Year
Source: Global Land Cover Facility Source: The Economist
Source: ICPSR
May 20, 2012 CSE: Data Standards and Data Citation 12
Issues
14. CODATA Group on Data
Citation
• Approved by the CODATA 27th General Assembly in
Cape Town in October, 2010
TASKS
– Survey existing literature and existing data citation initiatives.
– Obtain input from stakeholders in library, academic, publishing,
and research communities.
– Hold at least one meeting and a workshop to help establish a
solid foundation of the state of the art and practices in this area.
– Work with the ISO and major regional and national standards
organizations to develop formal data citation standards and good
practices.
May 20, 2012 CSE: Data Standards and Data Citation 13
Issues
15. Issues Requiring Attention
Technical Sustainability
• Metadata Standards
Identifiers
Scientific
• DOI, PURL, ARC
• Publication and peer review
Legal/Intellectual
Institutional Property Rights
• Promotion & Tenure • Accessibility and reuse rights
• Reward infrastructure
Socio-cultural
Financial
• Culture of citing data
• Infrastructure support • Proper attribution
May 20, 2012 CSE: Data Standards and Data Citation 14
Issues
16. Accomplishments so far
• Compiled a significant bibliography of 200+
references
• Conducted a survey for repository managers,
publishers, and funding organizations.
– More than 60 narrative responses
• Upcoming meeting in Copenhagen in June
• Final meeting in Taipei in October with final
report to follow.
May 20, 2012 CSE: Data Standards and Data Citation 15
Issues
17. BRDI Meeting in Berkeley
• Two-day meeting in Berkeley, CA on data
citation issues in August 2011
• Covered topics critical to data citation
– Administration -- Legal
– Publishing -- Policy
– Identification -- P&T
– Funding support -- Provenance
National Academies report to be published in June
May 20, 2012 CSE: Data Standards and Data Citation 16
Issues
18. What does the
future hold?
May 20, 2012 CSE: Data Standards and Data Citation 17
Issues
19. Standards for Data: Work
areas?
Many potential areas for work in sharing of scientific data including:
• Author/Contributor disambiguation & other issues
• Data Equivalence – How does one know that this thing and that are
equivalent (i.e., contain same data)?
• Systemic metadata
What is the form of this information?
What are its structural components?
• Archival issues
Storage, physical level, but also migration issues
• Bibliographic information for discovery, delivery and reuse
• Rights issues – Ownership, recognition, sharing, privacy
May 20, 2012 CSE: Data Standards and Data Citation Issues 18
20. ORCID – Author disambiguation
Founded by CrossRef,
Thomson-Reuters,
Nature in 2009
Now 328 participant
organizations, 50 of
which have provided
sponsorship funding
Prototype technology
Full launch in Fall 2012
May 20, 2012 CSE: Data Standards and Data Citation Issues 19
21. International Standard Name ID
• ISO 27729:2012
Information &
Documentation --
International Standard
Name Identifier
• Another identifier for
public identity of
parties
• Application for
institutions (NISO I2)
• Launched in Spring
May 20, 2012 CSE: Data Standards and Data Citation Issues 20
„12
22. Data Equivalence
• Creation of a high-
level conceptual
model of data
description
• A “FRBR” for data
• Defines the
distinctions between
states &
transformations of
data
• Basis for identification
May 20, 2012 CSE: Data Standards and Data Citation Issues 21
& description
23. Too many players, doing too many
things?
Source: XKCD
May 20, 2012 CSE: Data Standards and Data Citation Issues 22
24. Other Initiatives & Links
• CODATA/ICSTI – Task Group on Data Citation Standards and Practices
• DataCite – datacite.org
• NISO/NFAIS Group on Supplemental Journal Materials
• National Academies Board on Research Data and Information
– For Attribution: Developing Data Attribution and Citation Practices and Standards in
Berkely, CA. August 2011 Report due in January
• Cite Datasets and Link to Publicationsby Digital Curation Centre
• ORCID - about.orcid.org
• International Standard Name ID – www.isni.org
• ScienceCommons - Scholar‟s Copyright Project
• International Council for Scientific and Technical Information (ICSTI)
– Multimedia Search and Retrieval, and Interactive Journal Articles Projects
• Dublin Core Metadata Initiative - Science and Metadata Community
• DataNet projects funded by NSF, launched in 2009
• NISO/OAI ResourceSyncproject funded by Alfred P. Sloan Foundation
May 20, 2012 CSE: Data Standards and Data Citation 23
Issues
25. Information Standards Quarterly (ISQ)
Summer 2010 Issue of ISQ
Special issue on Enhanced
Journal Articles
Article-level Enhancements in the
Humanities and Social
Sciences
Hosting Supplemental Materials
Archiving Supplemental Materials
DOIs - Linking and Beyond
Supplemental Materials Survey
Report on NISO/NFAIS project
NOW AVAILABLE
OPEN ACCESS
www.niso.org/publications/is
q
May 20, 2012 CSE: Data Standards and Data Citation 24
Issues
26. Questions?
Todd Carpenter
Executive Director
tcarpenter@niso.org
National Information Standards Organization (NISO)
One North Charles Street, Suite 1905
Baltimore, MD 21201 USA
+1 (301) 654-2512
www.niso.org
May 20, 2012 CSE: Data Standards and Data Citation 25
Issues
Editor's Notes
A great deal has been written of late about the growing prevalence of data, data analysis and data sharing in science. This is just one of those works, but one I highly recommend, edited by Tony Hey, who is head of Microsoft Research, and others – its available freely online and worth the time to go though.
And as science has changed, so too have the ways in which scholarship is communicated and distributed. Increasingly paper is no longer the main medium of distributing findings. Most academic journals have already moved online and many are ceasing print publication altogether. Increasingly, the limitations of physical distribution are also falling away. We are no longer tied simply to text or static images. Science can be communicated in moving images, data visualizations, programs, and even datasets themselves, which can be separately analyzed, reprocessed and reused.
Massive in scale – terabytes orpetabytes of data is difficult to manage, store
This is just one example of the problems surrounding data publishing and use in science, on one of the vectors which I’ve just mentioned
Finally, another critical question about data is provenance.
So I’m going to talk about two projects of the many, many, many initiatives that are underway related to data that I and NISO are directly involved in related to data and data management. The first is data citation.
CODATA = International Council for Science (ICSU) : Committee on Data for Science and Technology –
Unfortunately, my crystal ball doesn’t work terribly well.
Each community is doing its own things, for its own reasons to support their own needs. This isn’t wrong. It’s completely understandable. However, it will increasingly cause problems when we start talking about Machines talking to Machines and interoperability, which is how access to data can be so valuable. Sure a human might be able to decipher the differences between biology and ecology, population science from voting results. However, computers can’t. Just as one example, we can’t do anything if we don’t have the right metadata. Data tables are filled with figures that are just numbers. Without context, a 22 is as useless a figure as as 356,000, when one is tons and one is grams.