This document provides guidance on best practices for data stewardship for researchers. It discusses why data management is an important topic, including funder requirements for data sharing and increased emphasis on reproducibility. The document outlines best practices such as creating data management plans, storing data in repositories, and sharing data. Tips are provided on overcoming barriers to data sharing through education and promoting a culture shift toward recognition of data as a first-class research product.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Data Stewardship Best Practices for Researchers
1. Data
Stewardship
for
Researchers
Carly
Strasser,
PhD
California
Digital
Library
@carlystrasser
carly.strasser@ucop.edu
31
July
2013
CLIR
Symposium
From
Calisphere,
Couretsy
of
UC
Riverside,
California
Museum
of
Photography
Tips,
Tools,
&
Guidance
From
Calisphere,
Courtesy
of
Thousand
Oaks
Library
2. Roadmap
4. Toolbox
1. Background
2. Why
you
should
care
3. Best
practices
3. NSF
funded
DataNet
Project
Office
of
Cyberinfrastructure
Two
main
goals:
1. Build
a
network
for
data
repositories
2. Build
community
around
data
Focus
on
Earth
|
environmental
|
ecological
|
oceanographic
data
4. Why
don’t
people
share
data?
Is
data
management
being
taught?
Do
attitudes
about
sharing
differ
among
disciplines?
How
can
we
promote
storing
data
in
repositories?
What
barriers
to
sharing
can
we
eliminate?
What
role
can
libraries
play
in
data
education?
5.
6. Why
is
data
management
a
hot
topic?
From
Flickr
by
Velo
Steve
7. Back in the day…
Da
Vinci
Curie
Newton
classicalschool.blogspot.com
Darwin
8. Digital
data
From
Flickr
by
Flickmor
From
Flickr
by
US
Army
Environmental
Command
From
Flickr
by
DW0825
C.
Strasser
Courtesey
of
WHOI
From
Flickr
by
deltaMike
14. From
Flickr
by
hyperion327
From
Flickr
by
Redden-‐McAllister
15. …
“Federal
agencies
investing
in
research
and
development
(more
than
$100
million
in
annual
expenditures)
must
have
clear
and
coordinated
policies
for
increasing
public
access
to
research
products.”
Back
in
February:
16. 1. Maximize
free
public
access
2. Ensure
researchers
create
data
management
plans
3. Allow
costs
for
data
preservation
and
access
in
proposal
budgets
4. Ensure
evaluation
of
data
management
plan
merits
5. Ensure
researchers
comply
with
their
data
management
plans
6. Promote
data
deposition
into
public
repositories
7. Develop
approaches
for
identification
and
attribution
of
datasets
8. Educate
folks
about
data
stewardship
From
Flickr
by
Joe
Crimmings
Photography
29. From
Flickr
by
whatthefeed
What
should
researchers
be
doing?
30. data management
From
Flickr
by
Big
Swede
Guy
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
Best
Practices
31. Create
unique
identifiers
• Decide
on
naming
scheme
early
• Create
a
key
• Different
for
each
sample
2.
Data
collection
&
organization
From
Flickr
by
sjbresnahan
From
Flickr
by
zebbie
32. Standardize
• Consistent
within
columns
– only
numbers,
dates,
or
text
• Consistent
names,
codes,
formats
Modified
from
K.
Vanderbilt
From
Pink
Floyd,
The
Wall
themurkyfringe.com
2.
Data
collection
&
organization
33. Google
Docs
Forms
Standardize
• Reduce
possibility
of
manual
error
by
constraining
entry
choices
Modified
from
K.
Vanderbilt
2.
Data
collection
&
organization
Excel
lists
Data
validataion
34. 2.
Data
collection
&
organization
Create
parameter
table
Create
a
site
table
From
doi:10.3334/ORNLDAAC/777
From
doi:10.3334/ORNLDAAC/777
From
R
Cook,
ESA
Best
Practices
Workshop
2010
35. Use
descriptive
file
names
• Unique
• Reflect
contents
From
R
Cook,
ESA
Best
Practices
Workshop
2010
Bad:
Mydata.xls
2001_data.csv
best
version.txt
Better:
Eaffinis_nanaimo_2010_counts.xls
Site
name
Year
What
was
measured
Study
organism
2.
Data
collection
&
organization
*Not
for
everyone
*
36. Organize
files
logically
Biodiversity
Lake
Experiments
Field
work
Grassland
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
From
S.
Hampton
2.
Data
collection
&
organization
37. Preserve
information
• Keep
raw
data
raw
• Use
scripts
to
process
data
&
save
them
with
data
Raw
data
as
.csv
R
script
for
processing
&
analysis
2.
Data
collection
&
organization
38. data management
From
Flickr
by
Big
Swede
Guy
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
Best
Practices
39. Before
data
collection
• Define
&
enforce
standards
• Assign
responsibility
for
data
quality
3.
Quality
control
and
quality
assurance
From
Flickr
by
StacieBee
40. After
data
entry
• Check
for
missing,
impossible,
anomalous
values
• Perform
statistical
summaries
• Look
for
outliers
3.
Quality
control
and
quality
assurance
0
10
20
30
40
50
60
0
10
20
30
40
41. data management
From
Flickr
by
Big
Swede
Guy
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
Best
Practices
43. • Digital
context
• Name
of
the
data
set
• The
name(s)
of
the
data
file(s)
in
the
data
set
• Date
the
data
set
was
last
modified
• Example
data
file
records
for
each
data
type
file
• Pertinent
companion
files
• List
of
related
or
ancillary
data
sets
• Software
(including
version
number)
used
to
prepare/read
the
data
set
• Data
processing
that
was
performed
• Personnel
&
stakeholders
• Who
collected
• Who
to
contact
with
questions
• Funders
• Scientific
context
• Scientific
reason
why
the
data
were
collected
• What
data
were
collected
• What
instruments
(including
model
&
serial
number)
were
used
• Environmental
conditions
during
collection
• Where
collected
&
spatial
resolution
When
collected
&
temporal
resolution
• Standards
or
calibrations
used
• Information
about
parameters
• How
each
was
measured
or
produced
• Units
of
measure
• Format
used
in
the
data
set
• Precision
&
accuracy
if
known
• Information
about
data
• Definitions
of
codes
used
• Quality
assurance
&
control
measures
• Known
problems
that
limit
data
use
(e.g.
uncertainty,
sampling
problems)
• How
to
cite
the
data
set
4.
Metadata
basics
44. • Provides
structure
to
describe
data
Common
terms
|
definitions
|
language
|
structure
4.
Metadata
basics
• Lots
of
different
standards
EML
,
FGDC,
ISO19115,
DarwinCore,…
• Tools
for
creating
metadata
files
Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)
What
is
metadata?
Select
the
appropriate
standard
45. data management
From
Flickr
by
Big
Swede
Guy
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
Best
Practices
46. Temperature
data
Salinity
data
Data
import
into
R
Analysis:
mean,
SD
Graph
production
Quality
control
&
data
cleaning
“Clean”
T
&
S
data
Summary
statistics
Data
in
R
format
5.
Workflows
Workflow:
how
you
get
from
the
raw
data
to
the
final
products
of
your
research
Simple
workflows:
flow
charts
47. • R,
SAS,
MATLAB
• Well-‐documented
code
is…
Easier
to
review
Easier
to
share
Easier
to
repeat
analysis
5.
Workflows
Workflow:
how
you
get
from
the
raw
data
to
the
final
products
of
your
research
Simple
workflows:
commented
scripts
#
%
$
&
49. Workflows
enable…
Reproducibility
can
someone
independently
validate
findings?
Transparency
others
can
understand
how
you
arrived
at
your
results
Executability
others
can
re-‐run
or
re-‐use
your
analysis
5.
Workflows
From
Flickr
by
merlinprincesse
Coming
Soon:
workflow
sharing
requirements!
50. data management
From
Flickr
by
Big
Swede
Guy
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
Best
Practices
51. Use
stable
formats
csv,
txt,
tiff
Create
back-‐up
copies
original,
near,
far
Periodically
test
ability
to
restore
information
6.
Data
stewardship
&
reuse
Modified from R. Cook
52. Store
your
data
in
a
repository
Institutional
archive
Discipline/specialty
archive
6.
Data
stewardship
&
reuse
From
Flickr
by
torkildr
Ask
a
librarian
Repos
of
repos:
databib.org
re3data.org
53. Allows
readers
to
find
data
products
Get
credit
for
data
and
publications
Promotes
reproducibility
Better
measure
of
research
impact
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
Persistent
Unique
Identifier
6.
Data
stewardship
&
reuse
Practice
Data
Citation
54. data management
From
Flickr
by
Big
Swede
Guy
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
Best
Practices
55. A
document
that
describes
what
you
will
do
with
your
data
throughout
the
research
project
From Flickr by Barbies Land
What
is
a
data
management
plan?
56. DMP
for
funders:
A
short
plan
submitted
alongside
grant
applications
But they all have
different requirements
and express them in
different ways
From
Flickr
by
401(K)
2013
An
outline
of
– what
will
be
collected
– methods
– Standards
– Metadata
– sharing/access
– long-‐term
storage
Includes
how
and
why
57. DMP
supplement
may
include:
1. the
types
of
data,
samples,
physical
collections,
software,
curriculum
materials,
and
other
materials
to
be
produced
in
the
course
of
the
project
2.
the
standards
to
be
used
for
data
and
metadata
format
and
content
(where
existing
standards
are
absent
or
deemed
inadequate,
this
should
be
documented
along
with
any
proposed
solutions
or
remedies)
3.
policies
for
access
and
sharing
including
provisions
for
appropriate
protection
of
privacy,
confidentiality,
security,
intellectual
property,
or
other
rights
or
requirements
4.
policies
and
provisions
for
re-‐use,
re-‐distribution,
and
the
production
of
derivatives
5.
plans
for
archiving
data,
samples,
and
other
research
products,
and
for
preservation
of
access
to
them
NSF
DMP
Requirements
From
Grant
Proposal
Guidelines:
58. • Types
of
data
• Existing
data
• How/when/where
created?
• How
processed?
• Quality
control
• Security
• Who
is
responsible
1. Types
of
data
&
other
information
biology.kenyon.edu
C.
Strasser
From
Flickr
by
Lazurite
59. Wired.com
• Metadata
needed
• How
captured
• Standards
2. Data
&
metadata
standards
60. • Obligation
to
share
• How/when/where
available
• Getting
access
• Copyright
/
IP
• Permission
restrictions
• Embargo
periods
• Ethics/privacy
• How
cited
3. Policies
for
access
&
sharing
4. Policies
for
re-‐use
&
re-‐distribution
From
Flickr
by
maryfrancesmain
61. • What
&
where
• Metadata
• Who’s
responsible
5. Plans
for
archiving
&
preservation
From
Flickr
by
theManWhoSurfedTooMuch
63. NSF’s
Vision*
DMPs
and
their
evaluation
will
grow
&
change
over
time
Peer
review
will
determine
next
steps
Community-‐driven
guidelines
Evaluation
will
vary
with
directorate,
division,
&
program
officer
*Unofficially
67. From
Flickr
by
karindalziel
E-‐notebooks
Online
science
http://datapub.cdlib.org/software-‐for-‐reproducibility-‐part-‐2-‐the-‐tools/
Reproducibility
78. Articles
are
the
butterfly
pinned
on
the
wall.
Pretty
but
not
very
useful.
They
are
only
the
advertisements
for
scholarship.
–
A.
Levi,
U.
Maryland
College
of
Information
Studies
From
Flickr
by
LisaW123
80. From
Flickr
by
dotpolka
Doing
science
is
a
privilege
–
not
a
right
81. There
is
a
social
contract
of
science:
we
have
an
obligation
to
ensure
dissemination,
validation,
&
advancement.
To
not
do
so
is
science
malpractice.
Who's
responsible?
Researchers,
publishers,
libraries,
repositories…
–
Brian
Hole,
Ubiquity
Press
at
UCL
From
Flickr
by
mikerosebery