4. NSF
funded
DataNet
Project
Of@ice
of
Cyberinfrastructure
Community
Cyberinfrastructure
Engagement
&
Outreach
From
Flickr
by
wetwebwork
Courtesy
of
DataONE
5. What
role
can
libraries
play
in
data
education?
Why
don’t
people
What
barriers
to
share
data?
sharing
can
we
eliminate?
Is
data
management
Do
attitudes
about
being
taught?
sharing
differ
among
disciplines?
How
can
we
promote
storing
data
in
repositories?
6. Roadmap
5. Tools
4. DCXL
3. Best
practices
for
scientists
2. Barriers
to
best
practices
1. Mistakes
scientists
make
7. From
Flickr
by
DW0825
From
Flickr
by
Flickmor
From
Flickr
by
deltaMike
Digital
data
www.woodrow.org
C.
Strasser
Courtesey
of
WHOI
From
Flickr
by
US
Army
Environmental
Command
15. Where
data
end
up
From
Flickr
by
diylibrarian
www
blog.order2disorder.com
From
Flickr
by
csessums
Data
Metadata
From
Flickr
by
csessums
Recreated
from
Klump
et
al.
2006
16. Who
cares?
From
Flickr
by
Redden-‐McAllister
From
Flickr
by
AJC1
www.rba.gov.au
17. Where
data
end
up
From
Flickr
by
diylibrarian
www
Data
www
Metadata
Recreated
from
Klump
et
al.
2006
19. UGLY TRUTH
Many
Earth
|
Environmental
|
Ecological
scientists…
5shortessays.blogspot.com
are
not
taught
data
management
don’t
know
what
metadata
are
can’t
name
data
centers
or
repositories
don’t
share
data
publicly
or
store
it
in
an
archive
aren’t
convinced
they
should
share
data
20. Roadmap
5. Tools
4. DCXL
3. Best
practices
for
scientists
2. Barriers
to
best
practices
1. Mistakes
scientists
make
21. Barriers
Cost
Time
cultblender.wordpress.com
Software,
Personnel
hardware
22. Barriers
Cost:
time,
personnel,
software,
hardware
Culture
of
Science
• Not
the
norm
• Lack
of
training
• Disparate
data
23. Barriers
Cost:
time,
personnel,
software,
hardware
Culture
of
Science
Loss
of
rights
or
bene:its
Misuse
of
data
Missed
opportunities
Con@lict
24. Barriers
Cost:
time,
personnel,
software,
hardware
Culture
of
Science
Loss
of
rights
or
bene:its
Lack
of
incentives
Time
consuming
&
expensive
Reward
structure
Few
requirements
25. Are
Undergrads
Learning
About
Data
Management?
Importance
Versus
Assessment
• Metadata
generation
40
• Software
choice
35
• File
naming
30
• QAQC
Important
25
• Backing
up
20
• Work@lows
15
• Data
sharing
10
• Data
re-‐use
• Meta-‐analysis
5
• Reproducibility
0
If
it’s
important,
why
0
• Notebook
protocols
10
20
30
40
Assessed
isn’t
it
taught?
• Databases
26. Barriers
to
Teaching
Data
Management
Too
Not
a
Not
advanced
priority
appropriate
level
Students
Time
don’t
know
No
software
Lab
No
training
Covered
Too
in
Lab
big
27. Roadmap
5. Tools
4. DCXL
3. Best
practices
for
scientists
2. Barriers
to
best
practices
1. Mistakes
scientists
make
28. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Work@lows
6. Data
stewardship
&
reuse
29. 2.
Data
collection
&
organization
Create
unique
identiTiers
• Decide
on
naming
scheme
early
• Create
a
key
• Different
for
each
sample
From
Flickr
by
zebbie
From
Flickr
by
sjbresnahan
30. 2.
Data
collection
&
organization
Standardize
• Consistent
within
columns
– only
numbers,
dates,
or
text
• Consistent
names,
codes,
formats
ModiVied
from
K.
Vanderbilt
From
Pink
Floyd,
The
Wall
themurkyfringe.com
31. 2.
Data
collection
&
organization
Standardize
• Reduce
possibility
of
manual
error
by
constraining
entry
choices
Excel
lists
Google
Docs
Data
validataion
Forms
ModiVied
from
K.
Vanderbilt
32. 2.
Data
collection
&
organization
Create
parameter
table
Create
a
site
table
From
doi:10.3334/ORNLDAAC/777
From
doi:10.3334/ORNLDAAC/777
From
R
Cook,
ESA
Best
Practices
Workshop
2010
33. 2.
Data
collection
&
organization
Use
descriptive
Tile
names
*
• Unique
• Re@lect
contents
Bad:
Mydata.xls
Better:
Eaf@inis_nanaimo_2010_counts.xls
2001_data.csv
best
version.txt
Study
Year
organism
Site
name
What
was
measured
*Not
for
everyone
From
R
Cook,
ESA
Best
Practices
Workshop
2010
34. 2.
Data
collection
&
organization
Organize
Tiles
logically
Biodiversity
Lake
Experiments
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Field
work
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
Grassland
From
S.
Hampton
35. 2.
Data
collection
&
organization
Preserve
information
R
script
for
processing
&
analysis
• Keep
raw
data
raw
• Use
scripts
to
process
data
&
save
them
with
data
Raw
data
as
.csv
36. 2.
Data
collection
&
organization
All
of
the
things
that
make
Excel
great
for
data
organization
are
bad
for
archiving!
What
to
do?
1. Create
archive-‐ready
raw
data
2. Put
it
somewhere
special
3. Have
your
fun
with
fancy
Excel
techniques
4. Keep
archiving
in
mind
37. 3.
Quality
control
and
quality
assurance
De@ine
&
enforce
standards
Double
data
entry
Document
changes
Minimize
manual
data
entry
No
missing,
impossible,
or
anomalous
values
• Perform
statistical
summaries
• Use
illegal
data
@ilter
• Look
for
outliers
60
50
40
30
20
10
0
0
5
10
15
20
25
30
35
38. • ScientiTic
context
4.
Metadata
basics
• Scienti@ic
reason
why
the
data
were
collected
• What
data
were
collected
• Digital
context
• What
instruments
(including
model
&
• Name
of
the
data
set
serial
number)
were
used
• The
name(s)
of
the
data
@ile(s)
in
the
• Environmental
conditions
during
data
set
collection
• Date
the
data
set
was
last
modi@ied
• Where
collected
&
spatial
resolution
• Example
data
@ile
records
for
each
data
When
collected
&
temporal
resolution
type
@ile
• Standards
or
calibrations
used
• Pertinent
companion
@iles
• Information
about
parameters
• List
of
related
or
ancillary
data
sets
• How
each
was
measured
or
produced
• Software
(including
version
number)
• Units
of
measure
used
to
prepare/read
the
data
set
• Format
used
in
the
data
set
• Data
processing
that
was
performed
• Precision
&
accuracy
if
known
• Personnel
&
stakeholders
• Information
about
data
• Who
collected
• De@initions
of
codes
used
• Who
to
contact
with
questions
• Quality
assurance
&
control
measures
• Funders
• Known
problems
that
limit
data
use
(e.g.
uncertainty,
sampling
problems)
• How
to
cite
the
data
set
39. 4.
Metadata
basics
• Provides
structure
to
describe
data
Common
terms
|
deVinitions
|
language
|
structure
• Lots
of
different
standards
EML
,
FGDC,
ISO19115,
DarwinCore,…
• Tools
for
creating
metadata
@iles
Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)
40. 5.
WorkTlows
Simplest
workTlows:
commented
scripts,
Vlow
charts
Temperature
data
Data
import
into
R
Data
in
R
Salinity
format
data
Quality
control
&
“Clean”
T
data
cleaning
&
S
data
Analysis:
mean,
SD
Summary
statistics
Graph
production
42. 5.
WorkTlows
WorkTlows
enable
From
Flickr
by
merlinprincesse
Reproducibility
can
someone
independently
validate
Vindings?
Transparency
others
can
understand
how
you
arrived
at
your
results
Executability
others
can
re-‐run
or
re-‐use
your
analysis
43. 6.
Data
stewardship
&
reuse
Use
stable
formats
csv,
txt,
tiff
Create
back-‐up
copies
original,
near,
far
Periodically
test
ability
to
restore
information
Modified from R. Cook
44. 6.
Data
stewardship
&
reuse
Where
do
I
put
it?
Insitutional
archive
Discipline/specialty
archive
DataCite
list
of
repostiories:
www.datacite.org/repolist
From
Flickr
by
torkildr
45. 6.
Data
stewardship
&
reuse
Data
Citation:
Why
everyone
should
do
it
Allow
readers
to
@ind
data
products
Get
credit
for
data
and
publications
Promote
reproducibility
Better
measure
of
research
impact
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversi@ication
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
@ishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
Learn
more
at
www.datacite.org
Modified from R. Cook
46. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Work@lows
6. Data
stewardship
&
reuse
7. Planning
47. 1.
Planning
What
is
a
data
management
plan?
A
document
that
describes
what
you
will
do
with
your
data
during
and
after
you
complete
your
research
48. 1.
Planning
Why
should
scientists
prepare
a
DMP?
Saves
time
Increases
ef@iciency
Easier
to
use
data
Others
can
understand
&
use
data
Credit
for
data
products
Funders
require
it
50. NSF
DMP
Requirements
From
Grant
Proposal
Guidelines:
DMP
supplement
may
include:
1. the
types
of
data,
samples,
physical
collections,
software,
curriculum
materials,
and
other
materials
to
be
produced
in
the
course
of
the
project
2.
the
standards
to
be
used
for
data
and
metadata
format
and
content
(where
existing
standards
are
absent
or
deemed
inadequate,
this
should
be
documented
along
with
any
proposed
solutions
or
remedies)
3.
policies
for
access
and
sharing
including
provisions
for
appropriate
protection
of
privacy,
con@identiality,
security,
intellectual
property,
or
other
rights
or
requirements
4.
policies
and
provisions
for
re-‐use,
re-‐distribution,
and
the
production
of
derivatives
5.
plans
for
archiving
data,
samples,
and
other
research
products,
and
for
preservation
of
access
to
them
51. Don’t
forget:
Budget
• Costs
of
data
preparation
&
documentation
Hardware,
software
Personnel
Archive
fees
• How
costs
will
be
paid
Request
funding!
dorrvs.com
52. NSF’s
Vision*
DMPs
and
their
evaluation
will
grow
&
change
over
time
(similar
to
broader
impacts)
Peer
review
will
determine
next
steps
Community-‐driven
guidelines
– Different
disciplines
have
different
de@initions
of
acceptable
data
sharing
– Flexibility
at
the
directorate
and
division
levels
– Tailor
implementation
of
DMP
requirement
Evaluation
will
vary
with
directorate,
division,
&
program
of@icer
*UnofVicially
Help
from
Jennifer
Schopf,
NSF
53. NSF’s
Vision*
DMPs
are
a
good
Tirst
step
towards
improving
data
stewardship
– starting
discussion
– scientists
learning
about
data
management
Additional
expertise
on
panels
to
effectively
evaluate
DMPs
(?)
Working
group
will
assess
outcomes
*UnofVicially
54. dmp.cdlib.org
Step-‐by-‐step
wizard
for
generating
DMP
Create
|
edit
|
re-‐use
|
share
|
save
|
generate
Open
to
community
Links
to
institutional
resources
Directorate
information
&
updates
55. Roadmap
5. Tools
4. DCXL
3. Best
practices
for
scientists
2. Barriers
to
best
practices
1. Mistakes
scientists
make
56. “A
transformation
in
the
conduct
of
a
segment
of
scientiVic
research
by
enabling
and
promoting
publishing,
sharing,
and
archiving
of
tabular
data”
Increase
interoperability
=
Sharing
publishability
=
Publishing
archivability
=
Archiving
Focus
on
atmospheric,
ecological,
hydrological,
and
oceanographic
data
57. Open
Source
&
Free
Excel
Add-‐in
Software
program
that
extends
the
capabilities
of
larger
programs
Complements
basic
Excel
functionality
From
www.webopedia.com
www.ablebits.com
58. DCXL
Project
Deliverables
• Excel
add-‐in
• Publicly
available
source
code
• Technical
documentation
• End
user
documentation
• Publicly
available
requirements
64. Requirements
1
Ensure
compatibility
for
Excel
users
without
the
add-‐in
2
Check
the
data
Tile
for
CSV
compatibility
2.1
Excel
performs
a
CSV
compatibility
check
on
the
data
Vile
2.2
Excel
generates
a
Compatibility
Report
3
Generate
metadata
that
is
linked
to
the
data
Tile
3.1
The
user
opens
an
existing
metadata
document
as
a
template
3.2
The
user
initiates
a
new
metadata
document
3.3
Excel
populates
Level
1
metadata
Vields
3.4
The
user
populates
Level
2
metadata
Vields
3.5
The
user
generates
labels
for
parameter
metadata
3.6
The
user
requests
standards
for
keywords
65. Requirements
4
Generate
a
citation
for
the
data
Tile
5
Deposit
into
a
repository
5.1
The
user
authenticates
via
an
existing
relationship
with
the
designated
repository
5.2
The
user
is
directed
to
establish
a
relationship
with
a
repository
5.3
The
user
links
an
identiVier
to
the
data
Vile
via
the
designated
repository
5.4
Excel
performs
Pre-‐Archiving
Tasks
5.5
The
user
submits
the
Excel
Vile
for
deposition
6
Appendix
A:
Metadata
Types
7
Appendix
B:
Citation
Format
8
Appendix
C:
Dictionary
of
Terms
66. Process
Assess
needs
Gather
requirements
Build
requirements
document
Build
community
Libraries
Scientists
Repositories
Programmers/
Developers
67. Why
are
you
promoting
Excel?
• Everyone
uses
it
• Features
that
make
it
good
for
data
organization
make
it
bad
for
archiving
• Stopgap
measure
69. Roadmap
5. Tools
4. DCXL
3. Best
practices
for
scientists
2. Barriers
to
best
practices
1. Mistakes
scientists
make
70. UC3
Services
Where
should
I
put
Data
Repository
my
data?
Deposit
|
Manage
|
Share
|
Preserve
www.cdlib.org/services/uc3
71. UC3
Services
How
do
I
get
a
unique
identiVier?
Create
&
manage
persistent
identi@iers
• Precise
identi@ication
of
a
dataset
• Credit
to
data
producers
and
data
publishers
• A
link
from
the
traditional
literature
to
the
data
• Research
metrics
for
datasets
www.cdlib.org/services/uc3
72. DataONE
www.dataone.org
• Data
Education
Tutorials
• Database
of
best
practices
&
software
tools
• Links
to
DMPTool
• Primer
on
data
management
From
Flickr
by
Robert
Hruzek