Elag workshop sessie 1 en 2 v10

ELAG Workshop
“Data repository challenges”
Wednesday, May 16th 2012
Session 1 & 2
Jeroen Rombouts & Egbert Gramsbergen

Programme
Session 1 (14:30 – 15:30): “meta - data - value - …”
2.Round of introduction: who-is-who and why this workshop?
3.Short intro 3TU.DC
4.Background information
5.Case: Traffic flow observations
6.Warming-up Graphs

Break

Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”
11.‘Discipline’ differences (researchers & repositories)
12.Dotmocracy ‘Lite’
13.Conclusions

1. Who is who?
• Who are you?

• Why interested in this topic?

2. 3TU.Datacentrum = …
• 3 Dutch TU’s: Delft, Eindhoven, Twente
• Project 2008-2011, going concern 2012-
• Data archive
– 2008 -
– “finished” data
– preserve but do not forget usability
– meta data harvestable (OAI-PMH)
– crawlable (OAI-ORE linked data)
– data citation information (incl. DataCite DOI’s)
• Data labs
– Just starting (hosting)
– Unfinished data + software/scripts

Website & Data-archive
• http://datacentrum.3tu.nl
• Information
News, announcements
Publications, links and
tutorials

• http://data.3tu.nl
• Data sets download and
‘management’
• ‘Use’ data with Google
Maps/Earth, OPeNDAP, …

Data archiving options

• ‘Simple’ sets (Do It Yourself)
Standard (self)upload form and descriptive information, single file
per object (can be a ‘zipped’ collection), single DOI, …

E.g.: Zandvliet, H.J.W. et al. (2010): Diffusion driven concerted
motion of surface atoms: Ge on Ge(001). MESA+ Institute For
Nanotechnology, University of Twente.
doi:10.4121/uuid:3f71549c-6097-4bb8-bc00-6db77deb161d

• Special collections (Do It Together)
Negotiate: deposit procedure, description (xml, picture, preview),
data model, level of DOI assignment, query online, …

E.g.: Otto, T., Russchenberg, H.W.J. (2010): IDRA weather radar
measurements - all data. TU Delft - Delft University of
Technology.
doi:10.4121/uuid:5f3bcaa2-a456-4a66-a67b-1eec928cae6d

Training & Data-labs
• http://dataintelligence.3tu.nl
• Reference, News & Events
for training library staff.

• OpenEarth, SHARE,
…?

3. Background information
• Workshop scope
– Need for change ?/!
– Questions (for now)

• Report inputs
– NSF/NSB: Definitions
– RIN: Discipline/Data Differences
– DANS/3TU.DC: Value/selection/DSA/…???

Data Deluge
• Data in 2015 approx. 18 million
times Library of Congress (in size).

• Video data in 2005 half of all digital
data.

• According to Eric Sieverts:
At current growth rate in 2210
number of bytes equal to number of
atoms on planet earth.
(predicts that before that happens
something will change ;-))

• CERN-LHC: 10-15PB/yr.

Workshop scope
Preconditions
• Challenge: Too much data (to keep).
Technology (storage capacity, cooling, energy), organizations (strategies, budgets) and
people (awareness, training) can’t keep (this) up!
• Upside: Not all data is valuable in the future
some relevant (de)selection experience in archiving, some efficiency improvements,
‘some’ increase in storage capacity, …

Questions
F.Which research output to share and preserve?
G.Who are the players involved?
H.How to collect and preserve the research output?

 Roles of University Libraries…

Conclusions on differences between documents and research data?

NSF/NSB - 1/3
• Data.
For the purposes of this document, data are any and all complex data
entities from observations, experiments, simulations, models, and
higher order assemblies, along with the associated documentation
needed to describe and interpret the data.

• Metadata.
Metadata are a subset of data, and are data about data. Metadata
summarize data content, context, structure, interrelationships, and
provenance (information on history and origins). They add
relevance and purpose to data, and enable the identification of
similar data in different data collections.

NSF/NSB - 2/3
3 functional types of data collections:

•Research Collections
Authors are individual investigators and investigator teams.
Research collections are usually maintained to serve immediate group
participants only for the life of a project, and are typically subjected to limited
processing or curation. Data may not conform to any data standards.

•Resource Collections
Resource collections are authored by a community of investigators, often within
a domain of science or engineering, and are often developed with community
level standards. Budgets are often intermediate in size.
Lifetime is between the mid- and long-term.

NSF/NSB - 3/3
• Reference Collections
Reference collections are authored by and serve large segments of the
science and engineering community and conform to robust, well-
established and comprehensive standards, which often lead to a
universal standard. Budgets are large and are often derived from
diverse sources with a view to indefinite support.

[NSF, Originally: National Science Board report on
Long-Lived Digital Data Collections, …]

Differences:
• Community size
• Collection lifetime
• Level of standardization
• Amount of processing
• Budget size & sources
• …

RIN
• Many different kinds and categories of data:
– scientific experiments;
– models or simulations; and
– observations of specific phenomena at a specific time or location.…
• Datasets are generated for different purposes and through different
processes.
• Data may undergo various stages of transformation.
• The quality of metadata provided for research datasets is very
variable.
• Varying degrees of data management, efforts, resources and
expertise.
• There are significant variations – as well as commonalities - in
researchers’ attitudes, behaviors and needs, in the available
infrastructure, and in the nature and effect of policy initiatives, in
different disciplines and subject areas
• …

DANS/3TU.DC
Key findings
•No solid definition of “research data” found
•Lot of literature on selection process, but…
•Not a single case of selection policy of digital data found
 Apparently a lot of implicit selection going on considering the available
digital research data

Reasons for preserving research data:
h)Obligation to enable re-use (by funder, publisher)
i)Other arguments: inter or intra disciplinary value, hard to repeat, value
for historic research
j)Obligation for verification (by code of conduct, employer, publisher)
k)Non scientific arguments (heritage, responsibilty to society)

Docs vs. Data (Differences)
• Object sizes (capacity)
• Collection sizes/granularity (number or objects)
• Meta data (type, standards and distinction from object)
• Heterogeneity of collections (not discipline differences)
– Data category (experiment, model/simulation, observation)
– Data generation process (man made vs. machine made or …)
– File formats
• Attitudes to ‘publishing’
• Resources, expertise, efforts on
data management
• Selection inevitable
• Value?
• …
• … Anything to add?

(list to be expanded in workshop)

4. Case: Traffic flow observations
• Case
Researchers needed to clear the disk space and offered data which
where “expensive to gather and had required quite a lot of
computation to process.”
Project was already finished.

• Content
Pictures of highway stretches shot from helicopter.
Shoulder open/closed, several flights, raw/stabilized, several dates,
calibration image, calibration software and settings.

Questions for case
• Which data to ingest?
raw pictures, stabilized pictures, movies or … vectors and type of cars?
GPS logs
calibration image
stabilisation software/data

• Who are involved?
data-producer (researcher)
research funder (owner)
data repository

• How to preserve?
gps logs: as data or meta data, all flight data or only when recording?
the software (code or executable?)
picture formats (tiff, png, jpeg2000, …)?
granularity (per flight, per location, per recording, ...?

Low level dataset (stabilized data)

• Collection sizes/granularity (number of files)
• Heterogeneity of collections
– File formats
data management
• Selection inevitable
• Value?
• Citation practice
• …

Session 2

Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”
3.‘Discipline’ differences (researchers & repositories)
4.Dotmocracy ‘Lite’
5.Preliminary conclusions?

Back to plenary presentations

What our accountmanagers ‘sell’…

The benefits for data producers and data consumers
• Increased visibility of research output.
(metadata in repository networks, assigning doi’s, facilitate
increases citation rate for ‘enhanced publications’, ...);
• Improved quality of dataset (quality assurance for multi-
user setup, checks on ingest, …);
• Provide (long-term) preservation of and accessibility to,
valuable research data;
• Distribution of research data for reuse, including
administration
and usage statistics;
• Provides advice on data management, rights, formats,
metadata, etc.

Value
 Secure research data
 Cite/Claim (DOIs)
 Quality Assurance (support)
 Data exchange
 Data visibility

 Support EU projects, Communities
 Extra show window
 Relation with non-academic
research, society
 Prepare for paradigm shift
 Enable verification

What do data producers say? 1/2

Only for long term Datasets are
stored by
continuous data
No time! publisher

Our research is
once only
Interesting but
not for me

Nobody needs my
data Our datasets are
Data transfer not confidential
needed, every PhD does
own project

What do data producers say? 2/2

Very usefull, essential
When can I store metadata often missing
my datasets? Much to improve
 in reuse of data

Good opportunity to
share datasets we
bought

Would like to
publish data
Surprising our
university had no
Transfer of data between
faciltity for data
PhD’s can be improved
preservation

Workshop with researchers
Data should only become available after publication

Workshop results
• Confirmed:
– Different domains have commonalities
– Need for support on research data management
exists

• There are strong differences depending on
– Research type
– Data types
– Individual attitudes

‘Conclusions’ on valuable data
Which data to preserve? And why?
• Data of ‘enhanced publications’ (underlying data and visualisations
linked to publications).
Increase publication value (stronger basis, more citations, …);

• Data generated by ‘hard to repeat’ processes.
E.g. high cost, (environmental) observations, complex or
continuous experiments, …;

• Data collected with public funding.
Conditions by funding organisations or publishers like Nature
Publishing Group, NWO, governmental organisations, universities,
…;

• Preferably open access data with potential for reuse (verification,
new research, …).
Increase visibility, efficiency and quality of research efforts.


– File formats
data management
• Selection inevitable (due to size)
• Value of research data higher
• Readability of research data is lower (zero without metadata
• …

The End
In one line:

“Challenge is to find the ready, able and willing
(researchers)”

To Dotmocracy…
• 15 min. to select or define new propositions
(approx. 3) and write them on a sheet.

• 15 min. to ‘vote’on every sheet.

• 15 min. for plenary discussion on opposing
opinions.

Responsibility Propositions 1/4
• All research data should be stored in disciplinary
archives.

• Research institutes must register data produced
by their researchers.

• Libraries are the best departments at universities
to take on research data archiving.

Obligation Propositions 2/4
• Data-producers should be obliged to publish their
(anonymous) research data as open data.

• High cost research facilities should be obliged to
share (and preserve) their data.

• Users should login to download data

• Data-repositories should never accept data in
proprietary file formats

Value Propositions 3/4
• Only datasets which are linked to publications
need to be preserved for the long term.

• Not simulation results but algorithms and
boundary conditions should be stored.

• Each dataset should also include the data in its
rawest form.

Misc. Propositions 4/4
• University libraries have a harder job to attract
datasets from exact sciences than from
humanities.

• Researchers are sloppy (they regard
documentation as irrelevant and annoying).

• Session #4 should be on the beach with lots of
beer.

– File formats
• Resources, expertise, efforts on data management
• Selection inevitable (due to size)
• Value of research data higher
• Readability of research data is lower (zero without metadata
• (A document is data)
• Boundaries of data (sets) are less clear than for documents
• Assigned responsibilities and tasks
• Legal status
• …

All Propositions 1/1
• All research data should be stored in disciplinary archives.
• Research institutes must register data produced by their researchers.
• Libraries are the best departments at universities to take on research data
archiving.
• Data-producers should be obliged to publish their (anonymous) research data
as open data.
• High cost research facilities should be obliged to share (and preserve) their
data.
• Users should login to download data
• Data-repositories should never accept data in proprietary file formats
• Only datasets which are linked to publications need to be preserved for the long
term.
• Not simulation results but algorithms and boundary conditions should be stored.
• Each dataset should also include the data in its rawest form.
• University libraries have a harder job to attract datasets from exact sciences
than from humanities.
• Researchers are sloppy (they regard documentation as irrelevant and
annoying).
• Session #4 should be on the beach with lots of beer.

Dotmocracy results 1/3
“Users should login to download data”
Str. Agree Agree Neutral Disagree Str. Disagree
xx xx x

+ Should be for some data types (sensitive)
+ It helps to get an idea of usage
+ Anonymity(?) on the net is a ‘2000’ thought
anyway
+ Accept license
+ Trace of use for data-producers
- Raise threshold for re-use

“Data repositories should never accept files in
proprietary formats”
xxxxxx xxxxxx xxxxxx xx

+ Easy to reuse data in open formats
- Better to have proprietary data than none at all
- May prelude data if insist on open format
- Can be migrated to open formats (sometimes)

“Libraries are the best departments at
universities to take on research data archiving”
xx xxxxxxxxxxxx xx
xxxxxx

+ Co-operation already with researchers
+ Librarians have good meta data skills
o The library’s vendor should deliver the service(?)
+ Full control and close to researcher(?)
- Challenge to big: long term sustainability
+ Builds on metadata knowledge of libraries
- Must have IT in co-operation
- Archiving skills

Responsibility
• All research data should be / is best stored in disciplinary
archives.
– Bigger bodies of (mono)disciplinary data for consumers
– Discipline specific meta data, guidelines and support
– Sustainability of data-archive organisations
– Research data ownership at research institutes

• Research institutes should register data
– …

• Libraries are the best departments at universities to take
on research data archiving.
– Accessibility
– Archiving
– IT knowledge
– Infrastructure

Obligations
• Data producers / High cost facilities should be obliged
to publish their (anonymised) research data.
– Risk: “Garbage in …”
– Funding consequences WOULD make a difference

• Login/registration of data-consumers
– Accept license
– User statistics for archive funding
– Trace of use for data-producers
– Raise threshold for re-use

• Data-repositories should refuse proprietary formats
– …

Value
• Only data linked to publications
– Data can be measured faster than it can be analyzed
– Accepted article proof of value AND documentation
– Possible future value without present publication?
– …

• Not simulation results but algorithms
– Software more difficult to authentically reproduce
– Data calculation can be very time/resource consuming
– Simulation datasets can be very large
– Ability to calculate higher resolutions, faster is increasing
– …

• At least data in its rawest form
– Interpretation (processing) might be done wrong
– Interpretation (processing) only for super-experts and generally accepted
– Raw data can be very large (PIV, IDRA, …)
– …

Elag workshop sessie 1 en 2 v10

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Elag workshop sessie 1 en 2 v10

Ähnlich wie Elag workshop sessie 1 en 2 v10 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Elag workshop sessie 1 en 2 v10

Hinweis der Redaktion