Economic Risk Factor Update: April 2024 [SlideShare]
Research Data Management for Econometrics
1. Econometrics of Panel Data and Network Analysis
Research Data
Management
Module 1
Dr. Peter Löwe
Berlin, 03. 08. 2017
2. Agenda
1. Why bother: A crisis, horror stories & a Panda-Oncologist
2. Size is relative: Doctor House, Big Data, and a long tail
3. Reality Check: Doing science in the 21st century
4. Research Data Management according to Gollum and XKCD
5. Persistent Identifiers: Digital dog tags for everything and everyone !
6. Research Data Repositories & good reads
7. Conclusion: Culture change & happy Pandas
3.1. Unterpunkt Nummer eins
3.2. Nächster Unterpunkt
3.3. Und noch ein Unterpunkt
Peter Löwe 2017-08-02
Research Data Management: Module 12
3. 1 Today‘s menue
Peter Löwe 2017-08-02
Research Data Management: Module 13
• Why Research Data Management matters and how it
should work (perfect world)
• How stuff currently works (state of the art)
• How stuff will work soon (outlook)
• How to get started (self help)
4. 1 Drivers for Research Data Management
Peter Löwe 2017-08-02
Research Data Management: Module 14
https://www.kent.ac.uk/library/research/data-
management/manage.html
Why you should care (internal motivation)
• Increase the efficiency of your research process
• Avoid losing data
• Enable data re-use and sharing
Why you are going care (external motivation)
• Meet the requirements of research funders and your institute
• Comply with the policies of a growing number journal publishers on
making the data underlying publications available
• Increase your visibility (citations)
5. 1 Research Data includes
Peter Löwe 2017-08-02
Research Data Management: Module 15
• Questionnaires/surveys
• Raw experimental data
• Analysed data
• Databases
• Simulations and research code (software)
• Audio-visual materials
• Laboratory and field notes
• Clinical data, including clinical records
• Images and photographs
6. 1 The Research Data Spectrum
Peter Löwe 2017-08-02
Research Data Management: Module 16
• Hand written letters
• Images or photos
• Soil samples
• Tissue samples
• Archeological dig sites
• …..
• Scanned & OCR version
• Scanned digital version
• Analysed result of samples
• Analysed result of samples
• 3D models of the dig site
• …..
Physical Digital
7. 1 Issue: The Reproducibility Crisis
Peter Löwe 2017-08-02
Research Data Management: Module 17
Nature 533, 452–454 (26 May 2016) doi:10.1038/533452a
https://www.slideshare.net/AustralianNationalDataService/research-data-management-in-practice-ria-data-management-
workshop-brisbane-2017
• A methodological crisis in
science
• the phrase was coined in the
early 2010s as part of a
growing awareness of the
problem
• 2016: poll of 1,500 scientists
• 70% of them had failed to
reproduce at least one other
scientist's experiment
• results of many scientific
studies are difficult or
impossible to replicate on
subsequent investigation
https://en.wikipedia.org/wiki/Replication_crisis
8. 1 Data Sharing and Management Snafu in 3 Short Acts
Peter Löwe 2017-08-02
Research Data Management: Module 18
[Snafu: „Situation normal, all f***ed up“]
10. 1 Discussion
Peter Löwe 2017-08-02
Research Data Management: Module 110
Have you encountered something similar ?
How to deal with such a situation ?
Where do you store your data?
How much data would you lose if your laptop was stolen?
11. 1
Reproducibility decreases of time
due to increasing data loss over time
Peter Löwe 2017-08-02
Research Data Management: Module 111
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
“In their parents' attic, in boxes in the garage, or stored on now-defunct
floppy disks — these are just some of the inaccessible places in which
scientists have admitted to keeping their old research data. Such practices
mean that data are being lost to science at a rapid rate, a study has now
found.”
12. 1 Night of the Living Data
Peter Löwe 2017-08-02
Research Data Management: Module 112
http://www.eweek.com/database/5-data-management-horror-stories-to-avoid
14. 1 Way Out: Keep Science FAIR (perfect world)
Peter Löwe 2017-08-02
Research Data Management: Module 114
Principles to ensure research data is FAIR:
Findable, Accessible, Interoperable, Reusable
“The problem the FAIR Principles address is the lack of widely shared, clearly
articulated, and broadly applicable best practices around the publication of scientific
data”
“FAIRness is a prerequisite for proper data management and
data stewardship”
Mark D. Wilkinson et al. The FAIR Guiding Principles for scientific data management and
stewardship, Scientific Data (2016). DOI: 10.1038/sdata.2016.18
16. 2 Life Expectancy of Digital Storage Media
Peter Löwe 2017-08-02
Research Data Management: Module 116
http://www.zeit.de/wissen/2013-10/s37-infografik-speichermedien.pdf
https://homsum.files.wordpress.com/2014/04/dr_house_hugh_laur
ie_desktop_1152x864_wallpaper-83467.jpg
17. 2 Life Expectancy of Digital Storage Media
Peter Löwe 2017-08-02
Research Data Management: Module 117
Storage capacity grows,
but not the lifespan
Average life-span:
about 10- 30 years
18. 2 Big Data Buzzwords: The Four V‘s
Peter Löwe 2017-08-02
Research Data Management: Module 118
19. 2
Size is not everything:
Big Data and the Long Tail of Science
Peter Löwe 2017-08-02
Research Data Management: Module 119
http://www.nature.com/neuro/journal/v17/n11/full/nn.3838.html
Big data from small data:
data-sharing in the 'long tail' of neuroscience
Long Tail of Science
• {Astro|Nuclear}-
physics,
• Genome studies,
• Remote Sensing
Overall amount
continues to
increases due to
„Big Data“
(Volume | Velocity)
20. 3 Data-driven Science
Peter Löwe 2017-08-02
Research Data Management: Module 120
http://www.allthingsdistributed.com/2007/02/help_find_jim_gray.html
Paradigms of Science:
1. empirical,
2. theoretical,
3. Computational
4. data-driven
21. 3 The Fourth Paradigm
Peter Löwe 2017-08-02
Research Data Management: Module 121
"It's the data, stupid"
Dr Gray's call-to-arms was [..] “to have a world
in which
• all of the science literature is online,
• all of the science data is online, and they
• interoperate with each other.”
22. 3 Innovation in Science travels at different velocities
Peter Löwe 2017-08-02
Research Data Management: Module 122
• Science in general is affected by digital innovation
• Every field of science is different
• but some are more ahead embracing different aspects of change.
• Exchange of lessons learned across disciplines needed.
http://i.quoteaddicts.com/media/q1/1487862.png
23. The Lifecycle of a Scientific Idea
(Elegant High Level Perspective)•3
Influeced by computer-driven Science
and „Big Data“ ?
24. The Lifecycle of a Scientific Idea :
Reality check
1. Formulate a theory
2. Gather data
3. Learn about data storage
4. Learn about data
movement protocols
5. Lose data
6. Check out of rehab
7. Learn about backup and
replication
8. Gather data
9. Learn about versioning
10. Start preliminary analysis
11. Buy a newer laptop
12. Buy more memory
13. Buy a desktop with more
memory
14. Buy a bigger monitor &
GPUs “for work”
15. Google “250GB Excel
Spreadsheet”
16. Learn about batch
processing
17. Learn about batch
schedulers
18. Learn about patience.
19. Learn more about data
storage
20. Learn about distributed
systems.
21. Go back through notes to
remember the science
question.
22. Learn R & Python
23. Learn linux admin
24. Finish preliminary
analysis.
25. Grow a ponytail
26. Write a paper.
27. Learn about data
publishing
28. Learn about
reproducibility
29. Plot the death of your
advisor/dept. head
30. Apply for grants & research
allocations on public
systems
31. Wait to apply next time
32. Finish analyzing data
33. Reformulate your theory
34.Goto 1
Source: John Fonner (2016) Jupyter Ascending, http://bit.ly/2vmTwCR
Reality Check:
Science is green IT & the rest is blue
Data-wrangling is red
•3
Many data-wrangling challenges !
25. 4
Data Wrangling:
Research Data Management (RDM)
Peter Löwe 2017-08-02
Research Data Management: Module 125
http://www.oclc.org/content/dam/research/images/publications/rdm-framework-4-with-cc.png
Today‘s
menue
YOU
Infrastructure
(is there one - yet ?)
26. 4
RDM
Responsibilities before, during and after a research project
Peter Löwe 2017-08-02
Research Data Management: Module 126
data/assets/pdf_file/0009/394056/research-data-management-in-practice.pdf
YOU
27. 4 Data Curation Continuum
Peter Löwe 2017-08-02
Research Data Management: Module 127
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Gliederung des Data Curation Continuum in vier Verantwortungsdomänen.. Im Prozess des
Datentransfers werden die vorliegenden Metadaten um weitere Elemente angereichert.
(Nach Klump, 2009)
Post ResearchPre Research
Research
28. 4 Pre Research: Institutional Requirements
Peter Löwe 2017-08-02
Research Data Management: Module 128
Institutional Policy and
Procedures
Support services - people and
other means of providing advice
and support
IT Infrastructure - the
hardware, software and other
facilities
Metadata management - so that
data records can be meaningful
and fit for purpose
Institutional Data
Management
Framework
29. 4 Pre Research: Data Management Plan (perfect world)
Peter Löwe 2017-08-02
Research Data Management: Module 129
data organisation and storage;
metadata standards and guidelines;
backups;
archiving for long-term preservation;
version control and derived data products;
data sharing or publishing intentions, including licensing;
ensuring security of confidential data;
data synchronisation; and
governance, roles and responsibilities.
30. 4 Documentation 101
Peter Löwe 2017-08-02
Research Data Management: Module 130
a) Document your data sets.
b) Ask your data repository how to document correctly (Metadata !)
c) If you do not document, you‘re wasting an opportunity to receive credit
by citation and reuse
d) Not to be missed:
Topic (keywords, controleld vocabulary, abstract)
Observation unit (counties, people, etc)
Database (random sampling, complete survey, etc.)
Sampling method
Extent
Access: Limitations, embargo, POC
31. 4 Metadata 101
Peter Löwe 2017-08-02
Research Data Management: Module 131
Metadata (structured data about the data)
• Who collected the data?
• Who funded the research project?
• When (and where) was it collected?
• Instruments and setting for collecting the data?
• Title of the dataset
• Methods used to process the data
• Etc. etc.
32. 4 Appropriate File Formats
Peter Löwe 2017-08-02
Research Data Management: Module 132
• Open and non-proprietary
• Human readible, non-binary
• Patent-free
• ISO-standards
• textual data: XML, TXT, HTML, PDF/A (Archival PDF)
• Tabular data (spreadsheets): CSV
• Databases: XML, CSV
• Images: TIFF, PNG, JPEG*
• Audio: FLAC, WAV, MP3
33. 4 Include a Manifest / readme File !
Peter Löwe 2017-08-02
Research Data Management: Module 133
34. 4 Data Life Cycle: Personal Domain Perspective
Peter Löwe 2017-08-02
Research Data Management: Module 134
http://cdn.ttgtmedia.com/informationsecurity/images/vol4iss7/ism_v4i7_f4_DataLifecycle.gif
Most critical stage in the research
data lifecycle is the completion of
the research project. In the most
cases there is no follow up funding
to maintain the research data. Also,
the scientist has to focus on the
next project.
!!!
35. 4 Publishing and Sharing Data
Peter Löwe 2017-08-02
Research Data Management: Module 135
Publishing and Sharing data ≠ Open Access to data
• “Open” and “Closed” are relative concepts.
• “Closed” ≈ conditional access based on individual
permission
• “Closed” ≈ conditional access based on roles
Metadata Research Data
Open Open
Open Closed
Closed Open
Closed Closed
36. 4 Continual data curation across domains
Peter Löwe 2017-08-02
Research Data Management: Module 136
37. 4 Data Curation Continuum: Visibility und Circulation
Peter Löwe 2017-08-02
Research Data Management: Module 137
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Low
visibility
High
visibility
38. 4 Data Delay Strategies ?
Peter Löwe 2017-08-02
Research Data Management: Module 138
https://www.explainxkcd.com/wiki/index.php/1805:_Unpublished_Discoveries
39. 4 The Grant Cycle according to XKCD (and Machiavelli ?)
Name + Datum (über Kopf- und Fußleiste einstellen)
Titel und Untertitel39
http://phdcomics.com/comics/archive.php?comicid=1431
40. 4 The Reputation Economy
Peter Löwe 2017-08-02
Research Data Management: Module 140
Open Access to Data:
• Science has become a reputation economy
• The fundamental difference between disciplines is the trade-off between reputation
and collaboration at points of the reputation economy where changes in the form of
capital occur.
• Sharing data as a form of collaboration must be balanced by a similar gain in
reputation.
• […]collaborative disciplines enforce data sharing as a social norm where non-
compliance will result in some form of penalty […]
41. 4
Research Parasites Paradigm:
Open Access for Data is evil
Peter Löwe 2017-08-02
Research Data Management: Module 141
https://media.tenor.com/images/236ee382fdf16973567dc3bb44c21
b51/tenor.gif
Lego
Gollum
43. 4
A Solution for the Crisis
Open Science enables Reproducible Science
Peter Löwe 2017-08-02
Research Data Management: Module 143
https://en.wikipedia.org/wiki/Op
en_science#/media/File:Open_
Science_-_Prinzipien.png
Benefits:
• Greater availability
and accessibility of
publicly funded
scientific research
outputs;
• Possibility for
rigorous peer-review
processes;
• Greater
reproducibility and
transparency of
scientific works;
• Greater impact of
scientific research.
Open Science is the
movement to make
scientific research
and data accessible
to all
44. 4 Reality check: Gollum (still) beats Prometheus by 10:1
Peter Löwe 2017-08-02
Research Data Management: Module 144
https://s-media-cache-
ak0.pinimg.com/originals/21/94/ed/2194ed6879d5bfd93679326508d382cd.jpg
• Gift culture still prevails
• It‘s not the technology
• It‘s not the generational change
• How to trigger cultural change ?
Science Technology Medicine (STM):
2006-2016: ~ 30 million papers published
~ 3 million data publications
(Klump 2017)
10:1
45. 4
Pradigm Change induced by Funding Agencies:
Watering hole approach instead of stick & carrot
Peter Löwe 2017-08-02
Research Data Management: Module 145
http://i.dailymail.co.uk/i/pix/2016/01/14/17/3025C04C00000578-3398562-image-a-16_1452793763082.jpg
Carrot & stick
did not work
Control the watering hole:
Works (for now)
46. 4 FAIR principles: As guidelines
Peter Löwe 2017-08-02
Research Data Management: Module 146
https://commons.wikimedia.org/wiki/File:FAIR_data_principles.jpg
http://www.macs.hw.ac.uk/~ajg33/wp-
content/uploads/2016/03/FAIR-Article-Poster.jpg
“The problem the FAIR Principles address
is the lack of widely shared, clearly
articulated, and broadly applicable best
practices around the publication of
scientific data”
47. 5 Technical Requirement for FAIR
Peter Löwe 2017-08-02
Research Data Management: Module 147
• Easy and permanent access to
research data via the internet
• Enhanced discovery, retrieval
and management of data to
enable data reuse and
verification of research results
48. 5 Benefits of Citation
Peter Löwe 2017-08-02
Research Data Management: Module 148
• Including citable data in related publications increases
the citation rate of those publications
• Only cited data can be counted and tracked (in a similar
manner to journal articles) to measure impact
• Routine citation of data will assist in gaining
acknowledgement of data as a first class research output
• Citations for published data can be included in CVs along
with journal articles, reports and conference papers
49. 5
Technical Challenge:
Unbreakable internet-based Citation
Peter Löwe 2017-08-02
Research Data Management: Module 149
Stable linking needed
• Data will move, URL links to Webpages will break.
• Unbreakable alternative needed !
50. 5 Digital Object Identifiers (DOI)
Peter Löwe 2017-08-02
Research Data Management: Module 150
• International DOI Foundation was founded in 1998.
• The DOI system offers long-term persistence and
accessibility of data.
• Based on the Handle system.
• In May 2012 the DOI System ISO Standard 26324 was
published.
• Part of the quality control is mandatory metadata for
each object registered with a DOI.
51. 5 What is a DOI ?
Peter Löwe 2017-08-02
Research Data Management: Module 151
DOI: Acronym for "digital object identifier“.
A DOI name is an identifier (not a location) of an entity on digital
networks.
What you see: alphanumeric string (never changes)
Associated with: location (such as URL)
Accompanied with: who, what, when… (metadata)
52. 5
DataCite Metadata Schema
Mandatory properties
Peter Löwe 2017-08-02
Research Data Management: Module 152
Part of the quality control is mandatory metadata for each
object registered with a DOI:
• Identifier (with type attribute)
• Creator (with type and nameIdentifier attributes)
• Title (with optional type attribute)
• Publisher
• PublicationYear
53. 5 DOI is a quality label for data
Peter Löwe 2017-08-02
Research Data Management: Module 153
Datasets with a DOI have to be:
Stable (i.e. not going to be modified)
Complete (i.e. not going to be updated)
Permanent – by assigning a DOI we’re committing to make
the dataset available for posterity
Good quality – by assigning a DOI its receiving the data
centre’s stamp of approval, saying that it’s complete and all
the metadata is available
DOI:
Seal of
Approval
54. 5 DOI for Research Data
Peter Löwe 2017-08-02
Research Data Management: Module 154
https://support.datacite.org/docs/doi-basics
55. 5 DOI Citation Examples
Peter Löwe 2017-08-02
Research Data Management: Module 155
Fahrenberg, Jochen (2010): Freiburger Beschwerdenliste FBL. Primärdaten der
Normierungsstichprobe 1993. Version 1.0.0. ZPID- Leibniz-Zentrum für Psychologische
Information und Dokumentation.
Dataset. http://doi.org/10.5160/psychdata.fgjn05an08
Rattinger, Hans; Roßteutscher, Sigrid; Schmitt-Beck, Rüdiger; Weßels, Bernhard(2012):
Wahlkampf-Panel (GLES 2009). Version: 3.0.0. GESIS Datenarchiv.
Dataset.doi:10.4232/1.11131.
Schupp, Jürgen; Kroh, Martin; Goebel, Jan; Bartsch, Simone; Giesselmann, Marco et.
al. (2013): Sozio-oekonomisches Panel (SOEP), Daten der Jahre 1984-2012. Version: 29.
SOEP- Sozio-oekonomisches Panel.
Dataset. doi:10.5684/soep.v29.
56. 5 DOI System Architecture
Peter Löwe 2017-08-02
Research Data Management: Module 156
58. 5 Upcoming: Search DOI-registered datasets by ORCID
Peter Löwe 2017-08-02
Research Data Management: Module 158
Find any DOI-registered
publication by ORCID
http://dashboard.project-thor.eu
Example: Löwe / Loewe / Lowe ?
Which of the four Peter Löwe ?
59. 6 Data Curation Continuum: Research Data Repositories
Peter Löwe 2017-08-02
Research Data Management: Module 159
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Low
visibility
High
visibility
60. 6 re3data: Registry of Research Data Repositories
Peter Löwe 2017-08-02
Research Data Management: Module 160
1,500 research dara repositories
described by tags:
61. 6 re3data: Search options
Peter Löwe 2017-08-02
Research Data Management: Module 161
62. 6 Research Data Repository (RDR) Development and Services
Peter Löwe 2017-08-02
Research Data Management: Module 162
Currently, DFG funds two RDR-related Projects:
1. SowiDataNet: addressing the social sciences
2. RADAR: addressing the long tail of Science
Technology and Metadata are compatible.
RADAR is a service offering by FIZ Karlsruhe (testing phase)
Near future:
• SowiDtaaNet will become a serice offering (GESIS)
• Datorium will merge with SowiDataNet
63. 6 RADAR: Research Data Repository Services
Peter Löwe 2017-08-02
Research Data Management: Module 163
Van den Broel K, Furtado F, Engel T (2015): RADAR – A Research Data Repository for the “Long-Tail of Science”
66. 6 Datorium: Data Set Description
Peter Löwe 2017-08-02
Research Data Management: Module 166
67. 6 Datorium: Terms of Access
Peter Löwe 2017-08-02
Research Data Management: Module 167
68. 4 Where NOT to „publish“ your Data
Peter Löwe 2017-08-02
Research Data Management: Module 168
Required:
Professional repositories which enable
• long term access,
• search,
• retrieval,
• thorough metadata
69. 6
Alternative (Self help):
All-purpose Repositories
Peter Löwe 2017-08-02
Research Data Management: Module 169
Rueda, Laura. (2017, May). Introduction to DataCite. Zenodo.
http://doi.org/10.5281/zenodo.571808
70. 6 OPENAIRE: RDM on the European Level
Peter Löwe 2017-08-02
Research Data Management: Module 170
https://www.openaire.eu/
https://www.slideshare.net/OpenAIRE_eu/enabling-better-science-results-and-vision-of-the-openaire-infrastructure-and-rda-
data-publishing-working-group-55075375
71. 6 Adoption of Open Science in Europe
Peter Löwe 2017-08-02
Research Data Management: Module 171
https://www.fosteropenscience.eu/
72. 6
Forschungsdaten
in den Sozial- und Wirtschaftswissenschaften
Peter Löwe 2017-08-02
Research Data Management: Module 172
http://dx.doi.org/10.4232/10.fisuzida2014.1
http://auffinden-zitieren-dokumentieren.de
74. 6 Rat für Sozial- und Wirtschaftdaten / DFG
Peter Löwe 2017-08-02
Research Data Management: Module 174
http://www.dfg.de/download/pdf/foerderung/antragstellung/forschungsd
aten/basisinformationen_forschungsdatenmanagement.pdf
76. 6 RESEARCH DATA ALLIANCE
Peter Löwe 2017-08-02
Research Data Management: Module 176
https://www.rd-alliance.org/
77. 6 Data Carpentry Workshops
Peter Löwe 2017-08-02
Research Data Management: Module 177
http://www.datacarpentry.org/
78. 7 AUSTRALIAN NATIONAL DATA SERVICE (ANDS)
Peter Löwe 2017-08-02
Research Data Management: Module 178
79. 7 Wise Advise
Peter Löwe 2017-08-02
Research Data Management: Module 179
https://nicolahemmings.wordpress.com/2016/04/05/mistakes-ive-
made-as-an-early-career-researcher/
Mistakes I’ve made as an early career researcher
APRIL 5, 2016
Nicola Hemmings (post-doc, University of Sheffield)
Failing to organise my data adequately (circa 2007).
“Prepare your datasets like you would if you were giving them to a
stranger who knew nothing about them. Label, annotate and
meticulously file your R scripts. Incorporate read-me files into everything
and write them for the monkey that will be you in five years, when you
return to your data and/or analyses for some unforeseen but vitally
important reason. Don’t get this wrong. You will regret it.“
80. 7
Back to the start:
Snafu ? Things are getting better
Peter Löwe 2017-08-02
Research Data Management: Module 180
• This film is scientific nontextual information
• It is available on the AV-portal of TIB Hannover, a data portal for
scientic audiovisual content.
• DOI-link: https://doi.org/10.5446/31036
81. Vielen Dank für Ihre Aufmerksamkeit.
DIW Berlin — Deutsches Institut
für Wirtschaftsforschung e.V.
Mohrenstraße 58, 10117 Berlin
www.diw.de
Redaktion
Peter Löwe (ploewe@diw.de)
http://dilbert.com/strip/2010-08-24
Based on the works of
• Paul Wong (2017) ANDS,Research Integrity Advisor Data Management Workshop
• 3TU.Datacentre (2014): Data citation and DOIs
• and others
82. Vielen Dank für Ihre Aufmerksamkeit.
DIW Berlin — Deutsches Institut
für Wirtschaftsforschung e.V.
Mohrenstraße 58, 10117 Berlin
www.diw.de
Redaktion
Peter Löwe (ploewe@diw.de)
https://doi.org/10.5446/31036
Lets look at it the other way around: Post Science
In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data. Such practices mean that data are being lost to science at a rapid rate, a study has now found.
The authors of the study, which is published today in Current Biology1, looked for the data behind 516 ecology papers published between 1991 and 2011. The researchers selected studies that involved measuring characteristics associated with the size and form of plants and animals, something that has been done in the same way for decades. By contacting the authors of the papers, they found that, whereas data for almost all studies published just two years ago were still accessible, the chance of them being so fell by 17% per year. Availability dropped to as little as 20% for research from the early 1990s.
“Most of the time, researchers said ‘it’s probably in this or that location’, such as their parents' attic, or on a zip drive for which they haven’t seen the hardware in 15 years," says Timothy Vines, the lead author on the study and an evolutionary ecologist at the University of British Columbia in Vancouver. "In theory, the data still exist, but the time and effort required by the researcher to get them to you is prohibitive.”
Apparenty ist an issue,
From personal perspective: icky.
Best practices for data handling
Should I store my data at home ?
The basic idea is that our capacity for collecting scientific data has far outstripped our present capacity to analyze it, and so our focus should be on developing technologies that will make sense of this "Deluge of Data"
Replicable:
results can be reproduced from an independent analysis (different lab, model system, software…)
Reproducible:
Results can be reproduced using your code and data
OCLC, currently incorporated as OCLC Online Computer Library Center, Incorporated,[3] is an American nonprofit cooperativeorganization "dedicated to the public purposes of furthering access to the world's information and reducing information costs".[4] It was founded in 1967 as the Ohio College Library Center. OCLC and its member libraries cooperatively produce and maintain WorldCat, the largest online public access catalog (OPAC) in the world. OCLC is funded mainly by the fees that libraries have to pay for its services (around $200 million annually as of 2016).[1]
Dies geschieht mit der Unterstützung von Informationsfachleuten und mit informationstechnischen Werkzeugen. (Abbildung Klump, 2009)
Datensätze dokumentieren
Den eigenen Datensatz sinnvoll zu dokumentieren sollte dem Datenproduzenten in Hinblick auf die gute wissenschaftliche Praxis sowie aufgrund von Reproduzierbarkeit und Transparenz gegenüber Dritten eine Herzensangelegenheit sein.
Fragen der Dokumentation von Forschungsdaten in den Sozial- und Wirtschaftswissenschaften noch zu wenig in der akademischen Lehre verankert.
Datenproduzenten sollte freilich klar sein: Eine gute Dokumentation macht es externen Datennutzern einfacher, die Daten zu re-analysieren und die vom Datenproduzenten geleistete Arbeit mit einer Referenz, also einem Zitat, zu honorieren. Fehlt die Dokumentation, verschenkt der Datenproduzent eine mögliche Anerkennung seiner Arbeit („credit“) durch Dritte.
Hauptziel einer Dokumentation ist es, die Entstehung des Datensatzes nachvollziehbar zu machen und ihn so zu beschreiben, dass Dritte damit arbeiten können. Der Aufwand, der dafür nötig ist, hängt zum einen vom Umfang des Datensatzes selber ab.
Zudem gibt es einige übergeordnete Informationen zu Datensätzen, die pauschal zur Verfügung gestellt werden sollten. Diese Informationen helfen den möglichen Nachnutzern bei der Entscheidung, ob die Daten relevant sein können. Folgende Punkte lassen sich darunter fassen:
Inhalt
Potentielle Nachnutzer eines Datensatzes werden im Allgemeinen versuchen, Angaben und Informationen über den Inhalt eines Datensatzes zu finden. Hilfreich dafür sind schlagwortartige Beschreibungen (z.B. „Arbeitsmarkt“, „Partnerschaften“, „Wahlen“, „Xenophobie“, „Investitionsgüter“) ebenso wie die Angabe von standardisierten inhaltsbezogenen Codes, z.B. JEL-Codes (ein Kder US-Ökonomenvereinigung American Economic Association),kreispfeil die eine Einordnung in bestimmte Forschungsfelder erlauben.
Der Nachteil dieser spezifischen Codes ist allerdings, dass ein Datenproduzent manchmal nicht abschätzen kann, in welchen ihm unbekannten bzw. wenig vertrauten Forschungsfeldern seine Daten für andere nutzbar sein könnten
Daher empfiehlt es sich, ein Abstract zu schreiben, das den Dateninhalt genauer spezifiziert als es ein einzelnes Schlagwort kann. Hier findet sich ein gutes Beispiel für das Abstract eines Datensatzes.lassifikationsschema für Forschungsinhalte
2. Beobachtungseinheit
Die Beobachtungseinheit ist die kleinste Ebene, die im Datensatz vorhanden ist. Sie muss in der Dokumentation klar benannt und beschrieben werden.
Im sozial- und wirtschaftswissenschaftlichen Kontext können dies Länder, Personen oder Güter sein.
3. Datengrundlage
Als Nächstes muss der potenzielle Nutzer informiert werden, ob es sich bei den Daten um eine Vollerhebung oder um eine Stichprobe aus einer Grundgesamtheit handelt.
Hierdurch erhält er im Idealfall direkt die Information darüber, welche Aussagen aufgrund der Daten überhaupt möglich sind.
Bei Stichproben ist eine Definition der Grundgesamtheit sowie die Frage, wie versucht wurde, die Stichprobe aus der Grundgesamtheit abzuleiten, essentiell
Bei einer Stichprobe stellt sich deswegen immer die Frage, wie sie erhoben wurde. Handelt es sich um eine Zufallsstichprobe, um eine Quotenstichprobe oder um eine Ziehung ohne
Die Art der Stichprobe hat wiederum Einfluss auf die Aussagekraft der Daten – und somit auch auf die Breite der Fragestellungen, für die eine Nachnutzung der Daten sinnvoll ist. Zur Einschätzung der Validität der Daten sind Angaben zum Prozess der Erhebung essentiell. So sollte z.B. dokumentiert werden, wie viele Einheiten (etwa Personen oder Betriebe) ursprünglich befragt werden sollten („Bruttosample“) und wie viele letztendlich teilgenommen haben („Nettosample“).
4. Erhebungsmethode
Daten können ganz unterschiedlich gewonnen werden und in verschiedenen Formen vorliegen. Dies genau darzulegen ist wichtig, um die Daten richtig interpretieren sowie deren Reliabilität (Messgenauigkeit) und Validität (Aussagekraft) einschätzen zu können. Beispielsweise lassen sich Zeitungsauschnitte zu einem Thema als Daten erfassen, Interviews mit Personen (die quantitativ oder qualitativ sein können) oder Suchanfragen auf Internetseiten können dabei eine Datengrundlage bilden. Insbesondere durch die fortschreitende Digitalisierung unseres Alltags lassen sich immer mehr Wege finden, an Daten zu kommen und diese zu wissenschaftlichen Zwecken zu nutzen. Umso wichtiger wird in diesem Zusammenhang die Dokumentation der Erhebungsmethode (für Standarderhebungsmethoden in persönlichen Interviews, siehe z.B. Schnell, 2012), so dass zusätzliche Informationen auch aus Fragebögen, Skalenhandbüchern, Testbeschreibungen, Kodierungsvorschriften, Übersetzungshilfen, oder Anschreiben gezogen werden können – kurzum alles, was den Prozess der Datenerstellung für den Nutzer konkretisiert.
5. Umfang
Der Umfang der Daten ist wesentlich, wenn über den weiteren Gebrauch entschieden wird. Dabei geht es zum einen um die Anzahl an Beobachtungen.
Wesentlich wichtiger ist aber, wie der in Punkt 1 angegebene Inhalt erfasst wird, also wie viele Variablen im Datensatz enthalten sind und was sie konkret messen. Hier kann eine veröffentlichte Aufsatz-Dokumentation, die den Lesern einen ersten Überblick geben soll, in der Regel nicht weit ins Detail gehen.
Weiterführende Dokumentationen sind dann für die tatsächlichen Nutzer gedacht, die Genaueres über die Erhebung erfahren möchten.
Hierfür ist die Erstellung eines so genannten Codebuches bzw. Datenhandbuches sinnvoll. Ein Beispiel für ein sehr ausführliches Codebuch findet sich beim SOEP: „Codebook: Household level questionnaires“.
6. Zugang
Zu guter Letzt ist es wichtig, anzugeben, ob und wie ein Nachnutzer an die betreffenden Daten gelangen kann. Zunächst muss dabei ein Ansprechpartner oder eine Institution genannt werden, der oder die verantwortlich für den Zugang und den Vertrieb der Daten ist (falls dies vorgesehen ist). Die meisten Datensätze können nicht einfach öffentlich zur Verfügung gestellt werden, denn auch bei selbst erstellten Daten müssen datenschutzrechtliche Bestimmungen eingehalten werden.
Erste Ansprechpartner für Fragen in diesem Zusammenhang sind die Datenschützer der jeweiligen Institution, die im Zweifel immer vor einer Studie mit selbst erhobenen Daten kontaktiert werden sollten.
Immer häufiger ist es möglich, Daten per Download bereitzustellen und dafür besondere Zertifikate auszugeben (in der Regel auf Basis eines Nutzungsvertrags). Dabei ist der Unterschied zwischen kommerziellen und wissenschaftlichen Nutzern, für die meist unterschiedliche Bedingungen gesetzt werden, zu beachten. Auch Kosten der Nachnutzung, die selbst bei grundsätzlich kostenfreien Daten allein durch den Versand entstehen können, sind zu benennen. Besonders im universitären Umfeld ist es wichtig, ob es eine Version der Daten für die Lehre gibt, die datenschutzrechtlich weniger sensibel ist und die ggf. für Studierende verbilligt oder vollständig kostenfrei abgegeben wird (z. B. per Downloadmöglichkeit).
Technische Aspekte ! Deckt nicht alles ab ! Soziale Aspekte !
Conversion of social capital (credibility) into other forms of capital: funding, access to equipment, data, new arguments, publication, resulting in a reputation gain through reception and recognition by peers. Success is measured by the efficiency of conversion of one form of capital into another. (Modified after Latour and Woolgar (1982).
The value of making research data available is broadly accepted. Policies concerning the open access to research data try to implement new norms calling for researchers to make their data more openly available. These policies either appeal to the common good or focus on publication and citationas an incentive to bring about a cultural change in how researchers share their data with their peers. But when we compare the total number of publications in the fields of science, technology and medicine with the number data publications from the same time period, the number of openly available datasets is rather small. This indicates that current policies on data sharing are not effective in changing behaviours and bringing about the wanted cultural change. By looking at research communities that are more open to data sharing we can study the social patterns that influence data sharing and point us to possible points for intervention and change.
Open Science is the movement to make scientific research and data accessible to all. It includes practices such as publishing open scientific research, campaigning for open access and generally making it easier to publish and communicate scientific knowledge. Additionally, it includes other ways to make science more transparent and accessible during the research process. This includes open notebook science, citizen science, and aspects of open source software and crowdfunded research projects.
The many advantages of this movement include:
Greater availability and accessibility of publicly funded scientific research outputs;
Possibility for rigorous peer-review processes;
Greater reproducibility and transparency of scientific works;
Greater impact of scientific research.
(http://www.unesco.org/new/en/communication-and-information/portals-and-platforms/goap/open-science-movement/)
Today, March 15 2016, the FAIR Guiding Principles for scientific data management and stewardship were formally published in the Nature Publishing Group journal Scientific Data. The problem the FAIR Principles address is the lack of widely shared, clearly articulated, and broadly applicable best practices around the publication of scientific data. While the history of scholarly publication in journals is long and well established, the same cannot be said of formal data publication. Yet, data could be considered the primary output of scientific research, and its publication and reuse is necessary to ensure validity, reproducibility, and to drive further discoveries. The FAIR Principles address these needs by providing a precise and measurable set of qualities a good data publication should exhibit – qualities that ensure that the data is Findable, Accessible, Interoperable, and Reusable (FAIR).
The principles were formulated after a Lorentz Center workshop in January, 2014 where a diverse group of stakeholders, sharing an interest in scientific data publication and reuse, met to discuss the features required of contemporary scientific data publishing environments. The first-draft FAIR Principles were published on the Force11 website for evaluation and comment by the wider community – a process that lasted almost two years. This resulted in the clear, concise, broadly-supported principles that were published today. The principles support a wide range of new international initiatives, such as the European Open Science Cloud and the NIH Big Data to Knowledge (BD2K), by providing clear guidelines that help ensure all data and associated services in the emergent ‘Internet of Data’ will be Findable, Accessible, Interoperable and Reusable, not only by people, but notably also by machines.
The recognition that computers must be capable of accessing a data publication autonomously, unaided by their human operators, is core to the FAIR Principles. Computers are now an inseparable companion in every research endeavour. Contemporary scientific datasets are large, complex, and globally-distributed, making it almost impossible for humans to manually discover, integrate, inspect and interpret them. This (re)usability barrier has, until now, prevented us from maximizing the return-on-investment from the massive global financial support of big data research and development projects, especially in the life and health sciences. This wasteful barrier has not gone unnoticed by key agencies and regulatory bodies. As a result, rigorous data management stewardship – applicable to both human and computational “users” – will soon become a funded, core activity within modern research projects. In fact, FAIR-oriented data management activities will increasingly be made mandatory by public funding bodies.
The high level of abstraction of the FAIR Principles, sidestepping controversial issues such as the technology or approach used in the implementation, has already made them acceptable to a variety of research funding bodies and policymakers. Examples include FAIR Data workshops from EU-ELIXIR, inclusion of FAIR in the future plans of Horizon 2020, and advocacy from the American National Institutes of Health. As such, it seems assured that these principles will rapidly become a key basis for innovation in the global move towards Open Science environments. Therefore, the timing of the Principles publication is aligned with the Open Science Conference in April 2016.
With respect to Open Science, the FAIR Principles advocate being “intelligently open”, rather than “religiously open”. The Principles do not propose that all data should be freely available – in particular with respect to privacy-sensitive data. Rather, they propose that all data should be made available for reuse under clearly-defined conditions and licenses, available through a well-defined process, and with proper and complete acknowledgement and citation.This will allow much wider participation of players from, for instance, the biomedical domain and industry where rigorous and transparent data usage conditions are a core requirement for data reuse.
“I am very proud that just over two years after the meeting where we came up with the early FAIR Principles. They play such an important role in many forward looking policy documents around the world and the authors on this paper are also in positions that allow them to follow these Principles. I sincerely hope that FAIR data will become a ‘given’ in the future of Open Science, in the Netherlands and globally”, says Barend Mons, Professor in Biosemantics at the Leiden University Medical Center.
DOI is an acronym for "digital object identifier", meaning a "digital identifier of an object".
A DOI name is an identifier (not a location) of an entity on digital networks.
It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI name can be assigned to any entity — physical, digital or abstract — primarily for sharing with an interested user community or managing as intellectual property.
The DOI system is designed for interoperability; that is to use, or work with, existing identifier and metadata schemes.
A Digital Object Identifier (DOI) is an alphanumeric string assigned to uniquely identify an object. It is tied to a metadata description of the object as well as to a digital location, such as a URL, where all the details about the object are accessible.
In order to create new DOIs and assign them to your content, it is necessary to become a DataCite member or work with one of the current members.
Technische Aspekte ! Deckt nicht alles ab ! Soziale Aspekte !
A network of Open Access repositories, archives and journals that support Open Access policies. The OpenAIRE Consortium is a Horizon 2020 (FP8) project, aimed to support the implementation of the EC and ERC Open Access policies.
Its successor OpenAIREplus is aimed at linking the aggregated research publications to the accompanying research and project information, datasets and author information.
Open access to scientific peer reviewed publications has evolved from a pilot project with limited scope in FP7 to an underlying principle in the Horizon 2020 funding scheme, obligatory for all H2020 funded projects. The goal is to make as much European funded research output as possible available to all, via the OpenAIRE portal.
— openaire.eu FAQ[25]
The Zenodo research data repository is a product of OpenAIRE. The OpenAIRE portal is online