Research Data Management for Econometrics

Econometrics of Panel Data and Network Analysis
Research Data
Management
Module 1
Dr. Peter Löwe
Berlin, 03. 08. 2017

Agenda
1. Why bother: A crisis, horror stories & a Panda-Oncologist
2. Size is relative: Doctor House, Big Data, and a long tail
3. Reality Check: Doing science in the 21st century
4. Research Data Management according to Gollum and XKCD
5. Persistent Identifiers: Digital dog tags for everything and everyone !
6. Research Data Repositories & good reads
7. Conclusion: Culture change & happy Pandas
3.1. Unterpunkt Nummer eins
3.2. Nächster Unterpunkt
3.3. Und noch ein Unterpunkt
Peter Löwe 2017-08-02
Research Data Management: Module 12

1 Today‘s menue
• Why Research Data Management matters and how it
should work (perfect world)
• How stuff currently works (state of the art)
• How stuff will work soon (outlook)
• How to get started (self help)

1 Drivers for Research Data Management
https://www.kent.ac.uk/library/research/data-
management/manage.html
Why you should care (internal motivation)
• Increase the efficiency of your research process
• Avoid losing data
• Enable data re-use and sharing
Why you are going care (external motivation)
• Meet the requirements of research funders and your institute
• Comply with the policies of a growing number journal publishers on
making the data underlying publications available
• Increase your visibility (citations)

1 Research Data includes
• Questionnaires/surveys
• Raw experimental data
• Analysed data
• Databases
• Simulations and research code (software)
• Audio-visual materials
• Laboratory and field notes
• Clinical data, including clinical records
• Images and photographs

1 The Research Data Spectrum
• Hand written letters
• Images or photos
• Soil samples
• Tissue samples
• Archeological dig sites
• …..
• Scanned & OCR version
• Scanned digital version
• Analysed result of samples
• Analysed result of samples
• 3D models of the dig site
• …..
Physical Digital

1 Issue: The Reproducibility Crisis
Nature 533, 452–454 (26 May 2016) doi:10.1038/533452a
https://www.slideshare.net/AustralianNationalDataService/research-data-management-in-practice-ria-data-management-
workshop-brisbane-2017
• A methodological crisis in
science
• the phrase was coined in the
early 2010s as part of a
growing awareness of the
problem
• 2016: poll of 1,500 scientists
• 70% of them had failed to
reproduce at least one other
scientist's experiment
• results of many scientific
studies are difficult or
impossible to replicate on
subsequent investigation
https://en.wikipedia.org/wiki/Replication_crisis

1 Data Sharing and Management Snafu in 3 Short Acts
[Snafu: „Situation normal, all f***ed up“]

1 Video

1 Discussion
Have you encountered something similar ?
How to deal with such a situation ?
Where do you store your data?
How much data would you lose if your laptop was stolen?

1
Reproducibility decreases of time
due to increasing data loss over time
http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
“In their parents' attic, in boxes in the garage, or stored on now-defunct
floppy disks — these are just some of the inaccessible places in which
scientists have admitted to keeping their old research data. Such practices
mean that data are being lost to science at a rapid rate, a study has now
found.”

1 Night of the Living Data
http://www.eweek.com/database/5-data-management-horror-stories-to-avoid

1 Self-help Groups

1 Way Out: Keep Science FAIR (perfect world)
Principles to ensure research data is FAIR:
Findable, Accessible, Interoperable, Reusable
“The problem the FAIR Principles address is the lack of widely shared, clearly
articulated, and broadly applicable best practices around the publication of scientific
data”
“FAIRness is a prerequisite for proper data management and
data stewardship”
Mark D. Wilkinson et al. The FAIR Guiding Principles for scientific data management and
stewardship, Scientific Data (2016). DOI: 10.1038/sdata.2016.18

Data Storage Evolution
https://www.nimbushosting.co.uk/evolution-data-storage/
We are
here
Ancient
times
•2
https://villagevoice.freetls.fastly.net/wp-content/uploads/2014/08/beatleboys560.jpg

2 Life Expectancy of Digital Storage Media
http://www.zeit.de/wissen/2013-10/s37-infografik-speichermedien.pdf
https://homsum.files.wordpress.com/2014/04/dr_house_hugh_laur
ie_desktop_1152x864_wallpaper-83467.jpg

2 Life Expectancy of Digital Storage Media
Storage capacity grows,
but not the lifespan
Average life-span:
about 10- 30 years

2 Big Data Buzzwords: The Four V‘s

2
Size is not everything:
Big Data and the Long Tail of Science
http://www.nature.com/neuro/journal/v17/n11/full/nn.3838.html
Big data from small data:
data-sharing in the 'long tail' of neuroscience
Long Tail of Science
• {Astro|Nuclear}-
physics,
• Genome studies,
• Remote Sensing
Overall amount
continues to
increases due to
„Big Data“
(Volume | Velocity)

3 Data-driven Science
http://www.allthingsdistributed.com/2007/02/help_find_jim_gray.html
Paradigms of Science:
1. empirical,
2. theoretical,
3. Computational
4. data-driven

3 The Fourth Paradigm
"It's the data, stupid"
Dr Gray's call-to-arms was [..] “to have a world
in which
• all of the science literature is online,
• all of the science data is online, and they
• interoperate with each other.”

3 Innovation in Science travels at different velocities
• Science in general is affected by digital innovation
• Every field of science is different
• but some are more ahead embracing different aspects of change.
• Exchange of lessons learned across disciplines needed.
http://i.quoteaddicts.com/media/q1/1487862.png

The Lifecycle of a Scientific Idea
(Elegant High Level Perspective)•3
Influeced by computer-driven Science
and „Big Data“ ?

The Lifecycle of a Scientific Idea :
Reality check
1. Formulate a theory
2. Gather data
3. Learn about data storage
4. Learn about data
movement protocols
5. Lose data
6. Check out of rehab
7. Learn about backup and
replication
8. Gather data
9. Learn about versioning
10. Start preliminary analysis
11. Buy a newer laptop
12. Buy more memory
13. Buy a desktop with more
memory
14. Buy a bigger monitor &
GPUs “for work”
15. Google “250GB Excel
Spreadsheet”
16. Learn about batch
processing
17. Learn about batch
schedulers
18. Learn about patience.
19. Learn more about data
storage
20. Learn about distributed
systems.
21. Go back through notes to
remember the science
question.
22. Learn R & Python
23. Learn linux admin
24. Finish preliminary
analysis.
25. Grow a ponytail
26. Write a paper.
27. Learn about data
publishing
28. Learn about
reproducibility
29. Plot the death of your
advisor/dept. head
30. Apply for grants & research
allocations on public
systems
31. Wait to apply next time
32. Finish analyzing data
33. Reformulate your theory
34.Goto 1
Source: John Fonner (2016) Jupyter Ascending, http://bit.ly/2vmTwCR
Reality Check:
Science is green IT & the rest is blue
Data-wrangling is red
•3
Many data-wrangling challenges !

4
Data Wrangling:
Research Data Management (RDM)
http://www.oclc.org/content/dam/research/images/publications/rdm-framework-4-with-cc.png
Today‘s
menue
YOU
Infrastructure
(is there one - yet ?)

4
RDM
Responsibilities before, during and after a research project
data/assets/pdf_file/0009/394056/research-data-management-in-practice.pdf
YOU

4 Data Curation Continuum
Transfer Transfer Publication
Personal
domain
Group
domain
Persistent
domain
Access
domain
Gliederung des Data Curation Continuum in vier Verantwortungsdomänen.. Im Prozess des
Datentransfers werden die vorliegenden Metadaten um weitere Elemente angereichert.
(Nach Klump, 2009)
Post ResearchPre Research
Research

4 Pre Research: Institutional Requirements
Institutional Policy and
Procedures
Support services - people and
other means of providing advice
and support
IT Infrastructure - the
hardware, software and other
facilities
Metadata management - so that
data records can be meaningful
and fit for purpose
Institutional Data
Management
Framework

4 Pre Research: Data Management Plan (perfect world)
 data organisation and storage;
 metadata standards and guidelines;
 backups;
 archiving for long-term preservation;
 version control and derived data products;
 data sharing or publishing intentions, including licensing;
 ensuring security of confidential data;
 data synchronisation; and
 governance, roles and responsibilities.

4 Documentation 101
a) Document your data sets.
b) Ask your data repository how to document correctly (Metadata !)
c) If you do not document, you‘re wasting an opportunity to receive credit
by citation and reuse
d) Not to be missed:
 Topic (keywords, controleld vocabulary, abstract)
 Observation unit (counties, people, etc)
 Database (random sampling, complete survey, etc.)
 Sampling method
 Extent
 Access: Limitations, embargo, POC

4 Metadata 101
Metadata (structured data about the data)
• Who collected the data?
• Who funded the research project?
• When (and where) was it collected?
• Instruments and setting for collecting the data?
• Title of the dataset
• Methods used to process the data
• Etc. etc.

4 Appropriate File Formats
• Open and non-proprietary
• Human readible, non-binary
• Patent-free
• ISO-standards
• textual data: XML, TXT, HTML, PDF/A (Archival PDF)
• Tabular data (spreadsheets): CSV
• Databases: XML, CSV
• Images: TIFF, PNG, JPEG*
• Audio: FLAC, WAV, MP3

4 Include a Manifest / readme File !

4 Data Life Cycle: Personal Domain Perspective
http://cdn.ttgtmedia.com/informationsecurity/images/vol4iss7/ism_v4i7_f4_DataLifecycle.gif
Most critical stage in the research
data lifecycle is the completion of
the research project. In the most
cases there is no follow up funding
to maintain the research data. Also,
the scientist has to focus on the
next project.
!!!

4 Publishing and Sharing Data
Publishing and Sharing data ≠ Open Access to data
• “Open” and “Closed” are relative concepts.
• “Closed” ≈ conditional access based on individual
permission
• “Closed” ≈ conditional access based on roles
Metadata Research Data
Open Open
Open Closed
Closed Open
Closed Closed

4 Continual data curation across domains

4 Data Curation Continuum: Visibility und Circulation
Personal
domain
Group
domain
Persistent
domain
Access
domain
Low
visibility
High
visibility

4 Data Delay Strategies ?
https://www.explainxkcd.com/wiki/index.php/1805:_Unpublished_Discoveries

4 The Grant Cycle according to XKCD (and Machiavelli ?)
Name + Datum (über Kopf- und Fußleiste einstellen)
Titel und Untertitel39
http://phdcomics.com/comics/archive.php?comicid=1431

4 The Reputation Economy
Open Access to Data:
• Science has become a reputation economy
• The fundamental difference between disciplines is the trade-off between reputation
and collaboration at points of the reputation economy where changes in the form of
capital occur.
• Sharing data as a form of collaboration must be balanced by a similar gain in
reputation.
• […]collaborative disciplines enforce data sharing as a social norm where non-
compliance will result in some form of penalty […]

4
Research Parasites Paradigm:
Open Access for Data is evil
https://media.tenor.com/images/236ee382fdf16973567dc3bb44c21
b51/tenor.gif
Lego
Gollum

4
Alternative Paradigm:
Sharing the fire of the Open Data „torch“

4
A Solution for the Crisis
Open Science enables Reproducible Science
https://en.wikipedia.org/wiki/Op
en_science#/media/File:Open_
Science_-_Prinzipien.png
Benefits:
• Greater availability
and accessibility of
publicly funded
scientific research
outputs;
• Possibility for
rigorous peer-review
processes;
• Greater
reproducibility and
transparency of
scientific works;
• Greater impact of
scientific research.
Open Science is the
movement to make
scientific research
and data accessible
to all

4 Reality check: Gollum (still) beats Prometheus by 10:1
https://s-media-cache-
ak0.pinimg.com/originals/21/94/ed/2194ed6879d5bfd93679326508d382cd.jpg
• Gift culture still prevails
• It‘s not the technology
• It‘s not the generational change
• How to trigger cultural change ?
Science Technology Medicine (STM):
2006-2016: ~ 30 million papers published
~ 3 million data publications
(Klump 2017)
10:1

4
Pradigm Change induced by Funding Agencies:
Watering hole approach instead of stick & carrot
http://i.dailymail.co.uk/i/pix/2016/01/14/17/3025C04C00000578-3398562-image-a-16_1452793763082.jpg
Carrot & stick
did not work
Control the watering hole:
Works (for now)

4 FAIR principles: As guidelines
https://commons.wikimedia.org/wiki/File:FAIR_data_principles.jpg
http://www.macs.hw.ac.uk/~ajg33/wp-
content/uploads/2016/03/FAIR-Article-Poster.jpg
“The problem the FAIR Principles address
is the lack of widely shared, clearly
articulated, and broadly applicable best
practices around the publication of
scientific data”

5 Technical Requirement for FAIR
• Easy and permanent access to
research data via the internet
• Enhanced discovery, retrieval
and management of data to
enable data reuse and
verification of research results

5 Benefits of Citation
• Including citable data in related publications increases
the citation rate of those publications
• Only cited data can be counted and tracked (in a similar
manner to journal articles) to measure impact
• Routine citation of data will assist in gaining
acknowledgement of data as a first class research output
• Citations for published data can be included in CVs along
with journal articles, reports and conference papers

5
Technical Challenge:
Unbreakable internet-based Citation
Stable linking needed
• Data will move, URL links to Webpages will break.
• Unbreakable alternative needed !

5 Digital Object Identifiers (DOI)
• International DOI Foundation was founded in 1998.
• The DOI system offers long-term persistence and
accessibility of data.
• Based on the Handle system.
• In May 2012 the DOI System ISO Standard 26324 was
published.
• Part of the quality control is mandatory metadata for
each object registered with a DOI.

5 What is a DOI ?
DOI: Acronym for "digital object identifier“.
A DOI name is an identifier (not a location) of an entity on digital
networks.
What you see: alphanumeric string (never changes)
Associated with: location (such as URL)
Accompanied with: who, what, when… (metadata)

5
DataCite Metadata Schema
Mandatory properties
Part of the quality control is mandatory metadata for each
object registered with a DOI:
• Identifier (with type attribute)
• Creator (with type and nameIdentifier attributes)
• Title (with optional type attribute)
• Publisher
• PublicationYear

5 DOI is a quality label for data
Datasets with a DOI have to be:
Stable (i.e. not going to be modified)
Complete (i.e. not going to be updated)
Permanent – by assigning a DOI we’re committing to make
the dataset available for posterity
Good quality – by assigning a DOI its receiving the data
centre’s stamp of approval, saying that it’s complete and all
the metadata is available
DOI:
Seal of
Approval

5 DOI for Research Data
https://support.datacite.org/docs/doi-basics

5 DOI Citation Examples
Fahrenberg, Jochen (2010): Freiburger Beschwerdenliste FBL. Primärdaten der
Normierungsstichprobe 1993. Version 1.0.0. ZPID- Leibniz-Zentrum für Psychologische
Information und Dokumentation.
Dataset. http://doi.org/10.5160/psychdata.fgjn05an08
Rattinger, Hans; Roßteutscher, Sigrid; Schmitt-Beck, Rüdiger; Weßels, Bernhard(2012):
Wahlkampf-Panel (GLES 2009). Version: 3.0.0. GESIS Datenarchiv.
Dataset.doi:10.4232/1.11131.
Schupp, Jürgen; Kroh, Martin; Goebel, Jan; Bartsch, Simone; Giesselmann, Marco et.
al. (2013): Sozio-oekonomisches Panel (SOEP), Daten der Jahre 1984-2012. Version: 29.
SOEP- Sozio-oekonomisches Panel.
Dataset. doi:10.5684/soep.v29.

5 DOI System Architecture

5 DataCite Services
Search.datacite.org

5 Upcoming: Search DOI-registered datasets by ORCID
Find any DOI-registered
publication by ORCID
http://dashboard.project-thor.eu
Example: Löwe / Loewe / Lowe ?
Which of the four Peter Löwe ?

6 Data Curation Continuum: Research Data Repositories
Personal
domain
Group
domain
Persistent
domain
Access
domain
Low
visibility
High
visibility

6 re3data: Registry of Research Data Repositories
1,500 research dara repositories
described by tags:

6 re3data: Search options

6 Research Data Repository (RDR) Development and Services
Currently, DFG funds two RDR-related Projects:
1. SowiDataNet: addressing the social sciences
2. RADAR: addressing the long tail of Science
Technology and Metadata are compatible.
RADAR is a service offering by FIZ Karlsruhe (testing phase)
Near future:
• SowiDtaaNet will become a serice offering (GESIS)
• Datorium will merge with SowiDataNet

6 RADAR: Research Data Repository Services
Van den Broel K, Furtado F, Engel T (2015): RADAR – A Research Data Repository for the “Long-Tail of Science”

6
RADAR:
Research Data Repositories Roles & Responsibilities

6
Datorium.gesis.org: Repository for Social Science and
Economic Science

6 Datorium: Data Set Description

6 Datorium: Terms of Access

4 Where NOT to „publish“ your Data
Required:
Professional repositories which enable
• long term access,
• search,
• retrieval,
• thorough metadata

6
Alternative (Self help):
All-purpose Repositories
Rueda, Laura. (2017, May). Introduction to DataCite. Zenodo.
http://doi.org/10.5281/zenodo.571808

6 OPENAIRE: RDM on the European Level
https://www.openaire.eu/
https://www.slideshare.net/OpenAIRE_eu/enabling-better-science-results-and-vision-of-the-openaire-infrastructure-and-rda-
data-publishing-working-group-55075375

6 Adoption of Open Science in Europe
https://www.fosteropenscience.eu/

6
Forschungsdaten
in den Sozial- und Wirtschaftswissenschaften
http://dx.doi.org/10.4232/10.fisuzida2014.1
http://auffinden-zitieren-dokumentieren.de

6 Handbuch Forschungsdatenmanagement
ISBN 978-3-88347-283-6 PDF: http://bit.ly/2uPJdaf

6 Rat für Sozial- und Wirtschaftdaten / DFG
http://www.dfg.de/download/pdf/foerderung/antragstellung/forschungsd
aten/basisinformationen_forschungsdatenmanagement.pdf

6 WIKI: FORSCHUNGSDATEN.ORG
http://www.forschungsdaten.org

6 RESEARCH DATA ALLIANCE
https://www.rd-alliance.org/

6 Data Carpentry Workshops
http://www.datacarpentry.org/

7 AUSTRALIAN NATIONAL DATA SERVICE (ANDS)

7 Wise Advise
https://nicolahemmings.wordpress.com/2016/04/05/mistakes-ive-
made-as-an-early-career-researcher/
Mistakes I’ve made as an early career researcher
APRIL 5, 2016
Nicola Hemmings (post-doc, University of Sheffield)
Failing to organise my data adequately (circa 2007).
“Prepare your datasets like you would if you were giving them to a
stranger who knew nothing about them. Label, annotate and
meticulously file your R scripts. Incorporate read-me files into everything
and write them for the monkey that will be you in five years, when you
return to your data and/or analyses for some unforeseen but vitally
important reason. Don’t get this wrong. You will regret it.“

7
Back to the start:
Snafu ? Things are getting better
• This film is scientific nontextual information
• It is available on the AV-portal of TIB Hannover, a data portal for
scientic audiovisual content.
• DOI-link: https://doi.org/10.5446/31036

Vielen Dank für Ihre Aufmerksamkeit.
DIW Berlin — Deutsches Institut
für Wirtschaftsforschung e.V.
Mohrenstraße 58, 10117 Berlin
www.diw.de
Redaktion
Peter Löwe (ploewe@diw.de)
http://dilbert.com/strip/2010-08-24
Based on the works of
• Paul Wong (2017) ANDS,Research Integrity Advisor Data Management Workshop
• 3TU.Datacentre (2014): Data citation and DOIs
• and others

Vielen Dank für Ihre Aufmerksamkeit.
DIW Berlin — Deutsches Institut
für Wirtschaftsforschung e.V.
Mohrenstraße 58, 10117 Berlin
www.diw.de
Redaktion
Peter Löwe (ploewe@diw.de)

Research Data Management for Econometrics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Research Data Management for Econometrics

Ähnlich wie Research Data Management for Econometrics (20)

Mehr von Peter Löwe

Mehr von Peter Löwe (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Research Data Management for Econometrics

Hinweis der Redaktion