Heart Disease Classification Report: A Data Analysis Project
Research data management for historians
1. CSC – Suomalainen tutkimuksen, koulutuksen ja julkishallinnon ICT-osaamiskeskus
Get the most out of your data! Data
publication, tracking and citation
Jessica PvE 28.2.2018
2. Working with data: amount of work
28.2.20182
Data Science Report 2016 http://visit.crowdflower.com/rs/416-ZBE-
142/images/CrowdFlower_DataScienceReport_2016.pdf
3. Monya Baker: 1,500 scientists lift the lid on reproducibility.
Survey sheds light on the ‘crisis’ rocking research. Nature 533,
2016. doi:10.1038/533452a
4. Data citation
• Project by Finnish Committee for Research
• Research data should be FAIR
• Data should have: Creator, title, publisher, publication
time, identifier
• Recommended additional information: Version,
resource type, copyright status
28.2.20184
6. • Include principles of data as evidence
and data transparency in next version
of Finnish RCR guideline byTENK.
• Recognise data authorship as a
distinct issue and discussion in the
TENK authorship guideline
http://www.tenk.fi/sites/tenk.fi/files/TENK_suositus_tek
ijyys.pdf
• Create a multi-institutional,
multidisciplinary working group to
define principles for defining data
authorship, coordinated f.e. byTENK,
OR assign national representation to
a relevant international activity with
the same goal.
• Organize multidisciplinary
discussion on data management
and citation, with the aim of
creating interoperable practices.
• Promote the use of data reference
model also when referring to
authors own primary source data.
• Define field specific level of
granularity for data citation.
Recommendations
6
7. • data licensing
• metadata
standards
• open data
formats
• shared
vocabularies
• informed decisions
to share data
• access rights
• descriptive
metadata
• persistent
identifiers
Findable Accessible
ReusableInter-operable
force11.org/group/fairgroup/fairprinciples
9. 1.(Meta)data are assigned a globally
unique and persistent identifier
2. Data are described with rich metadata
3. Metadata clearly and explicitly include
the identifier of the data it describes
4. (Meta)data are registered or indexed in
a searchable resource
URN
10. 1.(Meta)data are retrievable by their
identifier using a standardized
communications protocol
2.Metadata are accessible, even when
the data are no longer available
11. 1.(Meta)data use a formal, accessible,
shared, and broadly applicable language
for knowledge representation.
2.(Meta)data use vocabularies that
follow FAIR principles
3. (Meta)data include qualified
references to other (meta)data
12. Meta(data) are richly described with a
plurality of accurate and relevant
attributes
13. Persistent Identifiers (DOI, URN)
Data Catalog
Resolver
PID
Data file
License
Configuration
files
Read me
Landing page
14. Nano publications, linking data and compact identifiers
• https://www.go-fair.org/
• http://identifiers.org/
28.2.201814
15. Schofield et al (2015) http://digitalhumanities.org:8081/dhq/vol/9/3/000227/000227.html
Rutgers lib guide
17. Organising your data
• Sort and classify your information
• For instance: don’t mix different types of information in excel
columns: it is usually easier to combine datasets than sort out
ill structured data later
• Think about granularity (file size) and metadata
• Decide on formats, units, codes etc and be consequent
28.2.201817
18. Organising your data
• Write a code book, document
• Think about intelligibility
• Be careful when rearranging, reformatting, sorting or copy-
pasting data
• Use common file formats, preferably open
• Have processes in place for checking the data quality and
completeness
• Be clear about master copies and other copies
28.2.201818
19. Organising your data
• Be careful and plan well for sensitive data and anonymization
• Think about security and access rights
• Plan and agree on which versions of a dataset will be archived
and/or published
• Think about reproducibility and citing data
28.2.201819
20. Files and folders
• Create and agree on a system for naming files and folders and
be consequent
• Avoid very deep folder structures, since they can be difficult to
handle
• Use meaningful, unique file and folder names
• Keep names as short as possible but relevant. 25 characters is
usually considered maximum.
28.2.201820
21. Files and folders
• Dates inYYYY-MM-DD format allows you to sort and search
your files
• Avoid using special characters such as % & / : ; * . ? < > ^! “ ()
and Scandinavians
• Use three digits (or 4 if you have a large number of files) i.e.
001, 002…….201, 202 (not 1, 2, 21).
• Use underscores (_) instead of spaces
28.2.201821
22. Files and folders
• If using a personal name in the name give the surname first
followed by first name.
• Though, be very careful with personal data when naming files
and folders
• Indicate version number by using ‘V’ or “version” and number
(and subversions with more digits if minor changes)
28.2.201822
23. Types of data in research
Open Data Generic Research Data Fixed Research
Datasets
Description • Public data for any
use, commercial
• May be dynamic
• PSI
• Mature data
products
• Not produced
/monitored by
researchers
• Ex. Weather data,
public transport
• Produced
by/with/for
researchers
• Validated, good
quality
• Well documented
• Raw or processed
• Datasets
stable/version
controlled
• Ex. Corpus, SMEAR
• Produced for
specific research
question
• Underpins
research, for
replication
• Might be very
processed
• Reuse value
might be low
unless mature
field
Format Stable, standardised May vary over time Must be preserved
according to plan
28.2.201823
24. License and publish your data
• FSDTietoarkisto
• Language Bank of Finland
• Figshare
• Zenodo
• EUDAT
• IDA and Etsin
28.2.201824