Seminar Presentation for PMB Department, UC Berkeley for Love Data Week. Subject is how to prepare publications and associated data sets for maximum reuse.
2. Good Data Stewardship
• Publish data with the paper
• Describe data to your fullest ability
• Use the right words to identify Data
• Deposit data in the right Data Repository
• Budget time for Data Management
• Don’t think of it as YOUR data
3. What’s in it for YOU?
We all benefit from data sharing.
More citations of YOUR work, increasing
your visibility in the research community.
Easily comply with journal and
funding requirements
Less time spent fulfilling requests for data.
7. Data re-use leads to new insights
Data Processing
Quality Control
Validation
503 datasets 314 datasets
Statistical Analysis
Additional Experiments
Yu Zhang et al. PNAS doi:10.1073/pnas.1716300115
NOVEL DISCOVERY
MET1 and CMT3 are independently required for the maintenance of
asymmetric CHH methylation at CMT2 target sites
8. Credit: Melissa Haendel
Wilkinson, et al., (2016) The FAIR Guiding Principles for scientific data management and stewardship
10.1038/sdata.2016.18. https://www.nature.com/articles/sdata201618
• Findable means data is human and machine readable
and attached to persistent identifiers
• Accessible means data can be found and retrieved by
humans and machines using standard formats
• Interoperable means data can be exchanged and used
between systems.
• Reusable means data can be used by others
9. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
10. CHROM POS REF ALT Line
1
Line
2
1 12345 A C A A
3 67891 C T H C
10 23456 G T T U
CHROM POS REF ALT Line
1
Line
2
Gm01 12345 A C 0/0 0/0
Gm03 67891 C T 0/1 0/0
Gm10 23456 G T 1/1 ./.
CHROM POS REF ALT Line
1
Line
2
Chr01 12345 A C AA AA
Chr03 67891 C T C/T CC
Chr10 23456 G T TT NN
ALL MEAN THE SAME!
BUT ARE NOT THE SAME
Use Standard formats: SNP example
SNP (Single Nucleotide Polymorphism): A base, a chromosome
number and genome position, and a reference to the genome
assembly used, and the genotypes of lines tested.
VCF: Variant Call Format
Is the STANDARD
Use the File format
STANDARD
for your data type
12. If you use EXCEL, look out for data corruption and hidden
Microsoft characters that impede parsing
Zeimann, 2016
10.1186/s13059-016-1044-7
Use Standard formats: Beware of Excel
Fig. 1: Prevalence of gene name errors in Supplementary Excel files
Percentage of papers with gene lists effected Increase in supplementary files with gene
name errors per year
13. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
14. Metadata: Species = xxx
Germplasm = xxx
Field location = xxx
Environment = xxx
Measurement = xxx
method
Phenotype (Data): Plant is 170cm tall
Metadata is data about the data,
and allows understanding of the data
Supply Complete Metadata
15. • Write your Materials and Methods as if you wanted
someone else to be able to reproduce your work.
• Be accurate and complete about your bench and field
work; include samples/stocks/lines used, accession
numbers, sources of materials, exact measuring
techniques etc.
• Be AS accurate and complete about your computational
pipelines. Include your created raw data files and
versions. If you use reference data (eg; sequence
assembly), include the version number, download dates,
and download source.
• Include names of software applications, versions,
platforms and source. If you use a CyVerse, use their
metadata reporting tools.
Supply Complete Metadata
19. Supply Complete Metadata
597 Possible Attributes
At least 50 Attributes
Genome Sequence Assembly At least 100 Attributes
20. Budget TIME
to provide Metadata
The metadata in public databases is often confusing; a test case
with Zea mays mRNA seq data reveals a high proportion of
missing, misleading or incomplete metadata. 2018.
https://doi.org/10.1016/j.plantsci.2017.10.014
22. • Established: Genomic Standards Consortium
(http://gensc.org)
• Minimal Information about Any Sequence
• Emerging
• Minimal Information about a Plant Phenotyping Experiment
(MIAPPE)
Metadata Standards for Various Data Types
Supply Complete Metadata
Ask For Help from Database People
23. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
25. Embrace Ontologies
An Ontology is:
A set of precisely defined terms
In a logical hierarchy, and the
Relationship between can be
understood by computers
26. PO:0020105
ligule
Ontologies: Hierarchy of terms and
explicit relationship among terms
Plant
Ontology
(PO)
Ligule
PO:0020105
Vascular leaf
PO:0009025
Leaf sheath
PO:0020104
Flag leaf
PO:0020103
Adult vascular leaf
PO:0020103
Leaf
PO:0025034
27. Data from diverse types of experiments and organisms
can be compared
Henk J. Franssen, et al (2015)
doi: 10.1242/dev.120774
(Medicago)
Li,S. et al., (2016)
10.1016/j.devcel.2016.10.012
Arabidopsis
Zhou, X-F, et a.l. (2014) 10.1104/pp.114.243808
28. Embracing ontologies
• Ontologies provide a POWERFUL, MACHINE READABLE utility
for data
• Find and use existing ontologies
(http://www.obofoundry.org/, Planteome)
• Gene Function = Gene Ontology (GO)
• Sequences = Sequence Ontology (SO)
• Plant Anatomy and Development = Plant Ontology (PO)
• Phenotypes = Phenotype and Trait Ontology (PATO)
• …..many many others
• Apply them consistently
• To datasets (e.g. in metadata)
• In publications (e.g. TAIR GO/PO submission)
• Ask Questions!
29. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
35. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
36. Problem: Data is not findable because
it is not available
Piwowar HA, Vision TJ.(2013)Data reuse and the open data
citation
advantage.PeerJ1:e175https://doi.org/10.7717/peerj.175
Gibney and VanNorden
doi:10.1038/nature.2013.14416
37. Put your data in a stable public
repository
Large International Repositories for many data
types for all species. ALL sequence data goes here
Large but specialized databases serving many species
Soybase
Specialized databases serving specific communities
38. Submitting to a repository: SNP example
As of 9/2017, All NON- human SNPs are
processed through EMBL in the European
Variation Archive (EVA,
https://www.ebi.ac.uk/eva/).
NCBI’s dbSNP will only process Human SNPs
EVA will require:
• Data in (standard) Variant Calling Format
(VCF) including allele frequencies
• SUBMITTED Genome or Transcriptome
assembly
39. What if there is no specialized database?
Or no recommendations from journals ?
You should get a Digital Object Identifier (DOI)
http://datadryad.org
** Curated, metadata
https://zenodo.org/
https://figshare.com/
https://datashare.ucsf.edu/stash
And just for you folks at UC……
40. But.. please, don’t forget to actually complete
your submission*...
*And you never have to spend time fielding requests
or transferring huge data files again
42. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
43. Cite, share freely and encourage others to be FAIR
Include searchable and citable identifiers for your data in
your papers
Release your data with clearly defined terms of use
e.g. Creative Commons (CC) CC-0, CC-BY
If you do not specify restrictions are implied limiting reuse
Cite all of your data sources
Enhances reproducibility….. and also shows value to funders!
When reviewing papers check them for FAIRness
45. A few simple things to remember when
preparing your paper
• Include unambiguous identifiers
• Format data according to defined standards
• Keep data in (parseable) tables or text
• Include meaningful metadata
• Deposit data in a long term stable public repository and get a
DOI
• It is never to early to think about (meta) data, the best time to
start is BEFORE you are writing
46. You can get help structuring,
organizing and managing your data
● Contact your Community Database
● Don’t have one? Contact a curator
(Leonore, Lisa… we live amongst you)
● UCB Research Data Management Librarians
(http://researchdata.berkeley.edu/)
48. What YOU can do right now to
support FAIR data
Ask your funders for increased access to FAIR data
When you review papers- looks at the data, and be
sure it is well described (Metadata is great)
Change your attitude a little: You data will be more
cited, more important if you make it FAIR
Deposit your Data and get a DOI
Ask your institution to value good data submission,
and good data recycling