Primary data collected during a research study is increasingly shared and may be re-used for new research. The aim of this project was to assess the extent of data sharing of summary statistics of primary human genome-wide association studies (GWAS) as an example of data sharing in favourable circumstances in a particular discipline and whether such checks can be automated. This presentation will summarise the findings of the project and demonstrate a tool to extract information from data availability statements
1. Data availability and
feasibility of validation –
A genomics case study
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma
Stuart, Meiko Makita, Verena Weigert, Chris Keene,
Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK
Reproducibility Network & JISC
2. Data sharing experiment goals
• Find out how often data is shared in a field with
apparently ideal conditions
• Write a program to automatically identify shared
data of a specified type
• Write a program to validate the quality of shared
data of a specified type
• As a step towards more general automatic shared
data discovery and quality control
3. The ideal case study topic? GWAS
• Genome Wide Association Study (GWAS) summary
statistics
• Variation likelihood at large sets of locations of the human
genome for measurable traits (e.g. disease susceptibility)
• Data is high value and expensive to collect
• Often stored in a standard format for internal sharing
by consortia
• An international repository exists for hosting it,
emphasising its importance
• NHGRI-EBI Catalog of published genome-wide association
studies
• Meta-analyses benefit from shared files – increased
power and population triangulation
• Genomics has a reputation for data sharing
5. Methods
• Medline search for articles that could be primary
human GWAS
"Molecular Epidemiology"[Majr] AND "Genome-
Wide Association Study"[Majr]
• Restriction to 2010 and 2017 to identify trends
• Three human coders classified 1799 articles for
being (a) primary human GWAS and (b) publicly
sharing complete primary human GWAS summary
statistics
• MT and MM follow-up checks of results
https://www.biorxiv.org/content/10.1101/622795v1
6. Results
Data availability information 2010 2017 Total Percent
GWAS location not stated in article 156 139 295 89.4%
Broken link or not findable at stated location 3 1 4 1.2%
On request to the authors 0 8 8 2.4%
On request via dbGaP 2 5 7 2.1%
On request via EGA 1 3 4 1.2%
On request via another portal 0 3 3 0.9%
Free online without login, proprietary format 1 0 1 0.3%
Free online without login, plain text 0 8 8 2.4%
10.6% reported sharing GWAS summary statistics in some form
7. Article descriptions of the availability
of GWAS summary statistics
• Usually in a Data Availability article section (26 out of
35).
• Data availability more difficult to identify from the
methods (4 articles) and results (3 articles).
• Only five data sharing statements described the shared
data as GWAS summary statistics, and all five used
different phrases
• “full GWAS summary statistics”, “Case Oncoarray GWAS data”,
“Summary GWAS estimates”, “Summary statistics for the
genome-wide association study”, “genome-wide set of
summary association statistics”
• Descriptions are therefore hard to automatically
identify from articles.
8. Conclusions
• Data sharing is unlikely to become near-universal
when it is optional.
• Policy initiatives or mandates are needed to
promote data sharing.
• Automatically identifying shared data is difficult or
impossible in practice because of:
• the complexity of articles (multiple data sources and
article structures)
• a lack of standardisation of terminology
• - but data availability statements help
Mike Thelwall, Marcus Munafò, Amalia Mas Bleda, Emma Stuart, Meiko Makita, Verena
Weigert, Chris Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
University of Wolverhampton, University of Bristol & UK Reproducibility Network & JISC
9. Follow-up study: Investigating
data availability statements
• A program was written to extract data sharing
statements from full text articles in XML
• Free software Webometric Analyst
(http://lexiurl.wlv.ac.uk/), menu: Citations > PMC full
text > Data availability statements extract
• Manual content analysis for types of information in
extracted PMC Open Access Subset data availability
statements (n=500)
• Test machine learning for classifying data sharing
methods from data availability statements
10. Result - how is data shared?
Almost all papers with D.S.S. claim
to share data.
Standardised wordings common
e.g., “All relevant data are within
the paper.”
11. Results – what data is shared?
38% of data sharing
statements specify that all
data is shared
12. Results – why is data [not] shared?
91% of data sharing
statements give no
explanation for their
data sharing policy
13. Machine learning
• Simple support vector machines (SVM) test for
detecting sharing methods from data sharing
statements
• 87% accurate for: How is the data shared
• 89% accurate for: is all the data shared (binary)
14. Summary
• Data sharing seems to need mandates to become
widespread, even in otherwise best case fields
• Shared data is hard to detect precisely because of
article complexity and language variation.
• Basic information about whether data is shared and
where can be extracted automatically from data
availability statements.
• Mike Thelwall, Marcus Munafò, Amalia Mas Bleda,
Emma Stuart, Meiko Makita, Verena Weigert, Chris
Keene, Nushrat Khan, Katie Drax, Kayvan Kousha
• University of Wolverhampton, University of Bristol
& UK Reproducibility Network & JISC
Hinweis der Redaktion
“A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).” https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism