Supporting researchers in the molecular life sciences Jeff Christiansen
1. Supporting Researchers in the Molecular Life Sciences
Jeff Christiansen
UQ RCC Health and Life Sciences Program Manager
QCIF Health and Life Sciences Program Manager
EMBL-ABR Key Areas Coordinator
2.
3. DNA
mRNA
protein
metabolites
The central dogma of biology
Cell type 1 vs cell type 2: same genes but different mRNAs, proteins and metabolites (and with different levels)
Traditionally, researchers would focus on a small numbers of genes/proteins etc. due to technical constraints
folding
large
molecules
(small molecules)
enzymatic
catalysis
4. Global biomolecular profiling: the data explosion
DNA RNA protein metabolites
genomics transcriptomics proteomics metabolomics
20,005 ‘protein
coding’ genes
~200,000(?) transcripts
abundance?
16,518 identified
abundance?
>24597 compounds
abundance?
https://www.ebi.ac.uk/metabolights/referencehttps://hupo.org/HPP-Q&Ahttps://hupo.org/HPP-Q&A
5. The data explosion: challenges
• Data storage
• non-complex org’s (bacteria): 12GB raw data / sample (genomic, transcriptomic, proteomic, metabolomic)
• globally, est. 100 PB used by 20 largest institutions for genomic storage alone1
• Tools
• to convert data from raw > processed
• for comparative analyses on processed data (e.g. genome v. genome, transcriptome v. proteome)
• documenting methods (i.e. tool use – versions used, workflows applied)
• Compute
• resource intense (e.g. a single human : mouse genome alignment consumes ~100 CPU hrs.)
• Data management
• context surrounding the specimen (e.g. healthy vs diseased) and experiment
• context surrounding the data itself (provenance, state {raw, processed}, formats, etc.)
• managing sharing within research team
• data publishing at project end to international repositories
• Skills development
• enabling biologists to utilise bioinformatics approaches (expert [cmd line] > novice [GUI])
• enabling biologists to use storage, tools, compute and data management effectively
Stephens et al (2015) Astronomical or Genomical? PLOS Biology https://doi.org/10.1371/journal.pbio.1002195
6. Unmet Needs for Analysing Biological Big Data:
A Survey of 704 NSF Principal Investigators
Percent responding negatively
(318 ≤ n ≤ 510)
0% 20% 40% 60% 80% 100%
Barone L, Williams J, Micklos D; BioRxiv (2017)
Training on integration of multiple data types
Training on data management and metadata
Training on scaling analysis to cloud/HPC
Multi-step analysis workflows or pipelines
Cloud computing
Search for data & discover relevant datasets
Support for bioinformatics and analysis
Publish data to the community
Updated analysis software
Share data with colleagues
Training on basic computing and scripting
Sufficient data storage
High-performance computing
90% indicated
they are
currently or will
soon be
analysing large
digital datasets
7. Australian needs
The Most UsefulBiggest bioinformatics difficulty
https://www.embl-abr.org.au/news/braembl-community-survey-report-2013/
2013
N=210
12. Organise training material and events around research-relevant tasks, not the tools themselves
Training in how to perform tasks is required
13. Organise training material and events around research-relevant tasks, not the tools themselves
Training in how to perform tasks is required
Genome Annotation using Apollo
15. Involve a wide variety of users in usability testing
Building more intuitive tools is imperative
16. Involve a wide variety of users in usability testing
Building more intuitive tools is imperative
14 users (novice to expert bioinformaticians, student to CI)
5 tests (representing broad task types)
47 usability issues found – 38 addressed
17. Build/provide functionality that supports users with differing informatics skill levels
Building more intuitive tools is imperative
18. Build/provide functionality that supports users with differing informatics skill levels
Building more intuitive tools is imperative
20. Australia is geographically challenging:
leverage technology, international and local expertise to help
deliver training to a wider audience
Genome Annotation using Apollo
Dr Monica Muñoz-Torres
Project Lead, Apollo Project, Berkeley
21. Australia is geographically challenging:
leverage technology, international and local expertise to help
deliver training to a wider audience
Genome Annotation using Apollo
9 EMBL-ABR Nodes, 92 registrants
QLD: QCIF, JCU (TSV+CNS)
NSW: UNSW, SCU
VIC: Monash, UniMelb
SA: UniAdel
TAS: UTas
22. Australia is geographically challenging:
leverage technology, international and local expertise to help
deliver training to a wider audience
Genome Annotation using Apollo