The exponential increase in data has caused an analysis bottleneck: the effort needed to manage the data and develop complex analysis pipelines is greater than the collection itself. I discuss some of the major techniques we used in order to turn our research pipelines into a production system able to analyze diverse datasets with minimal failures. I highlight the importance of valid metadata, the adaptation of research software, and surrounding infrastructure including workflow systems.
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Using research software in a production environment
1. Using research software in a
production environment
Morgan Taschuk @morgantaschuk
Senior Manager, Genome Sequence Informatics
Ontario Institute for Cancer Research
2. ONTARIO INSTITUTE FOR CANCER RESEARCH
Genome Sequence Informatics
2
• Primary Analysis and
QC at OICR
• 8100 cores
• 2 petabytes of disk
• Support dozens of
projects, 100s
publications
• Half bioinformaticians
• Half software
developers/engineers
est 2008
7. ONTARIO INSTITUTE FOR CANCER RESEARCH
Big Data
7
Scale: 1 sequenced human whole genome is
between 30-45 GB
• Genomics England’s 100 000 Genomes Project will
take ~20 PB of disk to store
• Need to sequence between 5000-20 000 cases to
confidently link rare variants with disease
8. ONTARIO INSTITUTE FOR CANCER RESEARCH
Data is too big!!
8
Costs of whole genome sequencing (grey line) and computer power
(Moore law, black line).
Clinical and Translational Radiation Oncology 2017 3, 16-20DOI: (10.1016/j.ctro.2017.03.002)
9. ONTARIO INSTITUTE FOR CANCER RESEARCH
Translation to the clinic
9
• Only 10-25% of research is
able to be translated into
clinical practice
• Example: Recommended
laboratory test turnaround
time is 14 days
• Genomics test results
between biopsy and results
~35 days
Aung et al. Clin Cancer Res. 2018 doi: 10.1158/1078-0432.
10. ONTARIO INSTITUTE FOR CANCER RESEARCH
Growing pains
10
OICR acquires a
lot of sequencing
instruments
13. ONTARIO INSTITUTE FOR CANCER RESEARCH
GSI In 2017
13
• 17 staff but only ~2 monitor this system
• 90,098 analysis workflows executed on human
whole genome, exome, targeted panels, and RNA
sequencing
• 1 successful workflow every 6 minutes
• Vast majority of data never needs human
intervention
• My goal is/was to reduce turnaround time… stay
tuned for the end of the talk
14. ONTARIO INSTITUTE FOR CANCER RESEARCH
Our Current Approach
14
1. Nothing should be on fire
15. ONTARIO INSTITUTE FOR CANCER RESEARCH
Our Current Approach
15
1. Control our inputs (data and metadata)
2. As little human intervention as possible
3. Fail fast, fail loudly
4. Totally traceable and reproducible
16. ONTARIO INSTITUTE FOR CANCER RESEARCH
Total assimilation
16
• Borg’ed out on supply chain management
• Assimilate all aspects of metadata and data
management to ensure consistent quality
18. ONTARIO INSTITUTE FOR CANCER RESEARCH
Monitoring
Our Approach
Valid
metadata
Workflow
system
Automation
Genomics
Reports
HPC
Research
Software
Valid metadata entering an automated system running on
robust software with reproducible results - and everything
tracked and monitored.
19. ONTARIO INSTITUTE FOR CANCER RESEARCH
Total assimilation
19
Valid
metadata
Workflow
system
Automation
Reports/
Data
Genomics
SCIENCE!!
20. ONTARIO INSTITUTE FOR CANCER RESEARCH
Only good metadata enter
20
• Control and validate
metadata as far
upstream as we can
• Laboratory Information
Management System
(LIMS)
22. ONTARIO INSTITUTE FOR CANCER RESEARCH
MISO as the metadata solution
22
• Since 2017, MISO LIMS
• open source
• completely customizable
• Validate data at entry
• Sanity checks
• Reduce data entry and thus reduce data entry
errors
https://github.com/TGAC/miso-lims
23. ONTARIO INSTITUTE FOR CANCER RESEARCH
Automation
23
• Deciders:
• take in metadata and data
• decide what analysis to perform using rules (if-
then; map-reduce; etc)
• check whether data has previously been
analyzed
• if system is at capacity
• Difficult to write
• especially when metadata is poor
• software needs to understand all metadata
24. ONTARIO INSTITUTE FOR CANCER RESEARCH
Monitoring
24
• Track everything before you need it
• Silence on success
• but make sure you detect when systems go
offline!
• Dashboards and tickets instead of emails
• Fail fast, fail loudly
27. ONTARIO INSTITUTE FOR CANCER RESEARCH
Tickets and alerts instead of emails
27
Automatic of course
28. ONTARIO INSTITUTE FOR CANCER RESEARCH
Workflows
28
• Workflow systems:
• takes in input data and parameters
• runs the data through analysis steps
• produces data
• Analysis steps:
• Good research software
• Absolutely critical and integral to all other
systems discussed so far
29. ONTARIO INSTITUTE FOR CANCER RESEARCH
Having good software is not enough
29
Monitoring
Metadata validation
Automation
Workflow systems
software
30. ONTARIO INSTITUTE FOR CANCER RESEARCH
Turnaround time
30
• Sequencing to alignment has dropped from about
20 days to 7 days for Hiseq whole genome lanes
• Anecdotal: Variability reduced, hands-on time
reduced
31. ONTARIO INSTITUTE FOR CANCER RESEARCH
Current/future work
31
• Automation
• make it simpler
• more complete
• (never going to be done)
• Research is a changing field by nature
• Flexibility versus robustness
• Hot new things: sc-seq, ct-seq, immuno-onco-
genomics
• Underlying assumptions change over time
32. ONTARIO INSTITUTE FOR CANCER RESEARCH
We’re investing in good infrastructure
32
Turonno
!
entry-level!
Look for GSI!
Software dev!
report to me!!
Apply! http://bit.ly/oicr-gsi-dev
33. ONTARIO INSTITUTE FOR CANCER RESEARCH
Conclusions
33
• The FUTURE is
• hundreds of thousands of samples
• expediting clinical results
• no loss of reproducibility or quality
• Everyone needs a little production-style
infrastructure, even if you’re not production
• control your metadata!
• automate!
• standardize your analysis!
• monitor all the things!
34. ONTARIO INSTITUTE FOR CANCER RESEARCH
Acknowledgements
34
• Lars Jorgensen
• Lawrence Heisler
• Michael Laszloffy
• Heather Armstrong
• Dillan Cooke
• Andre Masella
• Iain Bancarz
• Timothy Beck
• Peter Ruzanov
• Prisni Rath
• Jonathan Torchia
• Richard Jovelin
• Yogi Sundaravadanam
• Xuemei Luo
• Many excellent co-op
students
To past and current GSI members.
35. OICR Technology Programs enable cancer research in
Ontario by providing value-added expertise, training and
access to high-end infrastructure and technologies.
Find out more at oicr.on.ca
This project was supported by the
OICR Adaptive Oncology Program
36. Funding for the Ontario Institute for Cancer Research
is provided by the Government of Ontario
37. ONTARIO INSTITUTE FOR CANCER RESEARCH
Attributions
37
Jensflorian CC BY-SA 3.0
Timothy Dilich - Noun Project, CC0
http://andrewjrobinson.github.io/training_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/
By David pogrebeshsky [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)],
from Wikimedia Commons
Star Trek ® Paramount Pictures
38. ONTARIO INSTITUTE FOR CANCER RESEARCH
GSI on the web
38
https://github.com/oicr-gsi
https://gsi.oicr.on.ca