Unprecedented data volumes and pressure on turnaround time driven by commercial applications require bioinformatics solutions to evolve to meed these new demands. New compute paradigms and cloud-based IT solutions enable this transition. Here I present two solution capable of meeting these demands for genomic variant analysis, VariantSpark, as well as genome engineering applications, GT-Scan2.
VariantSpark classifies 3000 individuals with 80 Million genomic variants each in under 30 minutes. This Hadoop/Spark solution for machine learning application on genomic data is hence capable to scale up to population size cohorts.
GT-Scan2, identifies CRISPR target sites by minimizing off-target effects and maximizing on-target efficiency. This optimization is powered by AWS Lambda functions, which offer an “always-on” web service that can instantaneously recruit enough compute resources keep runtime stable even for queries with several thousand of potential target sites.
How novel compute technology transforms life science research
1. How novel compute technology
transforms life science research
From Hadoop Spark to cloud-based micro-services
HEATH & BIOSECURITY
Dr Denis Bauer | Bioinformatics | @allPowerde
6 Dec 2016 – Cloudera Public Sector Government Forum, Canberra
stuckincustoms
2. Overview
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
GT-Scan2
How can genome
engineering be
made safer?
VariantSpark
How to find
disease genes in
population-size
cohorts?
CSIRO
How to facilitate
better
collaborations?
3. Team CSIRO
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
5319
talented staff
$1billion+
budget
Working
with over
2800+
industry
partners
55
sites across
Australia
Top 1%
of global
research
agencies
Each year
6 CSIRO
technologies
contribute
$5 billion to
the economy
4. Big ideas start here
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
EXTENDED
WEAR
CONTACTS
POLYMER
BANKNOTES
RELENZA
FLU TREATMENT
Fast WLAN
Wireless Local
Area Network
AEROGARD
TOTAL
WELLBEING
DIET
RAFT
POLYMERISATION
BARLEYmax™
SELF
TWISTING
YARN
SOFTLY
WASHING
LIQUID
HENDRA
VACCINE
NOVACQ™
PRAWN FEED
Convenient cardiac rehabilitation
Enhancing relationship between patient and mentor
Digital data collection
Equitable access
World's first, clinically validated smartphone based Cardiac
Rehab: uptake + 30% and completion +70%
5. Preparation for and recovery from
a Total Knee Replacement
o Remote monitoring by
Clinician
o Physiotherapy
o Wearable Technology
o Gamification
6. Genomic sequencing is revolutionizing
Health Care today. It offers up to 50%
more diagnoses than standard of care
and is on average 96% cheaper
Bauer et al. Trends Mol Med. 2014 PMID: 24801560
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
7. Advances in sequencing technology has
generated the capacity to sequence the
Earth’s Genome in just 10 days
The human genome is 3 billion letters long
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
need 3 billion samples to robustly analyze
8. 100,000 Genomes project
70,000 individuals
by 2017
The cancer genome atlas
11,000 samples 2015
Genomics projects hence are getting bigger
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
The HapMap Project
270 samples 2002
Human genome
~1 sample
1000 Genome Project
1097 samples 2012
ASPREE
4000 healthy 70+ year olds
Project MinE
15,000 people with ALS
Single samples are around 200GB in size
9. New demands on sequence analysis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
• The sheer volume of new data
necessitates new approaches.
Computational genomics must
progress from file formats to APIs,
from local hardware to the elasticity
of the cloud, from a cottage industry
of poorly maintained academic
software to professional-grade,
scalable code, and from one-time
evaluation by publication to
continuous evaluation by online
benchmarks.
Paten et al. The NIH BD2K center for big data in
translational genomics JAMIA 2015
10. Elasticity in the Cloud
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
1
Elastic cloud compute… is like an In-room sound system
Benefits:
• Instant availability of adequately powered system
• Images can be shared and everything on it is automatically version controlled
11. Efficient scalability2
Kelly et al. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of
human genetic variation in clinical and population-scale genomics Genome Biology 2015
Bespoke parallelization
e.g. Churchill
Chromosomal split
e.g. NGSANE
MapReduce
e.g. GATK queue
Transformational Bioinformatics | Denis C. Bauer | @allPowerde11
|
Beunder 2010 Embedded
12. Population-scale genomic data analysis requires BigData
solutions
Desktop compute High-performance
compute cluster
Hadoop/Spark
compute cluster
Focus small data Compute-intensive Data-intensive
Fault tolerant No No Yes
Node-bound Yes Yes No
Parallelization 10 CPU 100 CPU 1000 CPU
Parallelization
procedure
bespoke bespoke standardized
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CSIRO solution
13. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Spark Summit 2016 (June) by Frank Austin Nothaft (UC Berkeley)
(70TB – 300 individuals)
One human genome analyzed (variant called) every 3.2 hours
14. Still not fast enough…
Clinical genomics facilities expect to deal with >18,000 genomes a
year, so a 3.2h TAT would accumulate 6.5 years of compute.
CSIRO along with other prominent research institutes (MIT,
Berkeley) partnered with cloudera and AWS to investigate
• HPC-based solutions
• GATKspark (The Spark reimplementation of the accepted gold
standard)
• ADAM
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
15. Setup
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
• Instances
– 5 worker
– 3 Hadoop scheduler
– one Cloudera manager
• Why we chose to go with a
cloudera solution
– Set-up and deploy is automated,
e.g. no manual IP-address
matching
– No need for admin support, e.g.
preconfigured
– Set up is portable to other
providers and on-premise
16. All humans carry between 200 to 800
mutation that disrupt the function of a
gene.
Which needle is the right one?
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
http://science.sciencemag.org/content/335/6070/823.full
https://waynealliance.wordpress.com/2010/06/02/all-needles-no-hay/
17. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
0
1000
2000
Python
R
H
adoop
Adam
AD
M
IXTU
R
E
VariantSpark
method
timeinseconds
task
binary−conversion
clustering
pre−processing
It can classify 3000 individuals and 80 million variants in
under 30 minutes
18. • Collaboration between CSIRO, NCI and the John Curtin School of
Medical Research (JCSMR)
• Reuse AWS cluster on NCI on-premise cluster.
– Cluster built by joint effort by CSIRO Hadoop administrator and local
Cloudera staff
– VariantSpark deployed and running within only 3 days
• Demonstration of the lower risk for organisations with proof of
concept
Setup
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
19. NHMRC: Dementia
Research Teams Grant led
by Ian Blair (MQ)
Developing insight into the
molecular origins of
familial and sporadic
frontotemporal dementia
and amyotrophic lateral
sclerosis
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Affected
900 WGS
Normal
1400 WGS
Identify causative
mutations
Cluster Individuals on
disease progression
Application cases for a VariantSpark cluster
Kidney disease: Simon
Foote (JCSMR)
Uncover genetic cause of
early onset kidney failure.
20. Genome Engineering is currently
being developed for medical
treatments in humans, such as
cancer, blindness, HIV treatment.
However, the molecular
technology, CRISPR, is not 100%
efficient.
Aim: Develop computational
guidance framework to enable
edits the first time; every time.
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
21. Achieving the first time; every time
1. Better understanding of the science
2. Higher powered computational tools
• Super-computing-scale analysis
• Interactive real time analysis (query style research)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
lauren riddoch
iconfinder
GT-Scan2
Ranked choices
22. • We tested GT-scan2.0 against two publically available models:
• sgRNAscorer (Chari et al 2015, Nature Methods)
• WU-CRISPR (Wong et al 2015, Genome Biology)
• Tested 2 independent datasets (>4000 sgRNAs)
• Our chromatin aware model consistently outperformed the other models
Better Science
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
AreaUnderthePrecision/RecallCurveRecall
Precision
Validation Set 1
23. Higher powered instantaneous compute
Desktop
compute
High-performance
compute
Hadoop/Spark Microservices
Focus small data Compute-intensive Data-intensive Agility
Fault tolerant No No Yes (Yes)
Node-bound Yes Yes No No
Parallelization 10 CPU 100 CPU 1000 CPU 1000 CPU
Parallelization procedure bespoke bespoke standardized standardized
Overhead in the cloud NA spin-up lag spin-up lag instantaneously
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
CSIRO solution
24. Transformational Bioinformatics | Denis C. Bauer | @allPowerde
stuckincustoms
Area Under the Precision/Recall Curve International Recognition
25. Implementation
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
• GT-Scan2.0 is implemented as
a AWS Lambda function
• Server-less function:
• Does not require users to
have high-compute power
• Scalable:
• Can be easily scaled to
whole genome analysis
• Also intend to implement as a
“stand-alone”
• Can be run on local servers
• Can incorporate your own ChIP-seq
data rather than public data
26. On-demand instances vs Lambda
Pro Con
Lambda Instantaneously available Rel. small processing power
Spark-cluster Unlimited processing power Spin-up time
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Sweet-spot for when large number of “nimble” small processors give a
worse performance compared to a powerful cluster with overhead.
Especially, with spin up overhead reduced with managers like cloudera
Director.
27. Three things to remember
• Large volumes of detailed data?
VariantSpark, bringing bigLearning to genomics, can
classify 3000 individuals and 80 million variants in under
30 minutes using Spark
• Parallelizable tasks persistent cloud-availability?
GT-Scan2, computationally guiding genome engineering,
uses Chromatin information and the latest in cloud-
compute to improve CRISPR target site identification
• CSIRO specializes in using the latest advances in
compute technology to push the boundary on
bioinformatics problems
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
28. Natalie Twine
Acknowledgements
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Denis Bauer Oscar Luo Rob Dunne Piotr Szul
Transformational Bioinformatics Team
Aidan O’BrienLaurence Wilson
Adrian White
Mia Champion
Gaetan Burgio
Collaborators
David LevyIan Blair
Kelly Williams
News
Software
Open Position
Dan Andrews
Editor's Notes
Staff # as at 3 March 2016 = 5319
2014–15 budget = $1.2 billion
--------------------
Today we have around 5300 talented people working out of 50-plus centres in Australia and internationally.
We are a billion dollar organisation
We generate $485+ million in external revenue – essentially nearly 40% per cent of our revenue is externally sourced
Our people work closely with industry and communities to leave a lasting legacy.
Our ability to achieve results is shown by the quality of our research. We are in the top 1% of global research institutions in 15 of 22 research fields and in the top 0.1% in four research fields.
CSIRO is the key connector of institutions in the Australian system for some areas. CSIRO is the most central Australian institution in 6 research fields – Agricultural Sciences, Environment/Ecology, Plant and Animal Sciences, Geosciences, Chemistry and Materials Science.
CSIRO works with 1208 SME’s and 2,877 customers each year. We’re always looking for ways we can help business and industry.
Our work has impacted the daily lives of Australians and those around the world. These are some of our top inventions.
We invented the world’s best wireless technology for our homes and offices
We developed the Total wellbeing diet – a higher protein, low-fat diet that’s nutritious, and facilitates sustainable weight loss
We developed Softly washing liquid – the first formula to successfully wash wool at high temperatures, killing bacteria while not shrinking the wool
We developed Barleymax a high fibre wholegrain, which has four times the resistant starch and twice the dietary fibre of regular grains
We invented Relenza, a treatment for flu
We kept flies off her majesty, Queen Elizabeth II by creating Aerogard
We invented plastic (polymer) banknotes which are now exported to 25 countries with more than 3 billion notes currently in circulation
We invented Raft (Reverse Addition Fragmentation chain Transfer) technology enabling companies to develop new and advanced materials
We developed self-twisting yarn and made children's clothing safer than anywhere in the world
We invented contact lenses that can be worn for a month at a time
We invented Equivac HeV Vaccine for Hendra virus to protect Australian horse owners and the equine industry
Novacq prawn feed – need words
http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html
Bauer et al. Trends Mol Med. 2014 PMID: 24801560.
http://www.nature.com/nature/journal/v462/n7276/fig_tab/nature08645_F1.html
Bauer et al. Trends Mol Med. 2014 PMID: 24801560.