Axa Assurance Maroc - Insurer Innovation Award 2024
ASU
1. Agenda:
• Research Computing @ Arizona State University
• Program, Vision and Mission
• Emphasis on Open Source
• Evolution in Genomic Analysis (HPC > MRv2 > Spark)
J.A. Etchings
RC@ASU Innovation
2. 2
Arizona State University has become the foundational
model for the “New American University”, a new
paradigm for the public Research University that
transforms higher education. ASU is committed to
Excellence, Access and Impact in everything that it
does.
3.
4. Open-source Data Driven Infrastructure
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubby Thrift & Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on Hadoop
In Memory Spark In Memory Computation
Data Intensive
5. TransCORE Framework Knowledge Engine
Context
Ontologies
Data Elements
Information Models
Middleware
Transact
Clinical Research
Life Science Research
Qualitative Research
Analytic
In-Memory Analysis
Genomic Data
Machine Learning
Meta-Data Management
Data Resources
Open Big Data File System
Relational Key/Value
HPC Parallel
HPC SMA
Transactional
Data Reservoir
Big DataScratch Space
Internet 2 / SDN Connectivity
6.
7. The entire human genome of a single man
3 billion letters, 262,000 printed pages, 3.3GB
@rikisabatini #TED2016
8. Clarification & Limitations :
• Yes, we can sequence a Genome for $1000
– Unfortunately, this does not include analysis
• There are 3 billion diploid basepairs, but 6 billion haploid sequences
– Half come from mom and half from dad, and assembling those haplotypes - especially SNPs that are the same
haplotype - is going to instrumental in future medical advances
• Other limitations:
– batch effects (in physical sequencing, in sequencing technology
– Different software, different versions of software, and infrastructure (Standardization Gap)
– Batch effects can significantly impede variant discovery (false positives are high)
9. “NEED TO FOCUS NOT ON BIG DATA,
BUT BIG ANSWERS”
Harper Reed – CTO Obama for America 2012
10. Tumors are not composed of identical cells:
There is likely extreme intratumor heterogeneity
Macro heterogeneity
> 10 % frequency in the tumor
Micro heterogeneity
< 10 % frequency in
the tumor
11. • What are the population dynamics of cancer cell populations?
• What is the role of genetic drift in cancer initiation and progression?
• What is the extent of subclonal variation within a tumor at the time of
diagnosis?
• Are resistant subclones present in a tumor before the start of
therapy?
Use simulations to ask:
12. Model parameters and their values
• Probability of division, bn, which depends on the fitness of each cell
• Mean selection coefficient, 𝑠 , to generate the exponential distribution of selection
coefficients
𝑠 = [ 0.1; 0.01; 0.005 ]
• Average driver mutation rate per cell division, 𝑢
𝑢 = [ 10−8; 10−7; 10−6; 10−5 ]
• Generation time: average division time = 4 days*
*S Jones et al. Comparative lesion sequencing provides insights into tumor evolution. PNAS (2008)
13. The model: A branching evolutionary process
Death
Division
Division + driver mutation
The process starts in
a single cell with one
driver mutation
OR
OR
1-bn
(1-u)bn
ubn
15. ≈ 98% of starting mutant clones die out early
Mean selection
coefficient
Driver mutation rate
per cell division
Number of
realizations
Number of
realizations
that reached
109 cells
Percentage of
realizations
that reached
109 cells (%)
Average time
to detection
(years)
0.1 10155 162 1.6% 17.50
0.1 1948 112 5.7% 5.21
0.1 748 134 17.9% 1.74
0.1 748 111 14.8% 1.62
0.01 6867 125 1.8% 19.80
0.01 6866 113 1.6% 15.41
0.01 6866 120 1.7% 13.85
0.01 6865 115 1.7% 11.16
0.005 11951 102 0.9% 27.97
0.005 11751 112 1.0% 27.91
0.005 11750 126 1.1% 22.43
0.005 11750 100 0.9% 18.28
completed 88265 1432 1.6%
16. Some tumors develop very quickly
(mimics childhood cancers)
Mean selection
coefficient
Driver mutation rate
per cell division
Number of
realizations
Number of
realizations
that reached
109 cells
Percentage of
realizations
that reached
109 cells (%)
Average time
to detection
(years)
0.1 10155 162 1.6% 17.50
0.1 1948 112 5.7% 5.21
0.1 748 134 17.9% 1.74
0.1 748 111 14.8% 1.62
0.01 6867 125 1.8% 19.80
0.01 6866 113 1.6% 15.41
0.01 6866 120 1.7% 13.85
0.01 6865 115 1.7% 11.16
0.005 11951 102 0.9% 27.97
0.005 11751 112 1.0% 27.91
0.005 11750 126 1.1% 22.43
0.005 11750 100 0.9% 18.28
completed 88265 1432 1.6%
17. Some tumors take decades to develop
(mimics many adult cancers, like melanoma)
Mean selection
coefficient
Driver mutation rate
per cell division
Number of
realizations
Number of
realizations
that reached
109 cells
Percentage of
realizations
that reached
109 cells (%)
Average time
to detection
(years)
0.1 10155 162 1.6% 17.50
0.1 1948 112 5.7% 5.21
0.1 748 134 17.9% 1.74
0.1 748 111 14.8% 1.62
0.01 6867 125 1.8% 19.80
0.01 6866 113 1.6% 15.41
0.01 6866 120 1.7% 13.85
0.01 6865 115 1.7% 11.16
0.005 11951 102 0.9% 27.97
0.005 11751 112 1.0% 27.91
0.005 11750 126 1.1% 22.43
0.005 11750 100 0.9% 18.28
completed 88265 1432 1.6%
18. Computationally Intensive
• Running until 10-9 cells was not efficient on a laptop
• Most tumors die out before reaching a detectable limit
• Need to reduce run-time, track all mutations, and
subclone sizes (Massively)
19. eQTL Analysis
Generation trillions of hypothesis tests
• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
• Tested below on 120 billion associations
Example queries:
• “Given 5 genes of interest, find top 20 most significant eQTLs (cis and/or trans)”
o Finishes in several seconds
• “Find all cis-eQTLs across the entire genome”
o Finishes in a couple of minutes
o Limited by disk throughput
21. • Took a day to get a tumor to 10-7
– (still 2 orders of magnitude too small)
• Convert code from MatLab to Scala (Spark)
• Takes seconds to simulate a single tumor
• Ability to generate tens of thousands of possible tumors, and
thousands of measurable tumors, observed dynamics
22. Standard Output
0.00.20.4
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.20.40.6
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.20.4
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
𝑠 = 0.1, μd = 10-8
𝑠 = 0.01, μd = 10-8
𝑠 = 0.005, μd = 10-8
𝑠 = 0.1, μd = 10-7
𝑠 = 0.01, μd = 10-7
𝑠 = 0.005, μd = 10-7
𝑠 = 0.1, μd = 10-6
𝑠 = 0.01, μd = 10-6
𝑠 = 0.005, μd = 10-6
𝑠 = 0.1, μd = 10-5
𝑠 = 0.01, μd = 10-5
𝑠 = 0.005, μd = 10-5
N = 162 N = 112 N = 134 N = 111
N = 125 N = 113 N = 120 N = 115
N = 102 N = 112 N = 126 N = 100
DensityDensityDensity
Subclone size
(number of cells)
Subclone size (number of cells)Subclone size
(number of cells)
Subclone size
(number of cells)
Subclone size
(number of cells)
23. 0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.81.2
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
0.00.40.8
Subclone size (number of cells)
Density
10
2
10
4
10
6
10
8
10
10
N = 111
N = 115
N = 100
N = 134
N = 120
N = 126
𝑠 = 0.1, μd = 10-6 𝑠 = 0.1, μd = 10-5
𝑠 = 0.01, μd = 10-6 𝑠 = 0.01, μd = 10-5
𝑠 = 0.005, μd = 10-6 𝑠 = 0.005, μd = 10-5
Resistant subclone size (number of cells) Resistant subclone size (number of cells)
DensityDensityDensity
Standard Output
25. Minor subclones that harbor mutations resistant to treatment
can result in relapse
4 months on drug 6 months on drug
N. Wagle et al., Journal of Clinical of Oncology (2011)
Response to
vemurafenib
(V600E BRAF
inhibitors)
26. Subclonal variation of simulated tumor-1 at diagnosis
𝑠 = 0.005, u =10−5
per cell division, and mean division time = 4 daysNumberofcells
Subclonal compositionPopulation dynamics of cancer cells
subclone with a
resistance mutation
N = 2,682 cells
Resistant mutation rate =
17%
1 driver mutation
80%
2 driver mutations
Time (years)
27. Subclonal variation of simulated tumor-2 at diagnosis
Numberofcells
Time (years)
Subclonal composition
𝑠 = 0.01, u =10−5
per cell division, and mean division time = 4 days
19%
2 driver
mutations
10%
2 driver mutations
41%
1 driver mutations
subclone with a resistance mutation
N = 224,502 cells
Resistant mutation rate = 𝟏𝟎−𝟖
Population dynamics of cancer cells
28.
29. Conclusions:
• These results constitute an argument for the development and application of more sensitive
technologies for the detection of rare pre-existing subclones that might plant the seeds for
rapid clinical relapse.
• Based on the predicted extent of standing subclonal variation, drug-resistant subclones are
almost certain to exist before the initiation of treatment initiation.
• Greater subclonal diversity in a tumor may predict a higher likelihood of pre-existing
resistance to any conceivable targeted therapy
• Subclonal diversity itself may be a marker of the potential to evolve drug resistance, and
therefore may be an important prognostic indicator
• Reducing the time to research output with Apache Spark increases the success probability of
targeted therapies
30. The extent of subclonal variation is predicted by number of distinct dominant clones
Diego Chowella,b, James Napierc, Rohan Guptac, Karen S. Andersonb,d, Carlo C. Maleyb,d,f,1, and Melissa A. Wilson
Sayresb,d,e,1
aMathematical, Computational and Modeling Sciences Center, bBiodesign Institute, cResearch Computing Center,
dSchool of Life Sciences, eCenter for Evolution and Medicine, Arizona State University, Tempe, Arizona 85281,
USA, fCenter for Evolution and Cancer, University of California San Francisco, San Francisco, California 94158,
USA
1To whom correspondence may be addressed
E-mail: maley@asu.edu or melissa.wilsonsayres@asu.edu (wilsonsayreslab.org | @mwilsonsayres )
Hinweis der Redaktion
Quick Facts:
Founded in 1885 as the Territorial Normal School
Renamed to Arizona State University in 1958
In 1994 ASU was classified as a Research I institute
Largest public university in the United States by enrollment
83K Students enrolled in Academic year 2013-2014
20K Degrees completed
Ranked #4 in the world for US patents in universities w/o a medical school
Research Expenditures = $405 Million in 2013
Currently, Arizona State University is ranked among the Top 25 research institutes in the U.S. in terms of research output, innovation, development, research expenditures, number of awarded patents, and awarded research grant proposals.
ASU is measured not by who it excludes but by whom it includes.
Organizational Chart Updated 10/20/2015
Mostly through the Apache Software Foundation
Transdisciplinary Common Ontological Representational Framework
Hybridized Cloud model
All elements (once siloed) now exist on a seamless fabric without need for complicated ETL mechanisms
It should be noted that although we can sequence a human genome for $1000, this does not include any analysis of it.
There are 3 billion diploid basepairs, but 6 billion haploid sequences
(because half come from mom and half from dad, and assembling those haplotypes - especially SNPs that are the same haplotype - is going to instrumental in future medical advances).
Other limitations: batch effects (in physical sequencing, in sequencing technology, in using different software, and even different versions of software). Batch effects can significantly impede variant discovery (false positives are high).
Given that the detectable tumor burden is estimated to be approximately 109 tumor cells at the time of diagnosis, the level of resolution of conventional DNA-sequencing methods is clearly insufficient to assess pre-existing rare subclones that may harbor resistant mutations before therapy.
Given that the detectable tumor burden is estimated to be approximately 109 tumor cells at the time of diagnosis, the level of resolution of conventional DNA-sequencing methods is clearly insufficient to assess pre-existing rare subclones that may harbor resistant mutations before therapy.
Pressure Testing with HPC, MRv1 and Apache Spark
In this a scenario, treatment usually removes the dominant sub-clones, shifting the evolutionary landscape in favor of one or more of the rare sub-clones, and allowing these treatment-resistant clones to thrive.
The frequency of clonal mutations has been examined comprehensively for most cancer types, whereas the extent of subclonal heterogeneity within the DNA-sequences of individual tumors has not. Greater subclonal diversity in a tumor may predict a higher likelihood of pre-existing resistance to any conceivable targeted therapy.