8. The First $1,000 Genome – illumina HiSeq X Ten
h1t0t/3p1:/2//0s14ystemCso.niflidleunmtiali |n Caop.ycrioghmt 2/0s12y Tsretnedm Miscr/oh Inics.eq-x-sequen8cing-system.html
9. Expectation of Data Processing
Power for illumina HiSeq X Ten
• A cluster of 10 HiSeq X instruments
• Capable of sequencing up to 18,000 whole human
genomes each year
– Has a run cycle of ~3 days and produces ~150 genomes each
run cycle
– Running the industry standard BWA+GATK analysis pipeline to
perform this analysis on a reasonably high-end (Dual Intel Xeon
E5-2697v2 CPU – 12 core, 2.7 GHz with 96 GB DRAM)
compute server takes ~24 hours per genome.
– To achieve the required throughput of 150 genomes every three
days, at least 50 of these servers are required.
• Should meet a target of ~28 minutes for the completion
of the mapping, aligning, sorting, de-duplication and
variant calling of each genome.
h1t0t/3p1/:2/0/1w4 ww.Ceodnfidicenotiagl | eConpyorigmht 2e0.1c2 Toremnd /Mdicrroa Ingc.en/ 9
13. Algorithm of CloudBurst
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 13
Seed-and-Extend
Algorithm
14. Experiments$
Performance of CloudBurst
Scalability+
16000
14000
12000
10000
8000
6000
4000
2000
0
Running Time vs Number of Reads on Chr 1
0 1 2 3 4 5 6 7 8
Runtime (s)
Millions of Reads
0 1
2 3
4
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 14
15. Speedup over Serial RMAP
EECS$584$–$Fall$2013$
Speedup+over+serial+RMAP+
40
35
30
25
20
15
10
5
0
Speedup over serial RMAP
0 1 2 3 4
Speedup
Number of Mismatches
chr1 chr22
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 15
16. Experiments$
Speedup on EC2
Speedup+on+EC2+
1800
1600
1400
1200
1000
800
600
400
200
0
Running Time on EC2
High-CPU Medium Instance Cluster
24 48 72 96
Running time (s)
Number of Cores
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 16
33. Summary
• NGS is a new page for Big Data Era
• Need more CS experts to solve scalability and
performance issues
• Also, need more Data Scientist to discover the
secrets/insights of Human Genome
10/31/2014 Confidential | Copyright 2012 Trend Micro Inc. 33
From the figure, we can see that CloudAligner is 60 to 80% faster than CloudBurst.
We mapped different subsets of the accession SRR035459 to the human chromosome 22 (50 Mbp) allowing up to 3 mismatches.
From the figure, we can see that the execution time of both CloudBurst and CloudAligner is proportional to the number of reads, and CloudAligner outperforms Cloud- Burst from 35 to 67%.
With CloudBurst, the limitation of ts approach is the network bandwidth. With CloudAligner, its limitation is in the computation power of the workers in Hadoop. Consequently, if we run CloudAligner on cluster of legacy machines with high speed network, we probably lose the performance advantage over CloudBurst.