1. Files, Tools, and Bioinformatics in the Cloud
Thomas Keane
Vertebrate Resequencing Informatics
WTSI
thomas.keane@sanger.ac.uk
Vertebrate Resequencing Informatics 17th November, 2009
2. DATA is the problem!
NGS means large volumes of raw data
Previously SRF (~8-10bytes per bp), now BAM (~1.6bytes per bp)
How much data can a sequencing machine produce?
20Gbp per lane, 16 lanes per run (1 run = 1.5 weeks) => 11Tbp/year
Small sequencing center: 4 machines?
44Tbp per year!
Raw data in BAM: 70Tbytes SV Calling: SVMerge
Processed calls much smaller
1000G pilot VCF < 1Gbyte
Alignment + BAM improvement
Vertebrate Resequencing Informatics 17th November, 2009
3. Simplistic Model: Cloud as compute resource
Processes
1. Align
SRF/Fastq/BAM
(2Mbps/sec) Variant calling (n x SNP callers, n indel
callers, SV callers)
Sequencing Center/Institute BAM + VCF
(2Mbps/sec)
BAM 3,240 days
VCF to upload!
Vertebrate Resequencing Informatics 17th November, 2009
4. Move the raw data generation to the compute
Variant calling (n x SNP callers, n indel
callers, SV callers)
Sequencing Center/Institute
VCF
BAM
VCF
Vertebrate Resequencing Informatics 17th November, 2009
5. Large Collaborative Projects: Cloud centric model
VCF
Analysis groups
Vertebrate Resequencing Informatics 17th November, 2009