Tata AIG General Insurance Company - Insurer Innovation Award 2024
Large Scale Resequencing: Approaches and Challenges
1. Large Scale Resequencing: Approaches and
Challenges
Thomas Keane
Vertebrate Resequencing Informatics group
Wellcome Trust Sanger Institute
Hinxton, Cambridge, UK
thomas.keane@sanger.ac.uk
AGBT Tutorial Workshop 15th February, 2012
4. Vertebrate Resequencing Informatics Group
Established in 2008 with Jim Stalker
PIs: Richard Durbin and David Adams
Initial projects
1000 Genomes project (http://www.1000genomes.org)
Data processing, releases, aligner evaluation, sequencing
Pilot 2008-2009: ~5Tbp (Nature 2011;467)
Phase 1 2009-2011: ~30Tbp
Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
Mouse Genomes Project (http://www.sanger.ac.uk/
mousegenomes)
Sequencing 17 laboratory mouse strains
SNPs, indels, SVs, de novo assembly
Approx. ~1.2Tbp (Nature 2011;477)
AGBT Tutorial Workshop 15th February, 2012
5. UK10K
Investigating the role of rare genetic variants in health and disease
Whole genome cohorts: 4,000 individuals across two well-established and deeply
phenotyped UK cohorts with ongoing longitudinal phenotype collection:
TWINSUK – 2,000
ALSPAC – 2,000
6x (18Gbp) per sample
Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
Neurodevelopmental diseases – 3,000
e.g. schizophrenia, autism spectrum disorders
Obesity – 2,000
e.g. severe childhood onset obesity
Rare diseases – 1,000
e.g. severe insulin resistance, congenital heart disease, ciliopathies
5Gbp per sample
Expect to generate ~100Tbp by end 2012
~40Tbp from BGI
AGBT Tutorial Workshop 15th February, 2012
6. Current Status
Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop 15th February, 2012
7. What are the challenges?
Storage Software/Workflows
NGS
Compute Power
AGBT Tutorial Workshop 15th February, 2012
10. Storage Challenges
Expect ~200Tbp of sequence in 2011-2012
Working estimate including processing, release, and variant calling
10bytes per bp
Storage considerations
Scalability – can we easily add more storage units?
Backup and disaster recovery – what do we really need to keep?
Performance – sufficient I/O throughput to serve compute nodes
Cost
Data Formats
Standardised formats – BAM & VCF
Minimise the number of copies
Aim for two copies at most – original lanes + release (stripped) BAM
AGBT Tutorial Workshop 15th February, 2012
11. A Tiered Storage Solution
Cost Size
2 1 3Gb/sec
CPU Farm
1 3 800Mb/sec
Off- Off-
2 2 site site
Level 1
Data: Current release vertical BAMs
Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
Level 2
Data: Lane level BAMs
Processes: Alignment, recalibration, local realignment
Level 3
Data: Previous release BAMs + variant calls backup
AGBT Tutorial Workshop 15th February, 2012
12. Data release + archiving: iRODs
Rule-Oriented Data management systems iRODs
Open source – origins in particle physics world
Most important feature of iRODS is the Rule Engine nfs02 nfs20
Akin to source control system
Customise own application level metadata nfs03
nfs01 Off-
e.g. run, lane, plex, sample, library…. site
Stores/searches key-value metadata on files:
List all files from UK10K studies:
imeta -z seq qu -d study like 'UK10K_%’!
/seq/5363/5363_1.bam!
/seq/5363/5363_2.bam (.....and a whole lot more)!
Get metadata about a file:
imeta ls -d /seq/6534/6534_3#7.bam sample!
attribute: sample!
value: QTL191953!
Sanger production: BAM files from runs per lane per plex deposited
BMC Bioinformatics 2011, 12:361
Recently adopted for UK10K internal data release and archiving
Users use meta-data queries to find their data
Files can be part of multiple releases
http://www.irods.org
AGBT Tutorial Workshop 15th February, 2012
13. Compute Pipeline Management: VRPipe
VRPipe
Managed and automated execution of sequences of arbitrary
software against massive datasets across large compute clusters
Error handling, optimal memory requests, batching of jobs, retrying
failures, failure reporting, highly extendable, detailed job statistics
1000 Genomes Phase 2 processed through VRPipe
Tracked ~1 million jobs
Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
bwa_aln_fastq: ~2443 days total serial wall time
Mean memory: 941MB/job (max 5637)
2012 sb10@sanger.ac.uk
Fully migrate all NGS processes to VRPipe (data processing, SNP/
indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
Management front-ends
Create distributable VM for cloud rollout
http://www.github.com/VertebrateResequencing/vr-pipe/wiki
AGBT Tutorial Workshop 15th February, 2012
14. Even more scale up in 2012 – HiSeq 2500
Currently takes 1-2 weeks to sequence a human genome
High depth human genomes in a single day – Illumina HiSeq
2500
Caucasian family with a severe T-cell deficiency in affected
sibling
Single run on HiSeq 2500 by Illumina per individual
PF
% ≥Q30 Mismatch Mismatch Run time
Sample Yield % Align
(Gbp) value R1 (%) R2 (%) (hrs)
Father 117.7 89 92.6 0.4 0.5 25.5
Mother 125.7 90.2 92.8 0.4 0.5 25.5
Affected 124.4 90.3 92.4 0.4 0.5 25.5
AGBT Tutorial Workshop 15th February, 2012
15. What does the data look like?
AGBT Tutorial Workshop 15th February, 2012
16. Upcoming Changes in 2012
We cannot keep all of the data
2007-2008: Keep everything including images from runs
2009: BAM/Fastq – all of the base quality information
2010-2011: Stripping original qualities and other unused tags
2012-: Current formats contain lots of repetition
Reference based compression
Reducing quality information e.g. quality binning or quality
budgets
Potential formats: CRAM and/or Reduced BAM
AGBT Tutorial Workshop 15th February, 2012
17. CRAM Format
TGAGCTCTAAGTACC!
329183050298757!
CRAM models for
compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC!
002020010022212! -2---30---9---7!
Horizontal Vertical
Do nothing Lossless
Quality lossy
100 10 1 0.1
CRAM current
Untreated CRAM CRAM CRAM substitutions/insertions
performance lossless combination model
model
CRAM v0.6 released 13.2.12: • Option to preserve all unmapped reads
• Pairing information preservation regardless of distance • Performance and bug fixes
• Revised and improved lossless mode • Arbitrary tags
http://www.ebi.ac.uk/ena/about/cram_toolkit
Source: Ewan Birney/Guy Cochrane, EBI
AGBT Tutorial Workshop 15th February, 2012
18. Any questions?
Richard Durbin
URLs
• VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams
• iRODS@Sanger: BMC Bioinformatics 2011, 12:361
• http://www.slideshare.net/thomaskeane
AGBT Tutorial Workshop 15th February, 2012