SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Large Scale Resequencing: Approaches and
   Challenges


    Thomas Keane
    Vertebrate Resequencing Informatics group
    Wellcome Trust Sanger Institute
    Hinxton, Cambridge, UK

    thomas.keane@sanger.ac.uk



AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence (2007-2009)
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence to-date
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Vertebrate Resequencing Informatics Group

     Established in 2008 with Jim Stalker
         PIs: Richard Durbin and David Adams
     Initial projects
         1000 Genomes project (http://www.1000genomes.org)
               Data processing, releases, aligner evaluation, sequencing
               Pilot 2008-2009: ~5Tbp (Nature 2011;467)
               Phase 1 2009-2011: ~30Tbp
               Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
         Mouse Genomes Project (http://www.sanger.ac.uk/
           mousegenomes)
               Sequencing 17 laboratory mouse strains
               SNPs, indels, SVs, de novo assembly
               Approx. ~1.2Tbp (Nature 2011;477)


AGBT Tutorial Workshop   15th February, 2012
UK10K

 Investigating the role of rare genetic variants in health and disease
 Whole genome cohorts: 4,000 individuals across two well-established and deeply
 phenotyped UK cohorts with ongoing longitudinal phenotype collection:
     TWINSUK – 2,000
     ALSPAC – 2,000
     6x (18Gbp) per sample

 Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
    Neurodevelopmental diseases – 3,000
        e.g. schizophrenia, autism spectrum disorders
    Obesity – 2,000
        e.g. severe childhood onset obesity
    Rare diseases – 1,000
        e.g. severe insulin resistance, congenital heart disease, ciliopathies
    5Gbp per sample

 Expect to generate ~100Tbp by end 2012
    ~40Tbp from BGI


AGBT Tutorial Workshop   15th February, 2012
Current Status




                  Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop   15th February, 2012
What are the challenges?



 Storage                                             Software/Workflows



                                               NGS


 Compute                                                  Power


AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow


         Sample                              NA34842                 NA87465                 Sample/Platform
         merge

                                                                                                  Merge Up
                                    BAM                   BAM                    BAM
      Library
      merge                                                                                  Library
Freeze


       BAM
                            BAM           BAM          BAM      ……       BAM           BAM

   Improvement
                            BAM                                 ……
   Alignment
                                          BAM          BAM               BAM           BAM
                                                                                                   Import
   (bwa, smalt etc)
                            Fastq         Fastq        Fastq    ……       Fastq     Fastq
                                                                                                       +
                                                                                             Improvement



   AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow

                                        Chr1                   Chr2            Chr3
                     NA19294                                                              …
                     NA18943
                                                                                          …      Merge
                     NA19305              .                        .            .
                         .
                         .
                                          .
                                          .
                                                                   .
                                                                   .
                                                                                .
                                                                                .                across
                     NA19309                                                              …

                 RG:NA19294
                 RG:NA18943
                 RG:NA19305
                                                                                          Cross-sample BAMs

                        SNPs/indels                                                 SVMerge
                samtools                GATK                    Genome STRiP



                              VQSR
                                                                                                  Variant
                              BEAGLE/
                              Impute2
                                                                                                  Calling

                                                       VEP Annotation

                                                       Final VCF 

AGBT Tutorial Workshop           15th February, 2012
Storage Challenges

 Expect ~200Tbp of sequence in 2011-2012
   Working estimate including processing, release, and variant calling
   10bytes per bp

 Storage considerations
   Scalability – can we easily add more storage units?
   Backup and disaster recovery – what do we really need to keep?
   Performance – sufficient I/O throughput to serve compute nodes
   Cost

 Data Formats
   Standardised formats – BAM & VCF 

 Minimise the number of copies
   Aim for two copies at most – original lanes + release (stripped) BAM

AGBT Tutorial Workshop   15th February, 2012
A Tiered Storage Solution


Cost          Size

 2               1                                                              3Gb/sec




                                                                                                  CPU Farm
 1               3                                                                    800Mb/sec




                                                          Off-       Off-
 2               2                                        site       site
       Level 1
           Data: Current release vertical BAMs
           Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
       Level 2
           Data: Lane level BAMs
           Processes: Alignment, recalibration, local realignment
       Level 3
           Data: Previous release BAMs + variant calls backup

     AGBT Tutorial Workshop   15th February, 2012
Data release + archiving: iRODs

 Rule-Oriented Data management systems                                                iRODs
     Open source – origins in particle physics world
     Most important feature of iRODS is the Rule Engine                      nfs02       nfs20
     Akin to source control system
 Customise own application level metadata                          nfs03
                                                                                 nfs01        Off-
     e.g. run, lane, plex, sample, library….                                                 site
 Stores/searches key-value metadata on files:
            List all files from UK10K studies:
                     imeta -z seq qu -d study like 'UK10K_%’!
                          /seq/5363/5363_1.bam!
                          /seq/5363/5363_2.bam (.....and a whole lot more)!
                Get metadata about a file:
                     imeta ls -d /seq/6534/6534_3#7.bam sample!
                          attribute: sample!
                          value: QTL191953!

 Sanger production: BAM files from runs per lane per plex deposited
      BMC Bioinformatics 2011, 12:361

 Recently adopted for UK10K internal data release and archiving
      Users use meta-data queries to find their data
      Files can be part of multiple releases
                                                                              http://www.irods.org

AGBT Tutorial Workshop    15th February, 2012
Compute Pipeline Management: VRPipe

 VRPipe
   Managed and automated execution of sequences of arbitrary
     software against massive datasets across large compute clusters
   Error handling, optimal memory requests, batching of jobs, retrying
     failures, failure reporting, highly extendable, detailed job statistics
 1000 Genomes Phase 2 processed through VRPipe
   Tracked ~1 million jobs
   Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
   bwa_aln_fastq: ~2443 days total serial wall time
   Mean memory: 941MB/job (max 5637)
 2012                                                                sb10@sanger.ac.uk

   Fully migrate all NGS processes to VRPipe (data processing, SNP/
     indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
   Management front-ends
   Create distributable VM for cloud rollout
 http://www.github.com/VertebrateResequencing/vr-pipe/wiki

AGBT Tutorial Workshop   15th February, 2012
Even more scale up in 2012 – HiSeq 2500

 Currently takes 1-2 weeks to sequence a human genome
   High depth human genomes in a single day – Illumina HiSeq
     2500
   Caucasian family with a severe T-cell deficiency in affected
     sibling
   Single run on HiSeq 2500 by Illumina per individual

                             PF
                                                      % ≥Q30 Mismatch Mismatch Run time
              Sample        Yield         % Align
                            (Gbp)                      value  R1 (%)   R2 (%)    (hrs)

              Father       117.7               89      92.6     0.4      0.5     25.5
              Mother       125.7               90.2    92.8     0.4      0.5     25.5

              Affected     124.4               90.3    92.4     0.4      0.5     25.5




AGBT Tutorial Workshop   15th February, 2012
What does the data look like?




AGBT Tutorial Workshop   15th February, 2012
Upcoming Changes in 2012

 We cannot keep all of the data
   2007-2008: Keep everything including images from runs
   2009: BAM/Fastq – all of the base quality information
   2010-2011: Stripping original qualities and other unused tags
   2012-: Current formats contain lots of repetition
       Reference based compression
       Reducing quality information e.g. quality binning or quality
       budgets
       Potential formats: CRAM and/or Reduced BAM




AGBT Tutorial Workshop   15th February, 2012
CRAM Format
                                        TGAGCTCTAAGTACC!
                                        329183050298757!


CRAM models for
compression                                                           TGAGCTCTAAGTACC!               TGAGCTCTAAGTACC!
                                                                      002020010022212!               -2---30---9---7!

                                                                            Horizontal                Vertical
                            Do nothing                     Lossless
                                                                                             Quality lossy


        100                                       10                                     1                                            0.1



CRAM current
                                  Untreated             CRAM                       CRAM               CRAM substitutions/insertions
performance                                            lossless                  combination                   model
                                                                                   model


    CRAM v0.6 released 13.2.12:                                        •    Option to preserve all unmapped reads
    •  Pairing information preservation regardless of distance         •    Performance and bug fixes
    •  Revised and improved lossless mode                              •    Arbitrary tags

                                  http://www.ebi.ac.uk/ena/about/cram_toolkit
                                                                                         Source: Ewan Birney/Guy Cochrane, EBI

   AGBT Tutorial Workshop   15th February, 2012
Any questions?




                                                                 Richard Durbin




 URLs
  •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe   David Adams
  •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361
  •  http://www.slideshare.net/thomaskeane

AGBT Tutorial Workshop   15th February, 2012

Weitere ähnliche Inhalte

Andere mochten auch

Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
Thomas Keane
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
Thomas Keane
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 

Andere mochten auch (19)

Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 
The Best Way to Optimize Physician Workflow
The Best Way to Optimize Physician WorkflowThe Best Way to Optimize Physician Workflow
The Best Way to Optimize Physician Workflow
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
 
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
Maternal Fetal Medicine 2017
Maternal Fetal Medicine 2017Maternal Fetal Medicine 2017
Maternal Fetal Medicine 2017
 
The Real Opportunity of Precision Medicine and How to Not Miss Out
The Real Opportunity of Precision Medicine and How to Not Miss OutThe Real Opportunity of Precision Medicine and How to Not Miss Out
The Real Opportunity of Precision Medicine and How to Not Miss Out
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Key Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision MedicineKey Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision Medicine
 
The Scottish Ecosystem for Precision Medicine
The Scottish Ecosystem for Precision MedicineThe Scottish Ecosystem for Precision Medicine
The Scottish Ecosystem for Precision Medicine
 
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
Rheumatoid Arthritis: Too expensive to treat, too expensive to failRheumatoid Arthritis: Too expensive to treat, too expensive to fail
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
 
Stem cell personalized medicine 2017 plus
Stem cell personalized medicine 2017 plusStem cell personalized medicine 2017 plus
Stem cell personalized medicine 2017 plus
 
Six secrets-to-closing-sale
Six secrets-to-closing-saleSix secrets-to-closing-sale
Six secrets-to-closing-sale
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 

Ähnlich wie Large Scale Resequencing: Approaches and Challenges

Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10
mbasford
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
DVClub
 
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
Rob Shakir
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)
Rob Shakir
 

Ähnlich wie Large Scale Resequencing: Approaches and Challenges (16)

Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
 
Netgear ReadyNAS Comparison
Netgear ReadyNAS ComparisonNetgear ReadyNAS Comparison
Netgear ReadyNAS Comparison
 
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
 
Asml Euv Use Forecast
Asml Euv Use ForecastAsml Euv Use Forecast
Asml Euv Use Forecast
 
Public Presentation, ASML EUV forecast Jul 2010
Public Presentation, ASML EUV forecast Jul 2010Public Presentation, ASML EUV forecast Jul 2010
Public Presentation, ASML EUV forecast Jul 2010
 
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
 
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
AMD technologies for HPC
AMD technologies for HPCAMD technologies for HPC
AMD technologies for HPC
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Benchmarker - A Good Friend for Performance
Benchmarker - A Good Friend for PerformanceBenchmarker - A Good Friend for Performance
Benchmarker - A Good Friend for Performance
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)
 

Mehr von Thomas Keane (7)

2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Large Scale Resequencing: Approaches and Challenges

  • 1. Large Scale Resequencing: Approaches and Challenges Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK thomas.keane@sanger.ac.uk AGBT Tutorial Workshop 15th February, 2012
  • 2. Sanger total sequence (2007-2009) Gbp AGBT Tutorial Workshop 15th February, 2012
  • 3. Sanger total sequence to-date Gbp AGBT Tutorial Workshop 15th February, 2012
  • 4. Vertebrate Resequencing Informatics Group  Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams  Initial projects  1000 Genomes project (http://www.1000genomes.org)  Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)  Mouse Genomes Project (http://www.sanger.ac.uk/ mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477) AGBT Tutorial Workshop 15th February, 2012
  • 5. UK10K Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000   ALSPAC – 2,000   6x (18Gbp) per sample Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals   Neurodevelopmental diseases – 3,000  e.g. schizophrenia, autism spectrum disorders   Obesity – 2,000  e.g. severe childhood onset obesity   Rare diseases – 1,000  e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample Expect to generate ~100Tbp by end 2012   ~40Tbp from BGI AGBT Tutorial Workshop 15th February, 2012
  • 6. Current Status Recently passed 1000 genomes in terms of total Gbp AGBT Tutorial Workshop 15th February, 2012
  • 7. What are the challenges? Storage Software/Workflows NGS Compute Power AGBT Tutorial Workshop 15th February, 2012
  • 8. Data Production Workflow Sample NA34842 NA87465 Sample/Platform merge Merge Up BAM BAM BAM Library merge Library Freeze BAM BAM BAM BAM …… BAM BAM Improvement BAM …… Alignment BAM BAM BAM BAM Import (bwa, smalt etc) Fastq Fastq Fastq …… Fastq Fastq + Improvement AGBT Tutorial Workshop 15th February, 2012
  • 9. Data Production Workflow Chr1 Chr2 Chr3 NA19294 … NA18943 … Merge NA19305 . . . . . . . . . . . across NA19309 … RG:NA19294 RG:NA18943 RG:NA19305 Cross-sample BAMs SNPs/indels SVMerge samtools GATK Genome STRiP VQSR Variant BEAGLE/ Impute2 Calling VEP Annotation Final VCF  AGBT Tutorial Workshop 15th February, 2012
  • 10. Storage Challenges Expect ~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost Data Formats  Standardised formats – BAM & VCF  Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAM AGBT Tutorial Workshop 15th February, 2012
  • 11. A Tiered Storage Solution Cost Size 2 1 3Gb/sec CPU Farm 1 3 800Mb/sec Off- Off- 2 2 site site Level 1   Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Previous release BAMs + variant calls backup AGBT Tutorial Workshop 15th February, 2012
  • 12. Data release + archiving: iRODs Rule-Oriented Data management systems iRODs   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine nfs02 nfs20   Akin to source control system Customise own application level metadata nfs03 nfs01 Off-   e.g. run, lane, plex, sample, library…. site Stores/searches key-value metadata on files:   List all files from UK10K studies: imeta -z seq qu -d study like 'UK10K_%’! /seq/5363/5363_1.bam! /seq/5363/5363_2.bam (.....and a whole lot more)!   Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample! attribute: sample! value: QTL191953! Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361 Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases http://www.irods.org AGBT Tutorial Workshop 15th February, 2012
  • 13. Compute Pipeline Management: VRPipe VRPipe  Managed and automated execution of sequences of arbitrary software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637) 2012 sb10@sanger.ac.uk  Fully migrate all NGS processes to VRPipe (data processing, SNP/ indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout http://www.github.com/VertebrateResequencing/vr-pipe/wiki AGBT Tutorial Workshop 15th February, 2012
  • 14. Even more scale up in 2012 – HiSeq 2500 Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq 2500  Caucasian family with a severe T-cell deficiency in affected sibling  Single run on HiSeq 2500 by Illumina per individual PF % ≥Q30 Mismatch Mismatch Run time Sample Yield % Align (Gbp) value R1 (%) R2 (%) (hrs) Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5 Affected 124.4 90.3 92.4 0.4 0.5 25.5 AGBT Tutorial Workshop 15th February, 2012
  • 15. What does the data look like? AGBT Tutorial Workshop 15th February, 2012
  • 16. Upcoming Changes in 2012 We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition  Reference based compression  Reducing quality information e.g. quality binning or quality budgets  Potential formats: CRAM and/or Reduced BAM AGBT Tutorial Workshop 15th February, 2012
  • 17. CRAM Format TGAGCTCTAAGTACC! 329183050298757! CRAM models for compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC! 002020010022212! -2---30---9---7! Horizontal Vertical Do nothing Lossless Quality lossy 100 10 1 0.1 CRAM current Untreated CRAM CRAM CRAM substitutions/insertions performance lossless combination model model CRAM v0.6 released 13.2.12: •  Option to preserve all unmapped reads •  Pairing information preservation regardless of distance •  Performance and bug fixes •  Revised and improved lossless mode •  Arbitrary tags http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI AGBT Tutorial Workshop 15th February, 2012
  • 18. Any questions? Richard Durbin URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeane AGBT Tutorial Workshop 15th February, 2012