Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

Luc Dehaspe Genomics Core, UZ Leuven WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core

DNA sequencing determines the order of nucleotide bases in a genome DNA replicationmachinary HumanGenome 2 x 3 billion bases Human Genome 2 x 3 billion bases hours Sequencing machine FinalGenerationSequencing machine Computer’s copyfunction Human Genome 2 x 800 Mbtext Human Genome 2 x 800 Mbtext minutes

Nextgeneration sequencing Qualitydeterioratesafter 100-1000 base pairs Solution: Cut genomes in readablefragments Sequencefragments->reads Usebioinformatics to reconstruct genomes fromreads HumanGenome 2 x 3 billion bases NextGenerationSequencing machine Reads in textformat bioinformatics Human Genome 2 x 800 Mbtext

SequencersvsBioinformatics HumanGenome 2 x 3 billion bases HiSeq 2000 v3 HiSeq 2000 v2 Roche GS FLX 55billion bases per day 6 Human Genomes in 10 days 18billion bases per day 1billionbpd bioinformatics Scale up bioinformaticsor pile up sequencer output Human Genome 2 x 800 Mbtext

Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome

Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions

A bioinformaticspipeline Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Compare to reference, identifySNPs, insertions and deletions Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, … Annotation Sequencing: 10 days Abovepipeline: > 60 dayson 1 cpu Scale up orpile up

Favourable race conditions Sametaskperformedonmanyreadsorloci FOR 1.1 billionindexedreads DO Identify sample FOR 3 billionHuman Genome loci DO Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs Resultsforoneread/locus independent of resultsforotherreads/loci Suggestsnaturalscale up strategy …

Data parallelism Reads or loci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partition Speed up (near) linear to number of cluster nodes Variant calling 3 billionHuman Genome loci Variant calling Chr1 Variant callingChrY Cluster of 24 computers (nodes)

Data parallelism DemultiplexHiSeq 2000 microplate 1 node, 1.1 billionreads 1600 reads per second 8 days 1 microplate ,[object Object],1 1 day … 8 lanes ,[object Object],8 1 1 384 ½ hour 384 tiles …

Favourable race conditions MapReduce: data parallelism made easy Developed and extensivelyused at Google Open sourcelibrary (C++) takes care of Parallelization Fault Tolerance Data Distribution Load Balancing No knowledge of parallel systems required User implements functions Map() and Reduce()

MapReduce: demultiplexreads 8 lanes 8 Map tasks … Map: sortreads Map: sortreads Sample1 Sample3 Sample2 Sample1 Sample3 Sample2 Waituntil map has finished 8 1 Sample1 reads Sample3 reads Sample2 reads Reduce: deduplicatereads Reduce: deduplicatereads Reduce: deduplicatereads Sample1.fastq.gz Sample3.fastq.gz Sample2.fastq.gz

Favourable Race Conditions GATK: MapReducefor sequencing projects Genome analysis toolkit Developedby and usedextensively at BroadInstitute (Harvard and MIT) Open Source, Java 1.6 framework Provides common data accesspatterns Traversalbyread Traversalbylocus

Favourable race conditions Data parallelismsupportedbymany (open source) bioinformatics tools Number of nodes is parameter Full analysispipelineswidelyavailable GATK CASAVA …

Conclusion Data parallelism is key Scale up bybuying extra cluster nodes Genomics core recentlyadded 400 nodes(shared) Cannedsolutionsforcommonbioinformaticstasks Establishedprogrammingframeworksforcustomsolutions MapReduce GATK

Conclusion Bioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer … HumanGenome 2 x 3 billion bases NextGenerationSequencing machine FinalGeneration Sequencing machine Reads in textformat Bioinformaticsusing data parallelism Human Genome 2 x 800 Mbtext ,[object Object]

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

Ähnlich wie Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core (20)

Mehr von Maté Ongenaert

Mehr von Maté Ongenaert (18)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core