Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
by dr. Luc Dehaspe - Genomics Core, UZ Leuven
To grow and function, living organisms unconsciously and continuously read instructions from the DNA sequence in each cell. Thanks to the advances in DNA sequencing technology, scientists are increasingly able to consciously read along. In 2001, sequencing efforts resulted in a first draft of human genome. Since then, the capacity of the DNA reading machines has doubled every six months on average. While the first human genome sequencing project took years of worldwide collaboration, multiple genomes can now be sequenced in 10 days on a single machine at a service facility such as the Genomics Core.
Each sequencing run gives rise to a few terabytes of raw data that, using bioinformatics techniques, must be processed in time, before the next bunch of data arrives.
I will discuss bioinformatics techniques that are commonly used in the Genomics Core and that have a chance to survive another generation of sequencing machines. <\br>A crucial feature of these techniques is that they keep up with the sequencing machines by creating sub-tasks that are distributed over an extensible network of computers.
What's New in Teams Calling, Meetings and Devices March 2024
Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core
1. Luc Dehaspe Genomics Core, UZ Leuven WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
2. DNA sequencing determines the order of nucleotide bases in a genome DNA replicationmachinary HumanGenome 2 x 3 billion bases Human Genome 2 x 3 billion bases hours Sequencing machine FinalGenerationSequencing machine Computer’s copyfunction Human Genome 2 x 800 Mbtext Human Genome 2 x 800 Mbtext minutes
3. Nextgeneration sequencing Qualitydeterioratesafter 100-1000 base pairs Solution: Cut genomes in readablefragments Sequencefragments->reads Usebioinformatics to reconstruct genomes fromreads HumanGenome 2 x 3 billion bases NextGenerationSequencing machine Reads in textformat bioinformatics Human Genome 2 x 800 Mbtext
4. SequencersvsBioinformatics HumanGenome 2 x 3 billion bases HiSeq 2000 v3 HiSeq 2000 v2 Roche GS FLX 55billion bases per day 6 Human Genomes in 10 days 18billion bases per day 1billionbpd bioinformatics Scale up bioinformaticsor pile up sequencer output Human Genome 2 x 800 Mbtext
5. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome
6. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
7. A bioinformaticspipeline Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Compare to reference, identifySNPs, insertions and deletions Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, … Annotation Sequencing: 10 days Abovepipeline: > 60 dayson 1 cpu Scale up orpile up
8. Favourable race conditions Sametaskperformedonmanyreadsorloci FOR 1.1 billionindexedreads DO Identify sample FOR 3 billionHuman Genome loci DO Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs Resultsforoneread/locus independent of resultsforotherreads/loci Suggestsnaturalscale up strategy …
9. Data parallelism Reads or loci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partition Speed up (near) linear to number of cluster nodes Variant calling 3 billionHuman Genome loci Variant calling Chr1 Variant callingChrY Cluster of 24 computers (nodes)
10.
11. Favourable race conditions MapReduce: data parallelism made easy Developed and extensivelyused at Google Open sourcelibrary (C++) takes care of Parallelization Fault Tolerance Data Distribution Load Balancing No knowledge of parallel systems required User implements functions Map() and Reduce()
13. Favourable Race Conditions GATK: MapReducefor sequencing projects Genome analysis toolkit Developedby and usedextensively at BroadInstitute (Harvard and MIT) Open Source, Java 1.6 framework Provides common data accesspatterns Traversalbyread Traversalbylocus
14. Favourable race conditions Data parallelismsupportedbymany (open source) bioinformatics tools Number of nodes is parameter Full analysispipelineswidelyavailable GATK CASAVA …
15. Conclusion Data parallelism is key Scale up bybuying extra cluster nodes Genomics core recentlyadded 400 nodes(shared) Cannedsolutionsforcommonbioinformaticstasks Establishedprogrammingframeworksforcustomsolutions MapReduce GATK