SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Luc Dehaspe Genomics Core, UZ Leuven WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011  Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
DNA sequencing determines the order of nucleotide bases in a genome DNA replicationmachinary HumanGenome 2 x 3 billion bases Human Genome 2 x 3 billion bases hours Sequencing machine FinalGenerationSequencing machine Computer’s copyfunction Human Genome 2 x 800 Mbtext Human Genome 2 x 800 Mbtext minutes
Nextgeneration sequencing Qualitydeterioratesafter 100-1000 base pairs Solution: Cut genomes in readablefragments Sequencefragments->reads Usebioinformatics to reconstruct genomes fromreads HumanGenome 2 x 3 billion bases NextGenerationSequencing machine Reads in textformat bioinformatics Human Genome 2 x 800 Mbtext
SequencersvsBioinformatics HumanGenome 2 x 3 billion bases HiSeq 2000 v3 HiSeq 2000 v2 Roche GS FLX 55billion bases per day 6 Human Genomes in 10 days 18billion bases per day 1billionbpd bioinformatics Scale up bioinformaticsor pile up sequencer output Human Genome 2 x 800 Mbtext
 Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome
 Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
A bioinformaticspipeline  Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Compare to reference, identifySNPs, insertions and deletions Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, … Annotation Sequencing: 10 days Abovepipeline: > 60 dayson 1 cpu Scale up orpile up
Favourable race conditions Sametaskperformedonmanyreadsorloci FOR 1.1 billionindexedreads DO Identify sample FOR 3 billionHuman Genome loci DO Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs Resultsforoneread/locus independent of resultsforotherreads/loci Suggestsnaturalscale up strategy …
Data parallelism Reads or loci partitioned among nodes of computer cluster  Each node demultiplexes, aligns, etc on local partition Speed up (near) linear to number of cluster nodes Variant calling 3 billionHuman Genome loci Variant calling Chr1 Variant callingChrY Cluster of 24 computers (nodes)
Data parallelism DemultiplexHiSeq 2000 microplate 1 node, 1.1 billionreads 1600 reads per second 8 days 1 microplate ,[object Object],1 1 day …  8 lanes ,[object Object],8 1 1 384 ½ hour 384 tiles …
Favourable race conditions MapReduce: data parallelism made easy Developed and extensivelyused at Google Open sourcelibrary (C++) takes care of Parallelization Fault Tolerance Data Distribution Load Balancing No knowledge of parallel systems required User implements functions Map() and Reduce()
MapReduce: demultiplexreads 8 lanes 8 Map tasks … Map: sortreads Map: sortreads Sample1 Sample3 Sample2 Sample1 Sample3 Sample2 Waituntil map has finished 8 1  Sample1 reads  Sample3 reads  Sample2 reads Reduce: deduplicatereads Reduce: deduplicatereads Reduce: deduplicatereads Sample1.fastq.gz Sample3.fastq.gz Sample2.fastq.gz
Favourable Race Conditions GATK: MapReducefor sequencing projects Genome analysis toolkit Developedby and usedextensively at BroadInstitute (Harvard and MIT) Open Source, Java 1.6 framework Provides common data accesspatterns Traversalbyread Traversalbylocus
Favourable race conditions Data parallelismsupportedbymany (open source) bioinformatics tools Number of nodes is parameter Full analysispipelineswidelyavailable GATK CASAVA …
Conclusion Data parallelism is key Scale up bybuying extra cluster nodes Genomics core recentlyadded 400 nodes(shared) Cannedsolutionsforcommonbioinformaticstasks Establishedprogrammingframeworksforcustomsolutions MapReduce GATK
Conclusion Bioinformaticiansenjoyfavourableconditionsforkeepingpacewithsequencer … HumanGenome 2 x 3 billion bases NextGenerationSequencing machine FinalGeneration Sequencing machine Reads in textformat Bioinformaticsusing data parallelism Human Genome 2 x 800 Mbtext ,[object Object]

Weitere ähnliche Inhalte

Andere mochten auch

China health presentation may 2012
China health presentation may 2012China health presentation may 2012
China health presentation may 2012healthchina
 
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...L.E.K. Consulting
 
China Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsChina Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsL.E.K. Consulting
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...QIAGEN
 
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAGEN
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
 

Andere mochten auch (7)

China health presentation may 2012
China health presentation may 2012China health presentation may 2012
China health presentation may 2012
 
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
Opportunities and Challenges Associated with Novel Companion Diagnostic Techn...
 
China Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE InvestorsChina Exit or Co-Investment Opportunities for German PE Investors
China Exit or Co-Investment Opportunities for German PE Investors
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
 
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 

Ähnlich wie Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysisDr. Olusoji Adewumi
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataAlireza Doustmohammadi
 
DNA memories
DNA memoriesDNA memories
DNA memoriesHoda msw
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 

Ähnlich wie Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core (20)

Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Unilag workshop complex genome analysis
Unilag workshop   complex genome analysisUnilag workshop   complex genome analysis
Unilag workshop complex genome analysis
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Processing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing DataProcessing Raw scRNA-Seq Sequencing Data
Processing Raw scRNA-Seq Sequencing Data
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
DNA memories
DNA memoriesDNA memories
DNA memories
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
NCBI
NCBINCBI
NCBI
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 

Mehr von Maté Ongenaert

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Maté Ongenaert
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Maté Ongenaert
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenMaté Ongenaert
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Maté Ongenaert
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsMaté Ongenaert
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisMaté Ongenaert
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themMaté Ongenaert
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMaté Ongenaert
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsMaté Ongenaert
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchersMaté Ongenaert
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationMaté Ongenaert
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment trainingMaté Ongenaert
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercisesMaté Ongenaert
 

Mehr von Maté Ongenaert (18)

Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Bots & spiders
Bots & spidersBots & spiders
Bots & spiders
 
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosisExploring the neuroblastoma epigenome: perspectives for improved prognosis
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
 
High-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting themHigh-throughput proteomics: from understanding data to predicting them
High-throughput proteomics: from understanding data to predicting them
 
Microarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the benchMicroarray data and pathway analysis: example from the bench
Microarray data and pathway analysis: example from the bench
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
Integrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functionsIntegrative transcriptomics to study non-coding RNA functions
Integrative transcriptomics to study non-coding RNA functions
 
Bringing the data back to the researchers
Bringing the data back to the researchersBringing the data back to the researchers
Bringing the data back to the researchers
 
The post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integrationThe post-genomic era: epigenetic sequencing applications and data integration
The post-genomic era: epigenetic sequencing applications and data integration
 
Introduction
IntroductionIntroduction
Introduction
 
Literature managment training
Literature managment trainingLiterature managment training
Literature managment training
 
Scientific literature managment - exercises
Scientific literature managment - exercisesScientific literature managment - exercises
Scientific literature managment - exercises
 

Kürzlich hochgeladen

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Kürzlich hochgeladen (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Race against the sequencing machine: processing of raw DNA sequence data at the Genomics Core

  • 1. Luc Dehaspe Genomics Core, UZ Leuven WOUD – Onderzoeksgroep Associatie Universiteit Gent - 28 Sept 2011 Race against the sequencing machineProcessing of raw DNA sequence data at the Genomics Core
  • 2. DNA sequencing determines the order of nucleotide bases in a genome DNA replicationmachinary HumanGenome 2 x 3 billion bases Human Genome 2 x 3 billion bases hours Sequencing machine FinalGenerationSequencing machine Computer’s copyfunction Human Genome 2 x 800 Mbtext Human Genome 2 x 800 Mbtext minutes
  • 3. Nextgeneration sequencing Qualitydeterioratesafter 100-1000 base pairs Solution: Cut genomes in readablefragments Sequencefragments->reads Usebioinformatics to reconstruct genomes fromreads HumanGenome 2 x 3 billion bases NextGenerationSequencing machine Reads in textformat bioinformatics Human Genome 2 x 800 Mbtext
  • 4. SequencersvsBioinformatics HumanGenome 2 x 3 billion bases HiSeq 2000 v3 HiSeq 2000 v2 Roche GS FLX 55billion bases per day 6 Human Genomes in 10 days 18billion bases per day 1billionbpd bioinformatics Scale up bioinformaticsor pile up sequencer output Human Genome 2 x 800 Mbtext
  • 5. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome
  • 6. Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Bioinformaticspipeline Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Comparepileup of reads at givenlocus to reference, identifySNPs, insertions and deletions
  • 7. A bioinformaticspipeline Case: HumanExome, raw data = 1.1 billionreads2x100bp , HiSeq 2000 v3, ½ run Demultiplex Sortindexedreads per sample Alignment Alignreads per sample to reference genome Variant Calling Compare to reference, identifySNPs, insertions and deletions Annotatevariants (gene, effect onproteinsequence, conservation, frequency, predicted effect onproteinfunction, … Annotation Sequencing: 10 days Abovepipeline: > 60 dayson 1 cpu Scale up orpile up
  • 8. Favourable race conditions Sametaskperformedonmanyreadsorloci FOR 1.1 billionindexedreads DO Identify sample FOR 3 billionHuman Genome loci DO Comparelocus in alignedreads to reference and identify homo- and heterozygoticSNPs Resultsforoneread/locus independent of resultsforotherreads/loci Suggestsnaturalscale up strategy …
  • 9. Data parallelism Reads or loci partitioned among nodes of computer cluster Each node demultiplexes, aligns, etc on local partition Speed up (near) linear to number of cluster nodes Variant calling 3 billionHuman Genome loci Variant calling Chr1 Variant callingChrY Cluster of 24 computers (nodes)
  • 10.
  • 11. Favourable race conditions MapReduce: data parallelism made easy Developed and extensivelyused at Google Open sourcelibrary (C++) takes care of Parallelization Fault Tolerance Data Distribution Load Balancing No knowledge of parallel systems required User implements functions Map() and Reduce()
  • 12. MapReduce: demultiplexreads 8 lanes 8 Map tasks … Map: sortreads Map: sortreads Sample1 Sample3 Sample2 Sample1 Sample3 Sample2 Waituntil map has finished 8 1 Sample1 reads Sample3 reads Sample2 reads Reduce: deduplicatereads Reduce: deduplicatereads Reduce: deduplicatereads Sample1.fastq.gz Sample3.fastq.gz Sample2.fastq.gz
  • 13. Favourable Race Conditions GATK: MapReducefor sequencing projects Genome analysis toolkit Developedby and usedextensively at BroadInstitute (Harvard and MIT) Open Source, Java 1.6 framework Provides common data accesspatterns Traversalbyread Traversalbylocus
  • 14. Favourable race conditions Data parallelismsupportedbymany (open source) bioinformatics tools Number of nodes is parameter Full analysispipelineswidelyavailable GATK CASAVA …
  • 15. Conclusion Data parallelism is key Scale up bybuying extra cluster nodes Genomics core recentlyadded 400 nodes(shared) Cannedsolutionsforcommonbioinformaticstasks Establishedprogrammingframeworksforcustomsolutions MapReduce GATK
  • 16.