SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Toolkit for variation comparison and analysis


Brad Chapman, Bioinformatics Core at Harvard School of
                    Public Health



 Bioinformatics Open Source Conference, 13 July 2012
Variation

   1   1156131 rs2887286   T   C   1714.07 PASS
   AB=0;ABP=0;AC=1;AC1=2;AF=1.0;AF1=1;AN=1;AO=56;
   BVAR;CIGAR=1X;DB;DP=59;DP4=0,0,27,26;DPRA=0;Dels=0.00;
   EPP=6.88793;EPPR=0;FQ=-187;FS=0.000;GC=55.45;HRun=0;
   HaplotypeScore=0.0000;LEN=1;MEANALT=2;MQ=70.00;MQ0=0;
   MQM=70;MQMR=0;NBQ=27.76;NS=1;NUMALT=1;ODDS=308.76;
   PAIRED=0.928571;QD=30.61;RO=0;RPP=12.937;RUN=1;
   SAP=3.16541;SNPEFF_EFFECT=INTRON;SNPEFF_GENE_NAME=SDF4;
   SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER;
   SNPEFF_GENE_BIOTYPE=protein_coding;
   SNPEFF_TRANSCRIPT_ID=ENST00000263741;
   Samples=NA19239_illumina;TYPE=snp;VDB=0.0416;
   VQSLOD=16.7453;XAI=0.00018797;XAM=0.00154116
   GT:AO:DP:GL:GQ:QA:QR:RO
   1:56:58:-141.12,-3.61:99:1527:0:0
Answer biological questions; help people




   http://cmtproject.blogspot.com/
Solutions

   Comparisons

        Multiple callers: GATK, FreeBayes, samtools
        Multiple technologies: Illumina, SOLiD, IonTorrent

   Identify real variants

        Summarize associated metrics
        Remove false positives

   Scale

        Millions of variants
        Thousands of samples
http://genomics.xprize.org/
Clinical grade genome




      98 percent genome coverage
      1 error per million bases (SNPs + small indels)
      Full haplotype phasing
      Structural variations
Sequencing for patients




   http://matt.might.net/articles/my-sons-killer/
Technology overview



      Clojure
      http://clojure.org/
      Genome Analysis Toolkit
      http://www.broadinstitute.org/gsa/wiki/index.php/
      Home_Page
      GenomeSpace
      http://www.genomespace.org/

   https://github.com/chapmanb/bcbio.variation
Why Clojure?



      Hosted on Java Virtual Machine
          Interoperability with existing libraries: GATK, GenomeSpace
          Wonderful build, deployment and testing tools
      Functional and immutable
          Easier to write correct code
          Small functions: concise and refactorable
      Community
          Smart people working on hard problems
      Ecosystem
          Multiple backends: ClojureScript
Example analysis pipeline




       Variant files from two different callers
           GATK UnifiedGenotyper
           FreeBayes: https://github.com/ekg/freebayes
       Compare, identifying:
           Identical variants
           Different variants in each caller
           Metrics to help discriminate
High level description

   experiments:
     - sample: Test1
       ref: test/data/GRCh37.fa
       intervals: test/data/target-regions.bed
       align: test/data/aligned-reads.bam
       calls:
         - name: gatk
           file: test/data/gatk-calls.vcf
         - name: freebayes
           file: test/data/freebayes-calls.vcf
           prep: true
           annotate: true
           filters:
              - QD < 2.0
              - MQRankSum < -12.5
Simple to run

   With a new blank machine, get Java and automated Clojure build
   tool:

   $ sudo apt-get install openjdk-7-jdk openjdk-7-jre
   $ wget https://raw.github.com/technomancy/leiningen/
          preview/bin/lein
   $ chmod 755 lein && sudo mv lein /usr/local/bin

   Get code and dependencies, then run:

   $   git pull git://github.com/chapmanb/bcbio.variation.git
   $   cd bcbio.variation
   $   lein deps
   $   lein variant-compare comparison_description.yaml
What happened?
Establishing true variants


   X Prize: haploid gold standard genome

       Public genomes from HapMap/1000 genomes
            NA12878: Caucasian female from Utah.
            NA19239: Yoruba male from Ibadan, Nigeria.
       Multiple technologies
            Illumina
            SOLiD
            IonTorrent
       Multiple callers
            GATK
            FreeBayes
            samtools mpileup
True variant workflow – comparisons
True variant workflow – finalize
Comparison web interface




   https://validationprotocol.org/
Summary web interface
Next web steps – metrics




   http://keminglabs.com/
Answering biological questions



       Establish set of true variants
            Understand boundaries of certainty
            Make patient decisions
       Comparison architecture
            Cancer: tumor/normal pairs
            Mendelian inherited diseases: father/mother/child
       Prioritize based on known data
            Gemini: https://github.com/arq5x/gemini
            1000 genomes frequencies
            Mappability
            Clinical information

Weitere ähnliche Inhalte

Ähnlich wie B Chapman - Toolkit for variation comparison and analysis

BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeChunlei Wu
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Updatebosc
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsJoão André Carriço
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...StampedeCon
 
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsFrank Bergmann
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious DiseaseJoão André Carriço
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...Mark Evans
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaginggeetachauhan
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNJeremy Yang
 
Static Analysis
Static AnalysisStatic Analysis
Static Analysisalice yang
 
Modware
ModwareModware
Modwarebosc
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupGenomeInABottle
 
Ganeti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made SimpleGaneti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made SimpleOSCON Byrum
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 

Ähnlich wie B Chapman - Toolkit for variation comparison and analysis (20)

BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
Legacy Analysis: How Hadoop Streaming Enables Software Reuse – A Genomics Cas...
 
Applying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation ExperimentsApplying the Scientific Method to Simulation Experiments
Applying the Scientific Method to Simulation Experiments
 
BioWeka
BioWekaBioWeka
BioWeka
 
Eccmid meet the expert 2015
Eccmid meet the expert 2015Eccmid meet the expert 2015
Eccmid meet the expert 2015
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
 
Static Analysis
Static AnalysisStatic Analysis
Static Analysis
 
Modware
ModwareModware
Modware
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017HUG @ NGCLE@e-Novia 15.11.2017
HUG @ NGCLE@e-Novia 15.11.2017
 
Ganeti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made SimpleGaneti Web Manager: Cluster Management Made Simple
Ganeti Web Manager: Cluster Management Made Simple
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 

Mehr von Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoJan Aerts
 

Mehr von Jan Aerts (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUno
 

Kürzlich hochgeladen

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

B Chapman - Toolkit for variation comparison and analysis

  • 1. Toolkit for variation comparison and analysis Brad Chapman, Bioinformatics Core at Harvard School of Public Health Bioinformatics Open Source Conference, 13 July 2012
  • 2. Variation 1 1156131 rs2887286 T C 1714.07 PASS AB=0;ABP=0;AC=1;AC1=2;AF=1.0;AF1=1;AN=1;AO=56; BVAR;CIGAR=1X;DB;DP=59;DP4=0,0,27,26;DPRA=0;Dels=0.00; EPP=6.88793;EPPR=0;FQ=-187;FS=0.000;GC=55.45;HRun=0; HaplotypeScore=0.0000;LEN=1;MEANALT=2;MQ=70.00;MQ0=0; MQM=70;MQMR=0;NBQ=27.76;NS=1;NUMALT=1;ODDS=308.76; PAIRED=0.928571;QD=30.61;RO=0;RPP=12.937;RUN=1; SAP=3.16541;SNPEFF_EFFECT=INTRON;SNPEFF_GENE_NAME=SDF4; SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_IMPACT=MODIFIER; SNPEFF_GENE_BIOTYPE=protein_coding; SNPEFF_TRANSCRIPT_ID=ENST00000263741; Samples=NA19239_illumina;TYPE=snp;VDB=0.0416; VQSLOD=16.7453;XAI=0.00018797;XAM=0.00154116 GT:AO:DP:GL:GQ:QA:QR:RO 1:56:58:-141.12,-3.61:99:1527:0:0
  • 3.
  • 4. Answer biological questions; help people http://cmtproject.blogspot.com/
  • 5. Solutions Comparisons Multiple callers: GATK, FreeBayes, samtools Multiple technologies: Illumina, SOLiD, IonTorrent Identify real variants Summarize associated metrics Remove false positives Scale Millions of variants Thousands of samples
  • 7. Clinical grade genome 98 percent genome coverage 1 error per million bases (SNPs + small indels) Full haplotype phasing Structural variations
  • 8. Sequencing for patients http://matt.might.net/articles/my-sons-killer/
  • 9. Technology overview Clojure http://clojure.org/ Genome Analysis Toolkit http://www.broadinstitute.org/gsa/wiki/index.php/ Home_Page GenomeSpace http://www.genomespace.org/ https://github.com/chapmanb/bcbio.variation
  • 10. Why Clojure? Hosted on Java Virtual Machine Interoperability with existing libraries: GATK, GenomeSpace Wonderful build, deployment and testing tools Functional and immutable Easier to write correct code Small functions: concise and refactorable Community Smart people working on hard problems Ecosystem Multiple backends: ClojureScript
  • 11. Example analysis pipeline Variant files from two different callers GATK UnifiedGenotyper FreeBayes: https://github.com/ekg/freebayes Compare, identifying: Identical variants Different variants in each caller Metrics to help discriminate
  • 12. High level description experiments: - sample: Test1 ref: test/data/GRCh37.fa intervals: test/data/target-regions.bed align: test/data/aligned-reads.bam calls: - name: gatk file: test/data/gatk-calls.vcf - name: freebayes file: test/data/freebayes-calls.vcf prep: true annotate: true filters: - QD < 2.0 - MQRankSum < -12.5
  • 13. Simple to run With a new blank machine, get Java and automated Clojure build tool: $ sudo apt-get install openjdk-7-jdk openjdk-7-jre $ wget https://raw.github.com/technomancy/leiningen/ preview/bin/lein $ chmod 755 lein && sudo mv lein /usr/local/bin Get code and dependencies, then run: $ git pull git://github.com/chapmanb/bcbio.variation.git $ cd bcbio.variation $ lein deps $ lein variant-compare comparison_description.yaml
  • 15. Establishing true variants X Prize: haploid gold standard genome Public genomes from HapMap/1000 genomes NA12878: Caucasian female from Utah. NA19239: Yoruba male from Ibadan, Nigeria. Multiple technologies Illumina SOLiD IonTorrent Multiple callers GATK FreeBayes samtools mpileup
  • 16. True variant workflow – comparisons
  • 17. True variant workflow – finalize
  • 18. Comparison web interface https://validationprotocol.org/
  • 20. Next web steps – metrics http://keminglabs.com/
  • 21.
  • 22. Answering biological questions Establish set of true variants Understand boundaries of certainty Make patient decisions Comparison architecture Cancer: tumor/normal pairs Mendelian inherited diseases: father/mother/child Prioritize based on known data Gemini: https://github.com/arq5x/gemini 1000 genomes frequencies Mappability Clinical information