SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
The Genome Analysis Toolkit
A MapReduce framework for analyzing next-generation
DNA sequencing data




  Ma#	
  Hanna	
  and	
  Mark	
  DePristo	
  
      Genome	
  Sequencing	
  and	
  Analysis	
  Group	
  
      Medical	
  and	
  Popula<on	
  Gene<cs	
  Program	
  
        Broad	
  Ins<tute	
  of	
  Harvard	
  and	
  MIT	
  
The Genome Analysis Toolkit
    Agenda



    •  GATK	
  Overview	
  and	
  Concepts	
  
    •  GATK	
  Workflow	
  
    •  Example:	
  A	
  Simple	
  Bayesian	
  Genotyper	
  




                                                              2
2                                                                 2
GATK: Overview and Concepts
    Motivation



           Coverage in xMHC region of JPT individuals"




        •  Dataset size greatly increases analysis complexity.
        •  Implementation issues can prematurely terminate
           long-running jobs or introduce subtle bugs.

3
GATK: Overview
    Simplifying the process of writing analysis tools for resequencing data


    •  The	
  framework	
  is	
  designed	
  to	
  support	
  most	
  common	
  
       paradigms	
  of	
  analysis	
  algorithms	
  
         –  Provides	
  structured	
  access	
  to	
  reads	
  in	
  BAM	
  format,	
  
            reference	
  context,	
  as	
  well	
  as	
  reference-­‐associated	
  meta	
  
            data	
  
    •  General-­‐purpose	
  
         –  Op<mized	
  for	
  ease	
  of	
  use	
  and	
  completeness	
  of	
  
            func<onality	
  within	
  scope	
  
    •  Efficient	
  
         –  Engineering	
  investment	
  on	
  performance	
  of	
  cri<cal	
  data	
  
            structures	
  and	
  manipula<on	
  rou<nes	
  
    •  Convenient	
  
         –  Structured	
  plug-­‐in	
  model	
  makes	
  developing	
  in	
  Java	
  against	
  
            the	
  framework	
  rela<vely	
  painfree	
  

4
GATK: Overview
    The MapReduce design philosophy



         Data elements       a	
     b	
     c
                                             	
      d	
     e	
  
                                                                      Operations are
                     f(x)                                             independent of
                                                                      each other
                  X = f(x)   A	
     B	
     C	
     D	
     E	
  
             r(x,y, …, z)                                             Results depends on
                                                                      all sites

       R = r(A, R(B,…,E))                    R	
  


     Result is:

            Map              Function f applied to each element of list

          Reduce             Function r recursively reduced over each f(…)


5
GATK: Overview
    Rapid development of efficient and robust analysis tools




                                         Genome	
  Analysis	
  
           Provides the                   Toolkit	
  (GATK)	
  
           boilerplate                    infrastructure  	
  
           code required
           to perform any
           NGS analysis

                                           Traversal	
  engine	
  


                                                        Analysis	
  
                                                          tool
                                                             	
  


                             Provided	
  by	
  framework	
        Implemented	
  by	
  user
                                                                                          	
  
6
GATK: Workflow
    Introduction



    •  GATK	
  Overview	
  and	
  Concepts	
  
    •  GATK	
  Workflow	
  
       •  An	
  example	
  of	
  one	
  of	
  the	
  GATK’s	
  most	
  common	
  workflows	
  
       •  Data	
  access	
  pa#ern:	
  by	
  locus	
  
       •  Inputs:	
  reads,	
  reference,	
  dbSNP	
  

    •  Example:	
  A	
  Simple	
  Bayesian	
  Genotyper	
  


7
GATK: Workflow
     The sharding system: dividing data into processor-sized pieces




    Reads
    Reference
    dbSNP




                •  Divides data into small chunks that can be
                   processed independently
                •  Handles extraction of subsets of data
                •  Groups small intervals together to avoid
                   repetitive decompression



8
GATK: Workflow
    Traversal engines: preparing data for processing




                   Builds data structures
                   easy consumed by the
                          analysis




9
GATK: Workflow
     Interaction between sharding system and traversal engines




     •  Datasets are split into shards, which can be processed sequentially or in parallel
     •  When processing sequentially, the reduce value of each shard is used to
        bootstrap the next shard.
     •  When processing in parallel, the result of each shard is computed independently
        and then “tree-reduced” together.

10
GATK: Workflow
     Walkers: Analyses written by end-users


      dbsnp
      exons
                                         A
      ref
                                         A
      reads                              C
                                         C
                                         A
                                         C




                                    Analysis	
  
                                      tool
                                         	
  

     •  Walkers (analyses) can easily be written by end users. The GATK is
        distributed with a significant library of walkers.
     •  Only the reads, reference, and reference metadata applicable to a single-
        base location is presented to the analysis tool.
     •  The GATK provides tools to filter the pileup automatically or on demand.


11
GATK: Workflow
     Other data access patterns


     Other data access patterns:
           Traversal Type      Description
           Reads               Call map per read, along with the reference
                               and reference-ordered metadata spanning
                               that read.
           Duplicates          Call map for each set of duplicate reads.
           Read pair (naïve)   Call map for each read and its mate (naïve,
                               requires the input BAM to be sorted in
                               query name order).


        Straightforward (but not necessarily easy) to add any new
        access pattern involving streaming data.




12
GATK: Additional features
     Additional inputs and outputs


         Reference metadata
         •    Support for additional input data that is sorted in reference
              order can easily be added to the GATK.
         •    Input types can be added by creating two new classes: a
              feature (data access object) and a codec (parser).
         •    New file formats are indexed automatically.
         •    New data types are autodiscovered via a classpath search.
         •    Joint initiative with IGV.


         Additional I/O
         •    Analysis parameters can be added to a walker by annotating a
              field in the walker with an @Argument annotation.
         •    Command-line argument types can become very sophisticated.




13
Walkers: Example
     A simple Bayesian genotyper



     •  GATK	
  Overview	
  and	
  Concepts	
  
     •  GATK	
  Workflow	
  
     •  Example:	
  A	
  Simple	
  Bayesian	
  Genotyper	
  
         •  A	
  func<onal	
  genotyper	
  in	
  under	
  150	
  lines	
  of	
  code	
  
         •  A	
  minimal	
  example:	
  calls	
  are	
  much	
  lower	
  in	
  quality	
  than	
  
            the	
  UnifiedGenotyper	
  




14
Walkers: Example
     A simple Bayesian genotyper: the model


                                                  Likelihood of the
                     Likelihood for Prior for the data given the
                     the genotype genotype genotype                 Independent base model
 Bayesian	
  
  model	
  
        	
  
                    L(G | D) = P(G) P(D | G) =                       ∏
                                                               b∈{good _ bases}
                                                                                  P(b | G)

     •  Likelihood	
  of	
  data	
  computed	
  using	
  pileup	
  of	
  bases	
  and	
  
        associated	
  quality	
  scores	
  at	
  given	
  locus	
  
     •  Only	
  “good	
  bases”	
  are	
  included:	
  those	
  sa<sfying	
  minimum	
  
        base	
  quality,	
  mapping	
  read	
  quality,	
  pair	
  mapping	
  quality,	
  NQS	
  
     •  L(G|D)	
  computed	
  for	
  all	
  10	
  genotypes	
  

                See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper
                                    for a more complete approach

15
Walkers: Example
     A simple Bayesian genotyper


     •    Walker specifies the data access pattern and
          declares command-line arguments.
     •    Inheritance defines traversal type.
     •    Annotation defines command-line argument.

     public class GATKPaperGenotyper extends LocusWalker<Integer,Long> {	

          @Argument(fullName = "log_odds_score", 	
                    shortName = "LOD", 	
                    doc = "The LOD threshold", 	
                    required = false)	
          private double LODScore = 3.0;	




16
Walkers: Example
     A simple Bayesian genotyper


     •  Walker prepares the input dataset.
     •  ReadBackedPileup utility can be used to filter pileup on
        demand.
         public Integer map(RefMetaDataTracker tracker,	
                            ReferenceContext ref,	
                            AlignmentContext context) {	

           double likelihoods[] =	
                DiploidGenotypePriors.getReferencePolarizedPrior(	
                          ref.getBase(),	
                          DiploidGenotypePriors.HUMAN_HETEROZYGOSITY,	
                          0.01);	

           // get the bases and qualities from the pileup           	
           ReadBackedPileup pileup = context.getBasePileup().	
                getPileupWithoutMappingQualityZeroReads();	
           byte bases[] = pileup.getBases();	
           byte quals[] = pileup.getQuals();	
           …	




17
Walkers: Example
     A simple Bayesian genotyper


     •  Calculate the likelihood for each possible genotype.
     •  Determine the best of the calculated genotypes.


         for (GENOTYPE genotype : GENOTYPE.values())	
           for (int index = 0; index < bases.length; index++) {	
             // our epsilon is the de-Phred scored base quality	
             double epsilon = Math.pow(10, quals[index] / -10.0);	

                byte pileupBase = bases[index];	
                double p = 0;	
                for (char r : genotype.toString().toCharArray())	
                  p += r == pileupBase ? 1 - epsilon : epsilon / 3;	
                likelihoods[genotype.ordinal()] += Math.log10(p /
                genotype.length());	
           }	

         Integer sortedList[] = MathUtils.sortPermutation(likelihoods);	




18
Walkers: Example
     A simple Bayesian genotyper


     •  Conditionally output the results.
     •  Use reduce to calculate number of genotypes called.
     •  Writing to provided output stream is guaranteed to be
        thread-safe.
          …	
               if (lod > LODScore)	
                  out.printf("%st%st%.4ft%c%n", context.getLocation(),
                  selectedGenotype, lod, (char)ref.getBase());	
                  return 1;	
               }	
          }	
          // end of map() function	

          public Long reduce(Integer value, Long sum) {	
             return value + sum;	
          }	

          public void onTraversalDone(Integer result) {	
             out.printf("Simple Genotyper genotyped %d loci.”, result);	
          }	


19
Walkers: Threading performance
     A simple Bayesian genotyper




                                      GATK
                                      performance
                                      improves
                                      nearly linearly
                                      as processors
                                      are added




20
Genome Analysis Toolkit
     1000 Genomes Project


                    •  Supports	
  any	
  BAM-­‐                   Ini<al	
  alignment
                                                                                     	
  
                    compa<ble	
  aligner	
  

                    •  All	
  of	
  these	
  tools	
               MSA	
  realignment
                                                                                    	
  
                    have	
  been	
  developed	
  
                    in	
  the	
  GATK	
  	
  
                                                                       Q-­‐score	
  
                                                                     recalibra<on    	
  
                    •  They	
  are	
  memory	
  
                    and	
  CPU	
  efficient,	
  
                                                                       Base	
  error	
  
                    cluster	
  friendly	
  and	
  are	
                 modeling   	
  
                    easily	
  parallelized	
  

                    •  They	
  are	
  now	
                           Genotyping	
  
                    publically	
  and	
  are	
  
                    being	
  used	
  at	
  many	
  
                    sites	
  around	
  the	
  world	
                 SNP	
  filtering	
  

                       More	
  info:	
  h#p://www.broadins<tute.org/gsa/wiki/	
  
                       Support	
  	
  	
  :	
  h#p://www.getsa<sfac<on.com/gsa/	
  
21
Acknowledgments
                                         	
  
       Genome sequencing and           Broad postdocs, staff,   1000 Genomes project
         analysis group (MPG)                and faculty         In general but notably:
     Kiran Garimella (Analysis Lead)     Anthony Philippakis           Matt Hurles
             Michael Melgar               Vineeta Agarwala           Philip Awadalla
                Chris Hartl                  Manny Rivas             Richard Durbin
              Sherman Jia                   Jared Maguire          Goncalo Abecasis
     Eric Banks (Development lead)         Carrie Sougnez            Richard Gibbs
              Ryan Poplin                    David Jaffe              Gabor Marth
           Guillermo del Angel             Nick Patterson            Thomas Keane
            Aaron McKenna                  Steve Schaffner             Gil McVean
              Khalid Shakir               Shamil Sunyaev              Gerton Lunter
             Brett Thomas                  Paul de Bakker                Heng Li
              Corin Boyko
                                        Copy number group          Cancer genome
                                          Bob Handsaker                analysis
     Genome Sequencing Platform             Jim Nemesh             Kristian Cibulskis
          In general but notably:            Josh Korn            Andrey Sivachenko
             Lauren Ambrogio              Steve McCarroll              Gad Getz
       Illumina Production Team
               Tim Fennell             Integrative Genomics
             Kathleen Tibbetts              Viewer (IGV)          MPG directorship
              Alec Wysoker                  Jim Robinson           Stacey Gabriel
              Ben Weisburd                 Jesse Whitworth         David Altshuler
               Toby Bloom               Helga Thorvaldsdottir        Mark Daly
22

Weitere ähnliche Inhalte

Was ist angesagt?

HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 

Was ist angesagt? (7)

3rd presentation
3rd presentation3rd presentation
3rd presentation
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 

Andere mochten auch

The Maridien MD with DOR
The Maridien MD with DORThe Maridien MD with DOR
The Maridien MD with DORMarifil Ramirez
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytowebBOSC 2010
 
C:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)betaC:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)betayahohsoaho
 
最好的東西是免費的
最好的東西是免費的最好的東西是免費的
最好的東西是免費的t828vp
 
Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Naz Torabi
 
The Intellectual Property Quagmire, or, The Perils of Libertarian Creationism
The Intellectual Property Quagmire, or, The Perils of Libertarian CreationismThe Intellectual Property Quagmire, or, The Perils of Libertarian Creationism
The Intellectual Property Quagmire, or, The Perils of Libertarian CreationismStephan Kinsella
 
Finesse 12 18th aug 2013
Finesse 12 18th aug 2013Finesse 12 18th aug 2013
Finesse 12 18th aug 2013Rishi Kashyap
 
Social Media London Presentation 5th April 2011
Social Media London Presentation 5th April 2011Social Media London Presentation 5th April 2011
Social Media London Presentation 5th April 2011iohann Le Frapper
 
Educational Model to Illustrate HIV Infection Cycle
Educational Model to Illustrate HIV Infection CycleEducational Model to Illustrate HIV Infection Cycle
Educational Model to Illustrate HIV Infection Cyclekcmurphy3
 
мифы и правда об инвестировании
мифы и правда об инвестированиимифы и правда об инвестировании
мифы и правда об инвестированииАльберт Коррч
 
Conhecendo os netbooks 2º A Prof Eliane
Conhecendo os netbooks 2º A Prof ElianeConhecendo os netbooks 2º A Prof Eliane
Conhecendo os netbooks 2º A Prof Elianedalvanice
 
How digital can deliver your business goals
How digital can deliver your business goalsHow digital can deliver your business goals
How digital can deliver your business goalsChris Woods
 

Andere mochten auch (20)

The Maridien MD with DOR
The Maridien MD with DORThe Maridien MD with DOR
The Maridien MD with DOR
 
Bm1 pmr 2009
Bm1 pmr 2009Bm1 pmr 2009
Bm1 pmr 2009
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytoweb
 
Fishy Bitto Blutto Laddy
Fishy Bitto Blutto Laddy Fishy Bitto Blutto Laddy
Fishy Bitto Blutto Laddy
 
Hey
HeyHey
Hey
 
Trouble shooting
Trouble shootingTrouble shooting
Trouble shooting
 
C:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)betaC:\fakepath\消費者行動論(小松崎班)beta
C:\fakepath\消費者行動論(小松崎班)beta
 
最好的東西是免費的
最好的東西是免費的最好的東西是免費的
最好的東西是免費的
 
Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share
 
The Intellectual Property Quagmire, or, The Perils of Libertarian Creationism
The Intellectual Property Quagmire, or, The Perils of Libertarian CreationismThe Intellectual Property Quagmire, or, The Perils of Libertarian Creationism
The Intellectual Property Quagmire, or, The Perils of Libertarian Creationism
 
Finesse 12 18th aug 2013
Finesse 12 18th aug 2013Finesse 12 18th aug 2013
Finesse 12 18th aug 2013
 
Social Media London Presentation 5th April 2011
Social Media London Presentation 5th April 2011Social Media London Presentation 5th April 2011
Social Media London Presentation 5th April 2011
 
Educational Model to Illustrate HIV Infection Cycle
Educational Model to Illustrate HIV Infection CycleEducational Model to Illustrate HIV Infection Cycle
Educational Model to Illustrate HIV Infection Cycle
 
CherryAideFDN
CherryAideFDNCherryAideFDN
CherryAideFDN
 
мифы и правда об инвестировании
мифы и правда об инвестированиимифы и правда об инвестировании
мифы и правда об инвестировании
 
INTEF
INTEFINTEF
INTEF
 
1.2 Lisa French
1.2 Lisa French1.2 Lisa French
1.2 Lisa French
 
Cau kien 36 70
Cau kien 36 70Cau kien 36 70
Cau kien 36 70
 
Conhecendo os netbooks 2º A Prof Eliane
Conhecendo os netbooks 2º A Prof ElianeConhecendo os netbooks 2º A Prof Eliane
Conhecendo os netbooks 2º A Prof Eliane
 
How digital can deliver your business goals
How digital can deliver your business goalsHow digital can deliver your business goals
How digital can deliver your business goals
 

Ähnlich wie Hanna bosc2010

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
The Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth StudyThe Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth StudyKevin Tong
 
Pycvf
PycvfPycvf
Pycvftranx
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at ScaleSascha Dittmann
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 

Ähnlich wie Hanna bosc2010 (20)

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
The Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth StudyThe Performance of MapReduce: An In-depth Study
The Performance of MapReduce: An In-depth Study
 
Pycvf
PycvfPycvf
Pycvf
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Streams on wires
Streams on wiresStreams on wires
Streams on wires
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
MapReduce
MapReduceMapReduce
MapReduce
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 

Mehr von BOSC 2010

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkBOSC 2010
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsBOSC 2010
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesBOSC 2010
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenisBOSC 2010
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 embossBOSC 2010
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evokerBOSC 2010
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorBOSC 2010
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisBOSC 2010
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorBOSC 2010
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfBOSC 2010
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsBOSC 2010
 
Fields bosc2010 bio_perl
Fields bosc2010 bio_perlFields bosc2010 bio_perl
Fields bosc2010 bio_perlBOSC 2010
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopythonBOSC 2010
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBOSC 2010
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaBOSC 2010
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloBOSC 2010
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptxBOSC 2010
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiBOSC 2010
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitBOSC 2010
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 

Mehr von BOSC 2010 (20)

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomics
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-services
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 emboss
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evoker
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projector
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductor
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasf
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
 
Fields bosc2010 bio_perl
Fields bosc2010 bio_perlFields bosc2010 bio_perl
Fields bosc2010 bio_perl
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopython
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rna
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phylo
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptx
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 

Kürzlich hochgeladen

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Kürzlich hochgeladen (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Hanna bosc2010

  • 1. The Genome Analysis Toolkit A MapReduce framework for analyzing next-generation DNA sequencing data Ma#  Hanna  and  Mark  DePristo   Genome  Sequencing  and  Analysis  Group   Medical  and  Popula<on  Gene<cs  Program   Broad  Ins<tute  of  Harvard  and  MIT  
  • 2. The Genome Analysis Toolkit Agenda •  GATK  Overview  and  Concepts   •  GATK  Workflow   •  Example:  A  Simple  Bayesian  Genotyper   2 2 2
  • 3. GATK: Overview and Concepts Motivation Coverage in xMHC region of JPT individuals" •  Dataset size greatly increases analysis complexity. •  Implementation issues can prematurely terminate long-running jobs or introduce subtle bugs. 3
  • 4. GATK: Overview Simplifying the process of writing analysis tools for resequencing data •  The  framework  is  designed  to  support  most  common   paradigms  of  analysis  algorithms   –  Provides  structured  access  to  reads  in  BAM  format,   reference  context,  as  well  as  reference-­‐associated  meta   data   •  General-­‐purpose   –  Op<mized  for  ease  of  use  and  completeness  of   func<onality  within  scope   •  Efficient   –  Engineering  investment  on  performance  of  cri<cal  data   structures  and  manipula<on  rou<nes   •  Convenient   –  Structured  plug-­‐in  model  makes  developing  in  Java  against   the  framework  rela<vely  painfree   4
  • 5. GATK: Overview The MapReduce design philosophy Data elements a   b   c   d   e   Operations are f(x) independent of each other X = f(x) A   B   C   D   E   r(x,y, …, z) Results depends on all sites R = r(A, R(B,…,E)) R   Result is: Map Function f applied to each element of list Reduce Function r recursively reduced over each f(…) 5
  • 6. GATK: Overview Rapid development of efficient and robust analysis tools Genome  Analysis   Provides the Toolkit  (GATK)   boilerplate infrastructure   code required to perform any NGS analysis Traversal  engine   Analysis   tool   Provided  by  framework   Implemented  by  user   6
  • 7. GATK: Workflow Introduction •  GATK  Overview  and  Concepts   •  GATK  Workflow   •  An  example  of  one  of  the  GATK’s  most  common  workflows   •  Data  access  pa#ern:  by  locus   •  Inputs:  reads,  reference,  dbSNP   •  Example:  A  Simple  Bayesian  Genotyper   7
  • 8. GATK: Workflow The sharding system: dividing data into processor-sized pieces Reads Reference dbSNP •  Divides data into small chunks that can be processed independently •  Handles extraction of subsets of data •  Groups small intervals together to avoid repetitive decompression 8
  • 9. GATK: Workflow Traversal engines: preparing data for processing Builds data structures easy consumed by the analysis 9
  • 10. GATK: Workflow Interaction between sharding system and traversal engines •  Datasets are split into shards, which can be processed sequentially or in parallel •  When processing sequentially, the reduce value of each shard is used to bootstrap the next shard. •  When processing in parallel, the result of each shard is computed independently and then “tree-reduced” together. 10
  • 11. GATK: Workflow Walkers: Analyses written by end-users dbsnp exons A ref A reads C C A C Analysis   tool   •  Walkers (analyses) can easily be written by end users. The GATK is distributed with a significant library of walkers. •  Only the reads, reference, and reference metadata applicable to a single- base location is presented to the analysis tool. •  The GATK provides tools to filter the pileup automatically or on demand. 11
  • 12. GATK: Workflow Other data access patterns Other data access patterns: Traversal Type Description Reads Call map per read, along with the reference and reference-ordered metadata spanning that read. Duplicates Call map for each set of duplicate reads. Read pair (naïve) Call map for each read and its mate (naïve, requires the input BAM to be sorted in query name order). Straightforward (but not necessarily easy) to add any new access pattern involving streaming data. 12
  • 13. GATK: Additional features Additional inputs and outputs Reference metadata •  Support for additional input data that is sorted in reference order can easily be added to the GATK. •  Input types can be added by creating two new classes: a feature (data access object) and a codec (parser). •  New file formats are indexed automatically. •  New data types are autodiscovered via a classpath search. •  Joint initiative with IGV. Additional I/O •  Analysis parameters can be added to a walker by annotating a field in the walker with an @Argument annotation. •  Command-line argument types can become very sophisticated. 13
  • 14. Walkers: Example A simple Bayesian genotyper •  GATK  Overview  and  Concepts   •  GATK  Workflow   •  Example:  A  Simple  Bayesian  Genotyper   •  A  func<onal  genotyper  in  under  150  lines  of  code   •  A  minimal  example:  calls  are  much  lower  in  quality  than   the  UnifiedGenotyper   14
  • 15. Walkers: Example A simple Bayesian genotyper: the model Likelihood of the Likelihood for Prior for the data given the the genotype genotype genotype Independent base model Bayesian   model     L(G | D) = P(G) P(D | G) = ∏ b∈{good _ bases} P(b | G) •  Likelihood  of  data  computed  using  pileup  of  bases  and   associated  quality  scores  at  given  locus   •  Only  “good  bases”  are  included:  those  sa<sfying  minimum   base  quality,  mapping  read  quality,  pair  mapping  quality,  NQS   •  L(G|D)  computed  for  all  10  genotypes   See http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper for a more complete approach 15
  • 16. Walkers: Example A simple Bayesian genotyper •  Walker specifies the data access pattern and declares command-line arguments. •  Inheritance defines traversal type. •  Annotation defines command-line argument. public class GATKPaperGenotyper extends LocusWalker<Integer,Long> { @Argument(fullName = "log_odds_score", shortName = "LOD", doc = "The LOD threshold", required = false) private double LODScore = 3.0; 16
  • 17. Walkers: Example A simple Bayesian genotyper •  Walker prepares the input dataset. •  ReadBackedPileup utility can be used to filter pileup on demand. public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) { double likelihoods[] = DiploidGenotypePriors.getReferencePolarizedPrior( ref.getBase(), DiploidGenotypePriors.HUMAN_HETEROZYGOSITY, 0.01); // get the bases and qualities from the pileup ReadBackedPileup pileup = context.getBasePileup(). getPileupWithoutMappingQualityZeroReads(); byte bases[] = pileup.getBases(); byte quals[] = pileup.getQuals(); … 17
  • 18. Walkers: Example A simple Bayesian genotyper •  Calculate the likelihood for each possible genotype. •  Determine the best of the calculated genotypes. for (GENOTYPE genotype : GENOTYPE.values()) for (int index = 0; index < bases.length; index++) { // our epsilon is the de-Phred scored base quality double epsilon = Math.pow(10, quals[index] / -10.0); byte pileupBase = bases[index]; double p = 0; for (char r : genotype.toString().toCharArray()) p += r == pileupBase ? 1 - epsilon : epsilon / 3; likelihoods[genotype.ordinal()] += Math.log10(p / genotype.length()); } Integer sortedList[] = MathUtils.sortPermutation(likelihoods); 18
  • 19. Walkers: Example A simple Bayesian genotyper •  Conditionally output the results. •  Use reduce to calculate number of genotypes called. •  Writing to provided output stream is guaranteed to be thread-safe. … if (lod > LODScore) out.printf("%st%st%.4ft%c%n", context.getLocation(), selectedGenotype, lod, (char)ref.getBase()); return 1; } } // end of map() function public Long reduce(Integer value, Long sum) { return value + sum; } public void onTraversalDone(Integer result) { out.printf("Simple Genotyper genotyped %d loci.”, result); } 19
  • 20. Walkers: Threading performance A simple Bayesian genotyper GATK performance improves nearly linearly as processors are added 20
  • 21. Genome Analysis Toolkit 1000 Genomes Project •  Supports  any  BAM-­‐ Ini<al  alignment   compa<ble  aligner   •  All  of  these  tools   MSA  realignment   have  been  developed   in  the  GATK     Q-­‐score   recalibra<on   •  They  are  memory   and  CPU  efficient,   Base  error   cluster  friendly  and  are   modeling   easily  parallelized   •  They  are  now   Genotyping   publically  and  are   being  used  at  many   sites  around  the  world   SNP  filtering   More  info:  h#p://www.broadins<tute.org/gsa/wiki/   Support      :  h#p://www.getsa<sfac<on.com/gsa/   21
  • 22. Acknowledgments   Genome sequencing and Broad postdocs, staff, 1000 Genomes project analysis group (MPG) and faculty In general but notably: Kiran Garimella (Analysis Lead) Anthony Philippakis Matt Hurles Michael Melgar Vineeta Agarwala Philip Awadalla Chris Hartl Manny Rivas Richard Durbin Sherman Jia Jared Maguire Goncalo Abecasis Eric Banks (Development lead) Carrie Sougnez Richard Gibbs Ryan Poplin David Jaffe Gabor Marth Guillermo del Angel Nick Patterson Thomas Keane Aaron McKenna Steve Schaffner Gil McVean Khalid Shakir Shamil Sunyaev Gerton Lunter Brett Thomas Paul de Bakker Heng Li Corin Boyko Copy number group Cancer genome Bob Handsaker analysis Genome Sequencing Platform Jim Nemesh Kristian Cibulskis In general but notably: Josh Korn Andrey Sivachenko Lauren Ambrogio Steve McCarroll Gad Getz Illumina Production Team Tim Fennell Integrative Genomics Kathleen Tibbetts Viewer (IGV) MPG directorship Alec Wysoker Jim Robinson Stacey Gabriel Ben Weisburd Jesse Whitworth David Altshuler Toby Bloom Helga Thorvaldsdottir Mark Daly 22