SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
A Grammar of Graphics for Genomics
                                 The ggbio Package


                                  Michael Lawrence

                                         Genentech


                                   August 29, 2012




Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   1 / 18
Outline




1   Motivation


2   High-level Plots


3   Grammar Components




    Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   2 / 18
Outline




1   Motivation


2   High-level Plots


3   Grammar Components




    Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   3 / 18
Data on the Genome
  • Comes in two flavors:
      • Annotations (genes, TF binding sites, ...)
      • Experimental measurements (sequence reads)
  • Both types are tied to genomic coordinates, providing a common axis
    that permits cross-dataset comparison and inference
  • Typically stored as a table, with the range as a fundamental variable
    type, plus metadata




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   4 / 18
Data on the Genome
  • Comes in two flavors:
      • Annotations (genes, TF binding sites, ...)
      • Experimental measurements (sequence reads)
  • Both types are tied to genomic coordinates, providing a common axis
    that permits cross-dataset comparison and inference
  • Typically stored as a table, with the range as a fundamental variable
    type, plus metadata




                  120.928 Mb    120.93 Mb     120.932 Mb    120.934 Mb      120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)          A Grammar of Graphics for Genomics                  August 29, 2012   4 / 18
Data on the Genome
  • Comes in two flavors:
      • Annotations (genes, TF binding sites, ...)
      • Experimental measurements (sequence reads)
  • Both types are tied to genomic coordinates, providing a common axis
    that permits cross-dataset comparison and inference
  • Typically stored as a table, with the range as a fundamental variable
    type, plus metadata
            60

            50

            40

            30

            20

            10

             0
                    120928000   120930000     120932000    120934000      120936000   120938000




 Michael Lawrence (Genentech)        A Grammar of Graphics for Genomics                August 29, 2012   4 / 18
Data on the Genome
  • Comes in two flavors:
      • Annotations (genes, TF binding sites, ...)
      • Experimental measurements (sequence reads)
  • Both types are tied to genomic coordinates, providing a common axis
    that permits cross-dataset comparison and inference
  • Typically stored as a table, with the range as a fundamental variable
    type, plus metadata
             60
             50
             40
             30
             20
             10
              0




                     120.928 Mb   120.93 Mb    120.932 Mb    120.934 Mb    120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)         A Grammar of Graphics for Genomics                August 29, 2012   4 / 18
Data on the Genome
  • Comes in two flavors:
      • Annotations (genes, TF binding sites, ...)
      • Experimental measurements (sequence reads)
  • Both types are tied to genomic coordinates, providing a common axis
    that permits cross-dataset comparison and inference
  • Typically stored as a table, with the range as a fundamental variable
    type, plus metadata


 seqnames        start          end    strand   exon id    tx id
 10         120927215     120928045    -        129230     14886,14887
 10         120928689     120928854    -        129229     14886,14887
 10         120931894     120931997    -        129228     14886,14887
 10         120933249     120933384    -        129227     14886,14887
 10         120933963     120934069    -        129226     14886
 10         120933963     120934104    -        119757     14887
 10         120936533     120936665    -        119756     14887
 10         120936552     120936665    -        129225     14886
 10         120938267     120938345    -        129224     14886,14887




 Michael Lawrence (Genentech)         A Grammar of Graphics for Genomics   August 29, 2012   4 / 18
Challenges
Big data, wide spaces




   • Need summaries that are efficiently computed, communicate more
      with less and expose the most interesting aspects of the data
   • Need different ways of viewing the data, depending on the density and
      scale, from whole genome to single basepair
  Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   5 / 18
Challenges
Big data, wide spaces




                    120.928 Mb   120.93 Mb    120.932 Mb    120.934 Mb     120.936 Mb   120.938 Mb




   • Need summaries that are efficiently computed, communicate more
      with less and expose the most interesting aspects of the data
   • Need different ways of viewing the data, depending on the density and
      scale, from whole genome to single basepair
  Michael Lawrence (Genentech)        A Grammar of Graphics for Genomics                August 29, 2012   5 / 18
Challenges
Big data, wide spaces

              60
              50
              40
              30
              20
              10
               0




                      120.928 Mb   120.93 Mb    120.932 Mb    120.934 Mb    120.936 Mb   120.938 Mb


   • Need summaries that are efficiently computed, communicate more
      with less and expose the most interesting aspects of the data
   • Need different ways of viewing the data, depending on the density and
      scale, from whole genome to single basepair
  Michael Lawrence (Genentech)         A Grammar of Graphics for Genomics                August 29, 2012   5 / 18
Challenges
Big data, wide spaces

              300000
              250000
              200000
              150000
              100000
               50000
                  0




                       0 Mb               50 Mb                       100 Mb




   • Need summaries that are efficiently computed, communicate more
      with less and expose the most interesting aspects of the data
   • Need different ways of viewing the data, depending on the density and
      scale, from whole genome to single basepair
  Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   5 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Existing Tools
UCSC                      IGB               IGV                      Circos                 GViz


Limitations
  • Limited to one type of view (linear or circular)
  • Not tightly integrated with an analysis environment through
     standard, abstract data structures (except GViz)
  • No low-level toolkit for prototyping new types of graphics




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics            August 29, 2012   6 / 18
Grammars of Graphics

    • A grammar of graphics is a
       language for expressing plots
    • Graphics are constructed
       through the combination of
       various types of primitives;
       like legos for graphics
    • The most prominent
       grammar was introduced by
       Wilkinson’s book The
       Grammar of Graphics
    • Wilkinson’s grammar was
       extended by Wickham and
       the ggplot2 package


 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   7 / 18
Grammars of Graphics

    • A grammar of graphics is a
       language for expressing plots
    • Graphics are constructed
       through the combination of
       various types of primitives;
       like legos for graphics
    • The most prominent
       grammar was introduced by
       Wilkinson’s book The
       Grammar of Graphics
    • Wilkinson’s grammar was
       extended by Wickham and
       the ggplot2 package


 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   7 / 18
Grammars of Graphics

    • A grammar of graphics is a
       language for expressing plots
    • Graphics are constructed
       through the combination of
       various types of primitives;
       like legos for graphics
    • The most prominent
       grammar was introduced by
       Wilkinson’s book The
       Grammar of Graphics
    • Wilkinson’s grammar was
       extended by Wickham and
       the ggplot2 package


 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   7 / 18
Grammars of Graphics

    • A grammar of graphics is a
       language for expressing plots
    • Graphics are constructed
       through the combination of
       various types of primitives;
       like legos for graphics
    • The most prominent
       grammar was introduced by
       Wilkinson’s book The
       Grammar of Graphics
    • Wilkinson’s grammar was
       extended by Wickham and
       the ggplot2 package


 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   7 / 18
The ggbio Package



  • An R/Bioconductor package that extends the Wilkinson/Wickham
     grammar for applications in genomics
  • Integrated with Bioconductor
       • Operates on standard, abstract genomic data structures
       • Leverages efficient range-based algorithms
  • Programming interface has two levels of abstraction:
         autoplot Maps Bioconductor data structures to plots
          grammar Mix and match to create custom plots




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   8 / 18
Outline




1   Motivation


2   High-level Plots


3   Grammar Components




    Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   9 / 18
Basic Plots
Gene Structures                 Read Alignments                    Sequence             Multiple




 Michael Lawrence (Genentech)     A Grammar of Graphics for Genomics          August 29, 2012   10 / 18
Basic Plots
Gene Structures                    Read Alignments                         Sequence                     Multiple




               120.928 Mb   120.93 Mb     120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)           A Grammar of Graphics for Genomics                    August 29, 2012   10 / 18
Basic Plots
Gene Structures                     Read Alignments                           Sequence                     Multiple
          60


          50


          40


          30


          20


          10


           0

                  120928000     120930000       120932000   120934000    120936000   120938000




 Michael Lawrence (Genentech)               A Grammar of Graphics for Genomics                   August 29, 2012   10 / 18
Basic Plots
Gene Structures                 Read Alignments                      Sequence                      Multiple

                                                                                     A
                                                                                     C
                                                                                     G
                                                                                     T

                120928700       120928750          120928800             120928850




 Michael Lawrence (Genentech)       A Grammar of Graphics for Genomics                   August 29, 2012   10 / 18
Basic Plots
Gene Structures                    Read Alignments                       Sequence                Multiple


               CGTAGGAGAATCCGGTGTCCAGTTCGCTGGGCAGACTTCTCCATGTGTTT


              120928690     120928700       120928710     120928720      120928730   120928740




 Michael Lawrence (Genentech)           A Grammar of Graphics for Genomics             August 29, 2012   10 / 18
Basic Plots
Gene Structures                    Read Alignments                         Sequence                    Multiple
            60



            50



            40



            30



            20



            10



             0




                   120.928 Mb   120.93 Mb    120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)           A Grammar of Graphics for Genomics                   August 29, 2012   10 / 18
Overview Plots
Grand Linear                               Karyogram                           Circular




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   11 / 18
Overview Plots
Grand Linear                               Karyogram                           Circular




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   11 / 18
Overview Plots
Grand Linear                                     Karyogram                                                                                   Circular




                                                                          1
                                                                          2
                                                                          3
                                                                          4
                                                                          5
                                                                          6
                                                                          7
                                                                          8
                                                                          9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
                                                                                                                       seqReg
                                                                                                                          Exon
                                                                                                                          Intron
                                                                                                                          Other




                       5.0e+07   1.0e+08     1.5e+08    2.0e+08




 Michael Lawrence (Genentech)        A Grammar of Graphics for Genomics                                                            August 29, 2012   11 / 18
Overview Plots
Grand Linear                                                                                                                                                            Karyogram                                                                                                                                         Circular

                                                                                                      21                     22
                                                                                                                                                                               1
                                                                                  20




                                                                                                                                      50M




                                                                                                                                                            50M
                                                                                                                            0M




                                                                                                                                            0M




                                                                                                                                                                        100M
                                                                                                      0M




                                                                                                                                                                                    150M
                                                                                            50M
                                                                     19




                                                                                                                                                                                             200M
                                                                                  0M
                                                                           M
                                                                           50




                                                                                                                                                                                                                 0M
                                                                      0M
                                                     18                                                                                                                                                                                     2




                                                                                                                                                                                                                              M
                                                                                                                                                                                                                          50
                                                            M
                                                            50




                                                                                                                                                                                                                                    0M
                                                                                                                                                                                                                                   10
                                                      0M



                                                                                                                                                                                                                                                 0M
                                                                                                                                                                                                                                            15
                                                                                                                                                 q
                                17




                                                                                                                                                                                                                                                        0M
                                                                                                                                                                                                                                                      20
                                               M
                                                50
                                                                                                                                            q               q

                                          0M
                                                                                                                                                                          q

                                                                                                                                                                                                                                                                 0M
                                                                                                                                                                                                                      q
                   16




                                                                                                                                                                                                                                                                        M
                            50M                                                                                                                                                                                                                                       50




                                                                                                                                                                                                                                                                              3
                                                                                                                                                                                                             q
                           0M                                                                                                                                                                                                                                              100M
                                                                                                                                                                                                             q


                     100M                                                                                                                                                                                            q
                                                                                                                                                                                                                                                                            150M             rearrangements
              15




                    50M
                                                                 q
                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                                                                                  interchromosomal
                                                                                                                                                                                                                               q        q
                                                                                                                                                                                                                          q
                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                                                                  0M
                                                                                                                                                                                                                                                                                                  intrachromosomal
                    0M


                                                                                                                                                                                                                                                                                  50M
                  100M                                                                                                                                                                                                                                                                       tumreads




                                                                                                                                                                                                                                                                                         4
                                                                                                                                                                                                                               q
                                                                                                                                                                                                                                                                                                  4
             14




                                                                                                                                                                                                                                                                                              q
                  50M                                                                                                                                                                                                                                                             100M


                    0M                                                                                                                                                                                                                                                            150M
                                                                                                                                                                                                                                                                                              q   6

                                                                                                                                                                                                                          q
                                                                                                                                                                                                                                                                                              q   8
                                                            q
                   100M
                                                                                                                                                                                                                                                                             0M               q   10
              13




                                                                                                                                                                                                                      q
                                                                                                                                                                                                                                                                                              q   12
                     50M
                                                                                                                                                                                                                                                                            50M
                          0M                                                                                                                                                                                 q

                                                                                                                                                                                                                                                                       100M




                                                                                                                                                                                                                                                                                  5
                                                                                                                                                                                                  q
                                                                                        q                                                                                                                                                                         15
                            0M
                                 10                                                                                                                                                                                                                                 0M

                                       50                                                         q
                          12




                                      M
                                                                                                                                                                                                                                                            0M
                                               0M
                                                                                                                                                                                                                                                      50
                                                                                                                                                                                                                                                        M
                                                                                                                                                                                    q                                                       10
                                                                                                                                                                                                                                              0M
                                                     10




                                                                                                                                                                                                                                                           6
                                                     M  0




                                                                                                                                                                                                                                   15
                                                            50




                                                                                                                                                                                                                                    0M
                                                            M




                                                     11
                                                                 0M




                                                                                                                                                                                                                          0M
                                                                                                                                                                                                                 50
                                                                           10




                                                                                                                                                                                                                 M
                                                                                                                                                                                                      100M
                                                                             0M


                                                                                  50M




                                                                                                                                                                                           150M




                                                                                                                                                                                                                 7
                                                                                            0M




                                                                                                                                                                               0M




                                                                            10
                                                                                                                                                                  50M
                                                                                                           100M




                                                                                                                                                     100M
                                                                                                                      50M


                                                                                                                                 0M




                                                                                                                  9                                               8




 Michael Lawrence (Genentech)                                                                               A Grammar of Graphics for Genomics                                                                                                                                                                  August 29, 2012   11 / 18
Specialized Plots
Mismatch summary + VCF                                               Edge-linked Intervals




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics         August 29, 2012   12 / 18
Specialized Plots
Mismatch summary + VCF                                                                      Edge-linked Intervals

                        15




                        10
                                                                                                                read
   mismatch




                                                                                                                     A
               Counts




                                                                                                                     C
                                                                                                                     G
                                                                                                                     T


                         5




                         0
   snp




                                                              T
   reference




A T T A A G A A A G T A C C G T G T G A C A T C A C A G G C T G G G A G C T T G A G A G

                   25235720   25235725   25235730       25235735     25235740    25235745   25235750      25235755



  Michael Lawrence (Genentech)                      A Grammar of Graphics for Genomics                 August 29, 2012   12 / 18
Specialized Plots
Mismatch summary + VCF                                                           Edge-linked Intervals
                  1000

                  800
     Expression




                                                                                           group
                  600
                                                                                              GM12878
                  400
                                                                                              K562
                  200
                   0




  uc002raw.2
   uc010yjh.1
   uc002rav.2
   uc010yjg.1
  uc002rau.2

                         10930000   10940000   10950000   10960000   10970000   10980000




 Michael Lawrence (Genentech)            A Grammar of Graphics for Genomics                 August 29, 2012   12 / 18
Outline




1   Motivation


2   High-level Plots


3   Grammar Components




    Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   13 / 18
The Wilkinson/Wickham Grammar of Graphics




        Geom       The shape used for drawing the data
          Stat     Transforms the data before plotting
         Scale     Maps data to geom aesthetics, guides like legends and axes
        Coord      Maps from geom space to device space
        Facet      Small multiples of data subsets (trellis)
 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   14 / 18
The Wilkinson/Wickham Grammar of Graphics




        Geom       The shape used for drawing the data
          Stat     Transforms the data before plotting
         Scale     Maps data to geom aesthetics, guides like legends and axes
        Coord      Maps from geom space to device space
        Facet      Small multiples of data subsets (trellis)
 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   14 / 18
The Wilkinson/Wickham Grammar of Graphics




        Geom       The shape used for drawing the data
          Stat     Transforms the data before plotting
         Scale     Maps data to geom aesthetics, guides like legends and axes
        Coord      Maps from geom space to device space
        Facet      Small multiples of data subsets (trellis)
 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   14 / 18
The Wilkinson/Wickham Grammar of Graphics




        Geom       The shape used for drawing the data
          Stat     Transforms the data before plotting
         Scale     Maps data to geom aesthetics, guides like legends and axes
        Coord      Maps from geom space to device space
        Facet      Small multiples of data subsets (trellis)
 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   14 / 18
The Wilkinson/Wickham Grammar of Graphics




        Geom       The shape used for drawing the data
          Stat     Transforms the data before plotting
         Scale     Maps data to geom aesthetics, guides like legends and axes
        Coord      Maps from geom space to device space
        Facet      Small multiples of data subsets (trellis)
 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   14 / 18
The Wilkinson/Wickham Grammar of Graphics




        Geom       The shape used for drawing the data
          Stat     Transforms the data before plotting
         Scale     Maps data to geom aesthetics, guides like legends and axes
        Coord      Maps from geom space to device space
        Facet      Small multiples of data subsets (trellis)
 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   14 / 18
A Grammar of Graphics for Genomics
Extensions are marked in red




                                     geometric object:
        Y scale: discrete            chevron                                     geometric object: rect
        from stepping


                                                                                                                  color scale: discrete
                         2                                                                                        from strand

                                                                                                                 strand
statistical transformation:                                                                                         +
stepping
                                                                                                                    −
                                                                                                                     geometric object:
                         1                                                                                           alignment




                              48.245 Mb                              48.260 Mb        48.265 Mb      48.270 Mb      X scale: sequence
                                           48.250 Mb     48.255 Mb




   Michael Lawrence (Genentech)                     A Grammar of Graphics for Genomics                            August 29, 2012         15 / 18
Components of the Genomic Grammar
  Geom:         alignment chevron arch arrow arrowrect
   Stat:        gene reduce stepping coverage mismatch table
  Scale:        sequence genome fold-change giemsa
  Coord:        truncate-gaps
 Layout:        tracks range-facet




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:          alignment chevron arch arrow arrowrect
   Stat:         gene reduce stepping coverage mismatch table
  Scale:         sequence genome fold-change giemsa
  Coord:         truncate-gaps
 Layout:         tracks range-facet

      NM_014098(GeneID:10935)



      NM_006793(GeneID:10935)


                                120.928 Mb     120.93 Mb   120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)                A Grammar of Graphics for Genomics                         August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:          alignment chevron arch arrow arrowrect
   Stat:         gene reduce stepping coverage mismatch table
  Scale:         sequence genome fold-change giemsa
  Coord:         truncate-gaps
 Layout:         tracks range-facet

      NM_014098(GeneID:10935)



      NM_006793(GeneID:10935)


                                120.928 Mb   120.93 Mb   120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)            A Grammar of Graphics for Genomics                          August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:          alignment chevron arch arrow arrowrect
   Stat:         gene reduce stepping coverage mismatch table
  Scale:         sequence genome fold-change giemsa
  Coord:         truncate-gaps
 Layout:         tracks range-facet

       NM_014098(GeneID:10935)



       NM_006793(GeneID:10935)


                                 120.928 Mb   120.93 Mb   120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)            A Grammar of Graphics for Genomics                           August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:          alignment chevron arch arrow arrowrect
   Stat:         gene reduce stepping coverage mismatch table
  Scale:         sequence genome fold-change giemsa
  Coord:         truncate-gaps
 Layout:         tracks range-facet

        NM_014098(GeneID:10935)



        NM_006793(GeneID:10935)


                                  120.928 Mb   120.93 Mb   120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)             A Grammar of Graphics for Genomics                          August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:          alignment chevron arch arrow arrowrect
   Stat:         gene reduce stepping coverage mismatch table
  Scale:         sequence genome fold-change giemsa
  Coord:         truncate-gaps
 Layout:         tracks range-facet

        NM_014098(GeneID:10935)



        NM_006793(GeneID:10935)


                                  120.928 Mb   120.93 Mb   120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)            A Grammar of Graphics for Genomics                          August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:             alignment chevron arch arrow arrowrect
   Stat:            gene reduce stepping coverage mismatch table
  Scale:            sequence genome fold-change giemsa
  Coord:            truncate-gaps
 Layout:            tracks range-facet
         original
         reduced




                        120.928 Mb   120.93 Mb   120.932 Mb   120.934 Mb     120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)           A Grammar of Graphics for Genomics                   August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:           alignment chevron arch arrow arrowrect
   Stat:          gene reduce stepping coverage mismatch table
  Scale:          sequence genome fold-change giemsa
  Coord:          truncate-gaps
 Layout:          tracks range-facet
       stepping




                  120.928 Mb    120.93 Mb       120.932 Mb    120.934 Mb    120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)               A Grammar of Graphics for Genomics              August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:                 alignment chevron arch arrow arrowrect
   Stat:                gene reduce stepping coverage mismatch table
  Scale:                sequence genome fold-change giemsa
  Coord:                truncate-gaps
 Layout:                tracks range-facet
                   60


                   50


                   40
        Coverage




                   30


                   20


                   10


                    0

                         120928000   120930000    120932000     120934000        120936000   120938000




 Michael Lawrence (Genentech)               A Grammar of Graphics for Genomics                 August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:                alignment chevron arch arrow arrowrect
   Stat:               gene reduce stepping coverage mismatch table
  Scale:               sequence genome fold-change giemsa
  Coord:               truncate-gaps
 Layout:               tracks range-facet
                  60


                  50


                  40
                                                                                                      A
         Counts




                  30                                                                                  C
                                                                                                      G
                                                                                                      T
                  20


                  10


                   0

                       120928000   120930000      120932000   120934000   120936000   120938000




 Michael Lawrence (Genentech)                  A Grammar of Graphics for Genomics                 August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:                alignment chevron arch arrow arrowrect
   Stat:               gene reduce stepping coverage mismatch table
  Scale:               sequence genome fold-change giemsa
  Coord:               truncate-gaps
 Layout:               tracks range-facet
                  60

                  50

                  40
                                                                                                             A
         Counts




                                                                                                             C
                  30
                                                                                                             G
                  20                                                                                         T

                  10

                   0




                        120.928 Mb   120.93 Mb      120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)                    A Grammar of Graphics for Genomics                     August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:         alignment chevron arch arrow arrowrect
   Stat:        gene reduce stepping coverage mismatch table
  Scale:        sequence genome fold-change giemsa
  Coord:        truncate-gaps
 Layout:        tracks range-facet
  original
  truncated




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:         alignment chevron arch arrow arrowrect
   Stat:        gene reduce stepping coverage mismatch table
  Scale:        sequence genome fold-change giemsa
  Coord:        truncate-gaps
 Layout:        tracks range-facet
         2000




         1500




                                                                     normal
         1000




          500




            0                                                                 score
         2000
                                                                                 500
                                                                                 1000
         1500                                                                    1500

                                                                              novel




                                                                     tumor
         1000                                                                    FALSE
                                                                                 TRUE

          500




            0




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012     16 / 18
Components of the Genomic Grammar
  Geom:                  alignment chevron arch arrow arrowrect
   Stat:                 gene reduce stepping coverage mismatch table
  Scale:                 sequence genome fold-change giemsa
  Coord:                 truncate-gaps
 Layout:                 tracks range-facet
                                             10                                        11



                     6e+05
          Coverage




                     4e+05




                     2e+05




                     0e+00

                             0e+00   5e+07         1e+08         0e+00         5e+07        1e+08




 Michael Lawrence (Genentech)                 A Grammar of Graphics for Genomics               August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:                  alignment chevron arch arrow arrowrect
   Stat:                 gene reduce stepping coverage mismatch table
  Scale:                 sequence genome fold-change giemsa
  Coord:                 truncate-gaps
 Layout:                 tracks range-facet


                      6e+05
           Coverage




                      4e+05




                      2e+05




                      0e+00

                                      10                                   11




 Michael Lawrence (Genentech)         A Grammar of Graphics for Genomics        August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:                    alignment chevron arch arrow arrowrect
   Stat:                   gene reduce stepping coverage mismatch table
  Scale:                   sequence genome fold-change giemsa
  Coord:                   truncate-gaps
 Layout:                   tracks range-facet
                     200




                     150
                                                                                  value
          Features




                                                                                      5000
                     100
                                                                                      0

                                                                                      −5000


                      50




                              1      2         3             4        5       6
                                                   Samples




 Michael Lawrence (Genentech)            A Grammar of Graphics for Genomics       August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:               alignment chevron arch arrow arrowrect
   Stat:              gene reduce stepping coverage mismatch table
  Scale:              sequence genome fold-change giemsa
  Coord:              truncate-gaps
 Layout:              tracks range-facet
           chr10




                   chr10                                                      chr10




 Michael Lawrence (Genentech)      A Grammar of Graphics for Genomics   August 29, 2012   16 / 18
Components of the Genomic Grammar
  Geom:                      alignment chevron arch arrow arrowrect
   Stat:                     gene reduce stepping coverage mismatch table
  Scale:                     sequence genome fold-change giemsa
  Coord:                     truncate-gaps
 Layout:                     tracks range-facet
           chr10




                   chr10                                                                                      chr10



                            60

                            50

                            40                                                                                        A
                   Counts




                                                                                                                      C
                            30
                                                                                                                      G
                            20                                                                                        T
                            10

                            0




                                 120.928 Mb   120.93 Mb   120.932 Mb   120.934 Mb   120.936 Mb   120.938 Mb




 Michael Lawrence (Genentech)                      A Grammar of Graphics for Genomics                     August 29, 2012   16 / 18
Summary



  • The ggbio package is a toolkit for plotting genomic data and
     annotations
  • Available as part of the Bioconductor project
  • Easy to use and flexible enough to handle the diverse use cases
     encountered in genomics
  • Useful plots are automatically generated from Bioconductor data
     structures using reasonable defaults
  • New types of plots can be constructed from grammar primitives
     specially designed for genomics




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   17 / 18
Acknowledgements




Tengfei Yin
Di Cook

Robert Gentleman




 Michael Lawrence (Genentech)   A Grammar of Graphics for Genomics   August 29, 2012   18 / 18

Weitere ähnliche Inhalte

Ähnlich wie Michael 2011 japan

SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive GraphsSIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive GraphsJason Riedy
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsJason Riedy
 
2011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 20112011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 2011MIBBI Checklists
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisRavi Madduri
 
Learning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale ComputingLearning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale ComputingJoel Saltz
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBruno Mmassy
 
SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs
SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs
SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs Jason Riedy
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsAustin Benson
 
Dawn Field: the Genomics Standards Consortium (GSC)
Dawn Field: the Genomics Standards Consortium (GSC)Dawn Field: the Genomics Standards Consortium (GSC)
Dawn Field: the Genomics Standards Consortium (GSC)GigaScience, BGI Hong Kong
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
 
Amia tb-review-09
Amia tb-review-09Amia tb-review-09
Amia tb-review-09Russ Altman
 
Franhouder july2013
Franhouder july2013Franhouder july2013
Franhouder july2013CS, NcState
 
Introduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems BiologyIntroduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems BiologyMrinal Vashisth
 
Franz ludaescher tdwg 2016 an update on taxonomic concept reasoning
Franz ludaescher tdwg 2016 an update on taxonomic concept reasoningFranz ludaescher tdwg 2016 an update on taxonomic concept reasoning
Franz ludaescher tdwg 2016 an update on taxonomic concept reasoningtaxonbytes
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math studentsBen Bolker
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
Mdst3703 culturomics-2012-11-01
Mdst3703 culturomics-2012-11-01Mdst3703 culturomics-2012-11-01
Mdst3703 culturomics-2012-11-01Rafael Alvarado
 

Ähnlich wie Michael 2011 japan (20)

SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive GraphsSIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
SIAM Annual Meeting 2012: Streaming Graph Analytics for Massive Graphs
 
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel PlatformsSTING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
 
2011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 20112011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 2011
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Globus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS AnalysisGlobus Genomics: Democratizing NGS Analysis
Globus Genomics: Democratizing NGS Analysis
 
Learning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale ComputingLearning, Training,  Classification,  Common Sense and Exascale Computing
Learning, Training,  Classification,  Common Sense and Exascale Computing
 
Approaches to Mining Large-Scale Heterogeneous Data: Old and New
Approaches to Mining Large-Scale Heterogeneous Data: Old and NewApproaches to Mining Large-Scale Heterogeneous Data: Old and New
Approaches to Mining Large-Scale Heterogeneous Data: Old and New
 
Bls 303 l1.phylogenetics
Bls 303 l1.phylogeneticsBls 303 l1.phylogenetics
Bls 303 l1.phylogenetics
 
SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs
SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs
SIAM PP 2012: Scalable Algorithms for Analysis of Massive, Streaming Graphs
 
Higher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifsHigher-order spectral graph clustering with motifs
Higher-order spectral graph clustering with motifs
 
Dawn Field: the Genomics Standards Consortium (GSC)
Dawn Field: the Genomics Standards Consortium (GSC)Dawn Field: the Genomics Standards Consortium (GSC)
Dawn Field: the Genomics Standards Consortium (GSC)
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
Amia tb-review-09
Amia tb-review-09Amia tb-review-09
Amia tb-review-09
 
Franhouder july2013
Franhouder july2013Franhouder july2013
Franhouder july2013
 
Introduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems BiologyIntroduction to Systemics with focus on Systems Biology
Introduction to Systemics with focus on Systems Biology
 
Franz ludaescher tdwg 2016 an update on taxonomic concept reasoning
Franz ludaescher tdwg 2016 an update on taxonomic concept reasoningFranz ludaescher tdwg 2016 an update on taxonomic concept reasoning
Franz ludaescher tdwg 2016 an update on taxonomic concept reasoning
 
math bio for 1st year math students
math bio for 1st year math studentsmath bio for 1st year math students
math bio for 1st year math students
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Mdst3703 culturomics-2012-11-01
Mdst3703 culturomics-2012-11-01Mdst3703 culturomics-2012-11-01
Mdst3703 culturomics-2012-11-01
 

Michael 2011 japan

  • 1. A Grammar of Graphics for Genomics The ggbio Package Michael Lawrence Genentech August 29, 2012 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 1 / 18
  • 2. Outline 1 Motivation 2 High-level Plots 3 Grammar Components Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 2 / 18
  • 3. Outline 1 Motivation 2 High-level Plots 3 Grammar Components Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 3 / 18
  • 4. Data on the Genome • Comes in two flavors: • Annotations (genes, TF binding sites, ...) • Experimental measurements (sequence reads) • Both types are tied to genomic coordinates, providing a common axis that permits cross-dataset comparison and inference • Typically stored as a table, with the range as a fundamental variable type, plus metadata Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
  • 5. Data on the Genome • Comes in two flavors: • Annotations (genes, TF binding sites, ...) • Experimental measurements (sequence reads) • Both types are tied to genomic coordinates, providing a common axis that permits cross-dataset comparison and inference • Typically stored as a table, with the range as a fundamental variable type, plus metadata 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
  • 6. Data on the Genome • Comes in two flavors: • Annotations (genes, TF binding sites, ...) • Experimental measurements (sequence reads) • Both types are tied to genomic coordinates, providing a common axis that permits cross-dataset comparison and inference • Typically stored as a table, with the range as a fundamental variable type, plus metadata 60 50 40 30 20 10 0 120928000 120930000 120932000 120934000 120936000 120938000 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
  • 7. Data on the Genome • Comes in two flavors: • Annotations (genes, TF binding sites, ...) • Experimental measurements (sequence reads) • Both types are tied to genomic coordinates, providing a common axis that permits cross-dataset comparison and inference • Typically stored as a table, with the range as a fundamental variable type, plus metadata 60 50 40 30 20 10 0 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
  • 8. Data on the Genome • Comes in two flavors: • Annotations (genes, TF binding sites, ...) • Experimental measurements (sequence reads) • Both types are tied to genomic coordinates, providing a common axis that permits cross-dataset comparison and inference • Typically stored as a table, with the range as a fundamental variable type, plus metadata seqnames start end strand exon id tx id 10 120927215 120928045 - 129230 14886,14887 10 120928689 120928854 - 129229 14886,14887 10 120931894 120931997 - 129228 14886,14887 10 120933249 120933384 - 129227 14886,14887 10 120933963 120934069 - 129226 14886 10 120933963 120934104 - 119757 14887 10 120936533 120936665 - 119756 14887 10 120936552 120936665 - 129225 14886 10 120938267 120938345 - 129224 14886,14887 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 4 / 18
  • 9. Challenges Big data, wide spaces • Need summaries that are efficiently computed, communicate more with less and expose the most interesting aspects of the data • Need different ways of viewing the data, depending on the density and scale, from whole genome to single basepair Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
  • 10. Challenges Big data, wide spaces 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb • Need summaries that are efficiently computed, communicate more with less and expose the most interesting aspects of the data • Need different ways of viewing the data, depending on the density and scale, from whole genome to single basepair Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
  • 11. Challenges Big data, wide spaces 60 50 40 30 20 10 0 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb • Need summaries that are efficiently computed, communicate more with less and expose the most interesting aspects of the data • Need different ways of viewing the data, depending on the density and scale, from whole genome to single basepair Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
  • 12. Challenges Big data, wide spaces 300000 250000 200000 150000 100000 50000 0 0 Mb 50 Mb 100 Mb • Need summaries that are efficiently computed, communicate more with less and expose the most interesting aspects of the data • Need different ways of viewing the data, depending on the density and scale, from whole genome to single basepair Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 5 / 18
  • 13. Existing Tools UCSC IGB IGV Circos GViz Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 14. Existing Tools UCSC IGB IGV Circos GViz Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 15. Existing Tools UCSC IGB IGV Circos GViz Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 16. Existing Tools UCSC IGB IGV Circos GViz Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 17. Existing Tools UCSC IGB IGV Circos GViz Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 18. Existing Tools UCSC IGB IGV Circos GViz Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 19. Existing Tools UCSC IGB IGV Circos GViz Limitations • Limited to one type of view (linear or circular) • Not tightly integrated with an analysis environment through standard, abstract data structures (except GViz) • No low-level toolkit for prototyping new types of graphics Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 6 / 18
  • 20. Grammars of Graphics • A grammar of graphics is a language for expressing plots • Graphics are constructed through the combination of various types of primitives; like legos for graphics • The most prominent grammar was introduced by Wilkinson’s book The Grammar of Graphics • Wilkinson’s grammar was extended by Wickham and the ggplot2 package Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
  • 21. Grammars of Graphics • A grammar of graphics is a language for expressing plots • Graphics are constructed through the combination of various types of primitives; like legos for graphics • The most prominent grammar was introduced by Wilkinson’s book The Grammar of Graphics • Wilkinson’s grammar was extended by Wickham and the ggplot2 package Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
  • 22. Grammars of Graphics • A grammar of graphics is a language for expressing plots • Graphics are constructed through the combination of various types of primitives; like legos for graphics • The most prominent grammar was introduced by Wilkinson’s book The Grammar of Graphics • Wilkinson’s grammar was extended by Wickham and the ggplot2 package Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
  • 23. Grammars of Graphics • A grammar of graphics is a language for expressing plots • Graphics are constructed through the combination of various types of primitives; like legos for graphics • The most prominent grammar was introduced by Wilkinson’s book The Grammar of Graphics • Wilkinson’s grammar was extended by Wickham and the ggplot2 package Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 7 / 18
  • 24. The ggbio Package • An R/Bioconductor package that extends the Wilkinson/Wickham grammar for applications in genomics • Integrated with Bioconductor • Operates on standard, abstract genomic data structures • Leverages efficient range-based algorithms • Programming interface has two levels of abstraction: autoplot Maps Bioconductor data structures to plots grammar Mix and match to create custom plots Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 8 / 18
  • 25. Outline 1 Motivation 2 High-level Plots 3 Grammar Components Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 9 / 18
  • 26. Basic Plots Gene Structures Read Alignments Sequence Multiple Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
  • 27. Basic Plots Gene Structures Read Alignments Sequence Multiple 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
  • 28. Basic Plots Gene Structures Read Alignments Sequence Multiple 60 50 40 30 20 10 0 120928000 120930000 120932000 120934000 120936000 120938000 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
  • 29. Basic Plots Gene Structures Read Alignments Sequence Multiple A C G T 120928700 120928750 120928800 120928850 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
  • 30. Basic Plots Gene Structures Read Alignments Sequence Multiple CGTAGGAGAATCCGGTGTCCAGTTCGCTGGGCAGACTTCTCCATGTGTTT 120928690 120928700 120928710 120928720 120928730 120928740 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
  • 31. Basic Plots Gene Structures Read Alignments Sequence Multiple 60 50 40 30 20 10 0 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 10 / 18
  • 32. Overview Plots Grand Linear Karyogram Circular Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
  • 33. Overview Plots Grand Linear Karyogram Circular Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
  • 34. Overview Plots Grand Linear Karyogram Circular 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X seqReg Exon Intron Other 5.0e+07 1.0e+08 1.5e+08 2.0e+08 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
  • 35. Overview Plots Grand Linear Karyogram Circular 21 22 1 20 50M 50M 0M 0M 100M 0M 150M 50M 19 200M 0M M 50 0M 0M 18 2 M 50 M 50 0M 10 0M 0M 15 q 17 0M 20 M 50 q q 0M q 0M q 16 M 50M 50 3 q 0M 100M q 100M q 150M rearrangements 15 50M q q interchromosomal q q q q 0M intrachromosomal 0M 50M 100M tumreads 4 q 4 14 q 50M 100M 0M 150M q 6 q q 8 q 100M 0M q 10 13 q q 12 50M 50M 0M q 100M 5 q q 15 0M 10 0M 50 q 12 M 0M 0M 50 M q 10 0M 10 6 M 0 15 50 0M M 11 0M 0M 50 10 M 100M 0M 50M 150M 7 0M 0M 10 50M 100M 100M 50M 0M 9 8 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 11 / 18
  • 36. Specialized Plots Mismatch summary + VCF Edge-linked Intervals Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 12 / 18
  • 37. Specialized Plots Mismatch summary + VCF Edge-linked Intervals 15 10 read mismatch A Counts C G T 5 0 snp T reference A T T A A G A A A G T A C C G T G T G A C A T C A C A G G C T G G G A G C T T G A G A G 25235720 25235725 25235730 25235735 25235740 25235745 25235750 25235755 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 12 / 18
  • 38. Specialized Plots Mismatch summary + VCF Edge-linked Intervals 1000 800 Expression group 600 GM12878 400 K562 200 0 uc002raw.2 uc010yjh.1 uc002rav.2 uc010yjg.1 uc002rau.2 10930000 10940000 10950000 10960000 10970000 10980000 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 12 / 18
  • 39. Outline 1 Motivation 2 High-level Plots 3 Grammar Components Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 13 / 18
  • 40. The Wilkinson/Wickham Grammar of Graphics Geom The shape used for drawing the data Stat Transforms the data before plotting Scale Maps data to geom aesthetics, guides like legends and axes Coord Maps from geom space to device space Facet Small multiples of data subsets (trellis) Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
  • 41. The Wilkinson/Wickham Grammar of Graphics Geom The shape used for drawing the data Stat Transforms the data before plotting Scale Maps data to geom aesthetics, guides like legends and axes Coord Maps from geom space to device space Facet Small multiples of data subsets (trellis) Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
  • 42. The Wilkinson/Wickham Grammar of Graphics Geom The shape used for drawing the data Stat Transforms the data before plotting Scale Maps data to geom aesthetics, guides like legends and axes Coord Maps from geom space to device space Facet Small multiples of data subsets (trellis) Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
  • 43. The Wilkinson/Wickham Grammar of Graphics Geom The shape used for drawing the data Stat Transforms the data before plotting Scale Maps data to geom aesthetics, guides like legends and axes Coord Maps from geom space to device space Facet Small multiples of data subsets (trellis) Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
  • 44. The Wilkinson/Wickham Grammar of Graphics Geom The shape used for drawing the data Stat Transforms the data before plotting Scale Maps data to geom aesthetics, guides like legends and axes Coord Maps from geom space to device space Facet Small multiples of data subsets (trellis) Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
  • 45. The Wilkinson/Wickham Grammar of Graphics Geom The shape used for drawing the data Stat Transforms the data before plotting Scale Maps data to geom aesthetics, guides like legends and axes Coord Maps from geom space to device space Facet Small multiples of data subsets (trellis) Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 14 / 18
  • 46. A Grammar of Graphics for Genomics Extensions are marked in red geometric object: Y scale: discrete chevron geometric object: rect from stepping color scale: discrete 2 from strand strand statistical transformation: + stepping − geometric object: 1 alignment 48.245 Mb 48.260 Mb 48.265 Mb 48.270 Mb X scale: sequence 48.250 Mb 48.255 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 15 / 18
  • 47. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 48. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet NM_014098(GeneID:10935) NM_006793(GeneID:10935) 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 49. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet NM_014098(GeneID:10935) NM_006793(GeneID:10935) 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 50. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet NM_014098(GeneID:10935) NM_006793(GeneID:10935) 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 51. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet NM_014098(GeneID:10935) NM_006793(GeneID:10935) 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 52. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet NM_014098(GeneID:10935) NM_006793(GeneID:10935) 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 53. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet original reduced 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 54. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet stepping 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 55. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 60 50 40 Coverage 30 20 10 0 120928000 120930000 120932000 120934000 120936000 120938000 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 56. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 60 50 40 A Counts 30 C G T 20 10 0 120928000 120930000 120932000 120934000 120936000 120938000 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 57. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 60 50 40 A Counts C 30 G 20 T 10 0 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 58. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet original truncated Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 59. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 2000 1500 normal 1000 500 0 score 2000 500 1000 1500 1500 novel tumor 1000 FALSE TRUE 500 0 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 60. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 10 11 6e+05 Coverage 4e+05 2e+05 0e+00 0e+00 5e+07 1e+08 0e+00 5e+07 1e+08 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 61. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 6e+05 Coverage 4e+05 2e+05 0e+00 10 11 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 62. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet 200 150 value Features 5000 100 0 −5000 50 1 2 3 4 5 6 Samples Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 63. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet chr10 chr10 chr10 Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 64. Components of the Genomic Grammar Geom: alignment chevron arch arrow arrowrect Stat: gene reduce stepping coverage mismatch table Scale: sequence genome fold-change giemsa Coord: truncate-gaps Layout: tracks range-facet chr10 chr10 chr10 60 50 40 A Counts C 30 G 20 T 10 0 120.928 Mb 120.93 Mb 120.932 Mb 120.934 Mb 120.936 Mb 120.938 Mb Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 16 / 18
  • 65. Summary • The ggbio package is a toolkit for plotting genomic data and annotations • Available as part of the Bioconductor project • Easy to use and flexible enough to handle the diverse use cases encountered in genomics • Useful plots are automatically generated from Bioconductor data structures using reasonable defaults • New types of plots can be constructed from grammar primitives specially designed for genomics Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 17 / 18
  • 66. Acknowledgements Tengfei Yin Di Cook Robert Gentleman Michael Lawrence (Genentech) A Grammar of Graphics for Genomics August 29, 2012 18 / 18