SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Copy-number
Variations in
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario
                       Copy-number Variations in Lymphoblastoid
Data
Pipeline
                                     Cell Lines
How to detect
CNVs
Filtering:
Filtering:
             Step I
             Step II
                                                  Fei Yu
Filtering:   Step
III
Filtering:   Step                        Carnegie Mellon University
IV
Results

Conclusions
                                             April 4, 2012



                         Advisors: Bernie Devlin, Kathryn Roeder, Chad Schafer
Copy-number
Variations in                                                         Motivation
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline
                       • Advancement in DNA sequencing technology and rare
How to detect            genetic diseases such as autism.
CNVs
Filtering:   Step I    • Data collection rush. 100,000 samples in 15 years.
Filtering:   Step II
Filtering:   Step
III                    • Money. Time. Logistics.
Filtering:   Step
IV
Results

Conclusions
Motivation
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation                                                 • Advancement in DNA sequencing technology and rare
                                                                             genetic diseases such as autism.
                                                                           • Data collection rush. 100,000 samples in 15 years.
                                                                           • Money. Time. Logistics.

                        Motivation



             A decade ago, people had few successes in finding genetic variants that
             cause rare diseases. One of the challenges was that they could only afford
             to look at small regions of the genome that they thought are linked to
             the disease.
             Today, as DNA sequencing technology develops, cheap and fast whole
             genome sequencing becomes available. Now, people can look at all the
             genes.
Copy-number
Variations in                                                         Motivation
Lymphoblas-
  toid Cell
    Lines

    Fei Yu             • Advancement in DNA sequencing technology and rare
                         genetic diseases such as autism.
Motivation
A Strange
Scenario
                       • Data collection rush. 100,000 samples in 15 years.
Data                   • Money. Time. Logistics.
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
Motivation
             Copy-number Variations in Lymphoblastoid Cell Lines          • Advancement in DNA sequencing technology and rare



2012-04-04
                                                                            genetic diseases such as autism.

                Motivation                                                • Data collection rush. 100,000 samples in 15 years.
                                                                          • Money. Time. Logistics.




                        Motivation



             The graph shows the cost of sequencing a genome over the past decade.
             In 2001, the cost was 100 Million, which is just prohibitively high.
             Today, a company called Illumina offers the service at $5000 per genome.
             They even give you a 20 % discount when you place an order of 50
             genomes or more.
             The drastic drop in cost triggered a rush to collect as many DNA
             samples as possible. It is projected that in 15 years, we will have over
             100,000 samples.
Copy-number
Variations in                                                         Motivation
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline
                       • Advancement in DNA sequencing technology and rare
How to detect            genetic diseases such as autism.
CNVs
Filtering:   Step I    • Data collection rush. 100,000 samples in 15 years.
Filtering:   Step II
Filtering:   Step
III                    • Money. Time. Logistics.
Filtering:   Step
IV
Results

Conclusions
Motivation
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation                                                   • Advancement in DNA sequencing technology and rare
                                                                               genetic diseases such as autism.
                                                                             • Data collection rush. 100,000 samples in 15 years.
                                                                             • Money. Time. Logistics.

                        Motivation



             Despite the relatively low cost per genome, it still costs hundreds of
             millions to gather so many samples.
             Also, building infrastructures to store, maintain and distribute the data
             can cost as much money as that spent on sequencing.
             Furthermore, because these experiments involve human subjects, the
             researchers will also have to deal with obtaining permissions from the
             patients and safeguarding their privacy.
             All in all, it is a huge investment of our society’s resources.
Copy-number
Variations in                                                            Motivation
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario
                       But there is one problem: most DNA sequencing projects use
Data
                       lymphoblastoid cell line instead of peripheral blood.
Pipeline

How to detect             Cell line   - Immortal(!)
CNVs
Filtering:   Step I
                                      - Cultivated from peripheral blood
Filtering:   Step II
Filtering:   Step
III
Filtering:
IV
             Step           Blood     - Obtained from peripheral blood cells
Results
                                        consisting of red blood cells, white blood
Conclusions
                                        cells, and platelet
                                      - Best source of the DNA
                                      - Mortal
Motivation
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation                                                 But there is one problem: most DNA sequencing projects use
                                                                           lymphoblastoid cell line instead of peripheral blood.
                                                                              Cell line   - Immortal(!)
                                                                                          - Cultivated from peripheral blood

                                                                                Blood     - Obtained from peripheral blood cells
                        Motivation                                                          consisting of red blood cells, white blood
                                                                                            cells, and platelet
                                                                                          - Best source of the DNA
                                                                                          - Mortal




             But there is one problem: most DNA sequencing projects use
             lymphoblastoid cell line instead of peripheral blood.
             Cell lines are immortal, so they are suitable for permanent storage. But
             they are products of peripheral blood cultivation.
             Blood data are obtained directly from peripheral blood cells consisting of
             red blood cells, white blood cells, and platelet. They are the best source
             of the DNA.
             However, because they are mortal, it is not practical to store them and
             use them in a later time.
             That’s why people use cell lines for sequencing.
Copy-number
Variations in                                                             Motivation
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
                        Are cell line data truthful representations
Filtering:
Filtering:
Filtering:
             Step I
             Step II
             Step
                                       of the DNA?
III
Filtering:   Step
IV
Results

Conclusions

                       In other words, how close are cell line data to blood data?
Motivation
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation
                                                                             Are cell line data truthful representations
                                                                                            of the DNA?
                        Motivation
                                                                            In other words, how close are cell line data to blood data?




             Our concern is whether cell line data are truthful representations of the
             DNA. In other words, we want to know how close cell line data are to
             blood data.
             If the cell lines are corrupted, any subsequent analyses will lose their
             bases, and all the time, money, and efforts invested on collecting these
             DNA samples would have gone to waste.
Copy-number
Variations in
Lymphoblas-
  toid Cell
    Lines              1 Motivation
    Fei Yu                  A Strange Scenario
Motivation
A Strange
Scenario               2 Data
Data
Pipeline
                            Pipeline
How to detect
CNVs
Filtering:   Step I    3   How to detect CNVs
Filtering:   Step II
Filtering:   Step           Filtering: Step I
III
Filtering:
IV
             Step           Filtering: Step II
Results
                            Filtering: Step III
Conclusions
                            Filtering: Step IV
                            Results

                       4 Conclusions
Copy-number
Variations in          Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario
                       For a diploid organism (human):
Data
Pipeline

How to detect                                            Chromosome p1
CNVs
Filtering:   Step I                                       A      B
Filtering:   Step II
Filtering:   Step                                   A    AA      AB
III                               Chromosome p2
Filtering:
IV
             Step                                   B    BA      BB
Results

Conclusions
                       Homozygous if AA or BB.
                       Heterozygous if AB or BA.
Inference from Blood and Cell: A Strange Scenario
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation                                                  For a diploid organism (human):

                                                                                                              Chromosome p1
                   A Strange Scenario                                                  Chromosome p2
                                                                                                         A
                                                                                                               A
                                                                                                              AA
                                                                                                                      B
                                                                                                                      AB
                                                                                                         B    BA      BB

                        Inference from Blood and Cell: A Strange Scenario   Homozygous if AA or BB.
                                                                            Heterozygous if AB or BA.




             For diploid organisms such at humans, chromosomes come in pairs. Each
             chromosome contains one copy of a gene.
             An allele is one of two or more forms of a gene.
             If both alleles on a pair of chromosomes are the same, we call the genetic
             locus homozygous; if the alleles are different, we call the genetic locus
             heterozygous.
Copy-number
Variations in          Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline                                                            1 = Heterozygous
How to detect                                                       0 = Homozygous
CNVs
Filtering:   Step I
                                                     1
Filtering:   Step II
Filtering:   Step            Blood
III
Filtering:   Step         Locations         ......   150   ......
IV
                              Cell
Results

Conclusions
                                                      0
Inference from Blood and Cell: A Strange Scenario
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation                                                                                               1 = Heterozygous
                                                                                                                         0 = Homozygous

                   A Strange Scenario                                             Blood
                                                                                                          1


                                                                               Locations         ......   150   ......
                                                                                   Cell

                        Inference from Blood and Cell: A Strange Scenario                                  0




             Denote a heterozygous locus by 1 and a homozygous locus by 0. The
             picture shows that at location 150, blood is heterozygous and cell line is
             homozygous.
Copy-number
Variations in          Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
Inference from Blood and Cell: A Strange Scenario
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Motivation
                   A Strange Scenario
                        Inference from Blood and Cell: A Strange Scenario



             If we only look at loci at which blood is heterozygous, we may encounter
             a situation depicted by this picture. There are consecutive homozygous
             loci in the cell line but they are heterozygous in the blood.
             This looks suspicious.
Copy-number
Variations in                     Detour: What is Copy-number Variation?
Lymphoblas-
  toid Cell
    Lines

    Fei Yu             Copy-number variations (CNVs) correspond to relatively large
Motivation
                       regions of the genome that have been deleted on a
A Strange
Scenario
                       chromosome.
Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
Detour: What is Copy-number Variation?
             Copy-number Variations in Lymphoblastoid Cell Lines          Copy-number variations (CNVs) correspond to relatively large


2012-04-04
                                                                          regions of the genome that have been deleted on a
                Motivation                                                chromosome.



                   A Strange Scenario
                        Detour: What is Copy-number Variation?



             Now we take a detour and define copy-number variation.
             Copy-number variations (CNVs) correspond to relatively large regions of
             the genome that have been deleted on a chromosome.
             This picture shows the black region is deleted from the chromosome.
Copy-number
Variations in          Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
                                 What a CNV in cell line looks like:
A Strange

                               Blood                           Cell
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
Inference from Blood and Cell: A Strange Scenario
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04
                                                                                      What a CNV in cell line looks like:
                Motivation                                                          Blood                           Cell

                   A Strange Scenario
                        Inference from Blood and Cell: A Strange Scenario



             In this picture, the blood, which can be thought of as a representation of
             the DNA, is heterozygous. On the other hand, the cell line has the red
             region deleted.
             When we sequence the samples, we look at both chromosomes. But in
             this case, because the red region in the cell line is deleted, we can only
             sequence the remaining chromosome.
             As a result of the deletion, the cell line will always tell us this genetic
             locus is homozygous even though the DNA is heterozygous.
Copy-number
Variations in          Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation             This could be a CNV!
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
Inference from Blood and Cell: A Strange Scenario
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04
                                                                             This could be a CNV!
                Motivation
                   A Strange Scenario
                        Inference from Blood and Cell: A Strange Scenario



             Let’s go back to this picture. This scenario fits the profile of a CNV. If
             this indeed happens in the cell line, we know the cell line is corrupted at
             that region.
Copy-number
Variations in                                                                      Goal
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
                       Having CNVs in the cell line means the cell line is locally
Filtering:   Step I    corrupted. The goal of this project is to use the amount of
Filtering:   Step II
Filtering:
III
             Step      CNVs to quantify how reliable the cell line is as a source of
Filtering:
IV
             Step
                       DNA.
Results

Conclusions
Copy-number
Variations in                                                                     Data
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data                   The data we have:
Pipeline

How to detect            • 16 individuals’ entire exomes sequenced by next-generation
CNVs
Filtering:   Step I        sequencing (NGS) technology.
Filtering:   Step II
Filtering:
III
             Step        • Each individual is sequenced twice: once using blood
Filtering:
IV
             Step
                           samples and the other time using cell line samples.
Results

Conclusions
Data
             Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04      Data                                                    The data we have:
                                                                          • 16 individuals’ entire exomes sequenced by next-generation
                                                                            sequencing (NGS) technology.
                                                                          • Each individual is sequenced twice: once using blood

                        Data                                                samples and the other time using cell line samples.




             The data we have allow us to compare cell line data and blood data and
             answer of the questions of whether they are the same.
Copy-number
Variations in                                                                                 Pipeline
Lymphoblas-
  toid Cell
    Lines
                                               NGS
    Fei Yu
                       blood and cell line
                                                               BAM files
Motivation             samples
A Strange
Scenario

Data                                           GATK                              Samtools
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II                         VCF files                     additional locus-specific
Filtering:   Step                                                          information
III
Filtering:   Step
IV
Results

Conclusions                                                 Python scripts




                                                         Data ready for analysis
Copy-number
Variations in                                                                                                                                                      Pipeline: NGS
Lymphoblas-
  toid Cell            3/28/12                                                                                       pipeline1.svg
    Lines
                                                                                                            GATK                       VCF files
    Fei Yu                                                         NGS
                             blood and cell line
                                                                                      BAM files                                                            Python scripts   Data ready for analysis
                             samples
Motivation                                                                                                                     additional locus-specific
A Strange                                                                                                 Samtools             information
Scenario

Data                     3/28/12                                                                               ngs_demo_short.svg

Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions




                       file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/pipeline1.svg                                                                                1/1
Copy-number
Variations in                                                        Pipeline: NGS
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario               Next-generation sequencing (NGS) technology
Data
Pipeline

How to detect
                       Advantages:
CNVs
Filtering:   Step I
                         • Fast
Filtering:   Step II
Filtering:
III
             Step        • Cost-effective
Filtering:   Step
IV
Results

Conclusions
                       Disadvantages:
                         • Short DNA reads fragments are randomly located =⇒
                           great challenge for fragment assembly and mapping
Copy-number
Variations in                                                  Pipeline: BAM files
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation             Our raw data are BAM files. Their sizes are huge:
A Strange
Scenario
                         • encode the whole genome’s nucleotide alignments
Data
Pipeline                 • also encode quality of each read for a given locus (a locus
How to detect
CNVs
                           can be covered by as many as 1000 reads)
Filtering:   Step I
Filtering:   Step II
Filtering:
III
             Step
                                                       Mt. Sinai   Vanderbilt
Filtering:   Step
IV                                    # of subjects        7          12
Results
                                 # of subjects that
Conclusions                                                1            2
                                have corrupted data
                                   Average file size     7.4 GiB     17 GiB
                                          Total size   ≈ 85 GiB    ≈ 340 GiB
Copy-number
Variations in                                                  Pipeline: BAM files
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation             Our raw data are BAM files. Their sizes are huge:
A Strange
Scenario
                         • encode the whole genome’s nucleotide alignments
Data
Pipeline                 • also encode quality of each read for a given locus (a locus
How to detect
CNVs
                           can be covered by as many as 1000 reads)
Filtering:   Step I
Filtering:   Step II
Filtering:
III
             Step
                                                       Mt. Sinai   Vanderbilt
Filtering:   Step
IV                                    # of subjects        7          12
Results
                                 # of subjects that
Conclusions                                                1            2
                                have corrupted data
                                   Average file size     7.4 GiB     17 GiB
                                          Total size   ≈ 85 GiB    ≈ 340 GiB
Copy-number
Variations in          Pipeline: BAM files
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
Copy-number
Variations in                                                                           Pipeline: GATK, Samtools
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange              3/28/12                                                  pipeline2.svg
Scenario
                                                                      GATK                        VCF files
Data                                               NGS
Pipeline                     blood and cell line
                                                         BAM files                                                    Python scripts   Data ready for analysis
                             samples
How to detect                                                                             additional locus-specific
CNVs                                                                 Samtools             information

Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:
IV
             Step                • Genome Analysis Toolkit (GATK):
Results
                                     - make inference from the BAM files and determine whether
Conclusions
                                       a locus is homozygous or heterozygous.
                                     - apply different filters to obtain desired results.
                                 • Samtools: extract read-level information such as
                                        sequencing quality, alignment quality, read direction.
Copy-number
Variations in                                                                           Pipeline: GATK, Samtools
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange              3/28/12                                                  pipeline2.svg
Scenario
                                                                      GATK                        VCF files
Data                                               NGS
Pipeline                     blood and cell line
                                                         BAM files                                                    Python scripts   Data ready for analysis
                             samples
How to detect                                                                             additional locus-specific
CNVs                                                                 Samtools             information

Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:
IV
             Step                • Genome Analysis Toolkit (GATK):
Results
                                     - make inference from the BAM files and determine whether
Conclusions
                                       a locus is homozygous or heterozygous.
                                     - apply different filters to obtain desired results.
                                 • Samtools: extract read-level information such as
                                        sequencing quality, alignment quality, read direction.
Copy-number
Variations in                                                              Pipeline: GATK, Samtools
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline
                       Processing time: ∼1 day.
How to detect
CNVs                   GATK outputs:
Filtering:   Step I
Filtering:   Step II   [HEADER LINES]
Filtering:   Step
III                    #CHROM POS    ID           REF   ALT   QUAL      FILTER    INFO            FORMAT           NA12878
Filtering:   Step      chr1   873762 .            T     G     5231.78   PASS      [ANNOTATIONS]   GT:AD:DP:GQ:PL   0/1:173,141:282:99
IV                     chr1   877664 rs3828047    A     G     3931.66   PASS      [ANNOTATIONS]   GT:AD:DP:GQ:PL   1/1:0,105:94:99:25
Results                chr1   899282 rs28548431   C     T     71.77     PASS      [ANNOTATIONS]   GT:AD:DP:GQ:PL   0/1:1,3:4:25.92:10
                       chr1   974165 rs9442391    T     C     29.84     LowQual   [ANNOTATIONS]   GT:AD:DP:GQ:PL   0/1:14,4:14:60.91:
Conclusions
Copy-number
Variations in                                                                                                         Pipeline: tidy up
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                       3/28/12                                                  pipeline3.svg


Motivation                                                            GATK                        VCF files
A Strange                                          NGS
Scenario                     blood and cell line
                                                         BAM files                                                    Python scripts   Data ready for analysis
                             samples
Data
                                                                                          additional locus-specific
Pipeline                                                             Samtools             information

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II     Python scripts:
Filtering:   Step
III
Filtering:   Step
                                 • extract useful information from GATK and Samtools’
IV
Results                                outputs
Conclusions
                                 • prepare data for analysis in R
Copy-number
Variations in
Lymphoblas-
  toid Cell
    Lines              1 Motivation
    Fei Yu                  A Strange Scenario
Motivation
A Strange
Scenario               2 Data
Data
Pipeline
                            Pipeline
How to detect
CNVs
Filtering:   Step I    3   How to detect CNVs
Filtering:   Step II
Filtering:   Step           Filtering: Step I
III
Filtering:
IV
             Step           Filtering: Step II
Results
                            Filtering: Step III
Conclusions
                            Filtering: Step IV
                            Results

                       4 Conclusions
Copy-number
Variations in                                                                 Notations
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
                       Let T denote the zygosity of a genetic locus
A Strange
Scenario
                                         1     if the locus is heterozygous
Data                             T =
Pipeline                                 0     if the locus is homozygous
How to detect
CNVs                   Let G denote the zygosity called by GATK.
Filtering:   Step I
Filtering:   Step II
Filtering:
III
             Step
                                          1    if the call is heterozygous
Filtering:   Step                 G=
IV                                        0    if the call is homozygous
Results

Conclusions
                       Let
                               f+ = P(G = 1 | T = 0)           [false positive]
                               f− = P(G = 0 | T = 1)           [false negative]
Copy-number
Variations in                                                                 Notations
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
                       Let T denote the zygosity of a genetic locus
A Strange
Scenario
                                         1     if the locus is heterozygous
Data                             T =
Pipeline                                 0     if the locus is homozygous
How to detect
CNVs                   Let G denote the zygosity called by GATK.
Filtering:   Step I
Filtering:   Step II
Filtering:
III
             Step
                                          1    if the call is heterozygous
Filtering:   Step                 G=
IV                                        0    if the call is homozygous
Results

Conclusions
                       Let
                               f+ = P(G = 1 | T = 0)           [false positive]
                               f− = P(G = 0 | T = 1)           [false negative]
Copy-number
Variations in                                                                 Notations
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
                       Let T denote the zygosity of a genetic locus
A Strange
Scenario
                                         1     if the locus is heterozygous
Data                             T =
Pipeline                                 0     if the locus is homozygous
How to detect
CNVs                   Let G denote the zygosity called by GATK.
Filtering:   Step I
Filtering:   Step II
Filtering:
III
             Step
                                          1    if the call is heterozygous
Filtering:   Step                 G=
IV                                        0    if the call is homozygous
Results

Conclusions
                       Let
                               f+ = P(G = 1 | T = 0)           [false positive]
                               f− = P(G = 0 | T = 1)           [false negative]
Copy-number
Variations in                                          Distribution of (GB , GC ) I
Lymphoblas-
  toid Cell
    Lines

    Fei Yu             We can describe the distribution of the observations (GB , GC )
Motivation
                       in four cases:
A Strange
Scenario

Data
Pipeline
                         (I) TB = TC = 0
How to detect
CNVs                                                            Cell call
Filtering:
Filtering:
             Step I
             Step II
                                                            0              1
Filtering:
III
             Step
                                                   0    (1 − f+ )2    (1 − f+ )f+
Filtering:   Step                    Blood call                             2
IV                                                 1   f+ (1 − f+ )       f+
Results

Conclusions
                        (II) TB = 0, TC = 1 (i.e., a mutation)
                                                                Cell call
                                                        0                  1
                                               0   (1 − f+ )f− 2 (1 − f+ )(1 − f− )
                                  Blood call
                                               1      f+ f−           f+ (1 − f− )
Copy-number
Variations in                                         Distribution of (GB , GC ) II
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
                       (III) TB = 1, TC = 0 (i.e., a deletion)
A Strange
Scenario                                                      Cell call
Data
Pipeline
                                                          0                    1
How to detect                                 0      f− (1 − f+ )            f− f+
CNVs                             Blood call
Filtering:   Step I
                                              1   (1 − f− )(1 − f+ )      (1 − f− )f+
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
                       (IV) TB = TC = 1 (i.e., not a deletion)
IV
Results
                                                              Cell call
Conclusions
                                                           0             1
                                                  0       f−2       f− (1 − f− )
                                    Blood call
                                                  1   (1 − f− )f−    (1 − f− )2
Copy-number
Variations in
Lymphoblas-            Probability of observing (GB = 1, GC = 0) in each of the four
  toid Cell
    Lines              possible cases.
    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline                                   TB=0             TB=1
How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
                                TC=0     TC=1           Deletion (TC=0)   No deletion (TC=1)
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions
                           Case I          Case II        Case III          Case IV
Copy-number Variations in Lymphoblastoid Cell Lines        Probability of observing (GB = 1, GC = 0) in each of the four
                                                                        possible cases.



2012-04-04       How to detect CNVs                                                         TB=0             TB=1




                                                                                 TC=0     TC=1           Deletion (TC=0)   No deletion (TC=1)




                                                                            Case I          Case II        Case III          Case IV




             Let’s focus on the (GB = 1, GC = 0) observations and find out which
             observations indeed come from CNVs.
Copy-number
Variations in                                                       More on GATK
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data                   GATK takes into account the number of each type of
Pipeline
                       nucleotide acid, read quality, and mapping quality of a genetic
How to detect
CNVs                   locus to make inference on its true .
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results
                       But the inference is not always accurate. Luckily, we can
Conclusions            control how GATK makes mistakes, which I will explain in a
                       moment
Copy-number
Variations in                                                       More on GATK
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data                   GATK takes into account the number of each type of
Pipeline
                       nucleotide acid, read quality, and mapping quality of a genetic
How to detect
CNVs                   locus to make inference on its true .
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results
                       But the inference is not always accurate. Luckily, we can
Conclusions            control how GATK makes mistakes, which I will explain in a
                       moment
Copy-number
Variations in                                                                Filtering
Lymphoblas-
  toid Cell
    Lines                     TB=0, TC=0   TB=0, TC=1   TB=1, TC=0    TB=1, TC=1
    Fei Yu                      Case I       Case II     Case III      Case IV
Motivation
A Strange
Scenario

Data
                       Outline:
Pipeline                1 Use GATK to minimize Case II and Case IV by controlling
How to detect
CNVs                        threshold parameters that reduce f− at the expense of
Filtering:
Filtering:
             Step I
             Step II
                            allowing a larger f+ .
Filtering:
III
             Step
                        2 Filter the variants called in the previous step and eliminate
Filtering:   Step
IV                          calls with lower quality metrics. By reducing f+ , we can
Results

Conclusions
                            eliminate many variants in Case I.
                        3 Use hypothesis tests to pick out Case III candidate loci.
                        4 Fit the candidate loci to a hidden Markov model to pick
                            out the most likely candidate loci.
Copy-number
Variations in                                                        Filtering: Step I
Lymphoblas-
  toid Cell
    Lines                  TB=0, TC=0    TB=0, TC=1     TB=1, TC=0     TB=1, TC=1
    Fei Yu
                             Case I           Case II    Case III       Case IV
Motivation
A Strange
Scenario

Data                   • Run GATK with low threshold parameters to obtain a
Pipeline

How to detect
                         crude set of loci.
CNVs                   • Effects: f− ≈ 0, increase f+ .
Filtering:   Step I
Filtering:   Step II   • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering:   Step
III
Filtering:   Step
                       • f+ is bounded above by a small number:
IV
Results
                           ˆ                 #(1, 0) + #(0, 1)
Conclusions                f+ =                                         ≈ 0.05
                                  #(1, 0) + #(0, 1) + #(GB = 0, GC = 0)

                       • Minimize Case II and Case IV. Retain Case I and Case III.
                         Number of loci retained = 15,971.
Copy-number
Variations in                                                        Filtering: Step I
Lymphoblas-
  toid Cell
    Lines                  TB=0, TC=0    TB=0, TC=1     TB=1, TC=0     TB=1, TC=1
    Fei Yu
                             Case I           Case II    Case III       Case IV
Motivation
A Strange
Scenario

Data                   • Run GATK with low threshold parameters to obtain a
Pipeline

How to detect
                         crude set of loci.
CNVs                   • Effects: f− ≈ 0, increase f+ .
Filtering:   Step I
Filtering:   Step II   • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering:   Step
III
Filtering:   Step
                       • f+ is bounded above by a small number:
IV
Results
                           ˆ                 #(1, 0) + #(0, 1)
Conclusions                f+ =                                         ≈ 0.05
                                  #(1, 0) + #(0, 1) + #(GB = 0, GC = 0)

                       • Minimize Case II and Case IV. Retain Case I and Case III.
                         Number of loci retained = 15,971.
Copy-number
Variations in                                                        Filtering: Step I
Lymphoblas-
  toid Cell
    Lines                  TB=0, TC=0    TB=0, TC=1     TB=1, TC=0     TB=1, TC=1
    Fei Yu
                             Case I           Case II    Case III       Case IV
Motivation
A Strange
Scenario

Data                   • Run GATK with low threshold parameters to obtain a
Pipeline

How to detect
                         crude set of loci.
CNVs                   • Effects: f− ≈ 0, increase f+ .
Filtering:   Step I
Filtering:   Step II   • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering:   Step
III
Filtering:   Step
                       • f+ is bounded above by a small number:
IV
Results
                           ˆ                 #(1, 0) + #(0, 1)
Conclusions                f+ =                                         ≈ 0.05
                                  #(1, 0) + #(0, 1) + #(GB = 0, GC = 0)

                       • Minimize Case II and Case IV. Retain Case I and Case III.
                         Number of loci retained = 15,971.
Copy-number
Variations in                                                        Filtering: Step I
Lymphoblas-
  toid Cell
    Lines                  TB=0, TC=0    TB=0, TC=1     TB=1, TC=0     TB=1, TC=1
    Fei Yu
                             Case I           Case II    Case III       Case IV
Motivation
A Strange
Scenario

Data                   • Run GATK with low threshold parameters to obtain a
Pipeline

How to detect
                         crude set of loci.
CNVs                   • Effects: f− ≈ 0, increase f+ .
Filtering:   Step I
Filtering:   Step II   • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering:   Step
III
Filtering:   Step
                       • f+ is bounded above by a small number:
IV
Results
                           ˆ                 #(1, 0) + #(0, 1)
Conclusions                f+ =                                         ≈ 0.05
                                  #(1, 0) + #(0, 1) + #(GB = 0, GC = 0)

                       • Minimize Case II and Case IV. Retain Case I and Case III.
                         Number of loci retained = 15,971.
Copy-number
Variations in                                                        Filtering: Step I
Lymphoblas-
  toid Cell
    Lines                  TB=0, TC=0    TB=0, TC=1     TB=1, TC=0     TB=1, TC=1
    Fei Yu
                             Case I           Case II    Case III       Case IV
Motivation
A Strange
Scenario

Data                   • Run GATK with low threshold parameters to obtain a
Pipeline

How to detect
                         crude set of loci.
CNVs                   • Effects: f− ≈ 0, increase f+ .
Filtering:   Step I
Filtering:   Step II   • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering:   Step
III
Filtering:   Step
                       • f+ is bounded above by a small number:
IV
Results
                           ˆ                 #(1, 0) + #(0, 1)
Conclusions                f+ =                                         ≈ 0.05
                                  #(1, 0) + #(0, 1) + #(GB = 0, GC = 0)

                       • Minimize Case II and Case IV. Retain Case I and Case III.
                         Number of loci retained = 15,971.
Copy-number
Variations in
Lymphoblas-
  toid Cell            Figure: KS-tests for runs of 1s against the gamma distribution. Shape and scale
    Lines
                       parameters for gamma are estimated for each chromosome and for each
    Fei Yu             individual. Those cells with less than 20 runs are indicated by “-”. Cells with
                       p-value > 0.05 are colored grey.
Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions




                          • Runs of 1s are interrupted randomly by short runs of 0s.
                          • Many of the 0 calls are just random noise.
Copy-number
Variations in                                                       Filtering: Step II
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                                         TB=0, TC=0    TB=1, TC=0
Motivation                                 Case I        Case III
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
                       • Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
             Step I
             Step II
                         to filter out the false positive calls (loci in Case I).
Filtering:   Step
III                    • VQSR: fit a Gaussian Mixture Model to known variants
Filtering:   Step
IV
Results
                         and novel variants; filter based on the score of the variants.
Conclusions            • Effect: decrease f+ .
                       • Eliminate most of Case I. Retain Case III. Number of loci
                         retained = 380.
Copy-number
Variations in                                                       Filtering: Step II
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                                         TB=0, TC=0    TB=1, TC=0
Motivation                                 Case I        Case III
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
                       • Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
             Step I
             Step II
                         to filter out the false positive calls (loci in Case I).
Filtering:   Step
III                    • VQSR: fit a Gaussian Mixture Model to known variants
Filtering:   Step
IV
Results
                         and novel variants; filter based on the score of the variants.
Conclusions            • Effect: decrease f+ .
                       • Eliminate most of Case I. Retain Case III. Number of loci
                         retained = 380.
Copy-number
Variations in
Lymphoblas-
  toid Cell
                       An important covariate for VQSR is strand bias.
    Lines

    Fei Yu
                       DNA’s double helix structure: forward and backward strands
Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions



                       Definition
                       Strand bias is the tendency of making more variant calls on one
                       direction than the other.
Copy-number
Variations in                                         Quantifying Strand Bias
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data                                                n1·   n2·   n··
Pipeline               • Fisher’s exact test: p =             /
How to detect
                                                    n11   n21   n·1
CNVs
Filtering:   Step I
Filtering:   Step II                            Forward   Backward
Filtering:   Step
III                               Reference       n11       n12       n1·
Filtering:   Step
IV
Results
                                  Alternative     n21       n22       n2·
Conclusions                                       n·1        n·2      n··
Copy-number
Variations in                                                       Filtering: Step II
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                                         TB=0, TC=0    TB=1, TC=0
Motivation                                 Case I        Case III
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
                       • Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
             Step I
             Step II
                         to filter out the false positive calls (loci in Case I).
Filtering:   Step
III                    • VQSR: fit a Gaussian Mixture Model to known variants
Filtering:   Step
IV
Results
                         and novel variants; filter based on the score of the variants.
Conclusions            • Effect: decrease f+ .
                       • Eliminate most of Case I. Retain Case III. Number of loci
                         retained = 380.
Copy-number
Variations in                                                       Filtering: Step II
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                                         TB=0, TC=0    TB=1, TC=0
Motivation                                 Case I        Case III
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
                       • Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
             Step I
             Step II
                         to filter out the false positive calls (loci in Case I).
Filtering:   Step
III                    • VQSR: fit a Gaussian Mixture Model to known variants
Filtering:   Step
IV
Results
                         and novel variants; filter based on the score of the variants.
Conclusions            • Effect: decrease f+ .
                       • Eliminate most of Case I. Retain Case III. Number of loci
                         retained = 380.
Copy-number
Variations in                                                      Filtering: Step III
Lymphoblas-
  toid Cell
    Lines                               TB=0, TC=0    TB=1, TC=0
    Fei Yu
                                          Case I        Case III
Motivation
A Strange
Scenario

Data                   • For each locus, do hypothesis test:
Pipeline

How to detect
CNVs
                                      H0 : TB = TC        H1 : TB = TC
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
                       • Logistic regression:
IV
Results
                               IG =1 ∼ Iisblood + strand direction
Conclusions
                                       + base quality + mapping direction

                       • Find loci for which Iisblood is significant at 10%-level.
                         Number of Case III candidates = 126.
Copy-number
Variations in                                                      Filtering: Step III
Lymphoblas-
  toid Cell
    Lines                               TB=0, TC=0    TB=1, TC=0
    Fei Yu
                                          Case I        Case III
Motivation
A Strange
Scenario

Data                   • For each locus, do hypothesis test:
Pipeline

How to detect
CNVs
                                      H0 : TB = TC        H1 : TB = TC
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
                       • Logistic regression:
IV
Results
                               IG =1 ∼ Iisblood + strand direction
Conclusions
                                       + base quality + mapping direction

                       • Find loci for which Iisblood is significant at 10%-level.
                         Number of Case III candidates = 126.
Copy-number
Variations in                                            Features of the Data
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs                   • from blood or cell line
Filtering:   Step I
Filtering:   Step II   • strand direction (forward or backward)
Filtering:   Step
III
Filtering:   Step      • sequencing quality
IV
Results

Conclusions
Copy-number
Variations in                                            Features of the Data
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs                   • from blood or cell line
Filtering:   Step I
Filtering:   Step II   • strand direction (forward or backward)
Filtering:   Step
III
Filtering:   Step      • sequencing quality
IV
Results

Conclusions
Copy-number
Variations in                                                    Sequencing Quality
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
                       Quality is inversely related to P( error ).
Pipeline
                         • base quality: quality of a read at a genetic locus;
How to detect
CNVs                        determined by the sequencing equipment.
Filtering:   Step I
Filtering:   Step II     • mapping quality: alignment quality of a read; calculated
Filtering:   Step
III
Filtering:   Step
                            from base qualities and the reference sequence
IV
Results                base quality + mapping quality =⇒ genotype
Conclusions
                       likelihood—likelihood of a locus being homozygous or
                       heterozygous.
Copy-number
Variations in                                                    Sequencing Quality
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
                       Quality is inversely related to P( error ).
Pipeline
                         • base quality: quality of a read at a genetic locus;
How to detect
CNVs                        determined by the sequencing equipment.
Filtering:   Step I
Filtering:   Step II     • mapping quality: alignment quality of a read; calculated
Filtering:   Step
III
Filtering:   Step
                            from base qualities and the reference sequence
IV
Results                base quality + mapping quality =⇒ genotype
Conclusions
                       likelihood—likelihood of a locus being homozygous or
                       heterozygous.
Copy-number
Variations in                                                                                              Logistic Regression
Lymphoblas-
  toid Cell
    Lines

    Fei Yu                  IG =1 ∼ Iisblood + strand direction
Motivation
                                    + base quality + mapping direction
A Strange
Scenario               • Each locus is fit to a logistic regression model.
Data
Pipeline
                       • Perform the deviance χ2 goodness-of-fit test for each
How to detect            model and we see only 2.4% of the tests are significant at
CNVs
Filtering:   Step I      5%-level.
Filtering:   Step II
Filtering:   Step
III                                                   Histogram of p−values from the Chi^2 tests of the residual deviance

Filtering:   Step
                                          600




IV
Results
                                          500




Conclusions
                                          400
                              Frequency
                                          300
                                          200
                                          100
                                          0




                                                0.0        0.2               0.4               0.6               0.8        1.0
                                                                                    p−values
Copy-number
Variations in                                                      Filtering: Step III
Lymphoblas-
  toid Cell
    Lines                               TB=0, TC=0    TB=1, TC=0
    Fei Yu
                                          Case I        Case III
Motivation
A Strange
Scenario

Data                   • For each locus, do hypothesis test:
Pipeline

How to detect
CNVs
                                      H0 : TB = TC        H1 : TB = TC
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
                       • Logistic regression:
IV
Results
                               IG =1 ∼ Iisblood + strand direction
Conclusions
                                       + base quality + mapping direction

                       • Find loci for which Iisblood is significant at 10%-level.
                         Number of Case III candidates = 126.
Copy-number
Variations in                                                   Filtering: Step IV
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                       Did the Case III candidates come from CNVs?
Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions




                       Define the length of a run of 0s as the number of consecutive
                       (GB , GC ) = (1, 0) calls.
Copy-number
Variations in                                                   Filtering: Step IV
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                       Did the Case III candidates come from CNVs?
Motivation
A Strange
Scenario

Data
Pipeline

How to detect
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results

Conclusions




                       Define the length of a run of 0s as the number of consecutive
                       (GB , GC ) = (1, 0) calls.
Copy-number
Variations in                                                                       Filtering: Step IV
Lymphoblas-
  toid Cell
    Lines

    Fei Yu                       Density estimate of the lengths of runs of (G_B, G_C)=(1,0) calls

Motivation




                                 2.5
A Strange
Scenario

Data

                                 2.0
Pipeline

How to detect
CNVs
Filtering:   Step I
                                 1.5



Filtering:   Step II
                       Density




Filtering:   Step
III
Filtering:   Step
IV
                                 1.0




Results                                          > 95% quantile

Conclusions
                                 0.5
                                 0.0




                                           2       4        6        8       10      12   14
                                                       N = 3286 Bandwidth = 0.127
Copy-number
Variations in                                                      Filtering: Step IV
Lymphoblas-
  toid Cell
    Lines

    Fei Yu
                       10 loci come from runs of 0s of length at least 3:
Motivation
A Strange
Scenario
                       1101111111|000|1111111111
Data
Pipeline               1010111011|000|1111111111
How to detect          1111111011|000|1111110111
CNVs
Filtering:   Step I    0011111011|000|1101111111
Filtering:   Step II
Filtering:
III
             Step      1111111111|000|1111111111
Filtering:
IV
             Step      1101110111|000|1111111111
Results
                       1011111001|000|1111111010
Conclusions
                       1111111111|000|1111011110
                       1111111111|000|1111111111
                       1111111111|000|1101111011

                       Notice short runs of 1s. Are they errors?
Copy-number
Variations in                                                                        (Future Work) Filtering: Step IV
Lymphoblas-
  toid Cell
    Lines

    Fei Yu             Find probability of < 1011111001|000|1111111010 > using
Motivation             hidden Markov model:
                                     3/30/12                                                                                     hmm.svg



A Strange
Scenario

Data
Pipeline

How to detect                                                                            CNV                                                not CNV
CNVs
Filtering:   Step I
Filtering:   Step II
Filtering:   Step
III
Filtering:   Step
IV
Results
                                                   mismatched (0)                                                                            matched (1)
Conclusions
                                     file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/hmm.svg                             1/1




                                                                                                                                           CNV        not CNV
                                                                            CNV                                                             γ          1−γ
                                  Pi,i+1 =
                                                                            not CNV                                                        1−λ            λ
                       where γ and λ are big.
Copy-number
Variations in                                                                 Results
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline
                       • After a series of filtering, only 10 loci in the pool of 16
How to detect
CNVs                     individuals are found to be CNV candidates.
Filtering:   Step I
Filtering:
Filtering:
             Step II
             Step
                       • Those 10 loci fall into short runs of 0s. They are unlikely
III
Filtering:   Step        to be CNVs.
IV
Results
                       • We will fit HMM when there are more reliable signals.
Conclusions
Copy-number
Variations in                                                                 Results
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline
                       • After a series of filtering, only 10 loci in the pool of 16
How to detect
CNVs                     individuals are found to be CNV candidates.
Filtering:   Step I
Filtering:
Filtering:
             Step II
             Step
                       • Those 10 loci fall into short runs of 0s. They are unlikely
III
Filtering:   Step        to be CNVs.
IV
Results
                       • We will fit HMM when there are more reliable signals.
Conclusions
Copy-number
Variations in                                                           Conclusions
Lymphoblas-
  toid Cell
    Lines

    Fei Yu

Motivation
A Strange
Scenario

Data
Pipeline
                       • No CNV is good news. We now know a great amount of
How to detect
CNVs                     time, money, and effort have not gone to waste.
Filtering:   Step I
Filtering:
Filtering:
             Step II
             Step
                       • A useful assessment procedure when labs create cell lines.
III
Filtering:
IV
             Step      • In a separate work, we extended this procedure to finding
Results
                         mutation in cell line, i.e., TB = 0, TC = 1.
Conclusions

Weitere ähnliche Inhalte

Andere mochten auch

Match show
Match showMatch show
Match showjtmoses
 
Poor Person’s Bailer Retrieval System
Poor Person’s Bailer Retrieval SystemPoor Person’s Bailer Retrieval System
Poor Person’s Bailer Retrieval Systemgraniteiii
 
Geili 2012.2.25-ngo
Geili 2012.2.25-ngoGeili 2012.2.25-ngo
Geili 2012.2.25-ngoYiruo Dong
 
Geili 2012.2.25-wwf
Geili 2012.2.25-wwfGeili 2012.2.25-wwf
Geili 2012.2.25-wwfYiruo Dong
 
オリジナルMIDIシーケンサ 開発ノート
オリジナルMIDIシーケンサ 開発ノートオリジナルMIDIシーケンサ 開発ノート
オリジナルMIDIシーケンサ 開発ノートyou_ucchy
 
Problems and Potential for Open and Distance Learning in Asia
Problems and Potential for  Open and Distance Learning in AsiaProblems and Potential for  Open and Distance Learning in Asia
Problems and Potential for Open and Distance Learning in AsiaKikumaEhime
 
School Magazine Analysis
School Magazine AnalysisSchool Magazine Analysis
School Magazine Analysiskpnpensby
 
Спецпроекты в нишевых СМИ
Спецпроекты в нишевых СМИСпецпроекты в нишевых СМИ
Спецпроекты в нишевых СМИEvgeniy Smirnov
 
Mathematics for economics
Mathematics for economicsMathematics for economics
Mathematics for economics8891743524
 
Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...
Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...
Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...Exo -B2B
 
Medical research kawasaki disease
Medical research kawasaki diseaseMedical research kawasaki disease
Medical research kawasaki diseaseashtiparay
 

Andere mochten auch (20)

Film pitch
Film pitchFilm pitch
Film pitch
 
Match show
Match showMatch show
Match show
 
Poor Person’s Bailer Retrieval System
Poor Person’s Bailer Retrieval SystemPoor Person’s Bailer Retrieval System
Poor Person’s Bailer Retrieval System
 
How To Get More Sales
How To Get More SalesHow To Get More Sales
How To Get More Sales
 
Geili 2012.2.25-ngo
Geili 2012.2.25-ngoGeili 2012.2.25-ngo
Geili 2012.2.25-ngo
 
Geili 2012.2.25-wwf
Geili 2012.2.25-wwfGeili 2012.2.25-wwf
Geili 2012.2.25-wwf
 
オリジナルMIDIシーケンサ 開発ノート
オリジナルMIDIシーケンサ 開発ノートオリジナルMIDIシーケンサ 開発ノート
オリジナルMIDIシーケンサ 開発ノート
 
Pronouns
PronounsPronouns
Pronouns
 
Problems and Potential for Open and Distance Learning in Asia
Problems and Potential for  Open and Distance Learning in AsiaProblems and Potential for  Open and Distance Learning in Asia
Problems and Potential for Open and Distance Learning in Asia
 
School Magazine Analysis
School Magazine AnalysisSchool Magazine Analysis
School Magazine Analysis
 
Спецпроекты в нишевых СМИ
Спецпроекты в нишевых СМИСпецпроекты в нишевых СМИ
Спецпроекты в нишевых СМИ
 
Mathematics for economics
Mathematics for economicsMathematics for economics
Mathematics for economics
 
Geology 101
Geology 101Geology 101
Geology 101
 
Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...
Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...
Présentation aux investisseurs Exo B2B, Chambre de Commerce du Montréal Métr...
 
Past simple tense
Past simple tensePast simple tense
Past simple tense
 
Film pitch
Film pitchFilm pitch
Film pitch
 
Film pitch
Film pitchFilm pitch
Film pitch
 
Test
TestTest
Test
 
Linux01122011
Linux01122011Linux01122011
Linux01122011
 
Medical research kawasaki disease
Medical research kawasaki diseaseMedical research kawasaki disease
Medical research kawasaki disease
 

Ähnlich wie 09 apr2012 presentation

Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Intel IT Center
 
The power of dna fingerprinting in forensic science
The power of dna fingerprinting in forensic scienceThe power of dna fingerprinting in forensic science
The power of dna fingerprinting in forensic scienceJayaShrestha1
 
dna_finger_printing.ppt
dna_finger_printing.pptdna_finger_printing.ppt
dna_finger_printing.pptDanaKamal8
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencingshinycthomas
 
Is forensic dna analysis enough to judge
Is forensic dna analysis enough to judgeIs forensic dna analysis enough to judge
Is forensic dna analysis enough to judgeThe Carlson Company LLP
 
Dna fingerprinting
Dna fingerprintingDna fingerprinting
Dna fingerprintingWaqas Shams
 
Microarray (DNA and SNP microarray)
Microarray (DNA and SNP microarray)Microarray (DNA and SNP microarray)
Microarray (DNA and SNP microarray)Hamza Khan
 
microarrayppt-170906064529.pdf
microarrayppt-170906064529.pdfmicroarrayppt-170906064529.pdf
microarrayppt-170906064529.pdfnedalalazzwy
 
Cytopathology Of Cerebrospinal Fluid[1]Power Point
Cytopathology Of Cerebrospinal Fluid[1]Power PointCytopathology Of Cerebrospinal Fluid[1]Power Point
Cytopathology Of Cerebrospinal Fluid[1]Power PointGenevieve Warner Learmonth
 
Primary Cells 101
Primary Cells 101Primary Cells 101
Primary Cells 101allcells
 
3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx
3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx
3 . Molecular Cloning and screening strategies by using E.coli as Host.pptxAroojSheikh12
 
This is good ...ANTHONY SEMINAR POWER POINT.pptx
This is good ...ANTHONY SEMINAR POWER POINT.pptxThis is good ...ANTHONY SEMINAR POWER POINT.pptx
This is good ...ANTHONY SEMINAR POWER POINT.pptxiwegbuebubechukwu9
 
Application of biotechnology in forensic
Application of biotechnology in forensicApplication of biotechnology in forensic
Application of biotechnology in forensicMerinAliceGeorge
 
CTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage BionetworksCTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage BionetworksRobert Allaway
 
Biotechnology
Biotechnology Biotechnology
Biotechnology Shohrat266
 

Ähnlich wie 09 apr2012 presentation (20)

Microarray
MicroarrayMicroarray
Microarray
 
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
Developing tools & Methodologies for the NExt Generation of Genomics & Bio In...
 
The power of dna fingerprinting in forensic science
The power of dna fingerprinting in forensic scienceThe power of dna fingerprinting in forensic science
The power of dna fingerprinting in forensic science
 
dna_finger_printing.ppt
dna_finger_printing.pptdna_finger_printing.ppt
dna_finger_printing.ppt
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Is forensic dna analysis enough to judge
Is forensic dna analysis enough to judgeIs forensic dna analysis enough to judge
Is forensic dna analysis enough to judge
 
02 neutropenia
02 neutropenia02 neutropenia
02 neutropenia
 
DNA Fingerprinting.pptx
DNA Fingerprinting.pptxDNA Fingerprinting.pptx
DNA Fingerprinting.pptx
 
Dna fingerprinting
Dna fingerprintingDna fingerprinting
Dna fingerprinting
 
Microarray (DNA and SNP microarray)
Microarray (DNA and SNP microarray)Microarray (DNA and SNP microarray)
Microarray (DNA and SNP microarray)
 
microarrayppt-170906064529.pdf
microarrayppt-170906064529.pdfmicroarrayppt-170906064529.pdf
microarrayppt-170906064529.pdf
 
Cytopathology Of Cerebrospinal Fluid[1]Power Point
Cytopathology Of Cerebrospinal Fluid[1]Power PointCytopathology Of Cerebrospinal Fluid[1]Power Point
Cytopathology Of Cerebrospinal Fluid[1]Power Point
 
Primary Cells 101
Primary Cells 101Primary Cells 101
Primary Cells 101
 
3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx
3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx
3 . Molecular Cloning and screening strategies by using E.coli as Host.pptx
 
This is good ...ANTHONY SEMINAR POWER POINT.pptx
This is good ...ANTHONY SEMINAR POWER POINT.pptxThis is good ...ANTHONY SEMINAR POWER POINT.pptx
This is good ...ANTHONY SEMINAR POWER POINT.pptx
 
Application of biotechnology in forensic
Application of biotechnology in forensicApplication of biotechnology in forensic
Application of biotechnology in forensic
 
CTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage BionetworksCTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
CTF 2017 Cutaneous Neurofibroma Resource Sage Bionetworks
 
Dna microarray
Dna microarrayDna microarray
Dna microarray
 
Mbi121 11 f12notes
Mbi121 11 f12notesMbi121 11 f12notes
Mbi121 11 f12notes
 
Biotechnology
Biotechnology Biotechnology
Biotechnology
 

Kürzlich hochgeladen

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

09 apr2012 presentation

  • 1. Copy-number Variations in Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Copy-number Variations in Lymphoblastoid Data Pipeline Cell Lines How to detect CNVs Filtering: Filtering: Step I Step II Fei Yu Filtering: Step III Filtering: Step Carnegie Mellon University IV Results Conclusions April 4, 2012 Advisors: Bernie Devlin, Kathryn Roeder, Chad Schafer
  • 2. Copy-number Variations in Motivation Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline • Advancement in DNA sequencing technology and rare How to detect genetic diseases such as autism. CNVs Filtering: Step I • Data collection rush. 100,000 samples in 15 years. Filtering: Step II Filtering: Step III • Money. Time. Logistics. Filtering: Step IV Results Conclusions
  • 3. Motivation Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation • Advancement in DNA sequencing technology and rare genetic diseases such as autism. • Data collection rush. 100,000 samples in 15 years. • Money. Time. Logistics. Motivation A decade ago, people had few successes in finding genetic variants that cause rare diseases. One of the challenges was that they could only afford to look at small regions of the genome that they thought are linked to the disease. Today, as DNA sequencing technology develops, cheap and fast whole genome sequencing becomes available. Now, people can look at all the genes.
  • 4. Copy-number Variations in Motivation Lymphoblas- toid Cell Lines Fei Yu • Advancement in DNA sequencing technology and rare genetic diseases such as autism. Motivation A Strange Scenario • Data collection rush. 100,000 samples in 15 years. Data • Money. Time. Logistics. Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions
  • 5. Motivation Copy-number Variations in Lymphoblastoid Cell Lines • Advancement in DNA sequencing technology and rare 2012-04-04 genetic diseases such as autism. Motivation • Data collection rush. 100,000 samples in 15 years. • Money. Time. Logistics. Motivation The graph shows the cost of sequencing a genome over the past decade. In 2001, the cost was 100 Million, which is just prohibitively high. Today, a company called Illumina offers the service at $5000 per genome. They even give you a 20 % discount when you place an order of 50 genomes or more. The drastic drop in cost triggered a rush to collect as many DNA samples as possible. It is projected that in 15 years, we will have over 100,000 samples.
  • 6. Copy-number Variations in Motivation Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline • Advancement in DNA sequencing technology and rare How to detect genetic diseases such as autism. CNVs Filtering: Step I • Data collection rush. 100,000 samples in 15 years. Filtering: Step II Filtering: Step III • Money. Time. Logistics. Filtering: Step IV Results Conclusions
  • 7. Motivation Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation • Advancement in DNA sequencing technology and rare genetic diseases such as autism. • Data collection rush. 100,000 samples in 15 years. • Money. Time. Logistics. Motivation Despite the relatively low cost per genome, it still costs hundreds of millions to gather so many samples. Also, building infrastructures to store, maintain and distribute the data can cost as much money as that spent on sequencing. Furthermore, because these experiments involve human subjects, the researchers will also have to deal with obtaining permissions from the patients and safeguarding their privacy. All in all, it is a huge investment of our society’s resources.
  • 8. Copy-number Variations in Motivation Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario But there is one problem: most DNA sequencing projects use Data lymphoblastoid cell line instead of peripheral blood. Pipeline How to detect Cell line - Immortal(!) CNVs Filtering: Step I - Cultivated from peripheral blood Filtering: Step II Filtering: Step III Filtering: IV Step Blood - Obtained from peripheral blood cells Results consisting of red blood cells, white blood Conclusions cells, and platelet - Best source of the DNA - Mortal
  • 9. Motivation Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation But there is one problem: most DNA sequencing projects use lymphoblastoid cell line instead of peripheral blood. Cell line - Immortal(!) - Cultivated from peripheral blood Blood - Obtained from peripheral blood cells Motivation consisting of red blood cells, white blood cells, and platelet - Best source of the DNA - Mortal But there is one problem: most DNA sequencing projects use lymphoblastoid cell line instead of peripheral blood. Cell lines are immortal, so they are suitable for permanent storage. But they are products of peripheral blood cultivation. Blood data are obtained directly from peripheral blood cells consisting of red blood cells, white blood cells, and platelet. They are the best source of the DNA. However, because they are mortal, it is not practical to store them and use them in a later time. That’s why people use cell lines for sequencing.
  • 10. Copy-number Variations in Motivation Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline How to detect CNVs Are cell line data truthful representations Filtering: Filtering: Filtering: Step I Step II Step of the DNA? III Filtering: Step IV Results Conclusions In other words, how close are cell line data to blood data?
  • 11. Motivation Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation Are cell line data truthful representations of the DNA? Motivation In other words, how close are cell line data to blood data? Our concern is whether cell line data are truthful representations of the DNA. In other words, we want to know how close cell line data are to blood data. If the cell lines are corrupted, any subsequent analyses will lose their bases, and all the time, money, and efforts invested on collecting these DNA samples would have gone to waste.
  • 12. Copy-number Variations in Lymphoblas- toid Cell Lines 1 Motivation Fei Yu A Strange Scenario Motivation A Strange Scenario 2 Data Data Pipeline Pipeline How to detect CNVs Filtering: Step I 3 How to detect CNVs Filtering: Step II Filtering: Step Filtering: Step I III Filtering: IV Step Filtering: Step II Results Filtering: Step III Conclusions Filtering: Step IV Results 4 Conclusions
  • 13. Copy-number Variations in Inference from Blood and Cell: A Strange Scenario Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario For a diploid organism (human): Data Pipeline How to detect Chromosome p1 CNVs Filtering: Step I A B Filtering: Step II Filtering: Step A AA AB III Chromosome p2 Filtering: IV Step B BA BB Results Conclusions Homozygous if AA or BB. Heterozygous if AB or BA.
  • 14. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation For a diploid organism (human): Chromosome p1 A Strange Scenario Chromosome p2 A A AA B AB B BA BB Inference from Blood and Cell: A Strange Scenario Homozygous if AA or BB. Heterozygous if AB or BA. For diploid organisms such at humans, chromosomes come in pairs. Each chromosome contains one copy of a gene. An allele is one of two or more forms of a gene. If both alleles on a pair of chromosomes are the same, we call the genetic locus homozygous; if the alleles are different, we call the genetic locus heterozygous.
  • 15. Copy-number Variations in Inference from Blood and Cell: A Strange Scenario Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline 1 = Heterozygous How to detect 0 = Homozygous CNVs Filtering: Step I 1 Filtering: Step II Filtering: Step Blood III Filtering: Step Locations ...... 150 ...... IV Cell Results Conclusions 0
  • 16. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation 1 = Heterozygous 0 = Homozygous A Strange Scenario Blood 1 Locations ...... 150 ...... Cell Inference from Blood and Cell: A Strange Scenario 0 Denote a heterozygous locus by 1 and a homozygous locus by 0. The picture shows that at location 150, blood is heterozygous and cell line is homozygous.
  • 17. Copy-number Variations in Inference from Blood and Cell: A Strange Scenario Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions
  • 18. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Motivation A Strange Scenario Inference from Blood and Cell: A Strange Scenario If we only look at loci at which blood is heterozygous, we may encounter a situation depicted by this picture. There are consecutive homozygous loci in the cell line but they are heterozygous in the blood. This looks suspicious.
  • 19. Copy-number Variations in Detour: What is Copy-number Variation? Lymphoblas- toid Cell Lines Fei Yu Copy-number variations (CNVs) correspond to relatively large Motivation regions of the genome that have been deleted on a A Strange Scenario chromosome. Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions
  • 20. Detour: What is Copy-number Variation? Copy-number Variations in Lymphoblastoid Cell Lines Copy-number variations (CNVs) correspond to relatively large 2012-04-04 regions of the genome that have been deleted on a Motivation chromosome. A Strange Scenario Detour: What is Copy-number Variation? Now we take a detour and define copy-number variation. Copy-number variations (CNVs) correspond to relatively large regions of the genome that have been deleted on a chromosome. This picture shows the black region is deleted from the chromosome.
  • 21. Copy-number Variations in Inference from Blood and Cell: A Strange Scenario Lymphoblas- toid Cell Lines Fei Yu Motivation What a CNV in cell line looks like: A Strange Blood Cell Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions
  • 22. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 What a CNV in cell line looks like: Motivation Blood Cell A Strange Scenario Inference from Blood and Cell: A Strange Scenario In this picture, the blood, which can be thought of as a representation of the DNA, is heterozygous. On the other hand, the cell line has the red region deleted. When we sequence the samples, we look at both chromosomes. But in this case, because the red region in the cell line is deleted, we can only sequence the remaining chromosome. As a result of the deletion, the cell line will always tell us this genetic locus is homozygous even though the DNA is heterozygous.
  • 23. Copy-number Variations in Inference from Blood and Cell: A Strange Scenario Lymphoblas- toid Cell Lines Fei Yu Motivation This could be a CNV! A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions
  • 24. Inference from Blood and Cell: A Strange Scenario Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 This could be a CNV! Motivation A Strange Scenario Inference from Blood and Cell: A Strange Scenario Let’s go back to this picture. This scenario fits the profile of a CNV. If this indeed happens in the cell line, we know the cell line is corrupted at that region.
  • 25. Copy-number Variations in Goal Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline How to detect CNVs Having CNVs in the cell line means the cell line is locally Filtering: Step I corrupted. The goal of this project is to use the amount of Filtering: Step II Filtering: III Step CNVs to quantify how reliable the cell line is as a source of Filtering: IV Step DNA. Results Conclusions
  • 26. Copy-number Variations in Data Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data The data we have: Pipeline How to detect • 16 individuals’ entire exomes sequenced by next-generation CNVs Filtering: Step I sequencing (NGS) technology. Filtering: Step II Filtering: III Step • Each individual is sequenced twice: once using blood Filtering: IV Step samples and the other time using cell line samples. Results Conclusions
  • 27. Data Copy-number Variations in Lymphoblastoid Cell Lines 2012-04-04 Data The data we have: • 16 individuals’ entire exomes sequenced by next-generation sequencing (NGS) technology. • Each individual is sequenced twice: once using blood Data samples and the other time using cell line samples. The data we have allow us to compare cell line data and blood data and answer of the questions of whether they are the same.
  • 28. Copy-number Variations in Pipeline Lymphoblas- toid Cell Lines NGS Fei Yu blood and cell line BAM files Motivation samples A Strange Scenario Data GATK Samtools Pipeline How to detect CNVs Filtering: Step I Filtering: Step II VCF files additional locus-specific Filtering: Step information III Filtering: Step IV Results Conclusions Python scripts Data ready for analysis
  • 29. Copy-number Variations in Pipeline: NGS Lymphoblas- toid Cell 3/28/12 pipeline1.svg Lines GATK VCF files Fei Yu NGS blood and cell line BAM files Python scripts Data ready for analysis samples Motivation additional locus-specific A Strange Samtools information Scenario Data 3/28/12 ngs_demo_short.svg Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/pipeline1.svg 1/1
  • 30. Copy-number Variations in Pipeline: NGS Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Next-generation sequencing (NGS) technology Data Pipeline How to detect Advantages: CNVs Filtering: Step I • Fast Filtering: Step II Filtering: III Step • Cost-effective Filtering: Step IV Results Conclusions Disadvantages: • Short DNA reads fragments are randomly located =⇒ great challenge for fragment assembly and mapping
  • 31. Copy-number Variations in Pipeline: BAM files Lymphoblas- toid Cell Lines Fei Yu Motivation Our raw data are BAM files. Their sizes are huge: A Strange Scenario • encode the whole genome’s nucleotide alignments Data Pipeline • also encode quality of each read for a given locus (a locus How to detect CNVs can be covered by as many as 1000 reads) Filtering: Step I Filtering: Step II Filtering: III Step Mt. Sinai Vanderbilt Filtering: Step IV # of subjects 7 12 Results # of subjects that Conclusions 1 2 have corrupted data Average file size 7.4 GiB 17 GiB Total size ≈ 85 GiB ≈ 340 GiB
  • 32. Copy-number Variations in Pipeline: BAM files Lymphoblas- toid Cell Lines Fei Yu Motivation Our raw data are BAM files. Their sizes are huge: A Strange Scenario • encode the whole genome’s nucleotide alignments Data Pipeline • also encode quality of each read for a given locus (a locus How to detect CNVs can be covered by as many as 1000 reads) Filtering: Step I Filtering: Step II Filtering: III Step Mt. Sinai Vanderbilt Filtering: Step IV # of subjects 7 12 Results # of subjects that Conclusions 1 2 have corrupted data Average file size 7.4 GiB 17 GiB Total size ≈ 85 GiB ≈ 340 GiB
  • 33. Copy-number Variations in Pipeline: BAM files Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions
  • 34. Copy-number Variations in Pipeline: GATK, Samtools Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange 3/28/12 pipeline2.svg Scenario GATK VCF files Data NGS Pipeline blood and cell line BAM files Python scripts Data ready for analysis samples How to detect additional locus-specific CNVs Samtools information Filtering: Step I Filtering: Step II Filtering: Step III Filtering: IV Step • Genome Analysis Toolkit (GATK): Results - make inference from the BAM files and determine whether Conclusions a locus is homozygous or heterozygous. - apply different filters to obtain desired results. • Samtools: extract read-level information such as sequencing quality, alignment quality, read direction.
  • 35. Copy-number Variations in Pipeline: GATK, Samtools Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange 3/28/12 pipeline2.svg Scenario GATK VCF files Data NGS Pipeline blood and cell line BAM files Python scripts Data ready for analysis samples How to detect additional locus-specific CNVs Samtools information Filtering: Step I Filtering: Step II Filtering: Step III Filtering: IV Step • Genome Analysis Toolkit (GATK): Results - make inference from the BAM files and determine whether Conclusions a locus is homozygous or heterozygous. - apply different filters to obtain desired results. • Samtools: extract read-level information such as sequencing quality, alignment quality, read direction.
  • 36. Copy-number Variations in Pipeline: GATK, Samtools Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline Processing time: ∼1 day. How to detect CNVs GATK outputs: Filtering: Step I Filtering: Step II [HEADER LINES] Filtering: Step III #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 Filtering: Step chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99 IV chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:25 Results chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:10 chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91: Conclusions
  • 37. Copy-number Variations in Pipeline: tidy up Lymphoblas- toid Cell Lines Fei Yu 3/28/12 pipeline3.svg Motivation GATK VCF files A Strange NGS Scenario blood and cell line BAM files Python scripts Data ready for analysis samples Data additional locus-specific Pipeline Samtools information How to detect CNVs Filtering: Step I Filtering: Step II Python scripts: Filtering: Step III Filtering: Step • extract useful information from GATK and Samtools’ IV Results outputs Conclusions • prepare data for analysis in R
  • 38. Copy-number Variations in Lymphoblas- toid Cell Lines 1 Motivation Fei Yu A Strange Scenario Motivation A Strange Scenario 2 Data Data Pipeline Pipeline How to detect CNVs Filtering: Step I 3 How to detect CNVs Filtering: Step II Filtering: Step Filtering: Step I III Filtering: IV Step Filtering: Step II Results Filtering: Step III Conclusions Filtering: Step IV Results 4 Conclusions
  • 39. Copy-number Variations in Notations Lymphoblas- toid Cell Lines Fei Yu Motivation Let T denote the zygosity of a genetic locus A Strange Scenario 1 if the locus is heterozygous Data T = Pipeline 0 if the locus is homozygous How to detect CNVs Let G denote the zygosity called by GATK. Filtering: Step I Filtering: Step II Filtering: III Step 1 if the call is heterozygous Filtering: Step G= IV 0 if the call is homozygous Results Conclusions Let f+ = P(G = 1 | T = 0) [false positive] f− = P(G = 0 | T = 1) [false negative]
  • 40. Copy-number Variations in Notations Lymphoblas- toid Cell Lines Fei Yu Motivation Let T denote the zygosity of a genetic locus A Strange Scenario 1 if the locus is heterozygous Data T = Pipeline 0 if the locus is homozygous How to detect CNVs Let G denote the zygosity called by GATK. Filtering: Step I Filtering: Step II Filtering: III Step 1 if the call is heterozygous Filtering: Step G= IV 0 if the call is homozygous Results Conclusions Let f+ = P(G = 1 | T = 0) [false positive] f− = P(G = 0 | T = 1) [false negative]
  • 41. Copy-number Variations in Notations Lymphoblas- toid Cell Lines Fei Yu Motivation Let T denote the zygosity of a genetic locus A Strange Scenario 1 if the locus is heterozygous Data T = Pipeline 0 if the locus is homozygous How to detect CNVs Let G denote the zygosity called by GATK. Filtering: Step I Filtering: Step II Filtering: III Step 1 if the call is heterozygous Filtering: Step G= IV 0 if the call is homozygous Results Conclusions Let f+ = P(G = 1 | T = 0) [false positive] f− = P(G = 0 | T = 1) [false negative]
  • 42. Copy-number Variations in Distribution of (GB , GC ) I Lymphoblas- toid Cell Lines Fei Yu We can describe the distribution of the observations (GB , GC ) Motivation in four cases: A Strange Scenario Data Pipeline (I) TB = TC = 0 How to detect CNVs Cell call Filtering: Filtering: Step I Step II 0 1 Filtering: III Step 0 (1 − f+ )2 (1 − f+ )f+ Filtering: Step Blood call 2 IV 1 f+ (1 − f+ ) f+ Results Conclusions (II) TB = 0, TC = 1 (i.e., a mutation) Cell call 0 1 0 (1 − f+ )f− 2 (1 − f+ )(1 − f− ) Blood call 1 f+ f− f+ (1 − f− )
  • 43. Copy-number Variations in Distribution of (GB , GC ) II Lymphoblas- toid Cell Lines Fei Yu Motivation (III) TB = 1, TC = 0 (i.e., a deletion) A Strange Scenario Cell call Data Pipeline 0 1 How to detect 0 f− (1 − f+ ) f− f+ CNVs Blood call Filtering: Step I 1 (1 − f− )(1 − f+ ) (1 − f− )f+ Filtering: Step II Filtering: Step III Filtering: Step (IV) TB = TC = 1 (i.e., not a deletion) IV Results Cell call Conclusions 0 1 0 f−2 f− (1 − f− ) Blood call 1 (1 − f− )f− (1 − f− )2
  • 44. Copy-number Variations in Lymphoblas- Probability of observing (GB = 1, GC = 0) in each of the four toid Cell Lines possible cases. Fei Yu Motivation A Strange Scenario Data Pipeline TB=0 TB=1 How to detect CNVs Filtering: Step I Filtering: Step II TC=0 TC=1 Deletion (TC=0) No deletion (TC=1) Filtering: Step III Filtering: Step IV Results Conclusions Case I Case II Case III Case IV
  • 45. Copy-number Variations in Lymphoblastoid Cell Lines Probability of observing (GB = 1, GC = 0) in each of the four possible cases. 2012-04-04 How to detect CNVs TB=0 TB=1 TC=0 TC=1 Deletion (TC=0) No deletion (TC=1) Case I Case II Case III Case IV Let’s focus on the (GB = 1, GC = 0) observations and find out which observations indeed come from CNVs.
  • 46. Copy-number Variations in More on GATK Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data GATK takes into account the number of each type of Pipeline nucleotide acid, read quality, and mapping quality of a genetic How to detect CNVs locus to make inference on its true . Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results But the inference is not always accurate. Luckily, we can Conclusions control how GATK makes mistakes, which I will explain in a moment
  • 47. Copy-number Variations in More on GATK Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data GATK takes into account the number of each type of Pipeline nucleotide acid, read quality, and mapping quality of a genetic How to detect CNVs locus to make inference on its true . Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results But the inference is not always accurate. Luckily, we can Conclusions control how GATK makes mistakes, which I will explain in a moment
  • 48. Copy-number Variations in Filtering Lymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IV Motivation A Strange Scenario Data Outline: Pipeline 1 Use GATK to minimize Case II and Case IV by controlling How to detect CNVs threshold parameters that reduce f− at the expense of Filtering: Filtering: Step I Step II allowing a larger f+ . Filtering: III Step 2 Filter the variants called in the previous step and eliminate Filtering: Step IV calls with lower quality metrics. By reducing f+ , we can Results Conclusions eliminate many variants in Case I. 3 Use hypothesis tests to pick out Case III candidate loci. 4 Fit the candidate loci to a hidden Markov model to pick out the most likely candidate loci.
  • 49. Copy-number Variations in Filtering: Step I Lymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IV Motivation A Strange Scenario Data • Run GATK with low threshold parameters to obtain a Pipeline How to detect crude set of loci. CNVs • Effects: f− ≈ 0, increase f+ . Filtering: Step I Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV. Filtering: Step III Filtering: Step • f+ is bounded above by a small number: IV Results ˆ #(1, 0) + #(0, 1) Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 50. Copy-number Variations in Filtering: Step I Lymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IV Motivation A Strange Scenario Data • Run GATK with low threshold parameters to obtain a Pipeline How to detect crude set of loci. CNVs • Effects: f− ≈ 0, increase f+ . Filtering: Step I Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV. Filtering: Step III Filtering: Step • f+ is bounded above by a small number: IV Results ˆ #(1, 0) + #(0, 1) Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 51. Copy-number Variations in Filtering: Step I Lymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IV Motivation A Strange Scenario Data • Run GATK with low threshold parameters to obtain a Pipeline How to detect crude set of loci. CNVs • Effects: f− ≈ 0, increase f+ . Filtering: Step I Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV. Filtering: Step III Filtering: Step • f+ is bounded above by a small number: IV Results ˆ #(1, 0) + #(0, 1) Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 52. Copy-number Variations in Filtering: Step I Lymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IV Motivation A Strange Scenario Data • Run GATK with low threshold parameters to obtain a Pipeline How to detect crude set of loci. CNVs • Effects: f− ≈ 0, increase f+ . Filtering: Step I Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV. Filtering: Step III Filtering: Step • f+ is bounded above by a small number: IV Results ˆ #(1, 0) + #(0, 1) Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 53. Copy-number Variations in Filtering: Step I Lymphoblas- toid Cell Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1 Fei Yu Case I Case II Case III Case IV Motivation A Strange Scenario Data • Run GATK with low threshold parameters to obtain a Pipeline How to detect crude set of loci. CNVs • Effects: f− ≈ 0, increase f+ . Filtering: Step I Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV. Filtering: Step III Filtering: Step • f+ is bounded above by a small number: IV Results ˆ #(1, 0) + #(0, 1) Conclusions f+ = ≈ 0.05 #(1, 0) + #(0, 1) + #(GB = 0, GC = 0) • Minimize Case II and Case IV. Retain Case I and Case III. Number of loci retained = 15,971.
  • 54. Copy-number Variations in Lymphoblas- toid Cell Figure: KS-tests for runs of 1s against the gamma distribution. Shape and scale Lines parameters for gamma are estimated for each chromosome and for each Fei Yu individual. Those cells with less than 20 runs are indicated by “-”. Cells with p-value > 0.05 are colored grey. Motivation A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions • Runs of 1s are interrupted randomly by short runs of 0s. • Many of the 0 calls are just random noise.
  • 55. Copy-number Variations in Filtering: Step II Lymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0 Motivation Case I Case III A Strange Scenario Data Pipeline How to detect CNVs • Run GATK’s Variant Quality Score Recalibration (VQSR) Filtering: Filtering: Step I Step II to filter out the false positive calls (loci in Case I). Filtering: Step III • VQSR: fit a Gaussian Mixture Model to known variants Filtering: Step IV Results and novel variants; filter based on the score of the variants. Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 56. Copy-number Variations in Filtering: Step II Lymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0 Motivation Case I Case III A Strange Scenario Data Pipeline How to detect CNVs • Run GATK’s Variant Quality Score Recalibration (VQSR) Filtering: Filtering: Step I Step II to filter out the false positive calls (loci in Case I). Filtering: Step III • VQSR: fit a Gaussian Mixture Model to known variants Filtering: Step IV Results and novel variants; filter based on the score of the variants. Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 57. Copy-number Variations in Lymphoblas- toid Cell An important covariate for VQSR is strand bias. Lines Fei Yu DNA’s double helix structure: forward and backward strands Motivation A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions Definition Strand bias is the tendency of making more variant calls on one direction than the other.
  • 58. Copy-number Variations in Quantifying Strand Bias Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data n1· n2· n·· Pipeline • Fisher’s exact test: p = / How to detect n11 n21 n·1 CNVs Filtering: Step I Filtering: Step II Forward Backward Filtering: Step III Reference n11 n12 n1· Filtering: Step IV Results Alternative n21 n22 n2· Conclusions n·1 n·2 n··
  • 59. Copy-number Variations in Filtering: Step II Lymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0 Motivation Case I Case III A Strange Scenario Data Pipeline How to detect CNVs • Run GATK’s Variant Quality Score Recalibration (VQSR) Filtering: Filtering: Step I Step II to filter out the false positive calls (loci in Case I). Filtering: Step III • VQSR: fit a Gaussian Mixture Model to known variants Filtering: Step IV Results and novel variants; filter based on the score of the variants. Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 60. Copy-number Variations in Filtering: Step II Lymphoblas- toid Cell Lines Fei Yu TB=0, TC=0 TB=1, TC=0 Motivation Case I Case III A Strange Scenario Data Pipeline How to detect CNVs • Run GATK’s Variant Quality Score Recalibration (VQSR) Filtering: Filtering: Step I Step II to filter out the false positive calls (loci in Case I). Filtering: Step III • VQSR: fit a Gaussian Mixture Model to known variants Filtering: Step IV Results and novel variants; filter based on the score of the variants. Conclusions • Effect: decrease f+ . • Eliminate most of Case I. Retain Case III. Number of loci retained = 380.
  • 61. Copy-number Variations in Filtering: Step III Lymphoblas- toid Cell Lines TB=0, TC=0 TB=1, TC=0 Fei Yu Case I Case III Motivation A Strange Scenario Data • For each locus, do hypothesis test: Pipeline How to detect CNVs H0 : TB = TC H1 : TB = TC Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step • Logistic regression: IV Results IG =1 ∼ Iisblood + strand direction Conclusions + base quality + mapping direction • Find loci for which Iisblood is significant at 10%-level. Number of Case III candidates = 126.
  • 62. Copy-number Variations in Filtering: Step III Lymphoblas- toid Cell Lines TB=0, TC=0 TB=1, TC=0 Fei Yu Case I Case III Motivation A Strange Scenario Data • For each locus, do hypothesis test: Pipeline How to detect CNVs H0 : TB = TC H1 : TB = TC Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step • Logistic regression: IV Results IG =1 ∼ Iisblood + strand direction Conclusions + base quality + mapping direction • Find loci for which Iisblood is significant at 10%-level. Number of Case III candidates = 126.
  • 63. Copy-number Variations in Features of the Data Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline How to detect CNVs • from blood or cell line Filtering: Step I Filtering: Step II • strand direction (forward or backward) Filtering: Step III Filtering: Step • sequencing quality IV Results Conclusions
  • 64. Copy-number Variations in Features of the Data Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline How to detect CNVs • from blood or cell line Filtering: Step I Filtering: Step II • strand direction (forward or backward) Filtering: Step III Filtering: Step • sequencing quality IV Results Conclusions
  • 65. Copy-number Variations in Sequencing Quality Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Quality is inversely related to P( error ). Pipeline • base quality: quality of a read at a genetic locus; How to detect CNVs determined by the sequencing equipment. Filtering: Step I Filtering: Step II • mapping quality: alignment quality of a read; calculated Filtering: Step III Filtering: Step from base qualities and the reference sequence IV Results base quality + mapping quality =⇒ genotype Conclusions likelihood—likelihood of a locus being homozygous or heterozygous.
  • 66. Copy-number Variations in Sequencing Quality Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Quality is inversely related to P( error ). Pipeline • base quality: quality of a read at a genetic locus; How to detect CNVs determined by the sequencing equipment. Filtering: Step I Filtering: Step II • mapping quality: alignment quality of a read; calculated Filtering: Step III Filtering: Step from base qualities and the reference sequence IV Results base quality + mapping quality =⇒ genotype Conclusions likelihood—likelihood of a locus being homozygous or heterozygous.
  • 67. Copy-number Variations in Logistic Regression Lymphoblas- toid Cell Lines Fei Yu IG =1 ∼ Iisblood + strand direction Motivation + base quality + mapping direction A Strange Scenario • Each locus is fit to a logistic regression model. Data Pipeline • Perform the deviance χ2 goodness-of-fit test for each How to detect model and we see only 2.4% of the tests are significant at CNVs Filtering: Step I 5%-level. Filtering: Step II Filtering: Step III Histogram of p−values from the Chi^2 tests of the residual deviance Filtering: Step 600 IV Results 500 Conclusions 400 Frequency 300 200 100 0 0.0 0.2 0.4 0.6 0.8 1.0 p−values
  • 68. Copy-number Variations in Filtering: Step III Lymphoblas- toid Cell Lines TB=0, TC=0 TB=1, TC=0 Fei Yu Case I Case III Motivation A Strange Scenario Data • For each locus, do hypothesis test: Pipeline How to detect CNVs H0 : TB = TC H1 : TB = TC Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step • Logistic regression: IV Results IG =1 ∼ Iisblood + strand direction Conclusions + base quality + mapping direction • Find loci for which Iisblood is significant at 10%-level. Number of Case III candidates = 126.
  • 69. Copy-number Variations in Filtering: Step IV Lymphoblas- toid Cell Lines Fei Yu Did the Case III candidates come from CNVs? Motivation A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions Define the length of a run of 0s as the number of consecutive (GB , GC ) = (1, 0) calls.
  • 70. Copy-number Variations in Filtering: Step IV Lymphoblas- toid Cell Lines Fei Yu Did the Case III candidates come from CNVs? Motivation A Strange Scenario Data Pipeline How to detect CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results Conclusions Define the length of a run of 0s as the number of consecutive (GB , GC ) = (1, 0) calls.
  • 71. Copy-number Variations in Filtering: Step IV Lymphoblas- toid Cell Lines Fei Yu Density estimate of the lengths of runs of (G_B, G_C)=(1,0) calls Motivation 2.5 A Strange Scenario Data 2.0 Pipeline How to detect CNVs Filtering: Step I 1.5 Filtering: Step II Density Filtering: Step III Filtering: Step IV 1.0 Results > 95% quantile Conclusions 0.5 0.0 2 4 6 8 10 12 14 N = 3286 Bandwidth = 0.127
  • 72. Copy-number Variations in Filtering: Step IV Lymphoblas- toid Cell Lines Fei Yu 10 loci come from runs of 0s of length at least 3: Motivation A Strange Scenario 1101111111|000|1111111111 Data Pipeline 1010111011|000|1111111111 How to detect 1111111011|000|1111110111 CNVs Filtering: Step I 0011111011|000|1101111111 Filtering: Step II Filtering: III Step 1111111111|000|1111111111 Filtering: IV Step 1101110111|000|1111111111 Results 1011111001|000|1111111010 Conclusions 1111111111|000|1111011110 1111111111|000|1111111111 1111111111|000|1101111011 Notice short runs of 1s. Are they errors?
  • 73. Copy-number Variations in (Future Work) Filtering: Step IV Lymphoblas- toid Cell Lines Fei Yu Find probability of < 1011111001|000|1111111010 > using Motivation hidden Markov model: 3/30/12 hmm.svg A Strange Scenario Data Pipeline How to detect CNV not CNV CNVs Filtering: Step I Filtering: Step II Filtering: Step III Filtering: Step IV Results mismatched (0) matched (1) Conclusions file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/hmm.svg 1/1 CNV not CNV CNV γ 1−γ Pi,i+1 = not CNV 1−λ λ where γ and λ are big.
  • 74. Copy-number Variations in Results Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline • After a series of filtering, only 10 loci in the pool of 16 How to detect CNVs individuals are found to be CNV candidates. Filtering: Step I Filtering: Filtering: Step II Step • Those 10 loci fall into short runs of 0s. They are unlikely III Filtering: Step to be CNVs. IV Results • We will fit HMM when there are more reliable signals. Conclusions
  • 75. Copy-number Variations in Results Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline • After a series of filtering, only 10 loci in the pool of 16 How to detect CNVs individuals are found to be CNV candidates. Filtering: Step I Filtering: Filtering: Step II Step • Those 10 loci fall into short runs of 0s. They are unlikely III Filtering: Step to be CNVs. IV Results • We will fit HMM when there are more reliable signals. Conclusions
  • 76. Copy-number Variations in Conclusions Lymphoblas- toid Cell Lines Fei Yu Motivation A Strange Scenario Data Pipeline • No CNV is good news. We now know a great amount of How to detect CNVs time, money, and effort have not gone to waste. Filtering: Step I Filtering: Filtering: Step II Step • A useful assessment procedure when labs create cell lines. III Filtering: IV Step • In a separate work, we extended this procedure to finding Results mutation in cell line, i.e., TB = 0, TC = 1. Conclusions