Apidays New York 2024 - The value of a flexible API Management solution for O...
09 apr2012 presentation
1. Copy-number
Variations in
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Copy-number Variations in Lymphoblastoid
Data
Pipeline
Cell Lines
How to detect
CNVs
Filtering:
Filtering:
Step I
Step II
Fei Yu
Filtering: Step
III
Filtering: Step Carnegie Mellon University
IV
Results
Conclusions
April 4, 2012
Advisors: Bernie Devlin, Kathryn Roeder, Chad Schafer
2. Copy-number
Variations in Motivation
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
• Advancement in DNA sequencing technology and rare
How to detect genetic diseases such as autism.
CNVs
Filtering: Step I • Data collection rush. 100,000 samples in 15 years.
Filtering: Step II
Filtering: Step
III • Money. Time. Logistics.
Filtering: Step
IV
Results
Conclusions
3. Motivation
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation • Advancement in DNA sequencing technology and rare
genetic diseases such as autism.
• Data collection rush. 100,000 samples in 15 years.
• Money. Time. Logistics.
Motivation
A decade ago, people had few successes in finding genetic variants that
cause rare diseases. One of the challenges was that they could only afford
to look at small regions of the genome that they thought are linked to
the disease.
Today, as DNA sequencing technology develops, cheap and fast whole
genome sequencing becomes available. Now, people can look at all the
genes.
4. Copy-number
Variations in Motivation
Lymphoblas-
toid Cell
Lines
Fei Yu • Advancement in DNA sequencing technology and rare
genetic diseases such as autism.
Motivation
A Strange
Scenario
• Data collection rush. 100,000 samples in 15 years.
Data • Money. Time. Logistics.
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
5. Motivation
Copy-number Variations in Lymphoblastoid Cell Lines • Advancement in DNA sequencing technology and rare
2012-04-04
genetic diseases such as autism.
Motivation • Data collection rush. 100,000 samples in 15 years.
• Money. Time. Logistics.
Motivation
The graph shows the cost of sequencing a genome over the past decade.
In 2001, the cost was 100 Million, which is just prohibitively high.
Today, a company called Illumina offers the service at $5000 per genome.
They even give you a 20 % discount when you place an order of 50
genomes or more.
The drastic drop in cost triggered a rush to collect as many DNA
samples as possible. It is projected that in 15 years, we will have over
100,000 samples.
6. Copy-number
Variations in Motivation
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
• Advancement in DNA sequencing technology and rare
How to detect genetic diseases such as autism.
CNVs
Filtering: Step I • Data collection rush. 100,000 samples in 15 years.
Filtering: Step II
Filtering: Step
III • Money. Time. Logistics.
Filtering: Step
IV
Results
Conclusions
7. Motivation
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation • Advancement in DNA sequencing technology and rare
genetic diseases such as autism.
• Data collection rush. 100,000 samples in 15 years.
• Money. Time. Logistics.
Motivation
Despite the relatively low cost per genome, it still costs hundreds of
millions to gather so many samples.
Also, building infrastructures to store, maintain and distribute the data
can cost as much money as that spent on sequencing.
Furthermore, because these experiments involve human subjects, the
researchers will also have to deal with obtaining permissions from the
patients and safeguarding their privacy.
All in all, it is a huge investment of our society’s resources.
8. Copy-number
Variations in Motivation
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
But there is one problem: most DNA sequencing projects use
Data
lymphoblastoid cell line instead of peripheral blood.
Pipeline
How to detect Cell line - Immortal(!)
CNVs
Filtering: Step I
- Cultivated from peripheral blood
Filtering: Step II
Filtering: Step
III
Filtering:
IV
Step Blood - Obtained from peripheral blood cells
Results
consisting of red blood cells, white blood
Conclusions
cells, and platelet
- Best source of the DNA
- Mortal
9. Motivation
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation But there is one problem: most DNA sequencing projects use
lymphoblastoid cell line instead of peripheral blood.
Cell line - Immortal(!)
- Cultivated from peripheral blood
Blood - Obtained from peripheral blood cells
Motivation consisting of red blood cells, white blood
cells, and platelet
- Best source of the DNA
- Mortal
But there is one problem: most DNA sequencing projects use
lymphoblastoid cell line instead of peripheral blood.
Cell lines are immortal, so they are suitable for permanent storage. But
they are products of peripheral blood cultivation.
Blood data are obtained directly from peripheral blood cells consisting of
red blood cells, white blood cells, and platelet. They are the best source
of the DNA.
However, because they are mortal, it is not practical to store them and
use them in a later time.
That’s why people use cell lines for sequencing.
10. Copy-number
Variations in Motivation
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Are cell line data truthful representations
Filtering:
Filtering:
Filtering:
Step I
Step II
Step
of the DNA?
III
Filtering: Step
IV
Results
Conclusions
In other words, how close are cell line data to blood data?
11. Motivation
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation
Are cell line data truthful representations
of the DNA?
Motivation
In other words, how close are cell line data to blood data?
Our concern is whether cell line data are truthful representations of the
DNA. In other words, we want to know how close cell line data are to
blood data.
If the cell lines are corrupted, any subsequent analyses will lose their
bases, and all the time, money, and efforts invested on collecting these
DNA samples would have gone to waste.
12. Copy-number
Variations in
Lymphoblas-
toid Cell
Lines 1 Motivation
Fei Yu A Strange Scenario
Motivation
A Strange
Scenario 2 Data
Data
Pipeline
Pipeline
How to detect
CNVs
Filtering: Step I 3 How to detect CNVs
Filtering: Step II
Filtering: Step Filtering: Step I
III
Filtering:
IV
Step Filtering: Step II
Results
Filtering: Step III
Conclusions
Filtering: Step IV
Results
4 Conclusions
13. Copy-number
Variations in Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
For a diploid organism (human):
Data
Pipeline
How to detect Chromosome p1
CNVs
Filtering: Step I A B
Filtering: Step II
Filtering: Step A AA AB
III Chromosome p2
Filtering:
IV
Step B BA BB
Results
Conclusions
Homozygous if AA or BB.
Heterozygous if AB or BA.
14. Inference from Blood and Cell: A Strange Scenario
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation For a diploid organism (human):
Chromosome p1
A Strange Scenario Chromosome p2
A
A
AA
B
AB
B BA BB
Inference from Blood and Cell: A Strange Scenario Homozygous if AA or BB.
Heterozygous if AB or BA.
For diploid organisms such at humans, chromosomes come in pairs. Each
chromosome contains one copy of a gene.
An allele is one of two or more forms of a gene.
If both alleles on a pair of chromosomes are the same, we call the genetic
locus homozygous; if the alleles are different, we call the genetic locus
heterozygous.
15. Copy-number
Variations in Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline 1 = Heterozygous
How to detect 0 = Homozygous
CNVs
Filtering: Step I
1
Filtering: Step II
Filtering: Step Blood
III
Filtering: Step Locations ...... 150 ......
IV
Cell
Results
Conclusions
0
16. Inference from Blood and Cell: A Strange Scenario
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation 1 = Heterozygous
0 = Homozygous
A Strange Scenario Blood
1
Locations ...... 150 ......
Cell
Inference from Blood and Cell: A Strange Scenario 0
Denote a heterozygous locus by 1 and a homozygous locus by 0. The
picture shows that at location 150, blood is heterozygous and cell line is
homozygous.
17. Copy-number
Variations in Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
18. Inference from Blood and Cell: A Strange Scenario
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Motivation
A Strange Scenario
Inference from Blood and Cell: A Strange Scenario
If we only look at loci at which blood is heterozygous, we may encounter
a situation depicted by this picture. There are consecutive homozygous
loci in the cell line but they are heterozygous in the blood.
This looks suspicious.
19. Copy-number
Variations in Detour: What is Copy-number Variation?
Lymphoblas-
toid Cell
Lines
Fei Yu Copy-number variations (CNVs) correspond to relatively large
Motivation
regions of the genome that have been deleted on a
A Strange
Scenario
chromosome.
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
20. Detour: What is Copy-number Variation?
Copy-number Variations in Lymphoblastoid Cell Lines Copy-number variations (CNVs) correspond to relatively large
2012-04-04
regions of the genome that have been deleted on a
Motivation chromosome.
A Strange Scenario
Detour: What is Copy-number Variation?
Now we take a detour and define copy-number variation.
Copy-number variations (CNVs) correspond to relatively large regions of
the genome that have been deleted on a chromosome.
This picture shows the black region is deleted from the chromosome.
21. Copy-number
Variations in Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
What a CNV in cell line looks like:
A Strange
Blood Cell
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
22. Inference from Blood and Cell: A Strange Scenario
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04
What a CNV in cell line looks like:
Motivation Blood Cell
A Strange Scenario
Inference from Blood and Cell: A Strange Scenario
In this picture, the blood, which can be thought of as a representation of
the DNA, is heterozygous. On the other hand, the cell line has the red
region deleted.
When we sequence the samples, we look at both chromosomes. But in
this case, because the red region in the cell line is deleted, we can only
sequence the remaining chromosome.
As a result of the deletion, the cell line will always tell us this genetic
locus is homozygous even though the DNA is heterozygous.
23. Copy-number
Variations in Inference from Blood and Cell: A Strange Scenario
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation This could be a CNV!
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
24. Inference from Blood and Cell: A Strange Scenario
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04
This could be a CNV!
Motivation
A Strange Scenario
Inference from Blood and Cell: A Strange Scenario
Let’s go back to this picture. This scenario fits the profile of a CNV. If
this indeed happens in the cell line, we know the cell line is corrupted at
that region.
25. Copy-number
Variations in Goal
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Having CNVs in the cell line means the cell line is locally
Filtering: Step I corrupted. The goal of this project is to use the amount of
Filtering: Step II
Filtering:
III
Step CNVs to quantify how reliable the cell line is as a source of
Filtering:
IV
Step
DNA.
Results
Conclusions
26. Copy-number
Variations in Data
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data The data we have:
Pipeline
How to detect • 16 individuals’ entire exomes sequenced by next-generation
CNVs
Filtering: Step I sequencing (NGS) technology.
Filtering: Step II
Filtering:
III
Step • Each individual is sequenced twice: once using blood
Filtering:
IV
Step
samples and the other time using cell line samples.
Results
Conclusions
27. Data
Copy-number Variations in Lymphoblastoid Cell Lines
2012-04-04 Data The data we have:
• 16 individuals’ entire exomes sequenced by next-generation
sequencing (NGS) technology.
• Each individual is sequenced twice: once using blood
Data samples and the other time using cell line samples.
The data we have allow us to compare cell line data and blood data and
answer of the questions of whether they are the same.
28. Copy-number
Variations in Pipeline
Lymphoblas-
toid Cell
Lines
NGS
Fei Yu
blood and cell line
BAM files
Motivation samples
A Strange
Scenario
Data GATK Samtools
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II VCF files additional locus-specific
Filtering: Step information
III
Filtering: Step
IV
Results
Conclusions Python scripts
Data ready for analysis
29. Copy-number
Variations in Pipeline: NGS
Lymphoblas-
toid Cell 3/28/12 pipeline1.svg
Lines
GATK VCF files
Fei Yu NGS
blood and cell line
BAM files Python scripts Data ready for analysis
samples
Motivation additional locus-specific
A Strange Samtools information
Scenario
Data 3/28/12 ngs_demo_short.svg
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/pipeline1.svg 1/1
30. Copy-number
Variations in Pipeline: NGS
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario Next-generation sequencing (NGS) technology
Data
Pipeline
How to detect
Advantages:
CNVs
Filtering: Step I
• Fast
Filtering: Step II
Filtering:
III
Step • Cost-effective
Filtering: Step
IV
Results
Conclusions
Disadvantages:
• Short DNA reads fragments are randomly located =⇒
great challenge for fragment assembly and mapping
31. Copy-number
Variations in Pipeline: BAM files
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation Our raw data are BAM files. Their sizes are huge:
A Strange
Scenario
• encode the whole genome’s nucleotide alignments
Data
Pipeline • also encode quality of each read for a given locus (a locus
How to detect
CNVs
can be covered by as many as 1000 reads)
Filtering: Step I
Filtering: Step II
Filtering:
III
Step
Mt. Sinai Vanderbilt
Filtering: Step
IV # of subjects 7 12
Results
# of subjects that
Conclusions 1 2
have corrupted data
Average file size 7.4 GiB 17 GiB
Total size ≈ 85 GiB ≈ 340 GiB
32. Copy-number
Variations in Pipeline: BAM files
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation Our raw data are BAM files. Their sizes are huge:
A Strange
Scenario
• encode the whole genome’s nucleotide alignments
Data
Pipeline • also encode quality of each read for a given locus (a locus
How to detect
CNVs
can be covered by as many as 1000 reads)
Filtering: Step I
Filtering: Step II
Filtering:
III
Step
Mt. Sinai Vanderbilt
Filtering: Step
IV # of subjects 7 12
Results
# of subjects that
Conclusions 1 2
have corrupted data
Average file size 7.4 GiB 17 GiB
Total size ≈ 85 GiB ≈ 340 GiB
33. Copy-number
Variations in Pipeline: BAM files
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
34. Copy-number
Variations in Pipeline: GATK, Samtools
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange 3/28/12 pipeline2.svg
Scenario
GATK VCF files
Data NGS
Pipeline blood and cell line
BAM files Python scripts Data ready for analysis
samples
How to detect additional locus-specific
CNVs Samtools information
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering:
IV
Step • Genome Analysis Toolkit (GATK):
Results
- make inference from the BAM files and determine whether
Conclusions
a locus is homozygous or heterozygous.
- apply different filters to obtain desired results.
• Samtools: extract read-level information such as
sequencing quality, alignment quality, read direction.
35. Copy-number
Variations in Pipeline: GATK, Samtools
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange 3/28/12 pipeline2.svg
Scenario
GATK VCF files
Data NGS
Pipeline blood and cell line
BAM files Python scripts Data ready for analysis
samples
How to detect additional locus-specific
CNVs Samtools information
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering:
IV
Step • Genome Analysis Toolkit (GATK):
Results
- make inference from the BAM files and determine whether
Conclusions
a locus is homozygous or heterozygous.
- apply different filters to obtain desired results.
• Samtools: extract read-level information such as
sequencing quality, alignment quality, read direction.
36. Copy-number
Variations in Pipeline: GATK, Samtools
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
Processing time: ∼1 day.
How to detect
CNVs GATK outputs:
Filtering: Step I
Filtering: Step II [HEADER LINES]
Filtering: Step
III #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
Filtering: Step chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99
IV chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:25
Results chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:10
chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:
Conclusions
37. Copy-number
Variations in Pipeline: tidy up
Lymphoblas-
toid Cell
Lines
Fei Yu
3/28/12 pipeline3.svg
Motivation GATK VCF files
A Strange NGS
Scenario blood and cell line
BAM files Python scripts Data ready for analysis
samples
Data
additional locus-specific
Pipeline Samtools information
How to detect
CNVs
Filtering: Step I
Filtering: Step II Python scripts:
Filtering: Step
III
Filtering: Step
• extract useful information from GATK and Samtools’
IV
Results outputs
Conclusions
• prepare data for analysis in R
38. Copy-number
Variations in
Lymphoblas-
toid Cell
Lines 1 Motivation
Fei Yu A Strange Scenario
Motivation
A Strange
Scenario 2 Data
Data
Pipeline
Pipeline
How to detect
CNVs
Filtering: Step I 3 How to detect CNVs
Filtering: Step II
Filtering: Step Filtering: Step I
III
Filtering:
IV
Step Filtering: Step II
Results
Filtering: Step III
Conclusions
Filtering: Step IV
Results
4 Conclusions
39. Copy-number
Variations in Notations
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
Let T denote the zygosity of a genetic locus
A Strange
Scenario
1 if the locus is heterozygous
Data T =
Pipeline 0 if the locus is homozygous
How to detect
CNVs Let G denote the zygosity called by GATK.
Filtering: Step I
Filtering: Step II
Filtering:
III
Step
1 if the call is heterozygous
Filtering: Step G=
IV 0 if the call is homozygous
Results
Conclusions
Let
f+ = P(G = 1 | T = 0) [false positive]
f− = P(G = 0 | T = 1) [false negative]
40. Copy-number
Variations in Notations
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
Let T denote the zygosity of a genetic locus
A Strange
Scenario
1 if the locus is heterozygous
Data T =
Pipeline 0 if the locus is homozygous
How to detect
CNVs Let G denote the zygosity called by GATK.
Filtering: Step I
Filtering: Step II
Filtering:
III
Step
1 if the call is heterozygous
Filtering: Step G=
IV 0 if the call is homozygous
Results
Conclusions
Let
f+ = P(G = 1 | T = 0) [false positive]
f− = P(G = 0 | T = 1) [false negative]
41. Copy-number
Variations in Notations
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
Let T denote the zygosity of a genetic locus
A Strange
Scenario
1 if the locus is heterozygous
Data T =
Pipeline 0 if the locus is homozygous
How to detect
CNVs Let G denote the zygosity called by GATK.
Filtering: Step I
Filtering: Step II
Filtering:
III
Step
1 if the call is heterozygous
Filtering: Step G=
IV 0 if the call is homozygous
Results
Conclusions
Let
f+ = P(G = 1 | T = 0) [false positive]
f− = P(G = 0 | T = 1) [false negative]
42. Copy-number
Variations in Distribution of (GB , GC ) I
Lymphoblas-
toid Cell
Lines
Fei Yu We can describe the distribution of the observations (GB , GC )
Motivation
in four cases:
A Strange
Scenario
Data
Pipeline
(I) TB = TC = 0
How to detect
CNVs Cell call
Filtering:
Filtering:
Step I
Step II
0 1
Filtering:
III
Step
0 (1 − f+ )2 (1 − f+ )f+
Filtering: Step Blood call 2
IV 1 f+ (1 − f+ ) f+
Results
Conclusions
(II) TB = 0, TC = 1 (i.e., a mutation)
Cell call
0 1
0 (1 − f+ )f− 2 (1 − f+ )(1 − f− )
Blood call
1 f+ f− f+ (1 − f− )
43. Copy-number
Variations in Distribution of (GB , GC ) II
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
(III) TB = 1, TC = 0 (i.e., a deletion)
A Strange
Scenario Cell call
Data
Pipeline
0 1
How to detect 0 f− (1 − f+ ) f− f+
CNVs Blood call
Filtering: Step I
1 (1 − f− )(1 − f+ ) (1 − f− )f+
Filtering: Step II
Filtering: Step
III
Filtering: Step
(IV) TB = TC = 1 (i.e., not a deletion)
IV
Results
Cell call
Conclusions
0 1
0 f−2 f− (1 − f− )
Blood call
1 (1 − f− )f− (1 − f− )2
44. Copy-number
Variations in
Lymphoblas- Probability of observing (GB = 1, GC = 0) in each of the four
toid Cell
Lines possible cases.
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline TB=0 TB=1
How to detect
CNVs
Filtering: Step I
Filtering: Step II
TC=0 TC=1 Deletion (TC=0) No deletion (TC=1)
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
Case I Case II Case III Case IV
45. Copy-number Variations in Lymphoblastoid Cell Lines Probability of observing (GB = 1, GC = 0) in each of the four
possible cases.
2012-04-04 How to detect CNVs TB=0 TB=1
TC=0 TC=1 Deletion (TC=0) No deletion (TC=1)
Case I Case II Case III Case IV
Let’s focus on the (GB = 1, GC = 0) observations and find out which
observations indeed come from CNVs.
46. Copy-number
Variations in More on GATK
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data GATK takes into account the number of each type of
Pipeline
nucleotide acid, read quality, and mapping quality of a genetic
How to detect
CNVs locus to make inference on its true .
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
But the inference is not always accurate. Luckily, we can
Conclusions control how GATK makes mistakes, which I will explain in a
moment
47. Copy-number
Variations in More on GATK
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data GATK takes into account the number of each type of
Pipeline
nucleotide acid, read quality, and mapping quality of a genetic
How to detect
CNVs locus to make inference on its true .
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
But the inference is not always accurate. Luckily, we can
Conclusions control how GATK makes mistakes, which I will explain in a
moment
48. Copy-number
Variations in Filtering
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1
Fei Yu Case I Case II Case III Case IV
Motivation
A Strange
Scenario
Data
Outline:
Pipeline 1 Use GATK to minimize Case II and Case IV by controlling
How to detect
CNVs threshold parameters that reduce f− at the expense of
Filtering:
Filtering:
Step I
Step II
allowing a larger f+ .
Filtering:
III
Step
2 Filter the variants called in the previous step and eliminate
Filtering: Step
IV calls with lower quality metrics. By reducing f+ , we can
Results
Conclusions
eliminate many variants in Case I.
3 Use hypothesis tests to pick out Case III candidate loci.
4 Fit the candidate loci to a hidden Markov model to pick
out the most likely candidate loci.
49. Copy-number
Variations in Filtering: Step I
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1
Fei Yu
Case I Case II Case III Case IV
Motivation
A Strange
Scenario
Data • Run GATK with low threshold parameters to obtain a
Pipeline
How to detect
crude set of loci.
CNVs • Effects: f− ≈ 0, increase f+ .
Filtering: Step I
Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering: Step
III
Filtering: Step
• f+ is bounded above by a small number:
IV
Results
ˆ #(1, 0) + #(0, 1)
Conclusions f+ = ≈ 0.05
#(1, 0) + #(0, 1) + #(GB = 0, GC = 0)
• Minimize Case II and Case IV. Retain Case I and Case III.
Number of loci retained = 15,971.
50. Copy-number
Variations in Filtering: Step I
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1
Fei Yu
Case I Case II Case III Case IV
Motivation
A Strange
Scenario
Data • Run GATK with low threshold parameters to obtain a
Pipeline
How to detect
crude set of loci.
CNVs • Effects: f− ≈ 0, increase f+ .
Filtering: Step I
Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering: Step
III
Filtering: Step
• f+ is bounded above by a small number:
IV
Results
ˆ #(1, 0) + #(0, 1)
Conclusions f+ = ≈ 0.05
#(1, 0) + #(0, 1) + #(GB = 0, GC = 0)
• Minimize Case II and Case IV. Retain Case I and Case III.
Number of loci retained = 15,971.
51. Copy-number
Variations in Filtering: Step I
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1
Fei Yu
Case I Case II Case III Case IV
Motivation
A Strange
Scenario
Data • Run GATK with low threshold parameters to obtain a
Pipeline
How to detect
crude set of loci.
CNVs • Effects: f− ≈ 0, increase f+ .
Filtering: Step I
Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering: Step
III
Filtering: Step
• f+ is bounded above by a small number:
IV
Results
ˆ #(1, 0) + #(0, 1)
Conclusions f+ = ≈ 0.05
#(1, 0) + #(0, 1) + #(GB = 0, GC = 0)
• Minimize Case II and Case IV. Retain Case I and Case III.
Number of loci retained = 15,971.
52. Copy-number
Variations in Filtering: Step I
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1
Fei Yu
Case I Case II Case III Case IV
Motivation
A Strange
Scenario
Data • Run GATK with low threshold parameters to obtain a
Pipeline
How to detect
crude set of loci.
CNVs • Effects: f− ≈ 0, increase f+ .
Filtering: Step I
Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering: Step
III
Filtering: Step
• f+ is bounded above by a small number:
IV
Results
ˆ #(1, 0) + #(0, 1)
Conclusions f+ = ≈ 0.05
#(1, 0) + #(0, 1) + #(GB = 0, GC = 0)
• Minimize Case II and Case IV. Retain Case I and Case III.
Number of loci retained = 15,971.
53. Copy-number
Variations in Filtering: Step I
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=0, TC=1 TB=1, TC=0 TB=1, TC=1
Fei Yu
Case I Case II Case III Case IV
Motivation
A Strange
Scenario
Data • Run GATK with low threshold parameters to obtain a
Pipeline
How to detect
crude set of loci.
CNVs • Effects: f− ≈ 0, increase f+ .
Filtering: Step I
Filtering: Step II • f− ≈ 0 =⇒ minimize Case II and Case IV.
Filtering: Step
III
Filtering: Step
• f+ is bounded above by a small number:
IV
Results
ˆ #(1, 0) + #(0, 1)
Conclusions f+ = ≈ 0.05
#(1, 0) + #(0, 1) + #(GB = 0, GC = 0)
• Minimize Case II and Case IV. Retain Case I and Case III.
Number of loci retained = 15,971.
54. Copy-number
Variations in
Lymphoblas-
toid Cell Figure: KS-tests for runs of 1s against the gamma distribution. Shape and scale
Lines
parameters for gamma are estimated for each chromosome and for each
Fei Yu individual. Those cells with less than 20 runs are indicated by “-”. Cells with
p-value > 0.05 are colored grey.
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
• Runs of 1s are interrupted randomly by short runs of 0s.
• Many of the 0 calls are just random noise.
55. Copy-number
Variations in Filtering: Step II
Lymphoblas-
toid Cell
Lines
Fei Yu
TB=0, TC=0 TB=1, TC=0
Motivation Case I Case III
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
• Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
Step I
Step II
to filter out the false positive calls (loci in Case I).
Filtering: Step
III • VQSR: fit a Gaussian Mixture Model to known variants
Filtering: Step
IV
Results
and novel variants; filter based on the score of the variants.
Conclusions • Effect: decrease f+ .
• Eliminate most of Case I. Retain Case III. Number of loci
retained = 380.
56. Copy-number
Variations in Filtering: Step II
Lymphoblas-
toid Cell
Lines
Fei Yu
TB=0, TC=0 TB=1, TC=0
Motivation Case I Case III
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
• Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
Step I
Step II
to filter out the false positive calls (loci in Case I).
Filtering: Step
III • VQSR: fit a Gaussian Mixture Model to known variants
Filtering: Step
IV
Results
and novel variants; filter based on the score of the variants.
Conclusions • Effect: decrease f+ .
• Eliminate most of Case I. Retain Case III. Number of loci
retained = 380.
57. Copy-number
Variations in
Lymphoblas-
toid Cell
An important covariate for VQSR is strand bias.
Lines
Fei Yu
DNA’s double helix structure: forward and backward strands
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
Definition
Strand bias is the tendency of making more variant calls on one
direction than the other.
58. Copy-number
Variations in Quantifying Strand Bias
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data n1· n2· n··
Pipeline • Fisher’s exact test: p = /
How to detect
n11 n21 n·1
CNVs
Filtering: Step I
Filtering: Step II Forward Backward
Filtering: Step
III Reference n11 n12 n1·
Filtering: Step
IV
Results
Alternative n21 n22 n2·
Conclusions n·1 n·2 n··
59. Copy-number
Variations in Filtering: Step II
Lymphoblas-
toid Cell
Lines
Fei Yu
TB=0, TC=0 TB=1, TC=0
Motivation Case I Case III
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
• Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
Step I
Step II
to filter out the false positive calls (loci in Case I).
Filtering: Step
III • VQSR: fit a Gaussian Mixture Model to known variants
Filtering: Step
IV
Results
and novel variants; filter based on the score of the variants.
Conclusions • Effect: decrease f+ .
• Eliminate most of Case I. Retain Case III. Number of loci
retained = 380.
60. Copy-number
Variations in Filtering: Step II
Lymphoblas-
toid Cell
Lines
Fei Yu
TB=0, TC=0 TB=1, TC=0
Motivation Case I Case III
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
• Run GATK’s Variant Quality Score Recalibration (VQSR)
Filtering:
Filtering:
Step I
Step II
to filter out the false positive calls (loci in Case I).
Filtering: Step
III • VQSR: fit a Gaussian Mixture Model to known variants
Filtering: Step
IV
Results
and novel variants; filter based on the score of the variants.
Conclusions • Effect: decrease f+ .
• Eliminate most of Case I. Retain Case III. Number of loci
retained = 380.
61. Copy-number
Variations in Filtering: Step III
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=1, TC=0
Fei Yu
Case I Case III
Motivation
A Strange
Scenario
Data • For each locus, do hypothesis test:
Pipeline
How to detect
CNVs
H0 : TB = TC H1 : TB = TC
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
• Logistic regression:
IV
Results
IG =1 ∼ Iisblood + strand direction
Conclusions
+ base quality + mapping direction
• Find loci for which Iisblood is significant at 10%-level.
Number of Case III candidates = 126.
62. Copy-number
Variations in Filtering: Step III
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=1, TC=0
Fei Yu
Case I Case III
Motivation
A Strange
Scenario
Data • For each locus, do hypothesis test:
Pipeline
How to detect
CNVs
H0 : TB = TC H1 : TB = TC
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
• Logistic regression:
IV
Results
IG =1 ∼ Iisblood + strand direction
Conclusions
+ base quality + mapping direction
• Find loci for which Iisblood is significant at 10%-level.
Number of Case III candidates = 126.
63. Copy-number
Variations in Features of the Data
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs • from blood or cell line
Filtering: Step I
Filtering: Step II • strand direction (forward or backward)
Filtering: Step
III
Filtering: Step • sequencing quality
IV
Results
Conclusions
64. Copy-number
Variations in Features of the Data
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs • from blood or cell line
Filtering: Step I
Filtering: Step II • strand direction (forward or backward)
Filtering: Step
III
Filtering: Step • sequencing quality
IV
Results
Conclusions
65. Copy-number
Variations in Sequencing Quality
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Quality is inversely related to P( error ).
Pipeline
• base quality: quality of a read at a genetic locus;
How to detect
CNVs determined by the sequencing equipment.
Filtering: Step I
Filtering: Step II • mapping quality: alignment quality of a read; calculated
Filtering: Step
III
Filtering: Step
from base qualities and the reference sequence
IV
Results base quality + mapping quality =⇒ genotype
Conclusions
likelihood—likelihood of a locus being homozygous or
heterozygous.
66. Copy-number
Variations in Sequencing Quality
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Quality is inversely related to P( error ).
Pipeline
• base quality: quality of a read at a genetic locus;
How to detect
CNVs determined by the sequencing equipment.
Filtering: Step I
Filtering: Step II • mapping quality: alignment quality of a read; calculated
Filtering: Step
III
Filtering: Step
from base qualities and the reference sequence
IV
Results base quality + mapping quality =⇒ genotype
Conclusions
likelihood—likelihood of a locus being homozygous or
heterozygous.
67. Copy-number
Variations in Logistic Regression
Lymphoblas-
toid Cell
Lines
Fei Yu IG =1 ∼ Iisblood + strand direction
Motivation
+ base quality + mapping direction
A Strange
Scenario • Each locus is fit to a logistic regression model.
Data
Pipeline
• Perform the deviance χ2 goodness-of-fit test for each
How to detect model and we see only 2.4% of the tests are significant at
CNVs
Filtering: Step I 5%-level.
Filtering: Step II
Filtering: Step
III Histogram of p−values from the Chi^2 tests of the residual deviance
Filtering: Step
600
IV
Results
500
Conclusions
400
Frequency
300
200
100
0
0.0 0.2 0.4 0.6 0.8 1.0
p−values
68. Copy-number
Variations in Filtering: Step III
Lymphoblas-
toid Cell
Lines TB=0, TC=0 TB=1, TC=0
Fei Yu
Case I Case III
Motivation
A Strange
Scenario
Data • For each locus, do hypothesis test:
Pipeline
How to detect
CNVs
H0 : TB = TC H1 : TB = TC
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
• Logistic regression:
IV
Results
IG =1 ∼ Iisblood + strand direction
Conclusions
+ base quality + mapping direction
• Find loci for which Iisblood is significant at 10%-level.
Number of Case III candidates = 126.
69. Copy-number
Variations in Filtering: Step IV
Lymphoblas-
toid Cell
Lines
Fei Yu
Did the Case III candidates come from CNVs?
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
Define the length of a run of 0s as the number of consecutive
(GB , GC ) = (1, 0) calls.
70. Copy-number
Variations in Filtering: Step IV
Lymphoblas-
toid Cell
Lines
Fei Yu
Did the Case III candidates come from CNVs?
Motivation
A Strange
Scenario
Data
Pipeline
How to detect
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
Conclusions
Define the length of a run of 0s as the number of consecutive
(GB , GC ) = (1, 0) calls.
71. Copy-number
Variations in Filtering: Step IV
Lymphoblas-
toid Cell
Lines
Fei Yu Density estimate of the lengths of runs of (G_B, G_C)=(1,0) calls
Motivation
2.5
A Strange
Scenario
Data
2.0
Pipeline
How to detect
CNVs
Filtering: Step I
1.5
Filtering: Step II
Density
Filtering: Step
III
Filtering: Step
IV
1.0
Results > 95% quantile
Conclusions
0.5
0.0
2 4 6 8 10 12 14
N = 3286 Bandwidth = 0.127
72. Copy-number
Variations in Filtering: Step IV
Lymphoblas-
toid Cell
Lines
Fei Yu
10 loci come from runs of 0s of length at least 3:
Motivation
A Strange
Scenario
1101111111|000|1111111111
Data
Pipeline 1010111011|000|1111111111
How to detect 1111111011|000|1111110111
CNVs
Filtering: Step I 0011111011|000|1101111111
Filtering: Step II
Filtering:
III
Step 1111111111|000|1111111111
Filtering:
IV
Step 1101110111|000|1111111111
Results
1011111001|000|1111111010
Conclusions
1111111111|000|1111011110
1111111111|000|1111111111
1111111111|000|1101111011
Notice short runs of 1s. Are they errors?
73. Copy-number
Variations in (Future Work) Filtering: Step IV
Lymphoblas-
toid Cell
Lines
Fei Yu Find probability of < 1011111001|000|1111111010 > using
Motivation hidden Markov model:
3/30/12 hmm.svg
A Strange
Scenario
Data
Pipeline
How to detect CNV not CNV
CNVs
Filtering: Step I
Filtering: Step II
Filtering: Step
III
Filtering: Step
IV
Results
mismatched (0) matched (1)
Conclusions
file://localhost/Users/feiyu/Dropbox/University_Files/ADA/Presentation/2012/graphs/hmm.svg 1/1
CNV not CNV
CNV γ 1−γ
Pi,i+1 =
not CNV 1−λ λ
where γ and λ are big.
74. Copy-number
Variations in Results
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
• After a series of filtering, only 10 loci in the pool of 16
How to detect
CNVs individuals are found to be CNV candidates.
Filtering: Step I
Filtering:
Filtering:
Step II
Step
• Those 10 loci fall into short runs of 0s. They are unlikely
III
Filtering: Step to be CNVs.
IV
Results
• We will fit HMM when there are more reliable signals.
Conclusions
75. Copy-number
Variations in Results
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
• After a series of filtering, only 10 loci in the pool of 16
How to detect
CNVs individuals are found to be CNV candidates.
Filtering: Step I
Filtering:
Filtering:
Step II
Step
• Those 10 loci fall into short runs of 0s. They are unlikely
III
Filtering: Step to be CNVs.
IV
Results
• We will fit HMM when there are more reliable signals.
Conclusions
76. Copy-number
Variations in Conclusions
Lymphoblas-
toid Cell
Lines
Fei Yu
Motivation
A Strange
Scenario
Data
Pipeline
• No CNV is good news. We now know a great amount of
How to detect
CNVs time, money, and effort have not gone to waste.
Filtering: Step I
Filtering:
Filtering:
Step II
Step
• A useful assessment procedure when labs create cell lines.
III
Filtering:
IV
Step • In a separate work, we extended this procedure to finding
Results
mutation in cell line, i.e., TB = 0, TC = 1.
Conclusions