This document analyzes structural variants (SVs) identified from next generation sequencing data. It describes challenges in identifying true SVs and the development of methods to analyze mapped sequencing reads from multiple technologies. Validation parameters are extracted from sequencing data and reference sequences to annotate SVs. These parameters are used to distinguish true positive SVs from false positives by comparing the parameter values of known SVs to randomly generated non-SVs. Graphical visualization clearly separates SVs from non-SVs based on the parameter values. This allows integration of multiple datasets to identify high-confidence SV and non-SV calls for benchmarking and evaluation.
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
Aug2014 nist structural variant integration
1. ANALYSIS OF STRUCTURAL VARIANTS
FROM NEXT GENERATION SEQUENCING
Hemang Parikh, Ph.D.
NIST
2. Challenges for identifying true SVs
This Venn diagram shows the
numbers of unique and shared
structural variants (SVs) found
by different sequencing-based
discovery approaches that have
been used in the 1000
Genomes Project
Hence we decided to develop
methods to look for evidence
of SVs in mapped sequencing
reads from multiple
sequencing technologies
From Alkan et al. (2011)
3. • Coverage (mean and standard deviation)
• Paired-end distance/insert size (mean and
standard deviation)
• # of discordant paired-ends reads
• Soft clipping of the reads (mean and standard
deviation)
• Mapping quality (mean and standard deviation)
• # of heterozygous and homozygous SNP genotype
calls
• % of GC content
Validation parameters for each SV
5. NA12878 Data Sets—RM for GIAB
• Illumina (250 bp long sequences with 50X coverage)
• Illumina NIST (150 bp long sequences with 300X coverage)
• Illumina Platinum Genome (100 bp long sequences with
200X coverage)
• Illumina Moleculo
• Pacific Biosciences
6. Deletions Gold Sets for NA12878
• Personalis (n=2,306)
• The 1000 Genomes pilot (n=2,773)
• Complete Genomics (n=2,032)
• Conrad et al. (n=515)
• Kidds et al. (n=317)
• McCaroll et al. (n=128)
• The 1000 Genomes—aCGH array based (n=3,901)
• Roche NimbleGen 42 million—aCGH array based (n=719)
• Randomly generated (n=2,306)
7. Personalis deletions call set (n=2,306)
Log10 (SV Size)
2 3 4 5
Counts
600
400
200
0
• BAM-level evidence in the vicinity
of each SV, in most of the 19 CEPH
pedigree samples
• SV breakpoints were identified
• Some SVs were validated with PCR
9. Identifying likely SVs and likely non-SVs
Log10 (M coverage)
Counts
400
300
200
100
0
Random genome
Identify 99
percentile
value of an
annotation
parameter
-3 -2 -1 0 1 2
Compared
this value
with an
annotation
parameter
from SV
Gold Set
10. Annotatingwith IlluminaNIST and IlluminaMoleculo
Personalis SV Gold Set for Illumina
NIST annotation parameters
Personalis SV Gold Set for Illumina
Moleculo annotation parameters
L Insert size
L Soft Clipped
L # of discordant paired-ends reads
M Coverage
M Coverage SD
M Mapping quality
M Insert size
M Soft Clipped
M # of discordant paired-ends reads
L Soft Clipped
M Coverage
M Coverage SD
M Mapping quality
M Soft Clipped
12. Conclusions
• Graphical visualization of the annotation parameters has shown clear
distinction between true positive and false positive SVs
• A key advantage of the proposed method is its simplicity and flexibility to
generate various annotation parameters from aligned sequence data based on
different sequencing datasets from the same genome
• This allows integration of multiple sequencing datasets to identify high-
confidence SV and non-SV calls that can be used as a benchmark to assess
false positive and false negative rates
• We are currently testing classification methods based on the annotation
parameters to generate both high-confidence SV calls and high-confidence
non-SV calls for NA12878
13. Acknowledgements
NIST
Marc Salit
Justin Zook
Hariharan Iyer
Desu Chen
Sumona Sarkar
Jennifer McDaniel
Lindsay Vang
David Catoe
Nathanael Olson
Genome in a Bottle Consortium
Personalis Inc.
Mark Pratt
Gabor Bartha
Jason Harris
Illumina Inc.
Michael Eberle
Stanford University
Michael Snyder
Amin Zia
Somalee Datta
Cuiping Pan
Sean Michael Boyle
Rajini Haraksingh
Natalie Jaeger