Aug2014 nist structural variant integration

ANALYSIS OF STRUCTURAL VARIANTS
FROM NEXT GENERATION SEQUENCING
Hemang Parikh, Ph.D.
NIST

Challenges for identifying true SVs
This Venn diagram shows the
numbers of unique and shared
structural variants (SVs) found
by different sequencing-based
discovery approaches that have
been used in the 1000
Genomes Project
Hence we decided to develop
methods to look for evidence
of SVs in mapped sequencing
reads from multiple
sequencing technologies
From Alkan et al. (2011)

• Coverage (mean and standard deviation)
• Paired-end distance/insert size (mean and
standard deviation)
• # of discordant paired-ends reads
• Soft clipping of the reads (mean and standard
deviation)
• Mapping quality (mean and standard deviation)
• # of heterozygous and homozygous SNP genotype
calls
• % of GC content
Validation parameters for each SV

Reference sequence
Repeatmasker data
Perl script
About 180
annotations
per SV
Aligned sequence
data (BAM file)
List of structural
variants (bed file)

NA12878 Data Sets—RM for GIAB
• Illumina (250 bp long sequences with 50X coverage)
• Illumina NIST (150 bp long sequences with 300X coverage)
• Illumina Platinum Genome (100 bp long sequences with
200X coverage)
• Illumina Moleculo
• Pacific Biosciences

Deletions Gold Sets for NA12878
• Personalis (n=2,306)
• The 1000 Genomes pilot (n=2,773)
• Complete Genomics (n=2,032)
• Conrad et al. (n=515)
• Kidds et al. (n=317)
• McCaroll et al. (n=128)
• The 1000 Genomes—aCGH array based (n=3,901)
• Roche NimbleGen 42 million—aCGH array based (n=719)
• Randomly generated (n=2,306)

Personalis deletions call set (n=2,306)
Log10 (SV Size)
2 3 4 5
Counts
600
400
200
0
• BAM-level evidence in the vicinity
of each SV, in most of the 19 CEPH
pedigree samples
• SV breakpoints were identified
• Some SVs were validated with PCR

Illumina NIST
-2 0 2 4
400
300
200
100
0
Counts
Log10 (M coverage) Log10 (M coverage)
-1 0 1 2 3
Counts
900
600
300
0
Personalis Random genome

Identifying likely SVs and likely non-SVs
Log10 (M coverage)
Counts
400
300
200
100
0
Random genome
Identify 99
percentile
value of an
annotation
parameter
-3 -2 -1 0 1 2
Compared
this value
with an
annotation
parameter
from SV
Gold Set

Annotatingwith IlluminaNIST and IlluminaMoleculo
Personalis SV Gold Set for Illumina
NIST annotation parameters
Personalis SV Gold Set for Illumina
Moleculo annotation parameters
L Insert size
L Soft Clipped
L # of discordant paired-ends reads
M Coverage
M Coverage SD
M Mapping quality
M Insert size
M Soft Clipped
M # of discordant paired-ends reads
L Soft Clipped
M Coverage
M Coverage SD
M Mapping quality
M Soft Clipped

0 1 2 3 4 5 6 7 8 9 10
0 21 96 323 350 231 126 80 40 10 2 1
1 4 19 45 59 61 29 16 9 9 0 1
2 1 22 108 200 214 111 69 36 8 3 0
3 0 0 0 1 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0
Illumina NIST
Molecul
o
0 1 2 3 4 5 6 7 8 9 10
0 2059 94 18 6 2 3 1 0 0 0 0
1 62 15 12 5 1 3 2 0 0 1 0
2 13 3 5 0 0 0 0 1 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0
Illumina NIST
Molecul
o
(B) Random genome
(A) Personalis

Conclusions
• Graphical visualization of the annotation parameters has shown clear
distinction between true positive and false positive SVs
• A key advantage of the proposed method is its simplicity and flexibility to
generate various annotation parameters from aligned sequence data based on
different sequencing datasets from the same genome
• This allows integration of multiple sequencing datasets to identify high-
confidence SV and non-SV calls that can be used as a benchmark to assess
false positive and false negative rates
• We are currently testing classification methods based on the annotation
parameters to generate both high-confidence SV calls and high-confidence
non-SV calls for NA12878

Acknowledgements
NIST
Marc Salit
Justin Zook
Hariharan Iyer
Desu Chen
Sumona Sarkar
Jennifer McDaniel
Lindsay Vang
David Catoe
Nathanael Olson
Genome in a Bottle Consortium
Personalis Inc.
Mark Pratt
Gabor Bartha
Jason Harris
Illumina Inc.
Michael Eberle
Stanford University
Michael Snyder
Amin Zia
Somalee Datta
Cuiping Pan
Sean Michael Boyle
Rajini Haraksingh
Natalie Jaeger

Aug2014 nist structural variant integration

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Aug2014 nist structural variant integration

Ähnlich wie Aug2014 nist structural variant integration (20)

Mehr von GenomeInABottle

Mehr von GenomeInABottle (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Aug2014 nist structural variant integration