2. 2
Community resources and data we use for testing
â Platinum Genomes â WGS data for Platinum Genomes pedigree
- 6 samples available on ENA (HiSeq2000 2x100bp and soon 10X, HiSeqX & NovaSeq)
- 11 samples available on EGA soon (10X, HiSeqX & NovaSeq)
- 17 samples available on dbGaP (HiSeq2000 2x100bp)
- https://github.com/Illumina/PlatinumGenomes
â Polaris â WGS data for a larger cohort
- 150 1kGP samples available on ENA (HiSeqX 2x150bp)
- 51 1kGP samples to complete trios on above data soon (HiSeqX 2x150bp)
- 70 samples available on ENA (HiSeqX 2x150bp and soon 10X)
- Insertion/deletion variant calls validated with population-statistics
- https://github.com/illumina/polaris
â Paragraph â graph-realigner for SV breakpoints
- Our targeted validation tools: https://github.com/illumina/paragraph
For Research Use Only. Not for use in Diagnostic Procedures.
3. 3
â Given a putative SV, we can genotype in samples using targeted software
â Start with >1,000 unrelated samples for hypothesis-based testing
- Population datasets let us look at most variants rather than just those in NA12877 & NA12878
- Additionally genotype the variants in the 220 unrelated samples, 51 trios and the Platinum Genomes
â Validate the calls:
- Populations level metrics such as HWE
- Mendelian consistency in the Platinum Genomes and Trios
â Sources of the SVs can come from
- Aggregated calls within any sample
- Other projects (e.g. GiaB)
- We share information on variants that are common /
observable in publicly available datasets.
How we validate structural variants: targeted joint calling
For Research Use Only. Not for use in Diagnostic Procedures.
4. 4
Validation of GiaB SV candidates using paragraph
For Research Use Only. Not for use in Diagnostic Procedures.
5. 5
Validation of GiaB SV candidates using paragraph
Event Type Count
Bi-allelic 6232 (65%)
HWE-P > 0.05 3614 (58%)
Validation Summary
Contains duplicates
w. different representations
For Research Use Only. Not for use in Diagnostic Procedures.
6. 6
â 738 variants overlap between
Polaris set and GiaB test set
â Over 70% of the overlapping
variants have different
descriptions, but most of them fail
HWE in one or two call sets, or are
likely STRs
â ~60 SVs have different descriptions
in Polaris and GiaB, but they both
pass the HWE test
These provide test cases for how to better-
validate the calls â i.e. we want to validate
both the variant and the representation
Comparing GiaB (Ashkenazi trio) with the Polaris callset
For Research Use Only. Not for use in Diagnostic Procedures.
7. 7
Working to improve representation with joint mapping
â For each variant, we remap reads a graph consisting of the reference and the two alternative
paths (as defined by Polaris set and GiaB).
â The path with more uniquely mapped reads is more likely to be the better one.
#MappedtoPolaris
# Mapped to GiaB
For Research Use Only. Not for use in Diagnostic Procedures.
8. 8
â In Ashkenazi, the event is described as a
swap, while in Polaris it is a pure deletion.
â More reads are uniquely mapped to the
Ashkenazi description than the Polaris
one.
Example: reads supporting a GiaB representation
Mummerplot of alternative allele sequences
between the two descriptions
Presence of short
insertion
REF + GiaB
REF + Polaris
For Research Use Only. Not for use in Diagnostic Procedures.
9. 9
â More reads are uniquely mapped
to the Polaris description.
â Small insertions were observed in
both representations indicating
that neither is fully correct.
Example: SV with âbetterâ description in Polaris than GiaB
Mummerplot of alternative allele sequences
between the two descriptions
REF + GiaB
REF + Polaris
Presence of short
insertions highlight need
for improvement
For Research Use Only. Not for use in Diagnostic Procedures.
10. 10
Future plans
â Run graph-realignment and validation genome-wide
- Our tools have gotten faster, we can now run on more samples + on all events.
- We will share the results + genotypes on Polaris samples and PG.
â Improve our targeted validation tools
https://github.com/illumina/paragraph
â Make graph visualisation for paragraph publicly available
- Based on https://github.com/vgteam/sequencetubemap, extended to use inputs from the
paragraph tool.
â Share more data for our population datasets.
https://github.com/illumina/polaris
For Research Use Only. Not for use in Diagnostic Procedures.
11. 11
â Mike Eberle
â Egor Dolzhenko
â Sai Chen
â Mitchell Bekritsky
â Subramanian S Ajay
â Vani Rajan
â Sean Humphray
â Ryan J Taft
â David R Bentley
Thank you! Any questions?
â Justin Zook
For Research Use Only. Not for use in Diagnostic Procedures.