Chandigarh Call Girls Service ā¤ļøš 9809698092 šš«¦Independent Escort Service Cha...
Ā
GIAB Sep2016 Lightning chen sun varmatch
1. VarMatch:
robust matching of small variant datasets
using flexible scoring schemes
Chen Sun, Paul Medvedev
Penn State
1
2. Variant Matching
ā¢ Different pipelines tends to report variants in different
representations
ā¢ Need to compare VCF files
ā¢ Evaluate variant callers
ā¢ Find overlap as high confident variants
ā¢ Add variants into database
ā¢ Two variant sets are equivalent if applying them separately to the
reference genome results in the same donor genome.
ā¢ Variant Matching Problem: given two call sets, identify the largest
equivalent subsets.
2
3. The Variant Matching problem
Seq A G C C G G
1 REF G C C G
ALT C C G A
2 REF G C G
ALT C G A
3 REF A G G
ALT A G A
Donor: A C C G A G
ā¢ NaĆÆve approach
ā¢ Match two variants if location and alleles exactly
same
ā¢ Normalization (Tan et al 15)
ā¢ Guarantees to match equivalent singletons
ā¢ Complex Variants
ā¢ One variant matches multiple variants
ā¢ Multiple variants matches multiple variants
ā¢ Decomposition (Li 14, Zook et al 14)
ā¢ Creates fractional matches
ā¢ Does not always work (Example ļ )
3
4. VarMatch Algorithm Overview
ā¢ Separator on reference genome sequence
ā¢ Variants on the left can not be equivalent to variants on the right
ā¢ Linear scan of reference genome to identify separators
ā¢ Solve independent small problem
ā¢ Branch and bound method for small problem
ā¢ Similar algorithm as Cleary et al., 2015
ā¢ Problem size small
ā¢ Require less memory and time
ā¢ Theorem for identifying separators
Software: https://github.com/medvedevgroup/varmatch
Preprint: VarMatch: robust matching of small variant datasets using flexible
scoring schemes (bioArxiv)
4
5. VarMatch supports flexible scoring schemes
ā¢ Maximize number of total matched variants or just in the baseline?
ā¢ Maximize number of calls or total edit distance?
ā¢ e.g. a call affecting changes 10 bases vs. 10 calls changing 1 base.
ā¢ Require genotypes to match or to just detect a variant is present?
Others possible?
5
6. Benchmark
CHM1 + bowtie (Li 14)
Freebayes GATK-HC
NA12878 + bowtie (Li 14)
Freebayes GATK-UG
Vt normalize 2,778,372 2,778,372 4,092,161 4,092,161
RTG Tools 2,843,396 2,912,641 4,197,070 4,321,997
VarMatch 2,843,396 2,912,641 4,197,138 4,322,083
RAM(Gb) Time(s)
RTG Tools 48 456
VarMatch 5 302
Memory and Running Time Evaluation
Number of Matched Variants
7. Matching in low-complexity regions
ā¢ Comparison of (1) BWA+FreeBayes and (2) Bowtie2+Platypus NA12878 callsets (Li 14)
ā¢ Using Bowtie2+GATK as baseline
ā¢ Focus on low-complexity region
ā¢ 12% more equivalent variants identified using VarMatch than normalization
Results of Vt-normalize Results of VarMatch
8. Matching in dense regions
ā¢ Comparison of Freebayes vs. Platypus NA12878 callsets (Li. 2014)
ā¢ using GIAB Gold Standard (Zook et al 14) as baseline
ā¢ Focus on ādense regionsā
ā¢ 10 base regions that contain an INDEL and another variant
ā¢ Assessment genome wide differs from that in dense regions
Number of Matched Variants in Baseline
Freebayes Platypus
genome wide 2,896,841 2,891,849
dense regions 24,188 24,522
11. VarMatch Highlights
ā¢ Use less memory and running time
ā¢ Better performance matching complex variants
ā¢ Better performance in low-complexity regions
ā¢ Better performance in dense regions
ā¢ Flexible scoring schemes
11