Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix- matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.
The Mariana Trench remarkable geological features on Earth.pptx
Paired-end alignments in sequence graphs
1. Chirag Jain1, Haowen Zhang2, Alexander Dilthey3, Srinivas Aluru2
1) National Institutes of Health 2) Georgia Institute of Technology 3) University Hospital of Dusseldorf
Validating paired-end alignments in
sequence graphs
WABI 2019, Niagara Falls
3. !3
Mapping reads
Reference
sequence
Sequences
(reads)
… G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C …
C G T C C G T C A
A C T G C G C T T
A C T G C G C T T
T A A T C G C T T
G G C G C G C A G
A G T G C G C T TC C A G C G C G G
T G T G A T C A C
Single molecule
sequencing
Illumina paired-end
sequencing
6. !6
Applications
• Genotyping
MHC-PRG [Dilthey et al. 2015]
vg [Garrison et al. 2018]
Graph-Aligner [Rakocevic et al. 2018]
• RNA-seq
ASGAL [Denti et al. 2018]
HISAT2 [Kim et al. 2019]
• Graph-guided assembly
Kourami [Lee and Kingsford 2018]
• Hybrid genome assembly
Unicycler [Wick et al. 2017]
Whatshap [Garg et al. 2018]
Require alignment of reads to
sequence graphs
7. !7
Sequence to graph alignment
sequence to sequence sequence to acyclic graphs sequence to general graphs
ACCATGTTTA-G
-CCAAG-TTAAG
A C G A T C
ACCATGTTTAG
G (V, E)
ACCATGTTTAGQ:
Q:
R:
Q:
O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) *
[Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB]
*(edits allowed only in query)
A
C
C T
G A
T
A G
8. !8
Sequence to graph alignment
ACCATGTTTA-G
-CCAAG-TTAAG
A C G A T C
G (V, E)
ACCATGTTTAGQ:
R:
Q:
O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) *
Single-end Illumina reads or long reads
Paired-end reads
ACCATGTTTAGQ:
A
C
C T
G A
T
A G
[Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB]
*(edits allowed only in query)
sequence to sequence sequence to acyclic graphs sequence to general graphs
9. !9
Using paired-end sequencing
Paired-end read
linear reference
read mappings
• Dominant sequencing protocol
• Paired-end information allows
• repeat disambiguation
• SV discovery inner distance
10. !10
Using paired-end sequencing
Paired-end read
C A
T
G
T
A A
C T
G
T
C
A
How to evaluate
mapping candidates?
• vg, HISAT2, HLA-PRG,
deBGA use heuristics, and
lack guarantees
inner distance
• Dominant sequencing protocol
• Paired-end information allows
• repeat disambiguation
• SV discovery
11. !11
Contributions
• problem formulation for paired-end validation in graphs
• first index-based exact algorithm
• million queries < 1 sec
• can be plugged with any graph mapper
• superior accuracy/runtime than BFS-based heuristic
github.com/ParBLiSS/PairG
13. !13
Sequence graph
A G
T
C
C
A G
• A directed graph with character-labeled vertices
• Good abstraction for commonly used graphs in genomics
14. !14
Paired-end validation problem
Paired-end read
inner distance
(ranging from d1 to d2 )
C A
T
G
T
A A
C T
G
T
C
A
Does there exist any path of length from u to v ?∈ [d1, d2]
u
v
Sequence graph
(+ve strand)
(-ve strand)
15. !15
Related problems
• All pairs shortest path
• Exact-path length problem
In a weighted directed graph, is there a
path of length d from vertex u to v ?
• Transitive closure
Our problem: Does there exist any path of length from u to v ?∈ [d1, d2]
Solves our problem? Time
!O(|V||E|)
!O(|V||E|)
!O(d |E|)
(NP-complete)
[Nykanen and Ukkonen 2002]
17. !17
An index-based algorithm
A
C
C T
G A
T
A G
Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2]
A In d
1
2
3 4
5
6
7
8
9
(boolean adjacency matrix) (boolean index matrix)
G(V, E)
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
f (A, d1, d2)
(sequence graph)
18. !18
Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2]
A In df (A, d1, d2)
(index matrix)(adjacency matrix)
i.e. length d1 or d1+1 or … or d2
Ad1 Ad1+ 1
Ad2
In d = Ad1 ∨ Ad1+ 1
∨ … ∨ Ad2
= Ad1 ⋅ (I ∨ A)d2− d1
! time using general matrix-matrix multiplicationO(|V|3
log d2)
Input graph’s sparsity and near-linear topology use SpGEMM
An index-based algorithm
…
19. • Compressed format to store matrices (CSR)
• space dictated by non-zeros
• Runtime dictated by non-zero scalar products
!19
Using SpGEMM
0
1
2
3
row
pointer
column
index
( 0 1 3 4 7 )
( 0 0 2 0 1 2 3 )
[Gustavson 1978]
!O(|V|)
!O(|E|)
20. !20
Indexing cost using SpGEMM
• When input is a chain
In d = Ad1 ⋅ (I ∨ A)d2− d1
• Worst-case (dense) time and spaceO(|V|3
log d2) O(|V|2
)
time and
space
Θ(|V|((d2 − d1)2
+ log d1))
Θ(|V|(d2 − d1 + 1))
0 ≤ d1, d2 ≤ |V|
• Lower bound
Lemma. Computing the index for takesG(V, E)
time and
space
Ω(|Vc |((d2 − d1)2
+ log d1))
Ω(|Vc |(d2 − d1 + 1))
Assume ! = longest chain in !Gc(Vc, Ec) G(V, E)
21. !21
Query cost
• requires a simple lookup
• binary-search in column index
C A
T
G
T
A A
C T
G
T
C
A
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
23. !23
Setup
ACYCLIC
From human genome
GRCh37 + variations from 1KG Project
CYCLIC
Pan-genomic de-Bruijn graphs
B. anthracis strains
INSERT SIZES mean 300, 500, 700 bp
IMPLEMENTATION C++, using ‘KokkosKernels’ linear algebra library
Intel Xeon CPU: 28 cores and 256 GB RAM
SEQUENCE GRAPHS
(allowed range: ! 150 bp)±
Jain et. al.
Table 1 Directed sequence graphs used for evaluation. In these graphs, each ve
with a DNA nucleotide. Four acyclic graphs are derived from segments of human geno
files from the 1000 Genomes Project (Phase 3). Three cyclic graphs are de Bruijn gra
whole-genome sequences of Bacillus anthracis strains, with k-mer length 25.
Id Graph |V | |E| Type
VG1 mitochondrial-DNA 21K 27K
acyclic
VG2 BRCA1 83K 85K
VG3 LRC_KIR 1.1M 1.2M
VG4 MHC 5.1M 5.3M
DBG1 B. anthracis (1 strain) 5.2M 5.2M
cyclicDBG5 B. anthracis (5 strains) 10.4M 10.4M
DBG20 B. anthracis (20 strains) 11.2M 11.3M
tested PairG using d1 = 0, d2 = 250. Similarly, for insert-size configurations
700 bp, we tested PairG using inner distance limits (d1 = 150, d2 = 450) and (d
650), respectively. There may be insert size configurations where allowing read
24. !24
Index construction
Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi
matrix using all input graphs and di erent distance constraints. nnz represents number of
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem n
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1
Table 3 Time to execute a million queries using all the graphs and distance constrain
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
rix using all input graphs and di erent distance constraints. nnz represents number of non-zero
ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
y is a random pair of vertices in the graph.
GRCh37 +
variations
25. !25
Index construction
Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi
matrix using all input graphs and di erent distance constraints. nnz represents number of
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem n
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1
Table 3 Time to execute a million queries using all the graphs and distance constrain
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
rix using all input graphs and di erent distance constraints. nnz represents number of non-zero
ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
y is a random pair of vertices in the graph.
Pan-genomic
de-Bruijn graphs
GRCh37 +
variations
26. !26
Querying the index is super-fast!
• Simulated random vertex pairs !
• One million queries take <1 second
[i, j]
• Return true if BFS-distance from source !
• Index lookups are two-three orders of magnitude faster
• Heuristic accuracy ranged from 98%-100%
≤ d2
vs. BFS-based heuristic
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17
Table 3 Time to execute a million queries using all the graphs and dist
query is a random pair of vertices in the graph.
Graph Insert size
300 500 700
Time (sec)
VG1 0.1 0.1 0.1
VG2 0.2 0.2 0.2
VG3 0.4 0.4 0.5
VG4 0.5 0.5 0.5
DBG1 0.4 0.5 0.5
DBG5 0.4 0.5 0.5
DBG20 0.5 0.5 0.6
be uniformly distributed over the graph, we tested the querying perfo
a million random vertex pairs (u, v), u, v œ [1, |V |]. For all the seve
million vertex pairs finished in less than a second (Table 3). Even th
27. • First formulation for P.E. distance validation in graphs
• First index-based exact algorithm
• Practical for pan-genome graphs
• A useful module for graph mappers
!27
Conclusions
github.com/ParBLiSS/PairG
C A
T
G
T
A A
C T
G
T
C
A
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9