Paired-end alignments in sequence graphs

Chirag Jain1, Haowen Zhang2, Alexander Dilthey3, Srinivas Aluru2
1) National Institutes of Health 2) Georgia Institute of Technology 3) University Hospital of Dusseldorf
Validating paired-end alignments in
sequence graphs
WABI 2019, Niagara Falls

!2
1
2
3
4
Proposed Algorithm
Results
Introduction
Problem Statement

!3
Mapping reads
Reference
sequence
Sequences
(reads)
… G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C …
C G T C C G T C A
A C T G C G C T T
A C T G C G C T T
T A A T C G C T T
G G C G C G C A G
A G T G C G C T TC C A G C G C G G
T G T G A T C A C
Single molecule
sequencing
Illumina paired-end
sequencing

!4
Graph-based reference
Linear representation
… G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C …
Graph-based pan-genome reference
[Beyer et al. 2019]

!6
Applications
• Genotyping
MHC-PRG [Dilthey et al. 2015]
vg [Garrison et al. 2018]
Graph-Aligner [Rakocevic et al. 2018]  
• RNA-seq
ASGAL [Denti et al. 2018]
HISAT2 [Kim et al. 2019]  
• Graph-guided assembly
Kourami [Lee and Kingsford 2018]
• Hybrid genome assembly
Unicycler [Wick et al. 2017]
Whatshap [Garg et al. 2018]
Require alignment of reads to
sequence graphs

!7
Sequence to graph alignment
sequence to sequence sequence to acyclic graphs  sequence to general graphs
ACCATGTTTA-G
-CCAAG-TTAAG
A C G A T C
ACCATGTTTAG
G (V, E)
ACCATGTTTAGQ:
Q:
R:
Q:
O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) *
[Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB] 
*(edits allowed only in query)
A
C
C T
G A
T
A G

!8
Sequence to graph alignment
ACCATGTTTA-G
-CCAAG-TTAAG
A C G A T C
G (V, E)
ACCATGTTTAGQ:
R:
Q:
O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) *
Single-end Illumina reads or long reads
Paired-end reads
ACCATGTTTAGQ:
A
C
C T
G A
T
A G
[Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB] 
*(edits allowed only in query)
sequence to sequence sequence to acyclic graphs  sequence to general graphs

!9
Using paired-end sequencing
Paired-end read
linear reference
read mappings
• Dominant sequencing protocol
• Paired-end information allows
• repeat disambiguation
• SV discovery inner distance

!10
Using paired-end sequencing
Paired-end read
C A
T
G
T
A A
C T
G
T
C
A
How to evaluate
mapping candidates?
• vg, HISAT2, HLA-PRG,
deBGA use heuristics, and
lack guarantees
inner distance
• Dominant sequencing protocol
• Paired-end information allows
• repeat disambiguation
• SV discovery

!11
Contributions
• problem formulation for paired-end validation in graphs
• first index-based exact algorithm
• million queries < 1 sec
• can be plugged with any graph mapper
• superior accuracy/runtime than BFS-based heuristic
github.com/ParBLiSS/PairG

!12
1
2
3
4
Proposed Algorithm
Results
Problem Statement
Introduction

!13
Sequence graph
A G
T
C
C
A G
• A directed graph with character-labeled vertices
• Good abstraction for commonly used graphs in genomics

!14
Paired-end validation problem
Paired-end read
inner distance  
(ranging from d1 to d2 )
C A
T
G
T
A A
C T
G
T
C
A
Does there exist any path of length from u to v ?∈ [d1, d2]
u
v
Sequence graph
(+ve strand)
(-ve strand)

!15
Related problems
• All pairs shortest path
• Exact-path length problem 
In a weighted directed graph, is there a  
path of length d from vertex u to v ?
• Transitive closure
Our problem: Does there exist any path of length from u to v ?∈ [d1, d2]
Solves our problem? Time
!O(|V||E|)
!O(|V||E|)
!O(d |E|)
(NP-complete)
[Nykanen and Ukkonen 2002]

!16
1
2
3
4
Proposed Algorithm
Results
Problem Statement
Introduction

!17
An index-based algorithm
A
C
C T
G A
T
A G
Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2]
A In d
1
2
3 4
5
6
7
8
9
(boolean adjacency matrix) (boolean index matrix)
G(V, E)
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
f (A, d1, d2)
(sequence graph)

!18
Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2]
A In df (A, d1, d2)
(index matrix)(adjacency matrix)
i.e. length d1 or d1+1 or … or d2
Ad1 Ad1+ 1
Ad2
In d = Ad1 ∨ Ad1+ 1
∨ … ∨ Ad2
= Ad1 ⋅ (I ∨ A)d2− d1
! time using general matrix-matrix multiplicationO(|V|3
log d2)
Input graph’s sparsity and near-linear topology use SpGEMM
An index-based algorithm
…

• Compressed format to store matrices (CSR)
• space dictated by non-zeros 
 
 
 
 
 
• Runtime dictated by non-zero scalar products
!19
Using SpGEMM
0
1
2
3
row 
pointer
column
index
( 0 1 3 4 7 )
( 0 0 2 0 1 2 3 )
[Gustavson 1978]
!O(|V|)
!O(|E|)

!20
Indexing cost using SpGEMM
• When input is a chain
In d = Ad1 ⋅ (I ∨ A)d2− d1
• Worst-case (dense) time and spaceO(|V|3
log d2) O(|V|2
)
time and
space 
Θ(|V|((d2 − d1)2
+ log d1))
Θ(|V|(d2 − d1 + 1))
0 ≤ d1, d2 ≤ |V|
• Lower bound
Lemma. Computing the index for takesG(V, E)
time and 
space
Ω(|Vc |((d2 − d1)2
+ log d1))
Ω(|Vc |(d2 − d1 + 1))
Assume ! = longest chain in !Gc(Vc, Ec) G(V, E)

!21
Query cost
• requires a simple lookup
• binary-search in column index
C A
T
G
T
A A
C T
G
T
C
A
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9

!22
1
2
3
4
Proposed Algorithm
Results
Problem Statement
Introduction

!23
Setup
ACYCLIC
From human genome
GRCh37 + variations from 1KG Project
CYCLIC
Pan-genomic de-Bruijn graphs
B. anthracis strains
INSERT SIZES mean 300, 500, 700 bp
IMPLEMENTATION C++, using ‘KokkosKernels’ linear algebra library
Intel Xeon CPU: 28 cores and 256 GB RAM
SEQUENCE GRAPHS
(allowed range: ! 150 bp)±
Jain et. al.
Table 1 Directed sequence graphs used for evaluation. In these graphs, each ve
with a DNA nucleotide. Four acyclic graphs are derived from segments of human geno
files from the 1000 Genomes Project (Phase 3). Three cyclic graphs are de Bruijn gra
whole-genome sequences of Bacillus anthracis strains, with k-mer length 25.
Id Graph |V | |E| Type
VG1 mitochondrial-DNA 21K 27K
acyclic
VG2 BRCA1 83K 85K
VG3 LRC_KIR 1.1M 1.2M
VG4 MHC 5.1M 5.3M
DBG1 B. anthracis (1 strain) 5.2M 5.2M
cyclicDBG5 B. anthracis (5 strains) 10.4M 10.4M
DBG20 B. anthracis (20 strains) 11.2M 11.3M
tested PairG using d1 = 0, d2 = 250. Similarly, for insert-size configurations
700 bp, we tested PairG using inner distance limits (d1 = 150, d2 = 450) and (d
650), respectively. There may be insert size configurations where allowing read

!24
Index construction
Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi
matrix using all input graphs and di erent distance constraints. nnz represents number of
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem n
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1
Table 3 Time to execute a million queries using all the graphs and distance constrain
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
rix using all input graphs and di erent distance constraints. nnz represents number of non-zero
ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
zero of a matrix in memory.
Graph Insert size
300 500 700
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
y is a random pair of vertices in the graph.
GRCh37 +
variations

!25
Index construction
Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi
matrix using all input graphs and di erent distance constraints. nnz represents number of
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem n
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1
Table 3 Time to execute a million queries using all the graphs and distance constrain
matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
Graph Insert size
300 500 700
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
rix using all input graphs and di erent distance constraints. nnz represents number of non-zero
ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
zero of a matrix in memory.
Graph Insert size
300 500 700
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
y is a random pair of vertices in the graph.
Pan-genomic
de-Bruijn graphs
GRCh37 +
variations

!26
Querying the index is super-fast!
• Simulated random vertex pairs !
• One million queries take <1 second
[i, j]
• Return true if BFS-distance from source !
• Index lookups are two-three orders of magnitude faster
• Heuristic accuracy ranged from 98%-100%
≤ d2
vs. BFS-based heuristic
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17
Table 3 Time to execute a million queries using all the graphs and dist
Graph Insert size
300 500 700
Time (sec)
VG1 0.1 0.1 0.1
VG2 0.2 0.2 0.2
VG3 0.4 0.4 0.5
VG4 0.5 0.5 0.5
DBG1 0.4 0.5 0.5
DBG5 0.4 0.5 0.5
DBG20 0.5 0.5 0.6
be uniformly distributed over the graph, we tested the querying perfo
a million random vertex pairs (u, v), u, v œ [1, |V |]. For all the seve
million vertex pairs ﬁnished in less than a second (Table 3). Even th

• First formulation for P.E. distance validation in graphs
• First index-based exact algorithm
• Practical for pan-genome graphs
• A useful module for graph mappers
!27
Conclusions
github.com/ParBLiSS/PairG
C A
T
G
T
A A
C T
G
T
C
A
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9

!28
Future directions
• Performance
• Scale to human genomes
• Index compression (e.g., run-length enc.) 
• Applications
• Clustering adjacent seed matches
• End-to-end graph read mapper
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9

Paired-end alignments in sequence graphs

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Paired-end alignments in sequence graphs

Ähnlich wie Paired-end alignments in sequence graphs (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Paired-end alignments in sequence graphs