SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Chirag Jain1, Haowen Zhang2, Alexander Dilthey3, Srinivas Aluru2
1) National Institutes of Health 2) Georgia Institute of Technology 3) University Hospital of Dusseldorf
Validating paired-end alignments in
sequence graphs
WABI 2019, Niagara Falls
!2
1
2
3
4
Proposed Algorithm
Results
Introduction
Problem Statement
!3
Mapping reads
Reference
sequence
Sequences
(reads)
… G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C …
C G T C C G T C A
A C T G C G C T T
A C T G C G C T T
T A A T C G C T T
G G C G C G C A G
A G T G C G C T TC C A G C G C G G
T G T G A T C A C
Single molecule
sequencing
Illumina paired-end
sequencing
!4
Graph-based reference
Linear representation
… G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C …
Graph-based pan-genome reference
[Beyer et al. 2019]
!5
Applications
!6
Applications
• Genotyping
MHC-PRG [Dilthey et al. 2015]
vg [Garrison et al. 2018]
Graph-Aligner [Rakocevic et al. 2018]


• RNA-seq
ASGAL [Denti et al. 2018]
HISAT2 [Kim et al. 2019]


• Graph-guided assembly
Kourami [Lee and Kingsford 2018]
• Hybrid genome assembly
Unicycler [Wick et al. 2017]
Whatshap [Garg et al. 2018]
Require alignment of reads to
sequence graphs
!7
Sequence to graph alignment
sequence to sequence sequence to acyclic graphs
 sequence to general graphs
ACCATGTTTA-G
-CCAAG-TTAAG
A C G A T C
ACCATGTTTAG
G (V, E)
ACCATGTTTAGQ:
Q:
R:
Q:
O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) *
[Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB]

*(edits allowed only in query)
A
C
C T
G A
T
A G
!8
Sequence to graph alignment
ACCATGTTTA-G
-CCAAG-TTAAG
A C G A T C
G (V, E)
ACCATGTTTAGQ:
R:
Q:
O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) *
Single-end Illumina reads or long reads
Paired-end reads
ACCATGTTTAGQ:
A
C
C T
G A
T
A G
[Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB]

*(edits allowed only in query)
sequence to sequence sequence to acyclic graphs
 sequence to general graphs
!9
Using paired-end sequencing
Paired-end read
linear reference
read mappings
• Dominant sequencing protocol
• Paired-end information allows
• repeat disambiguation
• SV discovery inner distance
!10
Using paired-end sequencing
Paired-end read
C A
T
G
T
A A
C T
G
T
C
A
How to evaluate
mapping candidates?
• vg, HISAT2, HLA-PRG,
deBGA use heuristics, and
lack guarantees
inner distance
• Dominant sequencing protocol
• Paired-end information allows
• repeat disambiguation
• SV discovery
!11
Contributions
• problem formulation for paired-end validation in graphs
• first index-based exact algorithm
• million queries < 1 sec
• can be plugged with any graph mapper
• superior accuracy/runtime than BFS-based heuristic
github.com/ParBLiSS/PairG
!12
1
2
3
4
Proposed Algorithm
Results
Problem Statement
Introduction
!13
Sequence graph
A G
T
C
C
A G
• A directed graph with character-labeled vertices
• Good abstraction for commonly used graphs in genomics
!14
Paired-end validation problem
Paired-end read
inner distance 

(ranging from d1 to d2 )
C A
T
G
T
A A
C T
G
T
C
A
Does there exist any path of length from u to v ?∈ [d1, d2]
u
v
Sequence graph
(+ve strand)
(-ve strand)
!15
Related problems
• All pairs shortest path
• Exact-path length problem

In a weighted directed graph, is there a 

path of length d from vertex u to v ?
• Transitive closure
Our problem: Does there exist any path of length from u to v ?∈ [d1, d2]
Solves our problem? Time
!O(|V||E|)
!O(|V||E|)
!O(d |E|)
(NP-complete)
[Nykanen and Ukkonen 2002]
!16
1
2
3
4
Proposed Algorithm
Results
Problem Statement
Introduction
!17
An index-based algorithm
A
C
C T
G A
T
A G
Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2]
A In d
1
2
3 4
5
6
7
8
9
(boolean adjacency matrix) (boolean index matrix)
G(V, E)
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
f (A, d1, d2)
(sequence graph)
!18
Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2]
A In df (A, d1, d2)
(index matrix)(adjacency matrix)
i.e. length d1 or d1+1 or … or d2
Ad1 Ad1+ 1
Ad2
In d = Ad1 ∨ Ad1+ 1
∨ … ∨ Ad2
= Ad1 ⋅ (I ∨ A)d2− d1
! time using general matrix-matrix multiplicationO(|V|3
log d2)
Input graph’s sparsity and near-linear topology use SpGEMM
An index-based algorithm
…
• Compressed format to store matrices (CSR)
• space dictated by non-zeros











• Runtime dictated by non-zero scalar products
!19
Using SpGEMM
0
1
2
3
row

pointer
column
index
( 0 1 3 4 7 )
( 0 0 2 0 1 2 3 )
[Gustavson 1978]
!O(|V|)
!O(|E|)
!20
Indexing cost using SpGEMM
• When input is a chain
In d = Ad1 ⋅ (I ∨ A)d2− d1
• Worst-case (dense) time and spaceO(|V|3
log d2) O(|V|2
)
time and
space

Θ(|V|((d2 − d1)2
+ log d1))
Θ(|V|(d2 − d1 + 1))
0 ≤ d1, d2 ≤ |V|
• Lower bound
Lemma. Computing the index for takesG(V, E)
time and

space
Ω(|Vc |((d2 − d1)2
+ log d1))
Ω(|Vc |(d2 − d1 + 1))
Assume ! = longest chain in !Gc(Vc, Ec) G(V, E)
!21
Query cost
• requires a simple lookup
• binary-search in column index
C A
T
G
T
A A
C T
G
T
C
A
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
!22
1
2
3
4
Proposed Algorithm
Results
Problem Statement
Introduction
!23
Setup
ACYCLIC
From human genome
GRCh37 + variations from 1KG Project
CYCLIC
Pan-genomic de-Bruijn graphs
B. anthracis strains
INSERT SIZES mean 300, 500, 700 bp
IMPLEMENTATION C++, using ‘KokkosKernels’ linear algebra library
Intel Xeon CPU: 28 cores and 256 GB RAM
SEQUENCE GRAPHS
(allowed range: ! 150 bp)±
Jain et. al.
Table 1 Directed sequence graphs used for evaluation. In these graphs, each ve
with a DNA nucleotide. Four acyclic graphs are derived from segments of human geno
files from the 1000 Genomes Project (Phase 3). Three cyclic graphs are de Bruijn gra
whole-genome sequences of Bacillus anthracis strains, with k-mer length 25.
Id Graph |V | |E| Type
VG1 mitochondrial-DNA 21K 27K
acyclic
VG2 BRCA1 83K 85K
VG3 LRC_KIR 1.1M 1.2M
VG4 MHC 5.1M 5.3M
DBG1 B. anthracis (1 strain) 5.2M 5.2M
cyclicDBG5 B. anthracis (5 strains) 10.4M 10.4M
DBG20 B. anthracis (20 strains) 11.2M 11.3M
tested PairG using d1 = 0, d2 = 250. Similarly, for insert-size configurations
700 bp, we tested PairG using inner distance limits (d1 = 150, d2 = 450) and (d
650), respectively. There may be insert size configurations where allowing read
!24
Index construction
Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi
matrix using all input graphs and di erent distance constraints. nnz represents number of
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem n
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1
Table 3 Time to execute a million queries using all the graphs and distance constrain
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
rix using all input graphs and di erent distance constraints. nnz represents number of non-zero
ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
y is a random pair of vertices in the graph.
GRCh37 +
variations
!25
Index construction
Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi
matrix using all input graphs and di erent distance constraints. nnz represents number of
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem n
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1
Table 3 Time to execute a million queries using all the graphs and distance constrain
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero
elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
non-zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
query is a random pair of vertices in the graph.
Table 2 Performance measured in terms of wall-clock time and memory-usage for building index
rix using all input graphs and di erent distance constraints. nnz represents number of non-zero
ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each
zero of a matrix in memory.
Graph Insert size
300 500 700
Time Mem nnz Time Mem nnz Time Mem nnz
VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M
VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M
VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B
VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B
DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B
DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B
Table 3 Time to execute a million queries using all the graphs and distance constraints. Each
y is a random pair of vertices in the graph.
Pan-genomic
de-Bruijn graphs
GRCh37 +
variations
!26
Querying the index is super-fast!
• Simulated random vertex pairs !
• One million queries take <1 second
[i, j]
• Return true if BFS-distance from source !
• Index lookups are two-three orders of magnitude faster
• Heuristic accuracy ranged from 98%-100%
≤ d2
vs. BFS-based heuristic
DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17
Table 3 Time to execute a million queries using all the graphs and dist
query is a random pair of vertices in the graph.
Graph Insert size
300 500 700
Time (sec)
VG1 0.1 0.1 0.1
VG2 0.2 0.2 0.2
VG3 0.4 0.4 0.5
VG4 0.5 0.5 0.5
DBG1 0.4 0.5 0.5
DBG5 0.4 0.5 0.5
DBG20 0.5 0.5 0.6
be uniformly distributed over the graph, we tested the querying perfo
a million random vertex pairs (u, v), u, v œ [1, |V |]. For all the seve
million vertex pairs finished in less than a second (Table 3). Even th
• First formulation for P.E. distance validation in graphs
• First index-based exact algorithm
• Practical for pan-genome graphs
• A useful module for graph mappers
!27
Conclusions
github.com/ParBLiSS/PairG
C A
T
G
T
A A
C T
G
T
C
A
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9
!28
Future directions
• Performance
• Scale to human genomes
• Index compression (e.g., run-length enc.)

• Applications
• Clustering adjacent seed matches
• End-to-end graph read mapper
In dex
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9

Weitere ähnliche Inhalte

Was ist angesagt?

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Principle and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencingPrinciple and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencingsciencelearning123
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNAmaryamshah13
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomicsSukhjinder Singh
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysissaberhussain9
 
Bioinformatics Omics
Bioinformatics OmicsBioinformatics Omics
Bioinformatics OmicsHiplot
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingARUNDHATI MEHTA
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsAjit Shinde
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 

Was ist angesagt? (20)

Primer design
Primer designPrimer design
Primer design
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
PHYLOGENETICS WITH MEGA
PHYLOGENETICS WITH MEGAPHYLOGENETICS WITH MEGA
PHYLOGENETICS WITH MEGA
 
Rna seq
Rna seqRna seq
Rna seq
 
Principle and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencingPrinciple and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencing
 
ChIP-seq
ChIP-seqChIP-seq
ChIP-seq
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
FASTA
FASTAFASTA
FASTA
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomics
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysis
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
 
Bioinformatics Omics
Bioinformatics OmicsBioinformatics Omics
Bioinformatics Omics
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Kegg
KeggKegg
Kegg
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 

Ähnlich wie Paired-end alignments in sequence graphs

Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big DataAlexander Jung
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_PosterLong Pei
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
Variation Graphs and Structural Variation
Variation Graphs and Structural VariationVariation Graphs and Structural Variation
Variation Graphs and Structural VariationEric Dawson
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphspione30
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesIJMER
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORIJNSA Journal
 
PID Controller Design for a Real Time Ball and Beam System – A Double Integra...
PID Controller Design for a Real Time Ball and Beam System – A Double Integra...PID Controller Design for a Real Time Ball and Beam System – A Double Integra...
PID Controller Design for a Real Time Ball and Beam System – A Double Integra...idescitation
 
FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...
FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...
FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...IJCNC
 
Statistics & Decision Science for Agile - A Guided Tour
Statistics & Decision Science for Agile - A Guided TourStatistics & Decision Science for Agile - A Guided Tour
Statistics & Decision Science for Agile - A Guided TourSanjaya K Saxena
 
Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...
Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...
Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...IJERA Editor
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Till Blume
 

Ähnlich wie Paired-end alignments in sequence graphs (20)

Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big Data
 
Biochip
BiochipBiochip
Biochip
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_Poster
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
Variation Graphs and Structural Variation
Variation Graphs and Structural VariationVariation Graphs and Structural Variation
Variation Graphs and Structural Variation
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph Databases
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
 
PID Controller Design for a Real Time Ball and Beam System – A Double Integra...
PID Controller Design for a Real Time Ball and Beam System – A Double Integra...PID Controller Design for a Real Time Ball and Beam System – A Double Integra...
PID Controller Design for a Real Time Ball and Beam System – A Double Integra...
 
FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...
FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...
FURTHER RESULTS ON THE DIRAC DELTA APPROXIMATION AND THE MOMENT GENERATING FU...
 
Statistics & Decision Science for Agile - A Guided Tour
Statistics & Decision Science for Agile - A Guided TourStatistics & Decision Science for Agile - A Guided Tour
Statistics & Decision Science for Agile - A Guided Tour
 
Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...
Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...
Implementation and Comparison of Efficient 16-Bit SQRT CSLA Using Parity Pres...
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...
 
community detection
community detectioncommunity detection
community detection
 
Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 

Kürzlich hochgeladen

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cherry
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Cherry
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxCherry
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCherry
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Cherry
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 

Kürzlich hochgeladen (20)

Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Early Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdfEarly Development of Mammals (Mouse and Human).pdf
Early Development of Mammals (Mouse and Human).pdf
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 

Paired-end alignments in sequence graphs

  • 1. Chirag Jain1, Haowen Zhang2, Alexander Dilthey3, Srinivas Aluru2 1) National Institutes of Health 2) Georgia Institute of Technology 3) University Hospital of Dusseldorf Validating paired-end alignments in sequence graphs WABI 2019, Niagara Falls
  • 3. !3 Mapping reads Reference sequence Sequences (reads) … G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C … C G T C C G T C A A C T G C G C T T A C T G C G C T T T A A T C G C T T G G C G C G C A G A G T G C G C T TC C A G C G C G G T G T G A T C A C Single molecule sequencing Illumina paired-end sequencing
  • 4. !4 Graph-based reference Linear representation … G T C C G T C G C C T A A T C G C A C G T C C G T C G C C T A A T C G C A C G T C … Graph-based pan-genome reference [Beyer et al. 2019]
  • 6. !6 Applications • Genotyping MHC-PRG [Dilthey et al. 2015] vg [Garrison et al. 2018] Graph-Aligner [Rakocevic et al. 2018]

 • RNA-seq ASGAL [Denti et al. 2018] HISAT2 [Kim et al. 2019]

 • Graph-guided assembly Kourami [Lee and Kingsford 2018] • Hybrid genome assembly Unicycler [Wick et al. 2017] Whatshap [Garg et al. 2018] Require alignment of reads to sequence graphs
  • 7. !7 Sequence to graph alignment sequence to sequence sequence to acyclic graphs
 sequence to general graphs ACCATGTTTA-G -CCAAG-TTAAG A C G A T C ACCATGTTTAG G (V, E) ACCATGTTTAGQ: Q: R: Q: O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) * [Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB]
 *(edits allowed only in query) A C C T G A T A G
  • 8. !8 Sequence to graph alignment ACCATGTTTA-G -CCAAG-TTAAG A C G A T C G (V, E) ACCATGTTTAGQ: R: Q: O(|R||Q|) O(|V | + |E||Q|) O(|V | + |E||Q|) * Single-end Illumina reads or long reads Paired-end reads ACCATGTTTAGQ: A C C T G A T A G [Smith and Waterman JMB 1981] [Navarro 2001 TCS] [Jain et al. 2019 RECOMB]
 *(edits allowed only in query) sequence to sequence sequence to acyclic graphs
 sequence to general graphs
  • 9. !9 Using paired-end sequencing Paired-end read linear reference read mappings • Dominant sequencing protocol • Paired-end information allows • repeat disambiguation • SV discovery inner distance
  • 10. !10 Using paired-end sequencing Paired-end read C A T G T A A C T G T C A How to evaluate mapping candidates? • vg, HISAT2, HLA-PRG, deBGA use heuristics, and lack guarantees inner distance • Dominant sequencing protocol • Paired-end information allows • repeat disambiguation • SV discovery
  • 11. !11 Contributions • problem formulation for paired-end validation in graphs • first index-based exact algorithm • million queries < 1 sec • can be plugged with any graph mapper • superior accuracy/runtime than BFS-based heuristic github.com/ParBLiSS/PairG
  • 13. !13 Sequence graph A G T C C A G • A directed graph with character-labeled vertices • Good abstraction for commonly used graphs in genomics
  • 14. !14 Paired-end validation problem Paired-end read inner distance 
 (ranging from d1 to d2 ) C A T G T A A C T G T C A Does there exist any path of length from u to v ?∈ [d1, d2] u v Sequence graph (+ve strand) (-ve strand)
  • 15. !15 Related problems • All pairs shortest path • Exact-path length problem
 In a weighted directed graph, is there a 
 path of length d from vertex u to v ? • Transitive closure Our problem: Does there exist any path of length from u to v ?∈ [d1, d2] Solves our problem? Time !O(|V||E|) !O(|V||E|) !O(d |E|) (NP-complete) [Nykanen and Ukkonen 2002]
  • 17. !17 An index-based algorithm A C C T G A T A G Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2] A In d 1 2 3 4 5 6 7 8 9 (boolean adjacency matrix) (boolean index matrix) G(V, E) 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 f (A, d1, d2) (sequence graph)
  • 18. !18 Ind[i, j] = 1 iff path of length from vertex i to j∃ ∈ [d1, d2] A In df (A, d1, d2) (index matrix)(adjacency matrix) i.e. length d1 or d1+1 or … or d2 Ad1 Ad1+ 1 Ad2 In d = Ad1 ∨ Ad1+ 1 ∨ … ∨ Ad2 = Ad1 ⋅ (I ∨ A)d2− d1 ! time using general matrix-matrix multiplicationO(|V|3 log d2) Input graph’s sparsity and near-linear topology use SpGEMM An index-based algorithm …
  • 19. • Compressed format to store matrices (CSR) • space dictated by non-zeros
 
 
 
 
 
 • Runtime dictated by non-zero scalar products !19 Using SpGEMM 0 1 2 3 row
 pointer column index ( 0 1 3 4 7 ) ( 0 0 2 0 1 2 3 ) [Gustavson 1978] !O(|V|) !O(|E|)
  • 20. !20 Indexing cost using SpGEMM • When input is a chain In d = Ad1 ⋅ (I ∨ A)d2− d1 • Worst-case (dense) time and spaceO(|V|3 log d2) O(|V|2 ) time and space
 Θ(|V|((d2 − d1)2 + log d1)) Θ(|V|(d2 − d1 + 1)) 0 ≤ d1, d2 ≤ |V| • Lower bound Lemma. Computing the index for takesG(V, E) time and
 space Ω(|Vc |((d2 − d1)2 + log d1)) Ω(|Vc |(d2 − d1 + 1)) Assume ! = longest chain in !Gc(Vc, Ec) G(V, E)
  • 21. !21 Query cost • requires a simple lookup • binary-search in column index C A T G T A A C T G T C A In dex 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
  • 23. !23 Setup ACYCLIC From human genome GRCh37 + variations from 1KG Project CYCLIC Pan-genomic de-Bruijn graphs B. anthracis strains INSERT SIZES mean 300, 500, 700 bp IMPLEMENTATION C++, using ‘KokkosKernels’ linear algebra library Intel Xeon CPU: 28 cores and 256 GB RAM SEQUENCE GRAPHS (allowed range: ! 150 bp)± Jain et. al. Table 1 Directed sequence graphs used for evaluation. In these graphs, each ve with a DNA nucleotide. Four acyclic graphs are derived from segments of human geno files from the 1000 Genomes Project (Phase 3). Three cyclic graphs are de Bruijn gra whole-genome sequences of Bacillus anthracis strains, with k-mer length 25. Id Graph |V | |E| Type VG1 mitochondrial-DNA 21K 27K acyclic VG2 BRCA1 83K 85K VG3 LRC_KIR 1.1M 1.2M VG4 MHC 5.1M 5.3M DBG1 B. anthracis (1 strain) 5.2M 5.2M cyclicDBG5 B. anthracis (5 strains) 10.4M 10.4M DBG20 B. anthracis (20 strains) 11.2M 11.3M tested PairG using d1 = 0, d2 = 250. Similarly, for insert-size configurations 700 bp, we tested PairG using inner distance limits (d1 = 150, d2 = 450) and (d 650), respectively. There may be insert size configurations where allowing read
  • 24. !24 Index construction Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi matrix using all input graphs and di erent distance constraints. nnz represents number of elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st non-zero of a matrix in memory. Graph Insert size 300 500 700 Time Mem nnz Time Mem nnz Time Mem n VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7 VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2 VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0 VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1 DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1 DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4 DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1 Table 3 Time to execute a million queries using all the graphs and distance constrain query is a random pair of vertices in the graph. Table 2 Performance measured in terms of wall-clock time and memory-usage for building index matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each non-zero of a matrix in memory. Graph Insert size 300 500 700 Time Mem nnz Time Mem nnz Time Mem nnz VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B Table 3 Time to execute a million queries using all the graphs and distance constraints. Each query is a random pair of vertices in the graph. Table 2 Performance measured in terms of wall-clock time and memory-usage for building index rix using all input graphs and di erent distance constraints. nnz represents number of non-zero ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each zero of a matrix in memory. Graph Insert size 300 500 700 Time Mem nnz Time Mem nnz Time Mem nnz VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B Table 3 Time to execute a million queries using all the graphs and distance constraints. Each y is a random pair of vertices in the graph. GRCh37 + variations
  • 25. !25 Index construction Table 2 Performance measured in terms of wall-clock time and memory-usage for buildi matrix using all input graphs and di erent distance constraints. nnz represents number of elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to st non-zero of a matrix in memory. Graph Insert size 300 500 700 Time Mem nnz Time Mem nnz Time Mem n VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7 VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 2 VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0 VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1 DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1 DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4 DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 1 Table 3 Time to execute a million queries using all the graphs and distance constrain query is a random pair of vertices in the graph. Table 2 Performance measured in terms of wall-clock time and memory-usage for building index matrix using all input graphs and di erent distance constraints. nnz represents number of non-zero elements in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each non-zero of a matrix in memory. Graph Insert size 300 500 700 Time Mem nnz Time Mem nnz Time Mem nnz VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B Table 3 Time to execute a million queries using all the graphs and distance constraints. Each query is a random pair of vertices in the graph. Table 2 Performance measured in terms of wall-clock time and memory-usage for building index rix using all input graphs and di erent distance constraints. nnz represents number of non-zero ents in the index matrix, to indicate its size. Our implementation uses 4 bytes to store each zero of a matrix in memory. Graph Insert size 300 500 700 Time Mem nnz Time Mem nnz Time Mem nnz VG1 0.2s 0.1G 6.8M 0.4s 0.2G 7.8M 0.4s 0.2G 7.6M VG2 0.4s 0.3G 21M 0.9s 0.5G 26M 0.9s 0.5G 26M VG3 5.6s 3.8G 0.3B 12s 6.2G 0.3B 12s 6.2G 0.3B VG4 25s 17G 1.3B 53s 28G 1.6B 53s 28G 1.6B DBG1 25s 17G 1.3B 54s 28G 1.6B 54s 28G 1.7B DBG5 52s 35G 2.8B 2m 60G 3.6B 2m 60G 4.2B DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17.2h 129G 14B Table 3 Time to execute a million queries using all the graphs and distance constraints. Each y is a random pair of vertices in the graph. Pan-genomic de-Bruijn graphs GRCh37 + variations
  • 26. !26 Querying the index is super-fast! • Simulated random vertex pairs ! • One million queries take <1 second [i, j] • Return true if BFS-distance from source ! • Index lookups are two-three orders of magnitude faster • Heuristic accuracy ranged from 98%-100% ≤ d2 vs. BFS-based heuristic DBG20 1.3h 56G 4.6B 7.2h 118G 8.9B 17 Table 3 Time to execute a million queries using all the graphs and dist query is a random pair of vertices in the graph. Graph Insert size 300 500 700 Time (sec) VG1 0.1 0.1 0.1 VG2 0.2 0.2 0.2 VG3 0.4 0.4 0.5 VG4 0.5 0.5 0.5 DBG1 0.4 0.5 0.5 DBG5 0.4 0.5 0.5 DBG20 0.5 0.5 0.6 be uniformly distributed over the graph, we tested the querying perfo a million random vertex pairs (u, v), u, v œ [1, |V |]. For all the seve million vertex pairs finished in less than a second (Table 3). Even th
  • 27. • First formulation for P.E. distance validation in graphs • First index-based exact algorithm • Practical for pan-genome graphs • A useful module for graph mappers !27 Conclusions github.com/ParBLiSS/PairG C A T G T A A C T G T C A In dex 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
  • 28. !28 Future directions • Performance • Scale to human genomes • Index compression (e.g., run-length enc.)
 • Applications • Clustering adjacent seed matches • End-to-end graph read mapper In dex 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9