SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
Using BioNano Maps to Improve 
an Insect Genome Assembly 
! 
! 
! 
! 
! 
! 
! 
! 
! 
Sue Brown 
Oct 23, 2014 
1
Tribolium castaneum, the red flour beetle 
Genetic model organism for developmental, physiology and toxicology studies 
! 
• Easily cultured 
• Short generation time 
• Small genome size 
• Molecular and visible marker genetic maps 
• Genetic tools: balancers, deficiencies 
• Genomic libraries: lambda and BAC 
• cDNA libraries 
• Mutant analysis and RNAi 
• Transformation 
• 7x Sanger draft genome (Nature, 2008) 
2
Tribolium Genome 
! 
! 
! 
! 
Genome size: 200 (Mb) Cot Analysis 
9 autosomes, X and Y 
Low methylation 
Long period interspersion 
! 
! 
! 
! 
! 
! 
Jeff Stuart, Purdue University 
3
Molecular map markers used to anchor scaffolds to Chromosome builds 
Few X markers, no Y, variable marker density 
4
Tcas 3.0 Reference Genome stats from NCBI 
Input file N50 (Mb) Number Cumulative 
Length (Mb) 
Genome contigs 0.04 8814 160.74 
Genome scaffolds 0.98 481 152.53 
Unmapped scaffolds 352 
Unmapped contigs 1884 
5
Algorithms and filters used to improve the Tribolium 
draft Assembly with Physical Maps Based on 
Imaging Ultra-Long Single DNA Molecules 
! 
Jennifer Shelton 
2014 
6
Data formats 
BNX molecule 1 
BNX - text file of 
molecules 
7
Data formats 
BNX molecule 1 
BNX - text file of 
molecules 
CMAP - text file of 
consensus maps 
7
Data formats 
in silico CMAP - 
from genome 
FASTA 
BNX molecule 1 
in silico CMAP 1 
BNX - text file of 
molecules 
CMAP - text file of 
consensus maps 
7
Data formats 
in silico CMAP - 
from genome 
FASTA 
BNG CMAP - 
from assembled 
molecules 
BNX molecule 1 
in silico CMAP 1 
BNG CMAP 1 
BNX - text file of 
molecules 
CMAP - text file of 
consensus maps 
7
Data formats 
in silico CMAP - 
from genome 
FASTA 
BNG CMAP - 
from assembled 
molecules 
XMAP - text file of 
alignment of two 
CMAPs 
BNX molecule 1 
in silico CMAP 1 
BNG CMAP 1 
in silico CMAP 1 in silico CMAP 2 
BNG CMAP 1 BNG CMAP 2 
BNX - text file of 
molecules 
CMAP - text file of 
consensus maps 
7
Assembly Pipeline 
BBNNXX BNX 
scanBNX 
4) QC graphs for 
each flowcell 
adjBNX 
adjBNX 
5) Merge all 
flowcells 
6) First 
assemblies 
Strict 
p-value 
threshold 
Default 
p-value 
threshold 
Assembly workflow:! ! 
1) The Irys produces tiff files that are converted into BNX text files.! 
2) Each chip produces one BNX file for each of two flowcells.! 
3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 
3) use sequence reference to adjust molecule stretch for each scan 
recalculated from the alignment.! 
4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 
5) Adjusted flowcell BNXs are merged.! 
6) The first assemblies are run with a variety of p-value thresholds.! 
7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced 
with a variety of minimum molecule length filters. 
adjBNX 
adjBNX 
1) The Irys 
produces tiff files 
3) Scan BNX are 
adjusted 
7) Second 
assemblies 
Strict 
minimum 
molecule 
length 
Relaxed 
minimum 
molecule 
length 
mergeBNX 
Relaxed 
p-value 
threshold 
BBNNXX BNX 
scanBNX 
BBNNXX BNX 
scanBNX 
BBNNXX BNX 
scanBNX 
2) Each chip 
produces flowcell 
BNX files 
BNX 
BNX 
BNX 
BNX 
8
Assembly Pipeline 
In recent datasets when SNR is low and alignment is good we see a spike in 
bases per pixel (bpp) in the first scan, a plateau and a lower plateau 
First scan in a 
flow cell 
9
Assembly Pipeline 
BBNNXX BNX 
scanBNX 
4) QC graphs for 
each flowcell 
adjBNX 
adjBNX 
5) Merge all 
flowcells 
6) First 
assemblies 
Strict 
p-value 
threshold 
Default 
p-value 
threshold 
Assembly workflow:! ! 
1) The Irys produces tiff files that are converted into BNX text files.! 
2) Each chip produces one BNX file for each of two flowcells.! 
3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 
5) Use sequence reference to determine assembly noise parameters. 
Estimated recalculated from genome the alignment.! 
size is used to set the p-value threshold. 
4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 
5) Adjusted flowcell BNXs are merged.! 
6) The first assemblies are run with a variety of p-value thresholds.! 
7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced 
with a variety of minimum molecule length filters. 
adjBNX 
adjBNX 
1) The Irys 
produces tiff files 
3) Scan BNX are 
adjusted 
7) Second 
assemblies 
Strict 
minimum 
molecule 
length 
Relaxed 
minimum 
molecule 
length 
mergeBNX 
Relaxed 
p-value 
threshold 
BBNNXX BNX 
scanBNX 
BBNNXX BNX 
scanBNX 
BBNNXX BNX 
scanBNX 
2) Each chip 
produces flowcell 
BNX files 
BNX 
BNX 
BNX 
BNX 
10
Assembly Pipeline 
BBNNXX BNX 
scanBNX 
4) QC graphs for 
each flowcell 
adjBNX 
adjBNX 
5) Merge all 
flowcells 
6) First 
assemblies 
Strict 
p-value 
threshold 
Default 
p-value 
threshold 
Assembly workflow:! ! 
1) The Irys produces tiff files that are converted into BNX text files.! 
2) Each chip produces one BNX file for each of two flowcells.! 
3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 
6/7) Variants of the starting p-value and default minimum molecule length are 
explored in nine assemblies. 
recalculated from the alignment.! 
4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 
5) Adjusted flowcell BNXs are merged.! 
6) The first assemblies are run with a variety of p-value thresholds.! 
7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced 
with a variety of minimum molecule length filters. 
adjBNX 
adjBNX 
1) The Irys 
produces tiff files 
3) Scan BNX are 
adjusted 
7) Second 
assemblies 
Strict 
minimum 
molecule 
length 
Relaxed 
minimum 
molecule 
length 
mergeBNX 
Relaxed 
p-value 
threshold 
BBNNXX BNX 
scanBNX 
BBNNXX BNX 
scanBNX 
BBNNXX BNX 
scanBNX 
2) Each chip 
produces flowcell 
BNX files 
BNX 
BNX 
BNX 
BNX 
11
Current Tribolium sequence-based assembly 
Input file N50 (Mb) Number 
of 
Scaffolds 
Cumulative 
Length (Mb) 
Genome FASTA 1.16 2240 160.74 
in silico CMAP from FASTA 1.20 223 152.53 
223 scaffolds from the sequence-based assembly were longer than 20 (kb) 
with more than 5 labels and were converted into in silico CMAPs 
12
Assembly Results 
Input file N50 (Mb) Number Cumulative 
Length (Mb) 
Genome FASTA 1.16 2240 160.74 
in silico CMAP from FASTA 1.20 223 152.53 
CMAP from assembled BNG 
molecules (BNG CMAP) 
1.35 216 200.47 
BNG assembled molecules had a higher N50 and longer cumulative length 
than the sequence assembly 
! 
The estimated size of the Tribolium genome is ~200 (Mb) 
13
Simplest XMAP alignment description 
1 (Mb) 
1.1 (Mb) 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) 
Total alignment length for in silico CMAP: 2.1 (Mb) 
! 
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) 
Total alignment length for BNG CMAP: 2.4 (Mb) 
in silico CMAP 
from genome 
FASTA 
CMAP from 
assembled 
molecules 
in silico CMAP 1 in silico CMAP 2 
BNG CMAP 1 BNG CMAP 2 
14
Complex XMAP alignment description 
1 (Mb) 
in silico CMAP 1 
BNG CMAP 1 BNG CMAP 2 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage for in silico CMAP: 1 (Mb) 
Total alignment length for in silico CMAP: 2 (Mb) 
! 
Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) 
Total alignment length for BNG CMAP: 2.4 (Mb) 
in silico CMAP 
from genome 
FASTA 
CMAP from 
assembled 
molecules 
15
Alignment of CMAPs 
1 (Mb) 
in silico CMAP 1 
BNG CMAP 1 BNG CMAP 2 
1.1 (Mb) 1.3 (Mb) 
Breadth of alignment coverage compared to total aligned length can indicate 
relevant relationships between assemblies 
! 
In this example differences between "breadth" and "total" length could be due to: 
! 
Genomic duplications in sample molecules were extracted from 
Assembly of alternate haplotypes 
Mis-assembly creating redundant contigs 
Collapsed repeat in sequence assembly 
in silico CMAP 
from genome 
FASTA 
CMAP from 
assembled 
molecules 
16
Alignment of BNG assembly to reference genome 
CMAP name Breadth of alignment 
coverage for CMAP 
(Mb) 
Length of total 
alignment for 
CMAP (Mb) 
Percent of CMAP 
aligned 
in silico CMAP from FASTA 124.04 132.40 81 
CMAP from assembled BNG 
molecules (BNG CMAP) 
131.64 132.34 67 
Close to 4% of the alignment of the in silico CMAP appears to be redundant 
! 
Overall 81% of the in silico CMAP aligns to the BNG consensus map 
17
ChLG 9 super! 
Alignment of BNG assembly to reference genome 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
130 131 133 134 132 129 135 127 136 137 BNG consensus 
Typically where redundant alignments occur two BNG consensus maps 
aligned suggesting they represent haplotypes although this has not been 
verified 
maps 
18
Potential haplotypes where overlapping BNG cmaps align 
ChLG 9 super! 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
128 130 131 133 134 132 BNG consensus 
maps 
19
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in silico CMAP 4 
Stitch.pl estimates super scaffolds using alignments of scaffolds and 
assembled BNG molecules using BNG Refaligner 
in silico CMAP 
aligned as 
reference 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
20
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
Stitch.pl estimates super scaffolds using alignments of scaffolds and 
assembled BNG molecules using BNG Refaligner 
in silico CMAP 
aligned as 
reference 
alignment is 
inverted and 
used as input for 
stitch 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
20
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
Stitch.pl estimates super scaffolds using alignments of scaffolds and 
assembled BNG molecules using BNG Refaligner 
in silico CMAP 
aligned as 
reference 
alignment is 
inverted and 
used as input for 
stitch 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
alignments are 
filtered based on 
alignment length 
relative total 
possible 
alignment length 
and confidence 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 
20
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 
+ in silico CMAP 1 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
21
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 
scaffolds 
+ in silico CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
22
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
- in silico CMAP 2 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
23
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
- in silico CMAP 2 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment fails 
because the 
alignment length 
is less than 30% 
of the potential 
alignment length 
24
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 2 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment fails 
because the 
alignment length 
is less than 30% 
of the potential 
alignment length 
25
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 2 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
- in silico CMAP 3 
26
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 2 
scaffolds 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment fails 
because the 
alignment length 
is less than 30% 
of the potential 
alignment length 
- in silico CMAP 3 
27
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 1 + in silico CMAP 4 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 2 
scaffolds 
Stitch.pl checks alignment length against potential alignment lengths to find 
relevant global rather than local alignments 
alignment 
passes because 
the alignment 
length is greater 
than 30% of the 
potential 
alignment length 
+ in silico CMAP 4 
28
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
high quality 
scaffolding 
alignments... 
+ in silico CMAP 1 
29
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
are filtered for 
longest and 
highest 
confidence 
alignment for 
each in silico 
CMAP 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
+ in silico CMAP 1 + in silico CMAP 4 
high quality 
scaffolding 
alignments... 
+ in silico CMAP 1 
29
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
are filtered for 
longest and 
highest 
confidence 
alignment for 
each in silico 
CMAP 
Passing 
alignments are 
used to super 
scaffold 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 4 
+ in silico CMAP 1 + in silico CMAP 4 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4 
high quality 
scaffolding 
alignments... 
+ in silico CMAP 1 
29
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
Stitch is iterated 
and additional 
super 
scaffolding 
alignments are 
found 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
Stitch is iterated 
and additional 
super 
scaffolding 
alignments are 
found 
BNG CMAP 1 BNG CMAP 2 
+ in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4 
Until all super 
scaffolds are 
BNG CMAP 1 BNG CMAP 2 
joined + in silico CMAP 2 - in silico CMAP 3 
+ in silico CMAP 1 + in silico CMAP 4
Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium 
scaffolds 
BNG CMAP 1 BNG CMAP 2 
- in silico CMAP 3 
+ in silico CMAP 2 
+ in silico CMAP 4 
+ in silico CMAP 1 
If gap length is estimated to be negative gaps are represented by 100 (bp) 
spacers 
31
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had 
known lengths and 26 had negative lengths (set to 100 (bp)) 
! 
Of the manually edited Tribolium super-scaffolds there were 66 gaps had 
known lengths and 24 had negative lengths (set to 100 (bp)) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
32
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had 
known lengths and 26 had negative lengths (set to 100 (bp)) 
! 
Of the manually edited Tribolium super-scaffolds there were 66 gaps had 
known lengths and 24 had negative lengths (set to 100 (bp)) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
32
Negative gap lengths 
Is part of scaffold_23 connected to 136?! 
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should 
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG 
assembly? 
22 23 129 136 137 
The longest negative gap length is from a BNG consenus map joining in silico 
23 and 136 
33
Negative gap lengths 
Is part of scaffold_23 connected to 136?! 
I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should 
check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG 
assembly? 
22 23 129 136 137 
! 
Because the same region of 136 aligns to another BNG consensus map that 
aligns to its chromosome linkage group this alignment was rejected and stitch 
was re-run 
34
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
Negative gap lengths 
ChLG 2 super! 
scaffold 
ChLG 2 super! 
scaffold 
BNG consensus 
maps 
BNG consensus 
maps 
ChLG 2! 
scaffolds 
133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 
Two new super scaffolds were created and the sequence similarity is being 
evaluated 
min confidence 10 
scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
ChLG 2! 
scaffolds 
130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 
BNG consensus 
maps 
U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 
BNG consensus 
maps 
35
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
This negative alignment also indicated a potential assembly issue 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
36
Negative gap lengths 
This negative gap length is from a BNG consenus map joining in silico 81 and 
102 and 103 
Half of scaffold_81 aligns with ChLG7 
37
Negative gap lengths 
Half of scaffold_81 aligns with ChLG7 
79 80 81 82 83 
Because the other half of 81 aligns to another BNG consensus map that aligns 
to its chromosome linkage group this alignment was rejected and stitch was re-run 
! 
The BNG maps suggest a mis-assembly of in silico 81 at a sequence level 
38
Distribution of gap lengths for automated output 
Gap length (bp) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
Gap lengths 
All extremely small negative gap lengths, < -20,000 (bp) (shaded), were 
independently flagged as potential sequence mis-assemblies to be checked at 
the sequence-level 
39
Distribution of gap lengths for automated output 
Gap length (bp) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths 
Gap lengths 
All gaps from the shaded regions were also manually rejected and stitch.pl 
was rerun without them for the current super-scaffolded assembly 
! 
We suspect extremely small negative gap sizes may be useful in locating 
sequence mis-assemblies 
! 
stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp) 
but lists them in data summary 
40
Tribolium super-scaffolds 
Input file N50 (Mb) Number of 
Scaffolds 
Cumulative 
Length (Mb) 
genome FASTA 1.16 2240 160.74 
super-scaffold 
FASTA 
4.46 2150 165.92 
N50 of the super-scaffolded genome was ~4 times greater than the original 
! 
Super-scaffolds tend to agree with the Tribolium genetic map 
41
Tribolium super-scaffolds 
Input file N50 (Mb) Number of 
Scaffolds 
genome FASTA 1.16 2240 160.74 
4.46 2150 165.92 
For Tribolium : 
first minimum percent aligned = 30% 
first minimum confidence = 13 
Cumulative 
Length (Mb) 
second minimum percent aligned = 90% 
second minimum confidence = 8 
! 
super-scaffold 
FASTA 
Lower quality alignments were manually selected if genetic map also supported 
the order 
Complex scaffolds were broken manually for sequence level evaluation 
42
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to 
ChLG 3 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13 
43
Tribolium super-scaffolds 
min confidence 10 
51 U 43 45 44 46 
The second scaffold from ChLG X aligned to scaffolds from a portion of 
ChLG 3 
ChLG 3 super! 
scaffold 
BNG consensus 
maps 
ChLG 3! 
scaffolds 
BNG consensus 
maps 
32 33 34 35 36 2 37 38 39 40 41 42 
ChLG 3 super! 
scaffold 
BNG consensus 
maps 
ChLG 3 super! 
scaffold 
BNG consensus 
44
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
Two unplaced scaffolds aligned to ChLG X 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13 
45
Tribolium super-scaffolds 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
4% Redundancy in alignment may be from assembly of haplotypes (generally 
observed as two BNG consensus maps aligning to the same in silico map) 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13 
46
Potential haplotypes where overlapping BNG cmaps align 
min confidence 10 
From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 
ChLG X super! 
scaffold 
BNG consensus 
maps 
ChLG X! 
scaffolds 
BNG consensus 
maps 
U 3 4 5 6 7 U 8 9 10 11 12 13 
47
Tribmoinl icuonmfid esnucep 1e0r-scaffolds 
ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 
128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 
For ChLG 9 21 scaffolds were reduced to 9 
ChLG 9 super! 
scaffold 
BNG consensus 
maps 
ChLG 9! 
scaffolds 
BNG consensus 
maps 
48
min confidence 10 
Tribolium super-scaffolds 
For ChLG 5 17 scaffolds were reduced to 4 
ChLG 5 super! 
scaffold 
BNG consensus 
maps 
ChLG 5! 
scaffolds 
BNG consensus 
maps 
69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83 
49
Future directions: Structural Variant (SV) 
Use SV-detect pipelines to resize existing gaps in scaffolds and identify 
mis-assemblies 
50
Acknowledgements 
K-INBRE Bioinformatics Core! 
Susan Brown - PI 
Nic Herndon - script development 
Nanyan Lu - manual evaluation 
Michelle Coleman - extractions and running the Irys! 
Zachary Sliefert - metric summaries 
! 
Bionano Genomics! 
Ernest Lam - assembly pipeline best practices assistance 
Weiping Wang - assistance with data formats 
Palak Sheth - collaboration to standardize analysis 
! 
Script availability! 
https://github.com/i5K-KINBRE-script-share/Irys-scaffolding 
BNG scripts available by request from BNG 
! 
Slide availability! 
http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome- 
Physical 
Molecules! 
University, Warren 
Kansas State University 
assembly 
! 
This project was supported by grants from the National Center for Research Resources 
(5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from 
the National Institutes of Health. 
were constructed by mapping molecular markers from the 
the assembly scaffolds, anchoring greater than 90% 
51
Gap lengths 
Distribution of gap lengths for automated output 
Gap length (bp) 
Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had 
known lengths and 26 had negative lengths (set to 100 (bp)) 
! 
Of the manually edited Tribolium super-scaffolds there were 66 gaps had 
known lengths and 24 had negative lengths (set to 100 (bp)) 
Count 
−1500000 −1000000 −500000 0 500000 1000000 
0 5 10 15 20 
Negative gap lengths 
Positive gap lengths

Weitere ähnliche Inhalte

Was ist angesagt?

20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
sesejun
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
Anne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentationAnne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentation
Anne Vaittinen
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
Thomas Keane
 

Was ist angesagt? (20)

Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Improving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBioImproving and validating the Atlantic Cod genome assembly using PacBio
Improving and validating the Atlantic Cod genome assembly using PacBio
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
Creating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome AssembliesCreating Reference-Grade Human Genome Assemblies
Creating Reference-Grade Human Genome Assemblies
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Ashg grc workshop2014_tg
Ashg grc workshop2014_tgAshg grc workshop2014_tg
Ashg grc workshop2014_tg
 
Bng presentation draft
Bng presentation draftBng presentation draft
Bng presentation draft
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Anne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentationAnne_Vaittinen_advanced_seminar_presentation
Anne_Vaittinen_advanced_seminar_presentation
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
Updated: New High Throughput Sequencing technologies at the Norwegian Sequenc...
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccs
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexes
 
Optimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editingOptimized methods to use Cas9 nickases in genome editing
Optimized methods to use Cas9 nickases in genome editing
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
ABGT 2016 Workshop Schneider
ABGT 2016 Workshop SchneiderABGT 2016 Workshop Schneider
ABGT 2016 Workshop Schneider
 

Ähnlich wie Using BioNano Maps to Improve an Insect Genome Assembly​

Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
jukais
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
c.titus.brown
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
David E. Kling
 

Ähnlich wie Using BioNano Maps to Improve an Insect Genome Assembly​ (20)

Creating a SNP calling pipeline
Creating a SNP calling pipelineCreating a SNP calling pipeline
Creating a SNP calling pipeline
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
Toward A Better Understanding Of Plant Genome Structure: Combining NGS, Optic...
 
Genome assembly from three sequencing platforms: minION, MiSeq and PacBio
Genome assembly from three sequencing platforms: minION, MiSeq and PacBioGenome assembly from three sequencing platforms: minION, MiSeq and PacBio
Genome assembly from three sequencing platforms: minION, MiSeq and PacBio
 
Miten Generating high-quality reference human genomes using Promethion nanopo...
Miten Generating high-quality reference human genomes using Promethion nanopo...Miten Generating high-quality reference human genomes using Promethion nanopo...
Miten Generating high-quality reference human genomes using Promethion nanopo...
 
Complementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsComplementing Computation with Visualization in Genomics
Complementing Computation with Visualization in Genomics
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0 iMate Protocol Guide version 3.0
iMate Protocol Guide version 3.0
 
iMate Protocol Guide version 2.0
iMate Protocol Guide version 2.0iMate Protocol Guide version 2.0
iMate Protocol Guide version 2.0
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2iMate Protocol Guide version 1.2
iMate Protocol Guide version 1.2
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
picard_poster_12_16_15
picard_poster_12_16_15picard_poster_12_16_15
picard_poster_12_16_15
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycleRNA-Seq transcriptome analysis of Gonium pectorale cell cycle
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 

Mehr von Jennifer Shelton

Applied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-SeqApplied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-Seq
Jennifer Shelton
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
Jennifer Shelton
 
Translocation detection in lung cancer using mate-pair sequencing and iVIGS
Translocation detection in lung cancer using mate-pair sequencing and iVIGSTranslocation detection in lung cancer using mate-pair sequencing and iVIGS
Translocation detection in lung cancer using mate-pair sequencing and iVIGS
Jennifer Shelton
 

Mehr von Jennifer Shelton (14)

Bioinformatic core facilities discussion
Bioinformatic core facilities discussionBioinformatic core facilities discussion
Bioinformatic core facilities discussion
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
 
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
Lecture1: NGS Analysis on Beocat and an introduction to Perl programming for ...
 
Journal club slides to discuss "Differential analysis of gene regulation at t...
Journal club slides to discuss "Differential analysis of gene regulation at t...Journal club slides to discuss "Differential analysis of gene regulation at t...
Journal club slides to discuss "Differential analysis of gene regulation at t...
 
Hub gene selection_ds
Hub gene selection_dsHub gene selection_ds
Hub gene selection_ds
 
Applied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-SeqApplied Bioinformatics Journal Club Pacbio RNA-Seq
Applied Bioinformatics Journal Club Pacbio RNA-Seq
 
RNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal ClubRNASeq DE methods review Applied Bioinformatics Journal Club
RNASeq DE methods review Applied Bioinformatics Journal Club
 
Bionano genome maps_feb2014
Bionano genome maps_feb2014Bionano genome maps_feb2014
Bionano genome maps_feb2014
 
Translocation detection in lung cancer using mate-pair sequencing and iVIGS
Translocation detection in lung cancer using mate-pair sequencing and iVIGSTranslocation detection in lung cancer using mate-pair sequencing and iVIGS
Translocation detection in lung cancer using mate-pair sequencing and iVIGS
 
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
Summary slides by Prabhakar Chalise of the Oberg et al. 2012 article "Technic...
 
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
RNA-Seq transcriptome analysis of Gonium pectorale cell cycle.
 
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 4...
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 4...Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 4...
Multi-k-mer de novo transcriptome assembly and assembly of assemblies using 4...
 
Param selection phase1summary_v2
Param selection phase1summary_v2Param selection phase1summary_v2
Param selection phase1summary_v2
 
Bioinformatic jc 08_14_2013_formal
Bioinformatic jc 08_14_2013_formalBioinformatic jc 08_14_2013_formal
Bioinformatic jc 08_14_2013_formal
 

Using BioNano Maps to Improve an Insect Genome Assembly​

  • 1. Using BioNano Maps to Improve an Insect Genome Assembly ! ! ! ! ! ! ! ! ! Sue Brown Oct 23, 2014 1
  • 2. Tribolium castaneum, the red flour beetle Genetic model organism for developmental, physiology and toxicology studies ! • Easily cultured • Short generation time • Small genome size • Molecular and visible marker genetic maps • Genetic tools: balancers, deficiencies • Genomic libraries: lambda and BAC • cDNA libraries • Mutant analysis and RNAi • Transformation • 7x Sanger draft genome (Nature, 2008) 2
  • 3. Tribolium Genome ! ! ! ! Genome size: 200 (Mb) Cot Analysis 9 autosomes, X and Y Low methylation Long period interspersion ! ! ! ! ! ! Jeff Stuart, Purdue University 3
  • 4. Molecular map markers used to anchor scaffolds to Chromosome builds Few X markers, no Y, variable marker density 4
  • 5. Tcas 3.0 Reference Genome stats from NCBI Input file N50 (Mb) Number Cumulative Length (Mb) Genome contigs 0.04 8814 160.74 Genome scaffolds 0.98 481 152.53 Unmapped scaffolds 352 Unmapped contigs 1884 5
  • 6. Algorithms and filters used to improve the Tribolium draft Assembly with Physical Maps Based on Imaging Ultra-Long Single DNA Molecules ! Jennifer Shelton 2014 6
  • 7. Data formats BNX molecule 1 BNX - text file of molecules 7
  • 8. Data formats BNX molecule 1 BNX - text file of molecules CMAP - text file of consensus maps 7
  • 9. Data formats in silico CMAP - from genome FASTA BNX molecule 1 in silico CMAP 1 BNX - text file of molecules CMAP - text file of consensus maps 7
  • 10. Data formats in silico CMAP - from genome FASTA BNG CMAP - from assembled molecules BNX molecule 1 in silico CMAP 1 BNG CMAP 1 BNX - text file of molecules CMAP - text file of consensus maps 7
  • 11. Data formats in silico CMAP - from genome FASTA BNG CMAP - from assembled molecules XMAP - text file of alignment of two CMAPs BNX molecule 1 in silico CMAP 1 BNG CMAP 1 in silico CMAP 1 in silico CMAP 2 BNG CMAP 1 BNG CMAP 2 BNX - text file of molecules CMAP - text file of consensus maps 7
  • 12. Assembly Pipeline BBNNXX BNX scanBNX 4) QC graphs for each flowcell adjBNX adjBNX 5) Merge all flowcells 6) First assemblies Strict p-value threshold Default p-value threshold Assembly workflow:! ! 1) The Irys produces tiff files that are converted into BNX text files.! 2) Each chip produces one BNX file for each of two flowcells.! 3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 3) use sequence reference to adjust molecule stretch for each scan recalculated from the alignment.! 4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 5) Adjusted flowcell BNXs are merged.! 6) The first assemblies are run with a variety of p-value thresholds.! 7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced with a variety of minimum molecule length filters. adjBNX adjBNX 1) The Irys produces tiff files 3) Scan BNX are adjusted 7) Second assemblies Strict minimum molecule length Relaxed minimum molecule length mergeBNX Relaxed p-value threshold BBNNXX BNX scanBNX BBNNXX BNX scanBNX BBNNXX BNX scanBNX 2) Each chip produces flowcell BNX files BNX BNX BNX BNX 8
  • 13. Assembly Pipeline In recent datasets when SNR is low and alignment is good we see a spike in bases per pixel (bpp) in the first scan, a plateau and a lower plateau First scan in a flow cell 9
  • 14. Assembly Pipeline BBNNXX BNX scanBNX 4) QC graphs for each flowcell adjBNX adjBNX 5) Merge all flowcells 6) First assemblies Strict p-value threshold Default p-value threshold Assembly workflow:! ! 1) The Irys produces tiff files that are converted into BNX text files.! 2) Each chip produces one BNX file for each of two flowcells.! 3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 5) Use sequence reference to determine assembly noise parameters. Estimated recalculated from genome the alignment.! size is used to set the p-value threshold. 4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 5) Adjusted flowcell BNXs are merged.! 6) The first assemblies are run with a variety of p-value thresholds.! 7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced with a variety of minimum molecule length filters. adjBNX adjBNX 1) The Irys produces tiff files 3) Scan BNX are adjusted 7) Second assemblies Strict minimum molecule length Relaxed minimum molecule length mergeBNX Relaxed p-value threshold BBNNXX BNX scanBNX BBNNXX BNX scanBNX BBNNXX BNX scanBNX 2) Each chip produces flowcell BNX files BNX BNX BNX BNX 10
  • 15. Assembly Pipeline BBNNXX BNX scanBNX 4) QC graphs for each flowcell adjBNX adjBNX 5) Merge all flowcells 6) First assemblies Strict p-value threshold Default p-value threshold Assembly workflow:! ! 1) The Irys produces tiff files that are converted into BNX text files.! 2) Each chip produces one BNX file for each of two flowcells.! 3) BNX files are split by scan and aligned to the sequence reference. Stretch (bases per pixel) is 6/7) Variants of the starting p-value and default minimum molecule length are explored in nine assemblies. recalculated from the alignment.! 4) Quality check graphs are created for each pre-adjusted flowcell BNX.! 5) Adjusted flowcell BNXs are merged.! 6) The first assemblies are run with a variety of p-value thresholds.! 7) The best of the first assemblies (red oval) is chosen and a version of this assembly is produced with a variety of minimum molecule length filters. adjBNX adjBNX 1) The Irys produces tiff files 3) Scan BNX are adjusted 7) Second assemblies Strict minimum molecule length Relaxed minimum molecule length mergeBNX Relaxed p-value threshold BBNNXX BNX scanBNX BBNNXX BNX scanBNX BBNNXX BNX scanBNX 2) Each chip produces flowcell BNX files BNX BNX BNX BNX 11
  • 16. Current Tribolium sequence-based assembly Input file N50 (Mb) Number of Scaffolds Cumulative Length (Mb) Genome FASTA 1.16 2240 160.74 in silico CMAP from FASTA 1.20 223 152.53 223 scaffolds from the sequence-based assembly were longer than 20 (kb) with more than 5 labels and were converted into in silico CMAPs 12
  • 17. Assembly Results Input file N50 (Mb) Number Cumulative Length (Mb) Genome FASTA 1.16 2240 160.74 in silico CMAP from FASTA 1.20 223 152.53 CMAP from assembled BNG molecules (BNG CMAP) 1.35 216 200.47 BNG assembled molecules had a higher N50 and longer cumulative length than the sequence assembly ! The estimated size of the Tribolium genome is ~200 (Mb) 13
  • 18. Simplest XMAP alignment description 1 (Mb) 1.1 (Mb) 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage for in silico CMAP: 2.1 (Mb) Total alignment length for in silico CMAP: 2.1 (Mb) ! Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) Total alignment length for BNG CMAP: 2.4 (Mb) in silico CMAP from genome FASTA CMAP from assembled molecules in silico CMAP 1 in silico CMAP 2 BNG CMAP 1 BNG CMAP 2 14
  • 19. Complex XMAP alignment description 1 (Mb) in silico CMAP 1 BNG CMAP 1 BNG CMAP 2 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage for in silico CMAP: 1 (Mb) Total alignment length for in silico CMAP: 2 (Mb) ! Breadth of alignment coverage for BNG CMAP: 2.4 (Mb) Total alignment length for BNG CMAP: 2.4 (Mb) in silico CMAP from genome FASTA CMAP from assembled molecules 15
  • 20. Alignment of CMAPs 1 (Mb) in silico CMAP 1 BNG CMAP 1 BNG CMAP 2 1.1 (Mb) 1.3 (Mb) Breadth of alignment coverage compared to total aligned length can indicate relevant relationships between assemblies ! In this example differences between "breadth" and "total" length could be due to: ! Genomic duplications in sample molecules were extracted from Assembly of alternate haplotypes Mis-assembly creating redundant contigs Collapsed repeat in sequence assembly in silico CMAP from genome FASTA CMAP from assembled molecules 16
  • 21. Alignment of BNG assembly to reference genome CMAP name Breadth of alignment coverage for CMAP (Mb) Length of total alignment for CMAP (Mb) Percent of CMAP aligned in silico CMAP from FASTA 124.04 132.40 81 CMAP from assembled BNG molecules (BNG CMAP) 131.64 132.34 67 Close to 4% of the alignment of the in silico CMAP appears to be redundant ! Overall 81% of the in silico CMAP aligns to the BNG consensus map 17
  • 22. ChLG 9 super! Alignment of BNG assembly to reference genome scaffold BNG consensus maps ChLG 9! scaffolds 130 131 133 134 132 129 135 127 136 137 BNG consensus Typically where redundant alignments occur two BNG consensus maps aligned suggesting they represent haplotypes although this has not been verified maps 18
  • 23. Potential haplotypes where overlapping BNG cmaps align ChLG 9 super! scaffold BNG consensus maps ChLG 9! scaffolds 128 130 131 133 134 132 BNG consensus maps 19
  • 24. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 20
  • 25. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference alignment is inverted and used as input for stitch + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 20
  • 26. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 Stitch.pl estimates super scaffolds using alignments of scaffolds and assembled BNG molecules using BNG Refaligner in silico CMAP aligned as reference alignment is inverted and used as input for stitch + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 alignments are filtered based on alignment length relative total possible alignment length and confidence + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 20
  • 27. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 + in silico CMAP 1 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length 21
  • 28. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 scaffolds + in silico CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length 22
  • 29. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 - in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length 23
  • 30. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 - in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length 24
  • 31. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 2 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length 25
  • 32. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length - in silico CMAP 3 26
  • 33. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 scaffolds Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment fails because the alignment length is less than 30% of the potential alignment length - in silico CMAP 3 27
  • 34. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium BNG CMAP 1 BNG CMAP 2 + in silico CMAP 1 + in silico CMAP 4 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 2 scaffolds Stitch.pl checks alignment length against potential alignment lengths to find relevant global rather than local alignments alignment passes because the alignment length is greater than 30% of the potential alignment length + in silico CMAP 4 28
  • 35. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1 29
  • 36. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds are filtered for longest and highest confidence alignment for each in silico CMAP BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 + in silico CMAP 1 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1 29
  • 37. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds are filtered for longest and highest confidence alignment for each in silico CMAP Passing alignments are used to super scaffold BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 4 + in silico CMAP 1 + in silico CMAP 4 BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 high quality scaffolding alignments... + in silico CMAP 1 29
  • 38. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds Stitch is iterated and additional super scaffolding alignments are found BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4
  • 39. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds Stitch is iterated and additional super scaffolding alignments are found BNG CMAP 1 BNG CMAP 2 + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4 Until all super scaffolds are BNG CMAP 1 BNG CMAP 2 joined + in silico CMAP 2 - in silico CMAP 3 + in silico CMAP 1 + in silico CMAP 4
  • 40. Alignment of BNG assembly to reference genome was used to super-scaffold the Tribolium scaffolds BNG CMAP 1 BNG CMAP 2 - in silico CMAP 3 + in silico CMAP 2 + in silico CMAP 4 + in silico CMAP 1 If gap length is estimated to be negative gaps are represented by 100 (bp) spacers 31
  • 41. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths 32
  • 42. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths 32
  • 43. Negative gap lengths Is part of scaffold_23 connected to 136?! I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly? 22 23 129 136 137 The longest negative gap length is from a BNG consenus map joining in silico 23 and 136 33
  • 44. Negative gap lengths Is part of scaffold_23 connected to 136?! I went with the second alignment (21-26 together and 136-137 together because it is supported by genetic maps) but we should check these assemblies. ! ! In bottom alignment of 136 you can see that a large section of the BNG map 32 (which joins 23 to 136) is a duplicate in the BNG assembly? 22 23 129 136 137 ! Because the same region of 136 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-run 34
  • 45. scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? Negative gap lengths ChLG 2 super! scaffold ChLG 2 super! scaffold BNG consensus maps BNG consensus maps ChLG 2! scaffolds 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 Two new super scaffolds were created and the sequence similarity is being evaluated min confidence 10 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? ChLG 2! scaffolds 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 BNG consensus maps U 18 14 16 19 20 21 22 23 24 25 26 27 28 30 BNG consensus maps 35
  • 46. Gap lengths Distribution of gap lengths for automated output Gap length (bp) This negative alignment also indicated a potential assembly issue Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths 36
  • 47. Negative gap lengths This negative gap length is from a BNG consenus map joining in silico 81 and 102 and 103 Half of scaffold_81 aligns with ChLG7 37
  • 48. Negative gap lengths Half of scaffold_81 aligns with ChLG7 79 80 81 82 83 Because the other half of 81 aligns to another BNG consensus map that aligns to its chromosome linkage group this alignment was rejected and stitch was re-run ! The BNG maps suggest a mis-assembly of in silico 81 at a sequence level 38
  • 49. Distribution of gap lengths for automated output Gap length (bp) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths Gap lengths All extremely small negative gap lengths, < -20,000 (bp) (shaded), were independently flagged as potential sequence mis-assemblies to be checked at the sequence-level 39
  • 50. Distribution of gap lengths for automated output Gap length (bp) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths Gap lengths All gaps from the shaded regions were also manually rejected and stitch.pl was rerun without them for the current super-scaffolded assembly ! We suspect extremely small negative gap sizes may be useful in locating sequence mis-assemblies ! stitch.pl version 1.4.5 rejects alignments if negative gap lengths < -20,000 (bp) but lists them in data summary 40
  • 51. Tribolium super-scaffolds Input file N50 (Mb) Number of Scaffolds Cumulative Length (Mb) genome FASTA 1.16 2240 160.74 super-scaffold FASTA 4.46 2150 165.92 N50 of the super-scaffolded genome was ~4 times greater than the original ! Super-scaffolds tend to agree with the Tribolium genetic map 41
  • 52. Tribolium super-scaffolds Input file N50 (Mb) Number of Scaffolds genome FASTA 1.16 2240 160.74 4.46 2150 165.92 For Tribolium : first minimum percent aligned = 30% first minimum confidence = 13 Cumulative Length (Mb) second minimum percent aligned = 90% second minimum confidence = 8 ! super-scaffold FASTA Lower quality alignments were manually selected if genetic map also supported the order Complex scaffolds were broken manually for sequence level evaluation 42
  • 53. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. ChLG X was reduced from 13 scaffolds to 2 with one scaffold being moved to ChLG 3 ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 43
  • 54. Tribolium super-scaffolds min confidence 10 51 U 43 45 44 46 The second scaffold from ChLG X aligned to scaffolds from a portion of ChLG 3 ChLG 3 super! scaffold BNG consensus maps ChLG 3! scaffolds BNG consensus maps 32 33 34 35 36 2 37 38 39 40 41 42 ChLG 3 super! scaffold BNG consensus maps ChLG 3 super! scaffold BNG consensus 44
  • 55. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. Two unplaced scaffolds aligned to ChLG X ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 45
  • 56. Tribolium super-scaffolds min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. 4% Redundancy in alignment may be from assembly of haplotypes (generally observed as two BNG consensus maps aligning to the same in silico map) ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 46
  • 57. Potential haplotypes where overlapping BNG cmaps align min confidence 10 From ChLGX, 11 of the previous 13 scaffolds were joined with two unplaced scaffolds (U) into one super scaffold. ChLG X super! scaffold BNG consensus maps ChLG X! scaffolds BNG consensus maps U 3 4 5 6 7 U 8 9 10 11 12 13 47
  • 58. Tribmoinl icuonmfid esnucep 1e0r-scaffolds ChLG9 currently (2150 scaffold_133 aligns but is not visible in IrysView?) why is Super_scaffold_65 backwards? 128 130 131 133 134 132 129 135 127 136 137 139 138 140 141 142 143 144 145 For ChLG 9 21 scaffolds were reduced to 9 ChLG 9 super! scaffold BNG consensus maps ChLG 9! scaffolds BNG consensus maps 48
  • 59. min confidence 10 Tribolium super-scaffolds For ChLG 5 17 scaffolds were reduced to 4 ChLG 5 super! scaffold BNG consensus maps ChLG 5! scaffolds BNG consensus maps 69 68 70 71 72 73 74 U 75 76 77 78 79 80 81 82 83 49
  • 60. Future directions: Structural Variant (SV) Use SV-detect pipelines to resize existing gaps in scaffolds and identify mis-assemblies 50
  • 61. Acknowledgements K-INBRE Bioinformatics Core! Susan Brown - PI Nic Herndon - script development Nanyan Lu - manual evaluation Michelle Coleman - extractions and running the Irys! Zachary Sliefert - metric summaries ! Bionano Genomics! Ernest Lam - assembly pipeline best practices assistance Weiping Wang - assistance with data formats Palak Sheth - collaboration to standardize analysis ! Script availability! https://github.com/i5K-KINBRE-script-share/Irys-scaffolding BNG scripts available by request from BNG ! Slide availability! http://www.slideshare.net/kstatebioinformatics/using-bionano-maps-to-improve-an-insect-genome- Physical Molecules! University, Warren Kansas State University assembly ! This project was supported by grants from the National Center for Research Resources (5P20RR016475) and the National Institute of General Medical Sciences (8P20GM103418) from the National Institutes of Health. were constructed by mapping molecular markers from the the assembly scaffolds, anchoring greater than 90% 51
  • 62. Gap lengths Distribution of gap lengths for automated output Gap length (bp) Of the automated stitch.pl Tribolium super-scaffolds there were 66 gaps had known lengths and 26 had negative lengths (set to 100 (bp)) ! Of the manually edited Tribolium super-scaffolds there were 66 gaps had known lengths and 24 had negative lengths (set to 100 (bp)) Count −1500000 −1000000 −500000 0 500000 1000000 0 5 10 15 20 Negative gap lengths Positive gap lengths